论文标题
在低数据设置中的有效自然语言理解的生成依据
Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings
论文作者
论文摘要
在过去的一年中,通过大规模语言模型(LM)的转移学习的出现导致了各种自然语言理解任务的绩效的巨大改进。但是,这些大型LMS的大小和内存足迹使它们在许多情况下很难部署(例如在手机上)。最近的研究表明,知识蒸馏是一种潜在的解决方案,表明当给定任务的训练数据丰富时,可以将大型(教师)LM提炼成特定于任务的小型(学生)网络,而绩效的损失最少。但是,当此类数据稀缺时,即使通过蒸馏训练,大型的LMS和较小的任务特异性模型之间仍然存在显着的性能差距。在本文中,我们采用一种新型的培训方法(称为Generation-Distiltation)弥合了这一差距,该方法以两种方式利用了大型填充LMS:(1)生成新的(未标记的)培训示例,(2)使用这些示例将其知识提炼成小型网络。在三个低资源文本分类数据集中,我们在使用300倍的参数时实现了与Bert的可比性能,并且我们的表现要优于先验的文本分类方法,同时使用3倍参数。
Over the past year, the emergence of transfer learning with large-scale language models (LM) has led to dramatic performance improvements across a broad range of natural language understanding tasks. However, the size and memory footprint of these large LMs makes them difficult to deploy in many scenarios (e.g. on mobile phones). Recent research points to knowledge distillation as a potential solution, showing that when training data for a given task is abundant, it is possible to distill a large (teacher) LM into a small task-specific (student) network with minimal loss of performance. However, when such data is scarce, there remains a significant performance gap between large pretrained LMs and smaller task-specific models, even when training via distillation. In this paper, we bridge this gap with a novel training approach, called generation-distillation, that leverages large finetuned LMs in two ways: (1) to generate new (unlabeled) training examples, and (2) to distill their knowledge into a small network using these examples. Across three low-resource text classification datsets, we achieve comparable performance to BERT while using 300x fewer parameters, and we outperform prior approaches to distillation for text classification while using 3x fewer parameters.