DQ-BART：通过关节蒸馏和量化有效序列到序列模型

论文标题

DQ-BART：通过关节蒸馏和量化有效序列到序列模型

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

论文作者

Li, Zheng, Wang, Zijian, Tan, Ming, Nallapati, Ramesh, Bhatia, Parminder, Arnold, Andrew, Xiang, Bing, Roth, Dan

论文摘要

BART和T5（T5）等大规模训练的序列到序列模型在许多生成NLP任务上实现了最先进的性能。但是，由于其较大的记忆要求和高潜伏期，因此在资源约束的情况下，此类模型构成了巨大的挑战。为了减轻此问题，我们建议共同提炼和量化该模型，其中知识从完整的教师模型转移到量化和蒸馏的低精度学生模型。经验分析表明，尽管生成任务的性质具有挑战性，但我们能够达到16.5倍的模型足迹压缩率，而相对于多个摘要和质量质量质量质量标准的全程处理，相对于完整的同行的性能很少。我们进一步将压缩比的极限推到了27.7倍，并提出了使用预训练模型的生成任务的性能效率折衷。据我们所知，这是第一项旨在有效提炼和量化序列到序列的预训练模型的作品。

Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. However, such models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. To alleviate this issue, we propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model. Empirical analyses show that, despite the challenging nature of generative tasks, we were able to achieve a 16.5x model footprint compression ratio with little performance drop relative to the full-precision counterparts on multiple summarization and QA datasets. We further pushed the limit of compression ratio to 27.7x and presented the performance-efficiency trade-off for generative tasks using pre-trained models. To the best of our knowledge, this is the first work aiming to effectively distill and quantize sequence-to-sequence pre-trained models for language generation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题