结合自然语言任务上乘法尺寸缩放的压缩措施

论文标题

结合自然语言任务上乘法尺寸缩放的压缩措施

Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks

论文作者

Movva, Rajiv, Lei, Jinhao, Longpre, Shayne, Gupta, Ajay, DuBois, Chris

论文摘要

量化，知识蒸馏和修剪是NLP中神经网络压缩的最流行方法之一。独立地，这些方法降低了模型的大小并可以加速推断，但是尚未严格研究它们的相对益处和组合相互作用。对于这些技术的八个可能子集，我们比较了六个BERT体系结构和八个胶水任务的准确性与模型大小的权衡。我们发现量化和蒸馏始终比修剪更大的好处。令人惊讶的是，除了将多种方法一起使用多种修剪和量化之外，很少会产生减少的回报。取而代之的是，我们观察到对模型大小的互补和超级义务减少。我们的工作定量表明，结合压缩方法可以协同减少模型大小，并且从业人员应优先级（1）量化，（2）知识蒸馏，（3）修剪以最大程度地提高准确性与模型大小的权衡。

Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial interactions have not been rigorously studied. For each of the eight possible subsets of these techniques, we compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks. We find that quantization and distillation consistently provide greater benefit than pruning. Surprisingly, except for the pair of pruning and quantization, using multiple methods together rarely yields diminishing returns. Instead, we observe complementary and super-multiplicative reductions to model size. Our work quantitatively demonstrates that combining compression methods can synergistically reduce model size, and that practitioners should prioritize (1) quantization, (2) knowledge distillation, and (3) pruning to maximize accuracy vs. model size tradeoffs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题