零：大规模变压器的高效且负担得起的训练后量化

论文标题

零：大规模变压器的高效且负担得起的训练后量化

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

论文作者

Yao, Zhewei, Aminabadi, Reza Yazdani, Zhang, Minjia, Wu, Xiaoxia, Li, Conglong, He, Yuxiong

论文摘要

在实践中，如何有效地为受过大量培训的自然语言模型提供服务，即使对于功能强大的云服务器，由于其过度的记忆/计算要求，也已变得极为挑战。在这项工作中，我们提出了一种有效且负担得起的训练后量化方法，以压缩基于零的大型变压器模型。零位是一种端到端量化和推理管道，具有三个主要组成部分：（1）用于重量和激活的细粒度硬件友好量化方案；（2）即使没有访问原始培训数据，也是一种新颖的逐层知识蒸馏算法（LKD）的新颖性；（3）高度优化的量化系统后端支持，以删除量化/取消定量开销。因此，我们能够证明：（1）零量可以以最小的精度影响bert和gpt3风格的模型以无成本的方式将权重和激活的精度降低到INT8的精度，这与FP16的推断相比，这些模型的速度最高为5.19 x/4.16x速度；（2）零Quant Plus LKD负担得起，将完全连接的模块中的权重量化为INT4，以及注意模块中的INT8权重和INT8激活中的INT8权重，与FP16模型相比，Memory Footprint降低了3倍的记忆足迹；（3）零可以直接应用于两个最大的开源语言模型，包括GPT-J6B和GPT-NEOX20，我们的INT8模型的精度与FP16模型相似，但最高可提高效率5.2倍。

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference; (2) ZeroQuant plus LKD affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) ZeroQuant can be directly applied to two of the largest open-sourced language models, including GPT-J6B and GPT-NeoX20, for which our INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x better efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题