论文标题

零:大规模变压器的高效且负担得起的训练后量化

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

论文作者

Yao, Zhewei, Aminabadi, Reza Yazdani, Zhang, Minjia, Wu, Xiaoxia, Li, Conglong, He, Yuxiong

论文摘要

在实践中,如何有效地为受过大量培训的自然语言模型提供服务,即使对于功能强大的云服务器,由于其过度的记忆/计算要求,也已变得极为挑战。在这项工作中,我们提出了一种有效且负担得起的训练后量化方法,以压缩基于零的大型变压器模型。零位是一种端到端量化和推理管道,具有三个主要组成部分:(1)用于重量和激活的细粒度硬件友好量化方案; (2)即使没有访问原始培训数据,也是一种新颖的逐层知识蒸馏算法(LKD)的新颖性; (3)高度优化的量化系统后端支持,以删除量化/取消定量开销。因此,我们能够证明:(1)零量可以以最小的精度影响bert和gpt3风格的模型以无成本的方式将权重和激活的精度降低到INT8的精度,这与FP16的推断相比,这些模型的速度最高为5.19 x/4.16x速度; (2)零Quant Plus LKD负担得起,将完全连接的模块中的权重量化为INT4,以及注意模块中的INT8权重和INT8激活中的INT8权重,与FP16模型相比,Memory Footprint降低了3倍的记忆足迹; (3)零可以直接应用于两个最大的开源语言模型,包括GPT-J6B和GPT-NEOX20,我们的INT8模型的精度与FP16模型相似,但最高可提高效率5.2倍。

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference; (2) ZeroQuant plus LKD affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) ZeroQuant can be directly applied to two of the largest open-sourced language models, including GPT-J6B and GPT-NeoX20, for which our INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x better efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源