论文标题
下蹲:BERT的清晰度和量化感知培训
SQuAT: Sharpness- and Quantization-Aware Training for BERT
论文作者
论文摘要
量化是一种有效的技术,可减少记忆足迹,推理潜伏期和深度学习模型的功耗。但是,由于粗梯度估计通过非差异性量化层引入的错误,现有的量化方法与全精度(FP)模型相比,具有准确性降解。在过度参数模型的损失景观(例如,变压器)的损失景观中的存在趋于加剧低位(2,4位)设置中的这种绩效处罚。在这项工作中,我们提出了清晰度和量化感知的训练(Squat),这将鼓励模型在执行量化意识训练的同时收敛至平坦的最小值。我们提出的方法是在锐度目标和步进尺寸目标之间交替训练,这可能使模型学习最合适的参数更新幅度以达到接近平台的收敛幅度。广泛的实验表明,我们的方法在胶水基准上的2、3和4位设置下量化最先进的BERT模型可以始终超过1%,有时甚至可以超过全部精度(32位)模型。我们对清晰度经验测量的实验也表明,与其他量化方法相比,我们的方法将导致最小值。
Quantization is an effective technique to reduce memory footprint, inference latency, and power consumption of deep learning models. However, existing quantization methods suffer from accuracy degradation compared to full-precision (FP) models due to the errors introduced by coarse gradient estimation through non-differentiable quantization layers. The existence of sharp local minima in the loss landscapes of overparameterized models (e.g., Transformers) tends to aggravate such performance penalty in low-bit (2, 4 bits) settings. In this work, we propose sharpness- and quantization-aware training (SQuAT), which would encourage the model to converge to flatter minima while performing quantization-aware training. Our proposed method alternates training between sharpness objective and step-size objective, which could potentially let the model learn the most suitable parameter update magnitude to reach convergence near-flat minima. Extensive experiments show that our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings on GLUE benchmarks by 1%, and can sometimes even outperform full precision (32-bit) models. Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.