论文标题

没有留下的参数:敏感性引导的自适应学习率,用于训练大型变压器模型

No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

论文作者

Liang, Chen, Jiang, Haoming, Zuo, Simiao, He, Pengcheng, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu, Zhao, Tuo

论文摘要

最近的研究表明,大型变压器模型中存在显着的冗余。可以在不显着牺牲概括性能的情况下修剪冗余参数。但是,我们质疑如果冗余参数经过适当的培训,是否可以做出更多贡献。为了回答这个问题,我们提出了一种新颖的培训策略,鼓励所有参数得到足够的培训。具体而言,我们根据其灵敏度适应每个参数的学习率,这是一种基于稳健的梯度测量,反映了该参数对模型性能的贡献。低灵敏度的参数是冗余的,我们通过提高其学习率来改善其拟合。相比之下,具有较高灵敏度的参数是训练有素的,我们通过降低其学习率以防止进一步过度拟合来使其正规化。我们对自然语言理解,神经机器翻译和图像分类进行了广泛的实验,以证明拟议的时间表的有效性。分析表明,拟议的时间表确实降低了冗余并提高了概括性能。

Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源