论文标题
随机重量平均重新审视
Stochastic Weight Averaging Revisited
论文作者
论文摘要
通过骨干随机梯度下降(SGD)取样的神经网络权重是一种简单而有效的方法,可以帮助骨干SGD在概括方面找到更好的Optima。从统计的角度来看,平均体重(WA)有助于降低差异。最近,提出了一种良好的随机重量平均(SWA)方法,该方法的特征是在生成WA生成权重样品中应用周期性或高常数(CHC)学习率计划(LRS)。然后出现了对WA的新见解,该见解指出WA有助于发现更广泛的Optima,然后导致更好的概括。我们对SWA进行了广泛的实验研究,涉及十几个现代DNN模型结构以及十几个基准开源图像,图形和文本数据集。我们解开了WA操作和SWA的CHC LRS的贡献,这表明SWA中的WA操作仍然有助于降低差异,但并不总是会导致宽阔的Optima。实验结果表明,DNN损失格局中存在全球规模的几何结构。然后,我们提出了一种称为周期性SWA(PSWA)的算法,该算法利用一系列WA操作来发现全球几何结构。 PSWA的表现非常优于其骨干SGD,为全球几何结构的存在提供了实验证据。复制实验结果的代码可在https://github.com/zjlab-ammi/pswa上获得。
Averaging neural network weights sampled by a backbone stochastic gradient descent (SGD) is a simple yet effective approach to assist the backbone SGD in finding better optima, in terms of generalization. From a statistical perspective, weight averaging (WA) contributes to variance reduction. Recently, a well-established stochastic weight averaging (SWA) method is proposed, which is featured by the application of a cyclical or high constant (CHC) learning rate schedule (LRS) in generating weight samples for WA. Then a new insight on WA appears, which states that WA helps to discover wider optima and then leads to better generalization. We conduct extensive experimental studies for SWA, involving a dozen modern DNN model structures and a dozen benchmark open-source image, graph, and text datasets. We disentangle contributions of the WA operation and the CHC LRS for SWA, showing that the WA operation in SWA still contributes to variance reduction but does not always lead to wide optima. The experimental results indicate that there are global scale geometric structures in the DNN loss landscape. We then present an algorithm termed periodic SWA (PSWA) which makes use of a series of WA operations to discover the global geometric structures. PSWA outperforms its backbone SGD remarkably, providing experimental evidences for the existence of global geometric structures. Codes for reproducing the experimental results are available at https://github.com/ZJLAB-AMMI/PSWA.