论文标题
Mimicnorm:重量平均值和最后一个BN层模拟批处理的动态
MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization
论文作者
论文摘要
实质性实验已经验证了批处理(BN)层在受益融合和泛化方面的成功。但是,BN需要额外的内存和浮点点计算。此外,BN在微批量上不准确,因为这取决于批处理统计。在本文中,我们通过简化BN正则化来解决这些问题,同时保持BN层的两种基本影响,即数据脱字和适应性学习率。我们提出了一种名为Mimicnorm的新型归一化方法,以提高网络训练的收敛性和效率。 Mimicnorm仅由两个光操作组成,包括修改的重量平均操作(从权重参数张量减去平均值)和损耗函数之前的一个BN层(最后一个BN层)。我们利用神经切线内核(NTK)理论证明我们的体重平均操作使活化并将网络转移到BN层之类的混乱状态,从而导致增强的收敛性。最后的BN层提供自动召集的学习率,也提高了准确性。实验结果表明,Mimicnorm在各种网络结构(包括重置和轻量级网络)等网络(如Shufflenet)方面达到了相似的准确性,减少了约20%的存储器消耗。该代码可在https://github.com/kid-key/mimicnorm上公开获取。
Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at https://github.com/Kid-key/MimicNorm.