论文标题
批处理归一化偏向于深网中的身份函数的残留块
Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
论文作者
论文摘要
批处理大大增加了剩余网络的最大可训练深度,并且这种好处对于在广泛的基准中深层剩余网络的经验成功至关重要。我们表明,此关键优势之所以出现,是因为在初始化时,通过在网络深度的平方根的顺序上进行归一化因子,将残留分支相对于跳过连接降低了剩余分支。这样可以确保在训练的早期,深层网络中归一化残差块计算出的功能接近身份函数(平均而言)。我们使用这种见解来开发一种简单的初始化方案,该方案可以训练深层剩余网络而无需归一化。我们还提供了剩余网络的详细经验研究,该研究澄清说,尽管可以使用较大的学习率培训批处理的网络,但这种效果仅对特定的计算制度有益,并且当批次尺寸较小时,这种效果最小。
Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of the square root of the network depth. This ensures that, early in training, the function computed by normalized residual blocks in deep networks is close to the identity function (on average). We use this insight to develop a simple initialization scheme that can train deep residual networks without normalization. We also provide a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small.