论文标题
无批量归一化:如何在记忆要求最小的实例中归一化激活
Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements
论文作者
论文摘要
在培训神经网络中,批处理标准化具有许多好处,并非所有人都完全理解。但这也有一些缺点。可以说是最重要的是内存消耗,因为计算批处理统计信息需要同时处理批处理中的所有实例,而如果没有批处理归一化,则可以在积累重量梯度的同时对它们进行一次处理。另一个缺点是,分布参数(平均值和标准偏差)与所有其他模型参数不同,因为它们不是使用梯度下降训练,而是需要特殊处理,从而使实施变得复杂。在本文中,我展示了一种简单明了的方法来解决这些问题。简而言之,这个想法是为每种激活的损失添加术语,即导致高斯分布的负模可能性最小化,该分布用于使激活正常化。除其他好处外,这将有望通过降低培训大型模型的硬件要求来促进AI研究的民主化。
In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.