残留网络的实用层并行训练算法

论文标题

残留网络的实用层并行训练算法

A Practical Layer-Parallel Training Algorithm for Residual Networks

论文作者

Sun, Qi, Dong, Hexin, Chen, Zewei, Dian, Weizhen, Sun, Jiacheng, Sun, Yitong, Li, Zhenguo, Dong, Bin

论文摘要

基于梯度的训练重新设备的算法通常需要输入数据的正向通过，然后将目标梯度反向传播到更新参数，这些参数耗时为深度重新连接。为了打破前向和向后模式中模块之间的依赖关系，辅助变量方法（例如惩罚和增强的Lagrangian（Al）方法）最近由于其利用层面平行性的能力而引起了很大的兴趣。但是，我们观察到，大型沟通开销和缺乏数据增强是这些方法的两个关键挑战，这可能导致多个计算设备的速度比和准确性下降。受重置的最佳控制配方的启发，我们提出了一种新型的串行 - 平行混合训练策略，以实现数据增强的使用，以及降采样过滤器，以降低通信成本。提出的策略首先通过并行解决一系列独立的子问题来训练网络参数，然后通过数据的完整序列前向后传播来纠正网络参数。这种策略可以使用辅助变量应用于大多数现有的层平行训练方法。例如，我们使用跨MNIST，CIFAR-10和CIFAR-100数据集的Resnet和WideSnet上的惩罚和AL方法来验证提出的策略，从而在传统的层临时训练方法上实现了显着的加速，同时保持了可比的精度。

Gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters, which are time-consuming for deep ResNets. To break the dependencies between modules in both the forward and backward modes, auxiliary-variable methods such as the penalty and augmented Lagrangian (AL) approaches have attracted much interest lately due to their ability to exploit layer-wise parallelism. However, we observe that large communication overhead and lacking data augmentation are two key challenges of these methods, which may lead to low speedup ratio and accuracy drop across multiple compute devices. Inspired by the optimal control formulation of ResNets, we propose a novel serial-parallel hybrid training strategy to enable the use of data augmentation, together with downsampling filters to reduce the communication cost. The proposed strategy first trains the network parameters by solving a succession of independent sub-problems in parallel and then corrects the network parameters through a full serial forward-backward propagation of data. Such a strategy can be applied to most of the existing layer-parallel training methods using auxiliary variables. As an example, we validate the proposed strategy using penalty and AL methods on ResNet and WideResNet across MNIST, CIFAR-10 and CIFAR-100 datasets, achieving significant speedup over the traditional layer-serial training methods while maintaining comparable accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题