论文标题
量化分布式学习的数据
Quantizing data for distributed learning
论文作者
论文摘要
我们考虑通过利用在受信任的网络上分布的数据来培训模型的机器学习应用程序,在此过程中,通信约束可以创建性能瓶颈。许多最近的方法建议通过压缩梯度更新来克服这种瓶颈。但是,随着模型的变化,梯度更新的大小也会变得更大。在本文中,我们提出了一种替代方法,可以从量化数据而不是梯度的分布式数据中学习,并可以支持对梯度更新大小的应用程序的学习。我们的方法利用了计算梯度对数据样本的依赖性,这些梯度位于一个较小的空间中,以便在较小的维数据空间中执行量化。以额外的梯度计算为代价,可以通过使用少量位在量化数据点和原始梯度之间传达梯度之间的差异来完善梯度估计。最后,为了节省通信,我们的方法添加了一个层,该图层决定是否基于其学习的重要性来传输量化的数据样本。我们分析了平滑凸和非凸目标函数的建议方法的收敛性,并表明我们可以通过大部分取决于数据而不是模型(梯度)维度实现订单最佳收敛速率。我们使用建议的算法在CIFAR-10和Imagenet数据集上训练Resnet模型,并证明我们可以在梯度压缩方法上节省数量级。这些沟通节省的基础是以增加学习代理的计算为代价,因此我们的方法在通信负载是主要问题的情况下是有益的。
We consider machine learning applications that train a model by leveraging data distributed over a trusted network, where communication constraints can create a performance bottleneck. A number of recent approaches propose to overcome this bottleneck through compression of gradient updates. However, as models become larger, so does the size of the gradient updates. In this paper, we propose an alternate approach to learn from distributed data that quantizes data instead of gradients, and can support learning over applications where the size of gradient updates is prohibitive. Our approach leverages the dependency of the computed gradient on data samples, which lie in a much smaller space in order to perform the quantization in the smaller dimension data space. At the cost of an extra gradient computation, the gradient estimate can be refined by conveying the difference between the gradient at the quantized data point and the original gradient using a small number of bits. Lastly, in order to save communication, our approach adds a layer that decides whether to transmit a quantized data sample or not based on its importance for learning. We analyze the convergence of the proposed approach for smooth convex and non-convex objective functions and show that we can achieve order optimal convergence rates with communication that mostly depends on the data rather than the model (gradient) dimension. We use our proposed algorithm to train ResNet models on the CIFAR-10 and ImageNet datasets, and show that we can achieve an order of magnitude savings over gradient compression methods. These communication savings come at the cost of increasing computation at the learning agent, and thus our approach is beneficial in scenarios where communication load is the main problem.