论文标题
联合优化算法随机改组和梯度压缩
Federated Optimization Algorithms with Random Reshuffling and Gradient Compression
论文作者
论文摘要
梯度压缩是一种流行的技术,用于改善机器学习模型分布式培训中随机一阶方法的沟通复杂性。但是,现有作品仅考虑随机梯度的替换采样。相比之下,它在实践中是众所周知的,并且最近在理论上证实,基于没有替代抽样的随机方法,例如随机改组方法(RR)方法,其性能要比用更换梯度进行梯度的方法更好。在这项工作中,我们在文献中缩小了这一差距,并通过梯度压缩和没有替代抽样的方法提供了第一次分析方法。我们首先将随机改组与梯度压缩(Q-RR)形成幼稚的组合。也许令人惊讶的是,但是Q-RR的理论分析并未显示使用RR的任何好处。我们广泛的数值实验证实了这一现象。由于额外的压缩差异,这会发生。为了通过压缩揭示RR在分布式学习中的真正优势,我们提出了一种称为Diana-RR的新方法,该方法可降低压缩差异,并证明比现有对应物具有随机梯度的替换采样的现有对应物。接下来,为了更好地适合联合学习应用程序,我们结合了本地计算,即,我们建议和分析Q-RR和Diana-RR的变体-Q-Nastya和Diana-Nastya,它们使用本地梯度步骤以及不同的本地和全球步骤。最后,我们进行了几个数值实验,以说明我们的理论结果。
Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a naïve combination of random reshuffling with gradient compression (Q-RR). Perhaps surprisingly, but the theoretical analysis of Q-RR does not show any benefits of using RR. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of RR in the distributed learning with compression, we propose a new method called DIANA-RR that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of Q-RR and DIANA-RR -- Q-NASTYA and DIANA-NASTYA that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results.