论文标题
关于在本地更新方法中学习率的重要性
On the Outsized Importance of Learning Rates in Local Update Methods
论文作者
论文摘要
我们研究了一种算法家族,我们将其称为本地更新方法,该方法概括了许多联合学习和元学习算法。我们证明,对于二次目标,本地更新方法在我们精确表征的替代损耗函数上执行随机梯度下降。我们表明,客户学习率的选择控制该替代损失的状况数量,以及替代物和真实损失功能的最小化器之间的距离。我们使用该理论来得出新颖的收敛速率,以展示替代损失的状况数量与其与真实损失函数的一致性之间的这种权衡。我们从经验上验证结果,表明在沟通有限的设置中,适当的学习率调整通常足以达到近乎最佳的行为。我们还提出了一种在本地更新方法中自动学习率衰减的实用方法,该方法有助于降低学习率调整的需求,并在各种任务和数据集上强调其经验表现。
We study a family of algorithms, which we refer to as local update methods, that generalize many federated learning and meta-learning algorithms. We prove that for quadratic objectives, local update methods perform stochastic gradient descent on a surrogate loss function which we exactly characterize. We show that the choice of client learning rate controls the condition number of that surrogate loss, as well as the distance between the minimizers of the surrogate and true loss functions. We use this theory to derive novel convergence rates for federated averaging that showcase this trade-off between the condition number of the surrogate loss and its alignment with the true loss function. We validate our results empirically, showing that in communication-limited settings, proper learning rate tuning is often sufficient to reach near-optimal behavior. We also present a practical method for automatic learning rate decay in local update methods that helps reduce the need for learning rate tuning, and highlight its empirical performance on a variety of tasks and datasets.