平行机器学习的动态备份工人

论文标题

平行机器学习的动态备份工人

Dynamic backup workers for parallel machine learning

论文作者

Xu, Chuan, Neglia, Giovanni, Sebastianelli, Nicola

论文摘要

机器学习模型分布式培训的最流行框架是（同步）参数服务器（PS）。该范式由$ n $ Worker组成，该工人迭代地计算模型参数的更新，以及一个状态的PS，它等待和汇总了所有更新，以生成模型参数的新估计，并将其发送回工人以进行新迭代。瞬态计算放缓或传输延迟可能会延长每次迭代的时间。减轻此问题的有效方法是让PS仅等待最快的$ N-B $更新，然后再生成新参数。最慢的$ b $工人称为备用工人。备份工人的最佳数字$ b $取决于集群的配置和工作负载，而且（如本文所示）在学习算法的超参数和培训的当前阶段。我们提出了DBW，该算法在训练过程中动态决定备份工人的数量，以最大化每次迭代时的收敛速度。我们的实验表明，DBW 1）通过初步时间耗费的实验消除了调整$ b $的必要性，而2）将培训比最佳静态配置快$ 3 $。

The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of $n$ workers, which iteratively compute updates of the model parameters, and a stateful PS, which waits and aggregates all updates to generate a new estimate of model parameters and sends it back to the workers for a new iteration. Transient computation slowdowns or transmission delays can intolerably lengthen the time of each iteration. An efficient way to mitigate this problem is to let the PS wait only for the fastest $n-b$ updates, before generating the new parameters. The slowest $b$ workers are called backup workers. The optimal number $b$ of backup workers depends on the cluster configuration and workload, but also (as we show in this paper) on the hyper-parameters of the learning algorithm and the current stage of the training. We propose DBW, an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune $b$ by preliminary time-consuming experiments, and 2) makes the training up to a factor $3$ faster than the optimal static configuration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题