论文标题
散散的缓解序列编码的顺序梯度
Sequential Gradient Coding For Straggler Mitigation
论文作者
论文摘要
在分布式计算中,较慢的节点(Stragglers)通常成为瓶颈。由Tandon等人引入的梯度编码(GC)是一种有效的技术,它使用误差校正代码原理在存在散乱者的情况下分布梯度计算。在本文中,我们考虑了一系列渐变序列的分布式计算$ \ {g(1),g(2),\ ldots,g(j)\} $,其中每个梯度$ g(t)$的处理以圆形$ t $开始,然后按圆形 - $(t+t)$完成。这里$ t \ geq 0 $表示延迟参数。对于GC方案,编码仅在计算节点之间,这会导致$ t = 0 $的解决方案。另一方面,拥有$ t> 0 $允许设计利用时间维度的方案。在这项工作中,我们提出了两个方案,这些方案与GC相比表明性能有所提高。我们的第一个方案结合了GC与以前未完成的任务的选择性重复,并改善了Straggler的缓解措施。在构成我们主要贡献的第二个方案中,我们将GC应用于其余任务的任务和重复的子集。然后,我们基于过去的Straggler模式以自适应方式将这两类任务跨工人和回合进行了多重。使用理论分析,我们证明了我们的第二个方案可以显着减少计算负载。在我们的实验中,我们研究了一个同时培训多个神经网络的实践环境,该网络涉及256个工人节点,我们的框架自然适用。我们证明,在存在天然存在的,未模拟的散乱者的情况下,后一种方案可以比基线GC方案产生16 \%的运行时间改善。
In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.