论文标题

计划优化成功的工作时间

Scheduling to Optimize Sojourn Time of Successful Jobs

论文作者

Yao, Yuan, Paolieri, Marco, Golubchik, Leana

论文摘要

深度神经网络培训工作和其他迭代计算经常包括检查点,其中可以根据监视指标的当前价值取消作业。尽管大多数现有结果都集中在所有工作的绩效上(既成功完成和取消),但在这项工作中,我们探讨了调度策略,这些策略可以改善成功的工作时间,这通常对用户更有价值。我们的模型假设每个作业具有已知的离散大小分布(例如,从先前的执行日志估算),其中最大的尺寸值表示成功完成,而其他尺寸值对应于终止检查点。在所有工作都可以同时安排所有工作的单行案例中,我们证明,即使先发制人的开销可以忽略不计,最佳时间表也不会抢占工作。基于此,我们制定了一项调度策略,该策略将成功的工作的逗留时间最小化,即,当工作数量增长到无限时。通过一项广泛的数值研究,我们表明,即使工作数量有限,该政策的表现也比现有替代方案更好。对于使用多个服务器和动态作业到达的更现实的场景,我们根据单人服务员调度策略提出了一种在线方法。通过一项广泛的模拟研究,使用现实世界的痕迹,我们证明了这种在线方法与现有技术相比,成功的工作时间更高。

Deep neural networks training jobs and other iterative computations frequently include checkpoints where jobs can be canceled based on the current value of monitored metrics. While most of existing results focus on the performance of all jobs (both successfully completed and canceled), in this work we explore scheduling policies that improve the sojourn time of successful jobs, which are typically more valuable to the user. Our model assumes that each job has a known discrete size distribution (e.g., estimated from previous execution logs) where the largest size value indicates a successful completion, while other size values correspond to termination checkpoints. In the single-server case where all jobs are available for scheduling simultaneously, we prove that optimal schedules do not preempt jobs, even when preemption overhead is negligible. Based on this, we develop a scheduling policy that minimizes the sojourn time of successful jobs asymptotically, i.e., when the number of jobs grows to infinity. Through an extensive numerical study, we show that this policy performs better than existing alternatives even when the number of jobs is finite. For more realistic scenarios with multiple servers and dynamic jobs arrivals, we propose an online approach based on our single-server scheduling policy. Through an extensive simulation study, using real-world traces, we demonstrate that this online approach results in better average sojourn time for successful jobs as compared to existing techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源