论文标题

Rosella:一种异质簇的自动驾驶分布式调度程序

Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters

论文作者

Wu, Qiong, Liu, Zhenming

论文摘要

大规模交互式Web服务和高级AI应用程序基于在数千台服务器上执行大量计算任务,从而实时做出复杂的决策。通常在异质和波动性环境中运行的任务调度程序需要高吞吐量,即每秒安排数百万个任务,而低延迟,即导致毫秒级的任务的最小时间表延迟。其他用户在共享系统,其他背景活动以及数据中心内的多种硬件配置中,其他用户的工作负载更加复杂。 我们提出了Rosella,这是一种新的自动驾驶,分布式方法,用于在异质群中进行任务调度。 Rosella会自动学习计算环境,并实时调整其调度策略。该解决方案同时提供高吞吐量和低延迟,因为它在最低协调的多台机器上并行运行,并且仅对每个调度决策执行简单操作。我们的学习模块监视总系统负载,并使用信息来动态确定后端计算机的最佳估计策略。 Rosella概括了两次选择算法以处理异质工人,从而减少了先前算法获得的O(log n)的最大队列长度(log n)到O(log log log n)。我们在32节点AWS群集上使用各种工作负载评估了Rosella。实验结果表明,罗塞拉大大减少了任务响应时间,并且适应环境迅速变化。

Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Rosella generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(log n) obtained by prior algorithms to O(log log n). We evaluate Rosella with a variety of workloads on a 32-node AWS cluster. Experimental results show that Rosella significantly reduces task response time, and adapts to environment changes quickly.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源