论文标题

DeepRecsys:一种用于优化端到端刻度神经建议推断的系统

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

论文作者

Gupta, Udit, Hsia, Samuel, Saraph, Vikram, Wang, Xiaodong, Reagen, Brandon, Wei, Gu-Yeon, Lee, Hsien-Hsin S., Brooks, David, Wu, Carole-Jean

论文摘要

神经个性化的建议是广泛的云服务和产品集合的拐角处,构成了云基础架构的大量计算需求。因此,提高神经建议的执行效率直接转化为基础设施节省。在本文中,我们设计了一种新颖的端到端建模基础架构DeepRecinfra,该基础架构采用算法和系统共同设计方法来定制设计系统,以供推荐使用案例。通过考虑推理查询大小和到达模式,建议模型体系结构以及基础硬件系统的特征,提出了一种新的动态调度程序,以最大程度地限制潜伏期的吞吐量,从而最大程度地限制了潜伏期的吞吐量。通过这样做,在八个行业代表性的推荐模型中,系统吞吐量增加了一倍。最后,在尺度生产数据中心的设计,部署和评估显示,在数百台机器上运行的各种推荐模型中,超过30%的潜伏期降低。

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源