论文标题

毕加索:释放以GPU为中心的宽和深度推荐系统的潜力

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

论文作者

Zhang, Yuanxing, Chen, Langshi, Yang, Siran, Yuan, Man, Yi, Huimin, Zhang, Jie, Wang, Jiamang, Dong, Jianbo, Xu, Yunlong, Song, Yue, Li, Yong, Zhang, Di, Lin, Wei, Qu, Lin, Zheng, Bo

论文摘要

个性化建议的开发显着提高了信息匹配的准确性和电子商务平台的收入。最近,它有2个趋势:1)必须及时培训推荐系统,以应对在线营销和社交网络中不断增长的新产品以及不断变化的用户兴趣; 2)SOTA推荐模型介绍了DNN模块以提高预测准确性。传统的基于CPU的推荐系统无法满足这两种趋势,而以GPU为基础的培训已成为一种趋势方法。但是,我们观察到培训推荐系统中的GPU设备未充分利用,并且无法像在简历和NLP地区实现预期的吞吐量改进。这个问题可以用这些建议模型的两个特征来解释:首先,它们包含多达一千个输入特征字段,引入了零碎和记忆密集型操作;其次,多组成特征相互作用群引入了实质性的小型计算内核。为了消除推荐系统的开发,我们提出了一个名为Picasso的新型框架,以加快有关商品硬件的推荐模型的培训。具体而言,我们进行了系统的分析,以揭示培训建议模型中遇到的瓶颈。我们利用模型结构和数据分布来通过包装,交织和缓存优化释放硬件的潜力。实验表明,毕加索根据SOTA基线增加了硬件利用率,并为各种工业推荐模型带来了高达6倍的吞吐量改进。毕加索平均使用相同的硬件预算,将日常培训任务的墙壁缩短7小时,从而大大减少了连续交付的延迟。

The development of personalized recommendation has significantly improved the accuracy of information matching and the revenue of e-commerce platforms. Recently, it has 2 trends: 1) recommender systems must be trained timely to cope with ever-growing new products and ever-changing user interests from online marketing and social network; 2) SOTA recommendation models introduce DNN modules to improve prediction accuracy. Traditional CPU-based recommender systems cannot meet these two trends, and GPU- centric training has become a trending approach. However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in CV and NLP areas. This issue can be explained by two characteristics of these recommendation models: First, they contain up to a thousand input feature fields, introducing fragmentary and memory-intensive operations; Second, the multiple constituent feature interaction submodules introduce substantial small-sized compute kernels. To remove this roadblock to the development of recommender systems, we propose a novel framework named PICASSO to accelerate the training of recommendation models on commodity hardware. Specifically, we conduct a systematic analysis to reveal the bottlenecks encountered in training recommendation models. We leverage the model structure and data distribution to unleash the potential of hardware through our packing, interleaving, and caching optimization. Experiments show that PICASSO increases the hardware utilization by an order of magnitude on the basis of SOTA baselines and brings up to 6x throughput improvement for a variety of industrial recommendation models. Using the same hardware budget in production, PICASSO on average shortens the walltime of daily training tasks by 7 hours, significantly reducing the delay of continuous delivery.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源