分布式图神经网络培训：一项调查

论文标题

分布式图神经网络培训：一项调查

Distributed Graph Neural Network Training: A Survey

论文作者

Shao, Yingxia, Li, Hongzheng, Gu, Xizhi, Yin, Hongbo, Li, Yawen, Miao, Xupeng, Zhang, Wentao, Cui, Bin, Chen, Lei

论文摘要

图形神经网络（GNN）是一种经过图形训练并已成功应用于各个领域的一种深度学习模型。尽管GNN具有有效性，但GNN有效地扩展到大图仍然具有挑战性。作为一种补救措施，分布式计算成为训练大规模GNN的有前途的解决方案，因为它能够提供丰富的计算资源。但是，图结构的依赖性增加了实现高效分布的GNN训练的困难，这遭受了巨大的沟通和工作量失衡。近年来，已经为分布式GNN培训做出了许多努力，并提出了一系列培训算法和系统。然而，对于GNN培训的分布式执行的优化技术，缺乏系统的审查。在这项调查中，我们分析了分布式GNN培训的三个主要挑战，这些培训是大规模特征交流，模型准确性的丧失和工作负载失衡。然后，我们为分布式GNN培训中的优化技术介绍了一种新的分类法，以解决上述挑战。新的分类法将现有技术分类为四个类别，包括GNN数据分区，GNN批处理生成，GNN执行模型和GNN通信协议。我们仔细讨论每个类别中的技术。最后，我们分别总结了用于多GPU，GPU群和CPU群集的现有分布式GNN系统，并就分布式GNN培训的未来指导进行了讨论。

Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on distributed GNN training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题