论文标题
CUDA+MPI设计规则的机器学习规则
Machine Learning for CUDA+MPI Design Rules
论文作者
论文摘要
我们提出了一种新的策略,以自动探索关键CUDA+MPI程序的设计空间,并提供设计规则,以歧视快速实现的速度。在这样的程序中,操作顺序(例如GPU内核,MPI通信)和将操作分配给资源(例如GPU流)使设计的空间巨大。系统专家的任务是重新设计和重新调整这些程序,以有效地利用每个新平台。这项工作提供了一种原型工具来减轻负担。 在我们的方法中,CUDA和MPI操作的定向无环图定义了该程序的设计空间。蒙特卡洛树搜索发现设计空间的区域,这些区域会对该计划的性能产生很大影响。序列到矢量转换定义了每个探索的实现的功能,并且每个实现都根据其相对性能为类标签分配。对决策树的特征和标签进行了培训,以制定每个班级的设计规则;系统专家可以使用这些规则来指导其实施。我们在具有多个MPI等级和GPU流的平台上使用科学计算中的关键内核(稀疏矩阵矢量乘法)展示了我们的策略。
We present a new strategy for automatically exploring the design space of key CUDA+MPI programs and providing design rules that discriminate slow from fast implementations. In such programs, the order of operations (e.g., GPU kernels, MPI communication) and assignment of operations to resources (e.g., GPU streams) makes the space of possible designs enormous. Systems experts have the task of redesigning and reoptimizing these programs to effectively utilize each new platform. This work provides a prototype tool to reduce that burden. In our approach, a directed acyclic graph of CUDA and MPI operations defines the design space for the program. Monte-Carlo tree search discovers regions of the design space that have large impact on the program's performance. A sequence-to-vector transformation defines features for each explored implementation, and each implementation is assigned a class label according to its relative performance. A decision tree is trained on the features and labels to produce design rules for each class; these rules can be used by systems experts to guide their implementations. We demonstrate our strategy using a key kernel from scientific computing -- sparse-matrix vector multiplication -- on a platform with multiple MPI ranks and GPU streams.