使用OpenACC在不同平台上使用OpenACC对CFD代码的多GPU性能优化

论文标题

使用OpenACC在不同平台上使用OpenACC对CFD代码的多GPU性能优化

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

论文作者

Xue, Weicheng, Roy, Christopher J.

论文摘要

本文使用MPI和OpenACC指令在不同平台上调查了3D浮力驱动型腔求解器的多GPU性能。该论文表明，在不同维度中分解总问题会影响GPU的强大缩放性能。没有适当的性能优化，可以表明，由于不可连接的内存访问，1D域的分解在多个GPU上缩放较差。使用任何分解的性能可以从论文中的一系列性能优化中受益。由于浮力驱动的腔代码在检查的簇上存在延迟，因此对平台的不可知论和量身定制的一系列优化旨在降低潜伏期成本并有效地改善主机和设备之间的内存吞吐量。首先，针对主机和设备之间非连续数据运动开发的并行消息包装/解开策略，将整体性能提高了约2倍。其次，基于不同变量的模板大小传输不同的数据，进一步减少了交流的交流。这两个优化足以有益于对所有测试群集的幽灵变化的模具计算。第三，GPUDIRECT用于改善群集的通信，这些群集具有硬件和软件支持在GPU之间进行直接通信的支持，而无需分会CPU的内存。最后，在仅使用MPI或MPI+OpenACC时，将重叠的通信和计算在多GPU上不有效。尽管我们认为我们的实施已经揭示了足够的重叠，但由于缺乏异步进展，实际的运行并不能很好地利用重叠。

This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPU's memory. Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题