论文标题
GPU与OpenACC上的gadget3
Gadget3 on GPUs with OpenACC
论文作者
论文摘要
我们使用OpenACC指令介绍了所有主要GADGET3模块的GPU移植(重力计算,SPH密度计算,SPH水力传导和热传导)的初步结果。在这里,我们通过对CPU和GPU上的计算进行叠加计算来分配每个MPI等级,并利用主机和加速器功能:而GPU异步对其MPI等级中粒子之间的相互作用进行了计算,CPUS cpus cpus cpus cpus ter ret tree-walks and cop tere tree-walks and MPI通信邻近粒子的通信。我们介绍了代码的各个部分,以了解加速的起源,在那里我们发现由于很少有活动粒子的时间步长而无法实现峰值速度。我们从磁性项目中运行一个流体动力学宇宙学模拟,并提供$ 2 \ cdot10^{7} $颗粒,我们发现最终总速度约为$ \2。$我们还提供了令人鼓舞的缩放测试的结果,该测试是对$ obleohack17 emohohack17 every of portsport of portsport of potting potting potting proting proting proting proting proting proting proting provent of porting provent of porting provent of potting provent of potting provent of potting provent of porting provent of的结果GPU。
We present preliminary results of a GPU porting of all main Gadget3 modules (gravity computation, SPH density computation, SPH hydrodynamic force, and thermal conduction) using OpenACC directives. Here we assign one GPU to each MPI rank and exploit both the host and accellerator capabilities by overlapping computations on the CPUs and GPUs: while GPUs asynchronously compute interactions between particles within their MPI ranks, CPUs perform tree-walks and MPI communications of neighbouring particles. We profile various portions of the code to understand the origin of our speedup, where we find that a peak speedup is not achieved because of time-steps with few active particles. We run a hydrodynamic cosmological simulation from the Magneticum project, with $2\cdot10^{7}$ particles, where we find a final total speedup of $\approx 2.$ We also present the results of an encouraging scaling test of a preliminary gravity-only OpenACC porting, run in the context of the EuroHack17 event, where the prototype of the porting proved to keep a constant speedup up to $1024$ GPUs.