实施具有针对新兴Exascale体系结构的性能可移植性的神经网络间模型

论文标题

实施具有针对新兴Exascale体系结构的性能可移植性的神经网络间模型

Implementing a neural network interatomic model with performance portability for emerging exascale architectures

论文作者

Desai, Saaketh, Reeve, Samuel Temple, Belak, James F.

论文摘要

计算科学的两个主要推力是更准确的预测和更快的计算。为此，分子动力学（MD）模拟的时代精神正在追求机器学习和数据驱动的原子间模型，例如神经网络潜力和新型硬件体系结构，例如GPU。神经网络电位的当前实现比传统的原子间模型要慢，而迫在眉睫的Exascale Computing则可以通过这些模型运行大型，准确的模拟，从而为MD实现新的和多样化的Exascale硬件的便携式性能，需要使用新颖的数据结构和图书馆解决方案来重新思考传统算法。我们在Cabanamd（一个MD代理应用程序Cabanamd）重新实现了一个神经网络间模型，该模型建立在为性能可移植所开发的库中。与当前的LAMMPS实施相比，我们的实施显示了该复杂内核中的节点缩放率显着改善。在许多情况下，我们的单源解决方案可以提高性能，线程量表可实现单个CPU节点上高达2100万个原子的模拟，而单个GPU上有200万个原子。我们还探索了并行性和数据布局选择（使用称为AOSOAS的灵活数据结构）及其对性能的影响，仅通过选择正确的并行级别的并行性和数据布局，在GPU上的性能高达〜25％和〜10％。

The two main thrusts of computational science are more accurate predictions and faster calculations; to this end, the zeitgeist in molecular dynamics (MD) simulations is pursuing machine learned and data driven interatomic models, e.g. neural network potentials, and novel hardware architectures, e.g. GPUs. Current implementations of neural network potentials are orders of magnitude slower than traditional interatomic models and while looming exascale computing offers the ability to run large, accurate simulations with these models, achieving portable performance for MD with new and varied exascale hardware requires rethinking traditional algorithms, using novel data structures, and library solutions. We re-implement a neural network interatomic model in CabanaMD, an MD proxy application, built on libraries developed for performance portability. Our implementation shows significantly improved on-node scaling in this complex kernel as compared to a current LAMMPS implementation, across both strong and weak scaling. Our single-source solution results in improved performance in many cases, with thread-scalability enabling simulations up to 21 million atoms on a single CPU node and 2 million atoms on a single GPU. We also explore parallelism and data layout choices (using flexible data structures called AoSoAs) and their effect on performance, seeing up to ~25% and ~10% improvements in performance on a GPU simply by choosing the right level of parallelism and data layout, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题