论文标题

GAIA AVU-GSR平行求解器:基于LSQR的应用程序的初步研究

The Gaia AVU-GSR parallel solver: preliminary studies of a LSQR-based application in perspective of exascale systems

论文作者

Cesare, Valentina, Becciani, Ugo, Vecchiato, Alberto, Lattanzi, Mario Gilberto, Pitari, Fabio, Raciti, Mario, Tudisco, Giuseppe, Aldinucci, Marco, Bucciarelli, Beatrice

论文摘要

GAIA天文验证单位 - 全球球体重建(AVU-GSR)并行求解器旨在找到$ \ sim $ 10 $ 10 $^8 $ stars的星体参数,以银河系的方式,Gaia卫星的态度和工具规范,以及新顿邮政的全球参数$γ$的Newtonian Suparian Sustonism的全球参数$γ$。该代码迭代求解线性方程系统,$ \ mathbf {a} \ times \ vec {x} = \ vec {b} $,其中系数矩阵$ \ mathbf {a a} $是大的($ \ sim $ $ \ sim $ $ $ $ $ \ sim $ $ 10^{11}}} {11} \ times 10^8 $ elements)和Spass。为了求解该方程系统,代码利用了迭代PC-LSQR算法的混合实现,其中将与系数矩阵的不同水平部分相关的计算分配给了单独的MPI进程。在原始代码中,每个矩阵部分在OpenMP线程上进一步平行。为了进一步提高代码性能,我们将应用程序移植到GPU,用OpenACC替​​换OpenMP并行化语言。在此端口中,$ \ sim $ 95%的数据在整个迭代周期开始时从主机复制到设备,使代码$ compute $ $ $ $ $ $ $ $ bound $而不是$ data $$ - $$ Transfer $ $ $ bound $。 OpenACC代码在OpenMP版本上显示了$ \ sim $ 1.5的加速度,但正在进行进一步的优化以获得更高的收益。该代码在多个GPU上运行,并在Cineca SuperComputer Marconi100上进行了测试,预计将在2022年将其安装在Cineca。

The Gaia Astrometric Verification Unit-Global Sphere Reconstruction (AVU-GSR) Parallel Solver aims to find the astrometric parameters for $\sim$10$^8$ stars in the Milky Way, the attitude and the instrumental specifications of the Gaia satellite, and the global parameter $γ$ of the post Newtonian formalism. The code iteratively solves a system of linear equations, $\mathbf{A} \times \vec{x} = \vec{b}$, where the coefficient matrix $\mathbf{A}$ is large ($\sim$$10^{11} \times 10^8$ elements) and sparse. To solve this system of equations, the code exploits a hybrid implementation of the iterative PC-LSQR algorithm, where the computation related to different horizontal portions of the coefficient matrix is assigned to separate MPI processes. In the original code, each matrix portion is further parallelized over the OpenMP threads. To further improve the code performance, we ported the application to the GPU, replacing the OpenMP parallelization language with OpenACC. In this port, $\sim$95% of the data is copied from the host to the device at the beginning of the entire cycle of iterations, making the code $compute$ $bound$ rather than $data$$-$$transfer$ $bound$. The OpenACC code presents a speedup of $\sim$1.5 over the OpenMP version but further optimizations are in progress to obtain higher gains. The code runs on multiple GPUs and it was tested on the CINECA supercomputer Marconi100, in anticipation of a port to the pre-exascale system Leonardo, that will be installed at CINECA in 2022.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源