论文标题
使用混合精液瓷砖cholesky分解的地统计建模和预测
Geostatistical Modeling and Prediction Using Mixed-Precision Tile Cholesky Factorization
论文作者
论文摘要
地统计学代表了最具挑战性的科学应用类别之一,因为希望纳入不断增加的地理空间位置以准确建模和预测环境现象。例如,构成主要计算阶段的高斯对数似然函数的评估涉及求解具有较大密集的对称和正定的确定协方差矩阵的线性方程系统。标准算法Cholesky需要O(n^3)浮点运算符,并且具有O(n^2)内存足迹,其中n是地理位置的数量。在这里,我们提出了一种混合精确的瓷砖算法,以在对数似然函数评估过程中加速cholesky的分解。在适当的订购下,它在对角线周围的瓷砖上以双精度算术运行,同时还原为足够远的瓷砖的单精度算术。这可以改善性能,而不会对应用程序的数值准确性任何恶化。我们依靠Starpu动态运行时系统来安排任务并与数据移动重叠。为了评估所提出的混合精液算法的性能和准确性,我们在可能配备硬件加速器的各种共享和分布式内存系统上使用合成和真实数据集。我们将混合精确的cholesky分解与双精度参考实现以及独立的块近似方法进行了比较。我们在大规模平行体系结构上平均获得1.6倍的性能加速,同时保持建模和预测所需的准确性。
Geostatistics represents one of the most challenging classes of scientific applications due to the desire to incorporate an ever increasing number of geospatial locations to accurately model and predict environmental phenomena. For example, the evaluation of the Gaussian log-likelihood function, which constitutes the main computational phase, involves solving systems of linear equations with a large dense symmetric and positive definite covariance matrix. Cholesky, the standard algorithm, requires O(n^3) floating point operators and has an O(n^2) memory footprint, where n is the number of geographical locations. Here, we present a mixed-precision tile algorithm to accelerate the Cholesky factorization during the log-likelihood function evaluation. Under an appropriate ordering, it operates with double-precision arithmetic on tiles around the diagonal, while reducing to single-precision arithmetic for tiles sufficiently far off. This translates into an improvement of the performance without any deterioration of the numerical accuracy of the application. We rely on the StarPU dynamic runtime system to schedule the tasks and to overlap them with data movement. To assess the performance and the accuracy of the proposed mixed-precision algorithm, we use synthetic and real datasets on various shared and distributed-memory systems possibly equipped with hardware accelerators. We compare our mixed-precision Cholesky factorization against the double-precision reference implementation as well as an independent block approximation method. We obtain an average of 1.6X performance speedup on massively parallel architectures while maintaining the accuracy necessary for modeling and prediction.