在分布式系统上，快速硬件感知的无基质算法用于高阶元素离散化矩阵多动星产品

论文标题

在分布式系统上，快速硬件感知的无基质算法用于高阶元素离散化矩阵多动星产品

Fast hardware-aware matrix-free algorithm for higher-order finite-element discretized matrix multivector products on distributed systems

论文作者

Panigrahi, Gourab, Kodali, Nikhil, Panda, Debashis, Motamarri, Phani

论文摘要

与传统的稀疏矩阵方法相比，最新的硬件无矩阵无基质算法（FE）离散的矩阵矢量乘法降低了浮点操作和数据访问成本。这项工作提出了有效的无基质算法，用于评估多节点CPU和GPU体系结构上的Fe离散矩阵 - 培养基产品。我们解决了现有的无矩阵实现中的一个关键差距，该差距仅适用于在单个向量上的FE离散矩阵的作用。我们采用了批处理的评估策略，并针对基础硬件架构量身定制了批处理，从而提供了更好的数据局部性并实现了进一步的并行化。在CPU上，我们利用偶数DODD分解，SIMD矢量化以及重叠的计算和通信策略。在GPU上，我们采用策略与GPU共享内存，恒定内存和内核融合重叠计算和数据移动以减少数据访问。我们的实施方法优于Helmholtz操作员的基线，一个CPU节点上的提高了1.4倍，一个GPU节点上最高可达2.8倍，同时可在CPU（$ \ sim 3000 $ cores）和GPU（$ \ sim $ \ sim $ \ sim $ \ sim）上，可在多个节点上提高4.4 x和1.5倍。我们通过采用Chebyshev过滤后的子空间迭代方法来进一步基准提出的实施实施，以解决1024个最小特征值的模型特征值问题，可在一个CPU节点上提高1.5倍，并在一个GPU Node上达到2.2倍，并在连接2.2 x上取得了1.5倍的改进，并达到3.0 x。（$ \ sim 3000 $核）和gpus（$ \ sim $ 25 gpus）。

Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. This work proposes efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. We address a critical gap in existing matrix-free implementations, which are well suited only for the action of FE discretized matrices on a single vector. We employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we employ strategies to overlap compute and data movement in conjunction with GPU shared memory, constant memory, and kernel fusion to reduce data accesses. Our implementation outperforms the baselines for Helmholtz operator action, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs ($\sim 3000$ cores) and GPUs ($\sim$ 25 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multinode CPUs ($\sim 3000$ cores) and GPUs ($\sim$ 25 GPUs), respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题