线性高阶不连续的Galerkin方案的记忆足迹的矢量化和最小化

论文标题

线性高阶不连续的Galerkin方案的记忆足迹的矢量化和最小化

Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes

论文作者

Gallard, Jean-Matthieu, Rannabauer, Leonhard, Reinarz, Anne, Bader, Michael

论文摘要

我们为高阶不连续的盖尔金求解器的高度不连续PDE发动机Exahype的高阶galerkin求解器的核心计算内核提供了一系列优化 - 由于软件设计中的SIMD操作，高速缓存层次结构和限制而导致的瓶颈连续处理。从数值方案的通用标量实现开始，我们的第一个优化变体通过矢量化循环，改进数据布局并使用Loop-over-over-gemm通过Libxsmm库提供的高度优化的矩阵乘法功能来执行张量收缩，从而应用了最先进的优化技术。我们表明，由于内存足迹而导致的内存摊位超过了我们的L2高速缓存尺寸，阻碍了矢量化的增长。因此，我们引入了一种新的内核，该内核采用总和分解方法来减少内核的内存足迹并改善其缓存位置。拆除了L2高速缓存瓶颈后，我们能够通过引入混合式阵列数据布局来利用其他矢量化机会，从而解决了矩阵乘法内核与点的函数之间的数据布局冲突以实现PDE特定项。使用该最后一个内核，以高多项式顺序进行基准模拟评估，只有2 \％的浮点操作仍使用标量指令进行，并且可实现22.5％的可用性能。

We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE -- successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel's memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at high polynomial order, only 2\% of the floating point operations are still performed using scalar instructions and 22.5\% of the available performance is achieved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题