论文标题
线性高阶不连续的Galerkin方案的记忆足迹的矢量化和最小化
Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes
论文作者
论文摘要
我们为高阶不连续的盖尔金求解器的高度不连续PDE发动机Exahype的高阶galerkin求解器的核心计算内核提供了一系列优化 - 由于软件设计中的SIMD操作,高速缓存层次结构和限制而导致的瓶颈连续处理。 从数值方案的通用标量实现开始,我们的第一个优化变体通过矢量化循环,改进数据布局并使用Loop-over-over-gemm通过Libxsmm库提供的高度优化的矩阵乘法功能来执行张量收缩,从而应用了最先进的优化技术。我们表明,由于内存足迹而导致的内存摊位超过了我们的L2高速缓存尺寸,阻碍了矢量化的增长。因此,我们引入了一种新的内核,该内核采用总和分解方法来减少内核的内存足迹并改善其缓存位置。拆除了L2高速缓存瓶颈后,我们能够通过引入混合式阵列数据布局来利用其他矢量化机会,从而解决了矩阵乘法内核与点的函数之间的数据布局冲突以实现PDE特定项。 使用该最后一个内核,以高多项式顺序进行基准模拟评估,只有2 \%的浮点操作仍使用标量指令进行,并且可实现22.5%的可用性能。
We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE -- successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel's memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at high polynomial order, only 2\% of the floating point operations are still performed using scalar instructions and 22.5\% of the available performance is achieved.