稀疏的GPU内核，用于深度学习

论文标题

稀疏的GPU内核，用于深度学习

Sparse GPU Kernels for Deep Learning

论文作者

Gale, Trevor, Zaharia, Matei, Young, Cliff, Elsen, Erich

论文摘要

传统上，科学的工作负载利用了高水平的稀疏性来加速计算并减少记忆要求。尽管深度神经网络可以稀疏，但很难在GPU上实现实际加速，因为这些应用具有相对中等水平的稀疏度，而这不足以使现有的稀疏核可以超越其密集的对应物。在这项工作中，我们研究了深度学习应用中稀疏的矩阵，并确定可以利用以加速计算的有利属性。基于这些见解，我们为两个稀疏基质操作开发了高性能的GPU内核，广泛适用于神经网络：稀疏矩阵密集的矩阵乘法和采样密集的密度矩阵乘法。我们的内核达到了NVIDIA V100 GPU上单位峰的27％。使用我们的内核，我们演示了稀疏的变压器和Mobilenet模型，这些模型可实现1.2-2.1倍加速，并且最多可节省12.8倍的内存，而无需牺牲准确性。

Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernels to outperform their dense counterparts. In this work, we study sparse matrices from deep learning applications and identify favorable properties that can be exploited to accelerate computation. Based on these insights, we develop high-performance GPU kernels for two sparse matrix operations widely applicable in neural networks: sparse matrix-dense matrix multiplication and sampled dense-dense matrix multiplication. Our kernels reach 27% of single-precision peak on Nvidia V100 GPUs. Using our kernels, we demonstrate sparse Transformer and MobileNet models that achieve 1.2-2.1x speedups and up to 12.8x memory savings without sacrificing accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题