DNN培训加速通过探索GPGPU友好的稀疏性

论文标题

DNN培训加速通过探索GPGPU友好的稀疏性

DNN Training Acceleration via Exploring GPGPU Friendly Sparsity

论文作者

Song, Zhuoran, Xu, Yihong, Li, Han, Jing, Naifeng, Liang, Xiaoyao, Jiang, Li

论文摘要

深神经网络〜（DNN）的训练阶段消耗了巨大的处理时间和能量。利用DNN的稀疏性的压缩技术可以有效加速DNN的推理阶段。但是，在训练阶段几乎不使用它，因为训练阶段涉及使用图形处理器（GPGPU）的通用计算（GPGPU）的密集矩阵 - 型构成，后者认可规则和结构性数据布局。在本文中，我们首先提出了近似的随机辍学，以常规和在线生成的基于行或基于瓷砖的辍学模式取代了传统的神经元和突触的随机辍学，以消除多层perceptron〜（MLP）和长期短期内存的不必要的计算和数据访问。然后，我们开发一种基于SGD的搜索算法，该算法产生基于行或基于瓷砖的辍学模式的分布，以弥补潜在的准确性损失。此外，针对卷积神经网络〜（CNN）训练加速，我们首先探讨了输入特征图的重要性和敏感性；然后提出灵敏度感知的辍学方法，以基于其灵敏度动态删除输入特征图，以实现更大的前向和向后训练加速度，同时保留更好的NN精度。为了促进DNN编程，我们构建了DNN培训计算框架，该框架统一了软件堆栈中提出的技术。结果，GPGPU仅需要支持基本运算符 - 矩阵乘法，无论DNN模型如何，都可以实现显着的性能改进。

The training phases of Deep neural network~(DNN) consumes enormous processing time and energy. Compression techniques utilizing the sparsity of DNNs can effectively accelerate the inference phase of DNNs. However, it is hardly used in the training phase because the training phase involves dense matrix-multiplication using General-Purpose Computation on Graphics Processors (GPGPU), which endorse the regular and structural data layout. In this paper, we first propose the Approximate Random Dropout that replaces the conventional random dropout of neurons and synapses with a regular and online generated row-based or tile-based dropout patterns to eliminate the unnecessary computation and data access for the multilayer perceptron~(MLP) and long short-term memory~(LSTM). We then develop a SGD-based Search Algorithm that produces the distribution of row-based or tile-based dropout patterns to compensate for the potential accuracy loss. Moreover, aiming at the convolution neural network~(CNN) training acceleration, we first explore the importance and sensitivity of input feature maps; and then propose the sensitivity-aware dropout method to dynamically drop the input feature maps based on their sensitivity so as to achieve greater forward and backward training acceleration while reserving better NN accuracy. To facilitate DNN programming, we build a DNN training computation framework that unifies the proposed techniques in the software stack. As a result, the GPGPU only needs to support the basic operator -- matrix multiplication and can achieve significant performance improvement regardless of DNN model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题