论文标题

APACK:非芯片,无损数据压缩,以进行有效的深度学习推断

APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference

论文作者

Lascorz, Alberto Delmas, Mahmoud, Mostafa, Moshovos, Andreas

论文摘要

数据访问在芯片内和片上的记忆之间是通过深度学习网络推断过程中总体能耗的很大一部分。我们提出了APACK,这是一种简单有效的,无损的,片的内存压缩技术,适用于定量量化的模型。 APACK通过利用深度学习应用中的非均匀价值分布来减少数据宽度。 APACK可用于增加有效的记忆能力,以减少芯片流量和/或实现所需的性能/能量目标,同时使用较小的芯片外记忆。 APACK建立在算术编码的基础上,将每个值编码为算术编码的可变长度前缀,以及偏移。为了最大化压缩比,启发式软件算法将价值空间分组为每个共享共同前缀的分组。 APACK通过使用几个管道的编码器/解码器单元并行利用内存访问并行性,并跟上深度学习的高数据带宽需求。 APACK可以与任何机器学习加速器一起使用。在已显示的配置中,APACK位于离芯片内存控制器之前,以便他其余的片上内存和计算单元,因此请参阅原始数据流。我们在Verilog和65nm的技术节点中实现了APACK压缩机和解压缩器,以证明其性能和能源效率。有指示的是,在广泛的8位量化模型中,APACK将权重和激活的数据足迹平均分别减少到60%和48%。它自然会适应和压缩使用更具侵略性量化方法的模型。当与基于张力的加速器集成时,APACK分别将加速度和能源效率提高到1.44 x和1.37倍。

Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present APack, a simple and effective, lossless, off-chip memory compression technique for fixed-point quantized models. APack reduces data widths by exploiting the non-uniform value distribution in deep learning applications. APack can be used to increase the effective memory capacity, to reduce off-chip traffic, and/or to achieve the desired performance/energy targets while using smaller off-chip memories. APack builds upon arithmetic coding, encoding each value as an arithmetically coded variable length prefix, plus an offset. To maximize compression ratio a heuristic software algorithm partitions the value space into groups each sharing a common prefix. APack exploits memory access parallelism by using several, pipelined encoder/decoder units in parallel and keeps up with the high data bandwidth demands of deep learning. APack can be used with any machine learning accelerator. In the demonstrated configuration, APack is placed just before the off-chip memory controller so that he rest of the on-chip memory and compute units thus see the original data stream. We implemented the APack compressor and decompressor in Verilog and in a 65nm tech node demonstrating its performance and energy efficiency. Indicatively, APack reduces data footprint of weights and activations to 60% and 48% respectively on average over a wide set of 8-bit quantized models. It naturally adapts and compresses models that use even more aggressive quantization methods. When integrated with a Tensorcore-based accelerator, APack boosts the speedup and energy efficiency to 1.44X and 1.37X respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源