DSP包装：将低精度算术挤压到FPGA DSP块中

论文标题

DSP包装：将低精度算术挤压到FPGA DSP块中

DSP-Packing: Squeezing Low-precision Arithmetic into FPGA DSP Blocks

论文作者

Sommer, Jan, Özkan, M. Akif, Keszocze, Oliver, Teich, Jürgen

论文摘要

在现场可编程门阵列（FPGA）中可用的数字信号处理器（DSP）资源的数量通常非常有限。因此，对于算法的计算密集型部分的可用DSP资源的全面利用对于优化实现的非功能属性（即性能，功率和区域）至关重要。 Xilinx设备中可用的DSP实现了大宽度运算符（即48位蓄能器或$ 18 \ times 27 $乘数）。但是，使用此类DSP进行低精度量化数据（如图像处理或机器学习应用程序中常见）使DSP资源未被充分利用。作为一种补救措施，已经提出了一种方法来在单个时钟周期中在单个DSP上打包和计算四个4位乘法。本文介绍了该方案对任意位宽度和乘法数量的概括。我们还证明了先前提出的方法会导致误差（平均绝对误差（MAE）= 0.37）。此外，我们解释了这些错误的来源以及如何纠正它们。最重要的是，我们引入了一种称为“过包”的新型近似方法，该方法允许以小错误的成本将更多的乘法挤入单个DSP中（MAE = 0.47）。与文献中只有四个相比，过度包装可以将六个4位乘法挤压为单个DSP。最后，我们引入了一种替代方法，将多个小位宽度添加物包装到单个48位蓄能器中，以在诸如尖峰神经网络之类的应用中使用。

The number of Digital Signal Processor (DSP) resources available in Field Programmable Gate Arrays (FPGAs) is often quite limited. Therefore, full utilization of available DSP resources for the computationally intensive parts of an algorithm is paramount for optimizing the non-functional properties of an implementation (i.e., performance, power, and area). The DSPs available in Xilinx devices implement large bit width operators (i.e. a 48-bit accumulator or a $18 \times 27$ multiplier). However, using such a DSP for low-precision quantized data (as is common in image processing or machine learning applications) leaves the DSP resources underutilized. As a remedy, A method has been proposed to pack and compute four 4-bit multiplications on a single DSP in a single clock cycle. This paper presents a generalization of this scheme to arbitrary bit widths and number of multiplications. We also demonstrate that the previously proposed approach leads to errors (Mean Absolute Error (MAE) = 0.37). Furthermore, we explain where these errors come from and how they can be corrected. On top, we introduce a novel approximate method called "Overpacking" which allows to squeeze even more multiplications into a single DSP at the cost of small errors (MAE = 0.47). Overpacking allows to squeeze six 4-bit multiplications into a single DSP compared to just four in the literature. Finally, we introduce an alternative method for packing multiple small-bit width additions into a single 48-bit accumulator for use in applications such as Spiking Neural Networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题