硬件有效的混合精液CP张量分解

论文标题

硬件有效的混合精液CP张量分解

Hardware-Efficient Mixed-Precision CP Tensor Decomposition

论文作者

Yang, Zi, Shan, Junnan, Zhang, Zheng

论文摘要

张量分解已被广泛用于机器学习和大量数据分析。但是，大规模张量分解通常会消耗巨大的内存和计算成本。同时，现代化的计算硬件，例如张量处理单元（TPU）和Tensor Core GPU，通过混合或低精度算术表示，为硬件有效计算的新窗口打开了新的窗口。在本文中，我们利用了张量分解的低精度表示，并提出了混合精确的块随机梯度下降（SGD）方法来降低CP张量分解的成本。我们的方法通过两阶段优化（即符号，然后是混合精确的SGD）实现了鲁棒和快速的收敛性。提供了详细的理论分析，以证明所提出的混合精液算法的收敛性。在合成和逼真的张量数据集上进行的数值实验表明，与完整精确的CP分解相比，我们的混合精液算法的效率很高。这项工作可以显着降低资源构成边缘计算设备上的内存，计算和能源成本。我们通过FPGA原型证明了这种好处。

Tensor decomposition has been widely used in machine learning and high-volume data analysis. However, large-scale tensor factorization often consumes huge memory and computing cost. Meanwhile, modernized computing hardware such as tensor processing units (TPU) and Tensor Core GPU has opened a new window of hardware-efficient computing via mixed- or low-precision arithmetic representations. In this paper, we exploit the low-precision representation of tensor factorization, and propose a mixed-precision block stochastic gradient descent (SGD) method to reduce the costs of CP tensor decomposition. Our method achieves robust and fast convergence via a two-stage optimization, i.e., SignSGD followed by mixed-precision SGD. Detailed theoretical analysis is provided to prove the convergence of the proposed mixed-precision algorithm. Numerical experiments on both synthetic and realistic tensor data sets show the superior efficiency of our mixed-precision algorithm compared to full-precision CP decomposition. This work can remarkably reduce the memory, computing and energy cost on resource-constraint edge computing devices. We demonstrate this benefit via an FPGA prototype.

下载PDF全文

下载文献需遵守相关版权规定

论文标题