论文标题
硬件有效的混合精液CP张量分解
Hardware-Efficient Mixed-Precision CP Tensor Decomposition
论文作者
论文摘要
张量分解已被广泛用于机器学习和大量数据分析。但是,大规模张量分解通常会消耗巨大的内存和计算成本。同时,现代化的计算硬件,例如张量处理单元(TPU)和Tensor Core GPU,通过混合或低精度算术表示,为硬件有效计算的新窗口打开了新的窗口。在本文中,我们利用了张量分解的低精度表示,并提出了混合精确的块随机梯度下降(SGD)方法来降低CP张量分解的成本。我们的方法通过两阶段优化(即符号,然后是混合精确的SGD)实现了鲁棒和快速的收敛性。提供了详细的理论分析,以证明所提出的混合精液算法的收敛性。在合成和逼真的张量数据集上进行的数值实验表明,与完整精确的CP分解相比,我们的混合精液算法的效率很高。这项工作可以显着降低资源构成边缘计算设备上的内存,计算和能源成本。我们通过FPGA原型证明了这种好处。
Tensor decomposition has been widely used in machine learning and high-volume data analysis. However, large-scale tensor factorization often consumes huge memory and computing cost. Meanwhile, modernized computing hardware such as tensor processing units (TPU) and Tensor Core GPU has opened a new window of hardware-efficient computing via mixed- or low-precision arithmetic representations. In this paper, we exploit the low-precision representation of tensor factorization, and propose a mixed-precision block stochastic gradient descent (SGD) method to reduce the costs of CP tensor decomposition. Our method achieves robust and fast convergence via a two-stage optimization, i.e., SignSGD followed by mixed-precision SGD. Detailed theoretical analysis is provided to prove the convergence of the proposed mixed-precision algorithm. Numerical experiments on both synthetic and realistic tensor data sets show the superior efficiency of our mixed-precision algorithm compared to full-precision CP decomposition. This work can remarkably reduce the memory, computing and energy cost on resource-constraint edge computing devices. We demonstrate this benefit via an FPGA prototype.