在CPU/GPU架构上分发了内存的NMF

论文标题

在CPU/GPU架构上分发了内存的NMF

Distributed Out-of-Memory NMF on CPU/GPU Architectures

论文作者

Boureima, Ismael, Bhattarai, Manish, Eren, Maksim, Skau, Erik, Romero, Philip, Eidenbenz, Stephan, Alexandrov, Boian

论文摘要

我们提出了一种有效的分布式分布式内置的实现，用于异质高性能计算（HPC）系统的非负矩阵分解（NMF）算法。拟议的实现基于NMFK的先前工作，该工作可以执行自动模型选择并从数据中提取潜在变量和模式。在这项工作中，我们通过在多节点，多GPU系统上增加对密度和稀疏矩阵操作的支持来扩展NMFK。对所得算法进行了优化，以针对内存外（OOM）问题进行优化，其中分解给定矩阵所需的内存大于可用的GPU存储器。通过批处理/平铺策略降低记忆复杂性，稀疏和密集的矩阵操作通过GPU核（或可用的张量芯）显着加速。使用CUDA流与主机和设备之间的批次副本相关联的输入/输出（I/O）潜伏期，使用CUDA流与数据传输重叠并异步计算，并且使用与集体通信相关的延迟（包括基于NVIDIA的NVIDIA Collectia Commutive Communication NCCL库NCCL基于NCCL基于NCCL基于NCCL的通信者）相关的延迟。基准结果显示出从32倍到76倍加速度的显着改善，而新的实现在基于CPU的NMFK上使用了GPU。在分解约25,000 gpu的多达4096个多GPU群集节点上，分解340个340 Terabyte-Terabyte-size矩阵和11个exabyte尺寸的稀疏矩阵10e-6时，显示出良好的弱缩放率。

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.

下载PDF全文

下载文献需遵守相关版权规定

论文标题