论文标题
基于深度神经网络推断的相位变化记忆的64核混合信号计算芯片
A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference
论文作者
论文摘要
需要反复围绕着从内存到处理单元的突触重量值乘坐的需要,这是与人工神经网络的硬件实施相关的关键能源效率低下。具有空间实例化的突触权重的模拟内存计算(AIMC)通过直接在芯片上存储的网络权重执行推断工作量的网络权重,可以通过执行矩阵向量乘法(MVM)来克服这一挑战。但是,为了实现延迟和能源消耗的端到端改进,必须将AIMC与芯片数字操作和通信结合使用,以朝着完整的推理工作量迈向完全芯片的配置。此外,非常需要在不通过芯片重新调查应用的情况下实现高MVM和推理准确性。在这里,我们提出了一种由14 nm互补的金属氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物(CMOS)技术设计和制造的多核AIMC芯片,具有后端集成相变位数(PCM)。完全集成的芯片具有64 256x256通过芯片通信网络互连的AIMC核心。它还实现了与重置卷积神经网络和长期记忆(LSTM)网络相关的数字激活功能和处理。我们在实现与权重层相关的所有计算以及芯片上激活功能时实现所有计算时,证明了与Resnet和LSTM网络相当的相当于的推理精度。对于8位输入/输出矩阵矢量乘法,芯片可以达到63.1顶部的最大吞吐量。
The need to repeatedly shuttle around synaptic weight values from memory to processing units has been a key source of energy inefficiency associated with hardware implementation of artificial neural networks. Analog in-memory computing (AIMC) with spatially instantiated synaptic weights holds high promise to overcome this challenge, by performing matrix-vector multiplications (MVMs) directly within the network weights stored on a chip to execute an inference workload. However, to achieve end-to-end improvements in latency and energy consumption, AIMC must be combined with on-chip digital operations and communication to move towards configurations in which a full inference workload is realized entirely on-chip. Moreover, it is highly desirable to achieve high MVM and inference accuracy without application-wise re-tuning of the chip. Here, we present a multi-core AIMC chip designed and fabricated in 14-nm complementary metal-oxide-semiconductor (CMOS) technology with backend-integrated phase-change memory (PCM). The fully-integrated chip features 64 256x256 AIMC cores interconnected via an on-chip communication network. It also implements the digital activation functions and processing involved in ResNet convolutional neural networks and long short-term memory (LSTM) networks. We demonstrate near software-equivalent inference accuracy with ResNet and LSTM networks while implementing all the computations associated with the weight layers and the activation functions on-chip. The chip can achieve a maximal throughput of 63.1 TOPS at an energy efficiency of 9.76 TOPS/W for 8-bit input/output matrix-vector multiplications.