迈向异质的多核加速器，利用层融合的深神经网络的细粒度调度

论文标题

迈向异质的多核加速器，利用层融合的深神经网络的细粒度调度

Towards Heterogeneous Multi-core Accelerators Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks

论文作者

Symons, Arne, Mei, Linyan, Colleman, Steven, Houshmand, Pouya, Karl, Sebastian, Verhelst, Marian

论文摘要

为了跟上神经网络的不断增长的性能需求，专门的硬件（HW）加速器正在转向多核和chiplet架构。到目前为止，这些多加速器系统通过在不同核心上输入批处理的不同NN层来利用增加的并行性，以增加吞吐量。但是，在使用延迟关键应用程序的非批量逐层调度进行追求时，这无法完全利用可用的HW资源来朝着边缘的能源效率执行。因此，这项工作可以通过称为流的开源建模框架将图层融合的DNN的细粒度深度安排在多核体系结构上。 Stream能够代表各种计划粒度和HW架构，并优化了针对约束边缘设备的最小能量，最小延迟和/或最小内存足迹的执行时间表。我们验证了使用层融合的调度调度的三个SOTA HW实现，显示与测得的效率紧密匹配。在进一步的探索中，我们证明，在细粒度调度范式下，高级建筑决策极大地影响了硬件效率，将能量延迟的产品从2.4倍降低到单核架构的2.4倍，而与在层层状层状的传统调度相比，单核架构的多核多核体系结构的异质性多核体系结构最高。

To keep up with the ever-growing performance demand of neural networks, specialized hardware (HW) accelerators are shifting towards multi-core and chiplet architectures. So far, these multi-accelerator systems exploit the increased parallelism by pipelining different NN layers across input batches on different cores to increase throughput. Yet, when pursuing this with non-batched layer-by-layer scheduling of latency-critical applications, this fails to fully exploit the available HW resources towards energy-efficient execution at the edge. This work, therefore, enables fine-grained depth-first scheduling of layer-fused DNNs onto multi-core architectures through an open-source modeling framework called Stream. Stream is capable of representing a wide range of scheduling granularities and HW architectures and optimizes execution schedules towards minimal energy, minimal latency and/or minimal memory footprint for constrained edge devices. We validate against three SotA HW implementations employing layer-fused scheduling showing tight matching with measured efficiencies. Using Stream in further explorations, we demonstrate that high-level architectural decisions greatly impact hardware efficiency under the fine-grained scheduling paradigm, reducing the energy-delay product from 2.4x for single-core architectures to up to 30x for heterogeneous multi-core architectures compared to the traditional scheduling at layer granularity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题