高性能时间GNN推断FPGA的模型架构共同设计

论文标题

高性能时间GNN推断FPGA的模型架构共同设计

Model-Architecture Co-Design for High Performance Temporal GNN Inference on FPGA

论文作者

Zhou, Hongkuan, Zhang, Bingyi, Kannan, Rajgopal, Prasanna, Viktor, Busart, Carl

论文摘要

时间图神经网络（TGNN）是捕获时间图上的时间，结构和上下文信息的强大模型。生成的时间节点嵌入在许多下游任务中的其他方法都超过了其他方法。现实世界应用需要对实时流动动态图进行高性能推断。但是，这些模型通常依靠复杂的注意机制来捕获颞邻居之间的关系。此外，维护顶点内存会遭受内在的时间数据依赖关系，从而阻碍了任务级并行性，从而使其在通用处理器上效率低下。在这项工作中，我们提出了一种新颖的模型架构共同设计，用于推断FPGA上的基于内存的TGNN。我们提出的关键建模优化包括一种轻量级的方法来计算注意力分数和相关的临时邻居修剪策略，以进一步减少计算和内存访问。这些都与利用FPGA硬件的关键硬件优化相结合。我们用基于FIFO的硬件采样器替换了时间采样器，并用查找桌编码时间编码器。我们使用知识蒸馏训练简化的模型，以确保与原始模型相似的准确性。利用模型优化，我们提出了使用批处理，管道和预取技术的原则硬件体系结构，以进一步提高性能。我们还提出了一种硬件机制，以确保在不牺牲计算并行性的情况下更新时间顺序的顶点。我们评估了三个现实世界数据集上提出的硬件加速器的性能。

Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs. The generated temporal node embeddings outperform other methods in many downstream tasks. Real-world applications require high performance inference on real-time streaming dynamic graphs. However, these models usually rely on complex attention mechanisms to capture relationships between temporal neighbors. In addition, maintaining vertex memory suffers from intrinsic temporal data dependency that hinders task-level parallelism, making it inefficient on general-purpose processors. In this work, we present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs. The key modeling optimizations we propose include a light-weight method to compute attention scores and a related temporal neighbor pruning strategy to further reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage FPGA hardware. We replace the temporal sampler with an on-chip FIFO based hardware sampler and the time encoder with a look-up-table. We train our simplified models using knowledge distillation to ensure similar accuracy vis-á-vis the original model. Taking advantage of the model optimizations, we propose a principled hardware architecture using batching, pipelining, and prefetching techniques to further improve the performance. We also propose a hardware mechanism to ensure the chronological vertex updating without sacrificing the computation parallelism. We evaluate the performance of the proposed hardware accelerator on three real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题