PYGFI：分析和增强图神经网络针对硬件错误的鲁棒性

论文标题

PYGFI：分析和增强图神经网络针对硬件错误的鲁棒性

PyGFI: Analyzing and Enhancing Robustness of Graph Neural Networks Against Hardware Errors

论文作者

Wang, Ruixuan, Lin, Fred, Moore, Daniel, Sankar, Sriram, Jiao, Xun

论文摘要

Graph神经网络（GNN）最近成为学习图形结构数据的有前途的学习范式，并在各个领域（例如推荐系统，社交网络和电子设计自动化（EDA））中证明了广泛的成功。像其他深度学习（DL）方法一样，GNN也被部署在复杂的现代硬件系统以及专用的加速器中。然而，尽管GNN的流行以及将GNN带入硬件的最新努力，但GNN的容错和弹性通常被忽略了。受DL方法固有的算法弹性的启发，本文首次对GNN弹性进行了大规模和经验研究，旨在了解硬件故障与GNN准确性之间的关系。通过在Pytorch上开发自定义的故障注入工具，我们在各种GNN模型和应用程序数据集上进行了广泛的故障注入实验。我们观察到，GNN模型的误差弹性因不同模型和应用程序数据集而变化。此外，我们还探索了GNN的低成本误差缓解机制，以增强其弹性。这项GNN的弹性研究旨在为未来的GNN加速器设计和建筑优化打开新的方向和机会。

Graph neural networks (GNNs) have recently emerged as a promising learning paradigm in learning graph-structured data and have demonstrated wide success across various domains such as recommendation systems, social networks, and electronic design automation (EDA). Like other deep learning (DL) methods, GNNs are being deployed in sophisticated modern hardware systems, as well as dedicated accelerators. However, despite the popularity of GNNs and the recent efforts of bringing GNNs to hardware, the fault tolerance and resilience of GNNs have generally been overlooked. Inspired by the inherent algorithmic resilience of DL methods, this paper conducts, for the first time, a large-scale and empirical study of GNN resilience, aiming to understand the relationship between hardware faults and GNN accuracy. By developing a customized fault injection tool on top of PyTorch, we perform extensive fault injection experiments on various GNN models and application datasets. We observe that the error resilience of GNN models varies by orders of magnitude with respect to different models and application datasets. Further, we explore a low-cost error mitigation mechanism for GNN to enhance its resilience. This GNN resilience study aims to open up new directions and opportunities for future GNN accelerator design and architectural optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题