论文标题

FT-CNN:基于算法的卷积神经网络的容错

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

论文作者

Zhao, Kai, Di, Sheng, Li, Sihuan, Liang, Xin, Zhai, Yujia, Chen, Jieyang, Ouyang, Kaiming, Cappello, Franck, Chen, Zizhong

论文摘要

卷积神经网络(CNN)对于解决许多领域的具有挑战性和关键问题越来越重要。 CNN的推理应用已部署在安全 - 关键系统中,这些系统可能遭受高能颗粒,高温或异常电压引起的软误差。至关重要的是确保CNN推理过程对软错误的稳定性。传统的容错方法不适合CNN推断,因为错误校正的代码无法保护计算组件,指令重复技术产生高开销,并且现有的基于算法的容错(ABFT)技术无法保护所有卷积实现。在本文中,我们关注如何通过以下三项贡献来保护CNN推理过程免受软错误的效率。 (1)我们根据校验和技术提出了几种系统的ABFT方案,并彻底分析了其故障保护能力和运行时。基于矩阵矩阵乘法,我们的方案很像传统的ABFT,我们的方案支持任何卷积实现。 (2)我们设计了一个新型的工作流,该工作流程集成了所有提出的方案,以获得高检测/校正能力,并具有有限的总运行时开销。 (3)我们使用Imagenet进行评估,并与众所周知的CNN模型在内,包括Alexnet,VGG-19,Resnet-18和Yolov2。实验结果表明,我们的实现可以处理运行时开销非常有限的软误差(在不误差和注入错误的情况下为4%〜8%)。

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly.Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%~8% in both error-free and error-injected situations).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源