论文标题
通过观察者网络对对抗性深度学习攻击的非侵入性检测
Non-Intrusive Detection of Adversarial Deep Learning Attacks via Observer Networks
论文作者
论文摘要
最近的研究表明,深度学习模型容易受到人类准确侵蚀的专门制作的对抗性输入。在这封信中,我们提出了一种新的方法来检测对抗输入,通过使用多个二进制探测器(观察者网络)增强主要分类网络,该网络从原始网络(卷积内核输出)的隐藏层中获取输入并将输入分类为清洁或对抗性。在推断过程中,检测器被视为集合网络的一部分,如果至少一半的检测器将其分类为较量,则将输入视为对抗性。所提出的方法解决了在清洁和对抗样本上分类的准确性之间的权衡,因为在检测过程中未修改原始分类网络。使用多个观察者网络的使用使攻击检测机制即使攻击者意识到受害者分类器,也会攻击检测机制。我们在MNIST数据集上实现了99.5%的检测准确性,使用半白盒子设置中的快速梯度标志攻击,在CIFAR-10数据集上达到了97.5%。在最坏情况下,假阳性检测的数量仅为0.12%。
Recent studies have shown that deep learning models are vulnerable to specifically crafted adversarial inputs that are quasi-imperceptible to humans. In this letter, we propose a novel method to detect adversarial inputs, by augmenting the main classification network with multiple binary detectors (observer networks) which take inputs from the hidden layers of the original network (convolutional kernel outputs) and classify the input as clean or adversarial. During inference, the detectors are treated as a part of an ensemble network and the input is deemed adversarial if at least half of the detectors classify it as so. The proposed method addresses the trade-off between accuracy of classification on clean and adversarial samples, as the original classification network is not modified during the detection process. The use of multiple observer networks makes attacking the detection mechanism non-trivial even when the attacker is aware of the victim classifier. We achieve a 99.5% detection accuracy on the MNIST dataset and 97.5% on the CIFAR-10 dataset using the Fast Gradient Sign Attack in a semi-white box setup. The number of false positive detections is a mere 0.12% in the worst case scenario.