论文标题
利用无效类来限制决策空间并防御神经网络对抗攻击
Utilizing a null class to restrict decision spaces and defend against neural network adversarial attacks
论文作者
论文摘要
尽管最近的进展,但深层神经网络通常仍然容易受到所谓对抗性示例的影响 - 输入具有较小扰动的图像,尽管对人类观众的语义含义没有这种变化,但仍会导致输出分类的变化。即使对于看似简单的挑战,例如MNIST数字分类任务也是如此。在某种程度上,这表明这些网络不依赖于人类用来进行这些分类的相同对象特征。在本文中,我们研究了这种现象背后的另一个且在很大程度上没有探索的原因 - 即使用传统的培训范式,其中整个输入空间在培训类别中都被划分了。由于这种范式,学到的单个类别的决策空间跨越了输入空间的过多区域,并包括与训练集中图像没有语义相似性的图像。在这项研究中,我们训练包括无效类别的模型。也就是说,模型可以“选择”将输入图像分类为数字类之一。在训练过程中,通过多种方法创建了无效图像,以尝试为数字类创建更严格,更有意义的决策空间。最佳性能模型将几乎所有对抗性示例分类为零,而不是将它们误认为是错误的数字类的成员,同时在未扰动的测试集上保持了高准确性。本文提出的无效类别和培训范式的使用可能会为某些应用提供有效的防御攻击。复制本研究的代码将在https://github.com/mattros/null_class_adversarial_defense上提供。
Despite recent progress, deep neural networks generally continue to be vulnerable to so-called adversarial examples--input images with small perturbations that can result in changes in the output classifications, despite no such change in the semantic meaning to human viewers. This is true even for seemingly simple challenges such as the MNIST digit classification task. In part, this suggests that these networks are not relying on the same set of object features as humans use to make these classifications. In this paper we examine an additional, and largely unexplored, cause behind this phenomenon--namely, the use of the conventional training paradigm in which the entire input space is parcellated among the training classes. Owing to this paradigm, learned decision spaces for individual classes span excessively large regions of the input space and include images that have no semantic similarity to images in the training set. In this study, we train models that include a null class. That is, models may "opt-out" of classifying an input image as one of the digit classes. During training, null images are created through a variety of methods, in an attempt to create tighter and more semantically meaningful decision spaces for the digit classes. The best performing models classify nearly all adversarial examples as nulls, rather than mistaking them as a member of an incorrect digit class, while simultaneously maintaining high accuracy on the unperturbed test set. The use of a null class and the training paradigm presented herein may provide an effective defense against adversarial attacks for some applications. Code for replicating this study will be made available at https://github.com/mattroos/null_class_adversarial_defense .