论文标题
关于对抗性攻击的可转移性神经文本分类器
On the Transferability of Adversarial Attacksagainst Neural Text Classifier
论文作者
论文摘要
深度神经网络容易受到对抗性攻击的影响,在这些攻击中,对输入的小扰动会改变模型的预测。在许多情况下,故意为一种模型制作的恶意输入会欺骗另一种模型。在本文中,我们介绍了第一项研究,以系统地研究文本分类模型的对抗性示例的可传递性,并探讨各种因素,包括网络体系结构,令牌化方案,单词嵌入和模型能力,都会影响对抗性示例的可传递性。基于这些研究,我们提出了一种遗传算法来找到一种模型集合,这些模型可用于诱导对抗性示例,以欺骗几乎所有现有模型。这种对抗性例子反映了学习过程的缺陷和训练集中的数据偏差。最后,我们得出了可以从这些对抗性示例中用于模型诊断的单词替换规则。
Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.