几何启发的攻击，用于产生自然语言对抗性示例

论文标题

几何启发的攻击，用于产生自然语言对抗性示例

A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples

论文作者

Meng, Zhao, Wattenhofer, Roger

论文摘要

自然语言由离散的符号组成，因此为自然语言生成对抗性示例很难，示例通常是可变长度的。在本文中，我们提出了一种几何启发的攻击，以产生自然语言对抗性示例。我们的攻击通过迭代近似深神经网络（DNNS）的决策边界来产生对抗示例。两个不同模型的两个数据集上的实验表明，我们的攻击愚人自然语言模型具有很高的成功率，同时仅替换了几个单词。人类评估表明，我们的攻击产生的对抗性实例很难让人类识别。进一步的实验表明，对抗训练可以改善针对我们的攻击的模型鲁棒性。

Generating adversarial examples for natural language is hard, as natural language consists of discrete symbols, and examples are often of variable lengths. In this paper, we propose a geometry-inspired attack for generating natural language adversarial examples. Our attack generates adversarial examples by iteratively approximating the decision boundary of Deep Neural Networks (DNNs). Experiments on two datasets with two different models show that our attack fools natural language models with high success rates, while only replacing a few words. Human evaluation shows that adversarial examples generated by our attack are hard for humans to recognize. Further experiments show that adversarial training can improve model robustness against our attack.

下载PDF全文

下载文献需遵守相关版权规定

论文标题