论文标题
学会忽略对抗性攻击
Learning to Ignore Adversarial Attacks
论文作者
论文摘要
尽管当前的NLP模型的性能很强,但它们可能会在对抗攻击方面变得脆弱。为了实现针对对抗性输入的有效学习,我们介绍了可以明确学会忽略攻击令牌的理由模型的使用。我们发现,理由模型可以成功忽略超过90%的攻击令牌。这种方法可在三个数据集的BERT和ROBERTA的三个数据集上对基线模型进行一致的一致改进($ \ sim $ 10%),并且仅使用对抗性示例可靠地超过了数据增强。在许多情况下,我们发现我们的方法能够在干净的测试集和受攻击的测试集上的模型性能之间缩小差距,从而减少了对抗性攻击的效果。
Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements ($\sim$10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.