针对对抗检测的输入特定注意子网

论文标题

针对对抗检测的输入特定注意子网

Input-specific Attention Subnetworks for Adversarial Detection

论文作者

Biju, Emil, Sriram, Anirudh, Kumar, Pratyush, Khapra, Mitesh M

论文摘要

自我注意力头是变压器模型的特征，并且已经对可解释性和修剪进行了很好的研究。在这项工作中，我们展示了注意力头的完全不同的效用，即用于对抗性检测。具体而言，我们提出了一种构建特定于输入的注意子网（IAS）的方法，从中我们提取三个特征以区分真实和对抗性输入。最终的检测器可显着提高（超过7.5％），在10个NLU数据集上使用11种不同的对抗性攻击类型的BERT编码器的最新对抗检测精度。我们还证明，对于可能具有更虚假的相关性并因此容易受到对抗性攻击的较大模型，我们的方法（a）更为准确，并且（b）在适度的对抗性示例中表现出色。

Self-attention heads are characteristic of Transformer models and have been well studied for interpretability and pruning. In this work, we demonstrate an altogether different utility of attention heads, namely for adversarial detection. Specifically, we propose a method to construct input-specific attention subnetworks (IAS) from which we extract three features to discriminate between authentic and adversarial inputs. The resultant detector significantly improves (by over 7.5%) the state-of-the-art adversarial detection accuracy for the BERT encoder on 10 NLU datasets with 11 different adversarial attack types. We also demonstrate that our method (a) is more accurate for larger models which are likely to have more spurious correlations and thus vulnerable to adversarial attack, and (b) performs well even with modest training sets of adversarial examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题