论文标题

字符级的白色框对抗攻击通过可附加的子字替代针对变形金刚

Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

论文作者

Liu, Aiwei, Yu, Honghai, Hu, Xuming, Li, Shu'ang, Lin, Li, Ma, Fukun, Yang, Yawen, Wen, Lijie

论文摘要

我们建议针对变压器模型的第一个字符级白色框对面攻击方法。我们方法的直觉源于观察到单词在被送入变压器模型之前将单词分为微动物,而两个紧密的微动物之间的替换与角色修改具有相似的作用。我们的方法主要包含三个步骤。首先,采用了一种基于梯度的方法来找到句子中最脆弱的单词。然后,我们将选定的单词分为微动物,以替换来自变压器令牌的原点令牌化。最后,我们利用对抗性损失来指导替代可连接的子动物,其中引入了Gumbel-Softmax技巧以确保梯度传播。同时,我们在优化过程中介绍了视觉和长度约束,以实现最小特征修改。对句子级别和令牌级任务的广泛实验表明,我们的方法在成功率和编辑距离方面可以胜过先前的攻击方法。此外,人类评估验证了我们的对抗性例子可以保留其起源标签。

We propose the first character-level white-box adversarial attack method against transformer models. The intuition of our method comes from the observation that words are split into subtokens before being fed into the transformer models and the substitution between two close subtokens has a similar effect to the character modification. Our method mainly contains three steps. First, a gradient-based method is adopted to find the most vulnerable words in the sentence. Then we split the selected words into subtokens to replace the origin tokenization result from the transformer tokenizer. Finally, we utilize an adversarial loss to guide the substitution of attachable subtokens in which the Gumbel-softmax trick is introduced to ensure gradient propagation. Meanwhile, we introduce the visual and length constraint in the optimization process to achieve minimum character modifications. Extensive experiments on both sentence-level and token-level tasks demonstrate that our method could outperform the previous attack methods in terms of success rate and edit distance. Furthermore, human evaluation verifies our adversarial examples could preserve their origin labels.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源