论文标题

用变压器估算文本中属性的对抗性鲁棒性

Estimating the Adversarial Robustness of Attributions in Text with Transformers

论文作者

Ivankay, Adam, Rigotti, Mattia, Girardi, Ivan, Marchiori, Chiara, Frossard, Pascal

论文摘要

解释是深神经网络(DNN)分类器的关键部分。在高利益应用中,忠实而强大的解释对于理解和获得对DNN分类器的信任至关重要。但是,最近的工作表明,文本分类器中的最新归因方法容易受到不可察觉的对抗扰动的影响,这些扰动会显着改变解释,同时保持正确的预测结果。如果未被发现,这可能会严重误导DNN的用户。因此,了解这种对抗性扰动对网络的解释及其可感知的影响至关重要。在这项工作中,我们根据Lipschitz的连续性建立了属性鲁棒性(AR)的新颖定义。至关重要的是,它反映了对对抗性输入改变引起的归因变化和此类改变的概念性。此外,我们引入了广泛的文本相似性措施,以有效地捕获两个文本样本之间的位置以及文本中对抗性扰动的不可识别。然后,我们提出了一种新颖的变换式植物攻击(TEA),这是一个强大的对手,可以严格估计文本分类中的鲁棒性。 TEA使用最先进的语言模型来提取单词替换,从而产生流利的上下文对抗样本。最后,通过对几种文本分类体系结构进行实验,我们表明茶始终优于当前的最新AR估计器,产生扰动,从而更大程度地改变解释,同时更加流利和更不感知。

Explanations are crucial parts of deep neural network (DNN) classifiers. In high stakes applications, faithful and robust explanations are important to understand and gain trust in DNN classifiers. However, recent work has shown that state-of-the-art attribution methods in text classifiers are susceptible to imperceptible adversarial perturbations that alter explanations significantly while maintaining the correct prediction outcome. If undetected, this can critically mislead the users of DNNs. Thus, it is crucial to understand the influence of such adversarial perturbations on the networks' explanations and their perceptibility. In this work, we establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. Crucially, it reflects both attribution change induced by adversarial input alterations and perceptibility of such alterations. Moreover, we introduce a wide set of text similarity measures to effectively capture locality between two text samples and imperceptibility of adversarial perturbations in text. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution robustness in text classification. TEA uses state-of-the-art language models to extract word substitutions that result in fluent, contextual adversarial samples. Finally, with experiments on several text classification architectures, we show that TEA consistently outperforms current state-of-the-art AR estimators, yielding perturbations that alter explanations to a greater extent while being more fluent and less perceptible.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源