论文标题
部分可观测时空混沌系统的无模型预测
Textual Manifold-based Defense Against Natural Language Adversarial Examples
论文作者
论文摘要
关于对抗图像的最新研究表明,它们倾向于留下基本的低维数据歧管,这使得它们对于当前模型做出正确的预测更具挑战性。这种所谓的外界猜想激发了针对图像的对抗性攻击的新型防御措施。在这项研究中,我们发现类似的现象发生在预审前的语言模型引起的上下文化嵌入空间中,其中对抗文本倾向于使它们的嵌入与自然的歧管不同。基于这一发现,我们提出了基于文本歧管的防御(TMD),该防御机制将文本嵌入到分类前的近似嵌入歧管上。它降低了潜在的对抗性实例的复杂性,最终增强了受保护模型的鲁棒性。通过广泛的实验,我们的方法始终如一,并且在各种攻击设置下的防御能力始终如一,而无需交易清洁准确性。据我们所知,这是第一次利用对抗性攻击的多种结构的NLP防御。我们的代码可在\ url {https://github.com/dangne/tmd}上找到。
Recent studies on adversarial images have shown that they tend to leave the underlying low-dimensional data manifold, making them significantly more challenging for current models to make correct predictions. This so-called off-manifold conjecture has inspired a novel line of defenses against adversarial attacks on images. In this study, we find a similar phenomenon occurs in the contextualized embedding space induced by pretrained language models, in which adversarial texts tend to have their embeddings diverge from the manifold of natural ones. Based on this finding, we propose Textual Manifold-based Defense (TMD), a defense mechanism that projects text embeddings onto an approximated embedding manifold before classification. It reduces the complexity of potential adversarial examples, which ultimately enhances the robustness of the protected model. Through extensive experiments, our method consistently and significantly outperforms previous defenses under various attack settings without trading off clean accuracy. To the best of our knowledge, this is the first NLP defense that leverages the manifold structure against adversarial attacks. Our code is available at \url{https://github.com/dangne/tmd}.