论文标题

注意Trojan Transformers劫持

Attention Hijacking in Trojan Transformers

论文作者

Lyu, Weimin, Zheng, Songzhu, Ma, Tengfei, Ling, Haibin, Chen, Chao

论文摘要

特洛伊木马攻击对AI系统构成了严重威胁。有关变压器模型的最新著作获得了爆炸性的流行,并且自我展示是无可争议的。这提出了一个核心问题:我们可以通过伯特和VIT中的注意力机制揭示特洛伊木马吗?在本文中,我们调查了特洛伊木马AIS中的注意力劫持模式,\ ie,触发令牌``绑架''当存在特定的触发时,注意力的重量会重量。我们观察到来自自然语言处理(NLP)和计算机视觉(CV)域的Trojan变形金刚的劫持模式的一致性劫持模式。这种有趣的财产有助于我们了解Berts和Vits中的特洛伊木马机制。我们还提出了一个引人注目的特洛伊木马探测器(AHTD),以将特洛伊木马与干净的trojan AI区分开。

Trojan attacks pose a severe threat to AI systems. Recent works on Transformer models received explosive popularity and the self-attentions are now indisputable. This raises a central question: Can we reveal the Trojans through attention mechanisms in BERTs and ViTs? In this paper, we investigate the attention hijacking pattern in Trojan AIs, \ie, the trigger token ``kidnaps'' the attention weights when a specific trigger is present. We observe the consistent attention hijacking pattern in Trojan Transformers from both Natural Language Processing (NLP) and Computer Vision (CV) domains. This intriguing property helps us to understand the Trojan mechanism in BERTs and ViTs. We also propose an Attention-Hijacking Trojan Detector (AHTD) to discriminate the Trojan AIs from the clean ones.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源