论文标题
达西(Darcy)的甜兔子洞:使用蜜罐检测通用触发器的对抗性攻击
A Sweet Rabbit Hole by DARCY: Using Honeypots to Detect Universal Trigger's Adversarial Attacks
论文作者
论文摘要
通用触发器(Unitrigger)是一种最近提供的强大对抗性文本攻击方法。利用基于学习的机制,unitrigger生成了固定短语,当将其添加到任何良性输入中时,可以将文本神经网络(NN)模型的预测准确性降低到目标类中接近零。为了防止这种可能造成重大伤害的攻击,我们在本文中借用了网络安全社区的“蜜罐”概念,并提出了基于蜜罐的防御框架Darcy,这是针对Unitrigger的防御框架。达西(Darcy)贪婪地搜索并将多个陷阱门注入NN模型,以“诱饵和捕捉”潜在的攻击。通过四个公共数据集的全面实验,我们表明Darcy在大多数情况下检测到TPR高达99%的对抗性攻击,最高为99%,而FPR少于2%,同时维持1%额度内干净输入的预测准确性(In F1)。我们还证明,具有多个陷阱门的达西也对各种攻击方案也具有强大的态度,而攻击者的知识和技能水平都不同。源代码将在接受本文后发布。
The Universal Trigger (UniTrigger) is a recently-proposed powerful adversarial textual attack method. Utilizing a learning-based mechanism, UniTrigger generates a fixed phrase that, when added to any benign inputs, can drop the prediction accuracy of a textual neural network (NN) model to near zero on a target class. To defend against this attack that can cause significant harm, in this paper, we borrow the "honeypot" concept from the cybersecurity community and propose DARCY, a honeypot-based defense framework against UniTrigger. DARCY greedily searches and injects multiple trapdoors into an NN model to "bait and catch" potential attacks. Through comprehensive experiments across four public datasets, we show that DARCY detects UniTrigger's adversarial attacks with up to 99% TPR and less than 2% FPR in most cases, while maintaining the prediction accuracy (in F1) for clean inputs within a 1% margin. We also demonstrate that DARCY with multiple trapdoors is also robust to a diverse set of attack scenarios with attackers' varying levels of knowledge and skills. Source code will be released upon the acceptance of this paper.