通用后门攻击通过自适应对抗探测检测

论文标题

通用后门攻击通过自适应对抗探测检测

Universal Backdoor Attacks Detection via Adaptive Adversarial Probe

论文作者

Wang, Yuhang, Shi, Huafeng, Min, Rui, Wu, Ruijia, Liang, Siyuan, Wu, Yichao, Liang, Ding, Liu, Aishan

论文摘要

大量证据表明，深度神经网络（DNN）容易受到后门攻击的影响，这激发了后门攻击检测的发展。大多数检测方法旨在验证模型是否感染了假定类型的后门攻击，但是对手可能会在实践中引起多种后门攻击，这些后门攻击对捍卫者来说是无法预料的，这挑战了当前的检测策略。在本文中，我们专注于这种更具挑战性的场景，并提出了一种通用的后门攻击检测方法，称为自适应对抗探针（A2P）。具体而言，我们认为通用后门攻击检测的挑战在于一个事实，即不同的后门攻击通常在触发模式（即大小和透明胶片）中表现出不同的特征。因此，我们的A2P采用了全球到本地的探测框架，对手对具有自适应区域/预算的图像进行对手，以适合各种大小/透明胶片的各种后门触发器。关于探测区域，我们提出了注意力引导的区域生成策略，该策略基于目标模型的注意，生成具有不同大小/位置的区域建议，因为触发区域通常表现出更高的模型激活。考虑到攻击预算，我们介绍了框到距离计划，迭代地将扰动预算从框增加到稀疏约束，以便我们可以更好地激活具有不同透明胶片的不同潜在后门。在多个数据集（CIFAR-10，GTSRB，Tiny-ImageNet）上进行的广泛实验表明，我们的方法的表现优于最先进的基线（+12％）。

Extensive evidence has demonstrated that deep neural networks (DNNs) are vulnerable to backdoor attacks, which motivates the development of backdoor attacks detection. Most detection methods are designed to verify whether a model is infected with presumed types of backdoor attacks, yet the adversary is likely to generate diverse backdoor attacks in practice that are unforeseen to defenders, which challenge current detection strategies. In this paper, we focus on this more challenging scenario and propose a universal backdoor attacks detection method named Adaptive Adversarial Probe (A2P). Specifically, we posit that the challenge of universal backdoor attacks detection lies in the fact that different backdoor attacks often exhibit diverse characteristics in trigger patterns (i.e., sizes and transparencies). Therefore, our A2P adopts a global-to-local probing framework, which adversarially probes images with adaptive regions/budgets to fit various backdoor triggers of different sizes/transparencies. Regarding the probing region, we propose the attention-guided region generation strategy that generates region proposals with different sizes/locations based on the attention of the target model, since trigger regions often manifest higher model activation. Considering the attack budget, we introduce the box-to-sparsity scheduling that iteratively increases the perturbation budget from box to sparse constraint, so that we could better activate different latent backdoors with different transparencies. Extensive experiments on multiple datasets (CIFAR-10, GTSRB, Tiny-ImageNet) demonstrate that our method outperforms state-of-the-art baselines by large margins (+12%).

下载PDF全文

下载文献需遵守相关版权规定

论文标题