论文标题
神经网络脆弱的水印,没有模型性能降解
Neural network fragile watermarking with no model performance degradation
论文作者
论文摘要
深层神经网络容易受到恶意微调攻击,例如数据中毒和后门攻击。因此,在最近的研究中,提出了如何检测神经网络模型的恶意微调。但是,它通常会对受保护模型的性能产生负面影响。因此,我们提出了一个新的神经网络脆弱的水印,没有模型性能降解。在水印过程中,我们训练具有特定损耗函数和秘密键的生成模型,以生成对目标分类器微调敏感的触发器。在验证过程中,我们采用了水印的分类器来获取每个脆弱的触发器的标签。然后,可以通过比较秘密键和标签来检测恶意微调。经典数据集和分类器上的实验表明,所提出的方法可以有效地检测模型恶意调整而没有模型性能降解。
Deep neural networks are vulnerable to malicious fine-tuning attacks such as data poisoning and backdoor attacks. Therefore, in recent research, it is proposed how to detect malicious fine-tuning of neural network models. However, it usually negatively affects the performance of the protected model. Thus, we propose a novel neural network fragile watermarking with no model performance degradation. In the process of watermarking, we train a generative model with the specific loss function and secret key to generate triggers that are sensitive to the fine-tuning of the target classifier. In the process of verifying, we adopt the watermarked classifier to get labels of each fragile trigger. Then, malicious fine-tuning can be detected by comparing secret keys and labels. Experiments on classic datasets and classifiers show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.