论文标题
SSLGuard:一种自我监督学习预训练的编码器的水印方案
SSLGuard: A Watermarking Scheme for Self-supervised Learning Pre-trained Encoders
论文作者
论文摘要
自我监督的学习是一种新兴的机器学习范式。与监督的学习利用高质量标记的数据集相比,自我监督的学习依赖于未标记的数据集来预先培训功能强大的编码器,然后可以将其视为各种下游任务的功能提取器。大量的数据和计算资源消耗使编码器本身成为模型所有者的宝贵知识产权。最近的研究表明,机器学习模型的版权受到模型窃取攻击的威胁,该攻击旨在训练代理模型以模仿给定模型的行为。我们从经验上表明,预训练的编码器极易受到模型窃取攻击的影响。但是,版权保护算法(例如水印)的大多数努力集中在分类器上。同时,预先培训的编码器版权保护的内在挑战在很大程度上仍未得到研究。我们通过提出SSLGuard(预先训练的编码器的第一个水印方案)来填补空白。鉴于干净的预训练编码器,SSLGuard向其中注入了水印,并输出了水印版本。还采用了阴影训练技术来保留潜在模型窃取攻击下的水印。我们广泛的评估表明,SSLGuard在水印注入和验证方面有效,并且可与模型窃取和其他水印去除攻击(例如输入噪声,输出扰动,覆盖,覆盖,模型修剪和微调)相当强大。
Self-supervised learning is an emerging machine learning paradigm. Compared to supervised learning which leverages high-quality labeled datasets, self-supervised learning relies on unlabeled datasets to pre-train powerful encoders which can then be treated as feature extractors for various downstream tasks. The huge amount of data and computational resources consumption makes the encoders themselves become the valuable intellectual property of the model owner. Recent research has shown that the machine learning model's copyright is threatened by model stealing attacks, which aim to train a surrogate model to mimic the behavior of a given model. We empirically show that pre-trained encoders are highly vulnerable to model stealing attacks. However, most of the current efforts of copyright protection algorithms such as watermarking concentrate on classifiers. Meanwhile, the intrinsic challenges of pre-trained encoder's copyright protection remain largely unstudied. We fill the gap by proposing SSLGuard, the first watermarking scheme for pre-trained encoders. Given a clean pre-trained encoder, SSLGuard injects a watermark into it and outputs a watermarked version. The shadow training technique is also applied to preserve the watermark under potential model stealing attacks. Our extensive evaluation shows that SSLGuard is effective in watermark injection and verification, and it is robust against model stealing and other watermark removal attacks such as input noising, output perturbing, overwriting, model pruning, and fine-tuning.