论文标题
NLP中的抗蒸馏剂用于模型保护的水印
Distillation-Resistant Watermarking for Model Protection in NLP
论文作者
论文摘要
我们如何保护训练有素的NLP模型的知识产权?现代的NLP模型容易通过查询和蒸馏出公开裸露的API来偷窃。但是,现有的保护方法,例如仅适用于图像,但不适用于文本。我们提出了一种抗蒸馏水标记(DRW),这是一种新型技术,可保护NLP模型免于通过蒸馏被盗。 DRW通过将水印注入与秘密钥匙相对应的受害者的预测概率中来保护模型,并能够通过探测可疑模型来检测该键。我们证明,受保护的模型仍然保留在特定界限内的原始精度。我们对DRW进行了多种NLP任务,包括文本分类,词性标记和命名实体识别。实验表明,DRW保护原始模型,并检测所有四个任务的平均平均精度为100%的偷窃嫌疑犯,而先前的方法在两个方面都失败了。
How can we protect the intellectual property of trained NLP models? Modern NLP models are prone to stealing by querying and distilling from their publicly exposed APIs. However, existing protection methods such as watermarking only work for images but are not applicable to text. We propose Distillation-Resistant Watermarking (DRW), a novel technique to protect NLP models from being stolen via distillation. DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key and is able to detect such a key by probing a suspect model. We prove that a protected model still retains the original accuracy within a certain bound. We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition. Experiments show that DRW protects the original model and detects stealing suspects at 100% mean average precision for all four tasks while the prior method fails on two.