从大语言模型中提取培训数据

论文标题

从大语言模型中提取培训数据

Extracting Training Data from Large Language Models

论文作者

Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom, Song, Dawn, Erlingsson, Ulfar, Oprea, Alina, Raffel, Colin

论文摘要

发布在私人数据集上培训的大型（十亿个参数）语言模型已经很普遍了。本文表明，在这种情况下，对手可以通过查询语言模型来执行培训数据提取攻击以恢复单个培训示例。我们展示了我们对GPT-2的攻击，GPT-2是一种对公共互联网刮擦训练的语言模型，并能够从模型的培训数据中提取数百个逐字的文本序列。这些提取的示例包括（公共）个人身份信息（名称，电话号码和电子邮件地址），IRC对话，代码和128位UUID。即使以上序列仅包含在培训数据中的一个文档中，我们的攻击也是可能的。我们全面评估我们的提取攻击，以了解有助于其成功的因素。令人担忧的是，我们发现较大的模型比较小的模型更容易受到伤害。最后，我们通过绘制课程并讨论培训大语言模型的可能保障措施。

It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题