论文标题
忽略以前的提示:语言模型的攻击技术
Ignore Previous Prompt: Attack Techniques For Language Models
论文作者
论文摘要
基于变压器的大型语言模型(LLMS)为大规模客户面向客户的应用程序的自然语言任务提供了有力的基础。但是,探索其恶意用户互动中出现的脆弱性的研究很少。通过提出刺激,这是针对基于掩模的迭代对抗及时构图的平淡的对齐框架,我们研究了如何通过简单的手工制作的输入来轻松地将生产中最广泛部署的语言模型GPT-3错误地错过。特别是,我们研究了两种类型的攻击 - 目标劫持和提示泄漏 - 并证明即使是低接合,但有足够意外的代理人,也可以轻松利用GPT-3的随机性质,从而产生长尾风险。可在https://github.com/agencyenterprise/promptinject上获得威信的代码。
Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.