稳定的扩散安全过滤器红色团队

论文标题

稳定的扩散安全过滤器红色团队

Red-Teaming the Stable Diffusion Safety Filter

论文作者

Rando, Javier, Paleka, Daniel, Lindner, David, Heim, Lennart, Tramèr, Florian

论文摘要

稳定的扩散是最近的开源图像生成模型，可与Dalle，Imagen或Parti等专有模型相媲美。稳定的扩散带有一个安全过滤器，旨在防止产生明确的图像。不幸的是，该过滤器被混淆且记录不足。这使用户难以防止在其应用中滥用，并了解过滤器的局限性并改进它。我们首先表明，很容易生成绕过安全过滤器的令人不安的内容。然后，我们将过滤器逆转工程，并发现虽然旨在防止性内容，但它忽略了暴力，血腥和其他类似令人不安的内容。基于我们的分析，我们认为未来模型版本中的安全措施应努力完全开放并正确记录以刺激社区的安全贡献。

Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题