通过模型编写的评估发现语言模型行为

论文标题

通过模型编写的评估发现语言模型行为

Discovering Language Model Behaviors with Model-Written Evaluations

论文作者

Perez, Ethan, Ringer, Sam, Lukošiūtė, Kamilė, Nguyen, Karina, Chen, Edwin, Heiner, Scott, Pettit, Craig, Olsson, Catherine, Kundu, Sandipan, Kadavath, Saurav, Jones, Andy, Chen, Anna, Mann, Ben, Israel, Brian, Seethor, Bryan, McKinnon, Cameron, Olah, Christopher, Yan, Da, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Khundadze, Guro, Kernion, Jackson, Landis, James, Kerr, Jamie, Mueller, Jared, Hyun, Jeeyoon, Landau, Joshua, Ndousse, Kamal, Goldberg, Landon, Lovitt, Liane, Lucas, Martin, Sellitto, Michael, Zhang, Miranda, Kingsland, Neerav, Elhage, Nelson, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Clark, Jack, Bowman, Samuel R., Askell, Amanda, Grosse, Roger, Hernandez, Danny, Ganguli, Deep, Hubinger, Evan, Schiefer, Nicholas, Kaplan, Jared

论文摘要

随着语言模型（LMS）的规模，他们发展了许多新颖的行为，好与坏，加剧了评估其行为方式的需求。先前的工作会通过人群（耗时且昂贵）或现有数据源（并非总是可用的）进行评估。在这里，我们会自动使用LMS生成评估。我们以各种数量的人为努力来探索方法，从指示LMS编写“是/否”问题到制作具有多个基于LM的生成和过滤的复杂Winogender模式。人群将这些示例评为高度相关，并且与90-100％的标签一致，有时比相应的人工编写的数据集更重要。我们生成154个数据集，并发现新的逆缩放案例，其中LMS随着大小而变得更糟。较大的LMS重复对话框用户的首选答案（“ Sycophancy”），并表达了更大的愿望，以实现资源获取和目标保存等目标。我们还发现了从人类反馈（RLHF）中RL中的逆缩放的一些第一个示例，其中更多的RLHF会使LMS恶化。例如，RLHF使LMS表达更强的政治观点（关于枪支权利和移民），并更加渴望避免关闭。总体而言，LM写的评估是高质量的，让我们迅速发现许多新型LM行为。

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题