用于协助人类评估者的自我调查模型

论文标题

用于协助人类评估者的自我调查模型

Self-critiquing models for assisting human evaluators

论文作者

Saunders, William, Yeh, Catherine, Wu, Jeff, Bills, Steven, Ouyang, Long, Ward, Jonathan, Leike, Jan

论文摘要

我们调整了大型语言模型，以使用行为克隆来编写自然语言批评（自然语言评论）。关于基于主题的摘要任务，我们的模型所写的批评可以帮助人类在摘要中发现他们本来会错过的缺点。我们的模型有助于在模型和人的书面摘要中发现自然存在的缺陷，以及人类撰写的摘要中有意误导的摘要。我们研究了批评的缩放特性，包括基于主题的摘要和综合任务。较大的模型写出更多有用的批评，在大多数任务上，尽管产生了更困难的输出，但在大多数任务上都更好地自我征收。较大的模型还可以将自己的自我评价作为反馈整合在一起，将自己的摘要完善为更好的摘要。最后，我们激励并引入了一个框架，以比较批判能力的产生和歧视能力。我们的测量表明，即使是大型模型也可能仍然具有他们无法或不表达为批评的相关知识。这些结果是使用AI辅助的人类反馈来扩展机器学习系统的监督到人类直接评估的任务的概念证明。我们释放培训数据集以及批评援助实验的样本。

We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题