培训语言模型遵循人类反馈的指示

论文标题

培训语言模型遵循人类反馈的指示

Training language models to follow instructions with human feedback

论文作者

Ouyang, Long, Wu, Jeff, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll L., Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, Schulman, John, Hilton, Jacob, Kelton, Fraser, Miller, Luke, Simens, Maddie, Askell, Amanda, Welinder, Peter, Christiano, Paul, Leike, Jan, Lowe, Ryan

论文摘要

使语言模型更大并不能固有地使其遵循用户的意图更好。例如，大型语言模型可以生成不真实，有毒或根本对用户无助的输出。换句话说，这些模型与用户不符。在本文中，我们展示了通过对人类反馈进行微调来使语言模型与用户对各种任务进行各种任务的途径。从通过OpenAI API提交的一组标签编写的提示和提示开始，我们收集了所需模型行为的标签示范数据集，我们使用监督的学习将其用于微调GPT-3。然后，我们收集了模型输出排名的数据集，我们使用该数据集使用从人类反馈中的增强学习来进一步调整了这种监督模型。我们称之为生成的模型指令。在对我们及时分布的人体评估中，尽管参数减少了100倍，但从1.3B参数指令gpt模型中的输出比175B GPT-3的输出优选。此外，指令示意模型显示出有毒产量产生的真实性和减少的改善，同时在公共NLP数据集上的性能回归很小。即使指示邀请仍然犯了简单的错误，我们的结果表明，对人类反馈的微调是使语言模型与人类意图相结合的有前途的方向。

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题