CoderL：通过验证的模型和深入的强化学习来掌握代码生成

论文标题

CoderL：通过验证的模型和深入的强化学习来掌握代码生成

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

论文作者

Le, Hung, Wang, Yue, Gotmare, Akhilesh Deepak, Savarese, Silvio, Hoi, Steven C. H.

论文摘要

程序合成或代码生成旨在生成满足问题规范的程序。使用大规模预处理的语言模型（LMS）的最新方法显示出令人鼓舞的结果，但它们有一些关键局限性。特别是，他们经常遵循标准监督的微调程序，仅从对自然语言问题描述和基础真相计划对培训代码生成模型。这种范式在很大程度上忽略了问题规范中的一些重要但潜在的信号，例如单位测试，因此在解决复杂的看不见的编码任务时通常会导致性能差。为了解决局限性，我们提出了“ Coderl”，这是一个通过验证的LMS和深度强化学习（RL）实现程序合成任务的新框架。具体而言，在培训期间，我们将代码生成的LM视为参与者网络，并介绍一个培训的评论家网络，该网络旨在预测生成的程序的功能正确性，并为演员提供密集的反馈信号。在推理期间，我们引入了一个新一代程序，具有关键的抽样策略，该程序允许模型根据示例单位测试和评论家分数的反馈自动重新生成程序。对于模型骨架，我们扩展了Codet5的编码器架构，具有增强的学习目标，更大的模型大小和更好的预处理数据。我们的方法不仅在具有挑战性的应用程序基准上实现了新的SOTA结果，而且还显示出强大的零射击传输能力，并在简单的MBPP基准上具有新的SOTA结果。

Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题