通过测试驱动的用户形式化的交互式代码生成

论文标题

通过测试驱动的用户形式化的交互式代码生成

Interactive Code Generation via Test-Driven User-Intent Formalization

论文作者

Lahiri, Shuvendu K., Fakhoury, Sarah, Naik, Aaditya, Sakkas, Georgios, Chakraborty, Saikat, Musuvathi, Madanlal, Choudhury, Piali, von Veh, Curtis, Inala, Jeevana Priya, Wang, Chenglong, Gao, Jianfeng

论文摘要

大型语言模型（LLMS）通过从非正式自然语言（NL）意图中生成自然代码来自动化编码的重要方面表现出巨大的潜力。但是，当与LLMS互动时，用户不能保证正确产生的代码建议可以满足他们提供的意图。实际上，很难定义正确性的概念，因为自然语言可能是模棱两可的并且缺乏正式的语义。在本文中，我们提出了{\ it Interactive test-test驱动的代码生成}的工作流程，该工作流利用轻量级用户的反馈来（a）使用生成的测试对用户的意图进行形式化，这些测试可用于调试，（b）通过修剪和排名候选代码建议提出改进的代码建议集。我们描述了一种语言敏锐的抽象算法和具体的实现Ticoder。我们对\ emph {mbpp}和\ emph {humaneval}代码生成基准的Ticoder进行自动评估。使用OpenAi Codex LLM：我们的最佳算法提高了\ passk {1}代码生成准确度（绝对百分比）在$ 22.49 \％\％\％至37.71美元之间，MBPP的MBPP和24.79 $ 24.79 \％$ 24.79 \％至53.98％的$ 53.98 \％使用humaneval sim sime 1 to 53.98 \％$ sim，我们的结果很有希望。

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. In this paper, we propose the workflow of {\it interactive test-driven code generation}, which leverages lightweight user feedback to (a) formalize the user intent using generated tests that can be useful for debugging, and (b) produce an improved set of code suggestions by pruning and ranking candidate code suggestions. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder. We perform an automated evaluation of TiCoder on the \emph{MBPP} and \emph{HumanEval} code generation benchmarks. Our results are promising with using the OpenAI Codex LLM: our best algorithm improves the \passk{1} code generation accuracy (in absolute percentages) between $22.49\%$ to $37.71\%$ for MBPP and between $24.79\%$ to $53.98\%$ for HumanEval using between 1 to 5 simulated user queries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题