评论“翻译”的代码：数据，指标，基础和评估

论文标题

评论“翻译”的代码：数据，指标，基础和评估

Code to Comment "Translation": Data, Metrics, Baselining & Evaluation

论文作者

Gros, David, Sezhiyan, Hariharan, Devanbu, Prem, Yu, Zhou

论文摘要

评论与代码的关系，尤其是给定代码生成有用评论的任务一直很感兴趣。最早的方法是基于强烈的评论结构句法理论，并依赖文本模板。最近，研究人员将深度学习方法应用于此任务，特别是可训练的生成翻译模型，这些模型非常适合自然语言翻译（例如，从德语到英语）。我们仔细研究了这里的基本假设：产生评论的任务足够类似于自然语言之间翻译的任务，因此可以使用类似的模型和评估指标。我们为此任务分析了几个最近的代码评估数据集：CODENN，DEEPCOM，FUNCOM和DOCSTRING。我们将它们与WMT19进行了比较，WMT19是一种标准数据集，通常用于训练最先进的自然语言翻译人员。我们发现代码宣传数据和WMT19自然语言数据之间存在一些有趣的差异。接下来，我们描述和进行一些研究以校准BLEU（通常用作评论质量的量度）。使用方法的“亲和力对”，来自不同项目，在同一项目中，同一班级等的“亲和力对”；我们的研究表明，某些数据集上的当前性能可能需要大大改善。我们还认为，相当幼稚的信息检索（IR）方法在这项任务上做得很好，可以被视为合理的基准。最后，我们就如何在该领域的未来研究中使用我们的发现提出了一些建议。

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep learning methods to this task, and specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using "affinity pairs" of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题