论文标题

Bashexplainer:基于微调Codebert

BashExplainer: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT

论文作者

Yu, Chi, Yang, Guang, Chen, Xiang, Liu, Ke, Zhou, Yanlin

论文摘要

开发人员将Shell命令用于许多任务,例如文件系统管理,网络控制和过程管理。 Bash是最常用的外壳之一,在Linux系统的开发和维护中起着重要作用。由于BASH代码的语言灵活性,不熟悉BASH的开发人员通常很难理解BASH代码的目的和功能。在这项研究中,我们研究了Bash代码评论生成问题,并根据两阶段培训策略提出了一种自动方法BashExplainer。在第一阶段,我们在构造的bash代码语料库中通过微调Codebert训练bash编码器。在第二阶段,我们首先根据语义和词汇相似性从代码存储库中检索最相似的代码。然后,我们使用训练有素的bash编码器生成两个向量表示。最后,我们通过融合层融合这两个向量表示,并通过解码器生成代码注释。为了展示我们提出的方法的竞争力,我们通过结合了先前的NL2Bash研究中共享的语料库和NLC2CMD竞赛中共享的语料库来构建高质量的语料库。该语料库包含10,592个bash代码和相应的注释。然后,我们从先前关于自动代码评论生成的研究中选择了十个基准,这些研究涵盖了信息检索方法,深度学习方法和混合方法。

Developers use shell commands for many tasks, such as file system management, network control, and process management. Bash is one of the most commonly used shells and plays an important role in Linux system development and maintenance. Due to the language flexibility of Bash code, developers who are not familiar with Bash often have difficulty understanding the purpose and functionality of Bash code. In this study, we study Bash code comment generation problem and proposed an automatic method BashExplainer based on two-stage training strategy. In the first stage, we train a Bash encoder by fine-tuning CodeBERT on our constructed Bash code corpus. In the second stage, we first retrieve the most similar code from the code repository for the target code based on semantic and lexical similarity. Then we use the trained Bash encoder to generate two vector representations. Finally, we fuse these two vector representations via the fusion layer and generate the code comment through the decoder. To show the competitiveness of our proposed method, we construct a high-quality corpus by combining the corpus shared in the previous NL2Bash study and the corpus shared in the NLC2CMD competition. This corpus contains 10,592 Bash codes and corresponding comments. Then we selected ten baselines from previous studies on automatic code comment generation, which cover information retrieval methods, deep learning methods, and hybrid methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源