揭幕代码预培训模型：调查语法和语义能力

论文标题

揭幕代码预培训模型：调查语法和语义能力

Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

论文作者

Ma, Wei, Liu, Shangqing, Zhao, Mengjie, Xie, Xiaofei, Wang, Wenhan, Hu, Qiang, Zhang, Jie, Liu, Yang

论文摘要

过去的研究已经检查了这些模型如何掌握代码语法，但仍需要探索他们对代码语义的理解。我们广泛分析了七个代码模型，以调查代码模型如何表示代码语法和语义。这包括四个突出的代码预训练模型（Codebert，GraphCodebert，Codet5和Unixcoder）和三个大语言模型（StarCoder，Codellama和Codet5+）。我们已经开发了四个探测任务，以评估模型的学习代码语法和语义的能力。这些任务着重于重建代码语法和语义结构，例如在模型的表示空间中，例如AST，CFG，CDG和DDG。这些结构对于理解代码至关重要。此外，我们探讨了语法令牌在每个令牌表示中的作用以及代码令牌之间的扩展依赖关系。此外，我们研究了有关代码语义结构的注意力权重的分布。通过详细的分析，我们的结果强调了掌握代码语法和语义中各种代码模型的优势和缺点。研究结果表明，这些模型精通掌握代码语法，有效地捕获了语法令牌的关系和作用。但是，它们编码代码语义的能力显示出更多的可变性。这项研究丰富了我们对代码模型分析语法和语义的能力的理解。我们的发现为未来的代码模型增强功能提供了宝贵的见解，有助于在一系列与代码相关的任务中优化其应用程序。

Past research has examined how well these models grasp code syntax, yet their understanding of code semantics still needs to be explored. We extensively analyze seven code models to investigate how code models represent code syntax and semantics. This includes four prominent code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama, and CodeT5+). We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics. These tasks focus on reconstructing code syntax and semantic structures-such as AST, CFG, CDG, and DDG - within the models' representation spaces. These structures are fundamental to understanding code. Additionally, we explore the role of syntax tokens in each token representation and the extended dependencies among code tokens. Furthermore, we examine the distribution of attention weights concerning code semantic structures. Through detailed analysis, our results emphasize the strengths and weaknesses of various code models in mastering code syntax and semantics. The findings reveal that these models are proficient in grasping code syntax, effectively capturing the relationships and roles of syntax tokens. However, their ability to encode code semantics shows more variability. This study enriches our understanding of the capabilities of code models in analyzing syntax and semantics. Our findings offer valuable insights for future code model enhancements, helping optimize their application across a range of code-related tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题