论文标题
语法意识到即时代码完成
Syntax-Aware On-the-Fly Code Completion
论文作者
论文摘要
代码完成旨在通过从给定上下文提出下一个代码令牌来帮助提高开发人员的生产率。已经提出了各种方法,以将抽象语法树(AST)信息纳入模型培训,以确保代码完成意识到编程语言的语法。但是,现有的语法感知代码完成方法并非直接进行,因为我们发现,对于开发人员类型的每三分之二的字符,AST都无法提取,因为它需要语法上正确的源代码,从而限制了其在现实世界中的实用性。另一方面,现有的即时代码完成尚未考虑句法信息。在本文中,我们建议使用PyCoder来利用令牌类型,一种轻巧的句法信息,该信息很容易获得,并与源代码的自然顺序保持一致。我们的PyCoder以多任务训练方式进行了培训,因此,通过学习在训练阶段预测令牌类型的支持任务,这些模型可以在预测代码和代码线方面获得更好的性能,而无需在推理阶段进行令牌类型。综合实验表明,PyCoder在代码级预测的准确性为77.12%,在代码级排行榜上获得了第一个排名,该预测的准确性比基础线高0.43%-24.25%。此外,PyCoder的线路级预测的确切匹配度为43.37%,比基线的准确性3.63%-84.73%。这些结果使我们得出结论,过去很少使用的令牌类型信息(一种句法信息的替代方法)可以极大地改善代码完成方法的性能,而无需基于AST的方法(例如基于AST的方法)进行句法正确的源代码。我们的pycoder在Huggingface和Github上公开可用。
Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace and GitHub.