论文标题

源代码的作者归因:一种语言不足的方法和软件工程中的适用性

Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

论文作者

Bogomolov, Egor, Kovalenko, Vladimir, Rebryk, Yurii, Bacchelli, Alberto, Bryksin, Timofey

论文摘要

作者归因(即确定谁是源代码的作者)是一个既定的研究主题。作者归因问题的最新结果对于软件工程领域来说看起来很有希望,在该领域可以应用它们来检测窃代码并防止法律问题。在本文中,我们首先介绍了一种新的语言敏捷方法,以供源代码的作者归因。然后,我们讨论现有合成数据集的作者归因的局限性,并提出了一种数据收集方法,该方法提供了更好地反映在软件工程中潜在实际使用方面的数据集的数据集。最后,我们证明了现有数据集对作者归因模型的高度准确性在对更真实的数据进行评估时会大大降低。我们概述了设计和评估作者归因模型的下一步,这些模型可以使研究工作更接近于软件工程的实际用途。

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源