源代码段的作者身份识别由多个作者使用堆栈合奏方法撰写的源代码段

论文标题

源代码段的作者身份识别由多个作者使用堆栈合奏方法撰写的源代码段

Authorship Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

论文作者

Mahbub, Parvez, Oishie, Naz Zarreen, Haque, S M Rafizul

论文摘要

源代码段作者身份标识是通过监督学习来识别源代码段的作者的任务。它在窃，数字取证和其他几个执法问题中具有广泛的重要性。但是，当源代码段由多个作者编写时，典型的作者身份证方法将不再起作用。在这里，提出了一种使用堆叠集合分类器的作者标识技术，即使在多个作者的情况下，也能够预测源代码段的作者。这项提出的技术建立在几个深度神经网络，随机森林和支持向量机分类器上。已经表明，为了识别作者组，单个分类技术不再足够，并且使用深度神经网络的堆叠集合方法可以显着提高准确性。已将提出技术的性能与一些现有方法进行了比较，这些方法仅处理由单个作者精确编写的源代码段。尽管由多个作者编写的源代码段的作者身份识别的任务更艰巨，但与仅处理单个作者编写的代码段相比，我们提出的技术已经实现了由识别精度证明的有希望的结果。

Source code segment authorship identification is the task of identifying the author of a source code segment through supervised learning. It has vast importance in plagiarism detection, digital forensics, and several other law enforcement issues. However, when a source code segment is written by multiple authors, typical author identification methods no longer work. Here, an author identification technique, capable of predicting the authorship of source code segments, even in the case of multiple authors, has been proposed which uses a stacking ensemble classifier. This proposed technique is built upon several deep neural networks, random forests and support vector machine classifiers. It has been shown that for identifying the author group, a single classification technique is no longer sufficient and using a deep neural network-based stacking ensemble method can enhance the accuracy significantly. The performance of the proposed technique has been compared with some existing methods which only deal with the source code segments written precisely by a single author. Despite the harder task of authorship identification for source code segments written by multiple authors, our proposed technique has achieved promising results evidenced by the identification accuracy, compared to the related works which only deal with code segments written by a single author.

下载PDF全文

下载文献需遵守相关版权规定

论文标题