Ropgen：通过自动编码样式转换迈向强大的代码作者归因

论文标题

Ropgen：通过自动编码样式转换迈向强大的代码作者归因

RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation

论文作者

Li, Zhen, Guenevere, Chen, Chen, Chen, Zou, Yayi, Xu, Shouhuai

论文摘要

源代码作者归因是在软件取证，错误修复和软件质量分析等应用程序中经常遇到的重要问题。最近的研究表明，当前的源代码作者归因方法可以通过利用对抗性示例和编码样式操纵的攻击者损害。这要求解决代码作者归因问题的强大解决方案。在本文中，我们启动了有关进行深度学习（DL）的代码作者身份归因的研究。我们提出了一个创新的框架，称为强大的编码样式生成（Ropgen），该框架本质上学习了作者的独特编码样式模式，这些模式很难攻击者操纵或模仿。关键思想是在对抗性训练阶段结合数据增强和梯度增强。这有效地增加了培训示例的多样性，对深神经网络的梯度产生有意义的扰动，并学习了编码方式的多样化表示。我们使用C，C ++和Java编写的四个程序数据集评估了Ropgen的有效性。实验结果表明，Ropgen可以通过分别降低针对目标和未靶向攻击的成功率的22.8％和41.0％，从而显着提高基于DL的代码作者身份归因的鲁棒性。

Source code authorship attribution is an important problem often encountered in applications such as software forensics, bug fixing, and software quality analysis. Recent studies show that current source code authorship attribution methods can be compromised by attackers exploiting adversarial examples and coding style manipulation. This calls for robust solutions to the problem of code authorship attribution. In this paper, we initiate the study on making Deep Learning (DL)-based code authorship attribution robust. We propose an innovative framework called Robust coding style Patterns Generation (RoPGen), which essentially learns authors' unique coding style patterns that are hard for attackers to manipulate or imitate. The key idea is to combine data augmentation and gradient augmentation at the adversarial training phase. This effectively increases the diversity of training examples, generates meaningful perturbations to gradients of deep neural networks, and learns diversified representations of coding styles. We evaluate the effectiveness of RoPGen using four datasets of programs written in C, C++, and Java. Experimental results show that RoPGen can significantly improve the robustness of DL-based code authorship attribution, by respectively reducing 22.8% and 41.0% of the success rate of targeted and untargeted attacks on average.

下载PDF全文

下载文献需遵守相关版权规定

论文标题