论文标题
funcgnn:一种图形神经网络方法编程相似性
funcGNN: A Graph Neural Network Approach to Program Similarity
论文作者
论文摘要
程序相似性是一个基本概念,是解决软件工程任务的核心,例如软件pla窃,克隆识别,代码重构和代码搜索。程序之间的准确相似性估计需要深入了解其结构,语义和流程。控制流程图(CFG)是一个程序的图形表示,该程序捕获其逻辑控制流,从而捕获其语义。一种常见的方法是通过使用图形相似性度量(例如图编辑距离(GED)。但是,图编辑距离是一个NP硬性问题,计算上的昂贵,这使得图形相似性技术在复杂的软件程序中的应用不切实际。这项研究旨在通过分析相关的控制流程图来检查图神经网络对估计程序相似性的有效性。我们介绍了Funcgnn,这是一个在标记的CFG对中训练的图形神经网络,可通过利用有效的嵌入向量来预测看不见的程序对之间的GED。据我们所知,这是第一次将图形神经网络应用于标签的CFGS上,以估算高级语言程序之间的相似性。结果:我们证明了Funcgnn估计程序之间的GED的有效性,而我们的实验分析表明了它如何达到较低的错误率(0.00194),其速度更快(比最快的传统GED近似方法快23倍),并且与ART方法的状态相比,它更快。 Funcgnn具有推断程序结构并推广到看不见的程序的归纳学习能力。我们方法学提出的程序的图形嵌入可以应用于几个相关的软件工程问题(例如代码窃和克隆识别),从而打开多个研究方向。
Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.