论文标题

Graphrepo:软件存储库采矿中的快速探索

GraphRepo: Fast Exploration in Software Repository Mining

论文作者

Serban, Alex, Bruntink, Magiel, Visser, Joost

论文摘要

从软件存储库中的数据进行挖掘和存储通常是每项项目进行的,其中每个项目都使用数据架构,提取工具和(中间)存储基础架构的唯一组合。我们介绍了GraphrePO,该工具可以实现一种统一的方法来从GIT存储库中提取数据,存储并在储存库项目中共享。 GraphRepo使用符合酸符合酸的图形数据库管理系统的Neo4J,并允许组件的模块化插件用于存储库提取(钻机),分析(矿工)和导出(映射器)。该图可以通过消除数据归一化的需求来查询数据的自然方法。 Graphrepo内置于Python,并提供了多种与丰富的Python生态系统和大数据解决方案进行交互的方法。图数据库的模式是通用且可扩展的。使用GraphrePO进行软件存储库挖掘提供了多个优点,而不是创建项目特定的基础架构:(i)针对大型数据集的短期探索和可扩展性的高性能(ii)易于分发提取的数据(例如,用于复制)或在项目之间提取数据的共享以及(III)以及(III)的可扩展性和互操作性。四个开源项目上的一组基准测试表明,GraphrePO允许对存储库数据进行非常快速的查询,一旦提取和索引。有关更多信息,请参见项目的文档(可在https://tinyurl.com/grepodoc中找到)和项目存储库(可在https://tinyurl.com/grrepo中找到)。在线提供视频演示(https://tinyurl.com/grrepov)

Mining and storage of data from software repositories is typically done on a per-project basis, where each project uses a unique combination of data schema, extraction tools, and (intermediate) storage infrastructure. We introduce GraphRepo, a tool that enables a unified approach to extract data from Git repositories, store it, and share it across repository mining projects. GraphRepo usesNeo4j, an ACID-compliant graph database management system, and allows modular plug-in of components for repository extraction (drillers), analysis (miners), and export (mappers). The graph enables a natural way to query the data by removing the need for data normalisation. GraphRepo is built in Python and offers multiple ways to interface with the rich Python ecosystem and with big data solutions. The schema of the graph database is generic and extensible. Using GraphRepo for software repository mining offers several advantages versus creating project-specific infrastructure: (i) high performance for short-iteration exploration and scalability to large data sets (ii) easy distribution of extracted data(e.g., for replication) or sharing of extracted data among projects, and (iii) extensibility and interoperability. A set of benchmarks on four open source projects demonstrate that GraphRepo allows very fast querying of repository data, once extracted and indexed. More information can be found in the project's documentation (available at https://tinyurl.com/grepodoc) and in the project's repository (available at https://tinyurl.com/grrepo). A video demonstration isalso available online (https://tinyurl.com/grrepov)

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源