论文标题
自动创建的新颖的错误数据集及其在错误预测中的验证
An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction
论文作者
论文摘要
由于代码频繁更改,严格的截止日期等,在软件开发过程中,错误是不可避免的。因此,重要的是要找到这些错误的工具。执行错误标识的一种方法是分析过去的错误源代码元素的特征,并根据相同特征(例如机器学习模型。为了支持模型构建任务,代码元素及其特征是在所谓的错误数据集中收集的,这些数据集作为学习的输入。 我们介绍\ emph {bughunter dataset}:一种新颖的一种自动构造且免费可用的错误数据集,其中包含代码元素(文件,类,方法),其中包含大量代码指标和错误信息。其他可用的错误数据集遵循传统的方法,即以一个或多个预先选择的版本的版本收集所有源代码元素(buggy和buggy)的特征。另一方面,我们的方法从最狭窄的时间范围内捕获了相同源代码元素的固定状态,无论发布版本如何,我们都可以识别出错误的存在。为了显示新数据集的有用性,我们构建了和评估了错误预测模型,并实现了0.74以上的f量度值。
Bugs are inescapable during software development due to frequent code changes, tight deadlines, etc.; therefore, it is important to have tools to find these errors. One way of performing bug identification is to analyze the characteristics of buggy source code elements from the past and predict the present ones based on the same characteristics, using e.g. machine learning models. To support model building tasks, code elements and their characteristics are collected in so-called bug datasets which serve as the input for learning. We present the \emph{BugHunter Dataset}: a novel kind of automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information. Other available bug datasets follow the traditional approach of gathering the characteristics of all source code elements (buggy and non-buggy) at only one or more pre-selected release versions of the code. Our approach, on the other hand, captures the buggy and the fixed states of the same source code elements from the narrowest timeframe we can identify for a bug's presence, regardless of release versions. To show the usefulness of the new dataset, we built and evaluated bug prediction models and achieved F-measure values over 0.74.