论文标题

使用局部最小二乘进行回归,并通过判别分析进行错误预测

Employing Partial Least Squares Regression with Discriminant Analysis for Bug Prediction

论文作者

Ferenc, Rudolf, Siket, István, Hegedűs, Péter, Rajkó, Róbert

论文摘要

长期以来,源代码的预测缺陷pron是一项重大的研究问题。估计最有可能包含错误的软件系统的那些部分可能有助于集中测试工作,降低成本并提高产品质量。在过去的几十年中,已经引入了许多预测模型和方法,这些模型和方法试图根据静态源代码,变更和历史记录指标或两者兼而有之预测错误的代码元素。但是,仍然没有通用的最佳解决方案,因为大多数合适的功能和模型因数据集而异,并且取决于我们使用它们的上下文。因此,对该主题的新方法和进一步的研究非常必要。在本文中,我们采用了一种化学计量方法 - 具有判别分析(PLS -DA)​​的部分最小二乘 - 使用静态源代码指标预测Java程序中的错误类别。据我们所知,PLS-DA以前从未被用作软件维护域中用于预测软件错误的统计方法。此外,我们还使用了严格的统计处理,包括重新采样和随机化(排列)测试,以及代表软件工程结果的评估。我们表明,与最先进的方法相比,我们的基于PLS-DA的预测模型(即,在90%置信度下的F量度为0.44-0.47)的性能优于性能),当没有数据重新示例应用于其他数据时,在培训最大的漏洞数据集上时,该模型是显着的,因此可以更轻松地介绍最佳的较宽松的参数。就完整性而言,它测量了预测有缺陷的Java类中包含的错误量,PLS-DA的表现均优于其他所有算法:它发现没有重新采样和上采样的总错误的69.3%和79.4%。

Forecasting defect proneness of source code has long been a major research concern. Having an estimation of those parts of a software system that most likely contain bugs may help focus testing efforts, reduce costs, and improve product quality. Many prediction models and approaches have been introduced during the past decades that try to forecast bugged code elements based on static source code metrics, change and history metrics, or both. However, there is still no universal best solution to this problem, as most suitable features and models vary from dataset to dataset and depend on the context in which we use them. Therefore, novel approaches and further studies on this topic are highly necessary. In this paper, we employ a chemometric approach - Partial Least Squares with Discriminant Analysis (PLS-DA) - for predicting bug prone Classes in Java programs using static source code metrics. To our best knowledge, PLS-DA has never been used before as a statistical approach in the software maintenance domain for predicting software errors. In addition, we have used rigorous statistical treatments including bootstrap resampling and randomization (permutation) test, and evaluation for representing the software engineering results. We show that our PLS-DA based prediction model achieves superior performances compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when applying up-sampling on the largest open bug dataset, while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源