通过空间K折交叉验证估算空间模型的预测性能

论文标题

通过空间K折交叉验证估算空间模型的预测性能

Estimating the Prediction Performance of Spatial Models via Spatial k-Fold Cross Validation

论文作者

Pohjankukka, Jonne, Pahikkala, Tapio, Nevalainen, Paavo, Heikkonen, Jukka

论文摘要

在机器学习中，人们通常会在评估模型性能时假设数据是独立的。但是，这在实践中很少存在。地理信息数据集是一个示例，其中数据点在地理上越接近彼此的依赖性越强。这种称为空间自相关（SAC）的现象导致标准交叉验证（CV）方法为空间模型产生乐观的偏见预测性能估计，这可能导致实际应用中的成本和事故增加。为了克服这个问题，我们提出了一个称为空间K折叠验证（SKCV）的CV方法的修改版本，该方法为模型预测性能提供了有用的估计，而没有SAC引起的乐观偏见。我们使用三个现实世界案例测试SKCV，涉及开放的自然数据，表明普通简历所产生的估计值比SKCV的估计值高40％。在我们的实验中考虑了回归和分类案例。此外，我们将展示如何应用SKCV方法作为选择新研究领域的数据采样密度的标准。

In machine learning one often assumes the data are independent when evaluating model performance. However, this rarely holds in practise. Geographic information data sets are an example where the data points have stronger dependencies among each other the closer they are geographically. This phenomenon known as spatial autocorrelation (SAC) causes the standard cross validation (CV) methods to produce optimistically biased prediction performance estimates for spatial models, which can result in increased costs and accidents in practical applications. To overcome this problem we propose a modified version of the CV method called spatial k-fold cross validation (SKCV), which provides a useful estimate for model prediction performance without optimistic bias due to SAC. We test SKCV with three real world cases involving open natural data showing that the estimates produced by the ordinary CV are up to 40% more optimistic than those of SKCV. Both regression and classification cases are considered in our experiments. In addition, we will show how the SKCV method can be applied as a criterion for selecting data sampling density for new research area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题