CSMOUTE：合成的过度采样和不平衡数据分类采样技术

论文标题

CSMOUTE：合成的过度采样和不平衡数据分类采样技术

CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification

论文作者

Koziarski, Michał

论文摘要

在本文中，我们提出了一种新的数据级算法，用于处理分类任务中的数据不平衡，合成多数式采样技术（SMUTE）。 Smute利用了附近实例的插值概念，该实例以前在Smote中的过采样设置中引入。此外，我们将两者结合在合成的过采样和不足采样技术（CSMOUTE）中，该技术将Smote过采样与Smute shipsempling集成。进行的实验研究的结果证明了Smute和CSMOUTE算法的有用性，尤其是当与更复杂的分类器（即MLP和SVM）结合使用时，以及在由大量异常行组成的数据集中应用时。这使我们得出一个结论，即所提出的方法显示出有望适应本地数据特征的进一步扩展，这是本文中更详细讨论的方向。

In this paper we propose a novel data-level algorithm for handling data imbalance in the classification task, Synthetic Majority Undersampling Technique (SMUTE). SMUTE leverages the concept of interpolation of nearby instances, previously introduced in the oversampling setting in SMOTE. Furthermore, we combine both in the Combined Synthetic Oversampling and Undersampling Technique (CSMOUTE), which integrates SMOTE oversampling with SMUTE undersampling. The results of the conducted experimental study demonstrate the usefulness of both the SMUTE and the CSMOUTE algorithms, especially when combined with more complex classifiers, namely MLP and SVM, and when applied on datasets consisting of a large number of outliers. This leads us to a conclusion that the proposed approach shows promise for further extensions accommodating local data characteristics, a direction discussed in more detail in the paper.

下载PDF全文

下载文献需遵守相关版权规定

论文标题