通过分层学习自动化分类

论文标题

通过分层学习自动化分类

Automated Imbalanced Classification via Layered Learning

论文作者

Cerqueira, Vitor, Torgo, Luis, Branco, Paula, Bellinger, Colin

论文摘要

在本文中，我们解决了不平衡的算法分类（IBC）任务。采用重新采样策略来平衡培训实例的课堂分布是解决这些问题的一种常见方法。许多最先进的方法找到了靠近决策边界的感兴趣实例，以推动重新采样过程。但是，减少多数类可能会导致重要的信息丢失。过度采样还可能增加过度拟合的机会，从而传播少数族裔阶级的情况中所包含的信息。我们工作的主要贡献是一种称为ICLL的新方法，用于解决IBC任务，而IBC任务并非基于重新采样培训观察结果。取而代之的是，ICLL遵循分层的学习范式，以两个阶段对数据进行建模。在第一层中，ICLL学会了将靠近决策边界的案例与明显来自多数类别的案例区分开，其中这种二分法是使用分层聚类分析来定义的。在随后的一层中，我们使用靠近决策边界的实例和少数族裔类的实例来解决原始的预测任务。我们工作的第二个贡献是对层的自动定义，该定义包括使用层次聚类模型的分层学习策略。这是一个相关的发现，因为通常根据域知识手动执行此过程。我们使用100个基准数据集进行了广泛的实验。结果表明，与IBC的几种最新方法相对较高的方法，提出的方法会导致更好的性能。

In this paper we address imbalanced binary classification (IBC) tasks. Applying resampling strategies to balance the class distribution of training instances is a common approach to tackle these problems. Many state-of-the-art methods find instances of interest close to the decision boundary to drive the resampling process. However, under-sampling the majority class may potentially lead to important information loss. Over-sampling also may increase the chance of overfitting by propagating the information contained in instances from the minority class. The main contribution of our work is a new method called ICLL for tackling IBC tasks which is not based on resampling training observations. Instead, ICLL follows a layered learning paradigm to model the data in two stages. In the first layer, ICLL learns to distinguish cases close to the decision boundary from cases which are clearly from the majority class, where this dichotomy is defined using a hierarchical clustering analysis. In the subsequent layer, we use instances close to the decision boundary and instances from the minority class to solve the original predictive task. A second contribution of our work is the automatic definition of the layers which comprise the layered learning strategy using a hierarchical clustering model. This is a relevant discovery as this process is usually performed manually according to domain knowledge. We carried out extensive experiments using 100 benchmark data sets. The results show that the proposed method leads to a better performance relatively to several state-of-the-art methods for IBC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题