在学习不平衡数据时了解CNN脆弱性

论文标题

在学习不平衡数据时了解CNN脆弱性

Understanding CNN Fragility When Learning With Imbalanced Data

论文作者

Dablain, Damien, Jacobson, Kristen N., Bellinger, Colin, Roberts, Mark, Chawla, Nitesh

论文摘要

卷积神经网络（CNN）在不平衡的图像数据上取得了令人印象深刻的结果，但是它们仍然很难推广到少数群体，而且他们的决策很难解释。这些问题是相关的，因为CNN概括为需要改进的少数族裔类的方法包裹在黑框中。为了使CNN关于不平衡数据的决定，我们关注其潜在特征。尽管CNN嵌入了从模型参数中的训练集中学到的模式知识，但该知识的效果包含在功能和分类嵌入中（FE和CE）。可以从训练有素的模型中提取这些嵌入，并且可以分析其全局，类属性（例如频率，幅度和身份）。我们发现有关神经网络推广到少数类别的能力的重要信息位于班级CE和FE中。我们表明，CNN每个类别都学会了有限数量的顶级CE，并且它们的数字和大小根据同一类是平衡还是不平衡的不同。这引起人们的质疑是CNN是否学习了内在的类特征，还是仅仅发生在采样类分布中存在的固有类特征。我们还假设潜在类别的多样性与班级示例的数量一样重要，这对重新采样和成本敏感的方法具有重要意义。这些方法通常集中于重新平衡模型权重，班级数量和边缘。而不是通过增强来多样化阶级潜在特征。我们还证明，如果CNN的顶级潜在特征的大小与训练集不匹配，则CNN很难概括测试数据。我们使用三个流行的图像数据集和两种在学习不平衡学习中通常使用的成本敏感算法。

Convolutional neural networks (CNNs) have achieved impressive results on imbalanced image data, but they still have difficulty generalizing to minority classes and their decisions are difficult to interpret. These problems are related because the method by which CNNs generalize to minority classes, which requires improvement, is wrapped in a blackbox. To demystify CNN decisions on imbalanced data, we focus on their latent features. Although CNNs embed the pattern knowledge learned from a training set in model parameters, the effect of this knowledge is contained in feature and classification embeddings (FE and CE). These embeddings can be extracted from a trained model and their global, class properties (e.g., frequency, magnitude and identity) can be analyzed. We find that important information regarding the ability of a neural network to generalize to minority classes resides in the class top-K CE and FE. We show that a CNN learns a limited number of class top-K CE per category, and that their number and magnitudes vary based on whether the same class is balanced or imbalanced. This calls into question whether a CNN has learned intrinsic class features, or merely frequently occurring ones that happen to exist in the sampled class distribution. We also hypothesize that latent class diversity is as important as the number of class examples, which has important implications for re-sampling and cost-sensitive methods. These methods generally focus on rebalancing model weights, class numbers and margins; instead of diversifying class latent features through augmentation. We also demonstrate that a CNN has difficulty generalizing to test data if the magnitude of its top-K latent features do not match the training set. We use three popular image datasets and two cost-sensitive algorithms commonly employed in imbalanced learning for our experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题