论文标题
高度不平衡和重叠数据集中的域适应
Domain Adaptation in Highly Imbalanced and Overlapping Datasets
论文作者
论文摘要
在许多机器学习域中,数据集的特征是高度不平衡和重叠的类。特别是在医疗领域,特定症状清单可以标记为各种不同的疾病之一。这些条件中的一些可能比其他数量级更普遍。在这里,我们为此类数据集提供了一种新颖的无监督域适应方案。该方案基于特定类型的量化,旨在在标签和条件偏移下同时工作。它在电子健康记录产生的数据集上进行了证明,并在非常具有挑战性的情况下为量化和域适应提供了高质量的结果。在当前的Covid-19爆发中使用该方案的潜在好处,以估计感染的患病率和概率。
In many machine learning domains, datasets are characterized by highly imbalanced and overlapping classes. Particularly in the medical domain, a specific list of symptoms can be labeled as one of various different conditions. Some of these conditions may be more prevalent than others by several orders of magnitude. Here we present a novel unsupervised domain adaptation scheme for such datasets. The scheme, based on a specific type of Quantification, is designed to work under both label and conditional shifts. It is demonstrated on datasets generated from electronic health records and provides high quality results for both Quantification and Domain Adaptation in very challenging scenarios. Potential benefits of using this scheme in the current COVID-19 outbreak, for estimation of prevalence and probability of infection are discussed.