论文标题
多标签定量
Multi-Label Quantification
论文作者
论文摘要
量化,称为“监督流行率估计”或“学习量化”,是监督的学习任务,是生成未标记数据样本中兴趣类别的相对频率(又称“流行价值”)的预测指标。虽然过去已经提出了许多二元问题,并且在较小程度上是单标签多类问题,但多标签设置(即,在互相互斥的场景中不存在相互排斥的场景)。多标签量化问题的直接解决方案可以简单地简单地包括将问题作为一组独立的二进制量化问题组成。这样的解决方案很简单,但很幼稚,因为在大多数情况下,它所依赖的独立性假设不满足。在这些情况下,了解一个类的相对频率可能有助于确定其他相关类别的流行。我们提出了第一种真正的多标签定量方法,即推断阶级流行值估计值的方法,这些方法努力利用感兴趣类别之间的随机依赖性,以便更准确地预测其相对频率。我们显示的经验证据表明,本性多标签解决方案的表现优于幼稚的方法。复制我们所有实验的代码可在线获得。
Quantification, variously called "supervised prevalence estimation" or "learning to quantify", is the supervised learning task of generating predictors of the relative frequencies (a.k.a. "prevalence values") of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.