论文标题
基于邻里的人口级标签分布学习
Neighborhood-based Pooling for Population-level Label Distribution Learning
论文作者
论文摘要
监督的机器学习通常需要人类注销的数据。虽然注释者的分歧通常被解释为噪声的证据,但人口级标签分布学习(PLDL)将每个数据项的注释收集视为人类注释人群的观点的样本,其中即使没有噪音,分歧也可能是适当的和预期的。从这个角度来看,一个典型的训练集可能包含大量非常小的样本,每个数据项都没有,它们本身都没有足够大,可以被视为代表基本人口对该项目的信念。我们为PLDL提出了一个算法框架和新的统计测试,以解释采样大小。我们将它们应用于以前建议的方法,用于在类似的数据项上共享标签。我们还提出了标签共享的新方法,我们称之为基于社区的合并。
Supervised machine learning often requires human-annotated data. While annotator disagreement is typically interpreted as evidence of noise, population-level label distribution learning (PLDL) treats the collection of annotations for each data item as a sample of the opinions of a population of human annotators, among whom disagreement may be proper and expected, even with no noise present. From this perspective, a typical training set may contain a large number of very small-sized samples, one for each data item, none of which, by itself, is large enough to be considered representative of the underlying population's beliefs about that item. We propose an algorithmic framework and new statistical tests for PLDL that account for sampling size. We apply them to previously proposed methods for sharing labels across similar data items. We also propose new approaches for label sharing, which we call neighborhood-based pooling.