通过识别注释过程，从正面和未标记的数据中学习

论文标题

通过识别注释过程，从正面和未标记的数据中学习

Learning from Positive and Unlabeled Data by Identifying the Annotation Process

论文作者

Shajarisales, Naji, Spirtes, Peter, Zhang, Kun

论文摘要

在二进制分类中，从积极和未标记的数据中学习（LEPU）是半监督的学习，但仅带有一个类别的元素。关于LEPU的大多数研究都依赖于带注释的示例的选择过程和注释类别的特征（称为选定的选定）在完全随机（疤痕）假设之间的某种独立性。然而，注释过程是数据收集的重要组成部分，在许多情况下，它自然取决于数据的某些特征（例如，图像的强度和图像中要检测到的对象的大小）。在注释过程的模型上没有任何限制，LEPU问题中的分类结果将是高度唯一的。因此需要适当，灵活的约束。在这项工作中，我们将注释过程的更灵活，更现实的模型纳入了疤痕，更重要的是，为充满挑战的LEPU问题提供了解决方案。在理论方面，鉴于对数据生成过程的约束，我们建立了注释过程和分类函数的属性的可识别性。我们还提出了一种推理算法来学习模型的参数，并在模拟和真实数据上取得了成功的实验结果。我们还建议一个新颖的现实数据集forlepu，作为用于未来研究的基准数据集。

In binary classification, Learning from Positive and Unlabeled data (LePU) is semi-supervised learning but with labeled elements from only one class. Most of the research on LePU relies on some form of independence between the selection process of annotated examples and the features of the annotated class, known as the Selected Completely At Random (SCAR) assumption. Yet the annotation process is an important part of the data collection, and in many cases it naturally depends on certain features of the data (e.g., the intensity of an image and the size of the object to be detected in the image). Without any constraints on the model for the annotation process, classification results in the LePU problem will be highly non-unique. So proper, flexible constraints are needed. In this work we incorporate more flexible and realistic models for the annotation process than SCAR, and more importantly, offer a solution for the challenging LePU problem. On the theory side, we establish the identifiability of the properties of the annotation process and the classification function, in light of the considered constraints on the data-generating process. We also propose an inference algorithm to learn the parameters of the model, with successful experimental results on both simulated and real data. We also propose a novel real-world dataset forLePU, as a benchmark dataset for future studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题