在线社交网络中违反社区标准的不确定性估计

论文标题

在线社交网络中违反社区标准的不确定性估计

Uncertainty Estimation For Community Standards Violation In Online Social Networks

论文作者

Torabi, Narjes, Arora, Nimar S., Yu, Emma, Shah, Kinjal, Liu, Wenshun, Tingley, Michael

论文摘要

在线社交网络（OSN）为用户提供了一个平台，可以与他们的朋友社区或公众分享他们的想法和意见。为了使平台为所有用户保持安全，并确保其符合本地法律，OSN通常会创建一组组织为政策组的社区标准，并使用机器学习（ML）模型来识别和删除违反任何政策的内容。但是，在每天上传的数十亿个内容中，只有一小部分是如此明确地违反，以至于自动模型可以将其删除。患病率估计是通过将这些项目的一小部分样本发送给人类标签以获取地面真相标签来估计残差项目中违反内容的比例。这项任务非常困难，因为即使我们可以轻松地获得所有数十亿个项目的ML分数或功能，但由于实际考虑，我们只能在几千个项目上获得地面真相标签。的确，流行率可能是如此之低，以至于即使明智地选择要标记的物品之后，也可能有很多天，即使没有一个物品被违反。对于如此低的患病率的务实选择，$ 10^{ - 4} $至$ 10^{ - 5} $，制度是报告上限或$ 97.5 \％$置信区间，普遍性（UBP），以进行样品和标记过程的不确定性，并给出了平稳的预测。在这项工作中，我们介绍了两种新型技术，即β-Beta-biNomial和一个用于此UBP任务的桶装过程，并在真实和模拟的数据上证明了它的覆盖范围要比常用的Bootstrapping技术更好。

Online Social Networks (OSNs) provide a platform for users to share their thoughts and opinions with their community of friends or to the general public. In order to keep the platform safe for all users, as well as to keep it compliant with local laws, OSNs typically create a set of community standards organized into policy groups, and use Machine Learning (ML) models to identify and remove content that violates any of the policies. However, out of the billions of content that is uploaded on a daily basis only a small fraction is so unambiguously violating that it can be removed by the automated models. Prevalence estimation is the task of estimating the fraction of violating content in the residual items by sending a small sample of these items to human labelers to get ground truth labels. This task is exceedingly hard because even though we can easily get the ML scores or features for all of the billions of items we can only get ground truth labels on a few thousands of these items due to practical considerations. Indeed the prevalence can be so low that even after a judicious choice of items to be labeled there can be many days in which not even a single item is labeled violating. A pragmatic choice for such low prevalence, $10^{-4}$ to $10^{-5}$, regimes is to report the upper bound, or $97.5\%$ confidence interval, prevalence (UBP) that takes the uncertainties of the sampling and labeling processes into account and gives a smoothed estimate. In this work we present two novel techniques Bucketed-Beta-Binomial and a Bucketed-Gaussian Process for this UBP task and demonstrate on real and simulated data that it has much better coverage than the commonly used bootstrapping technique.

下载PDF全文

下载文献需遵守相关版权规定

论文标题