通过凸船体可行性采样算法实现代表性数据

论文标题

通过凸船体可行性采样算法实现代表性数据

Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms

论文作者

Niss, Laura, Sun, Yuekai, Tewari, Ambuj

论文摘要

培训数据中的采样偏见是机器学习系统中算法偏见的主要来源。尽管有许多方法试图减轻培训期间这种算法偏见，但最直接和最明显的方法只是收集更多代表性的培训数据。在本文中，我们考虑组装培训数据集的任务，其中少数群体从给定的一组数据源中充分表示。从本质上讲，这是一个自适应抽样问题，可以确定一个给定点是否位于一组未知分布中的均值的凸壳中。我们提出了自适应抽样方法，以高度置信度确定是否可以从给定数据源组装代表性数据集。我们还证明了政策在伯努利和多项式环境中模拟中的功效。

Sampling biases in training data are a major source of algorithmic biases in machine learning systems. Although there are many methods that attempt to mitigate such algorithmic biases during training, the most direct and obvious way is simply collecting more representative training data. In this paper, we consider the task of assembling a training dataset in which minority groups are adequately represented from a given set of data sources. In essence, this is an adaptive sampling problem to determine if a given point lies in the convex hull of the means from a set of unknown distributions. We present adaptive sampling methods to determine, with high confidence, whether it is possible to assemble a representative dataset from the given data sources. We also demonstrate the efficacy of our policies in simulations in the Bernoulli and a multinomial setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题