提高了公平最大多元化的近似和可伸缩性

论文标题

提高了公平最大多元化的近似和可伸缩性

Improved Approximation and Scalability for Fair Max-Min Diversification

论文作者

Addanki, Raghavendra, McGregor, Andrew, Meliou, Alexandra, Moumoulidou, Zafeiria

论文摘要

Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized.该问题由Moumoulidou等人提出。 [ICDT 2021]，并且是由于需要在各种应用中进行大型数据集的需要而激发的，以便派生的样本在多样性上达到平衡，即，包括一对选定点之间的最小距离，即公平性，即确保每个类别的足够点。我们证明了以下结果： 1。我们首先考虑通用度量空间。我们提出了一种随机多项式时间算法，该算法将$ 2 $ approximation返回多样性，但仅满足预期的公平限制。在此结果的基础上，我们提出了$ 6 $的APPRXIMATION，可以保证满足任何常数$ε$的公平限制$ 1-ε$。我们还提出了一种线性时间算法，该算法返回$ M+1 $ $ $ a的近似值。最好的先前结果是$ 3M-1 $ $近似。 2。然后，我们专注于欧几里得指标。我们首先表明问题可以在一个维度上精确解决。对于常数尺寸，类别和任何常数$ε> 0 $，我们提供了$ 1+ε$近似算法，该算法在$ O（nk）+2^{o（k）} $ time中运行，其中$ k = k_1+\ ldots+k_m $。我们可以将运行时间提高到$ O（nk）+ poly（k）$，而只需从[m] $中选择$（1-ε）k_i $点。最后，我们提出了适合处理大量数据集的算法，包括单频数据流算法和分布式处理的合并核心。

Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-ε$ for any constant $ε$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $ε>0$, we present a $1+ε$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-ε) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题