论文标题
关于k-nearthign问题的I/O复杂性
On the I/O complexity of the k-nearest neighbor problem
论文作者
论文摘要
我们考虑了$ k $ - neart邻居($ k $ -nn)问题的精确和近似版本的静态外部内存索引,并在标准不可分割的假设下显示新的下限: - 高维$ k $ -nn的多项式空间索引方案无法利用块转移:$ω(k)$ block读取需要读取查询。 - 对于$ \ ell_ \ infty $ metric,即使我们允许$ c $ appoximate最近的邻居返回,下限也保持着(1,3)$。 - 限制在$ c <3 $的限制是必须的:对于每个指标,在Hellerstein等人的索引模型中都存在一个索引方案。 - 对于特定的指标,可能具有更好近似因素的数据结构。对于$ k $ -nn的锤子空间和每个近似因子$ c> 1 $,存在一个多项式空间数据结构,该结构返回$ \ lceil k/b \ rceil $ i/os,返回$ k $ c $ c $ c $ approxapproxapproxapproxapproxightimate。 为了显示这些下限,我们开发了两种新技术:首先,要处理近似算法在决定返回的结果方面具有更大的自由度,我们开发了Hellerstein等人的$λ$ -SET工作负载技术的轻松版本。这项技术使我们可以显示出在$ d \ geq n $尺寸中的下限。为了将下限延伸至$ d = o(k \ log(n/k))$尺寸,我们开发了一种可能具有独立关注的新确定性缩小技术。
We consider static, external memory indexes for exact and approximate versions of the $k$-nearest neighbor ($k$-NN) problem, and show new lower bounds under a standard indivisibility assumption: - Polynomial space indexing schemes for high-dimensional $k$-NN in Hamming space cannot take advantage of block transfers: $Ω(k)$ block reads are needed to to answer a query. - For the $\ell_\infty$ metric the lower bound holds even if we allow $c$-appoximate nearest neighbors to be returned, for $c \in (1, 3)$. - The restriction to $c < 3$ is necessary: For every metric there exists an indexing scheme in the indexability model of Hellerstein et al.~using space $O(kn)$, where $n$ is the number of points, that can retrieve $k$ 3-approximate nearest neighbors using $\lceil k/B\rceil$ I/Os, which is optimal. - For specific metrics, data structures with better approximation factors are possible. For $k$-NN in Hamming space and every approximation factor $c>1$ there exists a polynomial space data structure that returns $k$ $c$-approximate nearest neighbors in $\lceil k/B\rceil$ I/Os. To show these lower bounds we develop two new techniques: First, to handle that approximation algorithms have more freedom in deciding which result set to return we develop a relaxed version of the $λ$-set workload technique of Hellerstein et al. This technique allows us to show lower bounds that hold in $d\geq n$ dimensions. To extend the lower bounds down to $d = O(k \log(n/k))$ dimensions, we develop a new deterministic dimension reduction technique that may be of independent interest.