OOD-Diskann：用于分发远离查询的高效且可扩展的图表

论文标题

OOD-Diskann：用于分发远离查询的高效且可扩展的图表

OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries

论文作者

Jaiswal, Shikhar, Krishnaswamy, Ravishankar, Garg, Ankit, Simhadri, Harsha Vardhan, Agrawal, Sheshansh

论文摘要

用于近似最近的邻居搜索（ANN）的最新算法（例如Diskann，Faiss-IVF和HNSW）构建了数据依赖性指数，这些指数可通过过度拟合到索引数据分布，从而提供了比数据不合时式指数更好的准确性和搜索效率。当查询数据是从不同的分布中绘制的 - 例如，当索引代表图像嵌入并查询代表文本嵌入时，此类算法将失去大部分性能优势。在各种数据集中，对于固定的召回目标，与分布（ID）查询相比，延迟的延迟较大或更大的数量级或更多的数量级或更多。我们在这项工作中解决的问题是，如果给出索引构造的少量示例集，是否可以使ANN算法有效地用于OOD查询。我们通过呈现OOD-Diskann的积极回答，该ood-diskann使用了宽度样本（索引设置大小的1％）查询，并提供了相似内存足迹的SOTA算法的平均查询延迟最多提高40％。 OOD-Diskann是可扩展的，具有基于图的ANN指数的效率。我们的一些贡献也可以提高ID查询的查询效率。

State-of-the-art algorithms for Approximate Nearest Neighbor Search (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent indices that offer substantially better accuracy and search efficiency over data-agnostic indices by overfitting to the index data distribution. When the query data is drawn from a different distribution - e.g., when index represents image embeddings and query represents textual embeddings - such algorithms lose much of this performance advantage. On a variety of datasets, for a fixed recall target, latency is worse by an order of magnitude or more for Out-Of-Distribution (OOD) queries as compared to In-Distribution (ID) queries. The question we address in this work is whether ANNS algorithms can be made efficient for OOD queries if the index construction is given access to a small sample set of these queries. We answer positively by presenting OOD-DiskANN, which uses a sparing sample (1% of index set size) of OOD queries, and provides up to 40% improvement in mean query latency over SoTA algorithms of a similar memory footprint. OOD-DiskANN is scalable and has the efficiency of graph-based ANNS indices. Some of our contributions can improve query efficiency for ID queries as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题