Augtriever：无监督的密集检索和域通过可扩展数据的适应

论文标题

Augtriever：无监督的密集检索和域通过可扩展数据的适应

AugTriever: Unsupervised Dense Retrieval and Domain Adaptation by Scalable Data Augmentation

论文作者

Meng, Rui, Liu, Ye, Yavuz, Semih, Agarwal, Divyansh, Tu, Lifu, Yu, Ning, Zhang, Jianguo, Bhat, Meghana, Zhou, Yingbo

论文摘要

密集的猎犬在文本检索和开放域的问题上取得了长足的进步。但是，这些成就中的大多数都在很大程度上依赖于广泛的人类宣传的监督。在这项研究中，我们旨在开发无监督的方法来改善密集检索模型。我们提出了两种方法，可以通过创建伪QueryDocument对：查询提取和转移查询生成来实现无注释和可扩展的培训。查询提取方法涉及从原始文档中选择明显跨度以生成伪查询。另一方面，转移的查询生成方法利用了针对其他NLP任务（例如摘要）训练的生成模型来产生伪查询。通过广泛的实验，我们证明了使用这些增强方法训练的模型可以比多个强密集基线获得可比性（即使不是更好）。此外，将这些策略结合起来会进一步改进，从而导致无监督的密集检索，无监督的域适应性和监督的登录，并在Beir和ODQA数据集上进行了基准测试。代码和数据集可在https://github.com/salesforce/augtriever上公开获取。

Dense retrievers have made significant strides in text retrieval and open-domain question answering. However, most of these achievements have relied heavily on extensive human-annotated supervision. In this study, we aim to develop unsupervised methods for improving dense retrieval models. We propose two approaches that enable annotation-free and scalable training by creating pseudo querydocument pairs: query extraction and transferred query generation. The query extraction method involves selecting salient spans from the original document to generate pseudo queries. On the other hand, the transferred query generation method utilizes generation models trained for other NLP tasks, such as summarization, to produce pseudo queries. Through extensive experimentation, we demonstrate that models trained using these augmentation methods can achieve comparable, if not better, performance than multiple strong dense baselines. Moreover, combining these strategies leads to further improvements, resulting in superior performance of unsupervised dense retrieval, unsupervised domain adaptation and supervised finetuning, benchmarked on both BEIR and ODQA datasets. Code and datasets are publicly available at https://github.com/salesforce/AugTriever.

下载PDF全文

下载文献需遵守相关版权规定

论文标题