问题是您训练密集的通道猎犬所需的全部问题

论文标题

问题是您训练密集的通道猎犬所需的全部问题

Questions Are All You Need to Train a Dense Passage Retriever

论文作者

Sachan, Devendra Singh, Lewis, Mike, Yogatama, Dani, Zettlemoyer, Luke, Pineau, Joelle, Zaheer, Manzil

论文摘要

我们介绍了Art，这是一种新的语料库级自动编码方法，用于培训密集检索模型，不需要任何标记的培训数据。密集的检索是开放域任务（例如Open QA）的核心挑战，其中最先进的方法通常需要大量的监督数据集，并具有自定义的硬性挖掘和肯定范围的积极示例。相比之下，ART仅需要访问未配对的输入和输出（例如问题和潜在的答案文件）。它使用新的文档 - 取回自动编码方案，其中（1）输入问题用于检索一组证据文档，并且（2）随后使用文档来计算重建原始问题的可能性。基于问题重建的检索培训可以使文档和问题编码器的无监督学习有效，以后可以将其纳入完整的QA系统中，而无需任何进一步的填充。广泛的实验表明，ART在多个QA检索基准中获得最先进的结果，并且仅从预先训练的语言模型中获得一般初始化，从而消除了对标记的数据和特定于任务特定损失的需求。

We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题