论文标题
上下文蒙版的自动编码器,用于浓密通道检索
ConTextual Masked Auto-Encoder for Dense Passage Retrieval
论文作者
论文摘要
密集的段落检索旨在根据查询和段落的密集表示(即矢量)从大型语料库中检索查询的相关段落。最近的研究探索了改善预训练的语言模型,以提高密集的检索性能。本文提出了COT-MAE(上下文掩盖自动编码器),这是一种简单而有效的生成性训练方法,可用于繁殖。 COT-MAE采用了不对称的编码器架构,该体系结构学会通过自我监督和上下文审议的掩盖自动编码来将句子语义压缩到密集的矢量中。确切地说,自我监督的掩盖自动编码学会学会为文本跨度内的令牌的语义建模,而上下文审议的蒙版自动编码学会学会模拟文本跨度之间的语义相关性。我们对大规模通过检索基准进行实验,并显示出对强基础的大量改善,证明了COT-MAE的效率很高。我们的代码可从https://github.com/caskcsg/ir/tree/main/cotmae获得。
Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae.