论文标题

Unifier:大规模检索的统一猎犬

UnifieR: A Unified Retriever for Large-Scale Retrieval

论文作者

Shen, Tao, Geng, Xiubo, Tao, Chongyang, Xu, Can, Long, Guodong, Zhang, Kai, Jiang, Daxin

论文摘要

大规模检索是要回忆起一个巨大收藏的相关文件,给出了查询。它依靠表示将文档和查询嵌入到一个通用的语义编码空间中。根据编码空间,基于预训练的语言模型(PLM)的最新检索方法可将其分类为密度向量或基于词典的范式。这两个范式在不同的粒度上揭示了PLM的表示能力,即分别是全局序列级压缩和局部单词级别的上下文。受其互补的全球环境化情境化的启发,并具有独特的代表观点,我们提出了一个新的学习框架,统一器在一种模型中以双重代表性的能力统一了密集的向量和基于词典的检索。对通过基准测试的实验验证了其在两个范式中的有效性。一项联合取消计划进一步提出了更好的检索质量。最后,我们在Beir基准测试上评估该模型以验证其可传递性。

Large-scale retrieval is to recall relevant documents from a huge collection given a query. It relies on representation learning to embed documents and queries into a common semantic encoding space. According to the encoding space, recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. These two paradigms unveil the PLMs' representation capability in different granularities, i.e., global sequence-level compression and local word-level contexts, respectively. Inspired by their complementary global-local contextualization and distinct representing views, we propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability. Experiments on passage retrieval benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme is further presented with even better retrieval quality. We lastly evaluate the model on BEIR benchmark to verify its transferability.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源