主题元文本表示文档检索的模型

论文标题

主题元文本表示文档检索的模型

Topic-Grained Text Representation-based Model for Document Retrieval

论文作者

Du, Mengxue, Li, Shasha, Yu, Jie, Ma, Jun, Ji, Bin, Liu, Huijun, Lin, Wuhang, Yi, Zibo

论文摘要

文档检索使用户能够准确，快速找到所需的文档。为了满足检索效率的要求，普遍的深神经方法采用了基于表示形式的匹配范式，该范式通过离线预先存储文档表示节省了在线匹配时间。但是，上述范式消耗了庞大的本地存储空间，尤其是将文档存储为单词元素表示时。为了解决这个问题，我们提出了TGTR，这是一种基于主题的文本表示模型，用于文档检索。按照基于表示形式的匹配范式，TGTR将文档表示脱机存储以确保检索效率，而通过使用新颖的主题熟悉的表示，而不是传统的单词元素，则大大降低了存储要求。实验结果表明，与单词粒度的基线相比，TGTR在检索准确性方面始终在TREC CAR和MS MARCO上竞争，但其所需的存储空间的少于1/10。此外，TGTR绝大多数在检索准确性方面超过了全球粒度的基线。

Document retrieval enables users to find their required documents accurately and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep neural methods adopt a representation-based matching paradigm, which saves online matching time by pre-storing document representations offline. However, the above paradigm consumes vast local storage space, especially when storing the document as word-grained representations. To tackle this, we present TGTR, a Topic-Grained Text Representation-based Model for document retrieval. Following the representation-based matching paradigm, TGTR stores the document representations offline to ensure retrieval efficiency, whereas it significantly reduces the storage requirements by using novel topicgrained representations rather than traditional word-grained. Experimental results demonstrate that compared to word-grained baselines, TGTR is consistently competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy, but it requires less than 1/10 of the storage space required by them. Moreover, TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题