关于在示例检索的查询中，基于上下文化的术语排名与BM25的插值

论文标题

关于在示例检索的查询中，基于上下文化的术语排名与BM25的插值

On the Interpolation of Contextualized Term-based Ranking with BM25 for Query-by-Example Retrieval

论文作者

Abolghasemi, Amin, Askari, Arian, Verberne, Suzan

论文摘要

基于术语的基于预训练的基于变压器的语言模型的排名最近引起了人们的关注，因为它们将变压器模型的上下文化功率带入了高效的基于术语的检索。在这项工作中，我们研究了这些深度上下文化的术语模型的普遍性，在逐示例（QBE）检索的背景下，种子文档作为查找相关文档的查询。在这种情况下 - 查询比常见关键字查询更长的时间 - 查询时间的BERT推断是有问题的，因为它涉及二次复杂性。我们研究了Tilde和Tildev2，这两者都将Bert Tokenizer作为其查询编码器。通过这种方法，在查询时间不需要BERT推断，而且查询也可以有任何篇幅。我们对SCIDOCS基准的四个QBE任务的广泛评估表明，在示例检索设置中，tilde和tildev2仍然不如交叉编码器BERT等级效率。但是，我们观察到，与Tilde和Tildev2相比，BM25可能显示出具有竞争性排名质量，这与关于这三个模型在先前工作中报告的简短查询中的相对性能的发现相反。该结果提出了一个问题，即在QBE设置中使用基于上下文化的术语排名模型是有益的。我们通过研究Tilde（Tildev2）和BM25的相关性分数之间的得分插值来跟进我们的发现。我们得出的结论是，这两个基于上下文的基于项的排名模型捕获了与BM25不同的相关性信号，并将基于术语的不同等级器组合到QBE检索的统计学上显着改善。我们的工作阐明了与常见评估基准不同的检索设置的挑战。

Term-based ranking with pre-trained transformer-based language models has recently gained attention as they bring the contextualization power of transformer models into the highly efficient term-based retrieval. In this work, we examine the generalizability of two of these deep contextualized term-based models in the context of query-by-example (QBE) retrieval in which a seed document acts as the query to find relevant documents. In this setting -- where queries are much longer than common keyword queries -- BERT inference at query time is problematic as it involves quadratic complexity. We investigate TILDE and TILDEv2, both of which leverage BERT tokenizer as their query encoder. With this approach, there is no need for BERT inference at query time, and also the query can be of any length. Our extensive evaluation on the four QBE tasks of SciDocs benchmark shows that in a query-by-example retrieval setting TILDE and TILDEv2 are still less effective than a cross-encoder BERT ranker. However, we observe that BM25 could show a competitive ranking quality compared to TILDE and TILDEv2 which is in contrast to the findings about the relative performance of these three models on retrieval for short queries reported in prior work. This result raises the question about the use of contextualized term-based ranking models being beneficial in QBE setting. We follow-up on our findings by studying the score interpolation between the relevance score from TILDE (TILDEv2) and BM25. We conclude that these two contextualized term-based ranking models capture different relevance signals than BM25 and combining the different term-based rankers results in statistically significant improvements in QBE retrieval. Our work sheds light on the challenges of retrieval settings different from the common evaluation benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题