基于变压器的语言模型，用于类似文本检索和排名

论文标题

基于变压器的语言模型，用于类似文本检索和排名

Transformer Based Language Models for Similar Text Retrieval and Ranking

论文作者

Qadrud-Din, Javed, Rabiou, Ashraf Bah, Walker, Ryan, Soni, Ravi, Gajek, Martin, Pack, Gabriel, Rangaraj, Akhil

论文摘要

大多数用于类似文本检索的方法和以长期自然语言查询的排名在某种程度上都取决于查询和响应彼此共同的词语。基于变压器的神经语言模型在文本检索和排名问题上的最新应用非常有前途，但仍涉及一个两步的过程，其中首先通过基于单词的方法获得了候选者，然后由神经变压器重新播放。在本文中，我们介绍了新颖的方法，以有效地将神经变压器模型应用于类似的文本检索和排名，而无需初始的基于单词的步骤。通过消除基于单词的步骤，我们的方法即使与查询没有任何共同的共同之处，我们的方法也能够准确检索和排名结果。我们通过使用来自变压器（BERT）的双向编码器表示来创建句子长度文本的矢量化表示以及矢量最近的邻居搜索索引来实现这一目标。我们展示了使用BERT完成这项任务的监督和无监督的手段。

Most approaches for similar text retrieval and ranking with long natural language queries rely at some level on queries and responses having words in common with each other. Recent applications of transformer-based neural language models to text retrieval and ranking problems have been very promising, but still involve a two-step process in which result candidates are first obtained through bag-of-words-based approaches, and then reranked by a neural transformer. In this paper, we introduce novel approaches for effectively applying neural transformer models to similar text retrieval and ranking without an initial bag-of-words-based step. By eliminating the bag-of-words-based step, our approach is able to accurately retrieve and rank results even when they have no non-stopwords in common with the query. We accomplish this by using bidirectional encoder representations from transformers (BERT) to create vectorized representations of sentence-length texts, along with a vector nearest neighbor search index. We demonstrate both supervised and unsupervised means of using BERT to accomplish this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题