论文标题
Multi-CPR:用于通过检索的多域中文数据集
Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
论文作者
论文摘要
通道检索是信息检索(IR)研究中的一项基本任务,最近引起了很多关注。在英语领域,大规模注释数据集的可用性(例如MARCO)以及深度训练的语言模型(例如Bert)的出现导致了现有段落检索系统的实质性改善。但是,在中国领域,尤其是对于特定领域,由于质量被宣布的数据集受规模限制,通道检索系统仍然不成熟。因此,在本文中,我们提出了一个新型的多域中文数据集,用于通过(Multi-CPR)。该数据集是从三个不同领域收集的,包括电子商务,娱乐视频和医疗。每个数据集都包含数百万段落和一定数量的人类注释的查询与相关的对。我们将各种代表性通过检索方法作为基准。我们发现,从一般域上训练在数据集上训练的检索模型的性能将不可避免地降低特定域。然而,建立在注释的数据集上的段落检索系统可以实现重大改进,这确实证明了标记为数据的域的必要性以进一步优化。我们希望发布多基金会数据集可以在特定领域中基准中国通道检索任务,并为将来的研究取得进步。
Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially for specific domains, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain. Nevertheless, a passage retrieval system built on in-domain annotated dataset can achieve significant improvement, which indeed demonstrates the necessity of domain labeled data for further optimization. We hope the release of the Multi-CPR dataset could benchmark Chinese passage retrieval task in specific domain and also make advances for future studies.