论文标题
制作奇迹:跨语言的多语言信息检索
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages
论文作者
论文摘要
Miracl(跨语言连续的多语言信息检索)是我们为WSDM 2023 CUP挑战构建的多种语言数据集,该数据集重点介绍了18种不同语言的临时检索,该语言共同涵盖了全球超过30亿本人的母语人士。这些语言具有多种类型,起源于许多不同的语言家族,并且与不同数量的可用资源有关,包括研究人员通常将其描述为高资源和低资源语言。我们的数据集旨在支持单语检索模型的创建和评估,其中查询和语料库使用相同的语言。总的来说,我们已经收集了以下18种语言对Wikipedia的约77K疑问的700k高质量相关性判断,在这些语言中,所有评估均由我们团队雇用的母语人士进行。我们的目标是刺激研究,以改善各种语言的检索,从而增强世界各地各种人群的信息访问能力,尤其是传统上服务不足的人群。该概述论文描述了我们与社区共享的数据集和基线。 Miracl网站直播在http://miracl.ai/。
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources -- including what researchers typically characterize as high-resource as well as low-resource languages. Our dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language. In total, we have gathered over 700k high-quality relevance judgments for around 77k queries over Wikipedia in these 18 languages, where all assessments have been performed by native speakers hired by our team. Our goal is to spur research that will improve retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved. This overview paper describes the dataset and baselines that we share with the community. The MIRACL website is live at http://miracl.ai/.