论文标题

迈向培训多语言密集检索模型的最佳实践

Towards Best Practices for Training Multilingual Dense Retrieval Models

论文作者

Zhang, Xinyu, Ogueji, Kelechi, Ma, Xueguang, Lin, Jimmy

论文摘要

使用基于变压器的双重编码器设计的密集检索模型已成为一个活跃的研究领域。在这项工作中,我们专注于使用这种设计的各种类型的语言进行单语检索的任务。尽管最近与多语言变压器的工作表明它们具有强大的跨语性概括能力,但我们在这里仍有许多开放的研究问题。我们的研究被组织为培训多语言密集检索模型的“最佳实践”指南,分为三种主要情况:可用的多语言变压器,但相关性判断不使用感兴趣的语言;均可使用模型和培训数据;而且,培训数据不可用,而是模型。在考虑这些情况时,我们可以更好地了解多阶段微调的作用,在各种条件下的跨语性转移强度,语言数据的有用性以及多语言和单语言变压器的优势。我们的建议为从业人员提供了搜索应用程序的指南,特别是对于低资源语言,而我们的作品留下了许多研究问题,但我们为将来的工作提供了坚实的基础。

Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research. In this work, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a "best practices" guide for training multilingual dense retrieval models, broken down into three main scenarios: where a multilingual transformer is available, but relevance judgments are not available in the language of interest; where both models and training data are available; and, where training data are available not but models. In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源