论文标题

评估跨语性转移的平衡数据方法:映射语言血库

A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank

论文作者

Malkin, Dan, Limisiewicz, Tomasz, Stanovsky, Gabriel

论文摘要

我们表明,预科语言的选择会影响基于BERT的模型的下游跨语性转移。我们检查在平衡数据条件下的零射击性能,以减轻数据大小的混淆,对训练的语言进行分类,以改善作为捐助者的下游性能,以及在零摄像绩效中改善作为收件人的语言。我们在语言数量中开发了一种二次时间复杂性的方法,以估计这些关系,而不是对所有可能组合的指数详尽的计算。我们发现我们的方法对涵盖不同语言特征和两个下游任务的多种语言有效。我们的发现可以为开发人员提供大规模的多语言模型,以选择更好的预处理配置。

We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源