不要使用英语开发：在上下文嵌入的零射击跨语性评估中

论文标题

不要使用英语开发：在上下文嵌入的零射击跨语性评估中

Don't Use English Dev: On the Zero-Shot Cross-Lingual Evaluation of Contextual Embeddings

论文作者

Keung, Phillip, Lu, Yichao, Salazar, Julian, Bhardwaj, Vikas

论文摘要

多语言上下文的嵌入在零射击跨语性转移学习中表明了最先进的性能，其中多种语言BERT在一种源语言上进行了微调，并通过不同的目标语言进行了评估。但是，在四篇论文中，在MLDOC分类任务上发布了Mbert Zero-Shot准确性的结果多达17点。我们表明，使用英语开发精度在零拍设置中选择模型选择的标准实践使得很难在MLDOC和XNLI任务上获得可重现的结果。具有目标语言准确性的英语开发精度通常是不相关（甚至是反相关的），而零拍的性能在相同的微调运行中以及不同的微调运行之间的不同点上有很大变化。对于具有不同预训练的嵌入（例如，带有XLM-R）的其他任务，这些可重复性问题也存在。我们建议提供零摄像结果的Oracle分数：仍然使用英语数据进行微调，但请选择具有目标开发设置的检查点。报告此上限通过避免任意糟糕的检查点使结果更加一致。

Multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on one source language and evaluated on a different target language. However, published results for mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results on the MLDoc and XNLI tasks. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy, and zero-shot performance varies greatly at different points in the same fine-tuning run and between different fine-tuning runs. These reproducibility issues are also present for other tasks with different pre-trained embeddings (e.g., MLQA with XLM-R). We recommend providing oracle scores alongside zero-shot results: still fine-tune using English data, but choose a checkpoint with the target dev set. Reporting this upper bound makes results more consistent by avoiding arbitrarily bad checkpoints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题