论文标题

Demetr:诊断翻译评估指标

DEMETR: Diagnosing Evaluation Metrics for Translation

论文作者

Karpinska, Marzena, Raj, Nishant, Thai, Katherine, Song, Yixiao, Gupta, Ankita, Iyyer, Mohit

论文摘要

尽管基于字符串重叠的机器翻译评估指标(例如,BLEU)有其局限性,但它们的计算是透明的:分配给特定候选转换的BLEU得分可以追溯到存在或不存在某些单词。相比之下,相比之下,较新的学识指标(例如Bleurt,Comet),它利用了预审前的语言模型与人类质量判断更高的相关性,相比之下,相比之下。在本文中,我们通过创建Demetr(一个具有31k英语示例的诊断数据集(从10种源语言翻译)来评估MT评估指标的敏感性到35个不同的语言扰动,跨越语义,义话,语法和形态误差类别来评估35个不同的语言扰动。仔细设计了所有扰动,以形成最小对的对实际翻译(即,仅在一个方面有所不同)。我们发现,学到的指标的性能比Demetr上的基于字符串的指标要好得多。此外,学到的指标对各种现象的敏感性有所不同(例如,Bertscore对未翻译单词敏感,但对性别操纵相对不敏感,而彗星对单词重复的敏感性比对方面的变化更为敏感)。我们公开释放Demetr,以刺激更明智的机器翻译评估指标的未来开发

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源