论文标题
可可-MT:用于对比的受控MT的数据集和基准,并应用正式
CoCoA-MT: A Dataset and Benchmark for Contrastive Controlled MT with Application to Formality
论文作者
论文摘要
机器翻译(MT)任务通常是为输入段返回单个翻译的任务。但是,在许多情况下,多种不同的翻译是有效的,适当的翻译可能取决于预期的目标受众,说话者的特征甚至说话者之间的关系。在处理荣誉时会出现特定问题,尤其是将英语翻译成带形式标记的语言。例如,句子“确定吗?”可以用德语翻译为“ Sind Sie Sich Sicher?” (正式登记册)还是“ bist du di dir Sicher?” (非正式)。对于某些文化和人口统计学的用户来说,使用错误或不一致的语气可能被认为是不合适的或令人震惊的。这项工作解决了从少数标记的对比数据中学习控制目标语言属性的问题,在这种情况下是形式。我们介绍了一个注释的数据集(可可-MT)和相关的评估指标,用于培训和评估六种不同目标语言的形式控制的MT模型。我们证明,我们可以通过对对比度数据进行微调来训练形式控制的模型,从而达到高精度(82%的内域和73%的偏置域),同时保持整体质量。
The machine translation (MT) task is typically formulated as that of returning a single translation for an input segment. However, in many cases, multiple different translations are valid and the appropriate translation may depend on the intended target audience, characteristics of the speaker, or even the relationship between speakers. Specific problems arise when dealing with honorifics, particularly translating from English into languages with formality markers. For example, the sentence "Are you sure?" can be translated in German as "Sind Sie sich sicher?" (formal register) or "Bist du dir sicher?" (informal). Using wrong or inconsistent tone may be perceived as inappropriate or jarring for users of certain cultures and demographics. This work addresses the problem of learning to control target language attributes, in this case formality, from a small amount of labeled contrastive data. We introduce an annotated dataset (CoCoA-MT) and an associated evaluation metric for training and evaluating formality-controlled MT models for six diverse target languages. We show that we can train formality-controlled models by fine-tuning on labeled contrastive data, achieving high accuracy (82% in-domain and 73% out-of-domain) while maintaining overall quality.