关于多模式机器翻译中的视觉功能

论文标题

关于多模式机器翻译中的视觉功能

On Vision Features in Multimodal Machine Translation

论文作者

Li, Bei, Lv, Chuanhao, Zhou, Zefan, Zhou, Tao, Xiao, Tong, Ma, Anxiang, Zhu, JingBo

论文摘要

以前关于多模式机器翻译（MMT）的工作重点是将视觉特征纳入翻译的方式，但对视觉模型质量的关注很少。在这项工作中，我们研究了视觉模型对MMT的影响。考虑到变压器在计算机视觉中变得流行的事实，我们尝试了各种强模型（例如视觉变压器）和增强功能（例如对象检测和图像字幕）。我们开发了一个选择性注意模型，以研究MMT中图像的斑点级贡献。在详细的探测任务上，我们发现更强的视觉模型有助于从视觉模态学习翻译。我们的结果还表明需要仔细检查MMT模型，尤其是当当前的基准是小规模且有偏见的时候。我们的代码可以在\ url {https://github.com/libeineu/fairseq_mmt}找到。

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at \url{https://github.com/libeineu/fairseq_mmt}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题