多模式抽象摘要的摘要视觉建模

论文标题

多模式抽象摘要的摘要视觉建模

Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization

论文作者

Liang, Yunlong, Meng, Fandong, Xu, Jinan, Wang, Jiaan, Chen, Yufeng, Zhou, Jie

论文摘要

鉴于多模式数据（文本和视觉），多模式抽象摘要（MAS）旨在产生简洁的摘要。现有的研究主要关注如何从文章的角度进行有效使用视觉特征，并在高资源英语数据集上取得了令人印象深刻的成功。但是，从摘要的角度来看，人们对视觉特征的关注较少，这可能会限制模型性能，尤其是在低资源和零资源场景中。在本文中，我们建议通过以摘要为导向的视觉特征提高摘要质量。为此，我们设计了两个辅助任务，包括摘要任务和掩盖图像建模任务。与主要汇总任务一起，我们通过所有这些任务的培训目标优化MAS模型。通过这些方式，可以通过捕获面向摘要的视觉特征来增强MAS模型，从而产生更准确的摘要。对44种语言进行实验，涵盖了中高，低和零资源的场景，验证了所提出的方法的有效性和优越性，在所有情况下都实现了最先进的表现。此外，我们将贡献一个大规模的多语言多模式抽象摘要（MM-SUM）数据集。

Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision). Existing studies mainly focus on how to effectively use the visual features from the perspective of an article, having achieved impressive success on the high-resource English dataset. However, less attention has been paid to the visual features from the perspective of the summary, which may limit the model performance, especially in the low- and zero-resource scenarios. In this paper, we propose to improve the summary quality through summary-oriented visual features. To this end, we devise two auxiliary tasks including vision to summary task and masked image modeling task. Together with the main summarization task, we optimize the MAS model via the training objectives of all these tasks. By these means, the MAS model can be enhanced by capturing the summary-oriented visual features, thereby yielding more accurate summaries. Experiments on 44 languages, covering mid-high-, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach, which achieves state-of-the-art performance under all scenarios. Additionally, we will contribute a large-scale multilingual multimodal abstractive summarization (MM-Sum) dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题