对象关系图和教师推荐的学习视频字幕

论文标题

对象关系图和教师推荐的学习视频字幕

Object Relational Graph with Teacher-Recommended Learning for Video Captioning

论文作者

Zhang, Ziqi, Shi, Yaya, Yuan, Chunfeng, Li, Bing, Wang, Peijin, Hu, Weiming, Zha, Zhengjun

论文摘要

从视觉和语言中充分利用信息对于视频字幕任务至关重要。由于对象之间的相互作用忽略了相互作用，因此现有模型缺乏足够的视觉表示，并且由于长尾问题而对与内容相关的单词进行了足够的培训。在本文中，我们提出了一个完整的视频字幕系统，包括新型模型和有效的培训策略。具体而言，我们提出了一个基于对象关系图（org）编码器，该编码器捕获更详细的交互特征以丰富视觉表示。同时，我们设计了一种教师推荐的学习（TRL）方法，以充分利用成功的外部语言模型（ELM），以将丰富的语言知识整合到标题模型中。 ELM生成更相似的单词建议，这些词扩展了用于处理长尾问题的培训的基本真实单词。对三个基准测试的实验评估：MSVD，MSR-VTT和VATEX显示了所提出的ORG-TRL系统达到了最新的性能。广泛的消融研究和可视化说明了我们系统的有效性。

Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. The ELM generates more semantically similar word proposals which extend the ground-truth words used for training to deal with the long-tailed problem. Experimental evaluations on three benchmarks: MSVD, MSR-VTT and VATEX show the proposed ORG-TRL system achieves state-of-the-art performance. Extensive ablation studies and visualizations illustrate the effectiveness of our system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题