论文标题
通过多方面的集成来增强视频表示学习
Boosting Video Representation Learning with Multi-Faceted Integration
论文作者
论文摘要
视频内容是多方面的,由对象,场景,交互或操作组成。现有的数据集大部分仅标记用于模型培训的一个方面之一,从而导致视频表示形式仅根据培训数据集偏见一个方面。目前尚无关于如何从多方面标签中学习视频表示形式的研究,以及多方面的信息是否有助于视频表示学习。在本文中,我们提出了一个新的学习框架,即多面集成(MUFI),以从不同数据集中汇总各个方面,以学习可以反映全部视频内容的表示形式。从技术上讲,MUFI将问题提出为视觉语义嵌入学习,从而将视频表示形式映射到丰富的语义嵌入空间中,并从两个角度共同优化视频表示。一种是利用每个视频与其自己的标签描述之间的内部监督,第二个视频从其他数据集的各个方面预测每个视频的“语义表示”作为面间的监督。广泛的实验表明,通过我们的MUFI框架学习3D CNN,通过四个大型视频数据集的结合以及两个图像数据集的结合,可以提高视频表示的卓越能力。带有MUFI的前线3D CNN还显示了几种下游视频应用程序的其他方法的明显改进。更明显的是,MUFI在UCF101/HMDB51上获得98.1%/80.9%的动作识别,而视频字幕上的MSVD上的Cider-D得分为101.5%。
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the "semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning.