论文标题

探索时间句子接地的运动和外观信息

Exploring Motion and Appearance Information for Temporal Sentence Grounding

论文作者

Liu, Daizong, Qu, Xiaoye, Zhou, Pan, Liu, Yang

论文摘要

本文介绍了时间句子接地。以前的工作通常通过学习框架级视频功能来解决此任务,并与文本信息对齐。这些作品的一个主要局限性是,由于框架级特征提取,它们无法区分具有微妙外观差异的模棱两可的视频帧。最近,一些方法采用更快的R-CNN来提取每个帧中的详细对象特征,以区分细粒度的外观相似性。但是,由于对象检测模型缺乏时间建模,因此较快的R-CNN提取的对象级特征会遇到缺失运动分析。为了解决这个问题,我们提出了一个新颖的运动表现推理网络(MARN),该网络同时结合了运动感知和外观感知的对象特征,以更好地理解对象关系,以建模连续帧之间的活动。具体来说,我们首先介绍两个单独的视频编码器,将视频嵌入到相应的面向运动和外观的对象表示中。然后,我们开发单独的运动和外观分支,分别学习运动引导和外观引导的对象关系。最后,来自两个分支的运动和外观信息都相关联,以生成最终接地的更多代表性特征。在两个具有挑战性的数据集(Charades-STA和炸玉米饼)上进行了广泛的实验表明,我们提议的MARN大大优于先前的最新方法。

This paper addresses temporal sentence grounding. Previous works typically solve this task by learning frame-level video features and align them with the textual information. A major limitation of these works is that they fail to distinguish ambiguous video frames with subtle appearance differences due to frame-level feature extraction. Recently, a few methods adopt Faster R-CNN to extract detailed object features in each frame to differentiate the fine-grained appearance similarities. However, the object-level features extracted by Faster R-CNN suffer from missing motion analysis since the object detection model lacks temporal modeling. To solve this issue, we propose a novel Motion-Appearance Reasoning Network (MARN), which incorporates both motion-aware and appearance-aware object features to better reason object relations for modeling the activity among successive frames. Specifically, we first introduce two individual video encoders to embed the video into corresponding motion-oriented and appearance-aspect object representations. Then, we develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations, respectively. At last, both motion and appearance information from two branches are associated to generate more representative features for final grounding. Extensive experiments on two challenging datasets (Charades-STA and TACoS) show that our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源