基于高斯内核的时空视频接地的交叉模态网络

论文标题

基于高斯内核的时空视频接地的交叉模态网络

Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding

论文作者

Xiong, Zeyu, Liu, Daizong, Zhou, Pan

论文摘要

时空视频接地（STVG）是一项具有挑战性的任务，旨在根据自然语言查询在语义上以语义来定位感兴趣的对象的时空管。以前的大多数作品不仅严重依赖于更快的R-CNN提取的锚固框，而且还简单地将视频视为一系列单独的帧，因此缺乏其时间建模。取而代之的是，在本文中，我们是第一个为STVG提出的无锚框架的人，称为Gaussian基于高斯内核的交叉模态网络（GKCMN）。具体而言，我们利用每个视频框架的基于高斯内核的热图来定位与查询相关的对象。混合的串行和并行连接网络进一步开发，以利用框架之间的空间和时间关系以更好地接地。 VIDSTG数据集的实验结果证明了我们提出的GKCMN的有效性。

Spatial-Temporal Video Grounding (STVG) is a challenging task which aims to localize the spatio-temporal tube of the interested object semantically according to a natural language query. Most previous works not only severely rely on the anchor boxes extracted by Faster R-CNN, but also simply regard the video as a series of individual frames, thus lacking their temporal modeling. Instead, in this paper, we are the first to propose an anchor-free framework for STVG, called Gaussian Kernel-based Cross Modal Network (GKCMN). Specifically, we utilize the learned Gaussian Kernel-based heatmaps of each video frame to locate the query-related object. A mixed serial and parallel connection network is further developed to leverage both spatial and temporal relations among frames for better grounding. Experimental results on VidSTG dataset demonstrate the effectiveness of our proposed GKCMN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题