论文标题
无监督的时间视频接地与深度语义群集
Unsupervised Temporal Video Grounding with Deep Semantic Clustering
论文作者
论文摘要
时间视频接地(TVG)旨在根据给定的句子查询在视频中定位目标段。尽管可观的作品在这项任务中取得了不错的成就,但他们严重依赖丰富的视频Query配对数据,这是昂贵且耗时的,可以在实际情况下收集。在本文中,我们探讨了是否可以在没有任何配对注释的情况下学习视频接地模型。据我们所知,本文是第一项试图在无监督环境中向TVG讲话的工作。考虑到没有配对的监督,我们提出了一个新颖的深层语义聚类网络(DSCNET),以利用整个查询集中的所有语义信息来构成每个视频中可能的活动以进行接地。具体来说,我们首先开发了语言语义挖掘模块,该模块从整个查询集中提取隐性语义特征。然后,这些语言语义特征是通过基于视频的语义聚合模块在视频中构成活动的指导。最后,我们利用前景注意分支来过滤冗余的背景活动并完善接地结果。为了验证DSCNET的有效性,我们对活动网字幕和Charades-STA数据集进行了实验。结果表明,DSCNET实现了竞争性能,甚至胜过最弱监督的方法。
Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.