学习为时间动作定位进行重构动作和共发生特征

论文标题

学习为时间动作定位进行重构动作和共发生特征

Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization

论文作者

Xia, Kun, Wang, Le, Zhou, Sanping, Zheng, Nanning, Tang, Wei

论文摘要

时间动作本地化的主要挑战是在未修剪的视频中从各种共同出现的成分（例如上下文和背景）中检索出微妙的人类行为。尽管先前的方法通过设计高级动作探测器取得了重大进展，但它们仍然遭受这些共同出现的成分，这些成分通常占据视频中实际的动作内容。在本文中，我们探讨了视频片段的两个正交但互补的方面，即动作功能和共发生功能。尤其是，我们通过在视频片段中解开这两种功能并重新组合它们来生成具有更明显的动作信息以进行准确的动作本地化的新功能表示，从而开发了一种新颖的辅助任务。我们称我们的方法重新处理，该方法首先明确将动作内容分解并正规化其共发生特征，然后综合一个新的以动作为主的视频表示形式。对Thumos14和ActivityNet V1.3的广泛实验结果和消融研究表明，我们的新表示与简单的动作检测器相结合可以显着改善动作定位性能。

The main challenge of Temporal Action Localization is to retrieve subtle human actions from various co-occurring ingredients, e.g., context and background, in an untrimmed video. While prior approaches have achieved substantial progress through devising advanced action detectors, they still suffer from these co-occurring ingredients which often dominate the actual action content in videos. In this paper, we explore two orthogonal but complementary aspects of a video snippet, i.e., the action features and the co-occurrence features. Especially, we develop a novel auxiliary task by decoupling these two types of features within a video snippet and recombining them to generate a new feature representation with more salient action information for accurate action localization. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features, and then synthesizes a new action-dominated video representation. Extensive experimental results and ablation studies on THUMOS14 and ActivityNet v1.3 demonstrate that our new representation, combined with a simple action detector, can significantly improve the action localization performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题