等效分类映射，用于弱监督的时间动作本地化

论文标题

等效分类映射，用于弱监督的时间动作本地化

Equivalent Classification Mapping for Weakly Supervised Temporal Action Localization

论文作者

Zhao, Tao, Han, Junwei, Yang, Le, Zhang, Dingwen

论文摘要

近年来，弱监督的时间行动本地化是一个新兴但又广泛研究的主题。现有方法可以分为两个按分类的本地化管道，即，分类前管道和分类后管道。预先分类管道首先在每个视频片段上执行分类，然后汇总摘要级分类分数以获得视频级别的分类分数。相比之下，分类后管道首先汇总了摘要级特征，然后根据汇总功能预测视频级别的分类得分。尽管这两个管道中的分类器以不同的方式使用，但它们所扮演的作用完全相同 - 对给定功能进行分类以识别相应的操作类别。为此，理想的分类器可以使两个管道都起作用。这激发了我们同时在统一框架中学习这两个管道以获得有效的分类器。具体而言，在拟议的学习框架中，我们实现了两个并行网络流，以同时逐个分类管道对两个本地化进行建模，并使两个网络流共享相同的分类器。这实现了新型的等效分类映射（ECM）机制。此外，我们发现理想的分类器可能具有两个特征：1）从分类流中获得的框架级分类得分以及后分类流中的特征聚集权重应保持一致； 2）这两个流的分类结果应相同。基于这两个特征，我们进一步向提议的学习框架引入了权重转换模块和等效的培训策略，该框架有助于彻底挖掘等效机制。

Weakly supervised temporal action localization is a newly emerging yet widely studied topic in recent years. The existing methods can be categorized into two localization-by-classification pipelines, i.e., the pre-classification pipeline and the post-classification pipeline. The pre-classification pipeline first performs classification on each video snippet and then aggregate the snippet-level classification scores to obtain the video-level classification score. In contrast, the post-classification pipeline aggregates the snippet-level features first and then predicts the video-level classification score based on the aggregated feature. Although the classifiers in these two pipelines are used in different ways, the role they play is exactly the same---to classify the given features to identify the corresponding action categories. To this end, an ideal classifier can make both pipelines work. This inspires us to simultaneously learn these two pipelines in a unified framework to obtain an effective classifier. Specifically, in the proposed learning framework, we implement two parallel network streams to model the two localization-by-classification pipelines simultaneously and make the two network streams share the same classifier. This achieves the novel Equivalent Classification Mapping (ECM) mechanism. Moreover, we discover that an ideal classifier may possess two characteristics: 1) The frame-level classification scores obtained from the pre-classification stream and the feature aggregation weights in the post-classification stream should be consistent; 2) The classification results of these two streams should be identical. Based on these two characteristics, we further introduce a weight-transition module and an equivalent training strategy into the proposed learning framework, which assists to thoroughly mine the equivalence mechanism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题