将视频表示为行动识别的歧视性子图

论文标题

将视频表示为行动识别的歧视性子图

Representing Videos as Discriminative Sub-graphs for Action Recognition

论文作者

Li, Dong, Qiu, Zhaofan, Pan, Yingwei, Yao, Ting, Li, Houqiang, Mei, Tao

论文摘要

人类的作用通常是组合结构或模式的，即主体，对象以及之间的时空相互作用。因此，发现这种结构是推理相互作用动力并认识行动的一种有益的方式。在本文中，我们介绍了一个新的子图设计设计，以表示和编码视频中每个动作的歧视性模式。具体而言，我们提出了多尺度的子学习（Musle）框架，该框架在新颖地构建了时空图形并将图形簇成每个尺度上相对于节点的数量的紧凑型子图。从技术上讲，肌肉在每个视频剪辑中，作为图形节点，在每个视频剪辑中产生3D边界框，即小管，并作为小管之间的图形边缘呈密集的连接性。对于每个动作类别，我们执行在线聚类，通过学习高斯混合物层将图分解为每个尺度上的子图形，并选择判别子图作为识别的动作原型。对某些事物的V1和V2和Kinetics-400数据集进行了广泛的实验，与最新方法相比，报告了优越的结果。更值得注意的是，我们的Musle实现了迄今为止，在某些事物V2验证集中，最佳报告的准确性为65.0％。

Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactions and recognize the actions. In this paper, we introduce a new design of sub-graphs to represent and encode the discriminative patterns of each action in the videos. Specifically, we present MUlti-scale Sub-graph LEarning (MUSLE) framework that novelly builds space-time graphs and clusters the graphs into compact sub-graphs on each scale with respect to the number of nodes. Technically, MUSLE produces 3D bounding boxes, i.e., tubelets, in each video clip, as graph nodes and takes dense connectivity as graph edges between tubelets. For each action category, we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discriminative sub-graphs as action prototypes for recognition. Extensive experiments are conducted on both Something-Something V1 & V2 and Kinetics-400 datasets, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our MUSLE achieves to-date the best reported accuracy of 65.0% on Something-Something V2 validation set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题