论文标题
用于时间建模的近似双线模块
Approximated Bilinear Modules for Temporal Modeling
论文作者
论文摘要
我们考虑了视频的两个较强调的时间特性:1。时间提示是细粒度的; 2。时间建模需要推理。为了立即解决这两个问题,我们利用近似双线性模块(ABM)进行时间建模。有两个要点使模块有效:两层MLP可以看作是双线性操作的约束近似,因此可以用于在现有CNN中构造深层ABM,同时重用预预读的参数;帧功能可以分为静态和动态零件,因为相邻帧中的视觉重复,这使得时间建模更加有效。从高性能到高效率,研究了多个ABM变体和实现。具体而言,我们展示了如何通过添加辅助分支来转换CNN中的两层子网。此外,我们介绍了片段采样和转移推理,以提高稀疏框架视频分类性能。进行了广泛的消融研究,以显示拟议技术的有效性。我们的模型可以在不进行动力学预处理的情况下胜过大多数最先进的方法,并且在其他类似YouTube的动作识别数据集上也具有竞争力。我们的代码可在https://github.com/zhuxinqimac/abm-pytorch上找到。
We consider two less-emphasized temporal properties of video: 1. Temporal cues are fine-grained; 2. Temporal modeling needs reasoning. To tackle both problems at once, we exploit approximated bilinear modules (ABMs) for temporal modeling. There are two main points making the modules effective: two-layer MLPs can be seen as a constraint approximation of bilinear operations, thus can be used to construct deep ABMs in existing CNNs while reusing pretrained parameters; frame features can be divided into static and dynamic parts because of visual repetition in adjacent frames, which enables temporal modeling to be more efficient. Multiple ABM variants and implementations are investigated, from high performance to high efficiency. Specifically, we show how two-layer subnets in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch. Besides, we introduce snippet sampling and shifting inference to boost sparse-frame video classification performance. Extensive ablation studies are conducted to show the effectiveness of proposed techniques. Our models can outperform most state-of-the-art methods on Something-Something v1 and v2 datasets without Kinetics pretraining, and are also competitive on other YouTube-like action recognition datasets. Our code is available on https://github.com/zhuxinqimac/abm-pytorch.