视频模型中的动态时间过滤

论文标题

视频模型中的动态时间过滤

Dynamic Temporal Filtering in Video Models

论文作者

Long, Fuchen, Qiu, Zhaofan, Pan, Yingwei, Yao, Ting, Ngo, Chong-Wah, Mei, Tao

论文摘要

视频时间动力学通常以3D时空内核或其分解版本为组成，该版本由2D空间内核和1D颞核组成。但是，建模能力受到沿时间维度的固定窗口大小和静态重量的限制。预定的内核大小严重限制了时间接受场和固定重量平均跨帧的每个空间位置，从而导致在自然场景中进行长时间时间建模的亚最佳解决方案。在本文中，我们提出了一种时间特征学习的新食谱，即动态时间滤波器（DTF），该食谱在带有较大的时间接收场的频域中进行了新颖的时空模型。具体而言，DTF会为每个空间位置动态学习一个专门的频率过滤器，以建模其远程时间动力学。同时，每个空间位置的时间特征还通过1D快速傅立叶变换（FFT）转化为频率谱。频谱由学习的频率过滤器调制，然后以逆FFT转换回时间域。此外，为了促进DTF中的频率滤波器的学习，我们通过框架间相关性执行框架聚集，以增强主要的时间特征及其时间邻居。将DTF块插入Convnets和Transformer是可行的，从而产生DTF-NET和DTF-Transformer。在三个数据集上进行的广泛实验证明了我们的建议的优势。更值得注意的是，DTF转换器在Kinetics-400数据集上的精度为83.5％。源代码可在\ url {https://github.com/fuchenustc/dtf}上找到。

Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and static weights of a kernel along the temporal dimension. The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes. In this paper, we present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF), that novelly performs spatial-aware temporal modeling in frequency domain with large temporal receptive field. Specifically, DTF dynamically learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. Meanwhile, the temporal feature of each spatial location is also transformed into frequency feature spectrum via 1D Fast Fourier Transform (FFT). The spectrum is modulated by the learnt frequency filter, and then transformed back to temporal domain with inverse FFT. In addition, to facilitate the learning of frequency filter in DTF, we perform frame-wise aggregation to enhance the primary temporal feature with its temporal neighbors by inter-frame correlation. It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer. Extensive experiments conducted on three datasets demonstrate the superiority of our proposals. More remarkably, DTF-Transformer achieves an accuracy of 83.5% on Kinetics-400 dataset. Source code is available at \url{https://github.com/FuchenUSTC/DTF}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题