论文标题
通过细心的跨模式互动与运动增强的互动,以进行压缩视频动作识别的表示。
Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement
论文作者
论文摘要
压缩视频动作识别最近引起了人们的关注,因为它通过用稀疏采样的RGB框架和压缩运动提示(例如运动向量和残差)替换原始视频,从而大大降低了存储和计算成本。但是,这项任务严重遭受了粗糙和嘈杂的动力学以及异质RGB和运动方式的融合不足。为了解决上面的两个问题,本文提出了一个新颖的框架,即带有运动增强的细心跨模式相互作用网络(MEACI-NET)。它遵循两流体系结构,即一个用于RGB模式,另一个用于运动方式。特别是,该运动流采用了带有denoising模块的多尺度块来增强表示表示。然后,通过引入选择性运动补体(SMC)和交叉模式增强(CMA)模块来增强两个流之间的相互作用,其中SMC通过时空上的局部局部运动特征补充了RGB模式,CMA进一步将两种模态与选择性功能增强相结合。对UCF-101,HMDB-51和Kinetics-400基准的广泛实验证明了MEACI-NET的有效性和效率。
Compressed video action recognition has recently drawn growing attention, since it remarkably reduces the storage and computational cost via replacing raw videos by sparsely sampled RGB frames and compressed motion cues (e.g., motion vectors and residuals). However, this task severely suffers from the coarse and noisy dynamics and the insufficient fusion of the heterogeneous RGB and motion modalities. To address the two issues above, this paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement (MEACI-Net). It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality. Particularly, the motion stream employs a multi-scale block embedded with a denoising module to enhance representation learning. The interaction between the two streams is then strengthened by introducing the Selective Motion Complement (SMC) and Cross-Modality Augment (CMA) modules, where SMC complements the RGB modality with spatio-temporally attentive local motion features and CMA further combines the two modalities with selective feature augmentation. Extensive experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net.