论文标题
可学习的抽样3D卷积,以增强视频和动作识别
Learnable Sampling 3D Convolution for Video Enhancement and Action Recognition
论文作者
论文摘要
视频增强和动作识别的主要挑战是将有用的信息从相邻框架中融合在一起。最近的工作表明,在融合时间信息之前,在相邻框架之间建立准确的对应关系。但是,生成的结果在很大程度上取决于对应估计的质量。在本文中,我们提出了一个更强大的解决方案:\ emph {采样和融合多级特征}跨邻域框架以生成结果。基于这个想法,我们引入了一个新的模块,以提高3D卷积的能力,即可学习的采样3D卷积(\ emph {ls3d-conv})。我们将可学习的2D偏移添加到3D卷积中,该卷积旨在在框架跨帧的空间特征地图上采样位置。可以学习特定任务的偏移。 \ emph {ls3d-conv}可以灵活地替换现有3D网络中的3D卷积层,并获得新的体系结构,该架构以多个功能级别学习采样。视频插值,视频超分辨率,视频降解和动作识别的实验证明了我们方法的有效性。
A key challenge in video enhancement and action recognition is to fuse useful information from neighboring frames. Recent works suggest establishing accurate correspondences between neighboring frames before fusing temporal information. However, the generated results heavily depend on the quality of correspondence estimation. In this paper, we propose a more robust solution: \emph{sampling and fusing multi-level features} across neighborhood frames to generate the results. Based on this idea, we introduce a new module to improve the capability of 3D convolution, namely, learnable sampling 3D convolution (\emph{LS3D-Conv}). We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames. The offsets can be learned for specific tasks. The \emph{LS3D-Conv} can flexibly replace 3D convolution layers in existing 3D networks and get new architectures, which learns the sampling at multiple feature levels. The experiments on video interpolation, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.