论文标题
蒙版运动编码用于自我监督的视频表示学习
Masked Motion Encoding for Self-Supervised Video Representation Learning
论文作者
论文摘要
如何从未标记的视频中学习歧视性视频表示,但对于视频分析至关重要。最新的尝试通过预测蒙面区域中的外观内容来学习表示模型。但是,简单地掩盖和恢复外观内容可能不足以对时间线索进行建模,因为可以轻松地从单个帧重建外观内容物。为了克服这一限制,我们提出了蒙版运动编码(MME),这是一种新的预训练范式,可重建外观和运动信息以探索时间线索。在MME中,我们专注于解决两个关键挑战以改善表示表现的绩效:1)如何很好地代表多个框架的长期运动; 2)如何从稀疏采样视频中获得细粒度的时间线索。由于人类能够通过跟踪对象的位置变化和形状变化来识别动作的事实,我们建议重建一个运动轨迹,该运动轨迹代表了蒙面区域中这两种变化。此外,考虑到稀疏的视频输入,我们强制执行该模型以在空间和时间尺寸中重建密集的运动轨迹。该模型通过我们的MME范式进行了预训练,能够预测长期和细粒度的运动细节。代码可从https://github.com/xinyusun/mme获得。
How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME.