视频：蒙面自动编码器是自我监视视频预训练的数据效率学习者

论文标题

视频：蒙面自动编码器是自我监视视频预训练的数据效率学习者

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

论文作者

Tong, Zhan, Song, Yibing, Wang, Jue, Wang, Limin

论文摘要

通常需要在大型数据集上进行预训练的视频变压器才能在相对较小的数据集上实现最佳性能。在本文中，我们表明视频蒙面的自动编码器（Videomae）是用于自我监督视频预训练（SSVP）的数据效率学习者。我们受到最近的图像模型的启发，并提出了具有极高比例的自定义视频管掩蔽。这种简单的设计使视频重建成为更具挑战性的自我判断任务，从而鼓励在此预训练过程中提取更有效的视频表示。我们在SSVP上获得了三个重要发现：（1）掩盖比的比例极高（即90％至95％）仍然可以产生有利的视频表现。在时间上冗余的视频内容可实现比图像更高的掩蔽率。（2）视频在很小的数据集（即大约3k-4k视频）上取得了令人印象深刻的结果，而无需使用任何额外的数据。（3）视频表明，数据质量比SSVP的数据数量更重要。培训和目标数据集之间的域移动是一个重要问题。值得注意的是，我们与Vanilla VIT的视频可以在动力学400上获得87.4％的效果，在不使用任何额外数据的情况下，在HMDB51上的V2上有75.4％，在UCF101上获得75.4％，在UCF101上获得91.3％，HMDB51的62.6％。代码可从https://github.com/mcg-nju/videomae获得。

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题