论文标题
蒙面自动编码器作为时空学习者
Masked Autoencoders As Spatiotemporal Learners
论文作者
论文摘要
本文研究了从概念上简单地扩展蒙版自动编码器(MAE),以从视频中学习时空表示。我们在视频中随机掩盖时空补丁,并学习自动编码器以像素重建它们。有趣的是,我们表明我们的MAE方法可以学习强大的表示,几乎没有时空上的电感偏差(仅除了贴片和位置嵌入外),而时空 - 不合时宜的随机遮罩表现最好。我们观察到,最佳掩蔽比高达90%(图像为75%),这支持了以下假设:该比率与数据的信息冗余性有关。高掩模的比率导致较大的加速,例如,墙壁锁定时间> 4倍甚至更大。我们使用Vanilla Vision Transformers报告了几个具有挑战性的视频数据集的竞争结果。我们观察到,MAE的表现可以超过大量利润的监督预训练。我们进一步报告了对现实世界中未经保护的Instagram数据培训的令人鼓舞的结果。我们的研究表明,掩盖自动编码(Bert,Mae等)的一般框架可以是用最少的领域知识来表示学习的统一方法。
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.