论文标题
观看世界的发展:表示未标记视频的代表性学习
Watching the World Go By: Representation Learning from Unlabeled Videos
论文作者
论文摘要
最近的单一图像无监督的表示技术在各种任务上都取得了巨大的成功。这些作品的基本原理是实例歧视:学习区分同一图像的两个增强版本和大量无关图像。网络学会忽略增强噪声并提取语义上有意义的表示。先前的工作使用人工数据增强技术,例如裁剪和颜色抖动,这些技术只能以表面的方式影响图像,并且与物体实际变化的方式不符,例如。遮挡,变形,观点变化。在本文中,我们认为视频免费提供这种自然增强。视频可以提供对象的全新视图,显示变形,甚至连接语义上相似但视觉上与众不同的概念。我们提出了视频噪声对比估计,这是一种使用未标记的视频来学习强,可转移的单像表示形式的方法。我们展示了对最近无监督的单图像技术的改进,以及对各种时间和非时空任务的过度监督的成像网。代码和随机相关的视频视图数据集可从https://www.github.com/danielgordon10/vince获得
Recent single image unsupervised representation learning techniques show remarkable success on a variety of tasks. The basic principle in these works is instance discrimination: learning to differentiate between two augmented versions of the same image and a large batch of unrelated images. Networks learn to ignore the augmentation noise and extract semantically meaningful representations. Prior work uses artificial data augmentation techniques such as cropping, and color jitter which can only affect the image in superficial ways and are not aligned with how objects actually change e.g. occlusion, deformation, viewpoint change. In this paper, we argue that videos offer this natural augmentation for free. Videos can provide entirely new views of objects, show deformation, and even connect semantically similar but visually distinct concepts. We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable single image representations. We demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks. Code and the Random Related Video Views dataset are available at https://www.github.com/danielgordon10/vince