论文标题
无监督的密集视觉表示
Unsupervised Learning of Dense Visual Representations
论文作者
论文摘要
对比的自我监督学习已成为一种无监督的视觉表示学习的有前途的方法。通常,这些方法学习同一图像的不同视图(即数据增强的组成)不变的全局(图像级)表示。但是,许多视觉理解任务需要密集(像素级)表示。在本文中,我们提出了视野密集表示(VADER),以无视密集表示的学习。 Vader通过强迫本地特征在不同的观看条件下保持恒定来学习PixelWise表示。具体而言,这是通过像素级对比度学习来实现的:匹配功能(即描述不同视图上相同位置的功能)应在嵌入空间中靠近,而不匹配的功能则应分开。 Vader为密集的预测任务提供了自然的表示,并很好地转移到了下游任务。我们的方法在多个密集的预测任务中超过了Imagenet监督预处理(和强烈无监督的基线)。
Contrastive self-supervised learning has emerged as a promising approach to unsupervised visual representation learning. In general, these methods learn global (image-level) representations that are invariant to different views (i.e., compositions of data augmentation) of the same image. However, many visual understanding tasks require dense (pixel-level) representations. In this paper, we propose View-Agnostic Dense Representation (VADeR) for unsupervised learning of dense representations. VADeR learns pixelwise representations by forcing local features to remain constant over different viewing conditions. Specifically, this is achieved through pixel-level contrastive learning: matching features (that is, features that describes the same location of the scene on different views) should be close in an embedding space, while non-matching features should be apart. VADeR provides a natural representation for dense prediction tasks and transfers well to downstream tasks. Our method outperforms ImageNet supervised pretraining (and strong unsupervised baselines) in multiple dense prediction tasks.