论文标题
EBMS vs. CL:探索自我监督的视觉预告片以回答视觉问题
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
论文作者
论文摘要
清洁和多样的标记数据的可用性是培训复杂任务(例如视觉问题答案(VQA))的培训模型的主要障碍。大型视觉和语言模型的广泛工作表明,自我监督的学习有效地预处理多模式相互作用。在此技术报告中,我们专注于视觉表示。我们审查和评估自我监督的方法,以利用未标记的图像并预处理模型,然后我们对其进行了自定义VQA任务,该任务允许受控的评估和诊断。我们将基于能量的模型(EBM)与对比度学习(CL)进行比较。尽管EBM越来越受欢迎,但他们缺乏对下游任务的评估。我们发现,EBM和CL都可以从未标记的图像中学习表示表示,这些图像可以在很少的注释数据上训练VQA模型。在类似于CLEVR的简单设置中,我们发现CL表示还可以改善系统的概括,甚至匹配了来自较大,监督,预测模型的表示的性能。但是,我们发现EBM由于不稳定性和结果差异很高而难以训练。尽管EBMS被证明对OOD检测有用,但基于监督的基于能量的训练和不确定性校准的其他结果在很大程度上是负面的。总体而言,CL当前似乎比EBM的选项更为可取。
The availability of clean and diverse labeled data is a major roadblock for training models on complex tasks such as visual question answering (VQA). The extensive work on large vision-and-language models has shown that self-supervised learning is effective for pretraining multimodal interactions. In this technical report, we focus on visual representations. We review and evaluate self-supervised methods to leverage unlabeled images and pretrain a model, which we then fine-tune on a custom VQA task that allows controlled evaluation and diagnosis. We compare energy-based models (EBMs) with contrastive learning (CL). While EBMs are growing in popularity, they lack an evaluation on downstream tasks. We find that both EBMs and CL can learn representations from unlabeled images that enable training a VQA model on very little annotated data. In a simple setting similar to CLEVR, we find that CL representations also improve systematic generalization, and even match the performance of representations from a larger, supervised, ImageNet-pretrained model. However, we find EBMs to be difficult to train because of instabilities and high variability in their results. Although EBMs prove useful for OOD detection, other results on supervised energy-based training and uncertainty calibration are largely negative. Overall, CL currently seems a preferable option over EBMs.