论文标题
从上下文描述中检索图像
Image Retrieval from Contextual Descriptions
论文作者
论文摘要
整合上下文(包括感知和时间提示)的能力在扎根语言话语的含义中起着关键作用。为了衡量当前视觉和语言模型在多大程度上掌握了这一能力,我们设计了一个新的多模式挑战,即从上下文描述(ImageCode)中检索图像。特别是,模型的任务是根据上下文描述从一组10个最小对比的候选人中检索正确的图像。因此,每个描述仅包含有助于区分图像的细节。因此,就语法和话语而言,描述往往很复杂,需要绘制务实的推论。图像来自静态图片和视频帧。我们基准了几种最先进的模型,包括ImageCode上的两个跨编码器,例如Vilbert和Bi-编码器,例如剪辑。我们的结果表明,这些模型在人类绩效之后远远落后:最佳变体在视频帧上的精度为20.9,而静态图片的精度为59.4,而人类的精度为90.8。此外,我们尝试了新的模型变体,这些变体可以更好地将视觉和时间上下文纳入其表示形式,从而实现适度的增长。我们的希望是,Imagecode通过鼓励模型专注于细粒度的视觉差异来促进基础语言理解的进步。
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe. Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans. Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.