论文标题
VLUE:用于评估视觉语言模型的多任务基准
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
论文作者
论文摘要
视觉预训练(VLP)的最新进展在一系列视觉语言(VL)任务中表现出了令人印象深刻的表现。但是,在衡量社区建立一般多模式情报方面的进步方面存在一些挑战。首先,大多数下游VL数据集使用在预训练期间已经看到的原始图像进行注释,这可能会导致当前VLP模型的概括能力高估。其次,最近的VLP工作主要集中在绝对性能上,但忽略了效率 - 性能权衡,这也是衡量进度的重要指标。 为此,我们介绍了视力语言理解评估(VLUE)基准,这是一种多任务多维基准,用于评估VLP模型的概括能力和效率绩效折衷(``````PARETO SOTA'''''''。我们证明,在对分布外测试集进行测试时,所有VLP模型都有相当大的概括差距,该测试集注释了来自更多样化的分布的图像,这些分布分布遍布跨文化。此外,我们发现VLP模型的效率绩效权衡取舍会导致对VLP的多种设计选择的互补见解。我们发布了VLUE基准测试,以促进对构建视觉模型的研究,这些模型可以很好地推广到预训练期间看不见的更多样化的图像和概念,并且在效率绩效折衷方面是实用的。
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (``Pareto SOTA'') of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.