COSIM：反事实场景想象力的常识推理

论文标题

COSIM：反事实场景想象力的常识推理

CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

论文作者

Kim, Hyounghun, Zala, Abhay, Bansal, Mohit

论文摘要

作为人类，我们可以通过想象我们脑海中的替代对象或概念来修改对场景的假设。例如，我们可以轻松地预料到雨云（例如，街道会弄湿了），并为此做准备。在本文中，我们引入了一个新任务/数据集，称为反事实场景（COSIM），旨在评估AI系统对场景变化想象的推理能力。在此任务/数据集中，为图像提供了模型和一个初始的问题响应对。接下来，应用了反事实想象的场景更改（以文本形式），该模型必须根据此场景更改预测对初始问题的新回答。我们收集了3.5k高质量和具有挑战性的数据实例，每个实例都由图像组成，常识性问题，带有回答，对反事实变化的描述，对问题的新回答以及三个干扰器响应。我们的数据集包含各种复杂的场景更改类型（例如对象添加/删除/状态更改，事件描述，环境变化等），这些更改需要模型来想象许多不同的场景和有关更改场景的原因。我们提出了基于视觉变压器（即LXMERT）和消融研究的基线模型。通过人类评估，我们证明了较大的人类模型性能差距，这为有望在这项具有挑战性的反事实，场景想象任务上提供了有希望的未来工作的空间。我们的代码和数据集可公开可用：https：//github.com/hyounghk/cosim

As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (CoSIm) which is designed to evaluate the ability of AI systems to reason about scene change imagination. In this task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging counterfactual, scene imagination task. Our code and dataset are publicly available at: https://github.com/hyounghk/CoSIm

下载PDF全文

下载文献需遵守相关版权规定

论文标题