论文标题

扎根的情况认可

Grounded Situation Recognition

论文作者

Pratt, Sarah, Yatskar, Mark, Weihs, Luca, Farhadi, Ali, Kembhavi, Aniruddha

论文摘要

我们介绍了基础的情况识别(GSR),该任务需要产生描述的图像的结构化语义摘要:主要活动,具有其角色参与活动的实体(例如,代理,工具)和实体的边界盒子基础。 GSR提出了重要的技术挑战:确定语义显着性,分类和本地位置多种多样的实体,克服语义稀疏性并消除歧义角色。此外,与字幕不同,GSR可以直接评估。为了研究这项新任务,我们使用基础(SWIG)数据集创建了情况,该数据集为IMSITU数据集中的11,538个实体类增加了278,336个边界框基础。我们提出了一个联合情况定位,发现与端到端培训的共同预测情况和基础在整个接地公制套件上,相对增长在8%至32%之间,端到端培训方便地比较分别优于独立培训。最后,我们在模型启用的三个令人兴奋的未来方向上显示了初步发现:有条件的查询,视觉链接和扎根的语义意识图像检索。代码和数据可在https://prior.allenai.org/projects/gsr上找到。

We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imsitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at https://prior.allenai.org/projects/gsr.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源