SIM到现实的视觉接地转移人为歧义分辨率

论文标题

SIM到现实的视觉接地转移人为歧义分辨率

Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

论文作者

Tziafas, Georgios, Kasaei, Hamidreza

论文摘要

服务机器人应该能够与非专家用户自然互动，不仅可以帮助他们完成各种任务，还可以接收指导，以解决指令中可能存在的歧义。我们考虑了视觉接地的任务，在该任务中，代理商在自然语言描述下将物体从拥挤的场景中段划分。现代的整体方法进行视觉接地通常忽略语言结构，而努力覆盖通用领域，因此很大程度上依靠大型数据集。此外，由于基准和目标域之间的高视觉差异，它们在RGB-D数据集中的传输性能受到了影响。模块化方法将学习与领域建模结合并利用语言的组成性质，以使视觉表示与语言解析相结合，但是由于缺乏强大的监督，要么依靠外部解析器或以端到端方式接受培训。在这项工作中，我们试图通过引入一个完全脱钩的模块化框架来解决这些局限性，以构成实体，属性和空间关系的构图视觉接地。我们利用在合成域中生成的丰富场景图表注释，并独立训练每个模块。我们的方法在模拟和两个实际的RGB-D场景数据集中进行了评估。实验结果表明，我们的框架的解耦性可以轻松地与域的适应方法集成，以实现SIMS到现实的视觉识别，从而在机器人应用中提供了数据效率，健壮且可解释的解决方案，以实现视觉接地。

Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains, therefore relying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffers due to high visual discrepancy between the benchmark and the target domains. Modular approaches marry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained in an end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle these limitations by introducing a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations. We exploit rich scene graph annotations generated in a synthetic domain and train each module independently. Our approach is evaluated both in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches for Sim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visual grounding in robotic applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题