论文标题
LEBP-语言期望和约束力政策:一个具体视觉和语言互动任务学习代理的两流框架
LEBP -- Language Expectation & Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents
论文作者
论文摘要
人们总是希望通过理解语言说明来执行任务的具体代理。此外,他们还希望监视并期望代理商理解命令以他们的预期方式。但是,如何构建这样的体现代理仍然不清楚。最近,人们可以通过视觉和语言互动基准Alfred探索这个问题,Alfred要求代理在看不见的场景中按照自然语言说明执行复杂的日常家庭任务。在本文中,我们提出了LEBP-语言期望和约束性政策模块来解决Alfred。 LEBP包含两个流程的过程:1)首先进行语言期望模块,以产生一个期望,描述如何通过理解语言指令执行任务。期望由一系列任务的子步骤组成(例如,选择一个苹果)。期望使人们能够在代理采取实际操作之前访问并检查指令的理解结果,以防任务可能出错。 2)然后,它使用绑定策略模块将期望中的子阶段与特定方案结合到实际动作。实际操作包括导航和对象操纵。实验结果表明,我们的方法可以达到与当前发表的SOTA方法相当的性能,并且可以避免从可见场景到看不见的场景的大衰减。
People always desire an embodied agent that can perform a task by understanding language instruction. Moreover, they also want to monitor and expect agents to understand commands the way they expected. But, how to build such an embodied agent is still unclear. Recently, people can explore this problem with the Vision-and-Language Interaction benchmark ALFRED, which requires an agent to perform complicated daily household tasks following natural language instructions in unseen scenes. In this paper, we propose LEBP -- Language Expectation and Binding Policy Module to tackle the ALFRED. The LEBP contains a two-stream process: 1) It first conducts a language expectation module to generate an expectation describing how to perform tasks by understanding the language instruction. The expectation consists of a sequence of sub-steps for the task (e.g., Pick an apple). The expectation allows people to access and check the understanding results of instructions before the agent takes actual actions, in case the task might go wrong. 2) Then, it uses the binding policy module to bind sub-steps in expectation to actual actions to specific scenarios. Actual actions include navigation and object manipulation. Experimental results suggest our approach achieves comparable performance to currently published SOTA methods and can avoid large decay from seen scenarios to unseen scenarios.