请参阅，计划，预测：语言引导的认知计划和视频预测

论文标题

请参阅，计划，预测：语言引导的认知计划和视频预测

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

论文作者

Attarian, Maria, Gupta, Advaya, Zhou, Ziyi, Yu, Wei, Gilitschenski, Igor, Garg, Animesh

论文摘要

认知计划是复杂任务的结构分解为一系列未来行为。在计算环境中，执行认知计划需要以一种或多种方式进行基础计划和概念，以利用它们进行低级控制。由于经常用自然语言描述现实世界任务，因此我们通过语言指导的视频预测来设计一种认知计划算法。当前的视频预测模型不支持根据自然语言说明进行调节。因此，我们提出了一个新的视频预测体系结构，该体系结构利用了预训练的变压器的力量。该网络具有基于自然语言输入并具有概括性对象的自然语言输入的能力。我们在新的仿真数据集中证明了这种方法的有效性，其中每个任务都是由自然语言描述的高级动作定义的。我们的实验将我们的方法再次比较了Stone Video生成基线，而无需计划或动作接地并展示重大改进。我们的消融研究突出了一种改进的概括，可以看到自然语言嵌入为概念接地能力提供的对象，以及计划对任务视觉“想象”的重要性。

Cognitive planning is the structural decomposition of complex tasks into a sequence of future behaviors. In the computational setting, performing cognitive planning entails grounding plans and concepts in one or more modalities in order to leverage them for low level control. Since real-world tasks are often described in natural language, we devise a cognitive planning algorithm via language-guided video prediction. Current video prediction models do not support conditioning on natural language instructions. Therefore, we propose a new video prediction architecture which leverages the power of pre-trained transformers.The network is endowed with the ability to ground concepts based on natural language input with generalization to unseen objects. We demonstrate the effectiveness of this approach on a new simulation dataset, where each task is defined by a high-level action described in natural language. Our experiments compare our method again stone video generation baseline without planning or action grounding and showcase significant improvements. Our ablation studies highlight an improved generalization to unseen objects that natural language embeddings offer to concept grounding ability, as well as the importance of planning towards visual "imagination" of a task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题