论文标题
用于任务计划的多媒体生成脚本学习
Multimedia Generative Script Learning for Task Planning
论文作者
论文摘要
面向目标的生成脚本学习旨在生成后续步骤以实现特定目标,这是帮助机器人或人类执行刻板印象活动的必不可少的任务。该过程的一个重要方面是能够以视觉捕获历史状态的能力,该状态提供了不受文本不涵盖的详细信息,并将指导后续步骤。因此,我们提出了一个新任务,多媒体生成脚本学习,以通过跟踪文本和视觉方式中的历史状态以及介绍包含5,652个任务和79,089个多媒体步骤的第一个基准来生成后续步骤。这项任务在三个方面都具有挑战性:在图像中捕获视觉状态的多媒体挑战,执行看不见的任务的归纳挑战以及在单个步骤中涵盖不同信息的多样性挑战。我们建议通过选择性的多媒体编码器编码视觉状态变化,以解决多媒体挑战,使用检索成绩的解码器从先前观察到的任务中转移知识,以克服归纳挑战,并通过优化以多样性的对比对比学习目标来进一步呈现不同的信息。我们定义指标以评估发电和归纳质量。实验结果表明,我们的方法显着胜过强大的基准。
Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities. An important aspect of this process is the ability to capture historical states visually, which provides detailed information that is not covered by text and will guide subsequent steps. Therefore, we propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 5,652 tasks and 79,089 multimedia steps. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. We propose to encode visual state changes through a selective multimedia encoder to address the multimedia challenge, transfer knowledge from previously observed tasks using a retrieval-augmented decoder to overcome the induction challenge, and further present distinct information at each step by optimizing a diversity-oriented contrastive learning objective. We define metrics to evaluate both generation and inductive quality. Experiment results demonstrate that our approach significantly outperforms strong baselines.