论文标题
适应:具有模态的动作提示的视觉导航
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
论文作者
论文摘要
视觉语言导航(VLN)是一项具有挑战性的任务,需要一个具体的代理来执行动作级的模态对准,即在复杂的视觉环境中依次地遵循指令填充的动作。大多数现有的VLN代理直接学习指令路径数据,并且无法充分探索多模式输入中的动作级别对齐知识。在本文中,我们提出了与模式一致的动作提示(Adapt),该提示为VLN代理提供了动作提示,以显式学习动作级别的模态对准以追求成功导航。具体而言,一个动作提示被定义为与模态对准的一对图像子标准和文本子提示,其中前者是单视图观察,后者是“走过椅子”的短语。启动导航时,与指令相关的操作提示集将从预先构建的操作提示基底座中检索,并通过提示编码器获得提示功能。然后将提示功能与原始指令功能串联,并馈送到多层变压器进行动作预测。为了收集高质量的动作提示,我们使用具有强大的跨模式对齐能力的对比语言图像预处理(剪辑)模型。进一步引入了模态比对损失和顺序一致性损失,以增强动作提示的比对,并强制执行代理以顺序关注相关的提示。 R2R和RXR的实验结果都表明适应优于最新方法。
Vision-Language Navigation (VLN) is a challenging task that requires an embodied agent to perform action-level modality alignment, i.e., make instruction-asked actions sequentially in complex visual environments. Most existing VLN agents learn the instruction-path data directly and cannot sufficiently explore action-level alignment knowledge inside the multi-modal inputs. In this paper, we propose modAlity-aligneD Action PrompTs (ADAPT), which provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment to pursue successful navigation. Specifically, an action prompt is defined as a modality-aligned pair of an image sub-prompt and a text sub-prompt, where the former is a single-view observation and the latter is a phrase like ''walk past the chair''. When starting navigation, the instruction-related action prompt set is retrieved from a pre-built action prompt base and passed through a prompt encoder to obtain the prompt feature. Then the prompt feature is concatenated with the original instruction feature and fed to a multi-layer transformer for action prediction. To collect high-quality action prompts into the prompt base, we use the Contrastive Language-Image Pretraining (CLIP) model which has powerful cross-modality alignment ability. A modality alignment loss and a sequential consistency loss are further introduced to enhance the alignment of the action prompt and enforce the agent to focus on the related prompt sequentially. Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.