通过探索跨模式内存来导航视觉dialog导航

论文标题

通过探索跨模式内存来导航视觉dialog导航

Vision-Dialog Navigation by Exploring Cross-modal Memory

论文作者

Zhu, Yi, Zhu, Fengda, Zhan, Zhaohuan, Lin, Bingqian, Jiao, Jianbin, Chang, Xiaojun, Liang, Xiaodan

论文摘要

Vision-Dialog导航是视觉语言纪律目标的一项新的圣洁任务，以学习具有持续对话能力的代理人，以寻求自然语言的帮助并根据人类的反应进行导航。除了视觉语言导航中面临的常见挑战外，视觉划分导航还需要很好地处理有关对话历史记录和共同修补对话和视觉场景的时间上下文的一系列问题的语言意图。在本文中，我们提出了跨模式内存网络（CMN），以记住和理解与历史导航动作相关的丰富信息。我们的CMN由两个内存模块组成，语言存储器模块（L-MEM）和视觉内存模块（V-MEM）。具体而言，L-MEM通过采用多头注意机制来学习当前语言互动与对话历史记录之间的潜在关系。 V-MEM学会了将当前的视觉视图和有关先前导航操作的跨模式内存关联。跨模式记忆是通过视觉到语言的关注和语言到视觉的关注而产生的。受益于L-MEM和V-MEM的协作学习，我们的CMN能够探索有关当前步骤的历史导航行动决策的记忆。 CVDN数据集上的实验表明，我们的CMN在可见和看不见的环境上都有很大的余地优于先前的最新模型。

Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题