Lovis：视觉和语言导航的学习取向和视觉信号

论文标题

Lovis：视觉和语言导航的学习取向和视觉信号

LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation

论文作者

Zhang, Yue, Kordjamshidi, Parisa

论文摘要

了解空间和视觉信息对于遵循自然语言说明的导航代理至关重要。当前基于变压器的VLN代理纠缠了方向和视觉信息，这限制了每个信息源的学习中的增益。在本文中，我们设计了具有明确取向和视觉模块的神经药物。这些模块学会了将空间信息和地标在视觉环境中的说明中提及。为了加强代理的空间推理和视觉感知，我们设计了特定的训练任务，以进食并更好地利用我们最终导航模型中的相应模块。我们在Room2Room（R2R）和Room4Room（R4R）数据集上评估我们的方法，并在两个基准测试中实现最先进的结果。

Understanding spatial and visual information is essential for a navigation agent who follows natural language instructions. The current Transformer-based VLN agents entangle the orientation and vision information, which limits the gain from the learning of each information source. In this paper, we design a neural agent with explicit Orientation and Vision modules. Those modules learn to ground spatial information and landmark mentions in the instructions to the visual environment more effectively. To strengthen the spatial reasoning and visual perception of the agent, we design specific pre-training tasks to feed and better utilize the corresponding modules in our final navigation model. We evaluate our approach on both Room2room (R2R) and Room4room (R4R) datasets and achieve the state of the art results on both benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题