论文标题
HOP:视觉和语言导航的历史和顺序的培训
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
论文作者
论文摘要
在最近的一些视觉和语言导航(VLN)的作品中采用了预训练。但是,先前的VLN预训练方法要么缺乏预测未来动作的能力,要么忽略了轨迹上下文,这对于贪婪的导航过程至关重要。在这项工作中,为了促进时空的视觉文本对应以及代理的决策能力,我们提出了一种新颖的历史和顺序的意识到的意识到的预训练范式(HOP),并利用VLN特定的目标来利用过去的观察和支持将来的行动预测。具体而言,除了常用的蒙版语言建模(MLM)和轨迹 - 指导匹配(TIM)之外,我们还设计了两个代理任务来建模时间顺序信息:轨迹顺序建模(TOM)和组订单建模(GOM)。此外,通过引入历史记录(APH)的行动预测任务,我们还可以增强我们的导航行动预测,该任务考虑了历史的视觉感知。对四个下游VLN任务(R2R,Reverie,NDH,RXR)的广泛实验结果证明了我们所提出的方法与几种最先进的代理相比的有效性。
Pre-training has been adopted in a few of recent works for Vision-and-Language Navigation (VLN). However, previous pre-training methods for VLN either lack the ability to predict future actions or ignore the trajectory contexts, which are essential for a greedy navigation process. In this work, to promote the learning of spatio-temporal visual-textual correspondence as well as the agent's capability of decision making, we propose a novel history-and-order aware pre-training paradigm (HOP) with VLN-specific objectives that exploit the past observations and support future action prediction. Specifically, in addition to the commonly used Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM), we design two proxy tasks to model temporal order information: Trajectory Order Modeling (TOM) and Group Order Modeling (GOM). Moreover, our navigation action prediction is also enhanced by introducing the task of Action Prediction with History (APH), which takes into account the history visual perceptions. Extensive experimental results on four downstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed method compared against several state-of-the-art agents.