RXR-Habitat视觉和语言导航竞赛的第一名解决方案（CVPR 2022）

论文标题

RXR-Habitat视觉和语言导航竞赛的第一名解决方案（CVPR 2022）

1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)

论文作者

An, Dong, Wang, Zun, Li, Yangguang, Wang, Yi, Hong, Yicong, Huang, Yan, Wang, Liang, Shao, Jing

论文摘要

本报告介绍了CVPR 2022年RXR-Habitat竞赛的获胜方法的方法。该竞赛解决了连续环境（VLN-CE）中视觉和语言导航的问题，该问题要求代理商遵循逐步的自然语言指令以达到目标。我们为任务提供了模块化的计划与控制方法。我们的模型由三个模块组成：候选Waypoints预测器（CWP），历史增强式计划者和试用控制器。在每个决策循环中，CWP首先根据来自多个视图的深度观察来预测一组候选航路点。它可以降低动作空间的复杂性并促进计划。然后，采用了历史增强的规划师来选择一个候选航路点作为子目标。计划者还编码历史记忆以跟踪导航进度，这对于长途导航特别有效。最后，我们提出了一个名为Trutout的非参数启发式控制器，以执行低级动作以达到计划的子目标。它基于反复试验的机制，该机制可以帮助代理商避免障碍并摆脱陷入困境。所有三个模块都在层次上工作，直到代理停止为止。我们进一步采取了视力和语言导航（VLN）的最新进展，以提高基于大规模合成域内数据集，环境级数据扩展和快照模型集合等训练的性能。我们的模型赢得了2022年RXR-HABITAT竞赛，对NDTW和SR指标的现有方法的相对改进分别为48％和90％。

This report presents the methods of the winning entry of the RxR-Habitat Competition in CVPR 2022. The competition addresses the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE), which requires an agent to follow step-by-step natural language instructions to reach a target. We present a modular plan-and-control approach for the task. Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller. In each decision loop, CWP first predicts a set of candidate waypoints based on depth observations from multiple views. It can reduce the complexity of the action space and facilitate planning. Then, a history-enhanced planner is adopted to select one of the candidate waypoints as the subgoal. The planner additionally encodes historical memory to track the navigation progress, which is especially effective for long-horizon navigation. Finally, we propose a non-parametric heuristic controller named tryout to execute low-level actions to reach the planned subgoal. It is based on the trial-and-error mechanism which can help the agent to avoid obstacles and escape from getting stuck. All three modules work hierarchically until the agent stops. We further take several recent advances of Vision-and-Language Navigation (VLN) to improve the performance such as pretraining based on large-scale synthetic in-domain dataset, environment-level data augmentation and snapshot model ensemble. Our model won the RxR-Habitat Competition 2022, with 48% and 90% relative improvements over existing methods on NDTW and SR metrics respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题