语义MAPNET：从以自我为中心的观点构建同类语义图和表示形式

论文标题

语义MAPNET：从以自我为中心的观点构建同类语义图和表示形式

Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views

论文作者

Cartillier, Vincent, Ren, Zhile, Jain, Neha, Lee, Stefan, Essa, Irfan, Batra, Dhruv

论文摘要

我们研究了语义映射的任务 - 具体来说，为您提供了一个新环境的体现的代理（机器人或以Egentric AI助手）的范围，并要求从具有已知库的RGB-D摄像头（通过局部化传感器）构建同类中心的自上而下的语义图（“什么？”）。 Towards this goal, we present SemanticMapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length x width x feature-dims that learns to accumulate projected egocentric功能，以及（4）使用内存张量来产生语义自上而下的地图的地图解码器。 SMNET结合了（已知的）投影摄像机几何形状和神经表示学习的优势。关于MatterPort3D数据集中语义映射的任务，SMNET在均值IOU上显着优于竞争基线的基线（绝对），而Boundare-F1 Metrics上的均值（绝对）和3.81-19.69％（绝对）（绝对）。此外，我们展示了如何使用由SMNET构建的神经情节记忆和空间语义同类代表来在同一空间中进行后续任务 - 导航到游览中看到的对象（“查找椅子”）或回答有关空间的问题（“您在房子里看到了多少只主人？”）。项目页面：https：//vincentcartillier.github.io/smnet.html。

We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose (via localization sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length x width x feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the neural episodic memories and spatio-semantic allocentric representations build by SMNet for subsequent tasks in the same space - navigating to objects seen during the tour("Find chair") or answering questions about the space ("How many chairs did you see in the house?"). Project page: https://vincentcartillier.github.io/smnet.html.

下载PDF全文

下载文献需遵守相关版权规定

论文标题