论文标题
LARA:多摄像机鸟的潜在射线和射线
LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation
论文作者
论文摘要
自主驾驶的最新作品已广泛采用了鸟眼视图(BEV)语义图作为世界的中间表示。这些BEV地图的在线预测涉及非平凡的操作,例如多摄像机数据提取以及融合和投影到常见的Topview网格中。这通常是通过易易错的几何操作(例如,单眼深度估计的同构图或反向反射)或BEV中图像像素和像素(例如MLP或注意力)之间的昂贵直接密集映射来完成。在这项工作中,我们提出了“ Lara”,这是一种有效的编码器编码器,基于变压器的模型,用于来自多个相机的车辆语义分割。我们的方法使用交叉注意的系统将信息通过多个传感器汇总成紧凑而丰富的潜在表示。这些潜在的表示,在通过一系列自我发场块处理后,在BEV空间中进行了第二次交叉注意。我们证明,我们的模型在Nuscenes上使用变压器优于以前的最佳作品。代码和训练有素的模型可在https://github.com/valeoai/lara上找到
Recent works in autonomous driving have widely adopted the bird's-eye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial operations such as multi-camera data extraction as well as fusion and projection into a common topview grid. This is usually done with error-prone geometric operations (e.g., homography or back-projection from monocular depth estimation) or expensive direct dense mapping between image pixels and pixels in BEV (e.g., with MLP or attention). In this work, we present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras. Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations. These latent representations, after being processed by a series of self-attention blocks, are then reprojected with a second cross-attention in the BEV space. We demonstrate that our model outperforms the best previous works using transformers on nuScenes. The code and trained models are available at https://github.com/valeoai/LaRa