注意到各处的关注：单眼深度预测，跳过注意力

论文标题

注意到各处的关注：单眼深度预测，跳过注意力

Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention

论文作者

Agarwal, Ashutosh, Arora, Chetan

论文摘要

单眼深度估计（MDE）旨在在单个RGB图像下预测像素深度。对于这两者，由于全球上下文和像素级分辨率的同时要求，卷积和最新的基于注意力的模型都被发现基于编码器架构非常有用。通常，Skip连接模块用于融合编码器和解码器功能，该功能由特征映射串联进行，然后进行卷积操作。受到众多计算机视觉问题所显示的注意的好处的启发，我们提出了基于注意的编码器和解码器功能的融合。我们将MDE作为像素查询精炼问题，在该问题中，最粗糙的编码器特征用于初始化像素级查询，然后通过建议的Skip注意模块（SAM）将其精制为更高的分辨率。我们将预测问题提出为bin中心上的序数回归，从而离散连续的深度范围并引入bin中心预测变量（BCP）模块，该模块使用像素查询来预测最高级别的垃圾箱。除了图像自适应深度箱的好处外，拟议的设计还可以通过直接监督地面真理来学习最初的像素查询中的深度嵌入。在两个规范数据集（NYUV2和KITTI）上进行的广泛实验表明，我们的体系结构的表现分别优于最先进的5.3％和3.9％，同时SunRGBD数据集的一般性概括性能提高了9.4％。代码可从https://github.com/ashutosh1807/pixelformer.git获得。

Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a single RGB image. For both, the convolutional as well as the recent attention-based models, encoder-decoder-based architectures have been found to be useful due to the simultaneous requirement of global context and pixel-level resolution. Typically, a skip connection module is used to fuse the encoder and decoder features, which comprises of feature map concatenation followed by a convolution operation. Inspired by the demonstrated benefits of attention in a multitude of computer vision problems, we propose an attention-based fusion of encoder and decoder features. We pose MDE as a pixel query refinement problem, where coarsest-level encoder features are used to initialize pixel-level queries, which are then refined to higher resolutions by the proposed Skip Attention Module (SAM). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range and introduce a Bin Center Predictor (BCP) module that predicts bins at the coarsest level using pixel queries. Apart from the benefit of image adaptive depth binning, the proposed design helps learn improved depth embedding in initial pixel queries via direct supervision from the ground truth. Extensive experiments on the two canonical datasets, NYUV2 and KITTI, show that our architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively, along with an improved generalization performance by 9.4% on the SUNRGBD dataset. Code is available at https://github.com/ashutosh1807/PixelFormer.git.

下载PDF全文

下载文献需遵守相关版权规定

论文标题