解释的潜在变压器用于可解释的单眼估计

论文标题

解释的潜在变压器用于可解释的单眼估计

Disentangled Latent Transformer for Interpretable Monocular Height Estimation

论文作者

Xiong, Zhitong, Chen, Sining, Shi, Yilei, Zhu, Xiao Xiang

论文摘要

遥感图像中的单眼高度估计（MHE）在有效地产生3D城市模型方面具有很高的潜力，以快速回应自然灾害。大多数现有作品都追求更高的性能。但是，很少有研究探索MHE网络的解释性。在本文中，我们旨在探讨深神经网络如何从单眼图像中预测高度。为了全面了解MHE网络，我们建议从多个层面解释它们：1）神经元：单位级解剖。探索学到的内部深度表示的语义和高度选择性； 2）实例：对象级解释。研究不同语义类别，尺度和空间环境对高度估计的影响； 3）归因：像素级分析。了解哪些输入像素对于高度估计很重要。基于多级解释，提出了一个分离的潜在变压器网络，以实现更紧凑，可靠和可解释的深度模型，以进行单眼高度估计。此外，这项工作首先引入了基于高度估计的一种新颖的无监督语义分割任务。此外，我们还构建了一个新的数据集，以用于关节语义分割和高度估计。我们的工作为理解和设计MHE模型提供了新颖的见解。

Monocular height estimation (MHE) from remote sensing imagery has high potential in generating 3D city models efficiently for a quick response to natural disasters. Most existing works pursue higher performance. However, there is little research exploring the interpretability of MHE networks. In this paper, we target at exploring how deep neural networks predict height from a single monocular image. Towards a comprehensive understanding of MHE networks, we propose to interpret them from multiple levels: 1) Neurons: unit-level dissection. Exploring the semantic and height selectivity of the learned internal deep representations; 2) Instances: object-level interpretation. Studying the effects of different semantic classes, scales, and spatial contexts on height estimation; 3) Attribution: pixel-level analysis. Understanding which input pixels are important for the height estimation. Based on the multi-level interpretation, a disentangled latent Transformer network is proposed towards a more compact, reliable, and explainable deep model for monocular height estimation. Furthermore, a novel unsupervised semantic segmentation task based on height estimation is first introduced in this work. Additionally, we also construct a new dataset for joint semantic segmentation and height estimation. Our work provides novel insights for both understanding and designing MHE models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题