用于RGB-D显着对象检测的暹罗网络

论文标题

用于RGB-D显着对象检测的暹罗网络

Siamese Network for RGB-D Salient Object Detection and Beyond

论文作者

Fu, Keren, Fan, Deng-Ping, Ji, Ge-Peng, Zhao, Qijun, Shen, Jianbing, Zhu, Ce

论文摘要

现有的RGB-D显着对象检测（SOD）模型通常将RGB和深度视为独立信息，并设计单独的网络，以从每个信息中提取特征。这些方案很容易受到有限数量的培训数据或对精心设计的培训过程的过度依赖的限制。受到观察的启发，即RGB和深度模式实际上在区分显着物体方面具有某些共同点，因此设计了一种新颖的联合学习和密集的合作融合（JL-DCF）结构，旨在通过共享网络骨干（称为暹罗体系）从RGB和深度输入中学习。在本文中，我们提出了两个有效的组成部分：联合学习（JL）和密集的合作融合（DCF）。 JL模块通过通过暹罗网络利用跨模式通用性来提供强大的显着性学习，而DCF模块被引入以进行互补功能发现。使用五个流行指标的综合实验表明，设计的框架可产生强大的RGB-D显着性检测器，并具有良好的概括。结果，JL-DCF在七个具有挑战性的数据集中显着提高了最先进的模型（最大F量）。此外，我们表明JL-DCF很容易适用于其他相关的多模式检测任务，包括RGB-T（热红外）SOD和视频SOD，可在最先进的方法中实现可比甚至更好的性能。我们还将JL-DCF链接到RGB-D语义分割字段，显示其在RGB-D SOD任务上优于几个语义分割模型的能力。这些事实进一步证实，所提出的框架可以为各种应用提供潜在的解决方案，并为跨模式互补任务提供更多了解。

Existing RGB-D salient object detection (SOD) models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately designed training process. Inspired by the observation that RGB and depth modalities actually present certain commonality in distinguishing salient objects, a novel joint learning and densely cooperative fusion (JL-DCF) architecture is designed to learn from both RGB and depth inputs through a shared network backbone, known as the Siamese architecture. In this paper, we propose two effective components: joint learning (JL), and densely cooperative fusion (DCF). The JL module provides robust saliency feature learning by exploiting cross-modal commonality via a Siamese network, while the DCF module is introduced for complementary feature discovery. Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the state-of-the-art models by an average of ~2.0% (max F-measure) across seven challenging datasets. In addition, we show that JL-DCF is readily applicable to other related multi-modal detection tasks, including RGB-T (thermal infrared) SOD and video SOD, achieving comparable or even better performance against state-of-the-art methods. We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models on the task of RGB-D SOD. These facts further confirm that the proposed framework could offer a potential solution for various applications and provide more insight into the cross-modal complementarity task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题