论文标题
360-MLC:自我训练和超参数调整的多视图布局一致性
360-MLC: Multi-view Layout Consistency for Self-training and Hyper-parameter Tuning
论文作者
论文摘要
我们提出了360-MLC,这是一种基于多视图布局一致性的自我训练方法,仅使用未标记的360 images来填充单眼房间模型。在实际情况下,预先训练的模型需要适应新的数据域而无需使用任何基础真相注释,这可能是有价值的。我们简单而有效的假设是,在同一场景中的多个布局估计都必须定义一致的几何形状,而不管它们的相机位置如何。基于这个想法,我们利用了预培训的模型将估计的布局边界从几个相机视图投影到3D世界坐标。然后,我们将它们重新投影回球形坐标并构建概率函数,从中我们采样了伪标签以进行自我训练。为了处理不受欢迎的伪标签,我们评估重新投影边界的差异是在训练期间在我们的损失函数中加权每个伪标签的不确定性值。此外,由于在训练期间或测试中无法获得地面真实注释,因此我们利用多个布局估计中的熵信息作为定量度量指标来衡量场景的几何形状一致性,从而使我们能够评估任何用于超参数调整的布局估计器,包括无需地面真相注释的模型选择。实验结果表明,当从三个公开可用的源数据集到一个独特的,新标记的数据集的自我培训时,我们的解决方案可以针对最先进的方法实现优惠的性能,该数据集由相同场景的多视图组成。
We present 360-MLC, a self-training method based on multi-view layout consistency for finetuning monocular room-layout models using unlabeled 360-images only. This can be valuable in practical scenarios where a pre-trained model needs to be adapted to a new data domain without using any ground truth annotations. Our simple yet effective assumption is that multiple layout estimations in the same scene must define a consistent geometry regardless of their camera positions. Based on this idea, we leverage a pre-trained model to project estimated layout boundaries from several camera views into the 3D world coordinate. Then, we re-project them back to the spherical coordinate and build a probability function, from which we sample the pseudo-labels for self-training. To handle unconfident pseudo-labels, we evaluate the variance in the re-projected boundaries as an uncertainty value to weight each pseudo-label in our loss function during training. In addition, since ground truth annotations are not available during training nor in testing, we leverage the entropy information in multiple layout estimations as a quantitative metric to measure the geometry consistency of the scene, allowing us to evaluate any layout estimator for hyper-parameter tuning, including model selection without ground truth annotations. Experimental results show that our solution achieves favorable performance against state-of-the-art methods when self-training from three publicly available source datasets to a unique, newly labeled dataset consisting of multi-view of the same scenes.