NERDI：单视图NERF合成，具有语言引导的扩散为一般图像先验

论文标题

NERDI：单视图NERF合成，具有语言引导的扩散为一般图像先验

NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

论文作者

Deng, Congyue, Jiang, Chiyu "Max'', Qi, Charles R., Yan, Xinchen, Zhou, Yin, Guibas, Leonidas, Anguelov, Dragomir

论文摘要

2d到3d重建是一个不足的问题，但是由于他们对3D世界多年来发展的3D世界的知识，人类擅长解决这个问题。在此观察结果的驱动下，我们提出了Nerdi，这是一种带有来自2D扩散模型的一般图像先验的单视NERF合成框架。我们将单视图重建作为图像条件的3D生成问题，我们通过在输入视图约束下使用预读取的图像扩散模型对其任意视图渲染的扩散损失来优化NERF表示。我们利用现成的视觉语言模型，并向扩散模型引入两段语言指导作为条件输入。这实质上有助于改善多视内容的连贯性，因为它缩小了先前在单视图输入图像的语义和视觉特征条件下的一般图像。此外，我们基于估计的深度图引入了几何损失，以使NERF的基础3D几何形状正常。 DTU MVS数据集的实验结果表明，即使与在此数据集中训练的现有方法相比，我们的方法甚至可以综合具有更高质量的新型视图。我们还证明了我们在野外图像的零摄入NERF合成中的普遍性。

2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题