3D-TOGO：迈向文本引导的跨类别3D对象生成

论文标题

3D-TOGO：迈向文本引导的跨类别3D对象生成

3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation

论文作者

Jiang, Zutao, Lu, Guansong, Liang, Xiaodan, Zhu, Jihua, Zhang, Wei, Chang, Xiaojun, Xu, Hang

论文摘要

文本指导的3D对象生成旨在生成用用户定义的字幕描述的3D对象，该对象铺平了一种灵活的方式来可视化我们想象的内容。尽管一些作品专门用于解决这项具有挑战性的任务，但这些作品要么利用一些明确的3D表示（例如网格），这些表示缺乏纹理，因此需要后处理来呈现照片真实的观点；或需要每种情况下的个人耗时优化。在这里，我们首次尝试通过新的3D-TOGO模型来实现通用文本引导的跨类别3D对象生成，该模型集成了文本对视图生成模块和视图对3D生成模块。文本对视图生成模块旨在生成给定输入字幕的目标3D对象的不同视图。提出了先前的指导，标题指导和视图对比度学习，以实现更好的视野一致性和字幕相似性。同时，对3D生成模块采用了Pixelnerf模型，以从先前生成的观点中获取隐式3D神经表示。我们的3D-TOGO模型以良好的纹理形式生成3D对象，并且不需要每个字幕的时间成本优化。此外，3D-TOGO可以用输入标题控制生成的3D对象的类别，颜色和形状。进行最大的3D对象数据集（即ABO）的广泛实验，以验证3D-TOGO可以根据98个不同类别的输入字幕更好地生成高质量的3D对象，与PSNR，SSIM，SSIM，LPIPS和剪贴画相比，与Text-Nerf和Text-nerf和Dreamnf和Dreamnf和Dreams Fields相比。

Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.

下载PDF全文

下载文献需遵守相关版权规定

论文标题