UPAINTING：统一的文本对图像扩散产生具有跨模式指导的

论文标题

UPAINTING：统一的文本对图像扩散产生具有跨模式指导的

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

论文作者

Li, Wei, Xu, Xue, Xiao, Xinyan, Liu, Jiachen, Yang, Hu, Li, Guohao, Wang, Zhanpeng, Feng, Zhifan, She, Qiaoqiao, Lyu, Yajuan, Wu, Hua

论文摘要

扩散生成模型最近已大大提高了文本条件图像生成的力量。现有的图像生成模型主要包括文本条件扩散模型和跨模式引导扩散模型，它们分别擅长于小型场景图像生成和复杂的场景图像生成。在这项工作中，我们提出了一种简单而有效的方法，即upaining着，以统一简单而复杂的场景图像产生，如图1所示。基于建筑改进和多样化的指导时间表，有效地将跨模式的跨模式指导从预测的图像介绍匹配的模型中匹配的文本扩散模型中，该模型将文本模型用于文本模型，将其用作文本模型。我们的主要发现是，在捕获跨模式语义和样式中，将大规模变压器语言模型的力量结合在理解语言和图像文本匹配模型中，可有效地提高样本忠诚度和图像生成的图像文本对齐。这样，Upainting具有更一般的图像生成能力，可以更有效地生成简单和复杂场景的图像。为了全面比较文本图像模型，我们在简单和复杂的场景中都通过写得很好的中文和英语提示，进一步创建了一个更通用的基准。我们将其与最近的模型进行比较，发现在简单和复杂的场景中，在标题相似性和图像保真度方面，上传量极大地优于其他模型。 Upainting项目页面\ url {https://upainting.github.io/}。

Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing image generation models mainly include text conditional diffusion model and cross-modal guided diffusion model, which are good at small scene image generation and complex scene image generation respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify simple and complex scene image generation, as shown in Figure 1. Based on architecture improvements and diverse guidance schedules, UPainting effectively integrates cross-modal guidance from a pretrained image-text matching model into a text conditional diffusion model that utilizes a pretrained Transformer language model as the text encoder. Our key findings is that combining the power of large-scale Transformer language model in understanding language and image-text matching model in capturing cross-modal semantics and style, is effective to improve sample fidelity and image-text alignment of image generation. In this way, UPainting has a more general image generation capability, which can generate images of both simple and complex scenes more effectively. To comprehensively compare text-to-image models, we further create a more general benchmark, UniBench, with well-written Chinese and English prompts in both simple and complex scenes. We compare UPainting with recent models and find that UPainting greatly outperforms other models in terms of caption similarity and image fidelity in both simple and complex scenes. UPainting project page \url{https://upainting.github.io/}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题