论文标题
渐进的文本对图像生成
Progressive Text-to-Image Generation
论文作者
论文摘要
最近,通过同样预测潜在空间的左至下的离散图像令牌,向量量化的自回归(VQ-AR)模型在文本到图像合成中显示出了显着的结果。尽管简单的生成过程令人惊讶地效果很好,但这是生成图像的最佳方法吗?例如,人类的创建更倾向于图像的概述,而VQ-AR模型本身并不认为图像贴片的任何相对重要性。在本文中,我们提出了一个高保真文本到图像生成的渐进模型。提出的方法通过并行的方式根据现有上下文创建新的图像令牌从粗到罚来生效,并且该过程与所提出的错误修订机制递归应用,直到完成图像序列为止。由此产生的粗到最新层次结构使图像生成过程直观且可解释。 MS Coco基准测试中的广泛实验表明,与以前在各种类别和方面的FID得分中的VQ-AR方法相比,渐进模型的结果明显更好。此外,每个步骤中平行生成的设计允许$ \ times 13 $推理加速度,并且略有性能损失。
Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of image patches. In this paper, we present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner, and this procedure is recursively applied with the proposed error revision mechanism until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments in MS COCO benchmark demonstrate that the progressive model produces significantly better results compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the design of parallel generation in each step allows more than $\times 13$ inference acceleration with slight performance loss.