论文标题
同时使用可区分的最佳传输同时多次指导生成
Simultaneous Multiple-Prompt Guided Generation Using Differentiable Optimal Transport
论文作者
论文摘要
深度学习的最新进展,例如强大的生成模型和联合文本图像嵌入,为计算创造力社区提供了新的工具,并为艺术追求开辟了新的观点。通过从文本提示生成图像的文本对图像合成方法提供了一个很好的例子。这些图像是由潜在向量生成的,该矢量逐渐完善以同意文本提示。为此,在生成的图像中对补丁进行了采样,并将其与公共文本图像嵌入空间中的文本提示进行比较;然后,使用梯度下降来更新潜在的矢量,以减少这些补丁和文本提示之间的平均值(平均)距离。尽管这种方法为艺术家提供了足够的自由来自定义图像的整体外观,但通过其选择在生成模型中的选择,但对简单标准(平均距离)的依赖通常会导致模式崩溃:整个图像都符合所有文本提示的平均值,从而失去了其多样性。为了解决这个问题,我们建议使用在最佳运输(OT)文献中找到的匹配技术,从而产生能够忠实地反映出各种提示的图像。我们提供了许多插图,表明OT避免了估计矢量均值距离引起的一些陷阱,并证明了我们提出的方法在定性和定量上在实验中表现更好的能力。
Recent advances in deep learning, such as powerful generative models and joint text-image embeddings, have provided the computational creativity community with new tools, opening new perspectives for artistic pursuits. Text-to-image synthesis approaches that operate by generating images from text cues provide a case in point. These images are generated with a latent vector that is progressively refined to agree with text cues. To do so, patches are sampled within the generated image, and compared with the text prompts in the common text-image embedding space; The latent vector is then updated, using gradient descent, to reduce the mean (average) distance between these patches and text cues. While this approach provides artists with ample freedom to customize the overall appearance of images, through their choice in generative models, the reliance on a simple criterion (mean of distances) often causes mode collapse: The entire image is drawn to the average of all text cues, thereby losing their diversity. To address this issue, we propose using matching techniques found in the optimal transport (OT) literature, resulting in images that are able to reflect faithfully a wide diversity of prompts. We provide numerous illustrations showing that OT avoids some of the pitfalls arising from estimating vectors with mean distances, and demonstrate the capacity of our proposed method to perform better in experiments, qualitatively and quantitatively.