论文标题
无条件的图像文本对生成与多模式交叉量化的生成
Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer
论文作者
论文摘要
尽管深层生成模型引起了很多关注,但大多数现有作品都是为单峰产生而设计的。在本文中,我们探讨了一种无条件图像文本对生成的新方法。我们设计了多模式的交叉定量VAE(MXQ-VAE),这是一种用于关节图像文本表示的新型矢量量化器,我们发现联合图像文本表示空间对语义上一致的图像文本对生成有效。要学习量化空间中的多模式语义相关性,我们将VQ-VAE与变压器编码器结合在一起,并应用输入掩蔽策略。具体而言,MXQ-VAE接受蒙版的图像文本对作为输入,并学习了量化的关节表示空间,因此可以将输入转换为统一的代码序列,然后我们使用代码序列执行无条件的图像文本对生成。广泛的实验表明,量化的关节空间与合成和现实世界数据集的多模式生成能力之间的相关性。此外,我们证明了我们在这两个方面的优越性,而不是几个基线。源代码可公开可用:https://github.com/ttumyche/mxq-vae。
Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.