CM3：Internet的因果蒙版多模式模型

论文标题

CM3：Internet的因果蒙版多模式模型

CM3: A Causal Masked Multimodal Model of the Internet

论文作者

Aghajanyan, Armen, Huang, Bernie, Ross, Candace, Karpukhin, Vladimir, Xu, Hu, Goyal, Naman, Okhonko, Dmytro, Joshi, Mandar, Ghosh, Gargi, Lewis, Mike, Zettlemoyer, Luke

论文摘要

我们介绍了CM3，这是一个因果掩盖的生成模型的家族，该模型在大量的结构化多模式文档中训练，这些模型可以包含文本和图像令牌。我们的新因果掩蔽方法从左到右产生令牌，同时还掩盖了在字符串末端生成的少数长令牌跨度，而不是其原始位置。休闲掩蔽对象通过启用完整的生成模型，同时在生成掩盖的跨度时提供双向上下文，从而提供了更常见的因果和掩盖语言模型的混合物。我们在大规模网络和Wikipedia文章上训练因果掩盖的语言图像模型，每个文档都包含所有文本，超文本标记，超链接和图像令牌（来自VQVAE-GAN），它们以它们出现在原始HTML源（掩盖之前）的顺序提供。所得的CM3模型可以在任意掩盖的文档上下文中生成丰富的结构化的多模式输出，从而隐含地学习各种文本，图像和交叉模态任务。可以提示它们以零拍的方式恢复，即DALL-E，类型和HTLM等模型的功能。我们在零摄影摘要，实体链接和实体歧义中设置了新的最新技术，同时在微调环境中保持竞争性能。我们可以无条件地生成图像，以文本为条件（如dall-e），并以单个模型为零射击的所有字幕。

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题