论文标题
Uni-eden:通过多粒状视觉培训的通用编码器网络网络
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
论文作者
论文摘要
视觉语言的预训练一直是一个新兴且快速发展的研究主题,它将多模式知识从富裕的资源培训预培训任务转移到有限资源的下游任务。与主要学习单个通用编码器的现有作品不同,我们提出了可识别的通用编码器 - 编码器网络(UNI-EDEN),以促进视觉感知(例如,视觉问题答案)和一代(例如,图像字幕)。 Uni-eden是一个基于两流变压器的结构,由三个模块组成:对象和句子编码器,它们分别了解每种模式的表示形式,而句子解码器则可以通过模式间交互启用多模式推理和句子生成。 Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (msg)。这样,Uni-eden赋予了多模式表示提取和语言建模的力量。广泛的实验表明,通过将其微调到四种视觉感知和下游任务的产生,使Uni-Eden具有令人信服的普遍性。
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality, and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.