i-tuning：使用图像进行轻巧图像字幕的图像调整冷冻语言模型

论文标题

i-tuning：使用图像进行轻巧图像字幕的图像调整冷冻语言模型

I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning

论文作者

Luo, Ziyang, Hu, Zhipeng, Xi, Yadong, Zhang, Rongsheng, Ma, Jing

论文摘要

图像字幕是一项传统的视觉和语言任务，旨在生成图像的语言描述。最近的研究着重于扩大模型大小和训练数据的数量，从而大大增加了模型培训的成本。与这些重型模型不同，我们引入了一个轻巧的图像字幕框架（I-Tuning），其中包含少量可训练的参数。我们设计了一个新型的I-INing交叉意见模块，以连接不可训练的预训练的语言解码器GPT2和视觉编码器剪辑。由于不需要在培训期间更新大多数参数，因此我们的框架轻巧且快速。在三个图像字幕基准上进行的实验结果表明，与大型基线系统相比，我们的框架具有可比性或更好的性能。但是，我们的模型包含的可训练参数最多要少10倍，并且与最先进的基线相比，培训数据所需的数据要少得多。

Image Captioning is a traditional vision-and-language task that aims to generate the language description of an image. Recent studies focus on scaling up the model size and the number of training data, which significantly increase the cost of model training. Different to these heavy-cost models, we introduce a lightweight image captioning framework (I-Tuning), which contains a small number of trainable parameters. We design a novel I-Tuning cross-attention module to connect the non-trainable pre-trained language decoder GPT2 and vision encoder CLIP-ViT. Since most parameters are not required to be updated during training, our framework is lightweight and fast. Experimental results conducted on three image captioning benchmarks reveal that our framework achieves comparable or better performance than the large-scale baseline systems. But our models contain up to 10 times fewer trainable parameters and require much fewer data for training compared with state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题