通过提示来控制图像字幕

论文标题

通过提示来控制图像字幕

Controllable Image Captioning via Prompting

论文作者

Wang, Ning, Xie, Jiahao, Wu, Jihao, Jia, Mingbo, Li, Linlin

论文摘要

尽管图像字幕取得了显着的进展，但现有的字幕人员通常缺乏可控的能力来生成所需的图像标题，例如以粗糙或详细的方式描述图像，以事实或情感视图等。在本文中，我们表明，统一的模型有资格在多个域中表现出色并在多种样式之间自由切换。通过将及时的学习嵌入图像字幕框架中，可以实现这种可控功能。具体来说，我们设计了一组提示来微调预训练的图像标题。这些提示使该模型可以从不同领域吸收风格化数据进行联合培训，而无需在每个领域的性能下降。此外，我们在连续的单词嵌入空间中使用可学习的向量优化了提示，避免了启发式及时的工程，同时表现出卓越的性能。在推理阶段，我们的模型能够通过选择相应的提示来生成所需的风格化字幕。广泛的实验验证了所提出的方法的可控能力。值得注意的是，我们在两个不同的图像字幕基准上实现了出色的性能，包括使用统一模型，包括可可karpathy拆分和TextCaps。

Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and TextCaps using a unified model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题