论文标题
Illume:通过人类互动来合理化视力模型
ILLUME: Rationalizing Vision-Language Models through Human Interactions
论文作者
论文摘要
从预训练的语言模型中进行的引导已被证明是用于构建视觉模型(VLM)的有效方法,用于诸如图像字幕或视觉问题响应之类的任务。但是,这些模型的输出很少与用户的理由保持一致,以获得特定答案。为了改善这种路线并加强常识性原因,我们提出了基于人类与机器生成数据的人类相互作用的调整范式。我们的Illume执行以下循环:给定图像问题提示提示,VLM采样了多个候选理性,人类评论家通过偏好选择提供反馈,用于微调。该循环增加了训练数据,并逐渐雕刻出与人类意图一致的VLM的合理化功能。我们的详尽实验表明,Illume在使用较少的训练数据中,仅需要最少的反馈,而Illume则具有标准监督的Finetuning竞争。
Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine-generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intent. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised finetuning while using significantly fewer training data and only requiring minimal feedback.