情节：通过视觉模型的最佳传输及时学习

论文标题

情节：通过视觉模型的最佳传输及时学习

PLOT: Prompt Learning with Optimal Transport for Vision-Language Models

论文作者

Chen, Guangyi, Yao, Weiran, Song, Xiangchen, Li, Xinyue, Rao, Yongming, Zhang, Kun

论文摘要

随着越来越多的视力模型（例如剪辑）的关注，一直致力于建立有效的提示。与仅学习一个单一提示的传统方法不同，我们建议学习多个全面的提示，以描述类别的各种特征，例如内在属性或外部环境。但是，将每个提示与相同的视觉功能直接匹配是有问题的，因为它将提示将收敛转换为一个点。为了解决这个问题，我们建议采用最佳运输以匹配视觉和文本方式。具体来说，我们首先使用视觉和文本特征集建模图像和类别。然后，我们采用两阶段优化策略来学习提示。在内部循环中，我们优化了最佳传输距离，以使sindhorn算法的视觉特征和提示对齐，而在外循环中，我们从监督数据中学习了此提示。进行了广泛的实验，对少量识别任务进行，改进证明了我们方法的优越性。该代码可在https://github.com/chengy12/plot上找到。

With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题