论文标题
夹具:连接语言和动物姿势的基于及时的对比度学习
CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose
论文作者
论文摘要
由于训练数据有限,并且种类间和种间差异,动物姿势估计对于现有基于图像的方法而言是挑战性的。在视觉语言研究的进步中,我们提出,预先训练的语言模型(例如剪辑)可以通过提供丰富的先验知识来描述文本中的动物关键点,从而促进动物姿势的估计。但是,我们发现,预先训练的语言模型和视觉动物关键点之间建立有效的连接是不平凡的,因为基于文本的描述与基于动物姿势的基于键盘的视觉特征之间的差距很重要。为了解决这个问题,我们介绍了一种基于新颖的基于及时的对比学习计划,以有效地连接语言和动物姿势(夹具)。夹具试图通过在网络训练期间调整文本提示来弥合差距。适应分解为空间感知和特征感知过程,并相应地设计了两个新颖的对比损失。实际上,夹具实现了第一个跨模式动物姿势估计范式。实验结果表明,我们的方法在监督,很少射击和零射击设置下实现了最先进的性能,超过了基于图像的方法的大幅度差距。
Animal pose estimation is challenging for existing image-based methods because of limited training data and large intra- and inter-species variances. Motivated by the progress of visual-language research, we propose that pre-trained language models (e.g., CLIP) can facilitate animal pose estimation by providing rich prior knowledge for describing animal keypoints in text. However, we found that building effective connections between pre-trained language models and visual animal keypoints is non-trivial since the gap between text-based descriptions and keypoint-based visual features about animal pose can be significant. To address this issue, we introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose (CLAMP) effectively. The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training. The adaptation is decomposed into spatial-aware and feature-aware processes, and two novel contrastive losses are devised correspondingly. In practice, the CLAMP enables the first cross-modal animal pose estimation paradigm. Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings, outperforming image-based methods by a large margin.