Transrank：通过基于排名的转换识别的自我监督视频表示学习

论文标题

Transrank：通过基于排名的转换识别的自我监督视频表示学习

TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition

论文作者

Duan, Haodong, Zhao, Nanxuan, Chen, Kai, Lin, Dahua

论文摘要

识别适用于视频剪辑（复发性）的转换类型是自我监督视频表示学习的悠久范式，与最近作品中的实例歧视方法（Instdisc）相比，它的性能较低。但是，基于对代表性的复发和Instdisc方法的彻底比较，我们观察到在语义相关和时间相关的下游任务上，反应的巨大潜力。基于硬标签的分类，现有的混合型方法遭受了预训练中嘈杂的监督信号。为了减轻这个问题，我们开发了Transrank，这是识别排名公式中转换的统一框架。 Transrank通过相对识别转换，始终超过基于分类的公式来提供准确的监督信号。同时，统一框架可以通过任意的时间或空间转换来实例化，表明一般性良好。通过基于排名的配方和几种经验实践，我们在视频检索和行动识别上实现了竞争性能。在相同的环境下，transrank在UCF101上超过了先前的最新方法，而HMDB51则超过了8.3％的动作识别（TOP1 ACC）；将UCF101上的视频检索提高20.4％（R@1）。有希望的结果证明，复发仍然值得探索视频自学学习的范式。代码将在https://github.com/kennymckormick/transrank上发布。

Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for self-supervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative RecogTrans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.

下载PDF全文

下载文献需遵守相关版权规定

论文标题