论文标题
通用原型传输,用于零拍动识别和本地化
Universal Prototype Transport for Zero-Shot Action Recognition and Localization
论文作者
论文摘要
当没有培训示例时,这项工作解决了视频中识别行动类别的问题。当前的最新目前可以通过学习从视频到语义空间的通用映射,可以通过大规模可见动作或对象训练。虽然有效,但我们发现普遍的动作和对象映射偏向语义空间中的特定区域。这些偏见导致了一个基本问题:在测试过程中根本无法推断出许多看不见的行动类别。例如,在UCF-101上,有四分之一的看不见的动作是无法实现的,这是一个最先进的通用动作模型。为此,本文介绍了通用原型传输,以进行零击动作识别。主要思想是通过将它们与所有测试视频的分布相匹配,重新定位看不见动作的语义原型。对于通用动作模型,我们建议通过从看不见的动作原型到所有投影测试视频集的超球形最佳传输来匹配分布。所得的传输耦合又决定了每个看不见的动作的目标原型。我们不是直接使用目标原型作为最终结果,而是将原始原型和目标原型跨越的地质原型重新定位为语义正则化的一种形式。对于通用对象模型,我们概述了一个基于看不见的动作原型和对象原型之间的最佳传输定义目标原型的变体。从经验上讲,我们表明通用原型传输减少了未见作用原型的偏见选择,并促进了零照片分类和时空定位的通用动作和对象模型。
This work addresses the problem of recognizing action categories in videos when no training examples are available. The current state-of-the-art enables such a zero-shot recognition by learning universal mappings from videos to a semantic space, either trained on large-scale seen actions or on objects. While effective, we find that universal action and object mappings are biased to specific regions in the semantic space. These biases lead to a fundamental problem: many unseen action categories are simply never inferred during testing. For example on UCF-101, a quarter of the unseen actions are out of reach with a state-of-the-art universal action model. To that end, this paper introduces universal prototype transport for zero-shot action recognition. The main idea is to re-position the semantic prototypes of unseen actions by matching them to the distribution of all test videos. For universal action models, we propose to match distributions through a hyperspherical optimal transport from unseen action prototypes to the set of all projected test videos. The resulting transport couplings in turn determine the target prototype for each unseen action. Rather than directly using the target prototype as final result, we re-position unseen action prototypes along the geodesic spanned by the original and target prototypes as a form of semantic regularization. For universal object models, we outline a variant that defines target prototypes based on an optimal transport between unseen action prototypes and object prototypes. Empirically, we show that universal prototype transport diminishes the biased selection of unseen action prototypes and boosts both universal action and object models for zero-shot classification and spatio-temporal localization.