论文标题
时间感知器:任意边界检测的一般体系结构
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
论文作者
论文摘要
通用边界检测(GBD)旨在定位将视频分为语义连贯和不含分类学单元的一般边界,并可以作为长篇视频理解的重要预处理步骤。以前的作品经常通过从简单CNN到LSTM的深网的特定设计分别处理这些不同类型的通用边界。取而代之的是,在本文中,我们提出了具有变压器的一般体系结构的时间感知器,它提供了一个统一的解决方案,以检测任意通用边界,从射击级别,事件级别到场景级别的GBD。核心设计是引入一小部分潜在特征查询作为锚点,以通过跨注意区块将冗余视频输入压缩到固定尺寸中。由于固定数量的潜在单元,它将注意力操作的二次复杂性大大降低到了线性形式的输入帧形式。具体而言,为了明确利用视频的时间结构,我们构造了两种类型的潜在特征查询:边界查询和上下文查询,这些查询处理语义不一致和相应。此外,为了指导潜在特征查询的学习,我们提出了交叉注意地图上的对齐损失,以明确鼓励边界查询参加顶部边界候选者。最后,我们在压缩表示形式上呈现一个稀疏的检测头,并直接输出最终边界检测结果,而无需任何后处理模块。我们在各种GBD基准上测试了时间感知器。我们的方法在所有基准上获得具有RGB单流功能的所有基准的最先进结果:Soccernet-V2(81.9%AVG-MAP),动力学-GEBD(86.0%AVG-F1)(86.0%AVG-F1),Tapos(73.2%AVG-F1),Moverscenes(73.2%AVG-F1),Moverscenes(51.9%AP和53.1%MOVIEN)和53.1%MiOU)(53.1%MOVIEN) miou),证明了我们的时间感知者的概括能力。
Generic Boundary Detection (GBD) aims at locating the general boundaries that divide videos into semantically coherent and taxonomy-free units, and could serve as an important pre-processing step for long-form video understanding. Previous works often separately handle these different types of generic boundaries with specific designs of deep networks from simple CNN to LSTM. Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs. The core design is to introduce a small set of latent feature queries as anchors to compress the redundant video input into a fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it greatly reduces the quadratic complexity of attention operation to a linear form of input frames. Specifically, to explicitly leverage the temporal structure of videos, we construct two types of latent feature queries: boundary queries and context queries, which handle the semantic incoherence and coherence accordingly. Moreover, to guide the learning of latent feature queries, we propose an alignment loss on the cross-attention maps to explicitly encourage the boundary queries to attend on the top boundary candidates. Finally, we present a sparse detection head on the compressed representation, and directly output the final boundary detection results without any post-processing module. We test our Temporal Perceiver on a variety of GBD benchmarks. Our method obtains the state-of-the-art results on all benchmarks with RGB single-stream features: SoccerNet-v2 (81.9% avg-mAP), Kinetics-GEBD (86.0% avg-f1), TAPOS (73.2% avg-f1), MovieScenes (51.9% AP and 53.1% Miou) and MovieNet (53.3% AP and 53.2% Miou), demonstrating the generalization ability of our Temporal Perceiver.