论文标题
视频实例细分的广义框架
A Generalized Framework for Video Instance Segmentation
论文作者
论文摘要
在视频实例细分(VIS)社区中,用复杂和遮挡的序列处理长视频最近成为了新的挑战。但是,现有方法在应对这一挑战方面有局限性。我们认为,当前方法中最大的瓶颈是训练和推理之间的差异。为了有效地弥合这一差距,我们为VIS(即Genvis)提出了一个广义的框架,即在不设计复杂的架构或需要额外的后处理的情况下,在具有挑战性的基准上实现了最先进的性能。 GENVIS的关键贡献是学习策略,其中包括基于查询的培训管道,用于连续学习,并具有新颖的目标标签分配。此外,我们引入了一种记忆,该内存有效地获取了以前状态的信息。多亏了新的观点,该观点重点是在单独的帧或剪辑之间建立关系,GENVI可以以在线和半联盟方式灵活地执行。我们评估了我们在流行的VIS基准测试中的方法,在YouTube-Vis 2019/2021/2022上实现最新结果,并被遮挡的Vis(OVIS)。值得注意的是,我们在长长的基准(OVIS)上的最先进表现极高,并用Resnet-50骨架改善了5.6 AP。代码可从https://github.com/miranheo/genvis获得。
The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks without designing complicated architectures or requiring extra post-processing. The key contribution of GenVIS is the learning strategy, which includes a query-based training pipeline for sequential learning with a novel target label assignment. Additionally, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS). Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code is available at https://github.com/miranheo/GenVIS.