论文标题
捍卫视频实例细分的在线模型
In Defense of Online Models for Video Instance Segmentation
论文作者
论文摘要
近年来,视频实例细分(VIS)在很大程度上是通过离线模型提出的,而在线模型由于其性能较低而逐渐吸引了人们的关注。但是,在线方法在处理长期视频序列和持续视频时具有固有的优势,而由于计算资源的限制,离线模型失败了。因此,如果在线模型可以实现可比甚至更好的性能,那将是非常可取的。通过解剖当前的在线模型和离线模型,我们证明了性能差距的主要原因是由特征空间中不同实例之间相似外观引起的框架之间存在错误的关联。观察到这一点,我们提出了一个基于对比度学习的在线框架,该框架能够学习更多的歧视实例嵌入以进行关联,并充分利用历史信息以达到稳定性。尽管它很简单,但我们的方法在三个基准测试中均优于在线和离线方法。具体来说,我们在YouTube-VIS 2019上实现了49.5 AP,比先前的在线和离线艺术分别取得了13.2 AP和2.1 AP的显着提高。此外,我们在OVIS上实现了30.2 AP,这是一个更具挑战性的数据集,具有大量的拥挤和遮挡,超过了14.8 AP的先前艺术。提出的方法在第四次大规模视频对象分割挑战(CVPR2022)的视频实例细分轨道中赢得了第一名。我们希望我们方法的简单性和有效性以及我们对当前方法的见解,可以阐明VIS模型的探索。
In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance. However, online methods have their inherent advantage in handling long video sequences and ongoing videos while offline models fail due to the limit of computational resources. Therefore, it would be highly desirable if online models can achieve comparable or even better performance than offline models. By dissecting current online models and offline models, we demonstrate that the main cause of the performance gap is the error-prone association between frames caused by the similar appearance among different instances in the feature space. Observing this, we propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association and fully exploit history information for stability. Despite its simplicity, our method outperforms all online and offline methods on three benchmarks. Specifically, we achieve 49.5 AP on YouTube-VIS 2019, a significant improvement of 13.2 AP and 2.1 AP over the prior online and offline art, respectively. Moreover, we achieve 30.2 AP on OVIS, a more challenging dataset with significant crowding and occlusions, surpassing the prior art by 14.8 AP. The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022). We hope the simplicity and effectiveness of our method, as well as our insight into current methods, could shed light on the exploration of VIS models.