论文标题
自我监督的Amodal视频对象细分
Self-supervised Amodal Video Object Segmentation
论文作者
论文摘要
Amodal感知需要推断部分遮挡的物体的完整形状。这项任务在两个层面上尤其具有挑战性:(1)比即时视网膜或成像传感器中包含的任务更多的信息,(2)很难获得足够的良好宣布的Amodal标签来进行监督。为此,本文开发了一个自我监督的Amodal视频对象细分(SAVOS)的新框架。我们的方法有效地利用视频时间序列的视觉信息来推断对象的阿莫达尔面罩。关键的直觉是,如果该部分在其他框架中可见,只要可以合理地学习变形,就可以解释对象的阻塞部分。因此,我们得出了一种新颖的自我监督学习范式,该学习范式有效地利用可见的对象部分作为指导视频培训的监督。除了在完成已知类型的完整掩码之前学习类型外,Savos还学习了时空先验,这对于Amodal任务也很有用,并且可以推广到看不见的类型。提出的框架在合成的Amodal分割基准鱼缸和现实世界基准Kins-video-car上实现了最先进的性能。此外,它可以很好地使用测试时间适应来转移到新颖的分布中,即使转移到新分布后,也表现出色。
Amodal perception requires inferring the full shape of an object that is partially occluded. This task is particularly challenging on two levels: (1) it requires more information than what is contained in the instant retina or imaging sensor, (2) it is difficult to obtain enough well-annotated amodal labels for supervision. To this end, this paper develops a new framework of Self-supervised amodal Video object segmentation (SaVos). Our method efficiently leverages the visual information of video temporal sequences to infer the amodal mask of objects. The key intuition is that the occluded part of an object can be explained away if that part is visible in other frames, possibly deformed as long as the deformation can be reasonably learned. Accordingly, we derive a novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos. In addition to learning type prior to complete masks for known types, SaVos also learns the spatiotemporal prior, which is also useful for the amodal task and could generalize to unseen types. The proposed framework achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends itself well to being transferred to novel distributions using test-time adaptation, outperforming existing models even after the transfer to a new distribution.