论文标题
感兴趣的视频对象细分
Video Object of Interest Segmentation
论文作者
论文摘要
在这项工作中,我们提出了一个新的计算机视觉任务,名为“感兴趣的视频对象”(VOIS)。给定视频和感兴趣的目标图像,我们的目标是同时细分和跟踪与目标图像相关的视频中的所有对象。此问题将传统的视频对象细分任务与其他图像结合在一起,以指示用户关注的内容。由于没有现有数据集完全适合这项新任务,因此我们专门构建了一个名为LiveVideos的大规模数据集,该数据集包含2418对目标图像和带有实例级注释的实时视频。此外,我们为此任务提出了一种基于变压器的方法。我们重新访问Swin Transformer,并设计一个双路结构,以融合视频和图像功能。然后,使用变压器解码器来生成对象建议,以从融合功能中进行分割和跟踪。 LiveVideos数据集的广泛实验显示了我们提出的方法的优越性。
In this work, we present a new computer vision task named video object of interest segmentation (VOIS). Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image. This problem combines the traditional video object segmentation task with an additional image indicating the content that users are concerned with. Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos, which contains 2418 pairs of target images and live videos with instance-level annotations. In addition, we propose a transformer-based method for this task. We revisit Swin Transformer and design a dual-path structure to fuse video and image features. Then, a transformer decoder is employed to generate object proposals for segmentation and tracking from the fused features. Extensive experiments on LiveVideos dataset show the superiority of our proposed method.