论文标题
基于标签的注意力指导的自下而上的视频实例细分方法
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation
论文作者
论文摘要
视频实例细分是一项基本的计算机视觉任务,涉及视频序列上的细分和跟踪对象实例。大多数现有方法通常通过采用多个阶段自上而下的方法来完成此任务,该方法通常涉及单独的网络以在每个帧中检测和细分对象,然后使用学习的跟踪头将这些检测到连续的框架。但是,在这项工作中,我们引入了一种简单的端到端可训练的自下而上的方法,以在像素级粒度上实现实例掩盖预测,而不是典型的基于区域的方法。与现代框架的模型不同,我们的网络管道将输入视频剪辑作为单个3D卷处理以包含时间信息。我们公式的核心思想是将视频实例分割任务求解为标签分配问题,以便生成不同的标签值基本上将各个对象实例分开在整个视频序列中(这里每个标签可以是0和1之间的任何任意值)。为此,我们提出了一种新颖的时空标记损失,该损失允许对不同对象进行足够的分离以及对同一对象的不同实例的必要识别。此外,我们提出了一个基于标签的注意模块,该模块可改善实例标签,同时学习视频中的实例传播。评估表明,我们的方法在YouTube-Vis和Davis-19数据集上提供了竞争结果,并且与其他最先进的性能方法相比,运行时间最少。
Video Instance Segmentation is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. Most existing methods typically accomplish this task by employing a multi-stage top-down approach that usually involves separate networks to detect and segment objects in each frame, followed by associating these detections in consecutive frames using a learned tracking head. In this work, however, we introduce a simple end-to-end trainable bottom-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Unlike contemporary frame-based models, our network pipeline processes an input video clip as a single 3D volume to incorporate temporal information. The central idea of our formulation is to solve the video instance segmentation task as a tag assignment problem, such that generating distinct tag values essentially separates individual object instances across the video sequence (here each tag could be any arbitrary value between 0 and 1). To this end, we propose a novel spatio-temporal tagging loss that allows for sufficient separation of different objects as well as necessary identification of different instances of the same object. Furthermore, we present a tag-based attention module that improves instance tags, while concurrently learning instance propagation within a video. Evaluations demonstrate that our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other state-of-the-art performance methods.