快速视频对象细分的定向深层嵌入和外观学习

论文标题

快速视频对象细分的定向深层嵌入和外观学习

Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation

论文作者

Yin, Yingjie, Xu, De, Wang, Xingang, Zhang, Lei

论文摘要

最新的半监督视频对象分割（VOS）方法依赖于使用第一帧的给定掩码在线进行深入卷积神经网络，或者预测了后续帧的掩码。但是，在线微调过程通常很耗时，从而限制了这种方法的实际使用。我们提出了一种方向性深层嵌入和外观学习（DDEAL）方法，该方法没有在线微调过程，用于快速VOS。首先，建议通过并行卷积操作有效地实现一个全球定向匹配模块，该模块被提议学习语义像素的嵌入作为内部指导。其次，提出了有效的基于方向模型的统计数据，以代表VOS球形嵌入空间的目标和背景。配备了全局定向匹配模块和方向性外观模型学习模块，Ddeal从标记的第一帧中学习静态提示，并动态更新后续帧的线索以进行对象分割。我们的方法在不使用在线微调的情况下展示了最先进的VOS性能。具体而言，它在Davis 2017数据集中达到了74.8％的J＆F平均得分，而大规模YouTube-VOS数据集的总分G为71.3％，同时使用单个Nvidia Titan XP GPU保持25 fps的速度。此外，我们的更快的版本只能损失31 fps。我们的代码和训练有素的网络可在https://github.com/yingjieyin/directional-deep-embedding-and-appearance-learning-for-fort-fast-vide-video-object-mentegentation上找到。

Most recent semi-supervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module, which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the global directional matching module and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 dataset and an overall score G of 71.3% on the large-scale YouTube-VOS dataset, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss. Our code and trained networks are available at https://github.com/YingjieYin/Directional-Deep-Embedding-and-Appearance-Learning-for-Fast-Video-Object-Segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题