论文标题
桥接图像和视频:大型词汇视频对象检测的简单学习框架
Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection
论文作者
论文摘要
缩放对象分类法是朝着识别系统进行稳健现实部署的重要步骤之一。自引入LVIS基准以来,我们在图像上取得了显着进展。为了在视频中继续这一成功,最近提出了一个新的视频基准陶。鉴于发现和跟踪社区的最新令人鼓舞的结果,我们有兴趣嫁给这两个进步并建立一个强大的大型词汇视频跟踪器。但是,LVIS和TAO的监督本质上是稀疏甚至缺失的,对训练大型词汇跟踪器提出了两个新的挑战。首先,在LVIS中没有跟踪监督,这会导致对检测的学习不一致(与LVIS和TAO)和跟踪(仅与Tao)。其次,道中的检测监督是部分的,这导致灾难性忘记在视频微调过程中缺少LVIS类别。为了解决这些挑战,我们提出了一个简单但有效的学习框架,该框架充分利用了所有可用的培训数据,以学习检测和跟踪,同时又不会失去任何LVIS类别以识别。通过这种新的学习方案,我们表明,各种大型词汇跟踪器的一致改进能够稳定地改进,为挑战性的TAO基准树立了强大的基线结果。
Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in marrying those two advances and building a strong large vocabulary video tracker. However, supervisions in LVIS and TAO are inherently sparse or even missing, posing two new challenges for training the large vocabulary trackers. First, no tracking supervisions are in LVIS, which leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO). Second, the detection supervisions in TAO are partial, which results in catastrophic forgetting of absent LVIS categories during video fine-tuning. To resolve these challenges, we present a simple but effective learning framework that takes full advantage of all available training data to learn detection and tracking while not losing any LVIS categories to recognize. With this new learning scheme, we show that consistent improvements of various large vocabulary trackers are capable, setting strong baseline results on the challenging TAO benchmarks.