Tap-vid：用于跟踪视频中任何点的基准

论文标题

Tap-vid：用于跟踪视频中任何点的基准

TAP-Vid: A Benchmark for Tracking Any Point in a Video

论文作者

Doersch, Carl, Gupta, Ankush, Markeeva, Larisa, Recasens, Adrià, Smaira, Lucas, Aytar, Yusuf, Carreira, João, Zisserman, Andrew, Yang, Yi

论文摘要

视频的通用运动理解不仅涉及跟踪对象，还涉及其表面如何变形和移动。此信息可用于推断3D形状，物理属性和对象相互作用。虽然在较长的视频剪辑上跟踪表面上任意物理点的问题已受到一些关注，但到目前为止，没有数据集或基准测试。在本文中，我们首先将问题正式化，并将其命名为跟踪任何点（点击）。我们介绍了一个伴侣基准Tap-Vid，该基准由两个现实世界的视频组成，并具有精确的点轨道注释，以及具有完美的地面真实点轨道的合成视频。我们的基准测试的核心是一种新型的半自动众包管道，该管道使用光流估算来补偿更容易的短期运动，例如相机摇动，从而使注释者可以专注于更艰难的视频部分。我们验证了关于合成数据的管道，并提出了一个简单的端点跟踪模型TAP-NET，这表明在接受合成数据训练时，它在基准测试时都优于我们的基准测试。

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题