IVT：用于3D姿势估计的端到端实例引导的视频变压器

论文标题

IVT：用于3D姿势估计的端到端实例引导的视频变压器

IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation

论文作者

Qiu, Zhongwei, Yang, Qiansheng, Wang, Jian, Fu, Dongmei

论文摘要

视频3D人姿势估计旨在将视频中人类关节的3D坐标定位。最近的基于变压器的方法着重于从顺序2D姿势捕获时空信息，该信息无法有效地对上下文深度特征进行建模，因为在2D姿势估计的步骤中丢失了视觉深度特征。在本文中，我们将范式简化为端到端框架，实例引导的视频变压器（IVT），该范式可以有效地从视觉特征中学习时空上下文深度信息，并直接从视频框架中预测3D姿势。特别是，我们首先将视频帧作为一系列实例引导令牌，每个令牌都负责预测人类实例的3D姿势。这些令牌包含身体结构信息，因为它们是通过从人体中心到相应身体关节的关节偏移的指导提取的。然后，这些令牌被发送到IVT中，以学习时空的上下文深度。此外，我们提出了一种跨尺度实例引导的注意机制，以处理多个人之间的变异量表。最后，每个人的3D姿势都是通过坐标回归从实例引导的代币中解码的。在三个广泛使用的3D姿势估计基准上进行的实验表明，拟议的IVT实现了最先进的性能。

Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to the corresponding body joints. Then, these tokens are sent into IVT for learning spatiotemporal contextual depth. In addition, we propose a cross-scale instance-guided attention mechanism to handle the variational scales among multiple persons. Finally, the 3D poses of each person are decoded from instance-guided tokens by coordinate regression. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题