论文标题
基于双网络的3D多人姿势估算单眼视频
Dual networks based 3D Multi-Person Pose Estimation from Monocular Video
论文作者
论文摘要
近年来,单眼3D人类姿势估计取得了进步。大多数方法都集中在单人上,该方法估算以人为中心的坐标中的姿势,即基于目标人的中心的坐标。因此,这些方法对于多人3D姿势估计是不适用的,其中需要绝对坐标(例如,相机坐标)。此外,由于人际关系的遮挡和紧密的人类相互作用,多人姿势估计比单姿势估计更具挑战性。现有的自上而下的多人方法取决于人类的检测(即自上而下的方法),因此遭受了检测错误的困扰,并且无法在多人场景中产生可靠的姿势估计。同时,不使用人类检测的现有自下而上的方法不受检测错误的影响,但是由于他们一次处理场景中的所有人员,因此很容易出现错误,尤其是对于小规模的人。为了应对所有这些挑战,我们建议将自上而下和自下而上的方法整合起来,以利用其优势。我们自上而下的网络估算了所有人的人类关节,而不是图像补丁中的一个人,这使得可能错误的边界框变得强大。我们的自下而上的网络结合了基于人类检测的标准化热图,从而使网络在处理量表变化方面更加强大。最后,从自上而下的网络和自下而上的网络中估计的3D姿势被送入我们的集成网络,以获得最终的3D姿势。为了解决训练和测试数据之间的常见差距,我们通过使用高阶时间约束,重新注射损失和骨长度正规化来完善测试时间进行优化。我们的评估证明了该方法的有效性。代码和模型可用:https://github.com/3dpose/3d-multi-person-pose。
Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.