论文标题
渐进的多视图人物网状恢复与自学的恢复
Progressive Multi-view Human Mesh Recovery with Self-Supervision
论文作者
论文摘要
迄今为止,尽管现实生活中的适用性(例如,运动捕获,运动分析)和对单视图歧义的鲁棒性,但对多视图3D人类网格估计的关注很少。现有的解决方案通常遭受对新环境的概括性能差,这在很大程度上是由于多视图培训数据中图像网对的多样性有限。为了解决这一缺点,人们探索了合成图像的使用。但是,除了渲染数据和目标数据之间的视觉差距的通常影响外,合成-DATA驱动的多视图估计器还遭受了过度适合对训练过程中所采样的相机视点分布的损失,这通常与现实世界分布不同。解决这两个挑战时,我们提出了一种基于模拟的新型训练管道,用于多视图人类网格恢复,该管道(a)依赖于中间2D表示,这更适合合成到现实的域间隙; (b)利用可学习的校准和三角剖分,以适应更多样化的相机设置; (c)在规范3D空间中逐渐汇总了多视图信息,以删除2D表示中的歧义。通过广泛的基准测试,我们展示了提出的解决方案的优势,尤其是对于看不见的野外场景。
To date, little attention has been given to multi-view 3D human mesh estimation, despite real-life applicability (e.g., motion capture, sport analysis) and robustness to single-view ambiguities. Existing solutions typically suffer from poor generalization performance to new settings, largely due to the limited diversity of image-mesh pairs in multi-view training data. To address this shortcoming, people have explored the use of synthetic images. But besides the usual impact of visual gap between rendered and target data, synthetic-data-driven multi-view estimators also suffer from overfitting to the camera viewpoint distribution sampled during training which usually differs from real-world distributions. Tackling both challenges, we propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space to remove ambiguities in 2D representations. Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios.