论文标题
鲁棒同步的数据标准化
Data standardization for robust lip sync
论文作者
论文摘要
LIP Sync是一项基本的视听任务。但是,现有的唇部同步方法在野外没有鲁棒性。一个重要的原因可能是分散视觉输入方面的因素,从而使提取唇部运动信息变得困难。为了解决这些问题,本文提出了一个数据标准化管道,以标准化唇部同步的视觉输入。根据3D面重建的最新进展,我们首先创建了一个模型,该模型可以一致地将唇部运动信息从原始图像中删除。然后,标准化图像通过分离的唇部运动信息合成,并具有与分散注意力的因素相关的所有其他属性,这些因素设置为独立于输入的预定义值,以减少其效果。使用合成的图像,现有的LIP同步方法提高了其数据效率和鲁棒性,并为主动扬声器检测任务实现了竞争性能。
Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust in the wild. One important cause could be distracting factors on the visual input side, making extracting lip motion information difficult. To address these issues, this paper proposes a data standardization pipeline to standardize the visual input for lip sync. Based on recent advances in 3D face reconstruction, we first create a model that can consistently disentangle lip motion information from the raw images. Then, standardized images are synthesized with disentangled lip motion information, with all other attributes related to distracting factors set to predefined values independent of the input, to reduce their effects. Using synthesized images, existing lip sync methods improve their data efficiency and robustness, and they achieve competitive performance for the active speaker detection task.