Flexlip：可控的文本到唇部系统

论文标题

Flexlip：可控的文本到唇部系统

FlexLip: A Controllable Text-to-Lip System

论文作者

Oneata, Dan, Lorincz, Beata, Stan, Adriana, Cucu, Horia

论文摘要

将文本输入转换为视频内容的任务已成为合成媒体生成的重要主题。已经提出了几种方法，其中一些方法在受限的任务中达到了近距离表现。在本文中，我们通过将文本转换为唇部标记来解决文本对视频生成问题的次要发音。但是，我们使用模块化，可控的系统体系结构来进行此操作，并评估其每个组件。我们的标题为Flexlip的系统分为两个单独的模块：文本到语音和语音到唇，都具有基本可控的深神经网络体系结构。这种模块化可以轻松替换其每个组件，同时还可以通过解开或投影输入功能来快速适应新的扬声器身份。我们表明，通过仅将数据的数据用于音频生成组件，而对于语音到唇部分子的组件只有5分钟，就可以使用较大的训练样品时获得的生成的唇部标记的客观度量与获得的客观度量相当。我们还通过考虑数据和系统配置的几个方面，对系统的完整流进行了一系列客观评估措施。这些方面与培训数据的质量和数量有关，使用了预告片的模型以及其中包含的数据以及目标扬声器的身份；关于后者，我们表明我们可以通过简单地更新模型中的嘴唇的形状来对零拍的唇部适应，以适应看不见的身份。

The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题