语音表示与对抗性相互信息学习的一声conversion依

论文标题

语音表示与对抗性相互信息学习的一声conversion依

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

论文作者

Yang, SiCheng, Tantrawenith, Methawee, Zhuang, Haolin, Wu, Zhiyong, Sun, Aolan, Wang, Jianzong, Cheng, Ning, Tang, Huaizhen, Zhao, Xintao, Wang, Jie, Meng, Helen

论文摘要

只有单个目标扬声器的演讲供参考的单发语音转换（VC）已成为一个热门研究主题。现有作品通常会散布音色，而有关音高，节奏和内容的信息仍然混合在一起。为了进一步解开这些语音组件，有效地执行单发VC，我们采用随机重采样用于音高和内容编码器，并使用相互信息的变异对比对数比率上限和基于梯度反向层的基于逆向层的相互信息学习，以确保仅包含培训期间的二线隔板的潜在空间的不同部分。 VCTK数据集上的实验显示，该模型就自然性和智能性方面实现了一声VC的最新性能。此外，我们可以通过语音表示分离分别传递音色，音调和节奏的单发VC的特征。我们的代码，预训练的模型和演示可在https://im1eon.github.io/is2022-srdvc/上找到。

One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题