论文标题
调查语音转换的目标口语率调整
Investigation into Target Speaking Rate Adaptation for Voice Conversion
论文作者
论文摘要
将语音信号的扬声器和内容属性解散成单独的潜在表示,然后用交换的说话者表示内容解码内容是一种流行的语音转换方法,可以通过非平行和未标记的语音数据对其进行培训。但是,以前的方法仅通过某种信息瓶颈或归一化的方式隐式进行分解,通常很难在语音转换和内容重建之间找到良好的权衡。此外,以前的作品通常不考虑对目标发言人的口语速度调整,或者对数据或用例给予了一些重大限制。因此,这项工作的贡献是两个方面。首先,我们采用了一种明确且完全无监督的解剖方法,该方法以前仅用于表示学习,并表明它允许获得卓越的语音转换和内容重建。其次,我们研究了简单而通用的方法,以线性地将语音信号的长度(因此)适应目标扬声器,并表明所提出的适应性允许增加相对于目标扬声器的说话率相似性。
Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained with non-parallel and unlabeled speech data. However, previous approaches perform disentanglement only implicitly via some sort of information bottleneck or normalization, where it is usually hard to find a good trade-off between voice conversion and content reconstruction. Further, previous works usually do not consider an adaptation of the speaking rate to the target speaker or they put some major restrictions to the data or use case. Therefore, the contribution of this work is two-fold. First, we employ an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and show that it allows to obtain both superior voice conversion and content reconstruction. Second, we investigate simple and generic approaches to linearly adapt the length of a speech signal, and hence the speaking rate, to a target speaker and show that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker.