基于适配器的新扬声器的多演讲者文本到语音模型的扩展

论文标题

基于适配器的新扬声器的多演讲者文本到语音模型的扩展

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

论文作者

Hsieh, Cheng-Ping, Ghosh, Subhankar, Ginsburg, Boris

论文摘要

微调是将文本到语音（TTS）模型调整为新扬声器的流行方法。但是，这种方法有一些挑战。通常，微调需要每个演讲者几个小时的高质量语音。还存在微调会对先前学习的说话者的语音综合质量产生负面影响。在本文中，我们提出了一种基于参数效率适配器模块的TTS适应性的替代方法。在建议的方法中，将一些小型适配器模块添加到原始网络中。原始的重量被冷冻，只有适配器对新演讲者进行了微调。参数效率高的微调方法将产生一种具有高水平参数共享的新模型。我们对库，HIFI-TTS和VCTK数据集的实验验证了基于适配器方法的有效性，通过客观和主观指标。

Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However this approach has some challenges. Usually fine-tuning requires several hours of high quality speech per speaker. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. In this paper we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few small adapter modules are added to the original network. The original weights are frozen, and only the adapters are fine-tuned on speech for new speaker. The parameter-efficient fine-tuning approach will produce a new model with high level of parameter sharing with original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题