论文标题

AutoDocoder:使用可区分的数字信号处理从学习的语音表示产生快速波形

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

论文作者

Webber, Jacob J, Valentini-Botinhao, Cassia, Williams, Evelyn, Henter, Gustav Eje, King, Simon

论文摘要

大多数最先进的文本到语音系统都将MEL-SPECTROGRAM用作中间表示,将任务分解为声学建模和波形生成。 通过简单,快速的DSP操作从波形中提取MEL光谱图,但是从MEL光谱图中产生高质量的波形需要计算昂贵的机器学习:神经声码器。我们提出的``自动编码器''逆转了这种安排。我们使用机器学习来获得代替MEL光谱图的表示形式,并且可以使用简单,快速操作(包括逆STFT的可区分实现)将其倒回波形。 AutoDocoder的波形比基于DSP的Griffin-LIM算法快5倍,而神经声码器Hifi-GAN的波形比14倍。我们提供感知的听力测试结果,以确认语音在复制综合任务中的质量与Hifi-GAN相当。

Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computationally expensive machine learning: a neural vocoder. Our proposed ``autovocoder'' reverses this arrangement. We use machine learning to obtain a representation that replaces the mel-spectrogram, and that can be inverted back to a waveform using simple, fast operations including a differentiable implementation of the inverse STFT. The autovocoder generates a waveform 5 times faster than the DSP-based Griffin-Lim algorithm, and 14 times faster than the neural vocoder HiFi-GAN. We provide perceptual listening test results to confirm that the speech is of comparable quality to HiFi-GAN in the copy synthesis task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源