AutoDocoder：使用可区分的数字信号处理从学习的语音表示产生快速波形

论文标题

AutoDocoder：使用可区分的数字信号处理从学习的语音表示产生快速波形

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

论文作者

Webber, Jacob J, Valentini-Botinhao, Cassia, Williams, Evelyn, Henter, Gustav Eje, King, Simon

论文摘要

大多数最先进的文本到语音系统都将MEL-SPECTROGRAM用作中间表示，将任务分解为声学建模和波形生成。通过简单，快速的DSP操作从波形中提取MEL光谱图，但是从MEL光谱图中产生高质量的波形需要计算昂贵的机器学习：神经声码器。我们提出的``自动编码器''逆转了这种安排。我们使用机器学习来获得代替MEL光谱图的表示形式，并且可以使用简单，快速操作（包括逆STFT的可区分实现）将其倒回波形。 AutoDocoder的波形比基于DSP的Griffin-LIM算法快5倍，而神经声码器Hifi-GAN的波形比14倍。我们提供感知的听力测试结果，以确认语音在复制综合任务中的质量与Hifi-GAN相当。

Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computationally expensive machine learning: a neural vocoder. Our proposed ``autovocoder'' reverses this arrangement. We use machine learning to obtain a representation that replaces the mel-spectrogram, and that can be inverted back to a waveform using simple, fast operations including a differentiable implementation of the inverse STFT. The autovocoder generates a waveform 5 times faster than the DSP-based Griffin-Lim algorithm, and 14 times faster than the neural vocoder HiFi-GAN. We provide perceptual listening test results to confirm that the speech is of comparable quality to HiFi-GAN in the copy synthesis task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题