高保真神经音频压缩

论文标题

高保真神经音频压缩

High Fidelity Neural Audio Compression

论文作者

Défossez, Alexandre, Copet, Jade, Synnaeve, Gabriel, Adi, Yossi

论文摘要

我们引入了一个实时的最先进的，高保真的音频编解码器，利用神经网络。它由流媒体编码器架构组成，其量化的潜在空间以端到端的方式训练。我们通过使用单个多尺度光谱图对手来简化和加快训练，从而有效地减少伪像并产生高质量的样本。我们引入了一种新型的损失平衡机制来稳定训练：损失的重量现在定义了它应该代表的整体梯度的比例，从而将这种超参数的选择与典型的损失量表解耦。最后，我们研究了如何使用轻型变压器模型来进一步压缩所获得的表示形式多达40％，同时保持比实时更快。我们提供了拟议模型的关键设计选择的详细描述，包括：培训目标，建筑变化以及对各种感知损失功能的研究。我们提出了广泛的主观评估（Mushra测试），以及一系列带宽和音频域的消融研究，包括语音，吵闹的言语和音乐。考虑到24 kHz单声音和48 kHz立体音频，我们的方法优于所有评估设置的基线方法。代码和模型可在github.com/facebookresearch/encodec上找到。

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

下载PDF全文

下载文献需遵守相关版权规定

论文标题