这是原始的！带有州空间模型的音频产生

论文标题

这是原始的！带有州空间模型的音频产生

It's Raw! Audio Generation with State-Space Models

论文作者

Goel, Karan, Gu, Albert, Donahue, Chris, Ré, Christopher

论文摘要

由于音频波形的采样率很高，因此开发适合原始音频建模的体系结构是一个具有挑战性的问题。先前已经对RNN和CNN等标准序列建模方法进行了量身定制，以适应音频的需求，但是所得的体系结构使不良的计算折衷构成并难以有效地对波形进行建模。我们提出了Sashimi，这是一种新的多尺度架构，用于围绕最近引入的长序列建模的S4模型构建的波形建模。我们确定S4在自回归生成过程中可能是不稳定的，并通过与Hurwitz矩阵进行连接来简单地改进其参数化。生鱼片在自回归环境中无条件的波形产生最先进的性能。此外，当用作扩散模型的骨干结构时，生鱼片可改善非自动回旋产生性能。与自回归生成环境中的先前体系结构相比，萨希米（Sashimi）生成钢琴和语音波形，人类分别找到更多音乐和连贯的钢琴，例如在无条件的语音生成任务上，比维纳特比vavenet更好2倍。在音乐生成任务上，即使使用较少的参数，生鱼片在训练和推理下的密度估计和速度都优于vavenet。代码可以在https://github.com/hazyresearch/state-paces和samples上找到，并在https://hazyresearch.stanford.edu/sashimi-examples上找到。

Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at https://github.com/HazyResearch/state-spaces and samples at https://hazyresearch.stanford.edu/sashimi-examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题