生成性口语对话语言建模

论文标题

生成性口语对话语言建模

Generative Spoken Dialogue Language Modeling

论文作者

Nguyen, Tu Anh, Kharitonov, Eugene, Copet, Jade, Adi, Yossi, Hsu, Wei-Ning, Elkahky, Ali, Tomasello, Paden, Algayres, Robin, Sagot, Benoit, Mohamed, Abdelrahman, Dupoux, Emmanuel

论文摘要

我们介绍了DGSLM，这是第一个能够生成自然主义口语对话的音频样本的“无文本”模型。它使用了有关无监督的口语单元发现的最新工作，再加上双式变压器体系结构，并在2000小时的两次通道原始对话音频（Fisher DataSet）进行了跨注意事项，没有任何文本或标签。我们表明，与基于文本的级联模型相比，我们的模型能够同时在两个通道中产生语音，笑声和其他副语言信号，并复制更自然和流畅的转折。

We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题