CSTNET：自我监督语音表示的对比度语音翻译网络学习

论文标题

CSTNET：自我监督语音表示的对比度语音翻译网络学习

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

论文作者

Khurana, Sameer, Laurent, Antoine, Glass, James

论文摘要

在世界上7,000种语言中，超过一半是灭绝的危险。记录语言的传统方法通过收集音频数据，然后由粒度不同的语言学家进行手动注释。这项耗时和艰苦的过程可能会受益于机器学习。许多濒临灭绝的语言没有任何拼写形式，但通常具有双语和接受高资源语言培训的扬声器。获得与语音相对应的文本翻译相对容易。在这项工作中，我们通过利用两种模式之间的相关性，即语音及其相应的文本翻译，为语音表示学习提供多模式的机器学习框架。在这里，我们构建了一个能够从语音中提取语言表示的卷积神经网络音频编码器。在对比度学习框架中，对音频编码器进行了训练，可以执行语音翻译检索任务。通过在电话识别任务上评估学习的表示形式，我们证明语言表示在音频编码器的内部表示中会出现，作为学习执行检索任务的副产品。

More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题