基于变压器的无监督预训练，用于声学表示

论文标题

基于变压器的无监督预训练，用于声学表示

Transformer based unsupervised pre-training for acoustic representation learning

论文作者

Zhang, Ruixiong, Wu, Haiwei, Li, Wubo, Jiang, Dongwei, Zou, Wei, Li, Xiangang

论文摘要

最近，出现了各种声学任务和相关应用程序。对于许多声学任务，标记的数据大小可能受到限制。为了解决这个问题，我们建议使用基于变压器的编码器进行无监督的预训练方法，以学习所有声学任务的一般且强大的高级表示。已经对三种声学任务进行了实验：语音情感识别，声音事件检测和语音翻译。所有实验都表明，使用其自身训练数据的预训练可以显着提高性能。借助较大的预训练数据，将必C，LibrisPeech和ESC-US数据集组合在一起，为了识别语音情感，UAR可以在IEMOCAP数据集中进一步提高4.3％的速度。对于声音事件检测，F1得分可以进一步提高Dcase2018 Task5开发设置的1.5％，评估设置为2.1％。对于语音翻译，BLEU得分可以进一步提高EN-DE数据集的12.2％，而EN-FR数据集则可以进一步提高8.4％。

Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题