自我监督音频预训练的蒙版光谱图预测

论文标题

自我监督音频预训练的蒙版光谱图预测

Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training

论文作者

Chong, Dading, Wang, Helin, Zhou, Peilin, Zeng, Qingcheng

论文摘要

基于变压器的模型在对足够数量的数据进行培训时，取得了出色的结果并概括了。但是，受音频域中可用的有限数据限制，大多数基于变压器的音频任务模型都是从其他域中的预训练模型（例如Image）中予以填充的，该模型与音频域具有明显的差距。其他方法直接在音频域中探索自我监督的学习方法，但目前在下游任务中表现不佳。在本文中，我们提出了一种针对基于变压器的音频模型的新型自我监督学习方法，称为“蒙版频谱预测”（MaskSpec），以从未标记的音频数据（本文中使用的音频集）中学习强大的音频表示。我们的方法掩盖了输入频谱图的随机补丁，并使用编码器解码器体系结构重建掩盖区域。在不使用额外的模型权重或监督的情况下，多个下游数据集的实验结果证明了MaskSpec在监督方法上获得了显着的性能增益，并且优于先前的预训练模型。特别是，我们的最佳模型在Audioset上达到了0.471（MAP）的性能，OpenMIC2018上的0.854（MAP），在ESC-50上的0.982（准确性）在SCV2上的0.982（准确性）和0.823（准确性）和0.823（准确性）分别在Dcase2019 Task1a上。

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are finetuned from pre-trained models in other domains (e.g. image), which has a notable gap with the audio domain. Other methods explore the self-supervised learning approaches directly in the audio domain but currently do not perform well in the downstream tasks. In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Without using extra model weights or supervision, experimental results on multiple downstream datasets demonstrate MaskSpec achieves a significant performance gain against the supervised methods and outperforms the previous pre-trained models. In particular, our best model reaches the performance of 0.471 (mAP) on AudioSet, 0.854 (mAP) on OpenMIC2018, 0.982 (accuracy) on ESC-50, 0.976 (accuracy) on SCV2, and 0.823 (accuracy) on DCASE2019 Task1A respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题