论文标题
ASIT:用于事件分类的本地全球音频谱图变压器
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification
论文作者
论文摘要
最初是为自然语言处理而开发的变压器,由于它们在学习长期关系方面的灵活性,最近对计算机视觉和音频社区产生了重大兴趣。尽管自然图像和音频域之间存在巨大差距,但受到变压器的饥饿性质和标记数据量有限的饥饿性质和有限的标记数据限制。这激发了在音频变压器预处理预处理方面的研究,这减少了对大量标记数据的依赖,并专注于提取音频谱图的简洁表示。 In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group掩盖的模型学习和自我介绍。我们在音频和语音分类任务上评估了验证的模型,包括音频事件分类,关键字发现和扬声器识别。我们进一步进行全面的消融研究,包括评估不同的预训练策略。提出的ASIT框架显着提高了所有任务上的性能,并在五个音频和语音分类任务中设置了新的最先进的性能,超过了最近的方法,包括使用其他数据集进行预处理的方法。
Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.