论文标题
根据音乐标记的时间调制来学习前端
Learnable Front Ends Based on Temporal Modulation for Music Tagging
论文作者
论文摘要
虽然端到端系统在听觉信号处理中变得流行,包括自动音乐标记,但使用原始音频作为输入的模型需要大量的数据和计算资源,而无需域知识。受到时间调制被认为是听觉感知的重要组成部分的启发,我们引入了时间调制神经网络(TMNN),该神经网络(TMNN)结合了类似MEL的数据驱动的前端和时间调制过滤器与简单的Resnet后端。该结构包括一组时间调制过滤器,以捕获所有频道中的长期模式。实验结果表明,所提出的前端超过了自动音乐标记中Magnatagatune数据集上的最新方法(SOTA)方法,并且它们也有助于在语音命令上发现关键字。此外,每个标签的模型性能表明,具有复杂节奏和情绪标签的流派或仪器标签,特别可以通过时间调制来改善。
While end-to-end systems are becoming popular in auditory signal processing including automatic music tagging, models using raw audio as input needs a large amount of data and computational resources without domain knowledge. Inspired by the fact that temporal modulation is regarded as an essential component in auditory perception, we introduce the Temporal Modulation Neural Network (TMNN) that combines Mel-like data-driven front ends and temporal modulation filters with a simple ResNet back end. The structure includes a set of temporal modulation filters to capture long-term patterns in all frequency channels. Experimental results show that the proposed front ends surpass state-of-the-art (SOTA) methods on the MagnaTagATune dataset in automatic music tagging, and they are also helpful for keyword spotting on speech commands. Moreover, the model performance for each tag suggests that genre or instrument tags with complex rhythm and mood tags can especially be improved with temporal modulation.