论文标题
关系引导的声学场景分类有助于事件嵌入
Relation-guided acoustic scene classification aided with event embeddings
论文作者
论文摘要
在现实生活中,声学场景和音频事件自然相关。人类本能地依靠细粒度的音频事件以及整体声音特征来区分各种声学场景。但是,大多数以前的方法将声学场景分类(ASC)和音频事件分类(AEC)视为两个独立的任务。关于场景和事件联合分类的一些研究使用几乎不匹配现实世界的合成音频数据集,或者只是使用多任务框架同时执行两个任务。这两种方式都没有充分利用细粒度事件和粗粒场景之间的隐式和固有关系。为此,本文提出了一个关系引导的ASC(RGASC)模型,以进一步利用和协调场景 - 事实关系,以获得场景和事件识别的相互利益。 Tut Urban声学场景2018数据集(TUT2018)用简单有效的音频相关的预训练的预培训模型PANN注释了事件的伪标签,这是最先进的AEC模型之一。然后,将先前的场景 - 事实关系矩阵定义为每个场景类中每个事件类型的存在的平均概率。最后,在现实生活数据集TUT2018上共同培训了两个塔的RGASC模型,以进行场景和事件分类。实现了以下结果。 1)rgasc有效地协调了粗粒场景的真实信息以及细粒度事件的伪信息。 2)在先前场景事实关系的指导下从伪标签中学到的事件嵌入有助于减少相似的声学场景之间的混淆。 3)与其他(非集装)方法相比,RGASC提高了现实生活数据集的场景分类精度。
In real life, acoustic scenes and audio events are naturally correlated. Humans instinctively rely on fine-grained audio events as well as the overall sound characteristics to distinguish diverse acoustic scenes. Yet, most previous approaches treat acoustic scene classification (ASC) and audio event classification (AEC) as two independent tasks. A few studies on scene and event joint classification either use synthetic audio datasets that hardly match the real world, or simply use the multi-task framework to perform two tasks at the same time. Neither of these two ways makes full use of the implicit and inherent relation between fine-grained events and coarse-grained scenes. To this end, this paper proposes a relation-guided ASC (RGASC) model to further exploit and coordinate the scene-event relation for the mutual benefit of scene and event recognition. The TUT Urban Acoustic Scenes 2018 dataset (TUT2018) is annotated with pseudo labels of events by a simple and efficient audio-related pre-trained model PANN, which is one of the state-of-the-art AEC models. Then, a prior scene-event relation matrix is defined as the average probability of the presence of each event type in each scene class. Finally, the two-tower RGASC model is jointly trained on the real-life dataset TUT2018 for both scene and event classification. The following results are achieved. 1) RGASC effectively coordinates the true information of coarse-grained scenes and the pseudo information of fine-grained events. 2) The event embeddings learned from pseudo labels under the guidance of prior scene-event relations help reduce the confusion between similar acoustic scenes. 3) Compared with other (non-ensemble) methods, RGASC improves the scene classification accuracy on the real-life dataset.