1-D CNN通过降低图层维度的基于声学场景分类

论文标题

1-D CNN通过降低图层维度的基于声学场景分类

1-D CNN based Acoustic Scene Classification via Reducing Layer-wise Dimensionality

论文作者

Singh, Arshdeep

论文摘要

本文为声学场景分类（ASC）提供了一个替代表示框架。使用预先训练的卷积神经网络（CNN）使用其各种中间层表示原始音频信号。该研究假设从中间层获得的表示形式本质上为低维度。为了获得低维嵌入，进行了主成分分析，并且研究分析只有少数主要成分很重要。但是，尚不清楚适当数量的重要组成部分。为了解决这个问题，利用一个自动词典学习框架近似为基础子空间。此外，低维嵌入在集合框架中以晚期融合方式聚集，以结合各个中间层中学到的层次信息。实验评估是在预先训练的1-D CNN SoundNet上对公开可用的Dcase 2017和2018 ASC数据集进行的。从经验上讲，据观察，更深的层比其他层显示的压缩比更多。在不同数据集的70％的压缩率下，性能类似于在没有进行任何维度降低的情况下获得的性能。所提出的框架的表现优于基于时频表示的方法。

This paper presents an alternate representation framework to commonly used time-frequency representation for acoustic scene classification (ASC). A raw audio signal is represented using a pre-trained convolutional neural network (CNN) using its various intermediate layers. The study assumes that the representations obtained from the intermediate layers lie in low-dimensions intrinsically. To obtain low-dimensional embeddings, principal component analysis is performed, and the study analyzes that only a few principal components are significant. However, the appropriate number of significant components are not known. To address this, an automatic dictionary learning framework is utilized that approximates the underlying subspace. Further, the low-dimensional embeddings are aggregated in a late-fusion manner in the ensemble framework to incorporate hierarchical information learned at various intermediate layers. The experimental evaluation is performed on publicly available DCASE 2017 and 2018 ASC datasets on a pre-trained 1-D CNN, SoundNet. Empirically, it is observed that deeper layers show more compression ratio than others. At 70% compression ratio across different datasets, the performance is similar to that obtained without performing any dimensionality reduction. The proposed framework outperforms the time-frequency representation based methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题