在国内环境中进行音频事件分类的开放式识别和少量的学习数据集

论文标题

在国内环境中进行音频事件分类的开放式识别和少量的学习数据集

An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments

论文作者

Naranjo-Alcazar, Javier, Perez-Castanos, Sergi, Zuccarrello, Pedro, Torres, Ana M., Lopez, Jose J., Ferri, Franscesc J., Cobos, Maximo

论文摘要

用一小组阳性样本训练的问题被称为少量学习（FSL）。众所周知，传统的深度学习（DL）算法通常在接受大型数据集训练时表现出非常好的性能。但是，在许多应用中，不可能获得如此大量的样本。在图像域中，典型的FSL应用程序包括与面部识别有关的应用程序。在音频域中，音乐欺诈或说话者的识别可以从FSL方法中显然受益。本文使用有限数量的样本介绍了FSL在不同类型的声音警报（例如门铃或火警）中的特定和故意的声学事件中的应用。这些声音通常发生在许多与各种声音类相对应的事件的家庭环境中。因此，在实际情况下检测此类警报可以被视为开放式识别（OSR）问题。为了解决缺乏用于音频FSL的专用公共数据集，研究人员通常会对其他可用数据集进行修改。本文旨在通过精心注释的数据集（https://zenodo.org/record/3689288）在OSR上下文中使用仔细注释的数据集（https://zenodo.org/record/3689288）来划分音频识别社区，其中包括34个类别分为模式的声音}的OSR上下文中的OSR上下文中的FSL。为了促进和促进对这一领域的研究，还提出了基于转移学习的最先进基线系统的结果。

The problem of training with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications include those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at poviding the audio recognition community with a carefully annotated dataset (https://zenodo.org/record/3689288) for FSL in an OSR context comprised of 1360 clips from 34 classes divided into pattern sounds} and unwanted sounds. To facilitate and promote research on this area, results with state-of-the-art baseline systems based on transfer learning are also presented.

下载PDF全文

下载文献需遵守相关版权规定

论文标题