使用SSW60数据集探索细粒度的视听分类

论文标题

使用SSW60数据集探索细粒度的视听分类

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

论文作者

Van Horn, Grant, Qian, Rui, Wilber, Kimberly, Adam, Hartwig, Mac Aodha, Oisin, Belongie, Serge

论文摘要

我们提出了一个新的基准数据集，即Sapsucker Woods 60（SSW60），用于推进视听细粒度分类的研究。尽管我们的社区在图像上的细粒度视觉分类方面取得了长足的进步，但音频和视频细颗粒分类的对应物相对尚未探索。为了鼓励在这个领域的进步，我们已经仔细构建了SSW60数据集，以使研究人员能够以三种不同的方式对相同的类别进行分类：图像，音频和视频。该数据集涵盖60种鸟类，并由现有数据集以及全新的，专家策划的音频和视频数据集组成。我们通过使用最先进的变压器方法进行了彻底基准的视听分类性能和模态融合实验。我们的发现表明，视听融合方法的性能要比仅使用基于图像或音频的方法进行视频分类任务要好。我们还提出了有趣的模态转移实验，这是由SSW60的独特构造所涵盖的三种不同模态所启用的。我们希望SSW60数据集和伴随的基线在这个迷人的地区进行研究。

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert-curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题