论文标题
VGGSOUND:一个大规模的视听数据集
VGGSound: A Large-scale Audio-Visual Dataset
论文作者
论文摘要
我们的目标是使用计算机视觉技术收集一个大规模的视听数据集,其中野外视频的标签噪声低。所得数据集可用于培训和评估音频识别模型。我们做出三项贡献。首先,我们建议基于计算机视觉技术的可扩展管道,以创建开源媒体的音频数据集。我们的管道涉及从YouTube获取视频;使用图像分类算法本地化视听对应;并使用音频验证来滤除环境噪声。其次,我们使用此管道来策划VGGSOUND数据集,该数据集由310个音频类的210k视频组成。第三,我们研究了各种卷积神经网络〜(CNN)架构和聚合方法,以建立我们新数据集的音频识别基准。与现有的音频数据集相比,VGGSOUND可确保视听通信,并在不受约束的条件下收集。代码和数据集可在http://www.robots.ox.ac.uk/~vgg/data/vggsound/
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. Third, we investigate various Convolutional Neural Network~(CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions. Code and the dataset are available at http://www.robots.ox.ac.uk/~vgg/data/vggsound/