使用wavext5k和拍手训练的音频检索

论文标题

使用wavext5k和拍手训练的音频检索

Audio Retrieval with WavText5K and CLAP Training

论文作者

Deshmukh, Soham, Elizalde, Benjamin, Wang, Huaming

论文摘要

音频文本检索需要自然语言查询来检索数据库中的相关音频文件。相反，文本审计检索将音频文件作为查询来检索相关的自然语言描述。大多数带有一个音频字幕数据集的文献训练检索系统，但是评估使用多个数据集培训的好处是没有充满反感的。此外，检索系统必须学习描述从几秒钟到几秒钟的可变长度的音频内容之间的详细句子之间的对齐。在这项工作中，我们提出了一个新的Web音频文本对以及一个新的检索框架。首先，我们提供了大约五千个Web音频对接的新集合，我们称为WavText5k。当用来训练我们的检索系统时，WavText5K比其他音频字幕更多地提高了性能。其次，我们的框架学会了使用文本编码器，两个音频编码器和对比度学习目标连接语言和音频内容。组合两个音频编码器有助于处理可变长度音频。这两个贡献超过了AudioCaps的最先进的表现，并在文本Audio检索中取得了2％和16％的贡献，而音频检索的结果则达到了6％和23％。

Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant natural language descriptions. Most of the literature train retrieval systems with one audio captioning dataset, but evaluating the benefit of training with multiple datasets is underexplored. Moreover, retrieval systems have to learn the alignment between elaborated sentences describing audio content of variable length ranging from a few seconds to several minutes. In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval. First, we provide a new collection of about five thousand web audio-text pairs that we refer to as WavText5K. When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets. Second, our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective. Combining both audio encoders helps to process variable length audio. The two contributions beat state of the art performance for AudioCaps and Clotho on Text-Audio retrieval by a relative 2% and 16%, and Audio-Text retrieval by 6% and 23%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题