主动学习视听视频表示

论文标题

主动学习视听视频表示

Active Contrastive Learning of Audio-Visual Video Representations

论文作者

Ma, Shuang, Zeng, Zhaoyang, McDuff, Daniel, Song, Yale

论文摘要

对比度学习已被证明可以通过在实例的不同视图之间最大化相互信息（MI）的下限来产生音频和视觉数据的可推广表示。但是，获得紧密的下限需要MI中的样本量指数，因此需要大量的负样品。我们可以通过构建大型基于队列的词典来结合更多样本，但是即使有大量的负样本，绩效改进也存在理论上的限制。我们假设\ textIt {随机负抽样}导致高度冗余的字典导致下游任务的次优表示。在本文中，我们提出了一种主动的对比学习方法，该方法构建了\ textit {主动采样}字典，其中包含多种多样且内容丰富的项目，从而提高了负面样本的质量并改善了数据中有高度共同信息的任务的性能，例如视频分类。我们的模型在具有挑战性的音频和视觉下游基准中实现了最先进的性能，包括UCF101，HMDB51和ESC50。

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that \textit{random negative sampling} leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an \textit{actively sampled} dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.\footnote{Code is available at: \url{https://github.com/yunyikristy/CM-ACC}}

下载PDF全文

下载文献需遵守相关版权规定

论文标题