由音频叙述指导的弱监督的动作检测

论文标题

由音频叙述指导的弱监督的动作检测

Weakly-Supervised Action Detection Guided by Audio Narration

论文作者

Ye, Keren, Kovashka, Adriana

论文摘要

与图像相比，视频是视觉概念学习的整理策划数据源。与仅涉及空间信息的二维图像不同，附加的时间维桥并同步多种方式。但是，在大多数视频检测基准中，这些其他方式尚未完全利用。例如，Epic Kitchens是第一人称（Egentric）愿景中最大的数据集，但它仍然依靠众包信息来完善动作边界来提供实例级别的动作注释。我们探索了如何消除提供精制边界的视频检测数据中昂贵的注释。我们提出了一个模型，以从叙述监督中学习并利用多模式功能，包括RGB，运动流和环境声音。我们的模型学会了参与与叙述标签相关的框架，同时抑制了使用无关的框架。我们的实验表明，嘈杂的音频叙述足以学习良好的动作检测模型，从而减少了注释费用。

Videos are more well-organized curated data sources for visual concept learning than images. Unlike the 2-dimensional images which only involve the spatial information, the additional temporal dimension bridges and synchronizes multiple modalities. However, in most video detection benchmarks, these additional modalities are not fully utilized. For example, EPIC Kitchens is the largest dataset in first-person (egocentric) vision, yet it still relies on crowdsourced information to refine the action boundaries to provide instance-level action annotations. We explored how to eliminate the expensive annotations in video detection data which provide refined boundaries. We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound. Our model learns to attend to the frames related to the narration label while suppressing the irrelevant frames from being used. Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题