通过语言增强和拍摄采样的电影类型分类

论文标题

通过语言增强和拍摄采样的电影类型分类

Movie Genre Classification by Language Augmentation and Shot Sampling

论文作者

Zhang, Zhongping, Gu, Yiwen, Plummer, Bryan A., Miao, Xin, Liu, Jiayi, Wang, Huayan

论文摘要

由于推荐系统中的各种应用，基于视频的电影类型分类引起了人们的关注。先前的工作通常通过调整传统视频分类任务（例如动作识别或事件检测）的模型来解决此任务。但是，这些模型经常忽略视频中存在的语言元素（例如叙述或对话），这些语言元素可以隐式传达电影类型的高级语义，例如故事情节或背景上下文。此外，现有方法主要旨在编码输入视频的整个内容，从而导致预测电影类型的效率低下。电影类型的预测可能只需要几张照片即可准确确定这种类型，从而使对整个视频的全面理解不必要。为了应对这些挑战，我们提出了一种基于语言增强和拍摄采样（电影剪辑）的电影类型分类方法。电影夹主要由两个部分组成：一个语言增强模块，以识别输入音频中的语言元素，以及一个shot采样模块，以从整个视频中选择代表性拍摄。我们在Movienet和凝结电影数据集上评估了我们的方法，比基线的平均平均精度（MAP）提高了约6-9％。我们还将电影绑定到场景边界检测任务中，比最先进的平均精度（AP）提高了1.1％。我们在github.com/zhongping-zhang/movie-clip上发布实施。

Video-based movie genre classification has garnered considerable attention due to its various applications in recommendation systems. Prior work has typically addressed this task by adapting models from traditional video classification tasks, such as action recognition or event detection. However, these models often neglect language elements (e.g., narrations or conversations) present in videos, which can implicitly convey high-level semantics of movie genres, like storylines or background context. Additionally, existing approaches are primarily designed to encode the entire content of the input video, leading to inefficiencies in predicting movie genres. Movie genre prediction may require only a few shots to accurately determine the genres, rendering a comprehensive understanding of the entire video unnecessary. To address these challenges, we propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP). Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video. We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines. We also generalize Movie-CLIP to the scene boundary detection task, achieving 1.1% improvement in Average Precision (AP) over the state-of-the-art. We release our implementation at github.com/Zhongping-Zhang/Movie-CLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题