多级分层网络，具有多尺度抽样的视频问题回答

论文标题

多级分层网络，具有多尺度抽样的视频问题回答

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

论文作者

Peng, Min, Wang, Chongyang, Gao, Yuan, Shi, Yu, Zhou, Xiang-Dong

论文摘要

鉴于其视觉理解和自然语言处理的多模式组合，视频问题回答（VideoQA）具有挑战性。尽管大多数现有方法忽略了不同时间尺度上的视觉外观运动信息，但未知如何将深度学习模型的多级处理能力与此类多尺度信息结合在一起。针对这些问题，本文提出了一个新型的多级分层网络（MHN），并为VideoQA进行多尺度采样。 MHN包括两个模块，即复发多模式相互作用（RMI）和平行视觉推理（PVR）。通过多尺度采样，RMI迭代了每个尺度上的外观运动信息的相互作用以及构建多级问题引导的视觉表示的问题。因此，使用共享的变压器编码器，PVR并行地在每个级别上渗透视觉提示，以与回答可能依赖相关级别的视觉信息的不同问题类型相吻合。通过在三个VideoQA数据集上进行的大量实验，我们证明了比以前的最先进的表现的改进，并证明了我们方法的每个部分的有效性。

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels. Through extensive experiments on three VideoQA datasets, we demonstrate improved performances than previous state-of-the-arts and justify the effectiveness of each part of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题