论文标题
尝试在视频qa中进行双线性池
Trying Bilinear Pooling in Video-QA
论文作者
论文摘要
双线性合并(BLP)是指最近开发的,该家族是为了融合了来自不同方式的融合特征,主要是为VQA模型开发的。人们认为双线性(外部产物)的扩展可以鼓励模型学习两个特征空间之间的相互作用,并在VQA基准上实验优于“简单”矢量操作(串联和元素 - 辅助和元素添加/乘法)。连续的BLP技术以较低的计算费用产生了更高的性能,并且通常与注意机制一起实施。但是,尽管VQA取得了重大进展,但BLP方法尚未被广泛应用于最近探索的视频问题答案(VIDEO-QA)任务。在本文中,我们开始通过将BLP技术应用于各种视频QA基准,即:TVQQA,TGIF-QA,EGO-VQA和MSVD-QA来弥合这一研究差距。我们在TVQA基线模型以及最近提出的异质 - 内存多模式注意(HME)模型上分享结果。我们的实验包括在现有型号中简单地用BLP替换功能串联,以及用于容纳BLP的TVQA基线的修改版本,我们将“双流式”型号命名。我们发现,我们相对简单的BLP集成不会增加,并且大部分危害了这些视频QA基准测试。使用最近提出的理论多模式融合分类法,我们提供了有关为什么BLP驱动的视频QA基准的性能增益可能比以前的VQA模型更难实现的洞察。我们建议在将BLP应用于视频QA时考虑其他一些“最佳实践”。我们强调的是,视频QA模型应仔细考虑实际需要从BLP的复杂代表性潜力来避免“冗余”融合的计算费用。
Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models. A bilinear (outer-product) expansion is thought to encourage models to learn interactions between two feature spaces and has experimentally outperformed `simpler' vector operations (concatenation and element-wise-addition/multiplication) on VQA benchmarks. Successive BLP techniques have yielded higher performance with lower computational expense and are often implemented alongside attention mechanisms. However, despite significant progress in VQA, BLP methods have not been widely applied to more recently explored video question answering (video-QA) tasks. In this paper, we begin to bridge this research gap by applying BLP techniques to various video-QA benchmarks, namely: TVQA, TGIF-QA, Ego-VQA and MSVD-QA. We share our results on the TVQA baseline model, and the recently proposed heterogeneous-memory-enchanced multimodal attention (HME) model. Our experiments include both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP we name the `dual-stream' model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We suggest a few additional `best-practices' to consider when applying BLP to video-QA. We stress that video-QA models should carefully consider where the complex representational potential from BLP is actually needed to avoid computational expense on `redundant' fusion.