论文标题
动态多模式融合
Dynamic Multimodal Fusion
论文作者
论文摘要
近年来,深度多模式学习取得了巨大进步。但是,当前的融合方法本质上是静态的,即它们处理和融合了具有相同计算的多模式输入,而没有考虑不同多模式数据的各种计算需求。在这项工作中,我们提出了动态多模式融合(DYNMM),这是一种新方法,可自适应地融合多模式数据并在推理过程中生成数据依赖于数据的前进路径。为此,我们提出了一个门控函数,以基于多模式功能和鼓励计算效率的资源感知的损失函数来提供模态级别或融合级决策。各种多模式任务的结果证明了我们方法的效率和广泛适用性。例如,与静态融合方法相比,DYNMM只能通过可忽略不计的准确性损失(CMU-Mosei情感分析)可降低计算成本46.5%(CMU-MOSEI情感分析)并改善分段性能(NYU深度V2语义细分)。我们认为,我们的方法为动态多模式网络设计打开了一个新的方向,并采用了广泛的多模式任务。
Deep multimodal learning has achieved great progress in recent years. However, current fusion approaches are static in nature, i.e., they process and fuse multimodal inputs with identical computation, without accounting for diverse computational demands of different multimodal data. In this work, we propose dynamic multimodal fusion (DynMM), a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference. To this end, we propose a gating function to provide modality-level or fusion-level decisions on-the-fly based on multimodal features and a resource-aware loss function that encourages computational efficiency. Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach. For instance, DynMM can reduce the computation costs by 46.5% with only a negligible accuracy loss (CMU-MOSEI sentiment analysis) and improve segmentation performance with over 21% savings in computation (NYU Depth V2 semantic segmentation) when compared with static fusion approaches. We believe our approach opens a new direction towards dynamic multimodal network design, with applications to a wide range of multimodal tasks.