扩展视频中社会推理的组成注意网络

论文标题

扩展视频中社会推理的组成注意网络

Extending Compositional Attention Networks for Social Reasoning in Videos

论文作者

Sartzetaki, Christina, Paraskevopoulos, Georgios, Potamianos, Alexandros

论文摘要

我们为有关视频中社交互动的推理的任务提出了一种新颖的深度建筑。我们利用组成注意网络（MAC）的多步推理能力，并提出多模式扩展（MAC-X）。 MAC-X基于通过使用时间注意机制在多个推理步骤上执行输入方式（视觉，听觉，文本）的迭代中级融合的复发单元。然后，我们将MAC-X与LSTMS结合使用，用于端到端体系结构中的时间输入处理。我们的消融研究表明，提出的MAC-X结构可以使用中级融合机制有效利用多模式输入线索。我们将MAC-X应用于社交智商数据集中回答的社交视频问题的任务，并在当前最新的二进制准确性方面获得了2.5％的绝对提高。

We propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multi-step reasoning capabilities of Compositional Attention Networks (MAC), and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题