多模式视觉变压器具有强迫注意行为分析

论文标题

多模式视觉变压器具有强迫注意行为分析

Multimodal Vision Transformers with Forced Attention for Behavior Analysis

论文作者

Agrawal, Tanay, Balazia, Michal, Müller, Philipp, Brémond, François

论文摘要

人类行为的理解需要在包含多种输入方式的场景的庞大背景下查看细节。这是必要的，因为它允许设计更类似人类的机器。尽管变压器方法表现出了很大的改进，但它们面临着多个挑战，例如缺乏数据或背景噪声。为了解决这些问题，我们介绍了强迫注意（FAT）变压器，该变压器使用改良的主链进行输入编码和使用其他输入，利用强迫注意力。除了改善不同任务和输入的性能外，修改还需要更少的时间和内存资源。我们为有关社会信号和行为分析的任务提供广义特征提取的模型。我们的重点是理解人们在彼此互动或与相机交谈的视频中的行为，以模拟社交互动中的第一人称观点。脂肪变压器被应用于两个下游任务：人格识别和肢体语言识别。我们获得了Udiva V0.5，第一印象V2和MPII组交互数据集的最新结果。我们进一步提供了对拟议建筑的广泛消融研究。

Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.

下载PDF全文

下载文献需遵守相关版权规定

论文标题