论文标题
视觉变压器通过时间转移进行跨注意,以有效的行动识别
Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition
论文作者
论文摘要
由于提出了时间偏移模块(TSM),因此已证明特征转移可用于使用基于CNN的模型的动作识别。它基于与晚期融合的框架特征提取,并且层特征沿时间相互作用的时间方向移动。 Tokenshift是一个基于视觉变压器(VIT)的最新模型,它也使用了时间特征移位机制,但是,该机制并未完全利用VIT中多头自我注意力(MSA)的结构。在本文中,我们提出了多头自我/交叉注意(MSCA),该自我/跨意义(MSCA)充分利用了注意力结构。 Tokenshift基于框架的VIT,其特征在时间t+1和t-1时暂时移动。相比之下,提出的MSCA取代了框架中的MSA,一些MSA头会乘坐连续的帧而不是当前帧。计算成本与框架的VIT和Tokenshift相同,因为它只是改变了注意注意的目标。可以选择从连续的帧中获取哪些密钥,查询和值,然后我们通过实验性地将这些变体与Kinetics400进行了比较。我们还研究了其他变体,其中沿VIT的贴片维度而不是头部尺寸使用了所提出的MSCA。实验结果表明,一种变体MSCA-KV显示出最佳性能,并且比Tokenshift的表现优于0.1%,然后VIT为1.2%。
Feature shifts have been shown to be useful for action recognition with CNN-based models since Temporal Shift Module (TSM) was proposed. It is based on frame-wise feature extraction with late fusion, and layer features are shifted along the time direction for the temporal interaction. TokenShift, a recent model based on Vision Transformer (ViT), also uses the temporal feature shift mechanism, which, however, does not fully exploit the structure of Multi-head Self-Attention (MSA) in ViT. In this paper, we propose Multi-head Self/Cross-Attention (MSCA), which fully utilizes the attention structure. TokenShift is based on a frame-wise ViT with features temporally shifted with successive frames (at time t+1 and t-1). In contrast, the proposed MSCA replaces MSA in the frame-wise ViT, and some MSA heads attend to successive frames instead of the current frame. The computation cost is the same as the frame-wise ViT and TokenShift as it simply changes the target to which the attention is taken. There is a choice about which of key, query, and value are taken from the successive frames, then we experimentally compared these variants with Kinetics400. We also investigate other variants in which the proposed MSCA is used along the patch dimension of ViT, instead of the head dimension. Experimental results show that a variant, MSCA-KV, shows the best performance and is better than TokenShift by 0.1% and then ViT by 1.2%.