使用级联双重注意CNN和双向GRU框架的人类活动识别

论文标题

使用级联双重注意CNN和双向GRU框架的人类活动识别

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

论文作者

Ullah, Hayat, Munir, Arslan

论文摘要

基于视觉的人类活动识别已成为视频分析领域的重要研究领域之一。在过去的十年中，已经引入了许多先进的深度学习算法，以识别视频流中复杂的人类行为。这些深度学习算法对人类活动识别任务显示出令人印象深刻的表现。但是，这些新引入的方法仅专注于模型性能，或者这些模型在计算效率和鲁棒性方面的有效性，从而导致其解决挑战性人类活动识别问题的提议中有偏见的权衡。为了克服当代深度学习模型对人类活动识别的局限性，本文提出了一个计算高效但通用的时空级联框架，该框架利用了深层歧视性的空间和时间特征，以识别人类活动的识别。为了有效地表示人类行动，我们提出了一个有效的双重注意卷积神经网络（CNN）体系结构，该结构利用统一的通道空间注意机制在视频框架中提取以人为中心的显着特征。双通道空间注意层与卷积层一起学会在具有特征图数量的物体的空间接收场中更加专注。然后将提取的判别显着特征转发到堆叠的双向封盖复发单元（BI-GRU），以使用前进和向后传球梯度学习对人类行为进行长期时间建模和识别。进行了广泛的实验，在其中获得的结果表明，与大多数当代行动识别方法相比，所提出的框架的执行时间最高可达167倍。

Vision-based human activity recognition has emerged as one of the essential research areas in video analytics domain. Over the last decade, numerous advanced deep learning algorithms have been introduced to recognize complex human actions from video streams. These deep learning algorithms have shown impressive performance for the human activity recognition task. However, these newly introduced methods either exclusively focus on model performance or the effectiveness of these models in terms of computational efficiency and robustness, resulting in a biased tradeoff in their proposals to deal with challenging human activity recognition problem. To overcome the limitations of contemporary deep learning models for human activity recognition, this paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for human activity recognition. For efficient representation of human actions, we have proposed an efficient dual attentional convolutional neural network (CNN) architecture that leverages a unified channel-spatial attention mechanism to extract human-centric salient features in video frames. The dual channel-spatial attention layers together with the convolutional layers learn to be more attentive in the spatial receptive fields having objects over the number of feature maps. The extracted discriminative salient features are then forwarded to stacked bi-directional gated recurrent unit (Bi-GRU) for long-term temporal modeling and recognition of human actions using both forward and backward pass gradient learning. Extensive experiments are conducted, where the obtained results show that the proposed framework attains an improvement in execution time up to 167 times in terms of frames per second as compared to most of the contemporary action recognition methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题