利用ConvlSTM：使用基于视频的原始视频复发神经网络的人类行动识别

论文标题

利用ConvlSTM：使用基于视频的原始视频复发神经网络的人类行动识别

Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based Recurrent Neural Networks

论文作者

Sanchez-Caballero, Adrian, Fuentes-Jimenez, David, Losada-Gutiérrez, Cristina

论文摘要

与许多其他不同领域一样，深度学习已成为大多数计算机视觉应用程序的主要方法，例如场景理解，对象识别，计算机 - 人类互动或人类行动识别（HAR）。 HAR中的研究工作主要集中于如何有效提取和处理视频序列的空间和时间依赖性。在本文中，我们提出和比较，基于卷积长的短期记忆单元，即ConvlstM，在体系结构和长期学习策略方面存在差异。前者使用视频长度自适应输入数据生成器（\ emph {nocelest}），而后者探讨了一般复发性神经网络的\ emph {stateful}能力，但在HAR的特定情况下应用。该状态属性允许该模型从以前的帧中积累判别模式，而不会损害计算机存储器。大规模NTU RGB+D数据集的实验结果表明，与最先进的方法相比，所提出的模型具有较低的计算成本的竞争识别精度，并证明，在视频的特定情况下，很少使用的状态神经网络的经过的状态模式可以显着改善具有标准模式的准确性。对于无状态模型，获得的识别精度为75.26 \％（CS）和75.45 \％（CV），每个视频平均耗时为0.21 s，为0.89 s，为0.21 s和79.91 \％（cs）和79.91 \％（CV），为0.89 s。

As in many other different fields, deep learning has become the main approach in most computer vision applications, such as scene understanding, object recognition, computer-human interaction or human action recognition (HAR). Research efforts within HAR have mainly focused on how to efficiently extract and process both spatial and temporal dependencies of video sequences. In this paper, we propose and compare, two neural networks based on the convolutional long short-term memory unit, namely ConvLSTM, with differences in the architecture and the long-term learning strategy. The former uses a video-length adaptive input data generator (\emph{stateless}) whereas the latter explores the \emph{stateful} ability of general recurrent neural networks but applied in the particular case of HAR. This stateful property allows the model to accumulate discriminative patterns from previous frames without compromising computer memory. Experimental results on the large-scale NTU RGB+D dataset show that the proposed models achieve competitive recognition accuracies with lower computational cost compared with state-of-the-art methods and prove that, in the particular case of videos, the rarely-used stateful mode of recurrent neural networks significantly improves the accuracy obtained with the standard mode. The recognition accuracies obtained are 75.26\% (CS) and 75.45\% (CV) for the stateless model, with an average time consumption per video of 0.21 s, and 80.43\% (CS) and 79.91\%(CV) with 0.89 s for the stateful version.

下载PDF全文

下载文献需遵守相关版权规定

论文标题