视频变压器用于视频异常检测的多上下文预测

论文标题

视频变压器用于视频异常检测的多上下文预测

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

论文作者

Lee, Joo-Yeon, Nam, Woo-Jeoung, Lee, Seong-Whan

论文摘要

传统上，视频异常检测（VAD）以两种主要方法进行了解决：基于重建的方法和基于预测的方法。当基于重建的方法学会概括输入图像时，该模型仅学习身份功能并强烈引起称为概括问题的问题。另一方面，由于基于预测的框架学会了以前的几个帧来预测未来的框架，因此它们对概括性问题不太敏感。但是，仍然不确定该模型是否可以学习视频的时空上下文。我们的直觉是，对视频的时空环境的理解在VAD中起着至关重要的作用，因为它提供了有关视频剪辑中事件的出现如何变化的精确信息。因此，为了充分利用视频情况下的上下文信息以进行异常检测，我们设计了具有三个不同上下文预测流的变压器模型：掩盖，整体和部分。通过学习预测连续正常帧的缺失帧，我们的模型可以有效地学习视频中的各种正态性模式，这会导致异常情况下不适合学习环境的异常情况。为了验证我们的方法的有效性，我们在公共基准数据集上评估了我们的模型：USCD Pateestrian 2，Cuhk Avenue和Shanghaitech，并以重建错误的异常得分度量评估了性能。结果表明，与现有视频异常检测方法相比，我们提出的方法具有竞争性能。

Video Anomaly Detection(VAD) has been traditionally tackled in two main methodologies: the reconstruction-based approach and the prediction-based one. As the reconstruction-based methods learn to generalize the input image, the model merely learns an identity function and strongly causes the problem called generalizing issue. On the other hand, since the prediction-based ones learn to predict a future frame given several previous frames, they are less sensitive to the generalizing issue. However, it is still uncertain if the model can learn the spatio-temporal context of a video. Our intuition is that the understanding of the spatio-temporal context of a video plays a vital role in VAD as it provides precise information on how the appearance of an event in a video clip changes. Hence, to fully exploit the context information for anomaly detection in video circumstances, we designed the transformer model with three different contextual prediction streams: masked, whole and partial. By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video, which leads to a high reconstruction error at the abnormal cases that are unsuitable to the learned context. To verify the effectiveness of our approach, we assess our model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and ShanghaiTech and evaluate the performance with the anomaly score metric of reconstruction error. The results demonstrate that our proposed approach achieves a competitive performance compared to the existing video anomaly detection methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题