视频动作预期的感应关注

论文标题

视频动作预期的感应关注

Inductive Attention for Video Action Anticipation

论文作者

Tai, Tsung-Ming, Fiameni, Giuseppe, Lee, Cheng-Kuang, See, Simon, Lanz, Oswald

论文摘要

在视频理解和预测计算机视觉中，预期基于时空观察的未来行动至关重要。此外，能够预期未来的模型具有重要的应用，它可以使预防系统受益于事件发生之前的反应。但是，与动作识别任务不同，在观察时间无法访问未来的信息 - 模型无法将视频帧直接映射到目标动作以解决预期任务。取而代之的是，需要时间推断将相关证据与可能的未来行动相关联。因此，基于动作识别模型的现有解决方案仅是最佳的。最近，研究人员提出了扩展观察窗口，以捕获过去时刻的较长的前行动概况，并利用注意力来检索微妙的证据以改善预期预测。但是，现有的注意力设计通常使用框架输入作为次优的查询，因为视频框架仅与未来的动作无关紧要。为此，我们提出了一种名为IAM的归纳注意模型，该模型利用当前的预测先验作为询问来推断未来的行动，并可以有效地处理长时间的视频内容。此外，我们的方法通过注意力设计中的多对多协会考虑了未来的不确定性。结果，IAM始终在多个大规模的以自我为中心的视频数据集上胜过最先进的预期模型，同时使用明显较少的模型参数。

Anticipating future actions based on spatiotemporal observations is essential in video understanding and predictive computer vision. Moreover, a model capable of anticipating the future has important applications, it can benefit precautionary systems to react before an event occurs. However, unlike in the action recognition task, future information is inaccessible at observation time -- a model cannot directly map the video frames to the target action to solve the anticipation task. Instead, the temporal inference is required to associate the relevant evidence with possible future actions. Consequently, existing solutions based on the action recognition models are only suboptimal. Recently, researchers proposed extending the observation window to capture longer pre-action profiles from past moments and leveraging attention to retrieve the subtle evidence to improve the anticipation predictions. However, existing attention designs typically use frame inputs as the query which is suboptimal, as a video frame only weakly connects to the future action. To this end, we propose an inductive attention model, dubbed IAM, which leverages the current prediction priors as the query to infer future action and can efficiently process the long video content. Furthermore, our method considers the uncertainty of the future via the many-to-many association in the attention design. As a result, IAM consistently outperforms the state-of-the-art anticipation models on multiple large-scale egocentric video datasets while using significantly fewer model parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题