论文标题
知识融合变压器用于视频动作识别
Knowledge Fusion Transformers for Video Action Recognition
论文作者
论文摘要
我们介绍知识融合变压器进行视频动作分类。我们提出了一个基于自我注意力的功能增强子,以在基于3D实体的时空上下文中融合动作知识的视频剪辑中,该视频片段旨在分类。我们表明,仅使用一个流网络,几乎没有训练的方法可以为接近当前最新技术的性能铺平道路。此外,我们介绍了如何将不同网络上使用的不同自我发挥体系结构融合在一起以增强特征表示。我们的架构经过了UCF-101和Charades数据集的培训和评估,在该数据集中,它与最先进的状态具有竞争力。它也超过了单流网络的巨大差距,没有预处理。
We introduce Knowledge Fusion Transformers for video action classification. We present a self-attention based feature enhancer to fuse action knowledge in 3D inception based spatio-temporal context of the video clip intended to be classified. We show, how using only one stream networks and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art. Additionally, we present how different self-attention architectures used at different levels of the network can be blended-in to enhance feature representation. Our architecture is trained and evaluated on UCF-101 and Charades dataset, where it is competitive with the state of the art. It also exceeds by a large gap from single stream networks with no to less pretraining.