视频动作识别的参数和频谱域中的协作蒸馏

论文标题

视频动作识别的参数和频谱域中的协作蒸馏

Collaborative Distillation in the Parameter and Spectrum Domains for Video Action Recognition

论文作者

Su, Haisheng, Su, Jing, Wang, Dongliang, Gan, Weihao, Wu, Wei, Wang, Mengmeng, Yan, Junjie, Qiao, Yu

论文摘要

近年来，深层网络见证了行动识别任务的重大进展。但是，当前大多数视频网络都需要大量的内存和计算资源，这阻碍了他们的应用程序。现有的知识蒸馏方法仅限于图像级空间域，忽略了提供结构知识的时间和频率信息，对于视频分析很重要。本文探讨了如何培训小型有效的网络以进行行动识别。具体而言，我们在频域中提出了两种蒸馏策略，分别是特征光谱和参数分布蒸馏。我们的见解是，动作识别的有吸引力的性能需要\ textit {显式}建模视频特征的时间频谱。因此，我们引入了一个频谱损失，该频谱损失可以实施学生网络以模仿教师网络的时间频谱，而不是\ textIt {remainitly}将功能作为以前的许多作品而蒸馏。其次，进一步采用参数频率分布来指导学生网络从老师那里学习外观建模过程。此外，提出了一种协作学习策略，以从概率观点优化培训过程。对几种动作识别基准进行了广泛的实验，例如动力学，某种东西和小丑，它们始终如一地验证我们方法的有效性，并证明我们的方法可以比具有相同骨架的先进方法获得更高的性能。

Recent years have witnessed the significant progress of action recognition task with deep networks. However, most of current video networks require large memory and computational resources, which hinders their applications in practice. Existing knowledge distillation methods are limited to the image-level spatial domain, ignoring the temporal and frequency information which provide structural knowledge and are important for video analysis. This paper explores how to train small and efficient networks for action recognition. Specifically, we propose two distillation strategies in the frequency domain, namely the feature spectrum and parameter distribution distillations respectively. Our insight is that appealing performance of action recognition requires \textit{explicitly} modeling the temporal frequency spectrum of video features. Therefore, we introduce a spectrum loss that enforces the student network to mimic the temporal frequency spectrum from the teacher network, instead of \textit{implicitly} distilling features as many previous works. Second, the parameter frequency distribution is further adopted to guide the student network to learn the appearance modeling process from the teacher. Besides, a collaborative learning strategy is presented to optimize the training process from a probabilistic view. Extensive experiments are conducted on several action recognition benchmarks, such as Kinetics, Something-Something, and Jester, which consistently verify effectiveness of our approach, and demonstrate that our method can achieve higher performance than state-of-the-art methods with the same backbone.

下载PDF全文

下载文献需遵守相关版权规定

论文标题