论文标题
Stip:高分辨率视频预测
STIP: A SpatioTemporal Information-Preserving and Perception-Augmented Model for High-Resolution Video Prediction
论文作者
论文摘要
尽管基于复发的神经网络(RNN)的视频预测方法已经取得了重大成就,但由于信息损失问题和基于知觉的均方误差(MSE)损失函数,它们在具有高分辨率的数据集中的性能仍然远非令人满意。在本文中,我们提出了一个时空信息保存和感知声明模型(STIP),以解决上述两个问题。为了解决信息损失问题,提出的模型旨在在特征提取和状态过渡期间分别保留视频的时空信息。首先,设计基于X-NET结构的多透明时空自动编码器(MGST-AE)。提出的MGST-AE可以帮助解码器回忆到时间和空间域中编码器的多元信息。这样,在高分辨率视频的功能提取过程中,可以保留更多时空信息。其次,时空门控复发单元(STGRU)的设计基于标准的封闭式复发单元(GRU)结构,该结构可以在状态过渡期间有效地保留时空信息。与流行的长期短期(LSTM)的预测记忆相比,提出的STGRU可以通过较低的计算负载来实现更令人满意的性能。此外,为了改善传统的MSE损失功能,基于生成的对抗网络(GAN)进一步设计了学识渊博的知觉损失(LP-loss),这可以帮助获得客观质量和感知质量之间令人满意的权衡。实验结果表明,与各种最先进的方法相比,提出的STIP可以预测具有更令人满意的视觉质量的视频。源代码已在\ url {https://github.com/zhengchang467/stiphr}上找到。
Although significant achievements have been achieved by recurrent neural network (RNN) based video prediction methods, their performance in datasets with high resolutions is still far from satisfactory because of the information loss problem and the perception-insensitive mean square error (MSE) based loss functions. In this paper, we propose a Spatiotemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems. To solve the information loss problem, the proposed model aims to preserve the spatiotemporal information for videos during the feature extraction and the state transitions, respectively. Firstly, a Multi-Grained Spatiotemporal Auto-Encoder (MGST-AE) is designed based on the X-Net structure. The proposed MGST-AE can help the decoders recall multi-grained information from the encoders in both the temporal and spatial domains. In this way, more spatiotemporal information can be preserved during the feature extraction for high-resolution videos. Secondly, a Spatiotemporal Gated Recurrent Unit (STGRU) is designed based on the standard Gated Recurrent Unit (GRU) structure, which can efficiently preserve spatiotemporal information during the state transitions. The proposed STGRU can achieve more satisfactory performance with a much lower computation load compared with the popular Long Short-Term (LSTM) based predictive memories. Furthermore, to improve the traditional MSE loss functions, a Learned Perceptual Loss (LP-loss) is further designed based on the Generative Adversarial Networks (GANs), which can help obtain a satisfactory trade-off between the objective quality and the perceptual quality. Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods. Source code has been available at \url{https://github.com/ZhengChang467/STIPHR}.