论文标题
压缩视频质量增强的端到端变压器
End-to-end Transformer for Compressed Video Quality Enhancement
论文作者
论文摘要
近年来,卷积神经网络在压缩视频质量增强任务中取得了出色的成果。最先进的方法探讨了相邻框架的时空信息,主要是通过可变形卷积。但是,可变形卷积中的偏移场很难训练,其训练的不稳定性通常会导致溢出,从而降低了相关建模的效率。在这项工作中,我们提出了一种基于变压器的压缩视频质量增强(TVQE)方法,该方法由基于Swin-AutoEncoder的时空特征融合(SSTF)模块和基于渠道注意的质量增强(CAQE)模块组成。拟议的SSTF模块在Swin-AutoEncoder的帮助下同时学习了本地和全局特征,从而提高了相关建模的能力。同时,基于窗口机制的Swin Transformer和EncoderDecoder结构极大地提高了执行效率。另一方面,提出的CAQE模块计算了通道注意力,该模块将功能图中的通道之间的时间信息汇总,并最终实现了框架间信息的有效融合。 JCT-VT测试序列上的广泛实验结果表明,所提出的方法在主观和客观质量方面的平均表现更好。同时,我们提出的方法在推理速度和GPU消耗方面都优于现有方法。
Convolutional neural networks have achieved excellent results in compressed video quality enhancement task in recent years. State-of-the-art methods explore the spatiotemporal information of adjacent frames mainly by deformable convolution. However, offset fields in deformable convolution are difficult to train, and its instability in training often leads to offset overflow, which reduce the efficiency of correlation modeling. In this work, we propose a transformer-based compressed video quality enhancement (TVQE) method, consisting of Swin-AutoEncoder based Spatio-Temporal feature Fusion (SSTF) module and Channel-wise Attention based Quality Enhancement (CAQE) module. The proposed SSTF module learns both local and global features with the help of Swin-AutoEncoder, which improves the ability of correlation modeling. Meanwhile, the window mechanism-based Swin Transformer and the encoderdecoder structure greatly improve the execution efficiency. On the other hand, the proposed CAQE module calculates the channel attention, which aggregates the temporal information between channels in the feature map, and finally achieves the efficient fusion of inter-frame information. Extensive experimental results on the JCT-VT test sequences show that the proposed method achieves better performance in average for both subjective and objective quality. Meanwhile, our proposed method outperforms existing ones in terms of both inference speed and GPU consumption.