论文标题
Discovqa:视频质量评估的时间失真符合变压器
DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment
论文作者
论文摘要
在现有作品中,框架及其对视频质量评估(VQA)的影响之间的时间关系仍然不足。这些关系导致视频质量的两种重要效果类型。首先,某些时间变化(例如摇动,闪烁和突然的场景过渡)会导致时间扭曲并导致额外的质量降解,而其他变化(例如,与有意义的事件相关的变化)却没有。其次,人类视觉系统通常对具有不同内容的框架有不同的关注,从而导致其对整体视频质量的重要性不同。基于变压器的突出时间序列建模能力,我们提出了一种新颖有效的基于变压器的VQA方法来解决这两个问题。为了更好地区分时间变化,从而捕获时间变形,我们设计了一个基于变压器的时空扭曲提取(STDE)模块。为了解决时间质量的关注,我们提出了类似编码器的时间含量变压器(TCT)。我们还介绍了功能上的时间抽样,以减少TCT的输入长度,以提高该模块的学习效率和效率。由STDE和TCT组成,用于视频质量评估(DISCOVQA)的拟议的时间失真符合变压器(DISCOVQA)在几个VQA基准上达到了最先进的性能,而没有任何额外的预训练数据集,并且比现有方法高出10%。我们还进行了广泛的消融实验,以证明我们提出的模型中每个部分的有效性,并提供可视化以证明所提出的模块实现了我们对这些时间问题进行建模的意图。我们将在以后发布我们的代码和预算权重。
The temporal relationships between frames and their influences on video quality assessment (VQA) are still under-studied in existing works. These relationships lead to two important types of effects for video quality. Firstly, some temporal variations (such as shaking, flicker, and abrupt scene transitions) are causing temporal distortions and lead to extra quality degradations, while other variations (e.g. those related to meaningful happenings) do not. Secondly, the human visual system often has different attention to frames with different contents, resulting in their different importance to the overall video quality. Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues. To better differentiate temporal variations and thus capture the temporal distortions, we design a transformer-based Spatial-Temporal Distortion Extraction (STDE) module. To tackle with temporal quality attention, we propose the encoder-decoder-like temporal content transformer (TCT). We also introduce the temporal sampling on features to reduce the input length for the TCT, so as to improve the learning effectiveness and efficiency of this module. Consisting of the STDE and the TCT, the proposed Temporal Distortion-Content Transformers for Video Quality Assessment (DisCoVQA) reaches state-of-the-art performance on several VQA benchmarks without any extra pre-training datasets and up to 10% better generalization ability than existing methods. We also conduct extensive ablation experiments to prove the effectiveness of each part in our proposed model, and provide visualizations to prove that the proposed modules achieve our intention on modeling these temporal issues. We will publish our codes and pretrained weights later.