论文标题
语义上的视频编码:将静电线索灌输到AI任务的结构化bitstream中
Semantically Video Coding: Instill Static-Dynamic Clues into Structured Bitstream for AI Tasks
论文作者
论文摘要
传统的媒体编码方案通常将图像/视频编码到语义不算的二进制流中,该流无法直接支持Bitstream级别下游智能任务。语义结构化的图像编码(SSIC)框架首次尝试通过语义结构的bitstream(SSB)启用无解码或部分编码的图像智能任务分析。但是,SSIC仅考虑图像编码,其生成的SSB仅包含静态对象信息。在本文中,我们从视频编码的角度扩展了语义结构化编码的想法,并提出了先进的语义结构化视频编码(SSVC)框架,以支持异构的智能应用程序。视频信号包含更丰富的动态运动信息,并且由于相邻帧之间的相似性而存在更多的冗余。因此,我们介绍了SSVC中语义结构的bitstream(SSB)的重新印象,其中包含静态对象特征和动态运动线索。具体而言,我们引入光流以编码连续运动信息并通过预测性编码体系结构降低跨框架冗余,然后将光流和残差信息重新组织为SSB,这使提出的SSVC可以更好地支持基于视频的下游智能应用程序。广泛的实验表明,提出的SSVC框架可以直接根据部分解码的bitstream直接支持多个智能任务。这避免了完整的Bitstream解压缩,因此可以显着节省智能分析的比特率/带宽消耗。我们在图像对象检测,姿势估计,视频操作识别,视频对象分割等方面验证了这一点。
Traditional media coding schemes typically encode image/video into a semantic-unknown binary stream, which fails to directly support downstream intelligent tasks at the bitstream level. Semantically Structured Image Coding (SSIC) framework makes the first attempt to enable decoding-free or partial-decoding image intelligent task analysis via a Semantically Structured Bitstream (SSB). However, the SSIC only considers image coding and its generated SSB only contains the static object information. In this paper, we extend the idea of semantically structured coding from video coding perspective and propose an advanced Semantically Structured Video Coding (SSVC) framework to support heterogeneous intelligent applications. Video signals contain more rich dynamic motion information and exist more redundancy due to the similarity between adjacent frames. Thus, we present a reformulation of semantically structured bitstream (SSB) in SSVC which contains both static object characteristics and dynamic motion clues. Specifically, we introduce optical flow to encode continuous motion information and reduce cross-frame redundancy via a predictive coding architecture, then the optical flow and residual information are reorganized into SSB, which enables the proposed SSVC could better adaptively support video-based downstream intelligent applications. Extensive experiments demonstrate that the proposed SSVC framework could directly support multiple intelligent tasks just depending on a partially decoded bitstream. This avoids the full bitstream decompression and thus significantly saves bitrate/bandwidth consumption for intelligent analytics. We verify this point on the tasks of image object detection, pose estimation, video action recognition, video object segmentation, etc.