论文标题
有效的视频理解的压缩愿景
Compressed Vision for Efficient Video Understanding
论文作者
论文摘要
经验和推理发生在多个时间尺度上:毫秒,秒,小时或几天。但是,绝大多数计算机视觉研究仍然集中在单独的图像或短视频上,仅持续几秒钟。这是因为处理更长的视频甚至需要更可扩展的方法来处理它们。在这项工作中,我们提出了一个框架,可以通过与现在可以处理第二长视频的硬件进行长时间的视频进行研究。我们取代标准视频压缩,例如JPEG,具有神经压缩,并表明我们可以直接将压缩视频作为常规视频网络输入。在压缩视频上操作可提高所有管道级别的效率 - 数据传输,速度和内存 - 使得可以更快地训练模型,更长的视频。但是,加工压缩信号的缺点是避免了标准增强技术的效力。我们通过引入一个可以将转换应用于对应于原始视频空间中常用增强的潜在代码的小型网络来解决这个问题。我们证明,通过压缩视觉管道,我们可以在诸如Kinetics600和Coin之类的流行基准上更有效地训练视频模型。我们还使用标准帧速率进行了长达一个小时的视频定义的新任务进行概念验证实验。如果不使用压缩表示,就不可能处理如此长的视频。
Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.