论文标题

RGB不再:最小的JPEG视觉变形金刚

RGB no more: Minimally-decoded JPEG Vision Transformers

论文作者

Park, Jeongsoo, Johnson, Justin

论文摘要

大多数用于计算机视觉的神经网络旨在使用RGB图像推断。但是,在保存到磁盘之前,这些RGB图像通常在JPEG中编码。解码它们为RGB网络施加了不可避免的开销。取而代之的是,我们的工作将直接从JPEG的编码功能直接关注训练视觉变压器(VIT)。这样,我们可以避免大多数解码开销,加速数据负载。现有作品研究了这一方面,但他们专注于CNN。由于这些编码功能是如何结构化的,因此CNN需要重大修改其体系结构才能接受此类数据。在这里,我们表明VIT并非如此。此外,我们直接对这些编码功能进行了数据增强,据我们所知,在这种情况下,尚未对此进行培训。通过这两种改进 - VIT和数据增强 - 我们表明,与RGB相比,我们的VIT-TI模型可以快速训练快39.2%,而没有准确的推断速度为17.9%,没有准确的损失。

Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源