纯变压器与集成专家进行场景文本识别

论文标题

纯变压器与集成专家进行场景文本识别

Pure Transformer with Integrated Experts for Scene Text Recognition

论文作者

Tan, Yew Lee, Kong, Adams Wai-kin, Kim, Jung-Jae

论文摘要

场景文本识别（STR）涉及在自然场景的裁剪图像中阅读文本的任务。 Str中的常规模型采用卷积神经网络（CNN），然后在编码器框架中进行了复发性神经网络。近来，变压器体系结构在STR中广泛采用，因为它显示出强大的能力捕获长期依赖性，这在场景文本图像中似乎很突出。许多研究人员将变压器用作混合CNN转换器编码器的一部分，然后通常是变压器解码器。但是，这种方法仅利用编码过程中途的长期依赖性。尽管视觉变压器（VIT）能够在早期阶段捕获这种依赖性，但其利用率仍然在很大程度上无法探索。这项工作建议将仅限变压器模型用作简单的基线，该基线的表现优于混合CNN转换器模型。此外，确定了两个关键的改进领域。首先，第一个解码字符的预测准确性最低。其次，不同原始长宽比的图像对贴片分辨率的反应不同，而VIT仅采用一种固定的斑块分辨率。为了探索这些领域，提出了具有综合专家（PTIE）的纯变压器。 PTIE是一个变压器模型，可以处理多个补丁分辨率并在原始字符和反向字符订单中解码。它在7种常用的基准测试中进行了检查，并与20多种最先进的方法进行了比较。实验结果表明，所提出的方法优于它们，并获得最先进的结果。

Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题