强大的扫描仪：动态增强可靠文本识别的位置线索

论文标题

强大的扫描仪：动态增强可靠文本识别的位置线索

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

论文作者

Yue, Xiaoyu, Kuang, Zhanghui, Lin, Chenhao, Sun, Hongbin, Zhang, Wayne

论文摘要

基于注意力的编码器框架框架最近为场景文本识别取得了令人印象深刻的结果，并且许多变体都随着识别质量的改善而出现。但是，它在无上下文文本（例如随机字符序列）上的性能很差，这在大多数实际应用程序场景中都是不可接受的。在本文中，我们首先深入研究解码器的解码过程。我们从经验上发现，代表性的字符级序列解码器不仅利用上下文信息，还利用位置信息。现有方法在很大程度上依赖上下文信息会导致注意力漂移问题。为了抑制这种副作用，我们提出了一个新颖的位置增强分支，并将其输出与解码器注意模块的输出动态融合以进行场景文本识别。具体而言，它包含一个位置意识模块，以使编码器能够输出编码自己的空间位置的特征向量，以及一个仅使用位置线索（即当前解码时间步）估算概念的注意模块。通过元素栅极机理进行动态融合以实现更健壮的特征。从理论上讲，我们所提出的方法称为\ emph {rounustScanner}，解码单个字符在上下文和位置线索之间具有动态比率的单个字符，并在稀缺上下文的解码序列时使用更多的位置序列，因此是强大而实用的。从经验上讲，它在流行的常规和不规则文本识别基准上取得了新的最新结果，而在无上下文基准上没有太大的性能下降，从而在上下文和无上下文应用程序场景中验证了其稳健性。

The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题