扩展TROCR，以供全页扫描收据图像的无文本本地化OCR

论文标题

扩展TROCR，以供全页扫描收据图像的无文本本地化OCR

Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images

论文作者

Zhang, Hongkuan, Whittaker, Edward, Kitagishi, Ikuo

论文摘要

扫描收据的数字化旨在从收据图像中提取文本并将其保存到结构化文档中。这通常分为两个子任务：文本定位和光学特征识别（OCR）。大多数现有的OCR模型仅关注裁剪的文本实例图像，该图像需要文本区域检测模型提供的边界框信息。在处理文档级OCR的整个映像时，引入附加检测器以提前识别文本实例图像会添加复杂性，但是实例级别的OCR模型的精度非常低，例如包含在各种布局中排列的多个文本线的接收图像。为此，我们提出了一个无本地化的文档级OCR模型，用于将收据图像中的所有字符转录为端到端的有序序列。具体而言，我们对验证的实例级模型TROR进行了验证，该模型具有随机裁剪的图像块，并逐渐增加了图像块的大小，以将识别能力从实例图像到整页图像概括。在我们对SROIE收据OCR数据集的实验中，使用我们的策略进行了固定的模型分别达到64.4 F1得分和22.8％的字符错误率（CER），这表现出48.5 F1得分和50.6％CER的基线结果。最好的模型将完整图像分为15个同等大小的块，可提供87.8 F1得分和4.98％的CER，而输出的额外预处理或后处理最少。此外，生成的文档级序列中的字符按阅读顺序排列，这对于现实世界应用是实用的。

Digitization of scanned receipts aims to extract text from receipt images and save it into structured documents. This is usually split into two sub-tasks: text localization and optical character recognition (OCR). Most existing OCR models only focus on the cropped text instance images, which require the bounding box information provided by a text region detection model. Introducing an additional detector to identify the text instance images in advance adds complexity, however instance-level OCR models have very low accuracy when processing the whole image for the document-level OCR, such as receipt images containing multiple text lines arranged in various layouts. To this end, we propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end. Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks, and gradually increase the image chunk size to generalize the recognition ability from instance images to full-page images. In our experiments on the SROIE receipt OCR dataset, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate (CER), respectively, which outperforms the baseline results with 48.5 F1-score and 50.6% CER. The best model, which splits the full image into 15 equally sized chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or post-processing of the output. Moreover, the characters in the generated document-level sequences are arranged in the reading order, which is practical for real-world applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题