将一般多模式预处理模型转移到文本识别

论文标题

将一般多模式预处理模型转移到文本识别

Transferring General Multimodal Pretrained Models to Text Recognition

论文作者

Lin, Junyang, Ren, Xuancheng, Zhang, Yichang, Liu, Gao, Wang, Peng, Yang, An, Zhou, Chang

论文摘要

本文提出了一种新方法，即OFA-OR，以将多模式预处理模型转移到文本识别中。具体而言，我们将文本识别重新验证为图像字幕，并将统一视觉识别的模型直接传输到最终任务。在不预读大规模的注释或合成文本识别数据的情况下，Of-ocr的表现优于基准，并在中国文本识别基准中实现最先进的表现。此外，我们使用OFA-OR构建OCR管道，并证明它可以通过产品级API实现竞争性能。代码（https://github.com/ofa-sys/ofa）和演示（https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary）可公开使用。

This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题