语言问题：一种弱监督的视力语言预训练方法，用于场景文本检测和发现

论文标题

语言问题：一种弱监督的视力语言预训练方法，用于场景文本检测和发现

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

论文作者

Xue, Chuhui, Zhang, Wenqing, Hao, Yu, Lu, Shijian, Torr, Philip, Bai, Song

论文摘要

最近，视觉培训（VLP）技术通过共同学习视觉和文本表示形式，从而极大地使各种视力语言任务受益匪浅，这是由于场景文本图像中丰富的视觉和文本信息而直觉上有助于光学特征识别（OCR）任务。但是，这些方法无法很好地应对OCR任务，因为实例级文本编码和图像文本对的难度都很难（即其中的图像和捕获的文本）。本文提出了一种弱监督的预训练方法OCLIP，可以通过共同学习和对齐视觉和文本信息来获取有效的场景文本表示。我们的网络由一个图像编码器和一个字符感知的文本编码器组成，该文本编码器分别提取视觉和文本特征以及视觉文本解码器，该解码器模拟了学习有效场景文本表示的文本和视觉特征之间的相互作用。通过学习文本功能，预先训练的模型可以在图像中以角色意识很好地参加文本。此外，这些设计可以从弱注释的文本（即图像中的部分文本中没有文本边界框中的部分文本）进行学习，从而大大减轻数据注释约束。 ICDAR2019-LSVT中弱注释图像的实验表明，我们的预训练模型分别将其权重转移到其他文本检测和发现网络时，将F-评分提高+2.5 \％和+4.8 \％。此外，所提出的方法在多个公共数据集（例如，总文本和CTW1500的+3.2 \％和+1.3 \％）上始终超过现有的预训练技术。

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5\% and +4.8\% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2\% and +1.3\% for Total-Text and CTW1500).

下载PDF全文

下载文献需遵守相关版权规定

论文标题