混合光栅和矢量PDF的图和图标题提取：具有OCR特征的天文学文献数字化

论文标题

混合光栅和矢量PDF的图和图标题提取：具有OCR特征的天文学文献数字化

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

论文作者

Naiman, J. P., Williams, Peter K. G., Goodman, Alyssa

论文摘要

1990年代后期“数字化”之前发表的科学文章包含在其扫描页面中“被困”的数字。尽管提取数字及其字幕的进展，但目前尚无强有力的方法来解决此过程。我们提出了一种基于YOLO的方法，用于扫描页面，光学后字符识别（OCR），该方法既使用灰度和OCR-features。当应用于天体物理学数据系统（ADS）的天体物理学文献持有量时，我们发现数字（图字幕）的F1得分为90.9％（92.2％），与其他较大的截止点相比，该数字（图字幕）的F1分数（图标题）比其他较为稳定的方法的显着改善。

Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the intersection-over-union (IOU) cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题