论文标题
从学术PDF文档中提取身体文本以进行挖掘
Extracting Body Text from Academic PDF Documents for Text Mining
论文作者
论文摘要
从PDF形式的学术文档中准确提取身体文本对于更深入的语义理解的文本应用程序至关重要。目的是将正文文本中的完整句子提取到具有原始句子流和段落边界的TXT文件中。从PDF文档中提取文本的现有工具通常会混合身体和非体文本。我们设计并实现了一个名为PDFBOT的系统,可以使用扫描的技术检测多柱布局,使用计算的文本功能删除非体文本,并在向后遍历中进行句法标记,然后将其余文本与句子和段落相提并论。我们表明,PDFBOT高度准确,平均F1分别在提取句子上分别为0.99,提取段落的句子为0.96,在删除表,图形上的删除文本和图表上的pdf文档上的图表上为0.98,在跨多个学科中随机从Arxiv.org随机选择的文本。
Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines.