论文标题

PDF文章的结构化参考:评估书目参考提取和解析工具

Structured references from PDF articles: assessing the tools for bibliographic reference extraction and parsing

论文作者

Cioffi, Alessia, Peroni, Silvio

论文摘要

已经提供了许多解决方案来从PDF论文中提取书目参考。机器学习,基于规则和正则表达方法是解决此任务的工具中采用的最常用的方法之一。这项工作旨在识别和评估所有和仅使用PDF格式的全文纸的工具,可以识别,提取和解析书目参考。我们确定了七个工具:Anystyle,Cermine,Excite,Grobid,PDFSSA4Met,Scholarcy和Science Parse。我们将它们与在27个主题领域发表的56篇PDF文章的语料库进行了比较和评估。确实,任何风格的总体得分最佳,其次是Cermine。但是,在某些主题领域,其他工具对特定任务有更好的结果。

Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We identified seven tools: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. We compared and evaluated them against a corpus of 56 PDF articles published in 27 subject areas. Indeed, Anystyle obtained the best overall score, followed by Cermine. However, in some subject areas, other tools had better results for specific tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源