论文标题
SMAUC-科学的多授权语料库
SMAuC -- The Scientific Multi-Authorship Corpus
论文作者
论文摘要
迅速增长的科学出版物为研究方法的研究提供了一个有趣的挑战,用于分析与一位或多个作者分析文档的作者身份。但是,大多数现有的数据集都缺乏科学文档或用于构建新实验和测试用例的必要元数据。我们介绍了SMAUC,这是一种针对科学作者资格分析量身定制的综合元数据富裕语料库。 Smauc是超过500万作者的各个学科的300万个出版物,是为此目的最大的公开访问语料库。它涵盖了人文和自然科学的科学文本,并伴随着广泛的,精心策划的元数据,包括明确的作者ID。 SMAUC旨在显着推进科学文本中作者分析的领域。
The rapidly growing volume of scientific publications offers an interesting challenge for research on methods for analyzing the authorship of documents with one or more authors. However, most existing datasets lack scientific documents or the necessary metadata for constructing new experiments and test cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis. Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose. It encompasses scientific texts from humanities and natural sciences, accompanied by extensive, curated metadata, including unambiguous author IDs. SMAuC aims to significantly advance the domain of authorship analysis in scientific texts.