论文标题
在社会科学中调查软件使用:一种知识图方法
Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach
论文作者
论文摘要
由于不同的原因,必须了解科学研究中使用的软件的知识,包括结果的出处,对属性开发人员的影响以及一般的文献计量软件引用分析。此外,提供有关是否以及如何使用软件和源代码的信息,可以评估开源软件在科学中的状态和作用。虽然可以手动进行此类分析,但是大规模分析需要应用自动化信息提取和链接方法。在本文中,我们介绍了SoftwareKG-一张知识图,其中包含有关软件的信息提及的信息,这些信息来自社会科学的51,000多种科学文章。由遥远和弱的监督方法创建的银标准语料库以及由手动注释创建的金标准语料库训练基于LSTM的神经网络,以识别科学文章中的软件提及。该模型的识别率在精确匹配中达到.82 f-评分。结果,我们确定了133,000多个软件提及。对于实体歧义,我们使用了公共领域知识库DBPEDIA。此外,我们将知识图的实体与其他知识库联系起来,例如Microsoft学术知识图,软件本体论和Wikidata。最后,我们说明如何使用软件库来评估软件在社会科学中的作用。
Knowledge about the software used in scientific investigations is necessary for different reasons, including provenance of the results, measuring software impact to attribute developers, and bibliometric software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and role of open source software in science in general. While such analyses can be done manually, large scale analyses require the application of automated methods of information extraction and linking. In this paper, we present SoftwareKG - a knowledge graph that contains information about software mentions from more than 51,000 scientific articles from the social sciences. A silver standard corpus, created by a distant and weak supervision approach, and a gold standard corpus, created by manual annotation, were used to train an LSTM based neural network to identify software mentions in scientific articles. The model achieves a recognition rate of .82 F-score in exact matches. As a result, we identified more than 133,000 software mentions. For entity disambiguation, we used the public domain knowledge base DBpedia. Furthermore, we linked the entities of the knowledge graph to other knowledge bases such as the Microsoft Academic Knowledge Graph, the Software Ontology, and Wikidata. Finally, we illustrate, how SoftwareKG can be used to assess the role of software in the social sciences.