从Wikipedia列表页面提取实体

论文标题

从Wikipedia列表页面提取实体

Entity Extraction from Wikipedia List Pages

论文作者

Heist, Nicolas, Paulheim, Heiko

论文摘要

当谈到有关广泛领域的事实知识时，Wikipedia通常是网络上信息的主要来源。 DBPedia和Yago作为大型跨域知识图，通过为Wikipedia中的每个页面创建实体，并通过边缘连接它们来编码该知识的子集。然而，众所周知，基于Wikipedia的知识图远非完整。尤其是，由于维基百科的政策只有在受试者具有一定的知名度时才允许有关受试者的页面，因此这些图表往往缺乏有关鲜为人知的实体的信息。有关这些实体的信息通常在百科全书中可用，但并未表示为单个页面。在本文中，我们提出了一种两步的方法，用于从Wikipedia列表页面中提取实体，这些方法已被证明是宝贵的信息来源。在第一阶段，我们从类别中构建了大型分类法，并以DBPedia为骨干。在遥远的监督下，我们提取培训数据，以识别我们在第二阶段使用的列表页面中培训分类模型的新实体。通过这种方法，我们提取超过70万个新实体，并以750万新的类型语句和380万个高精度的新事实扩展DBPEDIA。

When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia's policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia's list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题