论文标题
键形提取具有基于跨度的特征表示
Keyphrase Extraction with Span-based Feature Representations
论文作者
论文摘要
钥匙源能够提供表征文档的语义元数据并概述文档的内容。由于键形提取能够促进信息的管理,分类和检索,因此近年来它引起了很多关注。有三种方法可以解决键形提取:(i)使用神经网络的传统两步排名方法,(ii)序列标记和(iii)生成。两步排名的方法基于功能工程,该功能工程是劳动密集型和范围依赖性的。序列标记无法处理重叠的短语。生成方法(即,序列到序列神经网络模型)克服了这些缺点,因此已广泛研究它们并获得最新的性能。但是,生成方法无法有效利用上下文信息。在本文中,我们提出了一个新颖性跨度键形提取模型,该模型直接从所有内容令牌中提取基于跨度的键形特征表示。这样,我们的模型就可以为每个键形式获得表示形式,并进一步学习以捕获一个文档中的键形声之间的相互作用,以获得更好的排名结果。此外,借助令牌,我们的模型能够提取重叠的键形拼写。基准数据集上的实验结果表明,我们提出的模型比现有方法的幅度大。
Keyphrases are capable of providing semantic metadata characterizing documents and producing an overview of the content of a document. Since keyphrase extraction is able to facilitate the management, categorization, and retrieval of information, it has received much attention in recent years. There are three approaches to address keyphrase extraction: (i) traditional two-step ranking method, (ii) sequence labeling and (iii) generation using neural networks. Two-step ranking approach is based on feature engineering, which is labor intensive and domain dependent. Sequence labeling is not able to tackle overlapping phrases. Generation methods (i.e., Sequence-to-sequence neural network models) overcome those shortcomings, so they have been widely studied and gain state-of-the-art performance. However, generation methods can not utilize context information effectively. In this paper, we propose a novelty Span Keyphrase Extraction model that extracts span-based feature representation of keyphrase directly from all the content tokens. In this way, our model obtains representation for each keyphrase and further learns to capture the interaction between keyphrases in one document to get better ranking results. In addition, with the help of tokens, our model is able to extract overlapped keyphrases. Experimental results on the benchmark datasets show that our proposed model outperforms the existing methods by a large margin.