论文标题
我们对最先进的状态有什么了解?
What do we Really Know about State of the Art NER?
论文作者
论文摘要
命名实体识别(NER)是一项经过良好研究的NLP任务,被广泛用于现实世界NLP方案。 NER研究通常着重于创建新的培训方式,而对资源和评估的重视相对较少。此外,在标准数据集中接受培训的最新技术(SOTA)NER模型通常仅报告单个绩效指标(F-SCOER),我们真的不知道它们对不同实体类型和文本类型的表现如何,或者对新的,看不见的实体的鲁棒性。在本文中,我们使用流行的数据集对NER进行广泛的评估,该数据集中考虑了各种文本流派,并来源构成了手头数据集。此外,我们通过原始测试集中的小扰动生成六个新的对抗测试集,在保留上下文的同时更换了选定的实体。我们还在随机生成的火车/开发/测试拆分上进行训练和测试我们的模型,然后进行一个实验,该实验是在精选类型的一组类型上训练的,但在训练中未见的类型。这些全面的评估策略是使用三种SOTA NER模型进行的。根据我们的结果,我们建议对NER研究人员进行一些有用的报告实践,这可能有助于更好地了解SOTA模型将来的性能。
Named Entity Recognition (NER) is a well researched NLP task and is widely used in real world NLP scenarios. NER research typically focuses on the creation of new ways of training NER, with relatively less emphasis on resources and evaluation. Further, state of the art (SOTA) NER models, trained on standard datasets, typically report only a single performance measure (F-score) and we don't really know how well they do for different entity types and genres of text, or how robust are they to new, unseen entities. In this paper, we perform a broad evaluation of NER using a popular dataset, that takes into consideration various text genres and sources constituting the dataset at hand. Additionally, we generate six new adversarial test sets through small perturbations in the original test set, replacing select entities while retaining the context. We also train and test our models on randomly generated train/dev/test splits followed by an experiment where the models are trained on a select set of genres but tested genres not seen in training. These comprehensive evaluation strategies were performed using three SOTA NER models. Based on our results, we recommend some useful reporting practices for NER researchers, that could help in providing a better understanding of a SOTA model's performance in future.