重新访问docred-解决关系提取中的错误负面问题

论文标题

重新访问docred-解决关系提取中的错误负面问题

Revisiting DocRED -- Addressing the False Negative Problem in Relation Extraction

论文作者

Tan, Qingyu, Xu, Lu, Bing, Lidong, Ng, Hwee Tou, Aljunied, Sharifah Mahani

论文摘要

DOCRED数据集是用于文档级关系提取（RE）的最流行和广泛使用的基准之一。它采用了推荐的重新注释方案，以便具有大规模注释的数据集。但是，我们发现DOCRED的注释是不完整的，即，假阴性样本很普遍。我们分析了DOCRED数据集中压倒性错误问题的原因和影响。为了解决缺点，我们通过将缺失的关系三元三元添加回原始DOCRED，在DOCRED数据集中重新注释4,053个文档。我们将重新转移的修订后的DOCRED数据集命名。我们在两个数据集上对最新的神经模型进行了广泛的实验，实验结果表明，对我们重新培训的训练和评估的模型实现了约13个F1点的性能提高。此外，我们进行了全面的分析，以确定进一步改进的潜在领域。我们的数据集可在https://github.com/tonytan48/redocred上公开获取。

The DocRED dataset is one of the most popular and widely used benchmarks for document-level relation extraction (RE). It adopts a recommend-revise annotation scheme so as to have a large-scale annotated dataset. However, we find that the annotation of DocRED is incomplete, i.e., false negative samples are prevalent. We analyze the causes and effects of the overwhelming false negative problem in the DocRED dataset. To address the shortcoming, we re-annotate 4,053 documents in the DocRED dataset by adding the missed relation triples back to the original DocRED. We name our revised DocRED dataset Re-DocRED. We conduct extensive experiments with state-of-the-art neural models on both datasets, and the experimental results show that the models trained and evaluated on our Re-DocRED achieve performance improvements of around 13 F1 points. Moreover, we conduct a comprehensive analysis to identify the potential areas for further improvement. Our dataset is publicly available at https://github.com/tonytan48/Re-DocRED.

下载PDF全文

下载文献需遵守相关版权规定

论文标题