RELX数据集并匹配多语言空白，以进行跨语性关系分类

论文标题

RELX数据集并匹配多语言空白，以进行跨语性关系分类

The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification

论文作者

Köksal, Abdullatif, Özgür, Arzucan

论文摘要

关系分类是信息提取的关键主题之一，可用于构建知识库或为问题回答提供有用的信息。当前的关系分类方法主要集中在英语上，需要大量的人类注释培训数据。为低资源语言创建和注释大量的培训数据是不切实际且昂贵的。为了克服这个问题，我们提出了两个跨语性关系分类模型：基于多语言BERT的基线模型和一个新的多语言读图设置，这在遥远的监督下大大改善了基线。为了进行评估，我们引入了一个新的公共基准数据集，用于英语，法语，德语，西班牙语和土耳其语，用于跨语性关系分类，称为Relx。我们还提供RELX-DISTANT数据集，其中包括数十万个句子，这些句子与Wikipedia和Wikidata的关系以及遥远的对这些语言的关系收集。我们的代码和数据可在以下网址找到：https：//github.com/boun-tabi/relx

Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource languages is impractical and expensive. To overcome this issue, we propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup, which significantly improves the baseline with distant supervision. For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish, called RELX. We also provide the RELX-Distant dataset, which includes hundreds of thousands of sentences with relations from Wikipedia and Wikidata collected by distant supervision for these languages. Our code and data are available at: https://github.com/boun-tabi/RELX

下载PDF全文

下载文献需遵守相关版权规定

论文标题