论文标题
知识图问题回答数据集及其可推广性:它们足以供将来的研究吗?
Knowledge Graph Question Answering Datasets and Their Generalizability: Are They Enough for Future Research?
论文作者
论文摘要
关于对知识图的问题回答(kgqa)的现有方法的普遍性较弱。这通常是由于标准I.D.在基础数据集上的假设。最近,定义了KGQA的三个级别的概括,即I.I.D.组成,零射。我们分析了25个著名的KGQA数据集用于5个不同的知识图(kgs)。我们表明,根据这个定义,许多现有和在线可用的KGQA数据集不适用于培训可推广的KGQA系统,或者数据集基于停止和过时的KGS。生成新的数据集是一个昂贵的过程,因此,不是较小的研究小组和公司的替代方法。在这项工作中,我们提出了一种缓解方法,用于重新分解可用的KGQA数据集,以使其适用于评估概括,而无需任何费用和手动努力。我们在三个kgqa数据集(即LC-Quad,LC-Quad 2.0和Qald-9)上检验我们的假设。重新分割的KGQA数据集的实验证明了其对概括性的有效性。访问18个可用数据集的代码和统一的方式在线访问https://github.com/semantic-systems/kgqa-datasets,以及https://github.com/semantic-systems/kgqqa-datasets-generalization。
Existing approaches on Question Answering over Knowledge Graphs (KGQA) have weak generalizability. That is often due to the standard i.i.d. assumption on the underlying dataset. Recently, three levels of generalization for KGQA were defined, namely i.i.d., compositional, zero-shot. We analyze 25 well-known KGQA datasets for 5 different Knowledge Graphs (KGs). We show that according to this definition many existing and online available KGQA datasets are either not suited to train a generalizable KGQA system or that the datasets are based on discontinued and out-dated KGs. Generating new datasets is a costly process and, thus, is not an alternative to smaller research groups and companies. In this work, we propose a mitigation method for re-splitting available KGQA datasets to enable their applicability to evaluate generalization, without any cost and manual effort. We test our hypothesis on three KGQA datasets, i.e., LC-QuAD, LC-QuAD 2.0 and QALD-9). Experiments on re-splitted KGQA datasets demonstrate its effectiveness towards generalizability. The code and a unified way to access 18 available datasets is online at https://github.com/semantic-systems/KGQA-datasets as well as https://github.com/semantic-systems/KGQA-datasets-generalization.