论文标题
GCDT:中国第一个树库,用于多类和多语言话语解析
GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing
论文作者
论文摘要
缺乏大规模的人类通知数据阻碍了中国人的分层解析。在本文中,我们介绍了GCDT,这是在修辞结构理论(RST)框架中,普通话中文的最大的层次论述。 GCDT使用与当代RST Treebanks的英语相同的关系清单覆盖了五个自由使用文本的60k代币。我们还报告了该数据集的解析实验,其中包括使用中文和英语的跨语义培训和多语言嵌入式的跨语义培训,包括中文第一次解析和第一次解析的最先进的(SOTA)分数。
A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.