巨大的：1M示例的多语言自然语言理解数据集，具有51种类型多样性的语言

论文标题

巨大的：1M示例的多语言自然语言理解数据集，具有51种类型多样性的语言

MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

论文作者

FitzGerald, Jack, Hench, Christopher, Peris, Charith, Mackie, Scott, Rottmann, Kay, Sanchez, Ana, Nash, Aaron, Urbach, Liam, Kakarala, Vishesh, Singh, Richa, Ranganath, Swetha, Crist, Laurie, Britan, Misha, Leeuwis, Wouter, Tur, Gokhan, Natarajan, Prem

论文摘要

我们介绍了用于插槽，意图分类和虚拟助手评估的大规模数据集 - 数字亚马逊SLU资源包（SLURP）。大规模包含1M现实，平行，标记的虚拟助手话语，涵盖了51种语言，18个域，60个意图和55个插槽。通过任务专业翻译人员将仅英语的Slurp数据集定位为29属类型的语言，从而创建了大规模。我们还提出了XLM-R和MT5上的建模结果，包括精确的匹配精度，意图分类精度和插槽填充F1分数。我们已经公开发布了数据集，建模代码和模型。

We present the MASSIVE dataset--Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题