塔塔：非洲语言的多语言表与文本数据集

论文标题

塔塔：非洲语言的多语言表与文本数据集

TaTa: A Multilingual Table-to-Text Dataset for African Languages

论文作者

Gehrmann, Sebastian, Ruder, Sebastian, Nikolaev, Vitaly, Botha, Jan A., Chavinda, Michael, Parikh, Ankur, Rivera, Clara

论文摘要

现有的数据到文本生成数据集大部分仅限于英语。为了解决这种缺乏数据，我们以非洲语言（TATA）创建表格到文本，这是第一个大型多语言表与文本数据集，重点是非洲语言。我们通过人口统计和健康调查计划中的双语报告中的数字和随附的文本创建了TATA，然后进行专业翻译以使数据集完全平行。塔塔（Tata）包括8,700种九种语言的示例，包括四种非洲语言（Hausa，Igbo，Swahili和Yorùbá）和零拍测试语言（俄语）。我们还发布了原始图的屏幕截图，以供将来的多语言多模式方法研究。通过深入的人类评估，我们表明塔塔对当前模型充满挑战，而基于MT5-XXL的模型的一半不到一半是可以理解的，并且归因于源数据。我们进一步证明，现有的指标在塔塔（Tata）的表现较差，并引入了与人类判断高度相关的学到的指标。我们在https://github.com/google-research/url-nlp上发布所有数据和注释。

Existing data-to-text generation datasets are mostly limited to English. To address this lack of data, we create Table-to-Text in African languages (TaTa), the first large multilingual table-to-text dataset with a focus on African languages. We created TaTa by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional translation to make the dataset fully parallel. TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorùbá) and a zero-shot test language (Russian). We additionally release screenshots of the original figures for future research on multilingual multi-modal approaches. Through an in-depth human evaluation, we show that TaTa is challenging for current models and that less than half the outputs from an mT5-XXL-based model are understandable and attributable to the source data. We further demonstrate that existing metrics perform poorly for TaTa and introduce learned metrics that achieve a high correlation with human judgments. We release all data and annotations at https://github.com/google-research/url-nlp.

下载PDF全文

下载文献需遵守相关版权规定

论文标题