论文标题
为低资源的语言提供更有意义的资源
Toward More Meaningful Resources for Lower-resourced Languages
论文作者
论文摘要
在该立场论文中,我们描述了有关如何与这些语言的说话者相关的低资源语言资源的观点。我们首先详细研究了两个大量的多语言资源。我们探讨了在Wikidata中存储的名称的内容,用于一些低资源的语言,发现其中许多人实际上并不是他们声称自己是的语言,并且需要非平凡的努力才能纠正。我们讨论Wikiann中存在的质量问题,并评估它是否是手工注释数据的有用补充。然后,我们讨论以周到和道德的方式为低资源语言创建注释的重要性,其中包括语言的讲者作为开发过程的一部分。我们以推荐的资源开发指南结束。
In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. We first examine two massively multilingual resources in detail. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be and require non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data. We then discuss the importance of creating annotation for lower-resourced languages in a thoughtful and ethical way that includes the languages' speakers as part of the development process. We conclude with recommended guidelines for resource development.