论文标题
机器学习说我的语言吗?对8种人类语言的NLP-Pipeline的批判性看待
Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages
论文作者
论文摘要
自然语言处理(NLP)越来越多地用作关键决策系统中的关键要素,例如用于对求职者列表进行排序的简历解析器。 NLP系统通常会摄取大量的人类文本,试图从过去的人类行为和决策中学习,以产生有关我们未来世界的建议的系统。如今,正在说7000多种人类语言,而典型的NLP管道少于大多数人的扬声器在放大其他语言的演讲者的声音的同时。在本文中,一个团队包括8种语言的演讲者 - 英语,中文,乌尔都语,乌尔都语,阿拉伯语,法语,西班牙语和沃尔夫(Wolof) - 对典型的NLP管道进行了批判性的看法,即使在技术上支持语言的情况下,仍然存在实质性的警告,以防止充分参与。尽管在许多工具和资源上对多语言支持进行了巨大而令人钦佩的投资,但我们仍在制定NLP指导的决策,这些决策系统地和大幅度地占世界上大部分声音。
Natural Language Processing (NLP) is increasingly used as a key ingredient in critical decision-making systems such as resume parsers used in sorting a list of job candidates. NLP systems often ingest large corpora of human text, attempting to learn from past human behavior and decisions in order to produce systems that will make recommendations about our future world. Over 7000 human languages are being spoken today and the typical NLP pipeline underrepresents speakers of most of them while amplifying the voices of speakers of other languages. In this paper, a team including speakers of 8 languages - English, Chinese, Urdu, Farsi, Arabic, French, Spanish, and Wolof - takes a critical look at the typical NLP pipeline and how even when a language is technically supported, substantial caveats remain to prevent full participation. Despite huge and admirable investments in multilingual support in many tools and resources, we are still making NLP-guided decisions that systematically and dramatically underrepresent the voices of much of the world.