开发用于英文孟加拉语的POS Tagger

论文标题

开发用于英文孟加拉语的POS Tagger

Development of POS tagger for English-Bengali Code-Mixed data

论文作者

Raha, Tathagata, Mahata, Sainik Kumar, Das, Dipankar, Bandyopadhyay, Sivaji

论文摘要

由于社交媒体的出现，如今的代码混合文本被广泛普遍。由于这些文本结合了两种语言来制定句子，因此引起了与自然语言处理有关的各种研究问题。在本文中，我们试图挖掘一个这样的问题，即代码混合文本的语音标记的一部分。我们已经构建了一个系统，该系统可以在罗马脚本中写下孟加拉语单词，该系统可以标记英语孟加拉语混合数据。我们的方法最初涉及收集和清洁英语孟加拉语代码混合推文。这些推文被用作构建我们系统的开发数据集。提出的系统是一种模块化方法，首先是用各自的语言标记单个令牌，然后将它们传递给不同的POS标签器，该标记是为不同语言（在我们的情况下，英语和孟加拉语）设计的。这两个系统给出的标签后来将其连接在一起，然后将最终结果映射到通用POS标签集。使用100个手动POS标记的代码混合句子检查了我们的系统，其精度为75.29％

Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later joined together and the final result is then mapped to a universal POS tag set. Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%

下载PDF全文

下载文献需遵守相关版权规定

论文标题