基于多通道卷积LSTM网络的资源不足孟加拉语的分类基准

论文标题

基于多通道卷积LSTM网络的资源不足孟加拉语的分类基准

Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network

论文作者

Karim, Md. Rezaul, Chakravarthi, Bharathi Raja, McCrae, John P., Cochez, Michael

论文摘要

社交媒体和微博网站的指数增长不仅提供了赋予表达自由和个人声音能力的平台，而且还使人们能够表达反社会行为，例如在线骚扰，网络欺凌和仇恨言论。已经提出了许多作品，用于通过预测主要资源丰富的语言（例如英语）的上下文来利用这些数据进行社会和反社会行为分析，文档表征和情感分析。但是，有些语言是资源不足的语言，例如孟加拉语，泰米尔语，阿萨姆语，泰卢固语等南亚语言缺乏用于NLP任务的计算资源。在本文中，我们为一种资源不足的语言为孟加拉语提供了几种分类基准。我们分别准备了三个表达仇恨，常用主题以及仇恨言论检测，文档分类和情感分析的观点的数据集。我们根据2.5亿篇文章构建了迄今为止最大的孟加拉语嵌入模型，我们称之为BengfastText。我们执行三个不同的实验，涵盖文档分类，情感分析和仇恨言语检测。我们将单词的嵌入到多通道卷积LSTM（MCONV-LSTM）网络中，以预测不同类型的仇恨言论，文档分类和情感分析。实验表明，BengfastText可以正确地从各个上下文中捕获单词的语义。对几种基线嵌入模型的评估，例如Word2Vec和手套的评估在5倍的交叉效果测试期间，分别进行了文档分类，情感分析和仇恨言语检测，高达92.30％，82.25％和90.45％的F1得分。

Exponential growths of social media and micro-blogging sites not only provide platforms for empowering freedom of expressions and individual voices but also enables people to express anti-social behaviour like online harassment, cyberbullying, and hate speech. Numerous works have been proposed to utilize these data for social and anti-social behaviours analysis, document characterization, and sentiment analysis by predicting the contexts mostly for highly resourced languages such as English. However, there are languages that are under-resources, e.g., South Asian languages like Bengali, Tamil, Assamese, Telugu that lack of computational resources for the NLP tasks. In this paper, we provide several classification benchmarks for Bengali, an under-resourced language. We prepared three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively. We built the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText. We perform three different experiments, covering document classification, sentiment analysis, and hate speech detection. We incorporate word embeddings into a Multichannel Convolutional-LSTM (MConv-LSTM) network for predicting different types of hate speech, document classification, and sentiment analysis. Experiments demonstrate that BengFastText can capture the semantics of words from respective contexts correctly. Evaluations against several baseline embedding models, e.g., Word2Vec and GloVe yield up to 92.30%, 82.25%, and 90.45% F1-scores in case of document classification, sentiment analysis, and hate speech detection, respectively during 5-fold cross-validation tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题