印地语文本分类的深度学习：比较

论文标题

印地语文本分类的深度学习：比较

Deep Learning for Hindi Text Classification: A Comparison

论文作者

Joshi, Ramchandra, Goel, Purvi, Joshi, Raviraj

论文摘要

自然语言处理（NLP），尤其是自然语言文本分析，近期已经取得了巨大进步。在文本处理中使用深度学习的用法彻底改变了文本处理的技术，并取得了显着的结果。 CNN，LSTM和最近的Transformer等不同的深度学习体系结构已被用来实现NLP任务的最先进的结果。在这项工作中，我们调查了许多用于文本分类任务的深度学习架构。这项工作特别与印地语文本的分类有关。由于缺乏大型标签语料库，用Devanagari脚本编写的形态丰富和低资源的印地语语言的分类研究受到限制。在这项工作中，我们使用翻译版本的英语数据集来评估基于CNN，LSTM和注意力的模型。还比较了基于BERT和激光的多语言预训练句子嵌入，以评估其对印地语语言的有效性。该论文还可以作为流行文本分类技术的教程。

Natural Language Processing (NLP) and especially natural language text analysis have seen great advances in recent times. Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved remarkable results. Different deep learning architectures like CNN, LSTM, and very recent Transformer have been used to achieve state of the art results variety on NLP tasks. In this work, we survey a host of deep learning architectures for text classification tasks. The work is specifically concerned with the classification of Hindi text. The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus. In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention. Multilingual pre-trained sentence embeddings based on BERT and LASER are also compared to evaluate their effectiveness for the Hindi language. The paper also serves as a tutorial for popular text classification techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题