TechTexC：使用卷积和双向长期记忆网络对技术文本进行分类

论文标题

TechTexC：使用卷积和双向长期记忆网络对技术文本进行分类

TechTexC: Classification of Technical Texts using Convolution and Bidirectional Long Short Term Memory Network

论文作者

Sharif, Omar, Hossain, Eftekhar, Hoque, Mohammed Moshiul

论文摘要

本文介绍了技术文本分类系统及其结果的详细信息，该描述是参与共享任务TechDofication 2020的一部分而开发的结果。共享任务由两个子任务组成：（i）第一个任务在指定语言中确定给定文本的粗粒技术域，以及（ii）第二个任务将计算机科学域的第二个任务分类为良好的子群中的计算机科学域的文本。开发了一个分类系统（称为“ TechTexc”），以使用三种技术执行分类任务：卷积神经网络（CNN），双向长期内存（BILSTM）网络以及与BilstM合并的CNN。结果表明，具有BILSTM模型的CNN优于子任务1（A，B，C和G）和Task-2A的其他技术。该组合模型获得了82.63（子任务A），81.95（子任务B），82.39（子任务C），84.37（子任务G）和67.44（Task-2a）（task-2a）。此外，在测试集的情况下，具有BilstM方法的组合CNN达到了子任务1A（70.76％），1B（79.97％），1C（65.45％），1G（49.23％）和2A（70.14％）的较高准确性。

This paper illustrates the details description of technical text classification system and its results that developed as a part of participation in the shared task TechDofication 2020. The shared task consists of two sub-tasks: (i) first task identify the coarse-grained technical domain of given text in a specified language and (ii) the second task classify a text of computer science domain into fine-grained sub-domains. A classification system (called 'TechTexC') is developed to perform the classification task using three techniques: convolution neural network (CNN), bidirectional long short term memory (BiLSTM) network, and combined CNN with BiLSTM. Results show that CNN with BiLSTM model outperforms the other techniques concerning task-1 of sub-tasks (a, b, c and g) and task-2a. This combined model obtained f1 scores of 82.63 (sub-task a), 81.95 (sub-task b), 82.39 (sub-task c), 84.37 (sub-task g), and 67.44 (task-2a) on the development dataset. Moreover, in the case of test set, the combined CNN with BiLSTM approach achieved that higher accuracy for the subtasks 1a (70.76%), 1b (79.97%), 1c (65.45%), 1g (49.23%) and 2a (70.14%).

下载PDF全文

下载文献需遵守相关版权规定

论文标题