我们真的在文本分类方面取得了很大进展吗？比较审查

论文标题

我们真的在文本分类方面取得了很大进展吗？比较审查

Are We Really Making Much Progress in Text Classification? A Comparative Review

论文作者

Galke, Lukas, Scherp, Ansgar, Diera, Andor, Karl, Fabian, Lin, Bao Xin, Khera, Bhakti, Meuser, Tim, Singhal, Tushar

论文摘要

我们分析了跨知名数据集的单标签和多标签文本分类的各种方法，将它们分类为单词袋，基于序列的基于图，基于图和分层方法。尽管基于图的模型之类的方法激增，但仅编码的预先训练的语言模型（尤其是BERT）仍然是最新的。但是，最近的发现表明，诸如逻辑回归和基于Trigram的SVM之类的更简单的模型优于新技术。尽管仅解码器的生成语言模型在学习有限的数据方面表现出希望，但它们却落后于唯一的性能模型。我们强调了诸如伯特（Bert）之类的歧视性语言模型的优势，而不是生成模型的监督任务。此外，我们强调了文献在方法比较中缺乏鲁棒性，尤其是关于基本的超参数优化，例如仅在纯编码语言模型中的学习率。数据可用性：源代码可在https://github.com/drndr/multilabel-text-clf上找到除NYT数据集外，用于我们实验的所有数据集均公开可用。

We analyze various methods for single-label and multi-label text classification across well-known datasets, categorizing them into bag-of-words, sequence-based, graph-based, and hierarchical approaches. Despite the surge in methods like graph-based models, encoder-only pre-trained language models, notably BERT, remain state-of-the-art. However, recent findings suggest simpler models like logistic regression and trigram-based SVMs outperform newer techniques. While decoder-only generative language models show promise in learning with limited data, they lag behind encoder-only models in performance. We emphasize the superiority of discriminative language models like BERT over generative models for supervised tasks. Additionally, we highlight the literature's lack of robustness in method comparisons, particularly concerning basic hyperparameter optimizations like learning rate in fine-tuning encoder-only language models. Data availability: The source code is available at https://github.com/drndr/multilabel-text-clf All datasets used for our experiments are publicly available except the NYT dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题