用于端到端任务学习的无词汇多语言神经令牌

论文标题

用于端到端任务学习的无词汇多语言神经令牌

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

论文作者

Islam, Md Mofijul, Aguilar, Gustavo, Ponnusamy, Pragaash, Mathialagan, Clint Solomon, Ma, Chengyuan, Guo, Chenlei

论文摘要

子单词令牌化是最近NLP模型中常用的输入预处理步骤。但是，它限制了模型利用端到端任务学习的能力。其基于频率的词汇创建妥协以低资源语言损害了令牌化，导致模型产生次优表示。此外，对固定词汇的依赖性限制了子词模型跨语言和域的适应性。在这项工作中，我们通过从基于启发式的子单词令牌中提取细分信息来提出一种无词汇神经令牌。我们通过处理多语种语料库的独特单词来预先培训我们的基于角色的令牌，从而广泛地增加了跨语言的单词多样性。与子词方法中的预定义和固定词汇不同，我们的令牌剂允许端到端任务学习，从而实现特定于任务的最佳令牌化。实验结果表明，用我们的神经令牌替换子词令牌始终提高多语言（NLI）和代码转换（情感分析）任务的性能，并以低资产阶级语言增长。此外，当存在对抗噪声（错别字和拼写错误）时，我们的神经令牌在下游任务上表现出强大的性能，从而进一步提高了对统计子字引物的初始改进。

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models' adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization. The experimental results show that replacing the subword tokenizer with our neural tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks, with larger gains in low-resource languages. Additionally, our neural tokenizer exhibits a robust performance on downstream tasks when adversarial noise is present (typos and misspelling), further increasing the initial improvements over statistical subword tokenizers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题