论文标题
突破性角色:毕竟子词对MRL足够好吗?
Breaking Character: Are Subwords Good Enough for MRLs After All?
论文作者
论文摘要
大型审慎的语言模型(PLM)通常将输入字符串示为任何训练或推理之前的连续子字。但是,以前的研究声称,这种形式的子单词令牌化不足以处理形态富裕的语言(MRLS)。我们通过在字符序列而不是单词式的情况下预修伯特风格的蒙版语言模型来重新审视这一假设。我们将最终的模型(称为Tavbert)与基于三个高度复杂和模棱两可的MRL(希伯来语,土耳其语和阿拉伯语)的子词的当代PLM进行了比较,并在形态和语义任务上测试了它们。我们的结果表明,对于所有经过测试的语言,尽管塔夫伯特(Tavbert)在表面级别的任务上获得了温和的改进 - pos标记和完全的形态歧义,但基于子词的PLM在语义任务(例如命名实体识别和提取性问题答案)上实现了更高的性能。这些结果介绍并(重新)证实了子单词令牌化作为包括MRL在内的许多语言的合理建模假设的潜力。
Large pretrained language models (PLMs) typically tokenize the input string into contiguous subwords before any pretraining or inference. However, previous studies have claimed that this form of subword tokenization is inadequate for processing morphologically-rich languages (MRLs). We revisit this hypothesis by pretraining a BERT-style masked language model over character sequences instead of word-pieces. We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs (Hebrew, Turkish, and Arabic), testing them on both morphological and semantic tasks. Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks à la POS tagging and full morphological disambiguation, subword-based PLMs achieve significantly higher performance on semantic tasks, such as named entity recognition and extractive question answering. These results showcase and (re)confirm the potential of subword tokenization as a reasonable modeling assumption for many languages, including MRLs.