论文标题
MVP-bert:重新设计了中文伯特和多vocab预处理的词汇
MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining
论文作者
论文摘要
尽管开发了预训练的语言模型(PLM),大大提高了各种中国自然语言处理(NLP)任务的表现,但这些中文PLM的词汇仍然是基于中文角色的Google Charge Chinese Bert \ Cite {Devlin2018bert}提供的词汇。其次,蒙版语言模型预训练是基于单个词汇,该词汇限制了其下游任务性能。在这项工作中,我们首先提出了一种新颖的方法,即\ emph {seg \ _tok},以借助汉语单词细分(CWS)和子词标记来形成中国伯特的词汇。然后,我们提出了三个版本的多唱歌预处理(MVP),以提高模型的表现力。实验表明:(a)与基于char的词汇相比,\ emph {seg \ _tok}不仅可以改善中文在句子级别任务上的性能,还可以提高效率; (b)MVP改善了PLM的下游性能,尤其是它可以改善\ emph {seg \ _tok}在序列标记任务上的性能。
Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert \cite{devlin2018bert}, which is based on Chinese characters. Second, the masked language model pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, \emph{seg\_tok}, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, \emph{seg\_tok} does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve \emph{seg\_tok}'s performances on sequence labeling tasks.