对跨域单词细分的遥远注释和对抗培训

论文标题

对跨域单词细分的遥远注释和对抗培训

Coupling Distant Annotation and Adversarial Training for Cross-Domain Chinese Word Segmentation

论文作者

Ding, Ning, Long, Dingkun, Xu, Guangwei, Zhu, Muhua, Xie, Pengjun, Wang, Xiaobin, Zheng, Hai-Tao

论文摘要

全面监督的神经方法在中文单词细分（CWS）的任务中取得了重大进展。然而，当将监督模型的性能应用于室外数据时，它们的性能往往会急剧下降。性能降解是由跨域之间的分布差距和词汇量（OOV）问题引起的。为了同时减轻这两个问题，本文建议对跨域CWS进行远距离注释和对抗性培训。为了远处注释，我们重新考虑“中文单词”的本质，并设计一种自动远处注释机制，该机制不需要从目标域中进行任何监督或预定的词典。该方法可以有效地探索特定于域的单词，并远距离注释目标域的原始文本。对于对抗性训练，我们开发了句子级训练程序，以减少降噪和最大程度地利用来源域信息。在各个领域的多个现实世界数据集上进行的实验显示了我们模型的优越性和鲁棒性，显着优于先前的最新跨域CWS方法。

Fully supervised neural approaches have achieved significant progress in the task of Chinese word segmentation (CWS). Nevertheless, the performance of supervised models tends to drop dramatically when they are applied to out-of-domain data. Performance degradation is caused by the distribution gap across domains and the out of vocabulary (OOV) problem. In order to simultaneously alleviate these two issues, this paper proposes to couple distant annotation and adversarial training for cross-domain CWS. For distant annotation, we rethink the essence of "Chinese words" and design an automatic distant annotation mechanism that does not need any supervision or pre-defined dictionaries from the target domain. The approach could effectively explore domain-specific words and distantly annotate the raw texts for the target domain. For adversarial training, we develop a sentence-level training procedure to perform noise reduction and maximum utilization of the source domain information. Experiments on multiple real-world datasets across various domains show the superiority and robustness of our model, significantly outperforming previous state-of-the-art cross-domain CWS methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题