将神经中文单词分割作为低资源的机器翻译任务

论文标题

将神经中文单词分割作为低资源的机器翻译任务

Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

论文作者

Chen, Pinzhen, Heafield, Kenneth

论文摘要

中文单词细分进入了深度学习时代，这大大减少了功能工程的麻烦。最近，一些研究人员试图将其视为角色级翻译，这进一步简化了模型设计，但是基于翻译的方法和其他方法之间存在性能差距。这激发了我们的工作，其中我们将低资源神经机器翻译的最佳实践应用于监督中国细分。我们研究了一系列技术，包括正则化，数据扩展，客观加权，转移学习和结合。与以前的作品相比，我们的低资源基于翻译的方法保持了毫无轻松的模型设计，但在不使用其他数据的情况下，在受约束的评估中取得了与最新技术相同的结果。

Chinese word segmentation has entered the deep learning era which greatly reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as character-level translation, which further simplified model designing, but there is a performance gap between the translation-based approach and other methods. This motivates our work, in which we apply the best practices from low-resource neural machine translation to supervised Chinese segmentation. We examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning, and ensembling. Compared to previous works, our low-resource translation-based method maintains the effortless model design, yet achieves the same result as state of the art in the constrained evaluation without using additional data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题