在瓦斯施泰因普罗章的新应用中，用于无监督的跨语性学习

论文标题

在瓦斯施泰因普罗章的新应用中，用于无监督的跨语性学习

On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Learning

论文作者

Ramírez, Guillem, Dangovski, Rumen, Nakov, Preslav, Soljačić, Marin

论文摘要

在非常大的单语文本语料库中预先训练的无监督单词嵌入的出现是自然语言处理（NLP）正在进行的神经革命的核心。最初是针对英语引入的，这种预训练的单词嵌入很快就以其他多种语言出现。随后，已经进行了许多尝试，以使跨语言的嵌入空间对齐，这可以实现许多跨语言的NLP应用程序。使用无监督的跨语性学习（UCL）进行对齐方式特别有吸引力，因为它几乎不需要数据，并且经常与受到监督和半监督的方法相匹配。在这里，我们分析了UCL的流行方法，我们发现它们的目标通常是Wasserstein-Procrustes问题的版本。因此，我们以直接的方式设计了一种方法来求解Wasserstein-Procrustes，该方法可用于完善和改善流行的UCL方法，例如迭代最接近点（ICP），多种语言无监督和监督的嵌入（MUSE）和受监管的procrustes方法。我们在标准数据集上的评估实验显示了这些方法对这些方法的显着改进。我们认为，我们对Wasserstein-Procrustes问题的重新思考可以实现进一步的研究，从而有助于开发更好的算法，以使跨语言的单词嵌入对齐。我们可以在https://github.com/guillemram97/wp-hungarian上找到我们的代码和重现实验的说明。

The emergence of unsupervised word embeddings, pre-trained on very large monolingual text corpora, is at the core of the ongoing neural revolution in Natural Language Processing (NLP). Initially introduced for English, such pre-trained word embeddings quickly emerged for a number of other languages. Subsequently, there have been a number of attempts to align the embedding spaces across languages, which could enable a number of cross-language NLP applications. Performing the alignment using unsupervised cross-lingual learning (UCL) is especially attractive as it requires little data and often rivals supervised and semi-supervised approaches. Here, we analyze popular methods for UCL and we find that often their objectives are, intrinsically, versions of the Wasserstein-Procrustes problem. Hence, we devise an approach to solve Wasserstein-Procrustes in a direct way, which can be used to refine and to improve popular UCL methods such as iterative closest point (ICP), multilingual unsupervised and supervised embeddings (MUSE) and supervised Procrustes methods. Our evaluation experiments on standard datasets show sizable improvements over these approaches. We believe that our rethinking of the Wasserstein-Procrustes problem could enable further research, thus helping to develop better algorithms for aligning word embeddings across languages. Our code and instructions to reproduce the experiments are available at https://github.com/guillemram97/wp-hungarian.

下载PDF全文

下载文献需遵守相关版权规定

论文标题