论文标题

超越离线映射:通过上下文锚定学习交叉语言嵌入

Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring

论文作者

Ormazabal, Aitor, Artetxe, Mikel, Soroa, Aitor, Labaka, Gorka, Agirre, Eneko

论文摘要

关于跨语性单词嵌入的最新研究已由无监督的映射方法统一了单语嵌入。这样的方法在很大程度上依赖那些具有相似结构的嵌入,但是最近表明,不同语言的单独培训会导致与此假设不同。在本文中,我们提出了一种没有这种限制的替代方法,同时需要弱种子词典(例如,相同词的列表)是唯一的监督形式。我们的方法不是对两个固定的嵌入空间对齐,而是通过固定目标语言嵌入,并学习与它们一致的源语言的新嵌入方式来起作用。为此,我们使用Skip-gram的扩展,该扩展利用了将上下文单词翻译为锚点,并结合了自学和迭代重新启动以减少对初始字典的依赖。我们的方法表现优于双语词典诱导的常规映射方法,并获得了下游XNLI任务的竞争结果。

Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源