使用音节压缩单词嵌入

论文标题

使用音节压缩单词嵌入

Compressing Word Embeddings Using Syllables

论文作者

Mertens, Laurent, Vennekens, Joost

论文摘要

这项工作研究了使用音节嵌入的可能性，而不是经常使用的$ n $ gram嵌入式作为子字嵌入。我们对两种语言进行了调查：英语和荷兰语。为此，我们还将两个标准的英语单词嵌入评估数据集（Wordsim353和Semeval-2017）转换为荷兰语。此外，我们为研究社区提供了两种语言的音节分解数据集。我们比较了我们的完整单词和$ n $ gram嵌入方式的方法。与完整的单词嵌入相比，我们获得的英语型号是小20到30倍，同时保留了80％的性能。对于荷兰人，型号的型号为70％的绩效保留率为15倍。尽管不如我们使用的$ n $ gram基线准确，但我们的模型可以在几分钟之内进行培训，而不是$ n $ gram方法的小时。我们确定了在未来工作中升级绩效的途径。所有代码均可公开使用，以及我们收集的英语和荷兰音节分解和荷兰评估集翻译。

This work examines the possibility of using syllable embeddings, instead of the often used $n$-gram embeddings, as subword embeddings. We investigate this for two languages: English and Dutch. To this end, we also translated two standard English word embedding evaluation datasets, WordSim353 and SemEval-2017, to Dutch. Furthermore, we provide the research community with data sets of syllabic decompositions for both languages. We compare our approach to full word and $n$-gram embeddings. Compared to full word embeddings, we obtain English models that are 20 to 30 times smaller while retaining 80% of the performance. For Dutch, models are 15 times smaller for 70% performance retention. Although less accurate than the $n$-gram baseline we used, our models can be trained in a matter of minutes, as opposed to hours for the $n$-gram approach. We identify a path toward upgrading performance in future work. All code is made publicly available, as well as our collected English and Dutch syllabic decompositions and Dutch evaluation set translations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题