论文标题
使用音节压缩单词嵌入
Compressing Word Embeddings Using Syllables
论文作者
论文摘要
这项工作研究了使用音节嵌入的可能性,而不是经常使用的$ n $ gram嵌入式作为子字嵌入。我们对两种语言进行了调查:英语和荷兰语。为此,我们还将两个标准的英语单词嵌入评估数据集(Wordsim353和Semeval-2017)转换为荷兰语。此外,我们为研究社区提供了两种语言的音节分解数据集。我们比较了我们的完整单词和$ n $ gram嵌入方式的方法。与完整的单词嵌入相比,我们获得的英语型号是小20到30倍,同时保留了80%的性能。对于荷兰人,型号的型号为70%的绩效保留率为15倍。尽管不如我们使用的$ n $ gram基线准确,但我们的模型可以在几分钟之内进行培训,而不是$ n $ gram方法的小时。我们确定了在未来工作中升级绩效的途径。所有代码均可公开使用,以及我们收集的英语和荷兰音节分解和荷兰评估集翻译。
This work examines the possibility of using syllable embeddings, instead of the often used $n$-gram embeddings, as subword embeddings. We investigate this for two languages: English and Dutch. To this end, we also translated two standard English word embedding evaluation datasets, WordSim353 and SemEval-2017, to Dutch. Furthermore, we provide the research community with data sets of syllabic decompositions for both languages. We compare our approach to full word and $n$-gram embeddings. Compared to full word embeddings, we obtain English models that are 20 to 30 times smaller while retaining 80% of the performance. For Dutch, models are 15 times smaller for 70% performance retention. Although less accurate than the $n$-gram baseline we used, our models can be trained in a matter of minutes, as opposed to hours for the $n$-gram approach. We identify a path toward upgrading performance in future work. All code is made publicly available, as well as our collected English and Dutch syllabic decompositions and Dutch evaluation set translations.