使用语料库相似性度量来预测低资源设置中的嵌入可靠性

论文标题

使用语料库相似性度量来预测低资源设置中的嵌入可靠性

Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures

论文作者

Dunn, Jonathan, Li, Haipeng, Sastre, Damian

论文摘要

本文模拟了17种语言的低资源设置，以评估不同条件下的相似性，稳定性和可靠性。目的是在训练之前使用语料库相似性度量来预测训练后嵌入的特性。本文的主要贡献是表明可以使用上游语料库相似性度量来预测下游嵌入相似性。然后，通过建模从非常有限的训练数据创建的嵌入式的可靠性来将此发现应用于低资源设置。结果表明，有可能使用语料库相似性度量估算低资源嵌入的可靠性，这些措施在少量数据上保持强大。这些发现对评估真正低资源语言的评估具有重大影响，在这种语言中，由于数据限制，这种系统的下游验证方法是不可能的。

This paper simulates a low-resource setting across 17 languages in order to evaluate embedding similarity, stability, and reliability under different conditions. The goal is to use corpus similarity measures before training to predict properties of embeddings after training. The main contribution of the paper is to show that it is possible to predict downstream embedding similarity using upstream corpus similarity measures. This finding is then applied to low-resource settings by modelling the reliability of embeddings created from very limited training data. Results show that it is possible to estimate the reliability of low-resource embeddings using corpus similarity measures that remain robust on small amounts of data. These findings have significant implications for the evaluation of truly low-resource languages in which such systematic downstream validation methods are not possible because of data limitations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题