论文标题
朝着自动音素转录的零拍学习
Towards Zero-shot Learning for Automatic Phonemic Transcription
论文作者
论文摘要
自动音素转录工具可用于低资源语言文档。但是,由于缺乏训练集,只有一小部分语言具有音素转录工具。幸运的是,多语言声学建模给定有限的音频训练数据提供了解决方案。一个更具挑战性的问题是为具有零培训数据的语言构建音素转录器。这项任务的困难在于,训练语言和目标语言之间的音素库存通常会有所不同,因此识别看不见的音素是不可行的。在这项工作中,我们通过采用零击学习的想法来解决这个问题。我们的模型能够识别目标语言中看不见的音素,而无需任何培训数据。在我们的模型中,我们将音素分解为相应的发音属性,例如元音和辅音。我们首先要预测发音属性的分布,而不是直接预测音素,然后使用自定义的声学模型计算音素分布。我们通过使用13种语言训练模型来评估我们的模型,并使用7种看不见的语言对其进行测试。我们发现,在标准的多语言模型中,它平均达到了7.7%的音素错误率。
Automatic phonemic transcription tools are useful for low-resource language documentation. However, due to the lack of training sets, only a tiny fraction of languages have phonemic transcription tools. Fortunately, multilingual acoustic modeling provides a solution given limited audio training data. A more challenging problem is to build phonemic transcribers for languages with zero training data. The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes. In this work, we address this problem by adopting the idea of zero-shot learning. Our model is able to recognize unseen phonemes in the target language without any training data. In our model, we decompose phonemes into corresponding articulatory attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over articulatory attributes, and then compute phoneme distributions with a customized acoustic model. We evaluate our model by training it using 13 languages and testing it using 7 unseen languages. We find that it achieves 7.7% better phoneme error rate on average over a standard multilingual model.