基于子空间的表示和学习语言语言识别

论文标题

基于子空间的表示和学习语言语言识别

Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

论文作者

Lee, Hung-Shin, Tsao, Yu, Jeng, Shyh-Kang, Wang, Hsin-Min

论文摘要

可以通过将语音话语表示为多项式分发或电话事件来使用语音限制来区分语言。在本研究中，我们提出了一种基于基于子空间的表示的新学习机制，该机制可以从语音中提取隐藏的音调结构，以进行语言验证和方言/口音识别。该框架主要涉及两个连续的部分。第一部分涉及子空间结构。具体而言，它将每个话语解码为一系列充满了电话形式的向量，并将矢量序列转换为基于低率矩阵分解或动态线性建模的线性正交子空间。第二部分涉及基于内核机器的子空间学习，例如支持向量机和新开发的基于子空间的神经网络（SNNS）。 SNN的输入层是专门为子空间表示的样品设计的。拓扑确保可以通过修改常规的进纸传递以符合子空间相似性的数学定义来确保相同的输出可以从相同的子空间得出。在对NIST LRE 2007的“一般LR”测试中评估，所提出的方法分别在基于序列的PPR-LM，PPR-VSM和PPR-IVEC方法和基于晶格的PPR-LM方法的同样错误率的相对率相对率相对率的相对率相对率相同的相对率相对降低高达52％，46％，56％和27％。此外，在NIST LRE 2009的方言/重音识别任务上，基于SNN的系统的性能要比上述四种基线方法更好。

Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题