无上下文的声学建模，无明确的电话群集

论文标题

无上下文的声学建模，无明确的电话群集

Context-Dependent Acoustic Modeling without Explicit Phone Clustering

论文作者

Raissi, Tina, Beck, Eugen, Schlüter, Ralf, Ney, Hermann

论文摘要

大型词汇自动语音识别的基于音素的声学建模利用了音素上下文。大量依赖上下文的（CD）音素及其高度变化的统计数据需要绑扎或平滑以实现强大的训练。通常，分类和回归树用于语音聚类，这是隐藏的马尔可夫模型（HMM）系统中的标准化。但是，该解决方案引入了二级培训目标，不允许端到端培训。在这项工作中，我们讨论了混合深神经网络（DNN）/hmm的直接语音上下文建模，该建模并未以任何手机聚类算法为基础，以确定HMM状态库存。通过对中心音素状态的关节概率及其左和右下文进行不同的分解，我们获得了一个由不同组件组成的分解网络，该网络共同训练。此外，网络的语音上下文的表示依赖于音素嵌入。我们在总机任务上提出的模型的识别精度是可比的，并且使用标准状态趋势决策树略优于混合模型。

Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tying or smoothing to enable robust training. Usually, classification and regression trees are used for phonetic clustering, which is standard in hidden Markov model (HMM)-based systems. However, this solution introduces a secondary training objective and does not allow for end-to-end training. In this work, we address a direct phonetic context modeling for the hybrid deep neural network (DNN)/HMM, that does not build on any phone clustering algorithm for the determination of the HMM state inventory. By performing different decompositions of the joint probability of the center phoneme state and its left and right contexts, we obtain a factorized network consisting of different components, trained jointly. Moreover, the representation of the phonetic context for the network relies on phoneme embeddings. The recognition accuracy of our proposed models on the Switchboard task is comparable and outperforms slightly the hybrid model using the standard state-tying decision trees.

下载PDF全文

下载文献需遵守相关版权规定

论文标题