神经架构搜索LF-MMI训练的时间延迟神经网络

论文标题

神经架构搜索LF-MMI训练的时间延迟神经网络

Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

论文作者

Hu, Shoukang, Xie, Xurong, Cui, Mingyu, Deng, Jiajun, Liu, Shansong, Yu, Jianwei, Geng, Mengzhe, Liu, Xunying, Meng, Helen

论文摘要

最先进的自动语音识别（ASR）系统开发是数据和计算密集型。这些系统的深神经网络（DNN）的最佳设计通常需要专家知识和经验评估。在本文中，一系列神经体系结构搜索（NAS）技术用于自动学习两种类型的分类时间延迟神经网络（TDNN-FS）的超参数：i）左和右拼接上下文偏移； ii）瓶颈线性投影在每个隐藏层的尺寸。这些技术包括可区分的神经体系结构搜索（DARTS）方法，将体系结构学习与无晶格MMI培训集成； Gumbel-Softmax和管道的飞镖方法减少了候选体系结构的混乱并改善了体系结构选择的概括；并惩罚整合资源限制的飞镖，以平衡性能和系统复杂性之间的权衡。 TDNN-F体系结构之间的参数共享允许在多达7^28个不同的系统中进行有效的搜索。在最先进的300小时培训型基线LF-MMI TDNN-F系统中，获得了统计上显着的单词错误率（WER）的绝对模型和相对模型大小的降低31％，并降低了31％。同一任务与文献中报道的最新端到端系统的性能对比表明，NIST HUB5'00和RT03S测试集分别为9.9％和11.1％的最佳NAS自动配置系统达到了最高的最高水平，最高可减少96％的型号尺寸。使用贝叶斯学习的进一步分析表明...

State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to 7^28 different systems. Statistically significant word error rate (WER) reductions of up to 1.2% absolute and relative model size reduction of 31% were obtained over a state-of-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9% and 11.1% on the NIST Hub5' 00 and Rt03s test sets respectively with up to 96% model size reduction. Further analysis using Bayesian learning shows that ...

下载PDF全文

下载文献需遵守相关版权规定

论文标题