随机关注头移除：一种改善基于变压器ASR模型的简单有效方法

论文标题

随机关注头移除：一种改善基于变压器ASR模型的简单有效方法

Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models

论文作者

Zhang, Shucong, Loweimi, Erfan, Bell, Peter, Renals, Steve

论文摘要

最近，基于变压器的模型已显示出具有竞争力的自动语音识别（ASR）性能。这些模型成功的关键因素之一是多头注意机制。但是，对于训练有素的模型，我们以前已经观察到许多注意力矩阵接近对角线，表明相应的注意力头的冗余。我们还发现，一些关注头数量减少的体系结构具有更好的性能。由于对最佳结构的搜索是时间的，因此我们建议在训练过程中随机删除注意力头并在测试时间保持所有注意力头，因此最终模型是具有不同体系结构的模型的集合。提出的方法还迫使每个头部独立学习最有用的模式。我们将提出的方法应用于基于训练变压器和基于卷积的变压器（构象异构体）的ASR模型。我们的方法在《华尔街日报》，Aishell，Thandboard和AMI数据集上对强大基线的性能获得了一致的性能。据我们所知，我们已经实现了基于端到端变压器的最新端到端变压器模型性能。

Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some architectures with reduced numbers of attention heads have better performance. Since the search for the best structure is time prohibitive, we propose to randomly remove attention heads during training and keep all attention heads at test time, thus the final model is an ensemble of models with different architectures. The proposed method also forces each head independently learn the most useful patterns. We apply the proposed method to train Transformer based and Convolution-augmented Transformer (Conformer) based ASR models. Our method gives consistent performance gains over strong baselines on the Wall Street Journal, AISHELL, Switchboard and AMI datasets. To the best of our knowledge, we have achieved state-of-the-art end-to-end Transformer based model performance on Switchboard and AMI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题