论文标题
$ h-1 $头的混合物比$ h $ heads好
A Mixture of $h-1$ Heads is Better than $h$ Heads
论文作者
论文摘要
多头细心的神经体系结构已在各种自然语言处理任务上取得了最先进的结果。证据表明它们被过度参数化。注意力可以修剪而没有大幅度的性能损失。在这项工作中,我们“重新分配”它们 - 该模型学会了激活不同输入的不同头。我们提出了专家模型(MAE)的混合物,在多头注意力和专家的混合物之间建立了联系。使用块坐标下降算法对MAE进行了训练,该算法在更新(1)专家的职责和(2)其参数之间进行了交替。关于机器翻译和语言建模的实验表明,MAE在这两个任务上的表现都优于强大的基准。特别是,在WMT14英语到德语翻译数据集上,MAE将“变压器基本”提高了0.8 bleu,具有相当数量的参数。我们的分析表明,我们的模型学会了将不同的专家专门用于不同的输入。
Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead "reallocate" them -- the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.