多制剂：一种可直接言语翻译的基于头形的变压器模型

论文标题

多制剂：一种可直接言语翻译的基于头形的变压器模型

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

论文作者

Sant, Gerard, Gállego, Gerard I., Alastruey, Belen, Costa-Jussà, Marta R.

论文摘要

基于变压器的模型已经在自然语言处理的几个领域中实现了最先进的结果。但是，它直接应用于语音任务并不是微不足道的。该序列的性质带有问题，例如长序列长度和相邻令牌之间的冗余。因此，我们认为常规的自我注意机制可能不适合它。已经提出了不同的方法来克服这些问题，例如使用有效的注意机制。但是，这些方法的使用通常带有成本，这是由于信息丢失引起的绩效降低。在这项研究中，我们介绍了Multiformer，这是一个基于变压器的模型，该模型允许在每个头部使用不同的注意机制。通过这样做，该模型能够将自我注意力偏向提取更多样化的代币相互作用，并减少信息丢失。最后，我们对头部贡献进行分析，并观察到所有头部相关性均匀分布的结构获得了更好的结果。我们的结果表明，沿着不同头部和层混合注意力模式的表现优于我们的基线高达0.7 bleu。

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as long sequence lengths and redundancy between adjacent tokens. Therefore, we believe that regular self-attention mechanism might not be well suited for it. Different approaches have been proposed to overcome these problems, such as the use of efficient attention mechanisms. However, the use of these methods usually comes with a cost, which is a performance reduction caused by information loss. In this study, we present the Multiformer, a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions, and the information loss is reduced. Finally, we perform an analysis of the head contributions, and we observe that those architectures where all heads relevance is uniformly distributed obtain better results. Our results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题