学习对变压器的努力检索解码器的关注

论文标题

学习对变压器的努力检索解码器的关注

Learning Hard Retrieval Decoder Attention for Transformers

论文作者

Xu, Hongfei, Liu, Qiuhui, van Genabith, Josef, Xiong, Deyi

论文摘要

变压器翻译模型基于多头注意机制，可以轻松地将其平行。多头注意网络并行执行缩放的点产生注意力函数，通过共同参与来自不同位置的不同表示子空间的信息来增强模型的能力。在本文中，我们提出了一种学习艰难的注意力的方法，其中关注头仅在句子中而不是全部令牌中关注一个令牌。因此，可以通过简单有效的检索操作代替了注意力概率和标准缩放点产物关注的值序列之间的矩阵乘法。我们表明，在解码器中，我们的艰苦检索注意力机制的解码速度快1.43倍，同时在解码器自我和跨注意网络中使用时，在各种机器翻译任务上保持翻译质量。

The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题