论文标题
真正的关注实际上是多少?质疑注意力在经过验证的变压器中的重要性
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
论文作者
论文摘要
注意机制被认为是广泛使用的变压器结构的骨干。它通过计算特定于输入的注意矩阵来将输入与输入相关。我们发现,这种机制虽然强大而优雅,但并不像预审前的语言模型那样重要。我们介绍了PAPA,这是一种新的探测方法,该方法用常数替代输入依赖的注意矩阵 - 多个输入的平均注意力权重。我们使用爸爸在六个下游任务上分析了几个既定的经过验证的变压器。我们发现,没有任何输入依赖性的注意力,所有模型都达到了竞争性能 - 平均相对下降仅为探测基线8%。此外,用恒定(无独立)替换输入依赖的注意矩阵的一半时,几乎没有观察到的性能下降。有趣的是,我们表明,表现出更好的模型而不是弱模型而损失更多,这表明利用输入依赖性注意机制可能是其成功的一个因素。我们的结果促使研究对输入依赖性注意的更简单替代方案,以及在变压器体系结构中更好地利用该机制的方法。
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.