论文标题

可调节的延迟变压器编码器用于语言理解

Latency Adjustable Transformer Encoder for Language Understanding

论文作者

Kachuee, Sajjad, Sharifkhani, Mohammad

论文摘要

调整自然语言理解模型的延迟,力量和准确性是有效架构的理想目标。本文提出了一种有效的变压器体系结构,该体系结构可以通过所需的推理延迟速度自适应地调整推理计算成本。在微调阶段,所提出的方法检测到不太重要的隐藏序列元素(单词矢量),并使用提出的注意环境贡献(ACC)度量在每个编码器层中消除它们。经过微调阶段之后,具有新颖的离线属性属性,可以在无需进一步训练的情况下以广泛的推理加速选择进行调整。广泛的实验表明,较高变压器层中的大多数单词向量对随后的层的贡献较小,从而使其去除以改善推理潜伏期。各种语言理解,文本生成和调整任务和基准测试的实验结果证明了该方法在各种数据集中的有效性,对输入的全球环境产生了最小的影响。该技术将Llama3的时间为第一(TTFT)提高了2.9倍,并且性能较小。建议的方法认为,在大型语言模型(LLMS)中,尽管完整的网络对于培训是必需的,但可以在微调阶段将其截断。

Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. Extensive experiments reveal that most word-vectors in higher Transformer layers contribute less to subsequent layers, allowing their removal to improve inference latency. Experimental results on various language understanding, text generation, and instruction tuning tasks and benchmarks demonstrate the approach's effectiveness across diverse datasets, with minimal impact on the input's global context. The technique improves Time-to-First-Token (TTFT) of Llama3 by up to 2.9x, with minor performance drop. The suggested approach posits that in Large Language Models (LLMs), although the complete network is necessary for training, it can be truncated during the fine-tuning phase.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源