论文标题

Bibert:准确的完全二进制的Bert

BiBERT: Accurate Fully Binarized BERT

论文作者

Qin, Haotong, Ding, Yifu, Zhang, Mingyuan, Yan, Qinghua, Liu, Aishan, Dang, Qingqing, Liu, Ziwei, Liu, Xianglong

论文摘要

大型预训练的BERT在自然语言处理(NLP)任务上取得了出色的表现,但计算和内存也很昂贵。作为强大的压缩方法之一,二进制通过利用1位参数和钻头操作极大地降低了计算和内存消耗。不幸的是,BERT(即1位重量,嵌入和激活)的完整二进制通常会出现显着的性能下降,并且很少有研究解决此问题。在本文中,通过理论上的理由和经验分析,我们确定严重的性能下降可以主要归因于向前和向后传播中的信息降解和优化方向不匹配,并提出了准确的完全二进制的BERT BIBERT,以消除性能瓶颈。具体而言,Bibert引入了一种有效的双重意见结构,以最大程度地统计地表示表示信息和方向匹配蒸馏(DMD)方案,以准确地优化完整的二进制BERT。广泛的实验表明,Bibert的表现优于直接基线和现有的最先进的BERT,并通过说服NLP基准上的利润来进行超低位激活。作为第一个完全二元的BERT,我们的方法在拖鞋和型号大小上节省了56.3次和31.2倍,在现实世界中资源约束的情况下展示了完全二元化的BERT模型的巨大优势和潜力。

The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT introduces an efficient Bi-Attention structure for maximizing representation information statistically and a Direction-Matching Distillation (DMD) scheme to optimize the full binarized BERT accurately. Extensive experiments show that BiBERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源