BiT: Robustly Binarized Multi-distilled Transformer

05/25/2022
by   Zechun Liu, et al.
0

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2019

BitSplit-Net: Multi-bit Deep Neural Network with Bitwise Activation Function

Significant computational cost and memory requirements for deep neural n...
research
09/04/2020

AutoTrans: Automating Transformer Design via Reinforced Architecture Search

Though the transformer architectures have shown dominance in many natura...
research
05/02/2022

OPT: Open Pre-trained Transformer Language Models

Large language models, which are often trained for hundreds of thousands...
research
06/01/2023

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Fine-tuned transformer models have shown superior performances in many n...
research
06/22/2023

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Transformer models have been widely adopted in various domains over the ...
research
05/13/2023

GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

Affected by the massive amount of parameters, ViT usually suffers from s...
research
06/02/2023

Binary and Ternary Natural Language Generation

Ternary and binary neural networks enable multiplication-free computatio...

Please sign up or login with your details

Forgot password? Click here to reset