Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

05/21/2023

∙

The prevalence of Transformer-based pre-trained language models (PLMs) has led to their wide adoption for various natural language processing tasks. However, their excessive overhead leads to large latency and computational costs. The statically compression methods allocate fixed computation to different samples, resulting in redundant computation. The dynamic token pruning method selectively shortens the sequences but are unable to change the model size and hardly achieve the speedups as static pruning. In this paper, we propose a model accelaration approaches for large language models that incorporates dynamic token downsampling and static pruning, optimized by the information bottleneck loss. Our model, Infor-Coef, achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT. This work provides a promising approach to compress and accelerate transformer-based models for NLP tasks.

READ FULL TEXT

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models

Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures

NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Step by Step Loss Goes Very Far: Multi-Step Quantization for Adversarial Text Attacks

SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

Related Research

TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models

Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures

NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Step by Step Loss Goes Very Far: Multi-Step Quantization for Adversarial Text Attacks

SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models