EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

11/28/2020
by   Thierry Tambe, et al.
10

Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

READ FULL TEXT

page 1

page 4

page 10

research
03/25/2023

Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

Executing machine learning inference tasks on resource-constrained edge ...
research
01/05/2021

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-...
research
04/11/2022

MIME: Adapting a Single Neural Network for Multi-task Inference with Memory-efficient Dynamic Pruning

Recent years have seen a paradigm shift towards multi-task learning. Thi...
research
07/11/2022

Efficient NLP Inference at the Edge via Elastic Pipelining

Natural Language Processing (NLP) inference is seeing increasing adoptio...
research
03/24/2023

EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

Automated design of efficient transformer models has recently attracted ...
research
03/04/2021

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

BERT is the most recent Transformer-based model that achieves state-of-t...
research
08/10/2020

GANBERT: Generative Adversarial Networks with Bidirectional Encoder Representations from Transformers for MRI to PET synthesis

Synthesizing medical images, such as PET, is a challenging task due to t...

Please sign up or login with your details

Forgot password? Click here to reset