Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

03/25/2023
by   Zirui Fu, et al.
0

Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
11/28/2020

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Transformer-based language models such as BERT provide significant accur...
research
05/08/2020

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

Attention-based models have demonstrated remarkable success in various n...
research
07/11/2022

Efficient NLP Inference at the Edge via Elastic Pipelining

Natural Language Processing (NLP) inference is seeing increasing adoptio...
research
03/26/2021

RCT: Resource Constrained Training for Edge AI

Neural networks training on edge terminals is essential for edge AI comp...
research
05/21/2018

Streaming MANN: A Streaming-Based Inference for Energy-Efficient Memory-Augmented Neural Networks

With the successful development of artificial intelligence using deep le...
research
10/04/2018

Towards Fast and Energy-Efficient Binarized Neural Network Inference on FPGA

Binarized Neural Network (BNN) removes bitwidth redundancy in classical ...
research
12/03/2019

Understanding the Impact of On-chip Communication on DNN Accelerator Performance

Deep Neural Networks have flourished at an unprecedented pace in recent ...

Please sign up or login with your details

Forgot password? Click here to reset