Efficient model compression with Random Operation Access Specific Tile (ROAST) hashing

07/21/2022
by   Aditya Desai, et al.
3

Advancements in deep learning are often associated with increasing model sizes. The model size dramatically affects the deployment cost and latency of deep models. For instance, models like BERT cannot be deployed on edge devices and mobiles due to their sheer size. As a result, most advances in Deep Learning are yet to reach the edge. Model compression has sought much-deserved attention in literature across natural language processing, vision, and recommendation domains. This paper proposes a model-agnostic, cache-friendly model compression approach: Random Operation Access Specific Tile (ROAST) hashing. ROAST collapses the parameters by clubbing them through a lightweight mapping. Notably, while clubbing these parameters, ROAST utilizes cache hierarchies by aligning the memory access pattern with the parameter access pattern. ROAST is up to ∼ 25 × faster to train and ∼ 50 × faster to infer than the popular parameter sharing method HashedNet. Additionally, ROAST introduces global weight sharing, which is empirically and theoretically superior to local weight sharing in HashedNet, and can be of independent interest in itself. With ROAST, we present the first compressed BERT, which is 100× - 1000× smaller but does not result in quality degradation. These compression levels on universal architecture like transformers are promising for the future of SOTA model deployment on resource-constrained devices like mobile and edge devices

READ FULL TEXT

page 7

page 15

page 16

page 17

page 18

research
05/03/2022

Efficient Fine-Tuning of BERT Models on the Edge

Resource-constrained devices are increasingly the deployment targets of ...
research
07/21/2022

The trade-offs of model size in large recommendation models : A 10000 × compressed criteo-tb DLRM model (100 GB parameters to mere 10MB)

Embedding tables dominate industrial-scale recommendation model sizes, u...
research
05/30/2021

NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search

While pre-trained language models (e.g., BERT) have achieved impressive ...
research
04/06/2020

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Natural Language Processing (NLP) has recently achieved great success by...
research
06/30/2021

Improving the Efficiency of Transformers for Resource-Constrained Devices

Transformers provide promising accuracy and have become popular and used...
research
11/25/2019

Structured Multi-Hashing for Model Compression

Despite the success of deep neural networks (DNNs), state-of-the-art mod...
research
02/04/2020

Lightweight Convolutional Representations for On-Device Natural Language Processing

The increasing computational and memory complexities of deep neural netw...

Please sign up or login with your details

Forgot password? Click here to reset