HashFormers: Towards Vocabulary-independent Pre-trained Transformers

10/14/2022
by   Huiyin Xue, et al.
21

Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first introducing computationally cheap hashing functions that bucket together individual tokens to embeddings. We also propose three variants that do not require an embedding matrix at all, further reducing the memory requirements. We empirically demonstrate that HashFormers are more memory efficient compared to standard pre-trained transformers while achieving comparable predictive performance when fine-tuned on multiple text classification tasks. For example, our most efficient HashFormer variant has a negligible performance degradation (0.4% on GLUE) using only 99.1K parameters for representing the embeddings compared to 12.3-38M parameters of state-of-the-art models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2023

Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning

The extensive memory footprint of pre-trained language models (PLMs) can...
research
01/18/2021

HinFlair: pre-trained contextual string embeddings for pos tagging and text classification in the Hindi language

Recent advancements in language models based on recurrent neural network...
research
09/30/2020

Multiple Word Embeddings for Increased Diversity of Representation

Most state-of-the-art models in natural language processing (NLP) are ne...
research
03/30/2020

Pruned Wasserstein Index Generation Model and wigpy Package

Recent proposal of Wasserstein Index Generation model (WIG) has shown a ...
research
04/13/2020

ProFormer: Towards On-Device LSH Projection Based Transformers

At the heart of text based neural models lay word representations, which...
research
09/06/2022

Analyzing Transformers in Embedding Space

Understanding Transformer-based models has attracted significant attenti...
research
04/30/2022

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Transformer-based pre-trained models with millions of parameters require...

Please sign up or login with your details

Forgot password? Click here to reset