Towards Structured Dynamic Sparse Pre-Training of BERT

08/13/2021
by   Anastasia Dietrich, et al.
4

Identifying algorithms for computational efficient unsupervised training of large language models is an important and active area of research. In this work, we develop and study a straightforward, dynamic always-sparse pre-training approach for BERT language modeling task, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation. This approach enables us to achieve Pareto improvements in terms of the number of floating-point operations (FLOPs) over statically sparse and dense models across a broad spectrum of network sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/01/2021

SJ_AJ@DravidianLangTech-EACL2021: Task-Adaptive Pre-Training of Multilingual BERT models for Offensive Language Identification

In this paper we present our submission for the EACL 2021-Shared Task on...
research
08/03/2023

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

Obtaining versions of deep neural networks that are both highly-accurate...
research
10/04/2020

On Losses for Modern Language Models

BERT set many state-of-the-art results over varied NLU benchmarks by pre...
research
10/26/2020

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Recently, Transformer-based language models have demonstrated remarkable...
research
06/06/2023

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Large pre-trained transformers are show-stealer in modern-day deep learn...
research
01/09/2023

Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

We identify and overcome two key obstacles in extending the success of B...
research
01/11/2021

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inp...

Please sign up or login with your details

Forgot password? Click here to reset