Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

12/09/2022
by   Aran Komatsuzaki, et al.
0

Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling – a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only  50 also outperform sparse models trained from scratch on 100 pretraining computation budget.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/29/2021

Dense-to-Sparse Gate for Mixture-of-Experts

Mixture-of-experts (MoE) is becoming popular due to its success in impro...
research
01/11/2021

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inp...
research
10/19/2022

On the Adversarial Robustness of Mixture of Experts

Adversarial robustness is a key desirable property of neural networks. I...
research
01/14/2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

As the training of giant dense models hits the boundary on the availabil...
research
05/24/2022

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with ...
research
11/21/2016

Training Sparse Neural Networks

Deep neural networks with lots of parameters are typically used for larg...
research
07/11/2022

Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

Many recent breakthroughs in deep learning were achieved by training inc...

Please sign up or login with your details

Forgot password? Click here to reset