MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

11/29/2022
by   Trevor Gale, et al.
0

We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparse operations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40 and 2.4x over DNNs trained with the highly-optimized Megatron-LM framework.

READ FULL TEXT
research
12/29/2021

Dense-to-Sparse Gate for Mixture-of-Experts

Mixture-of-experts (MoE) is becoming popular due to its success in impro...
research
06/10/2023

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Vision Transformers (ViTs) have shown impressive performance and have be...
research
04/20/2022

On the Representation Collapse of Sparse Mixture of Experts

Sparse mixture of experts provides larger model capacity while requiring...
research
09/19/2023

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

With the fast growth of parameter size, it becomes increasingly challeng...
research
05/04/2022

Optimizing Mixture of Experts using Dynamic Recompilations

The Mixture of Experts architecture allows for outrageously large neural...
research
12/27/2019

MoEVC: A Mixture-of-experts Voice Conversion System with Sparse Gating Mechanism for Accelerating Online Computation

With the recent advancements of deep learning technologies, the performa...
research
06/08/2022

Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners

Domain generalization (DG) aims at learning generalizable models under d...

Please sign up or login with your details

Forgot password? Click here to reset