AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

10/14/2022
by   Ganesh Jawahar, et al.
1

Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the sub-architecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse models obtain (i) 3x FLOPs reduction over manually designed dense Transformers and (ii) 23 sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?); (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with the optimal trade-off between task performance and computational constraint like FLOPs and latency. AutoMoE code, data and trained models are available at https://github.com/microsoft/AutoMoE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2021

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

A myriad of recent breakthroughs in hand-crafted neural architectures fo...
research
07/28/2022

Neural Architecture Search on Efficient Transformers and Beyond

Recently, numerous efficient Transformers have been proposed to reduce t...
research
05/28/2020

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, ...
research
09/04/2020

AutoTrans: Automating Transformer Design via Reinforced Architecture Search

Though the transformer architectures have shown dominance in many natura...
research
07/14/2022

NASRec: Weight Sharing Neural Architecture Search for Recommender Systems

The rise of deep neural networks provides an important driver in optimiz...
research
09/25/2022

Bigger Faster: Two-stage Neural Architecture Search for Quantized Transformer Models

Neural architecture search (NAS) for transformers has been used to creat...
research
06/08/2023

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Weight-sharing supernet has become a vital component for performance est...

Please sign up or login with your details

Forgot password? Click here to reset