Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

06/08/2023
by   Ganesh Jawahar, et al.
0

Weight-sharing supernet has become a vital component for performance estimation in the state-of-the-art (SOTA) neural architecture search (NAS) frameworks. Although supernet can directly generate different subnetworks without retraining, there is no guarantee for the quality of these subnetworks because of weight sharing. In NLP tasks such as machine translation and pre-trained language modeling, we observe that given the same model architecture, there is a large performance gap between supernet and training from scratch. Hence, supernet cannot be directly used and retraining is necessary after finding the optimal architectures. In this work, we propose mixture-of-supernets, a generalized supernet formulation where mixture-of-experts (MoE) is adopted to enhance the expressive power of the supernet model, with negligible training overhead. In this way, different subnetworks do not share the model weights directly, but through an architecture-based routing mechanism. As a result, model weights of different subnetworks are customized towards their specific architectures and the weight generation is learned by gradient descent. Compared to existing weight-sharing supernet for NLP, our method can minimize the retraining time, greatly improving training efficiency. In addition, the proposed method achieves the SOTA performance in NAS for building fast machine translation models, yielding better latency-BLEU tradeoff compared to HAT, state-of-the-art NAS for MT. We also achieve the SOTA performance in NAS for building memory-efficient task-agnostic BERT models, outperforming NAS-BERT and AutoDistil in various model sizes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2020

Disturbance-immune Weight Sharing for Neural Architecture Search

Neural architecture search (NAS) has gained increasing attention in the ...
research
10/14/2022

AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

Neural architecture search (NAS) has demonstrated promising results on i...
research
10/06/2019

Improving One-shot NAS by Suppressing the Posterior Fading

There is a growing interest in automated neural architecture search (NAS...
research
06/21/2023

Balanced Mixture of SuperNets for Learning the CNN Pooling Architecture

Downsampling layers, including pooling and strided convolutions, are cru...
research
02/16/2021

AlphaNet: Improved Training of Supernet with Alpha-Divergence

Weight-sharing neural architecture search (NAS) is an effective techniqu...
research
10/08/2021

Speeding up Deep Model Training by Sharing Weights and Then Unsharing

We propose a simple and efficient approach for training the BERT model. ...
research
02/08/2017

Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization

A fundamental advantage of neural models for NLP is their ability to lea...

Please sign up or login with your details

Forgot password? Click here to reset