COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

06/05/2023
by   Shibal Ibrahim, et al.
0

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13 ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9 prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to 100× reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/29/2021

Dense-to-Sparse Gate for Mixture-of-Experts

Mixture-of-experts (MoE) is becoming popular due to its success in impro...
research
05/06/2020

Local Search is State of the Art for NAS Benchmarks

Local search is one of the simplest families of algorithms in combinator...
research
02/28/2023

Improving Expert Specialization in Mixture of Experts

Mixture of experts (MoE), introduced over 20 years ago, is the simplest ...
research
06/10/2021

Scaling Vision with Sparse Mixture of Experts

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated exce...
research
02/17/2022

Designing Effective Sparse Expert Models

Scale has opened new frontiers in natural language processing – but at a...
research
09/04/2022

A Review of Sparse Expert Models in Deep Learning

Sparse expert models are a thirty-year old concept re-emerging as a popu...
research
11/10/2020

Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration

Discrete structures play an important role in applications like program ...

Please sign up or login with your details

Forgot password? Click here to reset