Soft Merging of Experts with Adaptive Routing

06/06/2023
by   Mohammed Muqeeth, et al.
0

Sparsely activated neural networks with conditional computation learn to route their inputs through different "expert" subnetworks, providing a form of modularity that densely activated models lack. Despite their possible benefits, models with learned routing often underperform their parameter-matched densely activated counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely activated models that use non-differentiable discrete routing decisions. To address this issue, we introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization. All of the code used in our experiments is publicly available.

READ FULL TEXT

page 8

page 17

page 19

page 20

page 22

research
04/18/2022

StableMoE: Stable Routing Strategy for Mixture of Experts

The Mixture-of-Experts (MoE) technique can scale up the model size of Tr...
research
04/10/2019

Soft Conditional Computation

Conditional computation aims to increase the size and accuracy of a netw...
research
10/08/2021

Taming Sparsely Activated Transformer with Stochastic Experts

Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can ...
research
05/31/2021

Exploring Sparse Expert Models and Beyond

Mixture-of-Experts (MoE) models can achieve promising results with outra...
research
06/07/2023

Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks

In deep learning, mixture-of-experts (MoE) activates one or few experts ...
research
03/30/2021

BASE Layers: Simplifying Training of Large, Sparse Models

We introduce a new balanced assignment of experts (BASE) layer for large...
research
11/26/2017

SkipNet: Learning Dynamic Routing in Convolutional Networks

Increasing depth and complexity in convolutional neural networks has ena...

Please sign up or login with your details

Forgot password? Click here to reset