SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

05/07/2021
by   Zhao You, et al.
0

Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without computational cost increasing both at training and inference time. Besides, MoE based Transformer is a dynamic network which can adapt to the varying complexity of input instances in realworld applications. In this work, we explore the MoE based model for speech recognition, named SpeechMoE. To further control the sparsity of router activation and improve the diversity of gate values, we propose a sparsity L1 loss and a mean importance loss respectively. In addition, a new router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network and the hierarchical representation of different MoE layers. Experimental results show that SpeechMoE can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, providing 7.0

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2021

SpeechMoE2: Mixture-of-Experts Model with Improved Routing

Mixture-of-experts based acoustic models with dynamic routing mechanisms...
research
10/08/2021

Taming Sparsely Activated Transformer with Stochastic Experts

Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can ...
research
03/14/2023

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Transformer-based end-to-end speech recognition has achieved great succe...
research
06/05/2018

Deep Mixture of Experts via Shallow Embedding

Larger networks generally have greater representational power at the cos...
research
07/12/2023

Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition

Multilingual speech recognition for both monolingual and code-switching ...
research
09/21/2021

DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Dynamic networks have shown their promising capability in reducing theor...
research
06/28/2021

Complexity-based partitioning of CSFI problem instances with Transformers

In this paper, we propose a two-steps approach to partition instances of...

Please sign up or login with your details

Forgot password? Click here to reset