Scaling End-to-End Models for Large-Scale Multilingual ASR

04/30/2021
by   Bo Li, et al.
14

Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data. Existing work has shown positive transfer from high resource to low resource languages. However, degradations on high resource languages are commonly observed due to interference from the heterogeneous multilingual data and reduction in per-language capacity. We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours. We adopt GShard [1] to efficiently scale up to 10B parameters. Empirically, we find that (1) scaling the number of model parameters is an effective way to solve the capacity bottleneck - our 500M-param model is already better than monolingual baselines and scaling it to 1B and 10B brought further quality gains; (2) larger models are not only more data efficient, but also more efficient in terms of training cost as measured in TPU days - the 1B-param model reaches the same accuracy at 34 500M-param model; (3) given a fixed capacity budget, adding depth usually works better than width and large encoders tend to do better than large decoders.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2022

Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

End-to-end multilingual ASR has become more appealing because of several...
research
05/19/2023

Language-universal phonetic encoder for low-resource speech recognition

Multilingual training is effective in improving low-resource ASR, which ...
research
09/14/2019

Multilingual ASR with Massive Data Augmentation

Towards developing high-performing ASR for low-resource languages, appro...
research
07/04/2017

Multilingual Hierarchical Attention Networks for Document Classification

Hierarchical attention networks have recently achieved remarkable perfor...
research
09/22/2021

Scalable and Efficient MoE Training for Multitask Multilingual Models

The Mixture of Experts (MoE) models are an emerging class of sparsely ac...
research
05/25/2022

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

Scaling multilingual representation learning beyond the hundred most fre...
research
04/22/2022

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in mos...

Please sign up or login with your details

Forgot password? Click here to reset