. It brings renewed interests in building universal automatic speech recognition (ASR) systems that can recognize speech from any language. Traditionally, a predefined language agnostic representation is required for building multilingual models such as a global phoneme set like International Phonetic Alphabet (IPA)[international1999handbook], Speech Assessment Methods Phonetic Alphabet (SAMPA) [wells1995computer] and Worldbet [hieronymus1993ascii] or a universal speech representation like articulations [frankel2001asr]
, which all require expert knowledge. In the past decade, deep neural networks have been widely adopted in the speech community[hinton2012deep]
, especially the recently developed end-to-end (E2E) models that merges the modeling of acoustics, lexicon and language into a single neural network[li2020comparison, Ryan19, KimHoriWatanabe17, zhang2020transformer, saon2021advancing]
. It largely simplifies the building of multilingual models by learning shared representation directly from data. This also speeds up the sharing of techniques between ASR and other machine learning fields such as neural machine translation[arivazhagan2019massively]. Experiments on less than 10 languages have shown promising capabilities of such E2E models in modeling dialects of a particular language [li2018multi], languages from the same family [kannan2019large] and languages from different families [li2019bytes, pratap2020massively, hou2020large, adams2019massively]. Recently, massively multilingual experiments using more than 50 languages [adams2019massively, pratap2020massively, hou2020large] also show comparable or better performance compared to monolingual systems.
A major focus of multilingual ASR has been improving performance on low resource languages, which benefit from the data pooling of similar languages, the cross language joint optimization and the consequence positive transfer from higher resource languages [adams2019massively, pratap2020massively, hou2020large, zhou2018multilingual, chuangsuwanich2016multilingual]. However, high resource languages where sufficient monolingual data exists, suffer from interference and constrained capacity [pratap2020massively, wang2020balancing], and often see a degradation in performance. Improving performance across the board on both high and low resource languages is an under-studied and challenging task.
From a machine learning perspective, a statistical model’s generalization capability builds on inductive bias of the learning algorithms [mitchell1980need]. The underlying inductive bias for multilingual systems is that the learning signal from one language should benefit the quality of other languages [caruana1997multitask]. Under this assumption, the model will generalize better with an increasing number of languages due to the additional information brought by the new languages. This positive transfer is best observed for low resource languages. However, as we increase the number of languages, the modeling task become more challenging due to the large language variations and heavy data imbalance [wang2020balancing]. With a fixed model capacity which is loosely measured in terms of the number of free parameters in neural networks, the positive/negative transfer boundary becomes salient, and high resource languages start to regress due to task interference and reduction in per-task capacity. In [wang2020gradient], a simple and scalable optimization procedure, namely Gradient Vaccine, is developed to address the gradient interference problem from different tasks. Distilling knowledge from single task models to the multi-task model [li2020knowledge] has also been found to address the interfere problem.
In this paper, we investigate the high resource language performance degradation problem of multilingual models from a capacity perspective [arivazhagan2019massively, lepikhin2020gshard]. Prior work explored as many as 50 [pratap2020massively] to 100 [adams2019massively] languages. However, the scale of the dataset is very limited. The largest language used in [pratap2020massively] has just over 1K hours of speech. In our study, the amount of data per language ranges from 7.7K to 54.7K hours, which leads to high quality monolingual baselines, posing a challenge to build a single multilingual model that can outperform them. We present a capacity solution with thorough empirical studies demonstrating how it is devised. Unlike [pratap2020massively], no details were provided on how their best 1B model was developed. We adopt the GShard [lepikhin2020gshard] technique to efficiently scale our model up to 10B which shows further word error rate (WER) reductions though relatively small. With the increase of model capacity, we manage to recover the performance of monolingual models on all the high resource languages. We ablate the various factors in increasing the model capacity and find that depth generally does better than width and encoder capacity correlates well with recognition performance. We observe that with a fixed capacity, how to feed the language information becomes less important. Moreover, large models are more sample [kaplan2020scaling] and cost efficient, requiring fewer training iterations and less TPU time to reach similar performance.
2 Multilingual E2E Models
2.1 Model Architecture
Our multilingual ASR system is an attention-based encoder-decoder model. For encoder, we use full-context Conformer layers [gulati2020conformer], consisting of an input projection layer, a relative position embedding layer followed by a stack of Conformer layers which are organized into three blocks. The first Conformer block consists of 4 Conformer layers followed by a time stacking layer that concatenates the current output with one frame on its left. This doubles the output dim while achieving a 2X time reduction. The second block consists of a single Conformer layer and a projection layer which halves the feature dim and brings it back to the same dimension as the other layers. The remaining Conformer layers make up the third block. Similar to [bo21better], we use the existing convolution module to provide relative positional information and group normalization to address variations across languages in each Conformer layer.
We experiment with two types of decoders. Firstly, a unidirectional Long Short-Term Memory (LSTM)[Hasim14] based decoder is used together with a 4-head additive content-based attention to form a Listen, Attend and Spell (LAS) model [Chan15]. Secondly, we also adopt a Transformer decoder with masked self attention and cross attention to the encoder outputs [Vaswani17, zhang2020transformer].
The output vocabulary of our multilingual ASR is a unified grapheme set with 3328 tokens; among those 3,315 tokens appear at least 1000 times in the training data and the remaining are special tokens like “
”, “” and padded placeholders. The majority of the graphemes (3,055) come from Chinese; even with that, Chinese is the only language that has OOV grapheme problem due to the selection threshold and the coverage of training data. We feed language information via a one-hot embedding vector into the encoder as either an additional input
2.2 Scaling Up Model Capacity
There are multiple ways to scale up an encoder-decoded based multilingual mode. In this work, we empirically study the effect of the following factors: (1) width vs. depth; (2) encoder vs. decoder; (3) language dependent capacity vs. language independent capacity; (4) architecture vs. capacity. Strictly speaking, model capacity is not equivalent to the number of parameters i.e. the model size. For models with language dependent components built in such as adapter models [li2020knowledge], the inference capacity is smaller than training as during inference only the shared parameters and those corresponding to a specific language are active. To simplify the discussion, we look at the training model capacity and use model size and capacity interchangeably.
Scaling up models comes with various practical challenges: the model parallelism support, the computation cost, and the infrastructure support, etc. Recently, the GShard annotation API has been developed for parallel execution [lepikhin2020gshard] and is released in Lingvo [shen2019lingvo] which makes building giant models simpler. Additionally, a new compiler infrastructure, namely the Single Program Multiple Data (SPMD) partitioner , is developed, which makes the compilation time near constant regardless of the number of partitions[lepikhin2020gshard]. This allows us to more efficiently scale to thousands of partitions. We hence adopt these advances to scale up our multilingual models to 1B parameters and beyond.
3 Experimental Details
Experiments are conducted on a dataset of 15 languages from 9 distinct language families. There are totally 235.4M utterances which correspond to 364.9K hours of speech data from Google’s Voice Search traffic. This is more than 20 times of the data used in [pratap2020massively]. To the best of our knowledge, this is the first work looking at building multilingual ASRs at such a large scale. The data is annonymized and human transcribed. Per language data statistics are listed in Table 1. The number of utterances for each language ranges from 4.7M to 35.3M, roughly corresponding to 7.7K to 54.7K hours of speech data. The unbalanced data distribution poses a modeling challenge. Unlike other existing multilingual work, we focus on investigating the interference problem between high resource languages. The smallest language in our setup has around 7.7K hours of training data, which is about 7 times of the highest resource language used in [pratap2020massively]. This large scale dataset again bring in training efficiency challenges. The test set for each language contains around 319K utterances sampled from Google’s Voice Search traffic with no overlapping from the training set. Similarly, they are anonoymized and hand-transcribed for evaluation purpose.
We use 80D log Mel features that are computed using 32ms windows with a 10ms hop. Features from 3 contiguous frames are stacked and subsampled to form a 240D input representation with 30ms frame rate. A 16D one-hot language vector is fed into the encoder as an additional input. SpecAugment [Park_2019] is used to improve models’ robustness against noise. Specifically, two frequency masks with a maximum length of 27 and two time masks with a maximum length of 50 are used.
All the models are trained in Tensorflow[AbadiAgarwalBarhamEtAl15] using the Lingvo [shen2019lingvo]
toolkit on Google’s Tensor Processing Units (TPU) V3 with a global batch size of 4,096 utterances. Models are trained with 512 TPU cores except for 10B models which use 1024 TPU cores. This is mainly due to the 16G per core high bandwidth memory (HBM) limit. Models are optimized using synchronized stochastic gradient descent. For LSTM-decoder models, we use the Adam optimizer[KingmaBa15] with parameters =0.9 and =0.999; for Transformer-decoder models, Adafactor [shazeer2018adafactor] with parameters =0.9 and =0.99 is used. A transformer learning rate schedule [Vaswani17] with peak learning rate 3e-4 and 10K warm-up steps is used.
4 Results and Discussions
In this section we present our study of building high quality multilingual models on large scale dataset. For simplicity, we use the average WER for comparisons and only report the per language performance at the end.
4.1 Monolingual Baselines
Conformer has been shown to perform the best on many English tasks [gulati2020conformer, bo21better]. We hence adopt it for our monolingual baselines, specifically we use the same Conformer encoder and LSTM decoder architecture as [bo21better] but in a full context setup. The encoder consists of 17 layers of Conformer blocks with 12 layers in the last block (c.f. Section 2.1). Each Conformer layer has a model dimension of 512. 8-head attention is used in the self-attention layer and the Convolution kernel size used is 15. The decoder is an LSTM based LAS decoder, consisting of 2 layers of 640D LSTM with 2048 hidden units. 4-head content-based additive attention is used in the LAS attention module. Each monolingual model has around 140M parameters and is trained to predict language dependent graphemes only. The average WER is 9.29% and the per language breakdown is depicted in Figure 1(c). Across languages, WER ranges from English (US)’s 4.6% to Marathi (IN)’s 20.2%. Languages with more data tend to have lower WERs.
4.2 Multilingual Encoder Architecture
To justify the effectiveness of Conformer encoders for multilingual modeling, we compare three encoders with the same LSTM-based decoder: (1) LSTM encoder with 8 layers, 2048D hidden units and 640D output units; (2) ContextNet encoder with 24 layers, 640D hidden units and channel scale of 2 [han2020contextnet]; and (3) Conformer encoder
with 17 layers, 512D hidden dimension the same as the monolingual baselines. Language adapters are inserted between each encoder layers. The specific configurations of these three encoders are chosen such that the total number of model parameters are roughly the same, which is around 220M. The increase of model size compared to monolingual models comes from the additional language adapters and the increase of output vocabulary size. The average WER of these three models are 11.86%, 10.77% and 9.43% respectively. This clearly demonstrates the effectiveness of Conformer for multilingual ASR. Comparing to monolingual models, even though this baseline multilingual Conformer still lags behind in quality, it does reasonably well in recognizing all the 15 languages. It converges in around 1.2M training steps which roughly corresponds to 21 epochs, while the monolingual models normally train up to 50 epochs.
To understand the effect of the language adapters, we conduct the following ablation studies. For quick experimentation, we compare models at 200K steps (roughly 3.5 epochs), which we find sufficient for model selection.
The necessity of language dependent parameters. The use of language adapters brings in both language dependent model parameters and a small increase in model size. To understand which helps more, we train a single adapter model that forces all the languages to share the same adapter transformation. In this way we can isolate the model size increase from the adapter model. This model achieves 10.86% average WER vs. the baseline language dependent adapter model’s 10.38% @200K steps. This suggests feeding in language information and learning language dependent parameters are important.
The necessity of language dependent transforms. A simpler way of incorporating language information is to append the one-hot language vector to the input features. It effectively adds language dependent biases; while adapters bring in additional language dependent transformations. At 200K steps, the bias-only model achieves an average WER of 10.93% which is worse than the adapter model. However, one thing to note is it has less number of parameters (146M vs. 220M). We further increase the bias-only model to the same 220M, which yields an average WER of 10.37% @200K steps similar to the adapter model. Though they have the same amount of parameters, the bias-only model has slightly higher inference cost than the adapter model whose adapter components are partially activated depending on the language. But for simplicity, we will iterate base on the bias-only Conformer encoder model.
4.3 Multilingual Decoder Architecture
Besides using a single shared decoder, multi-head models [pratap2020massively] that use different decoders for different languages/families can be used to add per task capacity. Similarly to [pratap2020massively], we use per language family decoders. Totally 5 families are used: Germanic, Italic, Arabic, Indo-Iranian and others. For comparison, we ensure the single decoder and the multi-decoder model have the same number of parameters: (1) single-decoder has 6 layers of 768D LSTM with 3074 hidden units and (2) multi-decoder has 5 decoders and each have 2 layers of 640D LSTM with 2048 hidden units. Both models have 354M parameters. At 200K steps, single-decoder yields of an average WER of 10.13% compared to the multi-decoder’s 10.28%. This suggests given the same model size, it’s more beneficial to use a single decoder which encourages more cross language/family sharing.
To push the performance of our multilingual model, we further increase the 354M model to 500M by increasing the encoder’s width from 512D to 640D and depth from 17 to 22 layers. It reaches an average WER of 9.63% @200K and converges to 9.13% @1.1M steps, outperforming the monolingual models. However, its training speed is less than 1/3 of the baseline 220M model due to the error back propagation through time for RNN models. This makes it unfavourable for further scaling up.
Transformer decoder [Vaswani17] instead does not have the time recurrence constraint and has much high parallelism in training. With the same encoder architecture, we build a Transformer decoder model with totally 500M parameters, which leads to 12 Transformer layers with 768D model dim, 3072D hidden dim and 8 attention heads. It converges to a slightly higher WER of 9.26% but with a training speed similar to the 220M baseline. We hence use Transformer decoders for following studies.
4.4 Scaling up with GShard
In this experiment, we want to find the best way to further scale up the Conformer encoder and Transformer decoder model to 1B parameters. The set of experiments are listed in Table 2. Comparing E1 vs. E2 and E5 vs. E6, for both encoder and decoder, deeper model has better WER than wider models. However, deeper models are slow to train. Comparing E1-E4 with E5-E7, adding capacity to encoder does slightly better than decoder in terms of WER, however, large decoder tends to have better training loss. This might suggest the decoder’s modeling task of the current speech training data is relatively simpler than the encoder’s modeling task and larger decoder shows signs of over fitting. E4 that equally splits the additional capacity to width and depth [kaplan2020scaling] does not work well on this task instead with more capacity allocated to depth (i.e. E3) we obtained the best WER. Lastly, E8, which first equally splits the capacity between encoder and decoder and then allocates more to depth, performs similarly to E3.
E3 converges around 600K steps, roughly 10 epochs, halving the number of training iterations needed for the smaller models though each step runs longer. It is more data efficient. More importantly, it achieves an average WER of 9.07%.
4.5 Towards 10B-Param Model
Based on the previous experiments, we further scale up the model size towards 10B parameters by focusing more capacity on the encoder and depth. Specifically, we increased the E3 encoder depth from 33 to 86 and width from 1024 to 2048 and kept the decoder the same. It converges to 9.04% at 330K steps (6 epochs). Although the WER reduction comparing to the 1B model is marginal, the larger capacity leaves more room for scaling up from 15 languages to more in future.
Besides the performance gains, we find that large models tend to be more data and training cost efficient, i.e. they can reach the same level of performance with fewer optimization steps (Figure 1(a)) and less TPU days (Figure 1(b)). This is similar to the observation in [kaplan2020scaling]. We did not scale beyond 10B mainly due to the slow training speed with current hardware. As shown in Figure 1(a), 10B model has better sample efficiency than 1B, i.e. less training epochs are needed to reach the same WER, but the longer TPU time required per step makes it impractical for now. Sparse models [lepikhin2020gshard] have been found to scale up more efficiently, which will be explored in future work.
4.6 Human-in-the-loop Data Balancing
For simplicity, we only compared the average WER across models. To understand how they do on each language, we plot the breakdown in Figure 1(c) for the monolingual, 220M, 500M, 1B and 10B at convergence. Larger models win over monolingual models on most languages; however, there are languages they lag behind, especially on Russian (RU). We suspect this is because of the unbalanced data distribution. The amount of data per language mainly depends on the data collection project and has less consideration of the language complexity itself. To validate this, we take the 1B model and increase the mixing ratio for Russian (RU) to 0.4 and all the others with the same weights to continue training for another 130K steps. This reduces the WER on Russian (RU) from 6.3% to 5.1% which wins over the monolingual’s 5.5%. As we still maintain a small weight for others, no clear degradation is observed and the average WER is further reduced from 9.07% to 8.87%. This suggests it is beneficial to find a better data mixing ratio or a schedule of mixing ratios for multilingual models, which will be explored in future.
In this work, we investigate the problem of building multilingual end-to-end ASR models on high resource languages with large scale datasets, where language interference becomes more prominent. We address this problem by scaling up model capacities and empirically show that we can build models up to 10B parameters. With larger models, we have observed consistent performance gains. Moreover, the larger models are more sample and training cost efficient, i.e. requiring less training optimization steps and TPU time, though each training step of giant models runs longer. With increased capacities, we can build a single multilingual model that outperforms the monolingual models on high resource languages on a large scale multilingual dataset. We do see on some languages the multilingual model is still lagging behind. Empirical evidence suggests it is a data balancing problem, which will be investigated in future.
We would like thank Brian Farris, Chung-Cheng Chiu, Jiahui Yu, Wei Han, Sergey Kishchenko, Ron Weiss and Zhehuai Chen for helpful discussions.