Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

by   Rimita Lahiri, et al.

Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR). In this paper, we aim to enhance the multilingual ASR performance in two ways, 1)studying the impact of feeding a one-hot vector identifying the language, 2)formulating the task with a meta-learning objective combined with self-supervised learning (SSL). We associate every language with a distinct task manifold and attempt to improve the performance by transferring knowledge across learning processes itself as compared to transferring through final model parameters. We employ this strategy on a dataset comprising of 6 languages for an in-domain ASR task, by minimizing an objective related to expected gradient path length. Experimental results reveal the best pre-training strategy resulting in 3.55 combination of LEAP and SSL yields 3.51 using language ID.


Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

Almost none of the 2,000+ languages spoken in Africa have widely availab...

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Multilingual end-to-end (E2E) models have shown great promise in expansi...

BEA-Base: A Benchmark for ASR of Spontaneous Hungarian

Hungarian is spoken by 15 million people, still, easily accessible Autom...

Deploying self-supervised learning in the wild for hybrid automatic speech recognition

Self-supervised learning (SSL) methods have proven to be very successful...

Multi-task Recurrent Model for True Multilingual Speech Recognition

Research on multilingual speech recognition remains attractive yet chall...

Transferring Knowledge across Learning Processes

In complex transfer learning scenarios new tasks might not be tightly li...

Language ID Prediction from Speech Using Self-Attentive Pooling and 1D-Convolutions

This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on pr...

1 Introduction

In the past few decades, the speech recognition fraternity has seen remarkable technological advancement, specially in the field of multilingual ASR [toshniwal2018multilingual, NGaur2021, zhou2021configurable]

. Deploying a single ASR model for multiple languages is specifically challenging because of the inherent difference between the sub-word units, lexicons and word inventories associated with different languages. These challenges led to the growing interest in learning multilingual models with shared representations across languages to circumvent burdensome explicit data requirements.

Most of the early efforts [dahl2011context] in the ASR field employed 3 independent modules, namely acoustic model (AM), language model (LM) and pronunciation model (PM) to solve the problem. Later, the E2E ASR models gained popularity because of their simpler structure for encapsulating AM, LM and PM all in a single network while maintaining competitive performance. In fact, most of the previous efforts in multilingual ASR have been limited to using multilingual AM employing shared hidden layers [heigold2013multilingual], stacked bottleneck features [cui2015multilingual, sercu2017network], multitask learning [chen2015multitask] and so on. Watanabe et al. [watanabe2017language] introduced an E2E language independent model for joint language identification and multilingual ASR tasks. Later, Kannal et al. [Kannan2019] proposed a streaming E2E multilingual model using a combination of conditioning on language vector and usage of training-language specific adapter modules.

Language ID carries meaningful information for language representations for downstream tasks, thus it is a key component for multilingual ASR systems. Traditional systems used to employ multiple monolingual ASR systems in parallel along with a language ID predictor module. Depending on the predicted language ID appropriate downstream transcripts used to be triggered. This approach being extremely expensive, later joint ASR-language ID modelling gained attention in the literature. Waters et al. in [waters2019leveraging]

use a recurrent neural network based streaming language detector and feed the detector output as auxiliary input to the encoder of a multilingual ASR system. Prior works

[li2019bytes, seki2018end, muller2018neural, Hou2020] have reestablished that multilingual ASR performance is usually enhanced by using auxiliary inputs for language representation. Previous promising results have prompted us to use language ID as an input for the in-domain multilingual ASR task.

Recently, self-supervised learning (SSL) is shown to be very effective in the field of ASR [AConneau2006, zhang2020pushing, wang2021unispeech, zhang2021bigssl, Karimi2022]. SSL algorithms attempt at finding a good representation from unlabeled data. One of the major challenges for developing the multi-lingual ASR system is learning the cross-lingual representation for not only high-resource but also low-resource languages [AConneau2006, wang2021unispeech, Karimi2022]. SSL is particularly efficient in such cases, where a general representation is learnt by exploiting the substantial amount of unlabeled data and leveraging those learnt representations for the supervised ASR task using limited labeled data.

Another research direction gaining significant attention these days is meta-learning due to its success in computer vision related tasks 

[rusu2018meta, snell2017prototypical]

. These algorithms present the learning of a new task as a learning problem itself. Usually, meta-learning problems are solved by training a meta-learner by backpropagating through the entire training process. This backpropagation through thousand of gradient steps is always prone to becoming unstable. To get rid of such issues, Flennerhag et al. introduced

LEAP [flennerhag2018transferring], a meta-learning framework which efficiently transfers knowledge across learning processes by introducing an optimal initialization with the shortest expected path length. The work has claimed this type of framework to yield superior performance as compared to other state of the art meta-learning methods in computer vision based tasks. Motivated by the prior work, we combine the benefits of both the approaches in a single framework. We formulate the multilingual ASR task with a meta-learning objective of minimizing the expected path length traversed by the processes from initialization to the final set of parameters along with combining iterative self-training and pre-training in usual SSL setup.

In this work, we contribute to solve the multilingual ASR task along 2 directions, firstly we investigate whether feeding language ID as the input will improve the multilingual ASR performance or not. Moreover, we focus on enhancing the ASR performance by employing a strategy combining SSL and knowledge transfer across learning processes.

The rest of the paper is organized as follows: Section 2 describes the overall methodologies employed to boost multi-lingual ASR performance in this work. Section 3 provides experimental setup, details of the dataset used in this work. Key outcomes of the experiments are tabulated and interpreted in section 4. Finally, section 5 provides conclusions and highlights possible future extensions.

2 Methodology

2.1 Language ID as an input

One of the major challenges faced by multilingual ASR research fraternity is the varying availability of transcribed data from different languages. As a result, ASR models tend to be tailored to the resource rich over-represented languages. We investigate conditioning with language ID as a strategy to address the issue of imbalance in availability of labeled data.

The objective is to ensure that the ASR model will try to learn individual language traits based on the language ID instead of getting influenced by the languages having more data. Prior works have introduced the usage of language vectors in non-streaming E2E multilingual [toshniwal2018multilingual] and multidialect models [li2018multi]. The language identification information can be fed in numerous ways like simple one hot vector, language specific learnt embeddings or as clusters trained using cluster adaptive training  (CAT) [tan2015cluster] and so on. Prior works [toshniwal2018multilingual, Kannan2019]

have reported that employing simple one hot vectors yields competitive performance as compared to the more complex methods. Following that, in this work we study the impact of using language information by concatenating the language representation with the input features while feeding to the neural network architecture.

2.2 Self-supervised learning (SSL)

To obtain a good multi-lingual representation, we employ the log-filter-bank-energy-to-vector (lfbe2vec) similar with [zhang2020pushing, Karimi2022] as the SSL step. Figure 1 illustrates the schematic block chart of the lfbe2vec. As shown in figure 1

, initial time steps are sampled randomly and downsampled with a convolutional network followed by the linear transformation. We mask the subsequent 10 time steps with the mask probability of 0.065. The masked features are fed to the Transformer encoder to yield masked context vectors. In this work, we use the Transformer with relative positional embeddings 

[li2020comparison] instead of Conformer used in  [zhang2020pushing]

. Negative samples are drawn randomly from same utterance but other positions of target vectors. Finally lfb2vec optimizes the contrastive loss between masked positions of context vector and target vectors. We use multi-lingual data for SSL. The pretrained encoder is connected to the transducer decoder that consists of the label predictor, joint network, uni-directional LSTMs and Softmax layer. The whole Transformer transducer will be further optimized with the LEAP described in the next section.

Figure 1: Multi-lingual self-supervised learning (SSL)

2.3 Transferring knowledge across task manifolds

LEAP is a meta-learning framework that accumulates all the information related to the task geometries observed during the training process. In contrast to traditional knowledge transfer learning methods that only consider final parameters, this framework aggregates the task manifolds throughout the learning process, thus avoiding any information loss. To know the worth of an initialization, the notion of path length is considered as an objective criterion. In this case, Euclidean distance is not very accurate, as this metric ignores the trajectory of the learning process. Therefore, any notion of length associated with the gradient-based trajectory on the loss surface can be able to encode all the information related to the learning task efficiently.

Ideally when the gradients are updated in the same direction, that indicates a smooth task unlike the harder tasks where there are frequent updates of gradients in opposite directions making it undesirable. The LEAP framework utilizes this insight, by formulating a meta-learning objective aiming to minimize the expected length of the gradient descent trajectory across tasks. The framework aims to find the initialization resulting in the shortest path for all the task while considering all the constraints related to the task geometries throughout the training process.

Let us first denote the input and target sample as and , respectively. A task can be defined as a learning process to obtain the mapping of based on the samples drawn from the distribution . It starts with a random initialization and gradually progresses across iterations using the update rule following . Assuming that it requires gradient updates to minimize the objective function , the sequence specifies the approximate trajectory traversed on the task manifold with distance . Given a distribution of tasks , each of the candidate solution will be related to a specific expected gradient path length defined as denotes the quality of the initialization. The candidate initialization with the shortest expected gradient path length is likely to transfer maximum knowledge across learning processes and is considered to be pareto optimal in that sense.

Starting from a second best candidate solution , this framework first finds the baseline gradient trajectories , for each task in a batch . Since all the tasks share the same initialization . These baselines are used to update the gradient path distance metric using the equation, where is the frozen forward point from the baseline and is the point on the updated gradient path initialiazed at .


Formally, the above distance metric encodes all the constraints, optimizing with respect to pulls the initialization forward along each task specific gradient path and the objective can be stated as,


To have a better understanding of how this framework operates, let us consider 2 distinct tasks and as shown in Fig. 2.

Figure 2: Illustration of working principle of LEAP
Figure 3: Overall training strategy

The extent of knowledge transfer across these tasks depends on the suitability of the initialization. It starts from an initialization where the tasks and generates the trajectory paths and respectively. Gradually, LEAP follows the green path to improve the initialisation with an objective to minimize the expected gradient path length. For example, the improved initialization has expected shorter task trajectories (denoted by red arrow-lines in Fig. 2) and for tasks and respectively. If 2 candidate initialization solutions yield equivalent performance in each task, the initialization with shortest expected gradient path trajectory encodes maximum knowledge sharing.

2.4 Training strategy

In this work, we consider each of the 6 languages as a distinct task. Accordingly, we expect the LEAP to provide the optimum model over 6 languages. We follow a 3 step training recipe shown in Fig. 3 to incorporate the benefits of the LEAP framework on top of SSL. In the first step, we pre-train the network with an aim to minimize contrastive loss between masked and target context vectors. Next, we fine-tune the pre-trained model using the LEAP meta-learning framework and further fine-tune the model obtained from the previous step to minimize the normal Transformer Transducer (T-T) loss [li2020comparison, QZhang2020]. The motivation behind incorporating the LEAP framework was to find the optimal initialization point for the tasks which is likely to facilitate smoother training trajectory in the following steps.

3 Experimental Setup

3.1 Dataset

Language Hrs
German (DE) 5688
Italian (IT) 8049
Russian (RU) 3055
Spanish (ES) 8332
French (FR) 6930
Portuguese (PT) 10318
Table 1: Duration of Training data: 6 language data with 6.5k BPEs (42,372 hours)
Language Code Number of utterances Number of words
German DE 48708 446215
Italian IT 24881 211163
Russian RU 6328 108736
French FR-CA 17708 178392
FR-FR 25371 273150
Spanish ES-MX 19712 267438
ES-ES 23199 291183
Portuguese PT-BR 18643 262894
PT-PT 2560 44070
Table 2: Test dataset details
Experiments DE IT RU FR-FR FR-CA ES-ES ES-MX PT-PT PT-BR Overall
No pretraining without Lang ID 20.44 18.74 32 25.03 21.98 18.52 20.81 23.19 22.1 21.65
No pretraining with Lang ID 17.79 16.76 25.26 19.85 22.02 16.18 18.54 21.39 21.62 19.13
LEAP-SSL with Lang ID 17.12 16.40 25.22 19.31 20.87 15.41 17.55 21.68 20.83 18.45
Table 3: Evaluation of multi-lingual pretraining methods in the in-domain language task.
Experiments DE IT RU FR-FR FR-CA ES-ES ES-MX PT-PT PT-BR Overall
No pretraining with Lang ID 19.93 20.05 28.73 24.37 22.07 18.17 20.08 24.92 23.04 21.43
SSL only with Lang ID 20.27 19.47 28.86 23.6 21.83 17.98 19.97 24.1 22.87 21.25
LEAP-SSL with Lang ID 19.19 18.67 28.08 22.76 21.14 17.58 19.84 23.81 23.05 20.67
No pretraining without Lang ID 25.18 23.88 35.04 25.27 29.61 22.60 24.66 27.54 26.26 25.71
SSL only without Lang ID 23.26 22.2 31.64 27.83 24.08 20.59 22.97 24.97 23.59 23.92
LEAP-SSL without Lang ID 23.03 22.38 32.42 29.19 24.01 21.04 23.65 25.52 24.83 24.42
Table 4: Effect of language ID input for SSL methods with a subset of data.

The dataset used for training consists of 6 languages totaling 42,372 hours. As shown in Table 1, our training dataset is comprised of data from German, Italian, French, Spanish, Portuguese and Russian. It is worth noting here that we report the experimental results for multiple variants of Portuguese (PT-BR and PT-PT), French (FR-FR and FR-CA) and Spanish (ES-ES and ES-MX). Table 2 tabulates the statistics of our test set for each locale: number of utterances and number of words. The dataset of each locale also contains not only various speakers but also different speech recognition tasks such as command-and-control tasks in mobile, office and car scenarios, Cortana phrases, dictation, video indexing, conversational speech in telecommunication or meetings and so on.

For training a network, we further split the whole training data set into training and validation sets in order to determine the convergence. The amount of training data for each language is different as shown in Table 1. We, thus, sample the lower resource data more frequently to balance the language data distribution during training.

3.2 Implementation details

We used a 80-dimensional LFBE feature extracted at an interval of every 10

ms. The Transformer transducer architecture consists of the Transformer encoder and LSTM decoder. The Transformer encoder comprises of 2 convolutional layers followed by 18 blocks of the relative positional Transformers. Each Transformer block contains 2 

 2048 dimensional feed-forward layers, a multi-head attention with 8 heads and relative positional embeddings. Embedding dimension is fixed to be 512. The ASR decoder consists of 2 blocks of uni-directional LSTM layers followed by a 1024 dimensional feed forward layer. The Adam optimizer is used for SSL pretraining and AdamW is used for conventional fine-tuning. For LEAP-SSL fine-tuning, Adam is used as the meta optimizer to minimize the expected path length while Stochastic Gradient Descent (SGD) is used for minimizing the T-T loss for individual tasks. In SSL, the numbers of model updates were 25,000 and 325,000 for the warm-up and linear decay stage, respectively. For fine-tuning, we performed model updates until the loss on the validation data converged. The same batch size was used for all the fine-tuning experiments.

4 Results and Analysis

We first investigate how much the combination of SSL and LEAP pretraining can improve the multi-lingual ASR performance. Table 3 shows the word error rates (WERs) for each method in each row and for each locale in each column. As baselines, Table 3 shows the recognition performance without pretraining. The overall WER in the last column of the table indicates the WER average weighted with the number of words in each language set. It is clear from Table 3 that LEAP-SSL pretraining can provide better accuracy overall. It is also clear that the use of the language ID one-hot vector leads to better accuracy. It is interesting to note that for specific locales, using language ID as auxiliary input has shown a significant improvement (21.06% and 20.69% relative reduction in WER for RU and FR-FR locales respectively) while in some cases there is no significant improvement (PT-BR, FR-CA locales).

Moreover, we analyze the effect of the use of language ID information in the case that pretraining is employed. Table 4 shows the WERs for each method and locale. For quickly generating the numbers in Table 4, we used a quarter of the training data set for each language. It is clear from Table 4 that the use of the language ID one-hot vector improves recognition accuracy in each method. By comparing the results without pretraining to those with SSL only, we can see that the SSL method itself can improve multi-lingual recognition accuracy. The improvement of SSL is especially prominent in the case that the input language ID is unknown. Interestingly, we use the same data for unlabeled and labeled training in these experiments. We are accordingly led to believe that SSL with multiple language data can generate the better multi-lingual representation. It is also clear from Table 4 that LEAP combined with SSL can provide an additional improvement with the language ID input. The observations reported in Table 4 illustrate the superior performance of LEAP-SSL with the use of language ID in all the locales except PT-BR. Contrarily, the experiments without using the language ID have shown an opposite trend in terms of WERs, for most of the locales the performance is slightly degraded with using LEAP-SSL. One possible explanation can be without language information, the LEAP framework is unable to obtain an optimal initialization, but it is worth investigating more for understanding the cause of the performance drop.

5 Conclusion

In this paper, we have investigated different strategies for enhancing multilingual ASR performance for an in-domain language task. In particular, we focus on feeding language ID one hot vector and minimizing the expected gradient path length across task manifolds. We report lower WERs in the experiments using language ID for all the languages considered for our in-domain multilingual ASR task. We incorporate the LEAP meta-learning framework in the SSL scheme with an added objective of minimizing the expected gradient path length across the language based tasks. Combining LEAP with SSL yields enhanced performance for all the languages in terms of the WER.

In the future, we plan to carry out an extensive set of experiments on the entire dataset and its subset to have a thorough understanding of the potential of using the adopted strategies. Moreover, we will try to analyze the efficiency of these training strategies by comparing with other state of the art meta-learning paradigms. Extending the current work for a larger set of languages also seems to be a fascinating research direction to be opted in future.