Unsupervised learning methods are gaining significant traction in acoustic model training [wav2vec, vqwav2vec, wav2vec2.0, XLSR, WavLM, swietojanski2012unsupervised]. These methods exploit large amount of unlabelled data, which is easily available. For instance, wav2vec[wav2vec] use unlabelled data to pretrain the acoustic models and subsequently finetune with the task specific labelled data. The resultant acoustic models trained by initializing from these pretrained models, show superior performance compared to random initialization. The magnitude of gains is often higher for low resource languages[XLSR, Ai4Bharat]. Therefore an emerging acoustic model training paradigm is to pretrain a seed model with large amount of multilingual unlabelled data and subsequently fine-tune it with language specific labelled data.
Transfer learning[TL_kunze-etal-2017, TL_2, TL_3, Joshi2020] and multilingual modeling[SHL_1, SHL_2, MLT_1, MLT_2, Arindam_Mrnnt, Multilingual_RNNT, Besacier-ASRUnderResourcedSurvey, joshi2021multiple, LanguageIndependentAM, Lin-MultilingualAM] are also widely used to improve the speech recognition accuracy. It is now a common practice to pretrain a seed model using multilingual labelled data and subsequently finetune with language specific labelled data. While such pretrained models improve the performance, they do not leverage unlabelled data. Alternatively, self-supervised learning[amzn_million, BigSSL] methods use both labelled and unlabelled data for pretraining. These methods use a well trained teacher model to obtain labels or posteriors for the unlabelled data and subsequently use them along with the labelled data to pretrain the acoustic models. Broadly, the pretrained models can be categorized into following three categories:
Unsupervised seed: Trained with unlabelled data alone using unsupervised learning methods.
Supervised seed: Trained with labelled data alone using cross entropy loss.
Self-supervised seed: Trained with both labelled and unlabelled data.
While unsupervised seed models preform better than random initialization, they are often found to be inferior compared to supervised seed models, especially in industry scale settings. This is because the labelled data pooled from multiple languages to train the supervised seed model is often large. Self-supervised seed models can perform better than supervised and unsupervised seed models, however, need significantly large amount of unlabelled data compared to labelled data to obtain meaningful gains. [amzn_million] and [BigSSL] use close to million hours of unlabelled data with and hours of labelled data. Though it is possible to get million hours of unlabelled data, it is expensive to train models with such large amount of data. In general, using unlabelled data during pretraining pose the following challenges:
Data processing: Unlabelled data needs processing to extract speech only segments and remove silences, noise, and other sounds. We also need to obtain the labels or posteriors via decoding or forward pass. Hence, processing large amount of unlabelled data can be tedious and computationally expensive.
Model training: Will require large number of compute instances to train models with large amount of unlabelled data. This cost can be prohibitively high or can severely constrain the experimentation process.
Model updates: Pretrained models are regularly updated with arrival of new data or with new modeling enhancements. Such updates will now be constrained owing to expensive data preparation and model training.
Realizing the above challenges, we propose an acoustic modeling framework which utilizes any available seed model and leverages unlabelled data during finetuning. It needs much smaller amount of unlabelled data, as the amount of labelled data used in finetuning is considerably smaller than used in pretraining. We show improvements with hours of unlabelled data, when using hours of labelled data for finetuning.
More specifically, we propose to train the model with a joint objective function, consisting of cross entropy loss, suitable for classification task using labelled data, and the contrastive loss, suitable to learn contextual representations using unlabelled data. Therefore, the model is trained to learn representations to classify sound units as well as to learn contextual acoustic representations. The proposed approach is referred to as WavFT
as it learns contextual acoustic representation during finetuning. We conduct experiments on hybrid automatic speech recognition (ASR) systems and show the efficacy of WavFT on two Indian languages, namely, Gujarati (Gu-IN) and Bengali (Bn-IN). WavFT showsand WERR reduction over conventional finetuning on Gu-IN and Bn-IN languages, respectively. WavFT is readily applicable to the production scale systems, as the only change in the acoustic model training is during finetuning by using unlabelled data and the accordingly modified objective function.
2 Relation to prior work
Transfer learning [TL_kunze-etal-2017, TL_2, TL_3, Joshi2020] leverage a well trained seed model from high resource language to improve the accuracy of the low resource language. A natural extension to transfer learning is to train multilingual seed models[SHL_1, SHL_2, MLT_1, MLT_2, Arindam_Mrnnt, Multilingual_RNNT, Besacier-ASRUnderResourcedSurvey, joshi2021multiple, LanguageIndependentAM, Lin-MultilingualAM] with data pooled from multiple languages. While both these methods are effective, they do not leverage easily available unlabelled data.
Recently, numerous studies have used unsupervised and self-supervised learning to improve the speech recognition performance, especially on low resource languages. Wav2vec[wav2vec] show the efficacy of pretraining the model to learn contextual acoustic representations using unlabelled data. Vqwav2vec[vqwav2vec]
introduce learning discrete vector representations. Wav2vec2.0[wav2vec2.0] further improve by masking the speech input in the latent space and solving a contrastive task defined over quantized latent representations. XLSR[XLSR] model showed significant improvements on low resource languages by pretraining wav2vec2.0 with multilingual data. The effectiveness of such pretrained multilingual models on Indian languages is shown in [Ai4Bharat]. While the pretrained models trained with above discussed unsupervised learning methods significantly improved over random initialization, their performance is often inferior to supervised seed initialization, as they do not use labelled data during pretraining. Authors in [amzn_million] show improvements with semi-supervised learning using million hours of unlabelled data. A detailed study of unsupervised and self-supervised methods for pretraining is done in [BigSSL] and authors show benefits of such pretrained models on numerous downstream tasks. UniSpeech [UniSpeech] proposed learning unified speech representations using labelled and unlabelled data during pretraining, and showed improvements with such pretrained models on down-stream tasks. All the above discussed methods use unlabelled data during pretraining stage, unlike our proposition to using it in finetuning for said advantages. Also, our method can improve on top of these methods by using the corresponding pretrained models.
During finetuning, the acoustic model learns representations to classify senones using labelled data. Often such labelled data is small and can lead to over-fitting and poor generalization. On the other hand, model can learn robust contextual acoustic representations with unlabelled data using unsupervised learning[wav2vec, wav2vec2.0] methods. We hypothesize that learning contextual acoustic representations along with representations to classify senones, can help model generalize better. We achieve this by training the model with a joint objective function consisting of: a) The cross entropy loss to learn representations for classification task, computed on labelled data only b) The contrastive loss to learn acoustic representations for better generalization, computed on both labelled and unlabelled data. We next discuss the model architecture and training objective in detail.
Model architecture: Fig. 1 depicts the model architecture and the proposed approach to finetune the acoustic model with labelled and unlabelled data. The acoustic model consists of convolutional transformer (convTransformer) [convTransformer_speech, convTransformer_vision] blocks. Each block consists of multi-head attention[Transformer_Attn]
with relative positional embedding, depth-wise convolution and feed forward neural (FFN) network modules. The input log-mel filter bank (LFB) features are randomly masked and passed through the convolutional sub-sampling block. The corresponding features are fed to the convTransformer model to produce latent representations. These representations are further projected to the label dimension using a linear projection layer and subjected to softmax to obtain the posterior probabilities over the labels, senones in our case. The cross entropy loss is computed between the input labels and posterior probabilities only for the labelled data as shown in Fig.1. These latent representations from convTransformer model are also passed through a feed forward neural (FFN) network to produce context vectors, . The LFB features are transformed with a linear layer, instead of quantization, to produce target vectors, . The contrastive loss is computed between the context and target vectors for both labelled and unlabelled data.
Training objective: Let represent batches of labelled data. Let represent an utterance with audio-label pairs in batch . The cross entropy loss, , computed on utterance is as shown below.
where represents the total number of frames in utterance . represents the label dimension. is the one-hot vector representing the output label at time, . represents the actual value of label at time . is the posterior probability of label obtained at time by processing the input frame with the acoustic model.
Let represent batches of unlabelled data. Each batch consists of only audios. Each audio utterance is processed through the acoustic model to produce context and target vectors for every time instant. Given the context vector, , the model is trained to identify the right target vector, , from a set of candidate representations using the contrastive loss, , defined below:
is the cosine similarity between the two vectors. The computation of contrastive loss is same as done in wav2vec2.0[wav2vec2.0], except that we use linear layer instead of quantization as done in [BigSSL]. During training, a labelled or unlabelled batch is selected with probability and , respectively. The final loss depends on the type of the selected batch as defined below:
where and represent the cross entropy and contrastive loss computed on the entire batch of data. and represent set of labelled and unlabelled batches, respectively. Therefore, if the selected batch is labelled, then the final loss is a weighted combination of cross entropy and contrastive loss. If the selected batch is unlabelled, then the final loss is only the contrastive loss. The hyper-parameter determines the weight between the cross entropy and contrastive loss. During inference, we do not mask the feature and only use convTransformer model along with the projection layer, ignoring the rest of the model blocks.
4 Experimental details
Data: We conduct experiments on Indian languages, namely Bengali (Bn-IN) and Gujarati (Gu-IN), prominent languages spoken in east and west part of India, respectively. We use hours of Gujarati and hours of Bengali labelled data for finetuning. Approximately hours of labelled data from seven Indian languages namely, Indian English, Hindi, Gujarati, Tamil, Telugu, Bengali and Marathi is used for supervised seed model training. We use and hours of Gujarati and Bengali unlabelled data, respectively. Unlabelled data is extracted from videos containing tutorials, news broadcasts and continuous conversations such that they largely contain speech segments. Evaluation set consists of Gujarati and Bengali utterances from different scenarios such as dictation, call center, conversational speech and voice commands.
We conduct experiments on conventional hybrid ASR system consisting of separate acoustic model, language model and lexicon.-dimensional log mel filter bank (LFB) features are computed every milliseconds (ms). The adjacent LFB features are concatenated to obtain -dimensional features. They are further sampled with a sampling factor of to obtain -dimensional feature vector for every . The acoustic model consists of convolutional transformer blocks. Each block consists of multi-head attention with relative positional embedding, convolutional block and FFN network. Multi-head attention uses attention heads with inner dimension of . The convolutional module use kernel size of and the FFN network has dimension of . We use Adam optimizer with linear warm-up of learning rate for of data followed by linear decay for the rest of the data. Unlabelled data is processed through voice activity detection to obtain speech only segments, which are further processed to obtain dimensional LFB features for every . The 5-gram Gujarati and Bengali language model is trained on Gujarati and Bengali text corpus respectively. The labelled batch sampling probability is set to in all the experiments.
5 Discussion of results
The discussion of results is organized as follows: We first compare different initialization methods and select the best initialization method for further experiments. Next, we discuss results for WavFT and compare it with the corresponding baseline. We then discuss the results of tuning the hyperparameter,, that decides the importance between cross entropy and contrastive loss. We then discuss our experiments on varying the amount of unlabelled data and discuss the observations. We finally discuss the impact of using language specific versus multilingual unlabelled data during finetuning.
5.1 Comparison of initialization methods
Table 1 depicts the WER for acoustic models finetuned with the following three initialization methods namely: a) Random initialization b) Unsupervised seed initialization where the seed model is trained with unlabelled data only using contrastive loss in wav2vec2.0 fashion[wav2vec2.0] c) Supervised seed initialization where seed model is trained with approximately hours of multilingual labelled data using cross entropy loss. The acoustic model for respective initialization method is trained by finetuning the initialized model with corresponding labelled data. The Gu-IN and Bn-IN models are finetuned with and hours of corresponding labelled data. As seen from Table 1, unsupervised seed initialization shows and word error rate relative (WERR) reduction compared to random initialization on Gu-IN and Bn-IN locales, respectively. The supervised seed initialization has lower WER compared to other two initialization methods with WERR reduction over random initialization on Gu-IN. It also shows WERR reduction on Bn-IN locale compared to random initialization. Though unsupervised seed model performs better than supervised seed model initialization in Bn-IN, we still use supervised seed model initialization for both Bn-IN and Gu-IN for sake of consistency. Note that WavFT is agnostic to initial seed model initialization and any seed model can be used. Other methods like self-supervised seed initialization can also be used for WavFT finetuning, however, we do not experiment with them as the main focus of this work is to use unlabelled during finetuning.
5.2 WavFT results
|Random init.||Unsupervised seed||Supervised seed|
Table 2 depicts WER for conventional finetuning (baseline) and the proposed WavFT for Gu-IN and Bn-IN locales. The baseline Gu-IN and Bn-IN AMs are trained by finetuning the supervised seed model with Gu-IN and Bn-IN labelled data, respectively. The corresponding WavFT AMs are trained by finetuning the same supervised seed model with both labelled and unlabelled data using WavFT approach discussed in section 3. WavFT shows and WERR reduction over conventional finetuning on Gu-IN and Bn-IN locales respectively. The hyperparameter is set to for Gu-IN and for Bn-IN as they showed the best results. It is not clear why the magnitude of gain is higher for Gu-IN, even though the amount of unsupervised data is more for Bn-IN. We will investigate this further in future work. We next discuss our experiments on hyperparameter tuning, using different amount of unlabelled data and using multilingual unlabelled data. We conduct these experiments on Bn-IN locale as that has more unlabelled data.
5.3 Results with varying
The hyper-parameter determines the weight between cross entropy and contrastive loss in the training objective of WavFT approach, as defined in Eqn. 3. implies using only contrastive loss for both labelled and unlabelled data, equivalent to finetuning with unlabelled data alone. As expected, the WER with is very high () as the model forgets to classify senones and only learns acoustic representations. implies giving more importance to classification task rather than to learn contextual acoustic representations and vice versa. Table 3 shows WER with varying values of on Bn-IN locale. The lowest WER is seen for indicating that a combination of cross entropy and contrastive loss is better for labelled data. implies using only cross entropy loss for labelled data and contrastive loss only for unlabelled data. This is identical to conventional finetuning, except for using unlabelled data and the corresponding loss applied on the unlabelled data. With , we see WERR reduction compared to conventional finetuning, thereby showing the importance of using unlabelled data.
5.4 Results with varying amount of unlabelled data
We conduct experiments on Bn-IN language with varying amount of unlabelled data while keeping the labelled data fixed to hours. Let represent the ratio of number of hours of unlabelled data to number of hours of labelled data. Table 4 shows WER with different values of . implies using hours of unlabelled data and corresponds to using entire unlabelled data, as the unlabelled data is roughly times our labelled data. For every value of , the necessary number of hours of unlabelled data is randomly sampled from the entire set. We use for these experiments, as any smaller value of would imply using all the audios from labelled data to learn acoustic representations and may not truly reflect the impact of different amounts of unlabelled data. Hence, the WER for match the WER with in Table 3. WER decreases with increase in the unlabelled data, however, the amount of gains are not significant after . We did not expect improvements with hours of unlabelled data () and also we expected more improvements with increase in the unlabelled data. A possible reason for observing good gains even with smaller unlabelled data is because we kept the labelled batch sampling probability, , for all values of . Therefore, the model is updated with similar number of labelled and unlabelled batches at any time, and even hours of unlabelled data () could be diverse enough to learn the acoustic representations.
5.5 WavFT results with locale specific and multilingual unlabelled data
Table 5 shows results for Bn-IN AMs finetuned in WavFT fashion with locale specific and multilingual unlabelled data. The locale specific WavFT model is trained with hours of Bn-IN unlabelled data. The multilingual WavFT model is trained hours of unlabelled data collected from seven Indian languages. hours of Bengali data is part of the hour of multilingual unlabelled data. In both the models, same hours of labelled data is used. The batch sampling probability, and hyperparameter are set to for both models. WER results in table 5 suggest that it is better to use locale unlabelled data instead of multilingual unlabelled data.
In this work, we propose WavFT, acoustic model finetuning approach with labelled and unlabelled data. The method encompasses selecting a labelled or unlabelled batch with probability and updating the model with an objective function that is a weight combination of cross entropy and contrastive loss for labelled batch, and only contrastive loss for unlabelled batch. Our approach shows and WERR reduction over conventional finetuning on Gu-IN and Bn-IN locales respectively. WavFT needs smaller amount of unlabelled data unlike most unsupervised and self-supervised learning methods which use them in pretraining and hence is computationally less expensive. Our approach can leverage any existing seed model. It is readily usable in the production scale acoustic models as the proposed changes differ from conventional finetuning only in terms of using unlabelled data and the accordingly modified objective function. In future, we will conduct experiments with sampling probability , will explore newer unsupervised objective functions, newer model architectures and test the efficacy on end-to-end ASR systems.