Log In Sign Up

CLSRIL-23: Cross Lingual Speech Representations for Indic Languages

by   Anirudh Gupta, et al.

We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5 used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on 23 languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.


Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representatio...

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

This paper presents XLS-R, a large-scale model for cross-lingual speech ...

Unsupervised pretraining transfers well across languages

Cross-lingual and multi-lingual training of Automatic Speech Recognition...

Joint Unsupervised and Supervised Training for Multilingual ASR

Self-supervised training has shown promising gains in pretraining models...

Massively Multilingual Adversarial Speech Recognition

We report on adaptation of multilingual end-to-end speech recognition mo...

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

In this paper, we propose a unified pre-training approach called UniSpee...

XTREME-S: Evaluating Cross-lingual Speech Representations

We introduce XTREME-S, a new benchmark to evaluate universal cross-lingu...

1 Introduction

Speech recognition has made remarkable progress in the past few years especially after the advent of deep learning

hinton. End to end (E2E) models have surely simplified the modelling process but they are also notoriously known for huge amount of data requirements. Especially more so for low resource languages besacier2014automatic.

This is of particular importance for countries with many languages and dialects such as India which has 22 official languages with an additional 1500 minor languages/dialects. Apart from a few major languages, most of the languages are low resource, thereby making it more difficult to develop speech related technologies bourlard2011current.

(E2E) networks become an attractive choice for multilingual ASRs since they combine the acoustic model, pronunciation and lexicon model into a single network. One way to tackle with multiple languages using a single model would be to train a multilingual ASR model we can take a union over all the language characters and jointly train a model on all the languages. But even in that approach huge amounts of data is needed per language.

In recent years self supervised learning has emerged as a new paradigm in which representations are learnt from the data itself and then fine tuning is done on several other down stream tasks. This approach has been widely successful in natural language processing (NLP) applications

devlin2019bertpeters-etal-2018-deep and is active area of research in other fields.

In the past few years self supervised learning has been actively studied for speech recognition. In jiang2019improving the authors perform unsupervised pretraining with masked predictive coding using a transformer model. Most of work in this space as well has been in monolingual speech recognition chung2018speech2vectjandra2019vqvaejiang2019improvingharwath2020learning. Our approach is based on the wav2vec baevski2020wav2vec the details of which are explained in the coming sections. An approach which uses multiple languages in pre-training and fine-tuning is described in conneau2020unsupervised. It also shows that cross lingual pre-training outperforms monolingual pre-training. We extend the work by rivire2020unsupervised and conneau2020unsupervised by pretraining only on Indic languages so that speech recognition tasks have a better performance on Indic languages.

Languages spoken in the South Asian region belong to at least four major language families: Indo-European (most of which belong to its sub-branch Indo-Aryan), Dravidian, Austro-Asiatic, and Sino-Tibetan. Almost one third of our mother-tongues in India ( languages) belong to the Indo-Aryan family of languages - spoken by % of Indians. The Dravidian languages, in number, form the second major linguistic group of the country (% )222 Since most of the we have used are in common language families we aim to utilise language similarity to aid representation learning for low resource languages.

2 Modeling Approach

The method we use masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned baevski2020wav2vec , and shows that powerful representations can be learnt from speech audio alone. The approach encodes raw speech audio via a multi layer convolutional network and then masks resulting latent speech representations, similar to devlin2019bert

. The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors. The discrete speech units are learnt by a Gumbel softmax

jang2017categorical to represent the latent representations in the contrastive task.

2.1 Pre-training

The main body of the architecture consists of a CNN based feature encoder, a Transformer based sequence network and a quantization module. The feature encoder maps the raw waveform to latent speech representations . These representations are then fed to a transformer block to generate context representations , capturing the information in the entire sequence. The quantization module is used to discretize latent speech representations into . Given codebooks with entries where dimension of each codebook is

. Row corresponds to an entry in the codebook. We choose one entry from each codebook and concatenate the resulting vectors

. This concatenated vector is then mapped to

using a linear transform. The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way and probabilities for choosing the

-th codebook entry of group are -


where is a non-negative temperature, and are uniform samples from .

The speech representations during pre-training are learnt by a contrastive task . This is augmented by a codebook diversity loss .



is a tuned hyperparameter. Where:


is the contrastive loss to make the model distinguish true representations from latent distractors . In equation 3

is the cosine similarity. The diversity loss

is designed to increase use of the quantized notebook representations.

2.2 Fine-tuning

Pre trained models are fine-tuned by adding a fully connected layer on top of the context network with the size of output layer equal to the vocabulary of the task. Models are optimized using a CTC loss Graves06connectionisttemporal. During training only weights of the transformer module are updated but not of the feature encoder module.

3 Training Data

All our data has been processed through the open sourced framework called Vakyansh333 The basic steps of the process are -

  • Download and convert audio to wav format with sample rate , number of channels and bit rate per sample of .

  • We split an audio into voiced chunks using voice activity detection444 We make sure that all the voiced chunks lie between and seconds.

  • To detect and reject noisy samples we use a signal to noise ratio (SNR) approach described by

    wadasnr. We consider any audio sample below a SNR value of as noise and do not include them in training data.

  • We perform speaker and gender identification on our audio data. A high level representation of voice is learnt using a voice encoder based on wan2020generalized. For each audio sample the voice encoder creates a

    dimensional encoding that summarizes characteristics of the spoken voice. For gender identification we train a support vector machine algorithm on the embedding with manually labelled data. Our goal for speaker identification was to get a sense of the number of speakers in a particular audio source. To estimate we use a hierarchical clustering approach to cluster

    similar embeddings in the sense of cosine similarity. The number of speakers are thus the number of clusters.

Assamese Bengali Bodo Dogri English Gujarati Hindi Kannada Total
Maithili Konkani Malayalam Manipuri Marathi Nepali Odia Punjabi
Santali Tamil Telugu Urdu Kashmiri Sanskrit
Table 1: Language wise duration of train and valid sets in hours

From table it can be seen that we have a total of hours of audio data out of which hours is training data and hours is validation data in 23 Indic languages overall. To compare cross language exchange we also pretrain a model using only hours of Hindi data.

3.1 Finetuning data and Language Model

For our current work we are using labelled data only for Hindi. The labelled data is a combination of purchased data and transcripts generated from commercial speech to text engines. We normalize the text before doing any finetuning. Any punctuation is removed and numbers are converted to word format. For language modelling we use a statistical language model based on KenLMheafield-2011-kenlm. We use a -gram language model with a beam size of . The text for language model consists text in the transcribed speech and Hindi data open sourced here 555

4 Model Training

All models are implemented in fairseq ott2019fairseq library.

4.1 Pretraining

We use the base architecture of the wav2vec framework. It has blocks with a model dimension of and attention blocks. The pretraining is restored from a checkpoint which is trained on hours of librispeech data. We chose the base architecture over large since it is faster to train as it has almost half the parameters of large architecture. Base architecture also has a much lesser inference time when finetuned for speech recognition. We crop audio samples at audio frames or seconds and use a dropout of . The model is trained for almost steps and start with a learning rate of . We optimize using Adam kingma2017adam where the first steps are used as warmup updates for the learning rate after which it is linearly decayed. A weight of is used for diversity loss in equation 2. We use codebooks with entries each for the quantization module. We train on 16 Nvidia A100 GPUs when performing pretraining on languages and on Tesla V100 GPUs when training on Hindi. It took around hours to reach a stage where we did not see any gain in code perplexity. More details about training can be seen in the training logs666

4.2 Finetuning

To finetune a model on speech recognition downstream task, a fully connected layer is added on top of the transformer block in which the output labels are the characters for the respective language. While finetuning the weights of the feature encoder are fixed. We finetune until we get the lowest WER on the valid set. Some features of the feature encoder are masked for data augmentation. It is a technique similar to SpecAugment and detailed out in baevski2020wav2vec. All finetuning is performed on Tesla V100 GPUs.

Figure 1: Language specific pretraining loss on valid set

5 Results

We firstly demonstrate that multilingual pretraining outperforms monolingual pretraining by calculating language specific loss for all languages. Our experiments also show that there is a decrease in WER when multilingual pretraining model is used instead of monolingual.

5.1 Effectiveness of Cross Lingual Representation Learning

We calculate the language wise loss (contrastive and codebook) for audio training in both scenarios: when we have languages in pretraining and when we have just Hindi. Figure 1 shows that for all languages apart from Hindi, the loss is lesser in the multilingual pretraining case. This is expected since low resource languages benefit from multilingual pretraining. A lower loss also indicates that more meaningful speech representations are being learnt as had been shown in conneau2020unsupervised.

We also analyze shared discrete speech representations for different languages. For each language, we sample utterances and extract the quantized representations of the pretrained model. These vectors are normalized for each language to obtain vectors of size

. K-means clustering is performed on these vectors and then the dimensions are reduced by PCA.

Figure 2: Quantized speech representations

From 2 we see that most phonetically similar languages are clustered together. The colours correspond to the clusters obtained by K-Means. We perform clustering before PCA to avoid any information loss. Assamese, Bengali and Odia are in one cluster. Hindi, Sanskrit, Urdu and Punjabi are also under one cluster. Most of the South Indian languages are clustered together as well, apart from Kannada. English in Indian accent is not a monolith. In our training data, English contains accents from different parts of India. As a result, it is far from many major languages and is together with other low resource languages. The purpose of this plot is not to recover underlying language families, but to check whether pretraining was able to learn phonetic information from a language.

5.2 Effect on finetuning

We finetune on using pretrained checkpoints from monolingual and multilingual pretraining. We see that even the high resource language, in our case Hindi, benefits from multilingual pretraining. On a separate hour test we see a % decrease in WER and a % decrease in CER when decoding is done without a language model. It can be also seen from table 2 that the WER decrease by % and CER by % when using a language model while decoding.

Pretraining Finetuning Decoding WER CER
monolingual Hindi Viterbi 26.2 9.4
multilingual Hindi Viterbi 24.7 8.5
monolingual Hindi KenLM 16.26 8.9
multilingual Hindi KenLM 15.7 8.09
Table 2: Effect of multilingual and monolingual pretraining on WER and CER

6 Conclusion

In this work we present a multilingual pretrained model on Indic languages in which representations are learnt from raw waveforms. Our results indicate that multilingual pretraining outperforms monolingual pretraining while learning speech representations while pre-training and also during finetuning performance in a downstream speech recognition task. We also show that model is able to encode phonetic similarity in speech representations. We hope this work kick starts development of high quality speech recognition Indic languages, especially low resource languages. All our code777 and models888 are open source. Our pretrained model can also be used for unsupervised speech recognition, especially for languages where no labeled text is present using the methods described in baevski2021unsupervised. We plan to report finetuning results on all Indic languages in the future.

7 Acknowledgements

All authors gratefully acknowledge Ekstep Foundation for supporting this project financially and providing infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Rajat Singhal, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen and Amulya Ahuja for automated data pipelines and infrastructure support for data processing and model training.