Speech recognition has made remarkable progress in the past few years especially after the advent of deep learninghinton. End to end (E2E) models have surely simplified the modelling process but they are also notoriously known for huge amount of data requirements. Especially more so for low resource languages besacier2014automatic.
This is of particular importance for countries with many languages and dialects such as India which has 22 official languages with an additional 1500 minor languages/dialects. Apart from a few major languages, most of the languages are low resource, thereby making it more difficult to develop speech related technologies bourlard2011current.
(E2E) networks become an attractive choice for multilingual ASRs since they combine the acoustic model, pronunciation and lexicon model into a single network. One way to tackle with multiple languages using a single model would be to train a multilingual ASR model we can take a union over all the language characters and jointly train a model on all the languages. But even in that approach huge amounts of data is needed per language.
In recent years self supervised learning has emerged as a new paradigm in which representations are learnt from the data itself and then fine tuning is done on several other down stream tasks. This approach has been widely successful in natural language processing (NLP) applicationsdevlin2019bertpeters-etal-2018-deep and is active area of research in other fields.
In the past few years self supervised learning has been actively studied for speech recognition. In jiang2019improving the authors perform unsupervised pretraining with masked predictive coding using a transformer model. Most of work in this space as well has been in monolingual speech recognition chung2018speech2vectjandra2019vqvaejiang2019improvingharwath2020learning. Our approach is based on the wav2vec baevski2020wav2vec the details of which are explained in the coming sections. An approach which uses multiple languages in pre-training and fine-tuning is described in conneau2020unsupervised. It also shows that cross lingual pre-training outperforms monolingual pre-training. We extend the work by rivire2020unsupervised and conneau2020unsupervised by pretraining only on Indic languages so that speech recognition tasks have a better performance on Indic languages.
Languages spoken in the South Asian region belong to at least four major language families: Indo-European (most of which belong to its sub-branch Indo-Aryan), Dravidian, Austro-Asiatic, and Sino-Tibetan. Almost one third of our mother-tongues in India ( languages) belong to the Indo-Aryan family of languages - spoken by % of Indians. The Dravidian languages, in number, form the second major linguistic group of the country (% )222https://www.education.gov.in. Since most of the we have used are in common language families we aim to utilise language similarity to aid representation learning for low resource languages.
2 Modeling Approach
The method we use masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned baevski2020wav2vec , and shows that powerful representations can be learnt from speech audio alone. The approach encodes raw speech audio via a multi layer convolutional network and then masks resulting latent speech representations, similar to devlin2019bert
. The latent representations are fed to a Transformer network to build contextualized representations and the model is trained via a contrastive task where the true latent is to be distinguished from distractors. The discrete speech units are learnt by a Gumbel softmaxjang2017categorical to represent the latent representations in the contrastive task.
The main body of the architecture consists of a CNN based feature encoder, a Transformer based sequence network and a quantization module. The feature encoder maps the raw waveform to latent speech representations . These representations are then fed to a transformer block to generate context representations , capturing the information in the entire sequence. The quantization module is used to discretize latent speech representations into . Given codebooks with entries where dimension of each codebook is
. Row corresponds to an entry in the codebook. We choose one entry from each codebook and concatenate the resulting vectors. This concatenated vector is then mapped to -th codebook entry of group are -
where is a non-negative temperature, and are uniform samples from .
The speech representations during pre-training are learnt by a contrastive task . This is augmented by a codebook diversity loss .
is a tuned hyperparameter. Where:
Pre trained models are fine-tuned by adding a fully connected layer on top of the context network with the size of output layer equal to the vocabulary of the task. Models are optimized using a CTC loss Graves06connectionisttemporal. During training only weights of the transformer module are updated but not of the feature encoder module.
3 Training Data
All our data has been processed through the open sourced framework called Vakyansh333https://open-speech-ekstep.github.io/. The basic steps of the process are -
Download and convert audio to wav format with sample rate , number of channels and bit rate per sample of .
We split an audio into voiced chunks using voice activity detection444https://webrtc.org/. We make sure that all the voiced chunks lie between and seconds.
To detect and reject noisy samples we use a signal to noise ratio (SNR) approach described bywadasnr. We consider any audio sample below a SNR value of as noise and do not include them in training data.
We perform speaker and gender identification on our audio data. A high level representation of voice is learnt using a voice encoder based on wan2020generalized. For each audio sample the voice encoder creates a
dimensional encoding that summarizes characteristics of the spoken voice. For gender identification we train a support vector machine algorithm on the embedding with manually labelled data. Our goal for speaker identification was to get a sense of the number of speakers in a particular audio source. To estimate we use a hierarchical clustering approach to clustersimilar embeddings in the sense of cosine similarity. The number of speakers are thus the number of clusters.
From table it can be seen that we have a total of hours of audio data out of which hours is training data and hours is validation data in 23 Indic languages overall. To compare cross language exchange we also pretrain a model using only hours of Hindi data.
3.1 Finetuning data and Language Model
For our current work we are using labelled data only for Hindi. The labelled data is a combination of purchased data and transcripts generated from commercial speech to text engines. We normalize the text before doing any finetuning. Any punctuation is removed and numbers are converted to word format. For language modelling we use a statistical language model based on KenLMheafield-2011-kenlm. We use a -gram language model with a beam size of . The text for language model consists text in the transcribed speech and Hindi data open sourced here 555https://indicnlp.ai4bharat.org/corpora/.
4 Model Training
All models are implemented in fairseq ott2019fairseq library.
We use the base architecture of the wav2vec framework. It has blocks with a model dimension of and attention blocks. The pretraining is restored from a checkpoint which is trained on hours of librispeech data. We chose the base architecture over large since it is faster to train as it has almost half the parameters of large architecture. Base architecture also has a much lesser inference time when finetuned for speech recognition. We crop audio samples at audio frames or seconds and use a dropout of . The model is trained for almost steps and start with a learning rate of . We optimize using Adam kingma2017adam where the first steps are used as warmup updates for the learning rate after which it is linearly decayed. A weight of is used for diversity loss in equation 2. We use codebooks with entries each for the quantization module. We train on 16 Nvidia A100 GPUs when performing pretraining on languages and on Tesla V100 GPUs when training on Hindi. It took around hours to reach a stage where we did not see any gain in code perplexity. More details about training can be seen in the training logs666https://wandb.ai/harveenchadha/EKSTEP-PRETRAINING?workspace=user-agupta12.
To finetune a model on speech recognition downstream task, a fully connected layer is added on top of the transformer block in which the output labels are the characters for the respective language. While finetuning the weights of the feature encoder are fixed. We finetune until we get the lowest WER on the valid set. Some features of the feature encoder are masked for data augmentation. It is a technique similar to SpecAugment and detailed out in baevski2020wav2vec. All finetuning is performed on Tesla V100 GPUs.
We firstly demonstrate that multilingual pretraining outperforms monolingual pretraining by calculating language specific loss for all languages. Our experiments also show that there is a decrease in WER when multilingual pretraining model is used instead of monolingual.
5.1 Effectiveness of Cross Lingual Representation Learning
We calculate the language wise loss (contrastive and codebook) for audio training in both scenarios: when we have languages in pretraining and when we have just Hindi. Figure 1 shows that for all languages apart from Hindi, the loss is lesser in the multilingual pretraining case. This is expected since low resource languages benefit from multilingual pretraining. A lower loss also indicates that more meaningful speech representations are being learnt as had been shown in conneau2020unsupervised.
We also analyze shared discrete speech representations for different languages. For each language, we sample utterances and extract the quantized representations of the pretrained model. These vectors are normalized for each language to obtain vectors of size
. K-means clustering is performed on these vectors and then the dimensions are reduced by PCA.
From 2 we see that most phonetically similar languages are clustered together. The colours correspond to the clusters obtained by K-Means. We perform clustering before PCA to avoid any information loss. Assamese, Bengali and Odia are in one cluster. Hindi, Sanskrit, Urdu and Punjabi are also under one cluster. Most of the South Indian languages are clustered together as well, apart from Kannada. English in Indian accent is not a monolith. In our training data, English contains accents from different parts of India. As a result, it is far from many major languages and is together with other low resource languages. The purpose of this plot is not to recover underlying language families, but to check whether pretraining was able to learn phonetic information from a language.
5.2 Effect on finetuning
We finetune on using pretrained checkpoints from monolingual and multilingual pretraining. We see that even the high resource language, in our case Hindi, benefits from multilingual pretraining. On a separate hour test we see a % decrease in WER and a % decrease in CER when decoding is done without a language model. It can be also seen from table 2 that the WER decrease by % and CER by % when using a language model while decoding.
In this work we present a multilingual pretrained model on Indic languages in which representations are learnt from raw waveforms. Our results indicate that multilingual pretraining outperforms monolingual pretraining while learning speech representations while pre-training and also during finetuning performance in a downstream speech recognition task. We also show that model is able to encode phonetic similarity in speech representations. We hope this work kick starts development of high quality speech recognition Indic languages, especially low resource languages. All our code777https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation/tree/v2-hydra and models888https://github.com/Open-Speech-EkStep/vakyansh-models are open source. Our pretrained model can also be used for unsupervised speech recognition, especially for languages where no labeled text is present using the methods described in baevski2021unsupervised. We plan to report finetuning results on all Indic languages in the future.
All authors gratefully acknowledge Ekstep Foundation for supporting this project financially and providing infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Rajat Singhal, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen and Amulya Ahuja for automated data pipelines and infrastructure support for data processing and model training.