Log In Sign Up

Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition

by   Anirudh Gupta, et al.

In the recent years end to end (E2E) automatic speech recognition (ASR) systems have achieved promising results given sufficient resources. Even for languages where not a lot of labelled data is available, state of the art E2E ASR systems can be developed by pretraining on huge amounts of high resource languages and finetune on low resource languages. For a lot of low resource languages the current approaches are still challenging, since in many cases labelled data is not available in open domain. In this paper we present an approach to create labelled data for Maithili, Bhojpuri and Dogri by utilising pseudo labels from text to speech for forced alignment. The created data was inspected for quality and then further used to train a transformer based wav2vec 2.0 ASR model. All data and models are available in open domain.


page 1

page 2

page 3


Vakyansh: ASR Toolkit for Low Resource Indic languages

We present Vakyansh, an end to end toolkit for Speech Recognition in Ind...

Integrating Categorical Features in End-to-End ASR

All-neural, end-to-end ASR systems gained rapid interest from the speech...

An Automatic Speech Recognition System for Bengali Language based on Wav2Vec2 and Transfer Learning

An independent, automated method of decoding and transcribing oral speec...

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Speech synthesis (text to speech, TTS) and recognition (automatic speech...

Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages

With recent advancements in language technologies, humansare now interac...

1 Introduction

E2E ASR systems have gained immense popularity in the recent past and have repeatedly shown to be superior compared to traditional approaches of ASR systems [hinton]

. However, huge amount of labelled data is needed for E2E systems to perform well for a particular language. This makes the process of training models for high resource languages whose data is available in public domain easier. But on the other hand for most of the languages very limited amount of training data is available. And for some languages it is not available at all. Due to this reason not a lot of speech and natural language processing (NLP) tools are available for those languages. Moreover manually labelling data is both time consuming and expensive. There has been an interest among researchers for approaches which can bootstrap the process of data collection for low resource languages using high resource languages.

Transfer learning techniques for speech recognition aim at effectively transferring knowledge from high-resource languages to low-resource languages and have been extensively studied. In [schultz2001language]

the authors talk about a language independent and language adaptive approach for acoustic modelling. When using deep neural networks for speech recognition it was shown in

[huang2013cross] that hidden layers can be made language independent while the fully connected layer can be language specific. Similar approaches were discussed in [tong2017multilingual, toshniwal2018multilingual]

where efforts were made to train a single neural network for acoustic modelling of multiple languages. A popular paradigm for transfer learning in E2E systems is to pretrain a model on labelled speech from one (or more) high-resource languages and then fine-tune all or parts of the model on speech from the lowresource language. In the recent years semi supervised learning approaches have shown that state of the art models can be trained for as less as

minutes of labelled audio given huge amount of data is used for learning speech representations while pretraining [baevski2020wav2vec]. An approach which uses multilingual finetuning is described in [conneau2020unsupervised]. It shows that low resource languages benefit from multilingual pretraining by learning shared speech representations across languages.

A sensible approach to train ASR models would be to utilize pretrained models and finetune on less amount of labelled data. First we generate labelled data using forced alignment utilizing pseudo labels from a TTS engine and then finetuning is performed. =. The authors are not aware of any other study reporting WER for Maithili, Bhojpuri and Dogri for speech recognition on continuous speech.

2 Approach

Languages spoken in the South Asian region belong to at least four major language families: Indo-European (most of which belong to its sub-branch Indo-Aryan), Dravidian, Austro-Asiatic, and Sino-Tibetan. Almost one third of our mother-tongues in India ( languages) belong to the Indo-Aryan family of languages - spoken by % of Indians333 Maithili, Bhojpuri and Dogri are also part of the Indo Aryan languages. Additionally, all these three languages are predominantly written in Devanagri script. Maithili is mostly spoken in Nepal and northern parts of Bihar, a state in India. Bhojpuri is a dialect of Hindi and is spoken in eastern Uttar Pradesh and parts of Bihar, Jharkhand and Nepal. Dogri is chiefly spoken in Jammu and Kashmir and parts of Punjab and Himachal Pradesh.

In [maithili] it is deduced that phonetic similarity between Hindi and Maithili is % by looking at the most common words used in both languages. Dogri language sounds more similar to Punjabi than Hindi. This phonetic similarity and same script as Hindi for the languages under consideration form the basis for using pseudo labels of text to speech using Hindi models. Forced alignment is the process of determining, for each fragment of the transcript, the time interval (in the audio file) containing the spoken text of the fragment. Most tools for forced alignment use speech recognition algorithms. For low resource languages this becomes a big challenge due to non-availability of data. espeak444 is a lightweight TTS engine which uses formant synthesis method to synthesize sounds from phonemes and prosody information. Since all three languages use Devanagri script Hindi TTS models were used no new rules were written for text to phoneme conversion. Sound synthesis was also not altered and forced alignment was performed by assuming that phonetic similarity between Hindi and Maithili, Bhojpuri and Dogri would produce good quality aligned segments.

3 Experiment Details

We break down our process in two parts. First part is creation of labelled data and second part is training an ASR model on this data.

3.1 Dataset Details

News on AIR555 is a news service division of All India Radio. Multiple news bulletins are made public in various languages on a daily basis. A lot of radio stations broadcast news in their regional languages and verbatim text of these bulletins is also made public in PDF format.

3.1.1 Text Extraction

The first step in the process is to extract text from PDF bulletins which could then be further used for forced alignment. The text was machine readable but was not in standard encodings. For all the three languages text was in KrutiDev encoding. Special care was taken to convert all this text to UTF-8 format.

3.1.2 Forced Alignment

There were some attributes of extracted text that could affect the quality of forced alignment adversely. Some of them being:

  • The PDF bulletin contained some other information in English sometimes like time of the day, details of speak name etc. which were not part of the audio bulletin.

  • Before reading out the news, the speaker used to introduce himself or herself. Text of such portions was also not there in the PDF bulletin.

  • Sometimes there were long music interludes in between.

  • In some bulletins the text was not machine readable.

First we generate audio segments from text using espeak TTS engine. All the three languages Maithili, Bhojpuri and Dogri use Devanagri script which is same as Hindi. Also, Maithili and Bhojpuri are phonetically very similar to Hindi since they are spoken in the same regions of India. Support for these languages was not present in the TTS engine so we chose Hindi to generate audio segments for text for these languages. Dynamic Time Warping (DTW) algorithm is used to align the Mel-frequency cepstral coefficients (MFCCs) representation of the given audio wave and the audio wave obtained by synthesizing the text fragments.

We addressed the issues in mismatch between audio and text bulletins by manually analysing the aligned segments. The following steps helped to improve alignment quality.

  • Text from any other script apart from Devanagri was removed before alignment.

  • In the beginning of each audio bulleting the news presenters were introducing themselves. We found that these introductions were clubbed with the first or second aligned segments. As a result we discarded the initial five instances of aligned segments while counting them in labelled data.

  • Any segments which were greater than 15 seconds were rejected. Most of these segments contained musical interludes. And it was observed that longer segments had a higher chance of bad alignment.

3.1.3 Aligned Data

Total duration of dataset combining all languages is hours. Language wise data duration after forced alignment is shown in Table 1.

Language Train Valid
Table 1: Data after alignment

Figure 1 depicts the distribution of audio length segments for the three languages. It can be seen that most of the segment lengths are between seconds to seconds.

Figure 1: Segment duration distribution

To get a better sense of the distribution of aligned data we try to estimate the number of speakers and gender distribution in a given language. Recently a new loss function called the generalized end-to-end (GE2E) loss was introduced which makes the training of speaker verification models more efficient

[8462665]. We use an implementation which derives a high level representation of the voice666 Given an audio, it creates an embedding of

length vector which summarizes the characteristics of spoken voice.

We use the same

length embedding we used for speaker clustering as features for gender identification. Data was created manually and labelled as male or female to train a classifier. Total male audio duration was

hours and total female data was

hours. To make our model more robust we used clean as well as data with background noise for training. We also included multiple languages which included Hindi,Telugu, Tamil and Kannada. From our analysis a support vector machine


with radial basis function with

and worked best. We created another test set with new voices and environment in which the labels were balanced. To understand performance on other languages the test set included data from Marathi and Bengali as well. The total duration of the test set was hours. The final accuracy on the test set was %.

Figure 2: Gender distribution

3.2 Model Training

Wav2vec is a self-supervised pre-training framework for learning speech representations. It is trained by solving a contrastive task over masked latent speech representations and jointly learns quantization of latents shared across languages. [baevski2020wav2vec]. The resulting model is then fine-tuned on labelled data with a CTC loss function [Graves06connectionisttemporal] where the characters of a particular language act as labels. We start with CSRIL- as the pretrained model. This model is trained on hours of Indic languages [clsril]. Pretraining data contained hours of Maithili data, hours of Dogri data but no Bhojpuri. This pretrained model is a base model of the wav2vec architecture and is further finetuned for each language separately. The training is continued until no improvement in word error rate (WER) on validation set is observed. We use the same recipe as described in [baevski2020wav2vec] for fine-tuning.

3.3 Results

The WER on validation sets are reported in Table 2. A -gram KenLM [heafield-2011-kenlm] language model is trained on text from training data and is used while decoding.

Language WER
Table 2: Finetuning Results

4 Conclusions

Getting labelled data for ASR is difficult and even more so for low resource languages. We propose a method to generate labelled data for speech recognition in low resource languages using forced alignment, by leveraging TTS pseudo labels of a phonetically similar languages. It is observed WER is highest for Dogri. This is also due to the fact in that Dogri sounds more like Punjabi than Hindi. We plan to extend this approach for other languages which are phonetically similar to other high resource languages and use the same script.

5 Acknowledgements

All authors gratefully acknowledge Ekstep Foundation for supporting this project financially and providing infrastructure. A special thanks to Dr. Vivek Raghavan for constant support, guidance and fruitful discussions. We also thank Ankit Katiyar, Heera Ballabh, Niresh Kumar R, Sreejith V, Soujyo Sen, Amulya Ahuja and Nikita Tiwari for helping out when needed and infrastructure support for data processing and model training.