Automatic speech recognition(ASR) can be understood as a process to convert audio signal to text. ASR systems are a critical part of all voice assistants like siri, cortana, etc. Technology giants like Google, Amazon, etc. leverage tons of private data to build state-of-the-art ASR systems. This makes it really difficult for other players to reproduce similar performance. In this paper, we are trying to investigate if we can use publicly available data to train ASR systems which can compete with the state of the art. If true, it will empower startups, academics, etc to build competent ASR systems. Publicly available speech data like youtube may be contaminated with ambient noise and background music which makes it difficult to be used for training ASR systems. Hence, we propose speech enhancement techniques to clean the noisy data first and then use the original and its enhanced (cleaned) version to train ASR systems.
Speech enhancement(SE) is a well studied problem which aims to enhance audio quality by getting rid of contaminations such as white noise, background music, etc. Different GAN based models like SEGAN, FSEGAN, etc. have been shown to perform well for speech enhancement. In this work, we have used SEGAN[Pascual, Bonafonte, and Serra2017]
which operates at waveform level to remove noise from given noisy speech signal. SEGAN uses CNNs instead of RNNs for its encoder & decoder modules which makes it faster. It operates end to end with raw audio signal so its free of any assumptions made for feature extraction. Lastly, authors have also shared its code which makes it more reproducible. Hence, we chose SEGAN over other speech enhancement techniques.
There are different architectures to go about speech enhancement for ASR systems.Deep learning approaches to build robust ASR systems can be classified into 3 groups i.e. front-end, back-end & joint front-and back-end techniques[Zhang et al.2018]. In the front-end setting, speech enhancement and recognition system are independent from each other. Noisy speech is first enhanced during pre-processing and then recognizer is trained on the enhanced speech. In the back-end setting, noisy & enhanced speech are used together to train recognizer. Lastly, In joint front-and back end setting, speech enhancement & recognizer are considered as a single block and trained end-to-end. In this work, we have focused on the back-end technique as show in Fig. 1.
One of the most popular approach for back-end setting is multi-condition training. Multi-condition training is a technique which helps make more robust recognition system by training on multiple acoustic variants of the training dataset. In our case, we propose to use publicly available noisy speech along with its cleaned variant(via SE) for building ASR systems.
Existing datasets for speech enhancement are pretty limited in size. ASR systems trained on such datasets mightn’t generalize to different real world conditions. Hence, we decided to curate our own dataset. For clean speech, we used LibriSpeech dataset [Panayotov et al.2015] which is derived from public domain audiobooks. This dataset is fairly big(460hrs) and its corresponding transcript is also available which makes it suitable for ASR training. Next, we used diverse set of background music & ambient noises to simulate different real world conditions. For ambient noise, we used popular datasets like Urbansound, ESC50 along with youtube. From youtube, we cherry picked videos reflecting background noises in train, traffic, restaurant, rain, etc. For background music, we used youtube to extract movie theme songs and instrumental music belonging to different genres like Latin, Native American, Japanese, Indian, African, Heavy metal, etc. Lastly, we added ambient noise and background music to the clean speech. This resulted in 205hrs of noisy mixture for which we possess its clean variant along with the transcript.
First, we tried to investigate if training with noisy & its clean variant really helps. We trained deep speech model [Hannun et al.2014] (recognizer) on 100hrs of clean, noisy mixture and clean+noisy mixture. We tested the model on 5hrs of clean and noisy dataset.
For evaluation, we have used the de-facto standard for ASR systems which is word error rate(WER). WER is the percentage of words mis-recognized by the ASR system (lower the better).
As shown in Fig. 2, deep speech model trained on clean data performs well on clean test set but lags on noisy test set. Similarly, when trained on noisy data, deep speech model performs better on noisy test set but lags on clean test test. Finally, deep speech model trained on clean+noisy mixture outperforms other two cases on both clean and noisy test set. So clearly, training with noisy & clean version helps.
To compare our results, we have considered 3 different cases i.e., real world scenario (noisy), ideal case (noisy+clean), our solution (noisy+enhanced). In the real world scenario, we can gather noisy data from public sources. So we trained ASR system just on noisy data. For the best case scenario, we trained DeepSpeech with noisy dataset and its clean version. If our Speech enhancement model works really well, only then we might achieve similar performance. Lastly, we implemented our approach. We first processed the noisy dataset with pretrained SEGAN to get enhanced dataset. Thereafter, we trained the DeepSpeech model with noisy dataset and its enhanced version. First two cases represent back end approach because we there is no preprocessing involved. The model is left to decide what is noise and what is not. Our approach falls under the the front end approach because we clean the speech first before training.
In a multi-style ASR training, noisy speech together with its cleaned version can significantly reduce word error rate. Since we don’t have clean version of publicly available data, we replaced clean speech with enhanced speech. Noisy speech combined with its enhanced speech by SEGAN performed significantly well for ASR systems. We observed 9.5% reduction in WER when compared with noisy speech. We observed that speech cleaned with segan performed at par with the ideal case scenario for noisy test dataset. On the clean test set, its error rate is a little higher than ideal case. This might be attributed to the artifacts introduced by speech enhancement model on clean speech. In conclusion, this work is a proof of concept that found data treated with some speech enhancement model helps ASR become more robust & accurate.
Conclusion & Future Work
Our work shows that publicly available data together with Speech enhancement models can be leveraged to build robust ASR systems.
Next, we intend to test our approach with other SE models like FSEGAN, Wave-u-net, etc. It will also be interesting to test how back-end approach compares with end-to-end approach.
Overall, we believe this work will motivate larger research on building state of the art ASR systems from public available/found data.
- [Hannun et al.2014] Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
- [Panayotov et al.2015] Panayotov, V.; Chen, G.; Povey, D.; and Khudanpur, S. 2015. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 5206–5210. IEEE.
- [Pascual, Bonafonte, and Serra2017] Pascual, S.; Bonafonte, A.; and Serra, J. 2017. Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452.
- [Zhang et al.2018] Zhang, Z.; Geiger, J.; Pohjalainen, J.; Mousa, A. E.-D.; Jin, W.; and Schuller, B. 2018. Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST) 9(5):49.