Speech Recognition with Augmented Synthesized Speech

09/25/2019
by   Andrew Rosenberg, et al.
0

Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer. We find that improvements to speech recognition performance is achievable by augmenting training data with synthesized material. However, there remains a substantial gap in performance between recognizers trained on human speech those trained on synthesized speech.

READ FULL TEXT
research
11/24/2020

Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech

In recent years, Text-To-Speech (TTS) has been used as a data augmentati...
research
12/23/2020

Speech Synthesis as Augmentation for Low-Resource ASR

Speech synthesis might hold the key to low-resource speech recognition. ...
research
09/13/2023

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Recent strides in neural speech synthesis technologies, while enjoying w...
research
07/22/2022

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

As for other forms of AI, speech recognition has recently been examined ...
research
09/22/2017

Techniques and Challenges in Speech Synthesis

The aim of this project was to develop and implement an English language...
research
05/16/2020

Learning Joint Articulatory-Acoustic Representations with Normalizing Flows

The articulatory geometric configurations of the vocal tract and the aco...
research
07/09/2019

Transfer Learning from Audio-Visual Grounding to Speech Recognition

Transfer learning aims to reduce the amount of data required to excel at...

Please sign up or login with your details

Forgot password? Click here to reset