Towards Selection of Text-to-speech Data to Augment ASR Training

05/30/2023
by   Shuo Liu, et al.
0

This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating synthetic samples with considerable dissimilarity to real speech, owing in part to lexical differences, into ASR training is crucial for boosting recognition performance. Experimental results on Librispeech test sets indicate that, in order to maintain the same speech recognition accuracy as when using all TTS data, our proposed solution can reduce the size of the TTS data down below its 30 %, which is superior to several baseline methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2018

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Building an accurate automatic speech recognition (ASR) system requires ...
research
06/01/2021

A Neural Acoustic Echo Canceller Optimized Using An Automatic Speech Recognizer And Large Scale Synthetic Data

We consider the problem of recognizing speech utterances spoken to a dev...
research
06/02/2020

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

The perceptual quality of neural text-to-speech (TTS) is highly dependen...
research
11/29/2022

Evaluating and reducing the distance between synthetic and real speech distributions

While modern Text-to-Speech (TTS) systems can produce speech rated highl...
research
03/27/2023

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Adapting generic speech recognition models to specific individuals is a ...
research
10/21/2021

Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

With recent advances in speech synthesis, synthetic data is becoming a v...
research
12/11/2020

Improved Robustness to Disfluencies in RNN-Transducer Based Speech Recognition

Automatic Speech Recognition (ASR) based on Recurrent Neural Network Tra...

Please sign up or login with your details

Forgot password? Click here to reset