Evaluating and reducing the distance between synthetic and real speech distributions

11/29/2022
by   Christoph Minixhofer, et al.
0

While modern Text-to-Speech (TTS) systems can produce speech rated highly in terms of subjective evaluation, the distance between real and synthetic speech distributions remains understudied, where we use the term distribution to mean the sample space of all possible real speech recordings from a given set of speakers; or of the synthetic samples that could be generated for the same set of speakers. We evaluate the distance of real and synthetic speech distributions along the dimensions of the acoustic environment, speaker characteristics and prosody using a range of speech processing measures and the respective Wasserstein distances of their distributions. We reduce these distribution distances along said dimensions by providing utterance-level information derived from the measures to the model and show they can be generated at inference time. The improvements to the dimensions translate to overall distribution distance reduction approximated using Automatic Speech Recognition (ASR) by evaluating the fitness of the synthetic data as training data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

Towards Selection of Text-to-speech Data to Augment ASR Training

This paper presents a method for selecting appropriate synthetic speech ...
research
10/21/2021

Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

With recent advances in speech synthesis, synthetic data is becoming a v...
research
03/27/2023

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Adapting generic speech recognition models to specific individuals is a ...
research
12/13/2016

Evaluating Automatic Speech Recognition Systems in Comparison With Human Perception Results Using Distinctive Feature Measures

This paper describes methods for evaluating automatic speech recognition...
research
05/28/2019

Automatic Ambiguity Detection

Most work on sense disambiguation presumes that one knows beforehand -- ...
research
01/29/2018

Highly-Reverberant Real Environment database: HRRE

Speech recognition in highly-reverberant real environments remains a maj...
research
08/28/2022

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Transfer tasks in text-to-speech (TTS) synthesis - where one or more asp...

Please sign up or login with your details

Forgot password? Click here to reset