Automatic Evaluation of Speaker Similarity

07/01/2022
by   Deja Kamil, et al.
0

We introduce a new automatic evaluation method for speaker similarity assessment, that is consistent with human perceptual scores. Modern neural text-to-speech models require a vast amount of clean training data, which is why many solutions switch from single speaker models to solutions trained on examples from many different speakers. Multi-speaker models bring new possibilities, such as a faster creation of new voices, but also a new problem - speaker leakage, where the speaker identity of a synthesized example might not match those of the target speaker. Currently, the only way to discover this issue is through costly perceptual evaluations. In this work, we propose an automatic method for assessment of speaker similarity. For that purpose, we extend the recent work on speaker verification systems and evaluate how different metrics and speaker embeddings models reflect Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) scores. Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/20/2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

In recent years, neural network based methods for multi-speaker text-to-...
research
10/28/2022

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Text-based voice editing (TBVE) uses synthetic output from text-to-speec...
research
06/21/2022

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

This paper proposes a human-in-the-loop speaker-adaptation method for mu...
research
08/08/2020

Extrapolating false alarm rates in automatic speaker verification

Automatic speaker verification (ASV) vendors and corpus providers would ...
research
02/06/2023

Residual Information in Deep Speaker Embedding Architectures

Speaker embeddings represent a means to extract representative vectorial...
research
06/10/2021

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Text-to-speech systems recently achieved almost indistinguishable qualit...
research
05/18/2023

Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions

Different variants of a Forensic Automatic Speaker Recognition (FASR) sy...

Please sign up or login with your details

Forgot password? Click here to reset