VoiceMe: Personalized voice generation in TTS

03/29/2022
by   Pol van Rijn, et al.
0

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

READ FULL TEXT

page 2

page 3

page 4

research
12/13/2018

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...
research
11/01/2022

Generating Gender-Ambiguous Text-to-Speech Voices

The gender of a voice assistant or any voice user interface is a central...
research
09/01/2021

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

The strong relation between face and voice can aid active speaker detect...
research
05/05/2021

Exploring emotional prototypes in a high dimensional TTS latent space

Recent TTS systems are able to generate prosodically varied and realisti...
research
03/18/2022

Personalized filled-pause generation with group-wise prediction models

In this paper, we propose a method to generate personalized filled pause...
research
04/14/2021

Look at Me When I Talk to You: A Video Dataset to Enable Voice Assistants to Recognize Errors

People interacting with voice assistants are often frustrated by voice a...
research
06/28/2020

I can attend a meeting too! Towards a human-like telepresence avatar robot to attend meeting on your behalf

Telepresence robots are used in various forms in various use-cases that ...

Please sign up or login with your details

Forgot password? Click here to reset