Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

10/21/2020
by   Jennifer Williams, et al.
7

We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub-phone codebooks. We also compare two training methods: self-supervised with global conditions and semi-supervised with speaker labels. Adding a speaker VQ component improves objective measures of speech synthesis quality (estimated MOS, speaker similarity, ASR-based intelligibility) and provides learned representations that are meaningful. Our speaker VQ codebook indices can be used in a simple speaker diarization task and perform slightly better than an x-vector baseline. Additionally, phones can be recognized from sub-phone VQ codebook indices in our semi-supervised VQ-VAE better than self-supervised with global conditions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2021

Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

This work examines the content and usefulness of disentangled phone and ...
research
05/16/2020

Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful repres...
research
05/27/2021

Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling

Generating natural speech with diverse and smooth prosody pattern is a c...
research
05/21/2023

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

Self-supervised speech representations are known to encode both speaker ...
research
04/24/2023

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a se...
research
04/08/2022

Self-supervised Speaker Diarization

Over the last few years, deep learning has grown in popularity for speak...
research
01/26/2021

Semi-supervised source localization in reverberant environments with deep generative modeling

A semi-supervised approach to acoustic source localization in reverberan...

Please sign up or login with your details

Forgot password? Click here to reset