Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

05/04/2021
by   Jennifer Williams, et al.
0

This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be used to manipulate speech in a meaningful way. Our experiments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by manipulating phone VQ codes, while retaining speaker identity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disentangled representations.

READ FULL TEXT
research
10/21/2020

Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

We present a new approach to disentangle speaker voice and phone content...
research
04/09/2018

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Recently, cycle-consistent adversarial network (Cycle-GAN) has been succ...
research
02/10/2020

An empirical analysis of information encoded in disentangled neural speaker representations

The primary characteristic of robust speaker representations is that the...
research
04/12/2019

Building a mixed-lingual neural TTS system with only monolingual data

When deploying a Chinese neural text-to-speech (TTS) synthesis system, o...
research
03/30/2022

Probing phoneme, language and speaker information in unsupervised speech representations

Unsupervised models of representations based on Contrastive Predictive C...
research
11/01/2022

Disentangled representation learning for multilingual speaker recognition

The goal of this paper is to train speaker embeddings that are robust to...

Please sign up or login with your details

Forgot password? Click here to reset