Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders

08/16/2020
by   Mingjie Chen, et al.
0

Unsupervised representation learning of speech has been of keen interest in recent years, which is for example evident in the wide interest of the ZeroSpeech challenges. This work presents a new method for learning frame level representations based on WaveNet auto-encoders. Of particular interest in the ZeroSpeech Challenge 2019 were models with discrete latent variable such as the Vector Quantized Variational Auto-Encoder (VQVAE). However these models generate speech with relatively poor quality. In this work we aim to address this with two approaches: first WaveNet is used as the decoder and to generate waveform data directly from the latent representation; second, the low complexity of latent representations is improved with two alternative disentanglement learning methods, namely instance normalization and sliced vector quantization. The method was developed and tested in the context of the recent ZeroSpeech challenge 2020. The system output submitted to the challenge obtained the top position for naturalness (Mean Opinion Score 4.06), top position for intelligibility (Character Error Rate 0.15), and third position for the quality of the representation (ABX test score 12.5). These and further analysis in this paper illustrates that quality of the converted speech and the acoustic units representation can be well balanced.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2020

Robust Training of Vector Quantized Bottleneck Models

In this paper we demonstrate methods for reliable and efficient training...
research
02/15/2023

Topological Neural Discrete Representation Learning à la Kohonen

Unsupervised learning of discrete representations from continuous ones i...
research
05/19/2020

Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

In this paper, we explore vector quantization for acoustic unit discover...
research
10/24/2020

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Neural latent variable models enable the discovery of interesting struct...
research
04/16/2019

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

For our submission to the ZeroSpeech 2019 challenge, we apply discrete l...
research
03/27/2023

Object Discovery from Motion-Guided Tokens

Object discovery – separating objects from the background without manual...
research
07/11/2022

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Current text to speech (TTS) systems usually leverage a cascaded acousti...

Please sign up or login with your details

Forgot password? Click here to reset