Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

10/28/2019
by   Alexander H. Liu, et al.
0

In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances. This is achieved by proper temporal segmentation to make the representations phoneme-synchronized, and proper phonetic clustering to have total number of distinct representations close to the number of phonemes. Mapping between the distinct representations and phonemes is learned from a small amount of annotated paired data. Preliminary experiments on LJSpeech demonstrated the learned representations for vowels have relative locations in latent space in good parallel to that shown in the IPA vowel chart defined by linguistics experts. With less than 20 minutes of annotated speech, our method outperformed existing methods on phoneme recognition and is able to synthesize intelligible speech that beats our baseline model.

READ FULL TEXT
research
10/15/2022

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Unsupervised representation learning for speech audios attained impressi...
research
05/05/2023

A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

In this paper, we present a multimodal and dynamical VAE (MDVAE) applied...
research
02/10/2023

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Self-supervision has shown great potential for audio-visual speech recog...
research
11/11/2019

Feedback Recurrent AutoEncoder

In this work, we propose a new recurrent autoencoder architecture, terme...
research
06/02/2022

Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations

This paper proposes a multilingual speech synthesis method which combine...
research
08/05/2021

Applying the Information Bottleneck Principle to Prosodic Representation Learning

This paper describes a novel design of a neural network-based speech gen...
research
03/14/2020

Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

In English, prosody adds a broad range of information to segment sequenc...

Please sign up or login with your details

Forgot password? Click here to reset