Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

04/01/2021
by   Adam Polyak, et al.
10

We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under https://resynthesis-ssl.github.io/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2023

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

In this work, we propose a zero-shot voice conversion method using speec...
research
07/03/2023

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Significant progress has been made in speaker dependent Lip-to-Speech sy...
research
06/06/2022

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

In this paper, we propose a novel unsupervised text-to-speech (UTTS) fra...
research
08/02/2023

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

While FastSpeech2 aims to integrate aspects of speech such as pitch, ene...
research
02/11/2023

Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation

Auditory attention decoding (AAD) is a technique used to identify and am...
research
08/28/2022

Towards Disentangled Speech Representations

The careful construction of audio representations has become a dominant ...
research
01/23/2017

Characterisation of speech diversity using self-organising maps

We report investigations into speaker classification of larger quantitie...

Please sign up or login with your details

Forgot password? Click here to reset