Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

10/27/2021
by   Hyeong-Seok Choi, et al.
0

We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.

READ FULL TEXT

page 4

page 5

page 17

research
02/16/2023

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

In this work, we propose a zero-shot voice conversion method using speec...
research
05/19/2023

MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting

We present MParrotTTS, a unified multilingual, multi-speaker text-to-spe...
research
02/22/2022

DRVC: A Framework of Any-to-Any Voice Conversion with Self-Supervised Learning

Any-to-any voice conversion problem aims to convert voices for source an...
research
11/17/2022

NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

Various applications of voice synthesis have been developed independentl...
research
09/22/2022

Cross-domain Voice Activity Detection with Self-Supervised Representations

Voice Activity Detection (VAD) aims at detecting speech segments on an a...
research
03/02/2020

Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework

This article investigates into recently emerging approaches that use dee...
research
05/19/2022

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

The modeling of turn-taking in dialog can be viewed as the modeling of t...

Please sign up or login with your details

Forgot password? Click here to reset