Rep2wav: Noise Robust text-to-speech Using self-supervised representations

08/28/2023
by   Qiushi Zhu, et al.
0

Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of pre-trained models to the waveform. We then propose a text-to-representation FastSpeech2 model, which aims to learn to map text to pre-trained model representations. Experimental results on the LJSpeech and LibriTTS datasets show that our method outperforms those using speech enhancement methods in both subjective and objective metrics. Audio samples are available at: https://zqs01.github.io/rep2wav.

READ FULL TEXT

page 2

page 4

research
09/28/2022

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

With the development of deep learning, neural network-based speech enhan...
research
06/14/2023

Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement

Large, pre-trained representation models trained using self-supervised l...
research
02/16/2023

Speech Enhancement with Multi-granularity Vector Quantization

With advances in deep learning, neural network based speech enhancement ...
research
05/26/2022

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Speech enhancement (SE) is usually required as a front end to improve th...
research
04/07/2021

Utilizing Self-supervised Representations for MOS Prediction

Speech quality assessment has been a critical issue in speech processing...
research
09/22/2022

The Microsoft System for VoxCeleb Speaker Recognition Challenge 2022

In this report, we describe our submitted system for track 2 of the VoxC...

Please sign up or login with your details

Forgot password? Click here to reset