Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

by   Dongyang Dai, et al.
Tsinghua University
ByteDance Inc.

With the popularity of deep neural network, speech synthesis task has achieved significant improvements based on the end-to-end encoder-decoder framework in the recent days. More and more applications relying on speech synthesis technology have been widely used in our daily life. Robust speech synthesis model depends on high quality and customized data which needs lots of collecting efforts. It is worth investigating how to take advantage of low-quality and low resource voice data which can be easily obtained from the Internet for usage of synthesizing personalized voice. In this paper, the proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively. Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice. Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning pre-trained multi-speaker speech synthesis model with denoised new speaker data.



There are no comments yet.


page 4


Neural Voice Cloning with a Few Samples

Voice cloning is a highly desired feature for personalized speech interf...

Cloning one's voice using very limited data in the wild

With the increasing popularity of speech synthesis products, the industr...

A Fully Convolutional Neural Network Approach to End-to-End Speech Enhancement

This paper will describe a novel approach to the cocktail party problem ...

NeuraGen-A Low-Resource Neural Network based approach for Gender Classification

Human voice is the source of several important information. This is in t...

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training

Data efficient voice cloning aims at synthesizing target speaker's voice...

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

With the rapid development of neural network architectures and speech pr...

Exploring Transfer Learning for Low Resource Emotional TTS

During the last few years, spoken language technologies have known a big...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text-to-speech (TTS) technology has been widely used in many products, such as e-books, voice assistants, automatic navigation, etc. Recently, with the development of neural networks, end-to-end TTS models, such as Tacotron [wang2017tacotron], Char2Wav [sotelo2017char2wav], DeepVoice3 [ping2017deep] and Tacotron2 [shen2018natural] have gradually become mainstream. End-to-end TTS models using an attention based encoder-decoder structure, learning patterns from large amount of data, can produce more natural sound than traditional parametric TTS systems [zen2009statistical].

Based on the end-to-end model, many researchers have begun to pay attention to how to control the style, tone, and other information of synthesized speech. [wang2018style] proposed Global Style Tokens to represent the speech’s style information. [zhang2019learning]

applied variational autoencoder (VAE) to model the distribution of speech’s style features. As for personalized TTS,

[NIPS20188206] introduced speaker adaptation and speaker encoding approaches for voice cloning. As demonstrated in [NIPS20188206], speaker adaptation approach, which is based on fine-tuning a pre-trained multi-speaker model for an unseen speaker using a few samples, achieves better performence. Given the abundance of audio and video information proliferating on the Internet, finding effective way to synthesize wide variety of personalized sound using widely available low-quality voice data has become an interesting topic.

Noise Robust TTS –- training a stable TTS model on noisy, low-quality data, has long been of interest to researchers in the field. Authors in [valentini2016investigating]

introduced speech enhancement methods for noise robust TTS. In their solution, an recurisive neural network (RNN) based speech enhancement model is applied to map acoustic features extracted from noisy speech to features describing clean speech; the enhanced data is then used to adapt a pre-trained hidden Markov model (HMM) based TTS acoustic model; finally, STRAIGHT

[kawahara1999restructuring] vocoder is used to generate waveform from acoustic features. However, the speaker information will more or less been reduced by preprocessing through speech enhancement model. Besides, due to better effects in terms of sound quality and naturalness, neural network-based acoustic models and vocoders have replaced HMM models and STRAIGHT as mainstream.

To leverage low-quality crowd-sourced data to train multi-speaker TTS models that can synthesize clean speech for all speakers, based on Tacotron2, [hsu2019disentangling] introduced conditional generative reference encoders and adversarial training to learn disentangled representations to independently control the speaker identity and background noise in generated signals. The method applies speaker encoder to learn speaker related variable , and residual encoder to extract variable

to model unlabelled attributes (e.g. acoustic conditions). As the speaker encoder is followed by speaker classifier and gradient reversal layer followed with augmentation classifier,

respresents noise-free speaker related information;

represents noise related infomration. However, the noise signal is not always stable, which varies along with time. Therefore a fixed-length vector is not enough to model the noise information, especially in the case of speech data with low signal-to-noise ratio (SNR).

Recently, neural network-based methods using masks have achieved excellent results in tasks such as speech enhancement and speech separation, and have become mainstream [narayanan2013ideal, wang2014training]

. Inspired by these works, we use variable-length Mel-spectrogram denoise masks instead of a fixed-length vector as the representation of noise information. We assumed that the noisy speech is generated by the noise signal adding to clean speech, so for each point in the Mel frequency domain, the energy

contains the energy of the clean speech , and the energy of the noise signal . So the value of the Mel-spectrogram denoise mask of the corresponding point is calculated by . In this paper, based on voice cloning framework, we adopt the Mel-spectrogram denoise masks (including noisy masks for noisy speechs and clean masks for clean speechs) as noise representation which conditions on the end-to-end speech synthesis model, and pre-train the TTS model on multi-speaker’s enhancement data for noise robust personalized TTS. Our work is summarized in the following three aspects:

  1. The proposed method uses both the speaker embedding and the noise representation as conditional inputs of the basic end-to-end speech synthesis model to achieve independent control of the synthesized speech with noise and different speakers. After being pre-trained on multi-speaker enhancement data, model is adapted on low-quality new speaker’s data and can synthesize clean speech for the new speaker.

  2. The proposed model uses Mel-spectrogram denoise masks as the noise representation. Compared with a fixed-length vector, Mel-spectrogram denoise mask can better characterize the noise information. The model is pre-trained on both noisy and clean multi-speaker data. The noise representation includes noisy masks (extracted from noisy speech) and clean masks (all elements equal to 1). The model accepts noisy masks as conditional input to generate speech with corresponding noise, and similarly accepts clean masks to generate clean speech. When the noise representation varies from noisy masks to clean masks, the generated speech changes from noisy to clean.

  3. The proposed model accepts the features extracted by the pre-trained speaker recognition model as the speaker embedding, and the TTS model is pre-trained on multi-speaker voice data, which realizes personalized speech synthesis for low-quality low-resource new unseen speaker’s data.

2 Methodology

2.1 Proposed Approach

Figure 1: The proposed approach

As depicted in Figure 1, the proposed approach consists of three stages:

  1. The pre-training stage: The end-to-end TTS model is pre-trained on clean and noisy multi-speaker voice data. The model accepts speaker embedding and noise representation as conditional inputs. The speaker embedding is extracted via a speaker recognition model, and the noise representation is the Mel-spectrogram denoise masks, including noisy masks and clean masks. The noisy masks is the predicted Mel-spectrogram denoise masks extracted by a speech enhancement model from noisy speech, while the clean masks is corresponding to clean speech whose value all equals to 1.

  2. The adaptation stage: The pre-trained model is adapted on the new low-quality low-resource speaker data. The new speaker data only contains noisy speech, so the noise representation only contains noisy masks.

  3. The inference stage: The adapted model accepts clean masks as conditional input to synthesize clean voice of the new speaker. Where clean masks represents the Mel-spectrogram denoise masks of clean speech, and the value of each element is set to 1.

The following sections will describe the speaker embedding and noise representation extraction, as well as the details of the model.

2.2 Speaker Embedding Extraction

In order to preserve the relative relationship between different speakers and deal with unseen speakers more conveniently, instead of using one-hot encoding directly, the speaker embedding is extracted by a speaker recognition model, which is pre-trainded on an internal dataset containing about 20,000 speakers. We apply the approach in

[xie2019utterance] to implement the speaker recognition model, which consists of a modified ResNet [he2016deep] in a fully convolutional way to extract frame-level features and a following GhostVLAD [zhong2018ghostvlad] layer for feature aggregation along the temporal axis. To extract more discriminative speaker embedding, additive margin softmax [wang2018additive] is adopted to train the speaker recognition model.

2.3 Noise Representation Extraction

Inspired by recent advancement in speech enhancemet [narayanan2013ideal, valin2018hybrid], we apply Mel-spectrogram denoise masks as noise representation to model the noise information. The Mel-spectrogram denoise masks is extracted by a speech enhancement model which accepts Mel-spectrogram as input, the model structure is a variant of the CNN-RNN-FC structure, and it is very similar to the model in [valin2018hybrid]. Compared with [valin2018hybrid], the proposed model has following three main differences:

  1. The input feature of the model is Mel-spectrogram instead of log spectrogram.

  2. The LSTM layer is replaced with DFSMN [zhang2018deep], which can reduce the model size (from about 26M parameters to about 4.76M). Meanwhile, the inference speed is improved by about 9x due to the reason that computation of DFSMN layer can be parallelized (speed testing is benchmarked on Nvidia T1080).

  3. The activation function of the output layer is sigmoid, and its output is assumed to be

    . Since mean square error (MSE) is enough to extract effective Mel-spectrogram denoise masks according to our experiments, we directly use MSE as the model’s loss function as shown in Equation

    1, where denotes the Mel-spectrogram of noisy speech, denotes the corresponding clean speech’s Mel-spectrogram. means element-wise multiply, represents the denoised Mel-spectrogram. is the number of Mel-spectrogram’s TF-bins.


2.4 Noise Robust Personalized TTS Model

2.4.1 Basic TTS model

We use an encoder-decoder-based Tacoton-like end-to-end neural network as the basic TTS model. Compared with the original Tacotron [wang2017tacotron], two improvements have been made. Firstly, GMM-based attention [battenberg2020location] is applied for improving the stability of the synthesis and reducing bad cases. Secondly, in order to imporve the quality of the synthesized sound, we use a neural network-based vocoder to generate sound from Mel-spectrogram just like what Tacotron2 [shen2018natural] does, so the outputs of PostNet is log-level Mel-spectrogram instead of linear spectrogram, and the model is optimized by both before loss and after loss. More details about the decoder will be described in section 2.4.3.

2.4.2 Speaker embedding condition

Figure 2: Speaker embedding as the conditional input to encoder

In order to control the speaker characteristics of the synthesized speech, the speaker embedding is taken as the conditional input of the encoder of the TTS model. As shown in Figure 2, in addition to concatenating with the encoder outputs, the speaker embedding will also pass through two dense layers respectively, then adopted as the additional conditional inputs for Highway Network and the initial value of the GRU layer.

2.4.3 Noise representation condition

Figure 3: Noise representation as the conditional input to decoder

To model the noise information, we use noise representation (Mel-spectrogram denoise masks, extracted from the speech enhancement model) as the conditional input of the decoder. Shown in Figure 3, the decoder is an autoregressive structure, which is composed of Pre-Net, Attention RNN layer, Decoder RNN layer, Linear Projection layer, Stop Token Projection layer and Post-Net.

In our model, Post-Net is used to model noise information. In order not to introduce noise information in other parts of the decoder, we use denoised Mel-spectrogram or clean Mel-spectrogram as the target output of the Linear Projection layer to calculate the Before Loss (showed in Figure

3). The Post-Net accepts both processed noise representation and the output of Linear Projection layer as input, whose outpus is added to the output of Linear Projection layer to predict the original Mel-spectrogram and calculate the After Loss. Since the TTS model outputs log-level Mel-spectrograms, before concatenating to the Linear Projection layer’s output as Post-Net’s input, the noise representation is first clipped to a value between 0.1 and 1, then converted to log level, and finally normalized to a value between -4 and 4.

The model is pre-trained on multi-speaker clean data and noisy augmented data, where the noisy augmented speech data is generated from the clean speech data mixed with the noise signal, and each augmented utterance has its corresponding clean speech. So in the pre-training stage, the model calculates the Before Loss according to the clean Mel-spectrogram that is obtained from the corresponding clean speech. And in the adaptation stage, the Before Loss is calculated according to the denoised Mel-spectrogram that is obtained from the speech enhancement model and the corresponding noisy speech.

3 Experiments and Analysis

3.1 Experimental Setup

First of all, we add the noise signal from Audio Set [45857] to an internal clean multi-speaker TTS corpus to generate a noisy augmented multi-speaker corpus. The proposed model uses both the clean and the noisy augmented corpus during pre-training. The clean multi-speaker TTS corpus contains about 1100 speakers, each speaker has about 30 minutes of voice data. Then, the model is adapted on the low-resouce low-quality new speaker data. In our experiments, four new speaker’s data is adopted, each speaker contains 200-300 utterances, and the low-quality data is constructed based on the Microsoft Scalable Noisy Speech Dataset [reddy2019scalable], ensuring that the SNR of each utterance is less than 5dB.

In order to verify the effect of our proposed model, we use the approach of denoise and then synthesis as baseline. The model of baseline method is the same of the proposed, except that Post-Net only accepts the output of the Linear Projection layer as the input instead of concatenating the Mel-spectrogram denoise masks. The baseline model is only pre-trained on clean multi-speaker data, and the denoised new speaker’s Mel-spectrogram processed by the speech enhancement model is used for adaptation. In baseline, we utilized the denoised Mel-spectrogram instead of the original noisy Mel-spectrogram for adaptation as the speech synthesized by the following approach is much worse.

At first, a speech enchancement model need to be pretrained for both baseline and the proposed approach. In our experiments, we train the Mel-specdtrogram level enhancement model based on noisy multi-speaker TTS data and its corresponding clean data. Then the Mel-spectrogram denoise masks predicted by the speech enhancement model are used as the condition input of the proposed model, and the denoised Mel-spectrogram obtained by the enhanced model is used for the adaptation of baseline method.

3.2 Experimental Results and Analysis

We first verified the effect of the speech enhancement model, which is the basis for noise robust TTS. The speech enhancement model was tested under different SNRs, including -5dB, 0dB and 5dB. Under each SNR setting, there are 61 noisy voices generated with another internal noise sets. The scale-invariant signal-to-distortion ratio (SI-SDR) [le2019sdr] calculated on Mel-spectrogram is adopted as metric and shown in Table 1. It can be seen from Table 1 that the speech enhancement model can significantly reduce the noise in speech, especially in the case of low SNR.

Table 1: SI-SDR(db) result of our speech enhancement model on different SNR levels

An illustration of the enhanced Mel-spectrogram on -5dB utterance is shown in Figure 4. From Figure 4, we can see that some voice information contained in the original Mel-spectrogram was reduced by the speech enhancement model (marked with a red rectangle), indicating that adapting TTS model directly on the denoised data will result in unstable voice. Comparing Figure 4-(c) with Figure 4-(a) and Figure 4-(b), it can be seen that Mel-spectrogram denoise masks could well represent noise information, which is the reason why we adopt Mel-spectrogram denoise mask (noise representation) as the decoder’s condational input.

Figure 4: The speech enhancement result and Mel-spectrogram denoise masks

We synthesized 40 utterances with the same content for 4 new speakers under baseline and proposed approach respectively, meaning 40 x 4 x 2 utterances were synthesized111 We conducted mean opinion score tests, where 40 people were asked to evaluate synthesized speech in therm of speech quality. Each person was asked to score randomly selected 40 untterances from the total 320 ones. The score ranges from 1 to 5, where 1 represents the worst and 5 represents the best.

Baseline Proposed
Speaker1 3.018 3.227
Speaker2 3.462 3.560
Speaker3 3.139 3.181
Speaker4 3.313 3.424
Table 2: The MoS result in term of speech quality

The mean opinion score of each setting is shown in the Table 2. It can be seen that with the proposed method, the speech quality of each speaker is improved compared with the baseline. The score value has improved by an average of 0.115. The results demonstrate that our model can synthesize higher quality speech with low-resource low-quality data. We speculate that this is due to the denoise Mel-spectrogram used directly by the baseline loses some voice information, which leads to the instability of speech synthesis.

4 Conclusion

In this paper, we propose a novel method for synthesizing personalized speech based on the end-to-end network model in the case of low quality and low resources data. The model accepts speaker embedding and Mel-spectrogram denoise mask as the conditional input, for modeling speaker and noise information respectively. The model is first pre-trained on clean multi-speaker data and augmented noisy multi-speaker data, then adapted on the low-resource low-quality new-speaker data, and finally utilized to synthesize clean voices of the new speaker. Experimental and subjective evaluation results show that the proposed approach can synthesize better speech compared to baseline method, which fine-tunes the pre-trained multi-speaker TTS model on the denoised new speaker’s data directly.