Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework

by   Shan Yang, et al.

In this paper, we aim at improving the performance of synthesized speech in statistical parametric speech synthesis (SPSS) based on a generative adversarial network (GAN). In particular, we propose a novel architecture combining the traditional acoustic loss function and the GAN's discriminative loss under a multi-task learning (MTL) framework. The mean squared error (MSE) is usually used to estimate the parameters of deep neural networks, which only considers the numerical difference between the raw audio and the synthesized one. To mitigate this problem, we introduce the GAN as a second task to determine if the input is a natural speech with specific conditions. In this MTL framework, the MSE optimization improves the stability of GAN, and at the same time GAN produces samples with a distribution closer to natural speech. Listening tests show that the multi-task architecture can generate more natural speech that satisfies human perception than the conventional methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions

This paper introduces an improved generative model for statistical param...

Improving Neural Silent Speech Interface Models by Adversarial Training

Besides the well-known classification task, these days neural networks a...

Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection

This paper presents a novel framework for Speech Activity Detection (SAD...

Improving evidential deep learning via multi-task learning

The Evidential regression network (ENet) estimates a continuous target a...

A comparison of Vietnamese Statistical Parametric Speech Synthesis Systems

In recent years, statistical parametric speech synthesis (SPSS) systems ...

Real-Time Packet Loss Concealment With Mixed Generative and Predictive Model

As deep speech enhancement algorithms have recently demonstrated capabil...

Deep Generative Adversarial Compression Artifact Removal

Compression artifacts arise in images whenever a lossy compression algor...

Code Repositories


PyTorch implementation of GAN-based text-to-speech synthesis and voice conversion (VC)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical parametric speech synthesis (SPSS) has attracted significant attentions since the successful use of hidden Markov models (HMMs) 

[1, 2, 3]

. In HMM based systems, Gaussian mixture model (GMM) was used to model the hidden states of observations. Considering the limitations of the decision tree clustering procedure in modeling the complex context dependencies in HMM-based statistical parametric speech synthesis 

[4, 5], deep neural networks (DNNs) have been proposed for acoustic modeling, which can produce more natural synthesized speech [4, 6]. More recently, advanced estimation criteria and novel network architectures have been introduced to further improve the performance of SPSS [7, 8, 9, 10, 11].

Since the purpose of training in the statistical methods is to maximize the likelihood or specifically to minimize the mean square error (MSE) between the synthesized (i.e., network outputs) and the original speech parameters in neural network based Text-to-Speech (TTS), the synthesized speech may achieve suboptimal human perceptual level. Hence there is an underlying reasonable-but-not-necessarily-optimal hypothesis that the most natural synthesized speech has the minimal value in the numerical loss, which may fall into the perceptual deficiency problem. In other words, the reduction in numerical errors may not necessarily lead to better perceived speech [12]. In this paper, we propose to use generative adversarial networks (GANs) [13] to remedy this deficiency.

Significant efforts have been made to remedy the perceptual deficiency problem by improving the training criteria [14, 15, 16, 17]. In [14], by incorporating the whole sequence parameters into training, the sequence generation error (SGE) minimization was proposed to eliminate the mismatch between training and testing. Considering the independence of frames in DNNs, the minimum trajectory error training was adopted to take into account the dynamic constraints from a wide acoustic context during training [15]. In SPSS, the speech features must be invertible for reconstruction through a vocoder, and this rules out the use of many perceptual representations of speech that can not be reconstructed to speech waveform. Hence one solution to the perceptual suboptimality issue is to bring more representative perceptual features into acoustic modeling [12]. In [12], under a multi-task learning (MTL) framework, along with the invertible spectral feature used in the vocoder, extra perceptual representations of speech, e.g., spectro-temporal excitation pattern, were included as a second prediction target in DNN-based SPSS.

In this paper, we propose to use GANs to solve the perceptual deficiency problem in acoustic modeling. GAN is a powerful generative model that has been successfully used in image generation [13, 18, 19] and other tasks [20, 21]. It consists of a generator , which is treated as an acoustic model in our framework to generate speech, and a discriminator for discriminating the generated speech and the genuine speech. Specifically, the objective of is to capture the distribution of the natural speech, while aids the training of by examining the data generated by in reference to real data, and hence helping learn the distribution that underpins the real data [13]. In our framework, GAN naturally addresses the perceptual deficiency problem: the updating of the generator is not directly from the data samples, while it comes from the back propagation of the discriminator. This means can capture the essential difference between the natural speech and the synthesized speech and this ‘perceptual’ difference is used to guide , the generator. Considering the mode collapse problem of the generated samples in GAN [18], we take conditional linguistic features as a guidance to control the generation process. Moreover, since the gradients of GANs are not stable, we also use the conventional MSE loss function to stabilize the training process. More specifically, inspired by [16, 12, 22, 23], we combine the MSE loss with the GAN loss under an MTL framework. The objective experiments show that our framework has comparable performance in numerical loss compared to the baseline BLSTM-based TTS, while promisingly, the subjective listening experiments indicate that the proposed architecture achieves significant improvement. That is, the proposed GAN approach results in better perceptual speech quality.

2 Related Works

We notice that there are several recent attempts of using GANs to improve the quality of synthesized speech. In [24], GAN was treated as a post-filter for acoustic models to overcome the over-smoothing problem. Specifically, natural speech was used as a conditional guidance of GAN, which tries to reproduce the natural speech texture from the synthesized one. In [25]

, variational autoencoding Wasserstein GAN (VAW-GAN) was proposed to build a voice conversion system from unaligned data, in which the GAN objective was incorporated into the decoder to improve the conditional variational autoencoder (C-VAE).

Our approach shares a similar idea with [16]. In order to compensate the difference between the synthesized speech and the natural speech in acoustic modeling, an Anti-Spoofing Verification (ASV) module (like the discriminator in GAN) was introduced to distinguish between the natural and the synthetic speech. The speech generator has no difference with a typical neural network acoustic model [4], i.e., learning a non-linear mapping from linguistic features to speech parameters, but the ASV discrimination loss was combined with the minimal generation error (MGE) loss, under an MTL framework, to train the network.

It is noted that our approach is different from [16] in terms of motivation and implementation. Instead of addressing the over-smoothed problem with additional ASV constraint as compensation, we propose to use GANs to directly produce speech samples with closer distribution to natural speech from a uniformly random noise distribution. In other words, the input of our speech generator is random noise, while linguistic features are introduced to both the generator and the discriminator as conditions. As such, the prior uniformly random noise distribution creates new samples that approximate the training data distribution, and it brings diversity with conditions to the synthesized speech from the generator while the linguistic conditions add direct linguistic-discriminative information to the discriminator. On the other hand, as the Nash equilibrium is hard to achieve in network estimation, the training process becomes unstable during the adversarial game. To tackle this problem, we take other optimization methods, such as variational auto-encoder [22] or MSE, to restrain the process. Finally, in our implementation, we use state-of-the-art BLSTM network as a benchmark, which can produce speech with much better quality than feed-forward networks used in [16].

3 Gan-Based Multi-Task Learning

Fig. 1 illustrates the architecture of the proposed GAN-based MTL framework, which consists of a Generator and a Discriminator. In the training process, different from [16], we use random noise as the input of Generator, and introduce the linguistic features to each hidden layer as the conditional information. Then the Generator can produce the synthesized speech, with which the Discriminator can distinguish between the synthesized speech and the natural speech under the same conditions. The estimation of this process is made up of two aspects: 1) For Discriminator, the OR

operator means that the synthesized samples and natural samples are alternately used to train a binary classifier (whether synthesized or genuine speech). 2) As for

Generator, the AND operator is related to the MSE calculation and its optimization, and meanwhile the OR operator signifies that the discriminant error will also affect the estimation. In the synthesis stage, given a random noise and specific linguistic features, we can easily generate speech from Generator using the forward direction. We will describe the framework in details in the following.

Figure 1: System diagram of GAN-based multi-task learning framework.

3.1 Generative Adversarial Networks

GAN is a generative model that can learn a complex relationship between random noise input vector

and output parameters by an adversarial process [13]. The estimation of GANs consist of two models: a generative model that captures the data distribution from random noise , and a discriminative model

that maximizes the probability of correctly discriminating between the

real examples and fake samples generated from .

In this adversarial process, the generator tends to learn a mapping function to fit the real data distribution from a uniformly random noise distribution , while the purpose of discriminator is to perfectly judge whether the sample is from or . So the and are both trained simultaneously in the two-player min-max game with value function:


In the above generative model, the modes of generated samples cannot be controlled because of the weak guidance. So the conditional generative adversarial network (CGAN) [18] is proposed to direct the generation by considering additional information . Then the loss function can be expressed as


3.2 Multi-Task Learning with GANs in SPSS

In the traditional acoustic model for SPSS, we usually minimize the MSE between the predicted parameters and the natural speech during the estimation. The objective can be written as


As Eq. (3) shows, the numerical difference (in terms of MSE) is only concerned in the estimation, and the numerical error reduction may not necessarily lead to perceptual improvement on the synthesized speech [12]. To solve this problem, we propose to use GANs to learn the essential differences between the synthesized speech and the natural speech through a discriminative process.

GAN is able to generate data rather than estimate the density function. Due to the model collapse problem in the generative model in GAN [18], we propose the following generator loss function in order to guide GAN to converge to optimal solution such that the generative model produces desired data:


where , and is generated by the generator G using uniformly random noise under condition . Combining Eq. (2) and Eq. (4), the final objective of our MTL framework is:


We treat the linguistic features as additional vector , and make the input noise

obey a uniform distribution in the interval [-1,1]. Then our framework can generate the speech

by , and the and are estimated at the same time during training. Note that the input of our speech generator is uniformly random noise and linguistic features are used as conditions for both the generator and the discriminator, which in different from [16].

Since the effective likelihood of GAN is unknown and intractable [22], several auto-encoder GAN variants use zero-mean Laplace distribution  [26, 27] to solve the problems. In order to directly show the likelihood of these variants, we can simply set and replace the reconstruction loss with norm, and then we can get the MSE format as traditional methods. That is to say, we can take other explicit likelihood (e.g., MSE) to solve the intractable inference of GANs. The reconstruction loss will be investigated in the near future.

3.3 Phoneme Discrimination for GANs

In Section 3.2, the discriminator is a binary classifier to judge whether the data is from or under the condition . We also try to use phoneme information to guide the discrimination process in our multi-task framework, as shown in Fig.2.

Figure 2: The discriminator with phoneme information.


is a one-hot encoded vector representing the phoneme class, which is the category of both

fake and real samples for . Then our goal is to minimize the cross entropy (CE) for the real and to maximize this loss for the fake, and the latter one means that we do not know which phoneme the fake belongs to. So the target function of GANs in can be updated with


We obtain the new loss function considering the phoneme classification as follows.


4 Experiments

4.1 Experimental Setup

In the experiments, a Chinese speech corpus was used to evaluate the performance of our approach. The corpus consists of about 10,000 utterances from a single female speaker. We randomly selected around 8,000 sentences for network training, 1,000 utterances for model validation and another 1,000 for testing. Each speech waveform was sampled at 16 kHz, and we used WORLD [28] (D4C edition [29]) to extract 60-dimensional Mel-Cepstral Coefficients (MCCs), 1-dimensional band aperiodicities (BAP) and in log-scale in 5-ms step. So the final acoustic features were 63-dimensions including one extra binary voiced/unvoiced flag. As for the text, we made a complex text analysis module to get 138-dimensional linguistic features, including phoneme information, prosody boundary labeling, part of speech tagging, state information and corresponding position index.

To benchmark the performance of the GAN-based MTL framework, we compared four systems, listed as follows.


    : We used bidirectional long short-term memory (BLSTM) based acoustic model as the baseline, which contained three feed-forward layers with 512 nodes/layer, followed by two BLSTM layers with 512 cells and a fully-connected output layer.

  • GAN-based MTL (GAN): The proposed framework shown in Fig. 1. For the generator , we also used three feed-forward and two BLSTM layers corresponding to the baseline. But the input was replaced with 200-dimensional random noise under the [-1,1] uniform distribution. The linguistic features were added to the output of each hidden layer in as conditions. As for the discriminator, two convolutional layer were used with filter shape, and

    was treated as the activation function followed by batch normalization 

    [24, 19]. Besides, there was a fully-connected layer after the convolutional architecture and a binary classification layer in the end. The linguistic conditions were also introduced to all hidden layers in .

  • ASV as a second task [16] (ASV): We realized the ASV approach with . The network architecture was the same as BLSTM.

  • GAN with phoneme classification (GAN-PC): The same GAN model architecture was used, except that the output layer of became a 63-category phoneme classification task, as described in Section 3.3.

All the above methods were optimized using Adam optimizer [30, 31]

, and we implemented all the systems with TensorFlow 


4.2 Objective Evaluation

We first conducted the objective measure to evaluate the performance of the four systems on the testing data. Specifically, Mel-cepstral distortion (MCD) was used to evaluate the distortion of spectrum, and RMSE was introduced to calculate the error. Besides, the V/UV error rate was also used to present the accuracy of the voice/unvoice flag judgements.

Methods MCD (dB) RMSE (Hz) V/UV (%) BLSTM ASV [16] GAN GAN-PC

Table 1: Objective evaluation results.

Table 1 shows the objective results. As shown, there seems to be no remarkable differences among these systems. Since one purpose of our framework was to compensate the deficiency of traditional numerical loss function by GANs, these numerical measures may not be suitable for evaluation because of the internal squared error [24]. But in another aspect, we can find that the proposed MTL framework can compensate the instability and the mode collapse issues of GANs and generate stable and diverse speech with the help of traditional loss function. That is to say, GANs can utilize the numerical loss function to limit its adversarial process.

4.3 Subjective Evaluation

We conducted listening tests to assess the quality of the synthesized speech. we made four pairs of A/B preference test: BLSTM vs. GAN, GAN vs. ASV, GAN vs. GAN-PC and BLSTM vs. GAN-PC. For listening test, 20 sentences were randomly selected from the test data, and all listening pairs were presented in a shuffled order. There were 20 listeners participating in the test. In each test session, the listeners were asked to choose the better one considering the perceived speech quality, or choose the “neutral” option if there was no difference.

Figure 3: The preference score (%) of A/B test.

Fig. 3 shows the preference bars of the four pairs. The first bar of GAN vs. BLSTM indicates that the GAN-based MTL framework can significantly improve the performance of the synthesized speech (). The listeners pointed out that the GAN system could produce speech with less buzzy sounds in most cases and more natural prosody in some samples. The second bar of GAN vs. ASV shows that the proposed GAN approach is better than the ASV approach (). As discussed in [16], the ASV optimization aims to make the distribution of the synthesized speech close to the natural speech. But this method theoretically lacks the linguistic conditional guidance in distinguishing between the distributions of natural and synthesized speech. As a result, as compared with ASV, we find that the proposed GAN approach can capture both subtle and rapid changes, leading to better brightness of the synthesized speech. Fig. 4

shows the distance of averaged global variance (GV) between natural speech and synthesized speech from different approaches. The smaller values mean that the GVs of synthesized speech are more similar to natural speech. The result indicates that the GVs of

GAN are closer to the natural GVs than AVS especially in the first few coefficients.

Figure 4: The difference of averaged GVs per mel-cepstral coefficients compared to natural speech.

The third bar of GAN vs. GAN-PC in Fig. 3 shows that there is a decline in performance with phoneme classification in GANs (). In order to explain this phenomenon, we compared the BLSTM system with GAN-PC, as shown in the last bar. In this preference test, GAN-PC slightly outperforms BLSTM, but the difference is not significant between the two systems (). As we know, the phoneme information is related to the contents of speech, which highly correlates to the intelligibility of the synthesized samples. The BLSTM based acoustic model can already produce speech with high intelligibility. However, the purpose of treating the phoneme label as a guidance for the discriminator in GAN-PC is to improve the intelligibility, not to make the distribution of synthesized examples closer to the natural samples. So simply letting the discriminator distinguish whether is a natural sample in GAN can make the synthesized speech be more related to human perception, resulting in better subjective listening performance.

5 Conclusion

This paper proposed to use GANs to improve the quality of synthesized speech. We use a multi-task learning architecture with GANs, where the GANs can compensate the deficiency of traditional MSE loss function while the MSE can also help to solve the instability of GANs. Evaluation results show that the proposed method can compensate the weakness of numerical loss function and improve the performance of SPSS. The proposed framework has a little increase in the computation cost, compared to traditional acoustic models during the generation process, as the extra computation only comes from noise generation and feature concatenation.

In our future work, we will focus on improving the performance of the generator in GAN and try to use GAN in end-to-end speech synthesis [33, 34, 35, 36, 37]. Since the the MSE loss is still used to stabilize the adversarial process in our framework, we attempt to find a self-stabilizing architecture to directly estimate the distribution of synthesized speech, such as using Wasserstein GAN [31] and VAE-GAN [22].