High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

12/03/2019 ∙ by Leyuan Sheng, et al. ∙ 0

In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text to speech (TTS) synthesis aims at producing an intelligible and natural speech for a given text input. A traditional TTS system includes two parts: the front-end (text-analyzer) and the back-end (speech synthesizer). They usually consist of many domain-specific modules. The front-end requires a strong background in linguistics for text analysis and feature extraction. The back-end needs to have a certain understanding of the vocal mechanism and signal processing of speech, from the features extracted by the front-end to complete the speech synthesis.

Recently, TTS approaches based on deep learning have been proposed and extensively investigated. Deep learning can integrate both the front-end and the back-end into a single “end-to-end” learning model, example of which are Tacotron

[1], Char2Wav [2], Tacotron2 [3] and ClariNet [4]. These end-to-end generative text-to-speech models typically perform two tasks: 1) directly mapping the text into time-aligned acoustic features, as in Tacotron and Tacotron 2; 2) to convert the the generated acoustic features into a waveform such as Griffin-Lim algorithm [5], WaveNet [6] and WaveGlow [7] is using a vocoder. It has been found that the mel-spectrograms representation is reasonable and effective for the use of speech synthesis systems [1] [10]. Therefore, we use mel-spectrograms to bridge the two steps in a TTS system. Note that after obtaining the mel-spectrograms in Tacotron and Tacotron2, there exists a post-net module to improve the mel-spectrograms predictions, which will improve the quality of synthesized speech. The closer the gap between the mel-spectrograms predicted by the model to the mel-spectrograms extracted from ground truth audio, the higher the quality of synthesized speech.

As the research in generative model progress, new advanced Generative Adversarial Networks (GANs) have started to emerge in speech processing. A speech enhancement GAN (SEGAN) in [8]

and a Convolutional Neural Network (CNN)-based GAN in

[9] was proposed for SE. A GAN-based post-filter in [10] and [11]

was proposed for short-term Fourier transform (STFT) spectrograms. Moreover, the conditional GAN (cGAN) architecture adopted a pixel-to-pixel (Pix2Pix) framework has investigated for SE

[12] and Music Information Retrieval (MIR) [13]. As the high-resolution image synthesis architecture Pix2PixHD [14] improves the performance than Pix2Pix, in [15] the authors have implemented the Pix2PixHD within a cGAN framework for reducing over-smoothness in speech synthesis.

In this work, we proposed a novel model to improve mel-spectrograms prediction for high-quality speech synthesis by combining the advantages of Pix2PixHD [14] and deep residual U-Net (ResUnet) [16]. Meanwhile, although several researchers have investigated models translating images to speech. But these either adapt the image translation model to modify the acoustic feature inputs [10] [13] [17], or modify the image translation model itself [18]. In practice, the image translation model is combined with acoustic feature processing. These contributions of this paper are: 1) we demonstrate that the spectrogram can be viewed completely as an image, so that the task of speech signal processing can be processed by image restoration or translation model. 2) our proposed model can effectively improve speech quality based on generated enhanced spectrogram images.

2 Related Work

Our proposed super-resolution mel-spectrogram is based on the pix2pixHD conditional GAN-based image translation framework. In contrast to the original pix2pixHD model, the generator is used in a local enhancer network using ResUnet instead of the generator in GAN. In this section, we first review the conditional GANs followed and briefly describe pix2pixHD framework.

2.1 Conditional GANs

As described in [19], GANs are generative adversarial networks, which consist of two adversarial models: the generator and the discriminator ( and ). The generator network learns to map noise variables to a complex distribution , where is the noise sample and is the data sample. The generative model attempts to make the generated sample distribution indistinguishable from the actual data distribution . The discriminative model is trained to identify the samples from the data sample and the samples from . Here and play a min-max game, which can be represented in a value function:

(1)

Due to the weak guidance of the generative model, the generated samples cannot be controlled. Therefore, the conditional GAN is proposed to guide the generation by considering additional information . The training objective of conditional GAN can be expressed as:

(2)

2.2 Pix2pixHD

Pix2PixHD framework, improved the Pix2Pix [20] framework and achieved good results in image translation by using:

  1. A coarse-to-fine generator. Coarse: residual global generator network () trained on lower resolution images. Fine: another residual local enhancer network () appended to and the two networks are trained jointly the images.

  2. Multi-scale discriminator architectures have identical network structures but operate at different image scales.

  3. A robust adversarial learning objective function, that is based on the conditional GAN loss function shown in Eq. 

    2, combines with the feature matching loss to stabilize the training process, as the generator has to produce realistic image features on different scales. The feature matching loss is given by:

    (3)

    where is the conditional information semantic label map, is a corresponding original image, is the total number of layers and denotes the number of elements in each layer.

The proposed enhancement methods make the conditional GANs able to synthesize high-resolution image samples.

3 Super-resolution mel-spectrogram

The goal of generating super-resolution mel-spectrogram is to produce a high-resolution synthetic mel-spectrogram with fine-grained details for a given coarse mel-spectrogram (as shown in Fig. 1). We propose to utilize the adversarial strategy to train a ResUnet (as shown in Fig. 3) that updates the generator of Pix2PixHD. The components of ResUnet and its contrasts with Pix2PixHD are explained in the flowing paragraphs.

Figure 1: Illustration of cGAN system architecture.
Figure 2: The differences between the Pix2PixHD and our model: (a) a modified local enhancer network. (b) The global generator network is not used in our model.
(a) Components of the ResUnet architecture
(b)

ResUNet with long-skip residual connection

(c) Downsampling and upsampling blocks
Figure 3: Local enhancer generator architecture

3.1 Local enhancer generator

Difference from Pix2PixHD, which decomposes the generator into two sub-networks ( and ), shown in Fig. 2. The is only required as the input mel-spectrograms similar to the real mel-spectrograms. Then we are required to make partial enhancements to restore the lost information. ResUnet is mainly used to build and restructure information and then represent the images in a local neighborhood image space. Our local enhancer network consists of a ResUnet with a set of residual downsampling and upsampling blocks, shown in Fig. 3

3.2 Adversarial loss

In Pix2PixHD, the full objective function is to achieve multi-scale discriminator loss and feature matching loss. We use four multi-scale discriminators to distinguish the real mel-spectrogram from the generated ones, using the loss function in Eq. 4 . In addition, we also add Structural Similarity Index (SSIM) [21] loss and Mean Squared Error (MSE) loss to increase the stability of the training.

(4)

SSIM is one of the state-of-the-art perceptual metrics to measure image quality, which has been shown to provide sensitivity to structural information and texture. SSIM is defined between and , where indicates perfect perceptual quality relative to the ground truth. For pixel , the SSIM value is computed as:

(5)

where and are the mean for regions and ; and

are standard deviation for regions

and ; is the covariance of regions and and and are constant values for stabilizing the denominator. The SSIM loss for every pixel is expressed as:

(6)

The MSE loss, used for image restoration tasks is defined as:

(7)

where denotes the pixel and denotes the total number of pixels in images and .

Our full objective function is defined as the combinations of two parts: (1) the GAN and feature matching loss. (2) the intensity loss MSE and the structural loss SSIM. This is expressed as:

(8)

where is a constant that controls the importance factor of the loss terms.

4 Experiments

4.1 Corpus and features

Our work uses an open source LJSpeech dataset

[22], which consists of pairs of text and short audio clips of a single English speaker. The speech waveforms have a sampling frequency of   Hz, duration of hours and without phoneme-level alignment. We randomly selected utterances for training networks and another for testing data.

The mel-spectrograms are extracted from the original speech data with window length 1024, hop length 256 and frame length of 1024 as the parameters of fast Fourier transform (FFT) . To obtain the mel-spectrogram inputs (baseline), we use Griffin-Lim algorithm to invert the original mel-spectrograms into time-domain waveforms, from which we use the same parameters to extract the input mel-spectrograms. We note that Griffin-Lim algorithm produces characteristic artifacts and low quality audio, which is a common problem of synthesizing speech as it creates a gap between the input and output mel-spectrograms.

Another idea is to take the mel-spectrogram as an image. The stored mel-spectrogram adopts a series of scaling transformations that are reversible. The size of our extracted mel-spectrogram is  ( is the Mel filterbank channel, is the number of frames in the mel-spectrogram). Due to limitation of computing resources, we scaled the mel-spectrograms of model input to size . If we upscale to , we may obtain better performance than an other scale setting.

4.2 Vocoder methods

In our experiments, we use Griffin-Lim algorithm and WaveNet vocoder to synthesize speech from mel-spectrograms and then evaluate the quality of the synthesized speech. If the mel-spectrogram generated by our proposed model can restore the information lost by the coarse mel-spectrogram, better quality of synthesized speech maybe achieved.

4.2.1 Griffin-Lim

Griffin-Lim algorithm is based on iteratively estimating the missing phase information and converting between modified magnitude spectrograms and actual signal spectrograms. The iterative process may degrade the quality of synthesized speech. For simplicity of Griffin-Lim algorithm, we use

iterations from frequency to time domain in our experiments.

4.2.2 WaveNet

WaveNet is a typical autoregressive generative model with a convolutional model architecture with dilated convolutions that learns directly to model raw signals in the time domain and achieves very high-quality synthetic speech. The log-scale mel-spectrogram has been found to be a good acoustic feature as it reconstructs a more accurate algorithm than Griffin-Lim.

4.3 Model setup

We trained our model using NVidia GeForce GTX TITAN X GPUs with coarse mel-spectrogram, original mel-spectrogram image pairs as input, batch size of and epochs for days. We used Adam optimizer [23] with , , and an initial learning rate of . After epochs, we started a linearly decaying learning rate which decays to . We first trained the local multi-scale enhancer network with two-scale, three-scale and four-scale, and found that when we adopt four-scale, we can effectively accelerate loss function decline. Correspondingly, discriminators were used with identical network structure at different image scales.

4.4 Evaluation

To evaluate the quality of the enhanced speech, we used Griffin-Lim algorithm and pre-trained WaveNet model. The pre-trained model available at
https://github.com/r9y9/wavenet_vocoder to synthesize speech from predicted, coarse and original mel-spectrograms.

4.4.1 Objective Evaluation

We selected mel-spectrograms generated by our model from the test dataset and compared it with the coarse mel-spectrograms and original mel-spectrograms as shown in Fig. 4 . We observed that our model can not only emphasize the harmonic structure, but can reproduce the detailed structures that are close to those in the original mel-spectrograms from the coarse spectrograms. Moreover, we utilize SSIM value as a combined image quality measure of mel-spectrogram and Short-Time Objective Intelligibility (STOI) [24] index as a measure of speech intelligibility. Table 1 shows the enhanced mel-spectrogram has a significant improvement, and table 2 shows the generated speech from predicted mel-spectrogram and original is highly correlated with the intelligibility. We also found that the SSIM and STOI values are relatively well correlated.

Mel-spectrogram SSIM
Coarse (baseline)
Predicted (ours)
Original
Table 1: Mel-spectrogram comparison in terms of SSIM
Mel-spectrogram Griffin-Lim WaveNet
Coarse (baseline)
Predicted (ours)
Original
Table 2: Speech intelligibility comparison in terms of STOI
Figure 4: Comparison of Coarse mel-spectrogram (top), Predicted mel-spectrogram (middle) and Original mel-spectrogram (bottom).

4.4.2 Subjective evaluation

The commonly used Mean Opinion Score (MOS) test was conducted to compare the synthesized audio sample from coarse mel-spectrograms with the predicted mel-spectrograms. To make sure the results are legitimate, we used unseen coarse mel-spectrograms to generate the predicted mel spectrograms. We then used Griffin-Lim and WaveNet vocoder to synthesize speech and wherein sent to Amazon’s Mechanical Turk human rating service, having raters listen to the audio and rate on a five-point scale point increment. Our samples were rated by native listeners and the subjective MOS was calculated.

The results of subjective MOS are shown in Table 3 with confidence intervals computed from t-distribution for various systems. As we can see, our best model produced better mel-spectrograms than the coarse mel-spectrograms (baseline). The absolute improvement, obtained by using Griffin-Lim as vocoder was and the one obtained by using WaveNet as vocoder was , both improving the quality of synthesized speech. The mel-spectrograms predicted by our model achieves a subjective MOS, comparable the original mel-spectrogram regardless of whether it is synthesized by Griffin-Lim or WaveNet. Audio samples available at https://speech-enhancer.github.io/.

Mel-spectrogram Griffin-Lim WaveNet
Coarse (baseline)
Predicted (ours)
Original
  • Note: ground truth:

Table 3: Mean Opinion Scores

5 Conclusions

In our work, we have proposed the image-to-image translation model for high-quality speech synthesis from mel-spectrograms. The proposed model combines the advantages of Pix2PixHD and ResUnet (residual learning and U-Net). The multi-scale local enhancer network and skip connections within the residual units can reproduce detailed structures of mel-spectrograms. Our method significantly outperforms the baseline under subjective listening tests in MOS. But the results also showed a certain gap between the synthesized speech and the ground truth, shown by subjective listening tests in MOS. This may be caused by limitations of the vocoder. In the future work, we will design better vocoder to tackle the problem.

References

References