WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

09/25/2018 ∙ by Kou Tanaka, et al. ∙ 0

We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited number of data because it is possible to represent and process interpretable acoustic features over a compact space, such as the fundamental frequency (F0) and mel-cepstrum. However, a well-known problem that leads to the quality degradation of generated speech is an over-smoothing effect that eliminates some detailed structure of generated/converted acoustic features. To address this issue, we propose a synthetic-to-natural speech waveform conversion technique that uses cycle-consistent adversarial networks and which does not require any explicit assumption about speech waveform in adversarial learning. In contrast to current techniques, since our modification is performed at the waveform level, we expect that the proposed method will also make it possible to generate `vocoder-less' sounding speech even if the input speech is synthesized using a vocoder framework. The experimental results demonstrate that our proposed method can 1) alleviate the over-smoothing effect of the acoustic features despite the direct modification method used for the waveform and 2) greatly improve the naturalness of the generated speech sounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech processing systems such as statistical parametric speech synthesis [1] and statistical voice conversion [2] are well-known frameworks. These approaches using a vocoder framework have a significant advantage, especially for a limited number of data, because it is possible to represent interpretable acoustic features over a compact space, such as the fundamental frequency () and mel-cepstrum, which are lower dimensional acoustic features than a

short-term Fourier transform (STFT) spectrogram. Although these systems aim to produce speech with

a quality indistinguishable from that of clean and real speech, processed and synthesized speech can usually be distinguished from natural speech. The realization of synthetic-to-natural speech waveform conversion provides significant benefit with many speech processing approaches, especially when using a vocoder framework. Three major factors reported in [3] degrade the speech synthesized by a statistical parametric speech synthesis technique: the accuracy of acoustic models, over-smoothing, which eliminates some detailed structure of generated/converted acoustic features, and vocoding. In this paper, we focus on vocoding and over-smoothing.

Figure 1: Three major factors [3] that degrade the quality of synthesized speech during statistical parametric speech synthesis and general approaches to generating more natural sounding speech by using post-processing. Our proposed framework is assigned to a process b) which can address not only the over-smoothing problem but also the vocoding error.

To address the over-smoothing effect, several techniques for restoring the fine structure of natural speech over acoustic features have been proposed [2, 4, 5]. These approaches, as shown in Fig. 1 a), have achieved significant improvements as regards the naturalness of synthesized speech in the respective directions.

However, heuristics approaches such as enhancement of global variance 

[2] and modulation spectrum [4] are unsuitable for covering all the negative factors. On the other hand, although a learning-based postfilter [5] enables us to restore not only the

global variance and modulation spectrum but also other factors

that degrade the quality of synthesized speech, it is still insufficient to generate natural speech because of the post filter needed not for the waveform but for the heuristic acoustic features such as mel-cepstrum. Furthermore, all of these approaches suffer from vocoding error because of the use of the vocoder framework to synthesize the speech waveform.

To avoid this limitation, an end-to-end speech enhancement [6] method has been proposed within a generative adversarial framework. As shown in Fig. 1 b), since the waveform of the input speech was directly operated to obtain that of the desired speech after the vocoding part, [6] has the potential to address not only the over-smoothing effect but also the vocoding error. Furthermore, the generative adversarial framework does not require us to design any hand-crafted feature that creates a gap between natural speech and synthetic speech, in advance. In preliminary experiments, we found that this method is unsuitable when the alignments 111In this paper, we define alignment considering both the magnitude information and the phase information of speech because we focus on modifying the speech waveform rather than the acoustic features. between the input waveform and the desired waveform are not perfect. For example, the noise reduction of noisy speech simulated by adding noise to the speech waveform recorded in an ideal environment succeeded because of the perfect alignment between the simulated noisy speech and the clean source speech. However, the conversion of synthetic speech generated by text-to-speech synthesis and voice conversion processing to natural speech is not easy to achieve by applying this method because of the alignment problem as mentioned above.

In this paper, we propose a learning-based filter that allows us to convert a synthetic speech waveform into a natural speech waveform using cycle-consistent adversarial networks with a fully convolutional architecture. We adopt cycle-consistent adversarial networks because they do not require a dataset forcibly paired at the time frame level and as the name implies, they are trained within the adversarial learning. In contrast to  [7] which is also inspired by the cycle-consistent adversarial networks [8] to convert not speech waveform but acoustic feature, since our modification is performed at the waveform level, we expect that the proposed method will make it possible to generate “vocoder-less” sounding speech even if the input speech is synthesized using a vocoder framework.

Furthermore, we adopt a gated convolutional neural network (CNN) architecture 

[9], which is able to capture long- and short-term dependencies in the speech waveform. The experimental results demonstrate that our proposed method can 1) alleviate the over-smoothing effect of the acoustic features despite the direct modification method used for the waveform and 2) greatly improve the naturalness of the generated speech sounds.

2 SEGAN: Speech Enhancement Generative Adversarial Network

2.1 Generative Adversarial Networks

Generative Adversarial Networks (GANs) [10] are generative models consisting of two neural networks. One is a generator that learns to convert a sample from a prior distribution to a target sample from a distribution , which is a sample from the training data. The generator aims to learn a projection that can imitate the true feature distribution and to generate samples related to the training data. The other is a discriminator that learns the boundary between imitated features generated by the generator and true features picked up from the training data.

The adversarial characteristic arises from the fact that the discriminator

tries to classify the instances

obtained from the true data distribution as real and the candidates produced by the generator as fake, while the generator tries to make the discriminator classify those as real. Through back-propagation, the generator becomes able to generate better candidates and the discriminator becomes able to distinguish the generated ones and real data . The objective function of the adversarial learning is formulated as the following minimax game between and ,

(1)

Although the GANs achieve state-of-the-art results in a variety of generative tasks [11, 12],

the difficulty of the training is a well-known problem. For instance, the classic approach suffers from a vanishing gradient problem due to the sigmoid cross-entropy loss used for training.

Several adversarial training techniques have been proposed to overcome this difficulty. The least-squares GAN (LSGAN) approach [13] stabilizes the training process by replacing the cross-entropy loss shown in Eq. 1 with the least-squares function as follows.

(2)
(3)

2.2 GANs for Speech Enhancement

Figure 2: Generator network for speech enhancement reported in [6]. Structure is similar to an auto-encoder.

To retain the linguistic information of speech samples,  [6] adopts a conditioned version GAN that has some extra information in and to perform mapping and classification. As shown in Fig. 2, in the structure of the generator , which is similar to an auto-encoder, a noisy speech signal , which is the input of the network, is encoded as

. After concatenating the random vector

with the encoded vector , which is treated as a conditional vector, the decoding part of the

network is performed as transposed convolutions (a.k.a. deconvolutions or fractionally strided convolutions) to obtain the enhanced

vector.

To achieve the generation of speech samples that are closer to clean speech, a secondary component is added to the loss of .  [6] adopts the L1 norm, as it has been proven to be effective in the image manipulation domain  [14, 15]. In this way, they allow the adversarial component to add more fine-grained and realistic results. A new hyper-parameter

controls the magnitude of the L1 norm. Finally, the loss function of the generator

becomes

(4)

3 Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks

3.1 Concept

In preliminary experiments, we found that SEGAN [6] could not be easily applied to the conversion of a synthetic speech waveform to a natural speech waveform. One possible reason is that the misalignment caused by the different lengths and generation processes of synthetic and natural speech makes it difficult to ensure the operation of the bijective function in the generator . Specifically, the phase information of the speech waveform synthesized by using the vocoder framework is very far from that of natural speech, even if the magnitude information of the synthetic speech is close to that of natural speech. We assume that these factors induce “mode collapse”, which is a well-known problem when training GANs, and the SEGAN does not guarantee that an individual input and output are paired up in a meaningful way. Generally speaking, all input speech signals map to the same output speech signals and the optimization fails to make progress [10].

To solve this problem, we focus on cycle-consistent adversarial networks [8]. This approach has introduced a “cycle consistent” property, which ensures return to the original sample [16]. Mathematically, if we have a converter and another converter , and should be the inverse of each other, and both mappings should be bijections. We incorporate this property into SEGAN by training the mapping functions and simultaneously and adding a cycle consistency loss [17] that encourages and . Combining the cycle consistency loss with the adversarial losses defines our full objective for a training procedure using perfect alignment.

Furthermore, we focus on a convolutional architecture called a gated CNN. The gated CNN has recently been shown to be powerful for

modeling long-term sequential data. It was originally introduced for language modeling and was shown to outperform long short-term memory (LSTM) language models trained in a similar setting 

[9]. We previously applied a gated CNN architecture for acoustic feature sequence modeling, and its effectiveness has already been confirmed [18, 19]. With a gated CNN, the output of a hidden layer of a network is described as a linear projection modulated by an output gate. Similar to an LSTM [20] and agated recurrent unit (GRU) [21], the output gate controls what information should be propagated through the hierarchy of layers and allows the capture of long-term structures.

3.2 Cycle-Consistent Adversarial Networks

Figure 3: Training procedures of cycle-consistent adversarial networks: a) Forward-inverse mapping to consider forward cycle consistency and b) inverse-forward mapping to consider backward cycle consistency.

For each speech sample , the speech waveform conversion cycle shown in Fig. 3 a) constrains the samples to return to the original speech through a target domain corresponding to the samples , . This cycle consistency is called forward cycle consistency. Similarly, as shown in Fig. 3 b), for each speech waveform , and are constrained by a backward cycle consistency, . Therefore, these are described as the following cycle consistency loss,

(5)

Finally, the objective function is

(6)

where is a hyper parameter used to control the cycle consistency loss.

Figure 4: Network architectures of generator G and discriminator D. “Conv”, “GLU”, “IN”, “PS”, “FC”, and “Sigmoid” denote convolution, instance normalization, gated linear unit, pixel shuffler, fully connected, and sigmoid layers, respectively. In an input or output layer, w and c represent width and number of channels, respectively. In each convolutional layer, k, c, and s denote kernel size, number of channels, and stride size, respectively.

3.3 Identity-Mapping Loss

Cycle consistency loss allows us to reduce the possible mapping functions by constraining a structure. However, in a waveform modification task, the linguistic information is not always preserved by incorporating only the cycle consistency loss. The identity-mapping loss reported in [22] preserves the compositions of the input samples and the converted samples.  [8] has applied this approach to color preservation and demonstrated its effectiveness. Note that the secondary component of Eq. 4 is also identity-mapping loss. To encourage the generators and to preserve linguistic information, we also incorporate this property as follows.

(7)

In practice, the weighted loss with a hyper parameter to control the identity-mapping loss is added to Eq. 6.

3.4 Sequential Modeling with Gated CNN

To capture long- and short-term dependencies in speech waveforms, we use a gated CNN [9]

to construct both the generator and discriminator networks of the GAN. The gated CNNs are CNNs equipped with gated linear units (GLUs) as activation functions instead of the regular rectified linear units (ReLUs) 

[23] or Tanh activations. The output of the hidden layer of a gated CNN is described as a linear projection modulated by an output gate

(8)

where , , and are the network parameters to be trained,

is the sigmoid function and

indicates the element-wise product. Similar to LSTMs, the output gate multiplies each element of and controls what information should be propagated through the hierarchy of layers in a data-driven manner.

4 Experimental Evaluation

4.1 Experimental Conditions

Datasets (Natural): We used a Japanese speech dataset consisting of utterances by one professional female narrator. To evaluate the performance, we used 30 sentences (speech sections of 5.3 minutes). To train the models, we used about 6,500 sentences for a baseline system and 400 sentences (speech sections of 1.2 hours) for the conventional and proposed methods. The sampling rate of the speech signals was 22.05 kHz. Audio samples can be accessed on our web page222http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/s2n/s2n_speech_waveform_conversion.html.

Baseline system (Baseline): We used a DNN-based statistical parametric speech synthesis method [1] as the baseline. From the speech data, 40 Mel-cepstral coefficients, logarithmic , and 5-band aperiodicities were extracted every 5 ms with the STRAIGHT analysis system [24, 25]. The contextual features used as the input were 506-dimensional linguistic features including phonemes and mora positions. The output consisted of 40 Mel-cepstral coefficients, log , 5-band aperiodicities, their delta and delta-delta features, and a voiced/unvoiced binary value. The DNN architectures were feed-forward networks including 5 hidden layers each with 1,024 units.

Figure 5: Average modulation spectrums of the first 1k indices for a) 10th, b) 20th, c) 30th and d) 40th mel-cepstral coefficient sequences.

Conventional method (GANv): As a conventional approach, we used a GAN-based postfilter  [5] not for the speech waveform but for the acoustic features. The system setting was the same as the reported setting, except for the excitation signals. Although  [5] used the excitation signals of natural speech, we used the excitation signals generated by the vocoding for evaluating all of the synthetic speech. We applied the conventional method only to voiced segments.

Our proposed method (Proposed): We designed a network based on recent success of image modeling [26]. Figure 4 shows the network architectures of our proposed model. The network included downsampling layers, residual blocks [27], and upsampling layers. We used instance normalization (IN) [28]

, instead of batch normalization 

[29]. We used pixel shuffler (PS) for upsampling where the effectiveness was demonstrated in high-resolution image generation [26]. We normalized the speech waveform to zero mean and unit variance using their training sets. To stabilize the training, we used a least squares GAN [13]. We set at 10. To guide the learning process, we set at 5 for the first 20k iterations and linearly decay to 0 over the next 20k iterations. We optimized the model parameters using the Adam optimizer [30] with a mini-batch of size 32. The learning parameters were set at 0.0001 for discriminators and 0.0002 for generators. We used the same learning rate for the first 250k iterations and linearly decay to 0 over the next 250k iterations. The other learning parameters of the Adam optimizer, and , were set at 0.5 and 0.99, respectively. Note that since the generators are fully convolutional, they can handle an arbitrary length input.

4.2 Modulation Spectrum over Acoustic Features

To confirm the alleviation of the over-smoothing effect of the acoustic features, we applied the conventional and proposed methods to speech synthesized by the baseline system and obtained modulation spectrums of mel-cepstrum sequences on each system. Although the modulation spectrum is traditionally defined as a value calculated using the Fourier transform of the parameter sequence [31], this paper defines the modulation spectrum as its logarithmic power spectrum. We used 8,192 FFT points.

The average modulation spectrums of the first 1k indices for the 10th, 20th, 30th and 40th mel-cepstral coefficient sequences are shown in Fig. 5. We found that Baseline suffered more from the over-smoothing effect than GANv and Natural. On the other hand, GANv and Proposed are close to Natural. As with the GAN-based postfilter for the acoustic feature GANv, the result demonstrated that our proposed method for the speech waveform Proposed successfully alleviated the over-smoothing effect caused by the statistical parametric speech synthesis process.

4.3 Subjective Evaluation for Naturalness

Figure 6: Subjective 5-scale mean opinion score regarding

naturalness, with 95% confidence intervals.

We conducted a subjective 5-scale mean opinion score test regarding the naturalness of the generated speech. 10 listeners participated and each listener evaluated 120 speech samples (30 speech samples 4 systems). We applied the conventional and proposed methods to the same speech waveform Baseline as in Sec. 4.2.

Figure 6 shows that our proposed method Proposed achieved a significant improvement in terms of the naturalness of the generated speech, compared with Baseline and GANv. This result indicates that our approach is more effective than the use of postfilters for the acoustic features because it is possible to address both the over-smoothing problem and the vocoding error. Furthermore, with Proposed, the listeners commented that the “buzzy” sound peculiar to vocoding was sufficiently improved. However, there is still a gap between Proposed and Natural. One possible reason for the gap is the “hoarse” sound of Proposed. The listeners also advised that Proposed was distinguishable from Natural because Proposed sometimes had a “hoarse” sound.

5 Conclusion

In this paper, to realize a synthetic-to-natural speech filter, we proposed a learning-based filter that allows us to convert a synthetic speech waveform into a natural speech waveform using cycle-consistent adversarial networks. Since our process was applied after the synthesis part in statistical parametric speech synthesis, we expected that our approach would be able to address not only the over-smoothing problem but also the vocoding error. The experimental results demonstrated that our proposed method 1) alleviated the over-smoothing effect of the acoustic features despite the direct modification method used for the waveform and 2) dramatically improved the naturalness of the generated speech sounds. In the future, we will further fill the gap between natural speech and synthetic speech by considering the auditory property.

Acknowledgment

This work was supported by JSPS KAKENHI 17H01763.

References