WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

04/05/2019 ∙ by Kou Tanaka, et al. ∙ 0

WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones. One possible cause of this distinguishability is the aliasing observed in the processed speech waveform via down/up-sampling modules. To solve the aliasing and provide higher quality speech synthesis, we propose WaveCycleGAN2, which 1) uses generators without down/up-sampling modules and 2) combines discriminators of the waveform domain and acoustic parameter domain. The results show that the proposed method 1) alleviates the aliasing well, 2) is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and 3) achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet (open WaveNet) and WaveGlow while processing speech samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech processing systems using classical parametric vocoder frameworks such as STRAIGHT [4, 5] and WORLD [6] are popular frameworks for various tasks such as statistical parametric speech synthesis (SPSS) [7] and statistical voice conversion (VC) [8]. The key advantage of using the classical parametric vocoder frameworks is that speech signals can be represented by interpretable and compact acoustic parameters such as the fundamental frequency (

) and Mel-cepstrum rather than a short-term Fourier transform (STFT) spectrogram. On the other hand, the critical drawback is that the generated speech can usually be distinguished from natural speech due to vocoding error 

[9], even through we only re-synthesize the speech waveform from the analyzed acoustic parameter. Moreover, operating acoustic parameters and their statistics in SPSS and VC usually leads to an over-smoothing effect [9] and increases the differences between synthetic and natural speech.

To address these drawbacks, we previously proposed a learning-based filter of the time-domain, called WaveCycleGAN [10], which allows us to convert a synthetic speech waveform into a natural speech waveform using cycle-consistent adversarial networks with a fully convolutional architecture. The difficulties of the waveform conversion within the deep learning approaches are the difficulty of parallel data collection of speech waveform and the ambiguity of the phase information of speech waveform. Notably, the phase ambiguity prevents proper learning of a mapping function from synthetic speech to natural speech (e.g., when we have two speech waveforms with certain phase spectra and the reversed phase spectra in the training data of natural speech, minimizing the objective function results in converting to silence). The cyclic model makes it possible to address these difficulties of the operating waveform. Moreover, WaveCycleGAN is trained within the adversarial learning, so no explicit assumption against speech waveform is required. As a result, by applying WaveCycleGAN to SPSS trained for a Japanese dataset, the filtered speech has achieved a mean opinion score higher than 4. However, there is still a gap between natural speech and filtered speech. In the preliminary experiment, by applying WaveCycleGAN to the speech waveform synthesized from acoustic parameters of natural speech, the filtered speech was lower quality than the input of WaveCycleGAN. We found that one possible reason for the difference and degradation in quality is the aliasing observed in the filtered speech waveform via down/up-sampling modules in model architectures.

To bridge the gap including the aliasing, we propose WaveCycleGAN2, which is an improved variant of WaveCycleGAN that 1) uses generators without down/up-sampling modules and 2) combines discriminators of the waveform domain and acoustic parameter domain such as Mel-spectrogram, Mel-frequency cepstral coefficients, and phase spectrum. We analyzed the effect of each technique on an internal Japanese speech dataset 222 On our web page, we used an alternative speech dataset, the CMU Arctic database [11], which allows us to publish and a public domain English speech dataset [12]. Experimental evaluations showed that the proposed method 1) alleviates the aliasing well, 2) is useful for both speech waveforms generated by analysis-and-synthesis (AnaSyn) and SPSS, and 3) achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet [1] (open WaveNet [2]) and WaveGlow [3] while processing audio samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.

2 Related Works

2.1 Vocoder for Waveform Generation

To generate a speech waveform from given acoustic parameters, neural-network-based waveform models 

[1, 13, 3, 14] have been proposed and have performed outstandingly at numerous tasks involving speech signal processing. There are two types of neural-network-based waveform models: an autoregressive (AR) model [1, 13] and a moving-average (MA) model [3, 14] (a.k.a., non-AR model). As an AR model, although the WaveNet [1] synthesizes speech with high fidelity, its training procedure is complex 333 Generated speech sometimes collapses as reported by Wu et al.  [15]. and its inference speed is quite slow because the AR process never allows us to infer several waveform sampling points in parallel.

For MA models that allow us to parallelize the inference, distilled models [16, 17] demanding complex training criteria have also been proposed. To avoid the complex training criteria, WaveGlow [3] is a theoretically powerful model that has the tractability of the exact log-likelihood. Although WaveGlow makes it possible to train the theoretically exact mapping function by using only one criterion, it requires large-scale computational resources and long training time. To make the inference procedure interpretable, a neural source filter model [14] has also been proposed as an MA model. All these models can work well if the given acoustic parameters are close to natural ones seen in the training data. Otherwise, the training procedure has to be several steps rather than one step because it is combined with other training procedures such as fine-tuning [18] or methods described in the next subsection. In contrast, even if the given acoustic parameters are not close to natural ones in the training data, our approach, which is a kind of the MA model, makes it possible to directly train the mapping function from the generated speech waveform to the natural one in one step because it allows the use of a classical parametric vocoder that is not necessary to train.

2.2 Acoustic Parameter Generation/Modification

To address the over-smoothing effect [9], several techniques have been proposed [8, 19, 20] to restore the fine structure of acoustic parameters of natural speech 444 Of course, accurate modeling approaches have also been proposed, such as generative adversarial network-based text-to-speech [21, 22] and voice conversion [23].

. In their respective directions, these approaches have significantly improved the naturalness of acoustic parameters generated by SPSS and VC. However, these approaches are still insufficient to generate natural-sounding speech because of the post-filter of acoustic parameters on the heuristically limited compact space even in the generative adversarial nets (GAN) based approach 

[20]. Moreover, it is impossible to address the vocoding error [9] when we use the classical parametric vocoder to generate the speech waveform. In contrast, our approach allows us to address both the over-smoothing effect and vocoding error because of the processing after waveform generation processing.

3 Conventional WaveCycleGAN

We briefly review our previous work, WaveCycleGAN [10], which is a kind of cyclic model (a.k.a., dual learning [24]).

Let us use one-dimensional vectors

and to denote sequences belonging to sets of synthetic and natural speech waveforms, respectively. Inspired by CycleGAN [25], WaveCycleGAN uses three training criteria (adversarial loss  [26], cycle-consistency loss  [27], and identity-mapping loss  [28]) to train a mapping function that converts the waveform of synthetic speech into that of natural speech without relying on parallel data.

The adversarial loss is written as,


where indicates a discriminator trying to differentiate between a real sample and the samples converted by the generator while is trained for converting to that can deceive as . This criterion focuses on only whether it can deceive or not, so might be converted into samples that have different linguistic information. To retain the linguistic information of the input , the cycle-consistency and identity-mapping losses are used:


where indicates another generator that has the reverse direction to . Note that to guide the learning direction, is usually used only in the early stage of the training.

Finally, the full objective function can be written as


where and

indicate hyperparameters controlling the cycle-consistency and identity-mapping losses.

4 Proposed WaveCycleGAN2

4.1 Aliasing Issue of Conventional WaveCycleGAN

Many model architectures using convolutional neural networks involve down/up-sampling modules as the de facto standard 

[29, 30, 31, 32]

because it has a significant advantage in terms of the computation amount. We also adopted convolution with strides to WaveCycleGAN because of its computational advantage. As a result, we achieved a mean opinion score higher than 4 in terms of the naturalness. However, we found that aliasing is observed in the processed speech waveform, as shown in Fig. 

1 (c). This phenomenon has also been reported in several other tasks such as image classification [33] and deep speech processing [34]. The aliasing occurs when the Nyquist-Shannon sampling theorem [35] is not satisfied, so it follows that the classical convolution with strides is not guaranteed to satisfy the sampling theorem. This is reasonable because the classical convolution with strides is not guaranteed to have an anti-aliasing mechanism while we never perform down-sampling without anti-aliasing processing in the pure signal processing. Note that in acoustic parameter trajectory smoothing [36], the acoustic parameter differences in high modulation frequency 555 Modulation frequency is the frequency of modulation spectra, which are the power spectra of a given acoustic parameter sequence. are hardly perceived by humans. Therefore, even if the aliasing occurs on the acoustic parameter sequence, we will not notice it. The aliasing issue is a problem specific to the waveform conversion. To generate more natural-sounding speech, this aliasing issue remains to be solved.

4.2 Improved Generator: Addressing Aliasing Issue

Figure 1: Spectrogram of natural waveform and waveforms generated from models in Sec. 5.1.2. Dashed red box indicates aliasing.

To alleviate the aliasing described in Sec. 4.1, we have two options. One is to explicitly add anti-aliasing processing into the model architecture. Following this concept, a linear pooling with Gaussian weights has been proposed [37]. This pooling operation is equivalent to the down-sampling after Gaussian filtering. We can regard the Gaussian filtering as the approximation of low-pass filtering using the cardinal sine function (a.k.a., sinc function) so that the aliasing will be alleviated well. However, there is a fundamental trade-off between the performance and the combination of shift invariance and anti-aliasing.

Another option is to use a dilated convolution [38], which is introduced to the deep learning for semantic image segmentation, rather than the classical convolution with strides. This is a technique to reduce the number of model parameters and obtain the computational efficiency while maintaining a large receptive field to cater for long-range dependencies. Note that the recent neural-network-based vocoder such as WaveNet has also adopted the dilated convolution. Toward a high-quality neural post-filter for speech waveform generation, assuming that down/up-sampling modules are not suitable for the speech waveform conversion unlike acoustic parameter conversion, we replace the classical convolution with the dilated convolution in the architecture of WaveCycleGAN.

4.3 Improved Discriminator: Multiple Domains

In the preliminary experiment, although using the dilated convolution instead of the convolution with strides made it possible to alleviate the aliasing, the processed speech somehow became noisy speech, as shown in Fig. 1 (d). Theoretically, the generator should imitate a of the real data if the training succeeds. However, in practice, the gradient of the generator vanishes when the discriminator successfully rejects generated samples with high confidence. For this reason, we used the discriminators that have small model parameters, but this insufficient capability of the discriminator might make the decision boundaries non-optimal.

To find the best decision boundaries while avoiding the vanishing gradient problem of the generator, we propose discriminators combining multiple domains such as the waveform domain

and Mel spectrogram domain as follows:


where indicates a linear mapping function described as a convolution of the Hanning window, followed by a fast Fourier transform (FFT) matrix and Mel-filter bank. Unlike the L1 and L2 losses on spectra [16, 39, 14], we use the adversarial losses for the multiple domains, so the objective function related to the generator still does not depend directly on at all and our approach makes this objective function resistant to the over-smoothing problems [9] the same as conventional WaveCycleGAN.

5 Experiments

5.1 SPSS using Internal Japanese Dataset

5.1.1 Dataset

We used a Japanese speech dataset consisting of utterances by one professional female narrator. To train the models, we used about 6,500 sentences for a baseline system and 400 sentences (speech sections of 1.2 hours) each for WaveCycleGAN and WaveCycleGAN2. To evaluate the performance, we used 30 sentences (speech sections of 5.3 minutes). The sampling rate of the speech signals was 22.05 kHz.

5.1.2 Systems

We used a deep neural network (DNN)-based SPSS [7] and WaveCycleGAN [10] as a baseline system (SPSS) and a conventional system (V1). As a proposed system, V2 indicates WaveCycleGAN2, which has only the speech waveform domain’s discriminators. V2+ indicates WaveCycleGAN2 incorporating the discriminators of the acoustic parameter domains such as the Mel spectrogram (V2msp), Mel-frequency cepstrum coefficients (V2mfcc), and phase spectrogram (V2ph). The architecture of the generator was a linear projection (# of channel, kernel, dilation: 64, 15, 1) followed by a residual block (128, 15, 2), five residual blocks (128, 15, 4), and a linear projection (1, 15, 1). We applied the conventional and proposed systems to the speech waveform SPSS. We used the same learning rate for the first 80k iterations and linearly decayed to 0 over the next 80k iterations. The other conditions are the same as in our previous work [10].

5.1.3 Objective Evaluation

To evaluate the capability of addressing the over-smoothing effect caused by the SPSS, we calculated modulation spectrum differences (MSD) for the Mel cepstral coefficient of natural speech (Recorded). Although the modulation spectrum is traditionally defined as a value calculated using the Fourier transform of the parameter sequence [40], this paper defines the modulation spectrum as its logarithmic power spectrum. We used 8,192 FFT points.

Figure 2 showed that SPSS significantly suffered from the over-smoothing effect. Although V1 alleviated its over-smoothing effect, there was still a gap. On the other hand, V2+ restored the modulation spectrum of Recorded well. Note that as described in Sec. 4.3, the speech generated by V2 was more different from natural speech than that generated by the combination methods V2+.

5.1.4 Subjective Evaluation

We conducted a listening test with a 5-scale mean opinion score regarding naturalness. On each system, 200 speech samples (10 participants 20 randomly selected speech samples) were evaluated.

Figure 3 showed that V1 significantly improved the naturalness of the generated speech compared with SPSS. V2msp and V2mfcc were closer to natural speech, and there is no statistical difference from natural speech because p values of two-sided Mann-Whitney tests are more than 0.05. In contrast, V2 suffered from noisy speech. Note that the score of V2ph was significantly degraded because the silence sections somehow became quite noisy. These results suggest that it is better to combine the waveform domain discriminator and the amplitude spectrum domain discriminator.

Figure 2: Modulation spectrum differences against for Mel cepstral coefficient of natural speech.
Figure 3:

Subjective 5-scale mean opinion scores regarding naturalness, with 95% confidence intervals. Dashed line indicates results of recorded natural speech.

5.2 Analysis and Synthesis using LJSpeech Dataset [12]

5.2.1 Dataset

We used a public domain English speech dataset [12] containing 13,100 utterances. To evaluate the performance, we used 40 sentences disjoint from the training data. The sampling rate of the speech signals was 22.05 kHz.

5.2.2 Systems

We used WORLD [6] and Griffin-Lim [41] vocoders as the parametric and phase vocoder, respectively. For the neural-network-based vocoder, we used open WaveNet [2] employing a mixture of logistics distribution [42] and official WaveGlow [3]. The audio samples of open WaveNet and official WaveGlow were brought from a public folder 666 Speech samples can be accessed in a public folder of Google Drive: http://bit.ly/2JTDetX of R. Valle who is a co-author of WaveGlow [3]. For the proposed method, we used WaveCycleGAN2 incorporating the Mel spectrum domain discriminator V2msp. Note that our proposed method worked in both the parallel-data condition V2msp (paired) and non-parallel-data condition V2msp (unpaired) where the mini-batches of natural speech differed from those of synthesized speech in every iteration.

5.2.3 Objective and Subjective Evaluations

To evaluate the capability of WaveCycleGAN2 for an analysis-and-synthesis task, we calculated log spectral distortions (LSD) and conducted a listening test with 5-scale mean opinion scores regarding naturalness. On each system, 210 speech samples (14 participants 15 randomly selected speech samples) were evaluated subjectively.

The results of LSD, as shown in Tab. 1, show that Griffin-Lim has the lowest distortion. On the other hand, WORLD had higher the distortion because of the parametric vocoder. In the comparison of open WaveNet and WaveGlow, open WaveNet has larger distortion. One possible reason is that open WaveNet might generate speech waveforms that have different amplitude spectra from the given acoustic parameter when the previous outputs are captured more strongly than the given acoustic parameters. In contrast, V2msp has smaller distortion than WaveGlow. This might be because WaveCycleGAN2 has the advantage of the speech waveform conversion where the input and output domains are closer than those of WaveGlow.

In the results of the listening test, there is no statistical difference in the only two pairs of WORLD-Griffin-Lim and open WaveNet-V2msp (unpaired) because p values of two-sided Mann-Whitney tests are more than 0.05. Remarkably, V2msp (paired) outperformed open WaveNet and WaveGlow. Unlike WaveGlow, our proposed method is specified to work on only speech signals whereas WaveGlow theoretically works on not only speech signals but also audio signals such as music. Moreover, open WaveNet is not an official implementation, so this might not be the best result of WaveNet [1]. However, these results are impressive, and we also had the following feedback from the participants: 1) WaveGlow sometimes had artifacts like Griffin-Lim, 2) open WaveNet sometimes had artifacts like the collapsed speech samples reported by Wu et al. [15], and 3) V2msp sometimes had artifacts caused by the unvoiced/voiced detection error of the WORLD vocoder. Note that the tendency of the results compared with Recorded differs from the tendency described in Sec. 5.1.4 because the LJSpeech dataset [12] suffers from reverb.

System LSD [dB] Naturalness
Recorded [12] 4.590 0.082
WORLD [6] + mcep 4.414 0.022 3.124 0.150
Griffin-Lim [41] 1.546 0.016 3.300 0.143
open WaveNet (MoL) [2] 4.971 0.041 3.657 0.162
WaveGlow [3] 4.540 0.036 3.443 0.164
V2msp (paired) 4.318 0.019 4.023 0.124
V2msp (unpaired) 4.339 0.020 3.833 0.127
Table 1: Log spectral distortions (LSD) and subjective 5-scale mean opinion scores regarding naturalness (Naturalness), with 95% confidence intervals. Mcep indicates Mel cepstral coefficient.

6 Conclusions

We proposed a time-domain neural post-filter for speech waveform generation, WaveCycleGAN2. Experimental results demonstrated that the proposed method 1) outperformed the conventional WaveCycleGAN, 2) is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and 3) generated speech waveforms comparable to those of natural speech and speech synthesized by WaveNet [1] (open WaveNet [2]) and WaveGlow [3].

7 Acknowledgements

This work was supported by a grant from the Japan Society for the Promotion of Science (JSPS KAKENHI 17H01763). The authors thank Ryuichi Yamamoto and the authors of WaveGlow.