Speech processing systems such as statistical parametric speech synthesis  and statistical voice conversion  are well-known frameworks. These approaches using a vocoder framework have a significant advantage, especially for a limited number of data, because it is possible to represent interpretable acoustic features over a compact space, such as the fundamental frequency () and mel-cepstrum, which are lower dimensional acoustic features than a
short-term Fourier transform (STFT) spectrogram. Although these systems aim to produce speech witha quality indistinguishable from that of clean and real speech, processed and synthesized speech can usually be distinguished from natural speech. The realization of synthetic-to-natural speech waveform conversion provides significant benefit with many speech processing approaches, especially when using a vocoder framework. Three major factors reported in  degrade the speech synthesized by a statistical parametric speech synthesis technique: the accuracy of acoustic models, over-smoothing, which eliminates some detailed structure of generated/converted acoustic features, and vocoding. In this paper, we focus on vocoding and over-smoothing.
To address the over-smoothing effect, several techniques for restoring the fine structure of natural speech over acoustic features have been proposed [2, 4, 5].
These approaches, as shown in Fig. 1 a), have achieved significant improvements as regards the naturalness of synthesized speech in the respective directions.
However, heuristics approaches such as enhancement of global variance
However, heuristics approaches such as enhancement of global variance and modulation spectrum  are unsuitable for covering all the negative factors. On the other hand, although a learning-based postfilter  enables us to restore not only the
global variance and modulation spectrum but also other factorsthat degrade the quality of synthesized speech, it is still insufficient to generate natural speech because of the post filter needed not for the waveform but for the heuristic acoustic features such as mel-cepstrum. Furthermore, all of these approaches suffer from vocoding error because of the use of the vocoder framework to synthesize the speech waveform.
To avoid this limitation, an end-to-end speech enhancement  method has been proposed within a generative adversarial framework. As shown in Fig. 1 b), since the waveform of the input speech was directly operated to obtain that of the desired speech after the vocoding part,  has the potential to address not only the over-smoothing effect but also the vocoding error. Furthermore, the generative adversarial framework does not require us to design any hand-crafted feature that creates a gap between natural speech and synthetic speech, in advance. In preliminary experiments, we found that this method is unsuitable when the alignments 111In this paper, we define alignment considering both the magnitude information and the phase information of speech because we focus on modifying the speech waveform rather than the acoustic features. between the input waveform and the desired waveform are not perfect. For example, the noise reduction of noisy speech simulated by adding noise to the speech waveform recorded in an ideal environment succeeded because of the perfect alignment between the simulated noisy speech and the clean source speech. However, the conversion of synthetic speech generated by text-to-speech synthesis and voice conversion processing to natural speech is not easy to achieve by applying this method because of the alignment problem as mentioned above.
In this paper, we propose a learning-based filter that allows us to convert a synthetic speech waveform into a natural speech waveform using cycle-consistent adversarial networks with a fully convolutional architecture. We adopt cycle-consistent adversarial networks because they do not require a dataset forcibly paired at the time frame level and as the name implies, they are trained within the adversarial learning. In contrast to  which is also inspired by the cycle-consistent adversarial networks  to convert not speech waveform but acoustic feature, since our modification is performed at the waveform level, we expect that the proposed method will make it possible to generate “vocoder-less” sounding speech even if the input speech is synthesized using a vocoder framework.
Furthermore, we adopt a gated convolutional neural network (CNN) architecture, which is able to capture long- and short-term dependencies in the speech waveform. The experimental results demonstrate that our proposed method can 1) alleviate the over-smoothing effect of the acoustic features despite the direct modification method used for the waveform and 2) greatly improve the naturalness of the generated speech sounds.
2 SEGAN: Speech Enhancement Generative Adversarial Network
2.1 Generative Adversarial Networks
Generative Adversarial Networks (GANs)  are generative models consisting of two neural networks. One is a generator that learns to convert a sample from a prior distribution to a target sample from a distribution , which is a sample from the training data. The generator aims to learn a projection that can imitate the true feature distribution and to generate samples related to the training data. The other is a discriminator that learns the boundary between imitated features generated by the generator and true features picked up from the training data.
The adversarial characteristic arises from the fact that the discriminator
tries to classify the instancesobtained from the true data distribution as real and the candidates produced by the generator as fake, while the generator tries to make the discriminator classify those as real. Through back-propagation, the generator becomes able to generate better candidates and the discriminator becomes able to distinguish the generated ones and real data . The objective function of the adversarial learning is formulated as the following minimax game between and ,
the difficulty of the training is a well-known problem. For instance, the classic approach suffers from a vanishing gradient problem due to the sigmoid cross-entropy loss used for training.Several adversarial training techniques have been proposed to overcome this difficulty. The least-squares GAN (LSGAN) approach  stabilizes the training process by replacing the cross-entropy loss shown in Eq. 1 with the least-squares function as follows.
2.2 GANs for Speech Enhancement
To retain the linguistic information of speech samples,  adopts a conditioned version GAN that has some extra information in and to perform mapping and classification. As shown in Fig. 2, in the structure of the generator , which is similar to an auto-encoder, a noisy speech signal , which is the input of the network, is encoded as
. After concatenating the random vectorwith the encoded vector , which is treated as a conditional vector, the decoding part of the
network is performed as transposed convolutions (a.k.a. deconvolutions or fractionally strided convolutions) to obtain the enhancedvector.
To achieve the generation of speech samples that are closer to clean speech, a secondary component is added to the loss of .  adopts the L1 norm, as it has been proven to be effective in the image manipulation domain [14, 15]. In this way, they allow the adversarial component to add more fine-grained and realistic results. A new hyper-parameter
controls the magnitude of the L1 norm. Finally, the loss function of the generatorbecomes
3 Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks
In preliminary experiments, we found that SEGAN  could not be easily applied to the conversion of a synthetic speech waveform to a natural speech waveform. One possible reason is that the misalignment caused by the different lengths and generation processes of synthetic and natural speech makes it difficult to ensure the operation of the bijective function in the generator . Specifically, the phase information of the speech waveform synthesized by using the vocoder framework is very far from that of natural speech, even if the magnitude information of the synthetic speech is close to that of natural speech. We assume that these factors induce “mode collapse”, which is a well-known problem when training GANs, and the SEGAN does not guarantee that an individual input and output are paired up in a meaningful way. Generally speaking, all input speech signals map to the same output speech signals and the optimization fails to make progress .
To solve this problem, we focus on cycle-consistent adversarial networks . This approach has introduced a “cycle consistent” property, which ensures return to the original sample . Mathematically, if we have a converter and another converter , and should be the inverse of each other, and both mappings should be bijections. We incorporate this property into SEGAN by training the mapping functions and simultaneously and adding a cycle consistency loss  that encourages and . Combining the cycle consistency loss with the adversarial losses defines our full objective for a training procedure using perfect alignment.
Furthermore, we focus on a convolutional architecture called a gated CNN. The gated CNN has recently been shown to be powerful for
modeling long-term sequential data. It was originally introduced for language modeling and was shown to outperform long short-term memory (LSTM) language models trained in a similar setting. We previously applied a gated CNN architecture for acoustic feature sequence modeling, and its effectiveness has already been confirmed [18, 19]. With a gated CNN, the output of a hidden layer of a network is described as a linear projection modulated by an output gate. Similar to an LSTM  and agated recurrent unit (GRU) , the output gate controls what information should be propagated through the hierarchy of layers and allows the capture of long-term structures.
3.2 Cycle-Consistent Adversarial Networks
For each speech sample , the speech waveform conversion cycle shown in Fig. 3 a) constrains the samples to return to the original speech through a target domain corresponding to the samples , . This cycle consistency is called forward cycle consistency. Similarly, as shown in Fig. 3 b), for each speech waveform , and are constrained by a backward cycle consistency, . Therefore, these are described as the following cycle consistency loss,
Finally, the objective function is
where is a hyper parameter used to control the cycle consistency loss.
3.3 Identity-Mapping Loss
Cycle consistency loss allows us to reduce the possible mapping functions by constraining a structure. However, in a waveform modification task, the linguistic information is not always preserved by incorporating only the cycle consistency loss. The identity-mapping loss reported in  preserves the compositions of the input samples and the converted samples.  has applied this approach to color preservation and demonstrated its effectiveness. Note that the secondary component of Eq. 4 is also identity-mapping loss. To encourage the generators and to preserve linguistic information, we also incorporate this property as follows.
In practice, the weighted loss with a hyper parameter to control the identity-mapping loss is added to Eq. 6.
3.4 Sequential Modeling with Gated CNN
To capture long- and short-term dependencies in speech waveforms, we use a gated CNN 
to construct both the generator and discriminator networks of the GAN. The gated CNNs are CNNs equipped with gated linear units (GLUs) as activation functions instead of the regular rectified linear units (ReLUs) or Tanh activations. The output of the hidden layer of a gated CNN is described as a linear projection modulated by an output gate
where , , and are the network parameters to be trained,
is the sigmoid function andindicates the element-wise product. Similar to LSTMs, the output gate multiplies each element of and controls what information should be propagated through the hierarchy of layers in a data-driven manner.
4 Experimental Evaluation
4.1 Experimental Conditions
Datasets (Natural): We used a Japanese speech dataset consisting of utterances by one professional female narrator. To evaluate the performance, we used 30 sentences (speech sections of 5.3 minutes). To train the models, we used about 6,500 sentences for a baseline system and 400 sentences (speech sections of 1.2 hours) for the conventional and proposed methods. The sampling rate of the speech signals was 22.05 kHz. Audio samples can be accessed on our web page222http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/s2n/s2n_speech_waveform_conversion.html.
Baseline system (Baseline): We used a DNN-based statistical parametric speech synthesis method  as the baseline. From the speech data, 40 Mel-cepstral coefficients, logarithmic , and 5-band aperiodicities were extracted every 5 ms with the STRAIGHT analysis system [24, 25]. The contextual features used as the input were 506-dimensional linguistic features including phonemes and mora positions. The output consisted of 40 Mel-cepstral coefficients, log , 5-band aperiodicities, their delta and delta-delta features, and a voiced/unvoiced binary value. The DNN architectures were feed-forward networks including 5 hidden layers each with 1,024 units.
Conventional method (GANv): As a conventional approach, we used a GAN-based postfilter  not for the speech waveform but for the acoustic features. The system setting was the same as the reported setting, except for the excitation signals. Although  used the excitation signals of natural speech, we used the excitation signals generated by the vocoding for evaluating all of the synthetic speech. We applied the conventional method only to voiced segments.
Our proposed method (Proposed): We designed a network based on recent success of image modeling . Figure 4 shows the network architectures of our proposed model. The network included downsampling layers, residual blocks , and upsampling layers. We used instance normalization (IN) 
, instead of batch normalization. We used pixel shuffler (PS) for upsampling where the effectiveness was demonstrated in high-resolution image generation . We normalized the speech waveform to zero mean and unit variance using their training sets. To stabilize the training, we used a least squares GAN . We set at 10. To guide the learning process, we set at 5 for the first 20k iterations and linearly decay to 0 over the next 20k iterations. We optimized the model parameters using the Adam optimizer  with a mini-batch of size 32. The learning parameters were set at 0.0001 for discriminators and 0.0002 for generators. We used the same learning rate for the first 250k iterations and linearly decay to 0 over the next 250k iterations. The other learning parameters of the Adam optimizer, and , were set at 0.5 and 0.99, respectively. Note that since the generators are fully convolutional, they can handle an arbitrary length input.
4.2 Modulation Spectrum over Acoustic Features
To confirm the alleviation of the over-smoothing effect of the acoustic features, we applied the conventional and proposed methods to speech synthesized by the baseline system and obtained modulation spectrums of mel-cepstrum sequences on each system. Although the modulation spectrum is traditionally defined as a value calculated using the Fourier transform of the parameter sequence , this paper defines the modulation spectrum as its logarithmic power spectrum. We used 8,192 FFT points.
The average modulation spectrums of the first 1k indices for the 10th, 20th, 30th and 40th mel-cepstral coefficient sequences are shown in Fig. 5. We found that Baseline suffered more from the over-smoothing effect than GANv and Natural. On the other hand, GANv and Proposed are close to Natural. As with the GAN-based postfilter for the acoustic feature GANv, the result demonstrated that our proposed method for the speech waveform Proposed successfully alleviated the over-smoothing effect caused by the statistical parametric speech synthesis process.
4.3 Subjective Evaluation for Naturalness
We conducted a subjective 5-scale mean opinion score test regarding the naturalness of the generated speech. 10 listeners participated and each listener evaluated 120 speech samples (30 speech samples 4 systems). We applied the conventional and proposed methods to the same speech waveform Baseline as in Sec. 4.2.
Figure 6 shows that our proposed method Proposed achieved a significant improvement in terms of the naturalness of the generated speech, compared with Baseline and GANv. This result indicates that our approach is more effective than the use of postfilters for the acoustic features because it is possible to address both the over-smoothing problem and the vocoding error. Furthermore, with Proposed, the listeners commented that the “buzzy” sound peculiar to vocoding was sufficiently improved. However, there is still a gap between Proposed and Natural. One possible reason for the gap is the “hoarse” sound of Proposed. The listeners also advised that Proposed was distinguishable from Natural because Proposed sometimes had a “hoarse” sound.
In this paper, to realize a synthetic-to-natural speech filter, we proposed a learning-based filter that allows us to convert a synthetic speech waveform into a natural speech waveform using cycle-consistent adversarial networks. Since our process was applied after the synthesis part in statistical parametric speech synthesis, we expected that our approach would be able to address not only the over-smoothing problem but also the vocoding error. The experimental results demonstrated that our proposed method 1) alleviated the over-smoothing effect of the acoustic features despite the direct modification method used for the waveform and 2) dramatically improved the naturalness of the generated speech sounds. In the future, we will further fill the gap between natural speech and synthetic speech by considering the auditory property.
This work was supported by JSPS KAKENHI 17H01763.
-  Heiga Zen, Andrew Senior, and Mike Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7962–7966.
Tomoki Toda, Alan W Black, and Keiichi Tokuda,
“Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
-  Heiga Zen, Keiichi Tokuda, and Alan W Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
-  Shinnosuke Takamichi, Tomoki Toda, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura, “A postfilter to modify the modulation spectrum in HMM-based speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 290–294.
-  Takuhiro Kaneko, Hirokazu Kameoka, Nobukatsu Hojo, Yusuke Ijima, Kaoru Hiramatsu, and Kunio Kashino, “Generative adversarial network-based postfilter for statistical parametric speech synthesis,” in Proc. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2017), 2017, pp. 4910–4914.
-  Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “SEGAN: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
-  Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.
-  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier, “Language modeling with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 2016.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 1, pp. 84–96, 2018.
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul
“Least squares generative adversarial networks,”
2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 2813–2821.
-  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, arXiv preprint, 2017.
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A
“Context encoders: Feature learning by inpainting,”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
-  Richard W Brislin, “Back-translation for cross-cultural research,” Journal of cross-cultural psychology, vol. 1, no. 3, pp. 185–216, 1970.
-  Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A Efros, “Learning dense correspondence via 3D-guided cycle consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 117–126.
-  Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,” Proc. Interspeech 2017, pp. 1283–1287, 2017.
-  Kou Tanaka, Hirokazu Kameoka, and Kazuho Morikawa, “VAE-SPACE: Deep generative model of voice fundamental frequency contours,” Proc. ICASSP 2018, 2018.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  Yaniv Taigman, Adam Polyak, and Lior Wolf, “Unsupervised cross-domain image generation,” arXiv preprint arXiv:1611.02200, 2016.
Vinod Nair and Geoffrey E Hinton,
“Rectified linear units improve restricted boltzmann machines,”in
Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
-  Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne, “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech communication, vol. 27, no. 3, pp. 187–207, 1999.
-  Hideki Kawahara, Jo Estill, and Osamu Fujimura, “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight,” in Second International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, 2001.
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.,
“Photo-realistic single image super-resolution using a generative adversarial network,”arXiv preprint, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.
-  Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456.
-  Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Les Atlas and Shihab A. Shamma, “Joint acoustic and modulation frequency,” EURASIP Journal on Advances in Signal Processing, vol. 2003, no. 7, pp. 310290, June 2003.