1 Introduction
This paper addresses the problem of reconstructing a timedomain signal solely from a magnitude spectrogram.
The magnitude spectrograms of realworld audio signals tend to be highly structured in terms of both spectral and temporal regularities. For example, pitch contours and formant trajectories are clearly visible from a magnitude spectrogram representation of speech compared with a timedomain signal. Therefore, there are many cases where processing magnitude spectrograms can deal with problems more easily than directly processing timedomain signals. In fact, many methods for monaural audio source separation are applied to magnitude spectrograms [1, 2, 3]. Furthermore, a magnitude spectrogram representation was recently found to be reasonable and effective for use with speech synthesis systems [4, 5].
Since a magnitude spectrogram does not contain phase information, we must restore or infer phase information to reconstruct a timedomain signal. This problem is called the signal (or phase) reconstruction problem. One widely used method for solving the signal reconstruction problem was proposed by Griffin and Lim [6] (hereafter referred to as the GriffinLim method). One of the drawbacks of the GriffinLim method is that it usually requires many iterations to obtain highquality audio signals. This makes it particularly difficult to apply it to realtime systems. Furthermore, there are some cases where highquality audio signals can never be obtained even though the algorithm is run for many iterations. To overcome these shortcomings of the GriffinLim method, we apply a learningbased approach to the signal reconstruction problem. Specifically, we propose modeling the reconstruction process of a timedomain signal from a magnitude spectrogram using a deep neural network (DNN) and propose introducing the idea of the generative adversarial network (GAN) [7] for training the signal generator network.
The remainder of the paper is organized as follows. We provide an overview of the phase reconstruction problem in Section 2, introduce the GriffinLim method in Section 3, and present our GANbased approach in Section 4. Experimental evaluations, and supplements for training our model are provided in Section 5. Finally, we offer our conclusions in Section 6.
2 Signal Reconstruction Problem
In this section, we provide an overview of the signal reconstruction problem.
We use to denote a time domain signal and to denote the time frequency component of where and indicate frequency and time indices, respectively. By defining as a complex sinusoid of frequency modulated by a window function centered at time , is defined by the inner product between and , namely
. With a shorttime Fourier transform (STFT),
corresponds to the center time of frame andis the modulated complex sinusoid padded with zeros over the range outside the frame. By using
to denote a vector obtained by stacking all the timefrequency components
, the relationship between and can be written as(1) 
where is a matrix where each row is . Hereafter, we call a complex spectrogram. Since the total number of time frequency points is usually set at more than the number of sample points of the time domain signal, is a redundant representation of . Namely, belongs to a dimensional linear subspace spanned by each column vector of . With an STFT, all the elements of a complex spectrogram must satisfy certain conditions to ensure that the waveforms within the overlapping segment of consecutive frames are consistent. By using to denote the magnitude spectrogram of where each element of is given by the absolute value of the element of
, the signal reconstruction problem can be cast as an optimization problem of estimating
solely from using the redundancy constraint as a clue.3 GriffinLim Method
One widely used way of solving the phase reconstruction problem involves the GriffinLim method [6]. In this section, we derive the iterative algorithm of the GriffinLim method following the derivation given in [8].
Whether or not a given satisfies the redundancy constraint so that is a complex spectrogram associated with a time domain signal can be evaluated by examining whether or not the orthogonal projection of to the subspace matches . Here, is a pseudo inverse matrix of satisfying
(2) 
With an STFT, (2) corresponds to an inverse STFT. Thus, is the STFT of the inverse STFT of . Now, by using to denote a vector where each element is the phase , the phase reconstruction problem for a given is formulated as an optimization problem of estimating that minimizes
(3) 
where denotes an elementwise product. Now, from (2), is the point closest to in the subspace . Thus, we can rewrite (3) as
(4) 
According to the principle of the majorizationminimization algorithm [9], it can be shown that is a majorizer of where is an auxiliary variable and a stationary point of can be found by iteratively performing the following updates:
(5)  
(6) 
Here denotes an operation that divides each element of a vector by its absolute value. With an STFT, Eq. (5) can be interpreted as the inverse STFT of followed by the STFT whereas Eq. (6) is a procedure for replacing the phase with the phase of updated via (5). This algorithm is procedurally equivalent to the GriffinLim method [6].
The GriffinLim method usually requires many iterations to obtain a highquality audio signal. This makes it particularly difficult to apply to realtime systems. Furthermore, there are some cases where highquality audio signals can never be obtained even though the algorithm is run for many iterations, for example when is an artificially created magnitude spectrogram. In the next section, we propose a learningbased approach to the phase reconstruction problem to overcome these shortcomings of the GriffinLim method.
4 GANbased signal reconstruction
4.1 Modeling phase Reconstruction Process
By using to denote the initial value of , and defining and , the iterative algorithm of the GriffinLim method can be expressed as a multilayer composite function
(7) 
Here, is a linear projection whereas is a nonlinear operation applied to the output of . Hence, (7
) can be viewed as a deep neural network (DNN) where the weight parameters and the activation functions are fixed. From this point of view, finding an algorithm that converges more quickly to a better solution than the GriffinLim algorithm can be regarded as learning the weight parameters (and the activation functions) of the DNN. This idea is inspired by the deep unfolding framework
[10], which uses a learning strategy to obtain an improved version of a deterministic iterative inference algorithm by unfolding the iterations and treating them as layers in a DNN. Fortunately, an unlimited number of pair data of and can be collected very easily by computing the complex, magnitude and phase spectrograms of time domain signals. This is very advantageous for efficiently training our DNN.In the following, we consider a DNN that uses and as inputs and generates (or ) as an output. We call this DNN a generator and express the relationship between the input and output as .
4.2 Learning Criterion
For the generator training, one natural choice for the learning criterion would be a similarity metric (e.g., the
norm) between the generator output and a target complex spectrogram (or signal). Manually defining a similarity metric amounts to assuming a specific form of the probability distribution of the target data (e.g., a Laplacian distribution for the
norm). However, the data distribution is unknown. If we use a similarity metric defined in the data space as the learning criterion, the generator will be trained in such a way that the outputs that averagely fit the target data are considered optimal. As a result, the generator will learn to generate only oversmoothed signals. This is undesirable as the oversmoothing of reconstructed signals causes audio quality degradation. To avoid this, we propose using a similarity metric implicitly learned using a generative adversarial network (GAN) [7]. In addition to the generator network, we introduce a discriminator network that learns to correctly discriminate the complex spectrograms generated by the generator and the complex spectrograms of real audio signals. Given a target complex spectrogram , the discriminator is expected to find a feature space where and are as separate as possible. Thus, we expect that minimizing the distance between and measured in a hidden layer of the discriminator would make indistinguishable from in the data space. By using to denote the discriminator network , we first consider the following criteria for the discriminator(8) 
Here, the target label corresponding to real data is assumed to be 1 and that corresponding to the data generated by the generator is 0. Thus, (8) means that becomes 0 only if the discriminator correctly distinguishes the “fake” complex spectrograms generated by the generator and the “real” complex spectrograms of real audio signals. Therefore, the goal of is to minimize . As for the generator , one of the goals is to deceive the discriminator so as to make the “fake” complex spectrograms as indistinguishable as possible from the “real” complex spectrograms. This can be accomplished by minimizing the following criterion
(9) 
Another goal for is to make as close as possible to the target complex spectrogram . By using to denote the output of the th layer of the discriminator , we would also like to minimize
(10) 
where is a fixed weight, which weighs the importance of the th layer feature space. Here, the th layer corresponds to the input layer, namely .
The learning objectives for and can thus be summarized as follows:
(11)  
(12) 
where is a fixed weight.
A general framework for training a generator network in such a way that it can deceive a real/fake discriminator network is called a generative adversarial network (GAN) [7]. The novelty of our proposed approach is that we have successfully adapted the GAN framework to the signal reconstruction problem by incorporating an additional term (10). The GAN framework using (8) and (9) as the learning criteria is called the least squares GAN (LSGAN) [11]. Note that GAN frameworks using other learning criteria such as [12] have also been proposed. Thus, we can also use the learning criteria employed in [7], [12] or others instead of (8) and (9).
5 Experimental Evaluation
We tested our method and the GriffinLim method using real speech samples.
5.1 Experimental Settings
5.1.1 Dataset
We used clean speech signals excerpted from [13] as the experimental data. The speech data consisted of utterances of 30 speakers. The utterances of 28 speakers were used as the training set and the remaining utterances were used as the evaluation set. For the minibatch training, we divided each training utterance into 1secondlong segments with an overlap of 0.5 seconds. All the speech data were downsampled to 16 kHz. Magnitude spectrograms were obtained with an STFT using a Blackman window that was 64 ms long with a 32 ms overlap.
5.1.2 Network Architecture
Fig. 1 shows the network architectures we constructed for this experiment. The left half shows the architecture of the generator and the right half shows that of the discriminator . The light blue blocks indicate convolutional layers, and , , and on each convolutional layer represent hyperparameters. The yellow blocks indicate activation functions. PReLU[14] was used for the generator
and Leaky ReLU
[15] was used for the discriminator . The violet blocks indicate elementwise sums, and the green block indicates the concatenation of features along the channel axis. The red blocks indicate fullyconnected layers. Blocks without symbols have the same hyperparameters as the previous blocks. Note that we referred to [16] when constructing these architectures. The generator is fully convolutional [17], thus allowing an input to have an arbitrary length. The weight constant was set to for and for . was set to. RMSprop
[18] was used as the optimization algorithm and the learning rate was C. The minibatch size wasand the number of epochs was
.Instead of directly feeding an input magnitude spectrogram and a randomlygenerated phase spectrogram into the generator , we used a complex spectrogram reconstructed using the GriffinLim method after 5 iterations as the input. Both the input and output of the generator
have 2 channels, one corresponding to the real part and the other corresponding to the imaginary part of the complex spectrogram. For preprocessing, we normalized the complex spectrograms of the training data to obtain zeromean and unitvariance at each frequency. At test time, the scale of the generator output at each frequency was restored.
We added a block that applies an inverse STFT to the generator output before feeding it into the discriminator . We found this particularly important as the training did not work well without this block.
5.2 Data Augmentation
It is a wellknown fact that the difference between signals is hardly perceptible to human ears when the magnitude spectrograms and the interframe phase differences are the same. This implies that there is an arbitrariness in the initial phases of spectrograms that are perceived similarly. By utilizing this property, we can augment the training data for and by preparing many different waveforms that are the same except for the initial phases. We expect that this data augmentation would allow the generator to concentrate on learning a way of inferring appropriate interframe phase differences given a magnitude spectrogram, thus facilitating efficient learning.
5.3 Dimensionality Reduction
Note that the real and imaginary parts of the Fourier transform of a realvalued signal become even and odd functions, respectively. Owing to this symmetric structure, it is sufficient to restore/infer spectral components within the frequency range from 0 up to the Nyquist frequency. We can therefore restrict the sizes of the input and output of the generator to this frequency range.
5.4 Subjective Evaluation
We compared our proposed method with the GriffinLim method in terms of the perceptual quality of reconstructed signals by conducting an AB test, where “A” and “B” were reconstructed signals obtained respectively with the proposed and baseline methods. With this listening test, “A” and “B” were presented in random orders to eliminate bias as regards the order of stimuli. Five listeners participated in our listening test. Each listener was presented with {“A”,“B”} signals and asked to select “A”or “B” for each pair. The GriffinLim method was run for 400 iterations. The signals were 2 to 5 seconds long.
The preference scores are shown in Fig. 2. As the result shows, the reconstructed signals obtained with the proposed method were preferred by the listeners for 76% of the 50 pairs.
5.5 Generalization ability
To confirm the generalization ability of the proposed method, we tested it on musical audio signals excerpted from [19]. Examples of the reconstructed signals are shown in Fig. 3. With these examples, we can observe a discontinuous point in the reconstructed signal obtained with the GriffinLim method. On the other hand, the proposed method appears to have worked successfully, even though the model was trained using speech data.
5.6 Comparison of Processing Times
We further compared the proposed method with the GriffinLim method in terms of the processing times needed to reconstruct time domain signals. For comparison, we measured the processing times for various speech lengths. We used speech data shorter than 6 seconds for the evaluation. Here, the network architecture of our proposed method was the same as Fig. 1, and the GriffinLim method was run for 400 iterations. The CPU used in this experiment was “Intel Core i76850K CPU @ 3.60GHz”. The GPU was “NVIDIA GeForce GTX 1080”. We implemented the GriffinLim method using the fast Fourier transform function in NumPy [20]. We implemented our model with Chainer [21]. Fig. 4 shows the result. As the speech data become longer, the processing time increases linearly. When executing the proposed method using the GPU, the time needed to reconstruct a signal was only about onetenth the length of that signal. On the other hand, the GriffinLim method executed using the CPU took about the same time as the length of the signal. Therefore, if we can use a GPU, the proposed method can be run in real time. However, when using the CPU, the proposed method took about three times longer than the length of the signal. If we want to execute the proposed method in realtime using a CPU, we would need to construct a more compact architecture than that shown in Fig. 1. One simple way would be to replace the convolutional layers with downsampling and upsampling layers.
6 Conclusion
This paper proposed a GANbased approach to signal reconstruction from magnitude spectrograms. The idea was to model the signal reconstruction process using a DNN and train it using a similarity metric implicitly learned using a GAN discriminator. Through subjective evaluations, we showed that the proposed method was able to reconstruct higher quality time domain signals than the GriffinLim method, which was run for 400 iterations. Furthermore, we showed that the proposed method can be executed in realtime when using a GPU. Future work will include the investigation of a network architecture appropriate for CPU implementations.
References
 [1] P. Smaragdis, C. Févotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman, “Static and dynamic source separation using nonnegative factorizations: A unified view,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 66–75, 2014.
 [2] T. Virtanen, J. Florent Gemmeke, B. Raj, and P. Smaragdis, “Compositional models for audio processing: Uncovering the structure of sound mixtures,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 125–144, 2015

[3]
H. Kameoka,
“Nonnegative matrix factorization and its variants for audio signal processing,”
in
Applied Matrix and Tensor Variate Data Analysis
, T. Sakata (Ed.), Springer Japan, 2016.  [4] S. Takaki, H. Kameoka, and J. Yamagishi, “Direct modeling of frequency spectra and waveform generation based on phase recovery for DNNbased speech synthesis,” in Proc. Interspeech, pp. 1128–1132, 2017.
 [5] Y. Wang, RJ. SkerryRyan, D. Stanton, Y. Wu, R.J. Weiss, et al., “Tacotron: A fully endtoend texttospeech synthesis model,” arXiv preprint arXiv:1703.10135, 2017.
 [6] D. W. Griffin and J. S. Lim, “Signal estimation from modified shorttime Fourier transform,” IEEE Trans. ASSP, vol. 32, no. 2, pp. 236–243, 1984.
 [7] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, et al., “Generative adversarial nets,” in Adv. NIPS, pp. 2672–2680, 2014.
 [8] J. Le Roux, H. Kameoka, N. Ono, S. Sagayama, “Fast signal reconstruction from magnitude STFT spectrogram based on spectrogram consistency,” in Proc. DAFx, pp. 397–403, 2010.

[9]
D. R. Hunter and K. Lange, “Quantile regression via an MM algorithm,”
Journal of Computational and Graphical Statistics, vol. 9, pp. 60–77, 2000.  [10] J. R. Hershey, J. Le Roux, and F. Weninger, “Deep unfolding: Modelbased inspiration of novel deep architectures,” arXiv preprint arXiv:1409.2574.
 [11] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, et al., “Least squares generative adversarial networks,” in Proc. ICCV, pp. 2813–2821, 2017.
 [12] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875, 2017.
 [13] C. ValentiniBotinhao, “Noisy speech database for training speech enhancement algorithms and TTS models, [dataset],” University of Edinburgh. School of Informatics. CSTR, 2016. http://dx.doi.org/10.7488/ds/1356.

[14]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in
Proc. ICCV, pp. 1026–1034, 2015.  [15] A.L. Maas, A.Y. Hannun, and A.Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, pp. 3, 2013.
 [16] C. Ledig, L. Thesis, F. Huszár, J. Caballero, A. Cunningham, et al., “Photorealistic single image superresolution using a generative adversarial network,” arXiv preprint arXiv:1609.04802, 2016.
 [17] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. CVPR, pp. 3431–3440, 2015.

[18]
T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
 [19] CAFÉ DEL CHILLIA, “In The Story That We Say,” https://www.jamendo.com/track/1455877/inthestorythatwesay, 2017.
 [20] S. Walt, S.C. Colbert, and G. Varoquaux, “The NumPy array: a structure for efficient numerical computation,” Computing in Science & Engineering, vol. 13, no. 2, pp. 2230, 2011.

[21]
S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a nextgeneration open source framework for deep learning,” in Proc. LearningSys in the twentyninth annual conference on NIPS, vol. 5, 2015.
Comments
There are no comments yet.