Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms

04/06/2018 ∙ by Keisuke Oyamada, et al. ∙ 0

In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. Since magnitude spectrograms do not contain phase information, we must restore or infer phase information to reconstruct a time-domain signal. One widely used approach for dealing with the signal reconstruction problem was proposed by Griffin and Lim. This method usually requires many iterations for the signal reconstruction process and depending on the inputs, it does not always produce high-quality audio signals. To overcome these shortcomings, we apply a learning-based approach to the signal reconstruction problem by modeling the signal reconstruction process using a deep neural network and training it using the idea of a generative adversarial network. Experimental evaluations revealed that our method was able to reconstruct signals faster with higher quality than the Griffin-Lim method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper addresses the problem of reconstructing a time-domain signal solely from a magnitude spectrogram.

The magnitude spectrograms of real-world audio signals tend to be highly structured in terms of both spectral and temporal regularities. For example, pitch contours and formant trajectories are clearly visible from a magnitude spectrogram representation of speech compared with a time-domain signal. Therefore, there are many cases where processing magnitude spectrograms can deal with problems more easily than directly processing time-domain signals. In fact, many methods for monaural audio source separation are applied to magnitude spectrograms [1, 2, 3]. Furthermore, a magnitude spectrogram representation was recently found to be reasonable and effective for use with speech synthesis systems [4, 5].

Since a magnitude spectrogram does not contain phase information, we must restore or infer phase information to reconstruct a time-domain signal. This problem is called the signal (or phase) reconstruction problem. One widely used method for solving the signal reconstruction problem was proposed by Griffin and Lim [6] (hereafter referred to as the Griffin-Lim method). One of the drawbacks of the Griffin-Lim method is that it usually requires many iterations to obtain high-quality audio signals. This makes it particularly difficult to apply it to real-time systems. Furthermore, there are some cases where high-quality audio signals can never be obtained even though the algorithm is run for many iterations. To overcome these shortcomings of the Griffin-Lim method, we apply a learning-based approach to the signal reconstruction problem. Specifically, we propose modeling the reconstruction process of a time-domain signal from a magnitude spectrogram using a deep neural network (DNN) and propose introducing the idea of the generative adversarial network (GAN) [7] for training the signal generator network.

The remainder of the paper is organized as follows. We provide an overview of the phase reconstruction problem in Section 2, introduce the Griffin-Lim method in Section 3, and present our GAN-based approach in Section 4. Experimental evaluations, and supplements for training our model are provided in Section 5. Finally, we offer our conclusions in Section 6.

2 Signal Reconstruction Problem

In this section, we provide an overview of the signal reconstruction problem.

We use to denote a time domain signal and to denote the time frequency component of where and indicate frequency and time indices, respectively. By defining as a complex sinusoid of frequency modulated by a window function centered at time , is defined by the inner product between and , namely

. With a short-time Fourier transform (STFT),

corresponds to the center time of frame and

is the modulated complex sinusoid padded with zeros over the range outside the frame. By using

to denote a vector obtained by stacking all the time-frequency components

, the relationship between and can be written as

(1)

where is a matrix where each row is . Hereafter, we call a complex spectrogram. Since the total number of time frequency points is usually set at more than the number of sample points of the time domain signal, is a redundant representation of . Namely, belongs to a -dimensional linear subspace spanned by each column vector of . With an STFT, all the elements of a complex spectrogram must satisfy certain conditions to ensure that the waveforms within the overlapping segment of consecutive frames are consistent. By using to denote the magnitude spectrogram of where each element of is given by the absolute value of the element of

, the signal reconstruction problem can be cast as an optimization problem of estimating

solely from using the redundancy constraint as a clue.

3 Griffin-Lim Method

One widely used way of solving the phase reconstruction problem involves the Griffin-Lim method [6]. In this section, we derive the iterative algorithm of the Griffin-Lim method following the derivation given in [8].

Whether or not a given satisfies the redundancy constraint so that is a complex spectrogram associated with a time domain signal can be evaluated by examining whether or not the orthogonal projection of to the subspace matches . Here, is a pseudo inverse matrix of satisfying

(2)

With an STFT, (2) corresponds to an inverse STFT. Thus, is the STFT of the inverse STFT of . Now, by using to denote a vector where each element is the phase , the phase reconstruction problem for a given is formulated as an optimization problem of estimating that minimizes

(3)

where denotes an element-wise product. Now, from (2), is the point closest to in the subspace . Thus, we can rewrite (3) as

(4)

According to the principle of the majorization-minimization algorithm [9], it can be shown that is a majorizer of where is an auxiliary variable and a stationary point of can be found by iteratively performing the following updates:

(5)
(6)

Here denotes an operation that divides each element of a vector by its absolute value. With an STFT, Eq. (5) can be interpreted as the inverse STFT of followed by the STFT whereas Eq. (6) is a procedure for replacing the phase with the phase of updated via (5). This algorithm is procedurally equivalent to the Griffin-Lim method [6].

The Griffin-Lim method usually requires many iterations to obtain a high-quality audio signal. This makes it particularly difficult to apply to real-time systems. Furthermore, there are some cases where high-quality audio signals can never be obtained even though the algorithm is run for many iterations, for example when is an artificially created magnitude spectrogram. In the next section, we propose a learning-based approach to the phase reconstruction problem to overcome these shortcomings of the Griffin-Lim method.

4 GAN-based signal reconstruction

4.1 Modeling phase Reconstruction Process

By using to denote the initial value of , and defining and , the iterative algorithm of the Griffin-Lim method can be expressed as a multilayer composite function

(7)

Here, is a linear projection whereas is a nonlinear operation applied to the output of . Hence, (7

) can be viewed as a deep neural network (DNN) where the weight parameters and the activation functions are fixed. From this point of view, finding an algorithm that converges more quickly to a better solution than the Griffin-Lim algorithm can be regarded as learning the weight parameters (and the activation functions) of the DNN. This idea is inspired by the deep unfolding framework

[10], which uses a learning strategy to obtain an improved version of a deterministic iterative inference algorithm by unfolding the iterations and treating them as layers in a DNN. Fortunately, an unlimited number of pair data of and can be collected very easily by computing the complex, magnitude and phase spectrograms of time domain signals. This is very advantageous for efficiently training our DNN.

In the following, we consider a DNN that uses and as inputs and generates (or ) as an output. We call this DNN a generator and express the relationship between the input and output as .

4.2 Learning Criterion

For the generator training, one natural choice for the learning criterion would be a similarity metric (e.g., the

norm) between the generator output and a target complex spectrogram (or signal). Manually defining a similarity metric amounts to assuming a specific form of the probability distribution of the target data (e.g., a Laplacian distribution for the

norm). However, the data distribution is unknown. If we use a similarity metric defined in the data space as the learning criterion, the generator will be trained in such a way that the outputs that averagely fit the target data are considered optimal. As a result, the generator will learn to generate only oversmoothed signals. This is undesirable as the oversmoothing of reconstructed signals causes audio quality degradation. To avoid this, we propose using a similarity metric implicitly learned using a generative adversarial network (GAN) [7]. In addition to the generator network, we introduce a discriminator network that learns to correctly discriminate the complex spectrograms generated by the generator and the complex spectrograms of real audio signals. Given a target complex spectrogram , the discriminator is expected to find a feature space where and are as separate as possible. Thus, we expect that minimizing the distance between and measured in a hidden layer of the discriminator would make indistinguishable from in the data space. By using to denote the discriminator network , we first consider the following criteria for the discriminator

(8)

Here, the target label corresponding to real data is assumed to be 1 and that corresponding to the data generated by the generator is 0. Thus, (8) means that becomes 0 only if the discriminator correctly distinguishes the “fake” complex spectrograms generated by the generator and the “real” complex spectrograms of real audio signals. Therefore, the goal of is to minimize . As for the generator , one of the goals is to deceive the discriminator so as to make the “fake” complex spectrograms as indistinguishable as possible from the “real” complex spectrograms. This can be accomplished by minimizing the following criterion

(9)

Another goal for is to make as close as possible to the target complex spectrogram . By using to denote the output of the -th layer of the discriminator , we would also like to minimize

(10)

where is a fixed weight, which weighs the importance of the -th layer feature space. Here, the -th layer corresponds to the input layer, namely .

The learning objectives for and can thus be summarized as follows:

(11)
(12)

where is a fixed weight.

A general framework for training a generator network in such a way that it can deceive a real/fake discriminator network is called a generative adversarial network (GAN) [7]. The novelty of our proposed approach is that we have successfully adapted the GAN framework to the signal reconstruction problem by incorporating an additional term (10). The GAN framework using (8) and (9) as the learning criteria is called the least squares GAN (LSGAN) [11]. Note that GAN frameworks using other learning criteria such as [12] have also been proposed. Thus, we can also use the learning criteria employed in [7], [12] or others instead of (8) and (9).

5 Experimental Evaluation

We tested our method and the Griffin-Lim method using real speech samples.

5.1 Experimental Settings

5.1.1 Dataset

We used clean speech signals excerpted from [13] as the experimental data. The speech data consisted of utterances of 30 speakers. The utterances of 28 speakers were used as the training set and the remaining utterances were used as the evaluation set. For the mini-batch training, we divided each training utterance into 1-second-long segments with an overlap of 0.5 seconds. All the speech data were downsampled to 16 kHz. Magnitude spectrograms were obtained with an STFT using a Blackman window that was 64 ms long with a 32 ms overlap.

5.1.2 Network Architecture

Fig. 1 shows the network architectures we constructed for this experiment. The left half shows the architecture of the generator and the right half shows that of the discriminator . The light blue blocks indicate convolutional layers, and , , and on each convolutional layer represent hyper-parameters. The yellow blocks indicate activation functions. PReLU[14] was used for the generator

and Leaky ReLU

[15] was used for the discriminator . The violet blocks indicate element-wise sums, and the green block indicates the concatenation of features along the channel axis. The red blocks indicate fully-connected layers. Blocks without symbols have the same hyper-parameters as the previous blocks. Note that we referred to [16] when constructing these architectures. The generator is fully convolutional [17], thus allowing an input to have an arbitrary length. The weight constant was set to for and for . was set to

. RMSprop

[18] was used as the optimization algorithm and the learning rate was C. The mini-batch size was

and the number of epochs was

.

Instead of directly feeding an input magnitude spectrogram and a randomly-generated phase spectrogram into the generator , we used a complex spectrogram reconstructed using the Griffin-Lim method after 5 iterations as the input. Both the input and output of the generator

have 2 channels, one corresponding to the real part and the other corresponding to the imaginary part of the complex spectrogram. For pre-processing, we normalized the complex spectrograms of the training data to obtain zero-mean and unit-variance at each frequency. At test time, the scale of the generator output at each frequency was restored.

We added a block that applies an inverse STFT to the generator output before feeding it into the discriminator . We found this particularly important as the training did not work well without this block.

Figure 1: Network architectures of generator and discriminator. Light blue blocks indicate convolutional layers. In each convolutional layer, , , and

represent kernel size, stride size, and number of channels, respectively. Here,

indicates a one-dimensional convolutional layer whose kernel size is . Red blocks indicate fully connected layer. In each fully connected layer, the numbers represents size of output unit.

5.2 Data Augmentation

It is a well-known fact that the difference between signals is hardly perceptible to human ears when the magnitude spectrograms and the inter-frame phase differences are the same. This implies that there is an arbitrariness in the initial phases of spectrograms that are perceived similarly. By utilizing this property, we can augment the training data for and by preparing many different waveforms that are the same except for the initial phases. We expect that this data augmentation would allow the generator to concentrate on learning a way of inferring appropriate inter-frame phase differences given a magnitude spectrogram, thus facilitating efficient learning.

5.3 Dimensionality Reduction

Note that the real and imaginary parts of the Fourier transform of a real-valued signal become even and odd functions, respectively. Owing to this symmetric structure, it is sufficient to restore/infer spectral components within the frequency range from 0 up to the Nyquist frequency. We can therefore restrict the sizes of the input and output of the generator to this frequency range.

5.4 Subjective Evaluation

We compared our proposed method with the Griffin-Lim method in terms of the perceptual quality of reconstructed signals by conducting an AB test, where “A” and “B” were reconstructed signals obtained respectively with the proposed and baseline methods. With this listening test, “A” and “B” were presented in random orders to eliminate bias as regards the order of stimuli. Five listeners participated in our listening test. Each listener was presented with {“A”,“B”} signals and asked to select “A”or “B” for each pair. The Griffin-Lim method was run for 400 iterations. The signals were 2 to 5 seconds long.

The preference scores are shown in Fig. 2. As the result shows, the reconstructed signals obtained with the proposed method were preferred by the listeners for 76% of the 50 pairs.

Figure 2:

Result of the AB test. The orange area indicates the rate of the A and B pairs for which the listeners preferred A (proposed). The black bar indicates the 95% confidence interval.

5.5 Generalization ability

To confirm the generalization ability of the proposed method, we tested it on musical audio signals excerpted from [19]. Examples of the reconstructed signals are shown in Fig. 3. With these examples, we can observe a discontinuous point in the reconstructed signal obtained with the Griffin-Lim method. On the other hand, the proposed method appears to have worked successfully, even though the model was trained using speech data.

Figure 3: Waveforms of reconstructed music data [19]. The first row shows the acoustic signal reconstructed with the Griffin-Lim method, the second shows the proposed method, and the third is the target acoustic signal (real-world acoustic signal).

5.6 Comparison of Processing Times

We further compared the proposed method with the Griffin-Lim method in terms of the processing times needed to reconstruct time domain signals. For comparison, we measured the processing times for various speech lengths. We used speech data shorter than 6 seconds for the evaluation. Here, the network architecture of our proposed method was the same as Fig. 1, and the Griffin-Lim method was run for 400 iterations. The CPU used in this experiment was “Intel Core i7-6850K CPU @ 3.60GHz”. The GPU was “NVIDIA GeForce GTX 1080”. We implemented the Griffin-Lim method using the fast Fourier transform function in NumPy [20]. We implemented our model with Chainer [21]. Fig. 4 shows the result. As the speech data become longer, the processing time increases linearly. When executing the proposed method using the GPU, the time needed to reconstruct a signal was only about one-tenth the length of that signal. On the other hand, the Griffin-Lim method executed using the CPU took about the same time as the length of the signal. Therefore, if we can use a GPU, the proposed method can be run in real time. However, when using the CPU, the proposed method took about three times longer than the length of the signal. If we want to execute the proposed method in real-time using a CPU, we would need to construct a more compact architecture than that shown in Fig. 1. One simple way would be to replace the convolutional layers with downsampling and upsampling layers.

Figure 4: The change in processing time with respect to the change in speech length. Blue points show the processing time with the proposed method a GPU. Green points show the time with the Griffin-Lim method with a CPU. Orange points show the time with the proposed method with a CPU.

6 Conclusion

This paper proposed a GAN-based approach to signal reconstruction from magnitude spectrograms. The idea was to model the signal reconstruction process using a DNN and train it using a similarity metric implicitly learned using a GAN discriminator. Through subjective evaluations, we showed that the proposed method was able to reconstruct higher quality time domain signals than the Griffin-Lim method, which was run for 400 iterations. Furthermore, we showed that the proposed method can be executed in real-time when using a GPU. Future work will include the investigation of a network architecture appropriate for CPU implementations.

References