I Introduction
Many of audio and speech processing approaches represent the signal in a timefrequency transformation. The shorttime discrete Fourier transform (STFT) are most usually used. After this transformation, the signal can be represented by their magnitude and their phase in complexed value form. However the phase has been largely ignored while the researchers were focusing on the modeling and processing of the STFT magnitude in the past three decades. [2].
However, as soon as reconstruction is desired, phase information becomes essential. When the magnitude is modified, it is often sufficient to reuse the original phase to recover the signal, which may lead to undesired artifacts. Some researchers focus on the applications that the original phase is not available [3]. In this case, STFT phase retrieval algorithms construct a new valid phase from the modified magnitude, allowing complete disposal of the existing phase.
Based on phase enhancement research, enhancing the phase spectrogram of noisy speech leads to perceptual quality improvements [4]. Instead of separately enhancing the magnitude and phase response of noisy speech, recent researchers focus on jointly enhancing the magnitude and phase responses to further improve the perceptual quality [5]. If the spectrogram is modified, the modified spectrogram may not correspond to the STFT of any timedomain signal anymore, which is socalled inconsistent spectrogram [2]. The majority of speech enhancement approaches either only modify the magnitude or estimate complex spectrogram, which will most likely lead to an inconsistent spectrogram. It is worth mentioning that consistent spectrogram obtained from the SFTF of a timedomain signal should be a small subset of the complex spectrogram. In this letter, we propose a joint real and imaginary reconstruction algorithm on consistent spectrogram. In other words, given the complex spectrum of noisy speech, we could recover the consistent spectrum of clean speech. Because the optimization space of our method is restricted to a consistent spectrogram, fast convergence rate and high accuracy can be achieved by the proposed speech enhancement algorithm.
This paper is organized as follows. Section II reviews masking based speech enhancement methods and inconsistent spectrogram problem. Section III proposes Consistent Spectrogram Masking algorithm. Section IV describes the experimental setups used to evaluate the performance of the model we propose. Finally, Section V present conclusions.
Ii Masking methods and Inconsistent Spectrograms Problem
The common speech enhancement setup consisting of STFT analysis, spectral modification, and subsequent inverse STFT (ISTFT). The analyzed digital signal yielding the complexvalued STFT coefficients, this procedure can be compactly described as . Recently, phase processing has emerged as a further leverage on speech enhancement tasks, including the noticeable work like Phase Sensitive Masking (PSM) [6], and Complex Ratio Masking (cRM) [7, 1]. Wang et al. illustrated that the real and imaginary spectrograms exhibits clear temporal and spectral structure, so they propose the cRM which is defined as follow:
(1) 
However, the methods mentioned above all ignore the inconsistent spectrogram problem. The inconsistent spectrogram problem illustrated by Timo Gerkmon is a great challenge to speech enhancement. Because the STFT analysis is done using overlapping analysis window, any modification for individual signal components (sinusoids, impulses), will be spread over multiple frames and multiple STFT frequencies locations.
Le Roux et al. [8] derived the consistency constraints for STFT spectrograms consicely. Let be a set of complex numbers, where will correspond to the frame index and to the frequency band index, and , are analysis and synthesis window function verifying the perfect reconstruction conditions for a frame shift . For any complex spectrogram , we can get the following equation.
can be divided into and . can be obtained from STFT of time signal . And there is a onetoone mapping between and and a manytoone mapping between and . The resynthesized time signal ISTFT () has the consistent spectrogram after STFT transform. As a consequence, the relation between and can be shown in the following equation.
(2) 
Since the manytoone mapping between and and onetoone mapping between and as illustrated in Fig. 2, the space of is much larger than the space of . Therefore, the estimated clean spectrogram in the design of speech enhancement system tend to fall into the inconsistent spectrograms space. The commonly ignored inconsistent spectrograms problem not only introduces artifacts into resynthesized signals because of the inconsistency of overlapping frames but also increases difficulties of model convergence due to the expansion of inconsistent spectrogram space.
Iii Consistent Spectrogram Masking
Iiia Masking with Consistency constraints
The most of modelbased speech enhancement methods can be regarded as minimize the follow objective function:
(3) 
where is estimated clean spectrogram, denotes clean signal i.e the ground truth for the model, and is a tunable parameter to scale the distance.
Because is estimated from a nonlinear function of nosiy speech
(nonlinear function can be neural network or HMM etc.), these nonlinear operation may destruct the corresponding relationship between neirbouring frames and can not guarantee the consistence of
. As a result, the objective function defined in spectrogram incurs the aforementioned inconsistent spectrogram problem. Here we derive the difference between objective functions defined in consistent and inconsistent spectrogram.If we apply both ISTFT and STFT transform in terms of Eq. 3, we can have the following equations. Since the consistency of that the model estimate cannot be guaranteed, can be deduced from Eq. 2 and is not equal to . Therefore, the following objective functions are not equal to the objective function in Eq. 3. It worth noting that the last two equations in Eq. 4 shows the equivalent form of objective functions on both time domain and consistent spectrogram.
(4)  
Follow the motivations noted in section II and the derivation of Eq 4, we naturally considered introducing a objective function termed which is defined on consistent spectrogram domain . We name our method as Consistent Spectrogram Masking (CSM) because it iteratively minimizes the objective function and derives masking on a consistent spectrogram. Our proposed method could dispel the artifacts of resynthesis signal and speed up of model training based on space contraction on a consistent spectrogram.
(5) 
Although and are different, and are the same in time domain (illustrated by Fig. 2 and Eq. 2). Thus, we have the useful form of objective function in Eq. 5. By coincidence, there are some similarities between the Eq. 5 and GriffinLim algorithm [9], because a lot of ISTFT and STFT calculations are needed in the optimization procedure. In GriffinLim algorithm, phase information is solely derived from the magnitude of the spectrogram. Nevertheless, our method could estimate both magnitude and phase information in the form of complex numbers on the consistent spectrogram. Thus, we defined Consistent Spectrogram Masking (CSM) as follow by given the complex spectrogram of noisy speech,
(6) 
where , represent the mask for the real and imaginary spectrogram at time and frequency .
IiiB The framework of our proposed endtoend model
Following the aforementioned methodology and principle that optimizing the model with consistency constraint, we designed an endtoend speech enhancement model which comprises a densely connected convolutional neural network (CNN) and integrated QuasiLayers (QL). A highlevel visual depiction of our proposed model is presented in Fig.
1. Specifically, for corresponding functionalities, the CNN module is employed to adaptively modify spectrogram of the input signal, and QL is a backpropagate module designed to simulate the STFT transform and its inversion, thereby making it possible to directly accumulate the loss on consistent spectrogram.
The CNN based acoustic models have been used in speech enhancement and source separation tasks and have been proven to improve the performance [10]. The unique connection structure and weight sharing make CNN capable of learning feature representation via applying convolutional filters to the spectrogram of audio. However, there is an intrinsic tradeoff problem between kernel size and feature resolution. In other words, a larger kernel can exploit more contextual information in time dimension or learning pattern in a wider band, but obtain lower resolution features. In this work, we utilize a densely connected fully convolutional network (FCN) [11] which can learn multiscale features efficiently to solve the tradeoff problem. In a standard feedforward network, the output of the th layer is computed as where the network input is denoted as and is a nonlinear transformation which can be a composite function of operations such as nonlinear activation, pooling or convolution[11].
The idea of DenseNet is to use concatenation of feature maps produced in preceding layers as the input to succeeding layers:
(7) 
where refers to the concatenation of the feature maps produced in layers [11]. Such dense connectivity enables all layers not only to receive the gradient directly but also to reuse features computed in preceding layers. This pipeline avoids the recalculation of similar features in different layers and makes network can learn different level features in the same layer [11]. The experimental results show that our DenseNet based approach has a considerable improvement compared to DNN based model.
The FCN is the backbone of our model, and the preprocessing and postprocessing modules QuasiLayers, are also vital parts of the whole system. The QuasiSTFT layer uses two 1dimensional convolutions, each of which is initialized with real and imaginary part of discrete Fourier transform kernels respectively, following the definition of STFT:
(8) 
for
, the QuasiISTFT layer is similar to this one. These modules are constructed on normal convlutional layers and thus it’s easily to integrate these modules into the neural network based model. These QuasiLayers can bring us benefits in two folds, firstly QuasiISTFT also offers the probability to define the objective function on a consistent spectrogram as Eq.
5. On the other hand, the integration of STFT and ISTFT into the endtoend model can make Fourier transform kernel and window function learnable with the backpropagation.Iv Experiment
Iva Experimental Setup
We conducted our experiments on the Center for Speech Technology Voice Cloning Toolkit (VCTK) [12] and The DARPA TIMIT AcousticPhonetic Continuous Speech Corpus (TIMIT) [13] corpora, the training data is supplied by VCTK which includes 400 x 109 sentence uttered by 109 native speakers of English with various accents and the model is evaluated in TIMIT. Training and testing in different dataset promise the reliability of results. Moreover, the following broadband noise: speech babble (Babble), cafeteria (Cafe), factory floor noise (Factory), transportation noise (Road). The training set is composed by combining ten random parts from the first half of each noise with each training sample at different SNR levels which is 6, 3, 0, 3 and 6 respectively. The test set is generated by mixing 60 clean utterances of the last half of the above noises at different SNRs. Dividing noises into two halves ensures that the testing noise segments are unseen during training.
The proposed model termed QLFCNCSM is given in Figure. 1
. Ahead of the FCN, the raw audio input of 66048 samples, is transformed to a 512 x 16 x 2 matrix by STFT Quasilayer, the window length and hop length of which are set to 1024 and 512 respectively. Mean, and variance normalization was applied to the input vector to make the training and testing process stable. The perceptual evaluation of speech quality (PESQ)
[14] and the signal to noise ratio (SNR) are used to evaluate the quality and intelligibility of different signals.IvB Experimental Results
IvB1 Comparison Between Different Objective Functions
We conducted the experiments with models based on different objective functions, the model which is targeted to minimize the error between the complex spectrogram of clean speech and its noisy version is denoted as QLFCNcRM (similar to QLFCNCSM, but replace CSM with cRM), and the model which estimate magnitude solely is denoted as QLFCNIRM (still similar to QLFCNCSM, but replace CSM with IRM).
Table 1 shows that there is a substantial performance gap between QLFCNCSM and QLFCNcRM, between QLFCNCSM and QLFCNIRM, which proves the efficiency of CSM which optimize model with the objective function defined in the consistent spectrogram and synthesize waveforms directly. It is observed that the average PESQ scores and SNR of QLFCNCSM and QLFCNcRM are always better than the other models, which proves the effectiveness of the endtoend model we proposed. Our best results on 0dB condition are even more encouraging: the PESQ score is 0.38 higher than the DNNcRM, which is stateoftheart DNN approach.
It was noteworthy that the convergency speed of QLFCNCSM overtaking the others with better performance, these circumstances reinforce the view we hold: the constrain of the estimated spectrogram into the scope of the consistent spectrogram, leading the faster convergence shown in Fig. 4.
IvB2 Comparison Between Different Network Architectures
To compare our FCN based model with those base on DNN, experiments compare ours with DNNcRM [1] (QL is not conducted as there is no convolution procedure here, deep neural network is used instead of FCN) and DNNIRM [15].
From Table 1, we can observe that QLFCNCSM and QLFCNcRM outperform DNNcRM and DNNIRM all the time. The results proved the efficiency of our selection of network architecture. However, the results of QLFCNCSM is comparable to those of QLFCNcRM in 6 and 6 conditions. It is because artifacts caused by the loss of phase information are negligible in very high or very low SNR conditions [16].
PESQ  SNR  

SNR  6  3  0  3  6  6  3  0  3  6  
Babble 
a  1.179  1.301  1.489  1.672  1.998  6.00  3.00  0.00  3.00  6.00 
b  1.951  2.112  2.682  2.855  3.106  5.93  8.47  11.32  13.82  16.43  
c  1.953  2.068  2.543  2.833  2.966  5.89  8.13  10.76  13.91  16.14  
d  1.967  2.077  2.515  2.710  2.976  5.92  8.07  10.83  13.16  15.66  
e  1.914  1.836  2.299  2.517  2.843  4.67  6.87  8.38  10.98  14.73  
f  1.809  1.787  2.113  2.442  2.798  4.09  6.53  8.05  10.12  13.09  
Cafe 
a  1.413  1.676  1.894  2.123  2.342  6.00  3.00  0.00  3.00  6.00 
b  2.365  2.517  2.720  2.878  3.021  6.34  8.59  11.42  14.0  16.47  
c  2.363  2.501  2.686  2.880  3.004  6.30  8.37  10.96  14.03  16.18  
d  2.362  2.496  2.690  2.836  2.975  6.29  8.28  11.01  13.26  15.7  
e  2.272  2.426  2.516  2.698  2.937  5.01  7.23  8.58  11.12  15.03  
f  2.240  2.401  2.493  2.647  2.833  4.59  6.85  8.24  10.44  13.22  
Factory 
a  0.987  1.119  1.265  1.468  1.695  6.00  3.00  0.00  3.00  6.00 
b  1.783  1.911  2.121  2.304  2.460  7.16  8.82  11.58  14.19  16.53  
c  1.778  1.89  2.106  2.302  2.441  7.10  8.55  11.37  14.16  16.25  
d  1.78  1.893  2.101  2.246  2.408  7.12  8.59  11.30  13.36  15.75  
e  1.687  1.813  1.908  2.113  2.381  5.89  7.55  8.78  11.47  15.33  
f  1.625  1.765  1.874  2.046  2.240  5.09  6.93  8.34  10.55  13.27  
Road 
a  2.182  2.363  2.547  2.721  2.903  6.00  3.00  0.00  3.00  6.00 
b  2.995  3.095  3.265  3.405  3.529  7.46  9.03  11.74  14.28  16.63  
c  2.982  3.084  3.253  3.403  3.530  7.26  8.88  11.53  14.25  16.65  
d  2.98  3.078  3.249  3.356  3.493  7.22  8.79  11.45  13.43  15.89  
e  2.905  3.007  3.084  3.253  3.467  6.03  7.64  8.87  11.53  15.39  
f  2.853  2.966  3.059  3.185  3.352  5.19  7.01  8.47  10.42  13.37 
V Conclusions
The insights and deductions of our work are clear and comprehensive. We draw concepts from prior works that a) Phase processing is essential to speech enhancement tasks; b) Masking on spectrogram would destruct the consistency constraints. In this letter, we unveil facts that inconsistent spectrograms problem slow the convergence of model and cause unintended artifacts. To estimate the clean spectrogram (including magnitude and phase) from the STFT of noisy speech with the constraint of consistency, we design a CSM on complex spectrogram and derive the loss function in the consistent spectrogram, which resolves the problem of inconsistent spectrogram and phase processing simultaneous and jointly.
In technical details, we implement new QuasiLayers to emulate STFT with convolution layers in the neural network, which makes it possible to optimize our model with an objective function on the consistent spectrogram. DenseNet is selected as the basis of our model framework rather than vanilla CNN or DNN, for its superior ability to extract features with various scales in a spectrogram. The experimental results show that the considered acceleration of convergence and the improvement of quality occurred.
References
 [1] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio, Speech & Language Processing, vol. 24, no. 3, pp. 483–492, 2016.
 [2] T. Gerkmann, M. KrawczykBecker, and J. L. Roux, “Phase processing for singlechannel speech enhancement: History and recent advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, 2015. [Online]. Available: https://doi.org/10.1109/MSP.2014.2369251
 [3] Z. Prusa and P. Rajmic, “Toward highquality realtime signal reconstruction from STFT magnitude,” IEEE Signal Process. Lett., vol. 24, no. 6, pp. 892–896, 2017. [Online]. Available: https://doi.org/10.1109/LSP.2017.2696970
 [4] K. K. Paliwal, K. K. Wójcicki, and B. J. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, 2011.
 [5] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for joint enhancement of magnitude and phase,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 2025, 2016. IEEE, 2016, pp. 5220–5224. [Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472673

[6]
H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,” in
2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 1924, 2015, 2015, pp. 708–712. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178061  [7] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for joint enhancement of magnitude and phase,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 2025, 2016. IEEE, 2016, pp. 5220–5224.
 [8] J. Le Roux, “Phasecontrolled sound transfer based on maximallyinconsistent spectrograms,” in Proceedings of the Acoustical Society of Japan Spring Meeting, no. 1Q51, Mar. 2011.
 [9] S. Nawab, T. Quatieri, and J. Lim, “Signal reconstruction from shorttime fourier transform magnitude,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 31, no. 4, pp. 986–998, 1983.
 [10] S. Fu, Y. Tsao, X. Lu, and H. Kawai, “Endtoend waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” CoRR, vol. abs/1709.03658, 2017.

[11]
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017.  [12] C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multispeaker corpus for cstr voice cloning toolkit,” 2017.
 [13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “Darpa timit acousticphonetic continous speech corpus cdrom. nist speech disc 11.1,” NASA STI/Recon technical report n, vol. 93, 1993.
 [14] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, 711 May, 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, Proceedings. IEEE, 2001, pp. 749–752.
 [15] M. Tu and X. Zhang, “Speech enhancement based on deep neural networks with skip connections,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 59, 2017. IEEE, 2017, pp. 5565–5569.
 [16] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2013.
Comments
There are no comments yet.