End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

by   Xingjian Du, et al.
Shanghai University

Recently, phase processing is attracting increasinginterest in speech enhancement community. Some researchersintegrate phase estimations module into speech enhancementmodels by using complex-valued short-time Fourier transform(STFT) spectrogram based training targets, e.g. Complex RatioMask (cRM) [1]. However, masking on spectrogram would violentits consistency constraints. In this work, we prove that theinconsistent problem enlarges the solution space of the speechenhancement model and causes unintended artifacts. ConsistencySpectrogram Masking (CSM) is proposed to estimate the complexspectrogram of a signal with the consistency constraint in asimple but not trivial way. The experiments comparing ourCSM based end-to-end model with other methods are conductedto confirm that the CSM accelerate the model training andhave significant improvements in speech quality. From ourexperimental results, we assured that our method could enha


Invertible DNN-based nonlinear time-frequency transform for speech enhancement

We propose an end-to-end speech enhancement method with trainable time-f...

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

In recent years, deep networks have led to dramatic improvements in spee...

Efficient Trainable Front-Ends for Neural Speech Enhancement

Many neural speech enhancement and source separation systems operate in ...

BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement

In this paper, we present a blockwise optimization method for masking-ba...

Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

This paper considers speech enhancement of signals picked up in one nois...

TaylorBeamformer: Learning All-Neural Beamformer for Multi-Channel Speech Enhancement from Taylor's Approximation Theory

While existing end-to-end beamformers achieve impressive performance in ...

Real-time Monaural Speech Enhancement With Short-time Discrete Cosine Transform

Speech enhancement algorithms based on deep learning have been improved ...

I Introduction

Many of audio and speech processing approaches represent the signal in a time-frequency transformation. The short-time discrete Fourier transform (STFT) are most usually used. After this transformation, the signal can be represented by their magnitude and their phase in complexed value form. However the phase has been largely ignored while the researchers were focusing on the modeling and processing of the STFT magnitude in the past three decades. [2].

However, as soon as reconstruction is desired, phase information becomes essential. When the magnitude is modified, it is often sufficient to reuse the original phase to recover the signal, which may lead to undesired artifacts. Some researchers focus on the applications that the original phase is not available [3]. In this case, STFT phase retrieval algorithms construct a new valid phase from the modified magnitude, allowing complete disposal of the existing phase.

Based on phase enhancement research, enhancing the phase spectrogram of noisy speech leads to perceptual quality improvements [4]. Instead of separately enhancing the magnitude and phase response of noisy speech, recent researchers focus on jointly enhancing the magnitude and phase responses to further improve the perceptual quality [5]. If the spectrogram is modified, the modified spectrogram may not correspond to the STFT of any time-domain signal anymore, which is so-called inconsistent spectrogram [2]. The majority of speech enhancement approaches either only modify the magnitude or estimate complex spectrogram, which will most likely lead to an inconsistent spectrogram. It is worth mentioning that consistent spectrogram obtained from the SFTF of a time-domain signal should be a small subset of the complex spectrogram. In this letter, we propose a joint real and imaginary reconstruction algorithm on consistent spectrogram. In other words, given the complex spectrum of noisy speech, we could recover the consistent spectrum of clean speech. Because the optimization space of our method is restricted to a consistent spectrogram, fast convergence rate and high accuracy can be achieved by the proposed speech enhancement algorithm.

This paper is organized as follows. Section II reviews masking based speech enhancement methods and inconsistent spectrogram problem. Section III proposes Consistent Spectrogram Masking algorithm. Section IV describes the experimental setups used to evaluate the performance of the model we propose. Finally, Section V present conclusions.

Fig. 1: The framework of our proposed end-to-end model for speech enhancement

Ii Masking methods and Inconsistent Spectrograms Problem

The common speech enhancement setup consisting of STFT analysis, spectral modification, and subsequent inverse STFT (ISTFT). The analyzed digital signal yielding the complex-valued STFT coefficients, this procedure can be compactly described as . Recently, phase processing has emerged as a further leverage on speech enhancement tasks, including the noticeable work like Phase Sensitive Masking (PSM) [6], and Complex Ratio Masking (cRM) [7, 1]. Wang et al. illustrated that the real and imaginary spectrograms exhibits clear temporal and spectral structure, so they propose the cRM which is defined as follow:


However, the methods mentioned above all ignore the inconsistent spectrogram problem. The inconsistent spectrogram problem illustrated by Timo Gerkmon is a great challenge to speech enhancement. Because the STFT analysis is done using overlapping analysis window, any modification for individual signal components (sinusoids, impulses), will be spread over multiple frames and multiple STFT frequencies locations.

Le Roux et al. [8] derived the consistency constraints for STFT spectrograms consicely. Let be a set of complex numbers, where will correspond to the frame index and to the frequency band index, and , are analysis and synthesis window function verifying the perfect reconstruction conditions for a frame shift . For any complex spectrogram , we can get the following equation.

can be divided into and . can be obtained from STFT of time signal . And there is a one-to-one mapping between and and a many-to-one mapping between and . The resynthesized time signal ISTFT () has the consistent spectrogram after STFT transform. As a consequence, the relation between and can be shown in the following equation.


Since the many-to-one mapping between and and one-to-one mapping between and as illustrated in Fig. 2, the space of is much larger than the space of . Therefore, the estimated clean spectrogram in the design of speech enhancement system tend to fall into the inconsistent spectrograms space. The commonly ignored inconsistent spectrograms problem not only introduces artifacts into resynthesized signals because of the inconsistency of overlapping frames but also increases difficulties of model convergence due to the expansion of inconsistent spectrogram space.

Fig. 2: An illustration of the notion of consistency. STFT transform is an injective function which maps distinct valid signals to corresponding consistent spectrograms respectively i.e. there is a perfect one-to-one correspondence between the sets of time signal and consistent spectrograms. However, STFT transform is not guaranteed to be invertiable for inconsistent spectrograms . There is a many-to-one mapping between and time signal as indicated by red arrows.

Iii Consistent Spectrogram Masking

Iii-a Masking with Consistency constraints

The most of model-based speech enhancement methods can be regarded as minimize the follow objective function:


where is estimated clean spectrogram, denotes clean signal i.e the ground truth for the model, and is a tunable parameter to scale the distance.

Because is estimated from a non-linear function of nosiy speech

(non-linear function can be neural network or HMM etc.), these non-linear operation may destruct the corresponding relationship between neirbouring frames and can not guarantee the consistence of

. As a result, the objective function defined in spectrogram incurs the aforementioned inconsistent spectrogram problem. Here we derive the difference between objective functions defined in consistent and inconsistent spectrogram.

If we apply both ISTFT and STFT transform in terms of Eq. 3, we can have the following equations. Since the consistency of that the model estimate cannot be guaranteed, can be deduced from Eq. 2 and is not equal to . Therefore, the following objective functions are not equal to the objective function in Eq. 3. It worth noting that the last two equations in Eq. 4 shows the equivalent form of objective functions on both time domain and consistent spectrogram.


Follow the motivations noted in section II and the derivation of Eq 4, we naturally considered introducing a objective function termed which is defined on consistent spectrogram domain . We name our method as Consistent Spectrogram Masking (CSM) because it iteratively minimizes the objective function and derives masking on a consistent spectrogram. Our proposed method could dispel the artifacts of resynthesis signal and speed up of model training based on space contraction on a consistent spectrogram.


Although and are different, and are the same in time domain (illustrated by Fig. 2 and Eq. 2). Thus, we have the useful form of objective function in Eq. 5. By coincidence, there are some similarities between the Eq. 5 and Griffin-Lim algorithm [9], because a lot of ISTFT and STFT calculations are needed in the optimization procedure. In Griffin-Lim algorithm, phase information is solely derived from the magnitude of the spectrogram. Nevertheless, our method could estimate both magnitude and phase information in the form of complex numbers on the consistent spectrogram. Thus, we defined Consistent Spectrogram Masking (CSM) as follow by given the complex spectrogram of noisy speech,


where , represent the mask for the real and imaginary spectrogram at time and frequency .

Iii-B The framework of our proposed end-to-end model

Following the aforementioned methodology and principle that optimizing the model with consistency constraint, we designed an end-to-end speech enhancement model which comprises a densely connected convolutional neural network (CNN) and integrated Quasi-Layers (QL). A high-level visual depiction of our proposed model is presented in Fig.


. Specifically, for corresponding functionalities, the CNN module is employed to adaptively modify spectrogram of the input signal, and QL is a backpropagate module designed to simulate the STFT transform and its inversion, thereby making it possible to directly accumulate the loss on consistent spectrogram.

The CNN based acoustic models have been used in speech enhancement and source separation tasks and have been proven to improve the performance [10]. The unique connection structure and weight sharing make CNN capable of learning feature representation via applying convolutional filters to the spectrogram of audio. However, there is an intrinsic tradeoff problem between kernel size and feature resolution. In other words, a larger kernel can exploit more contextual information in time dimension or learning pattern in a wider band, but obtain lower resolution features. In this work, we utilize a densely connected fully convolutional network (FCN) [11] which can learn multi-scale features efficiently to solve the trade-off problem. In a standard feedforward network, the output of the th layer is computed as where the network input is denoted as and is a nonlinear transformation which can be a composite function of operations such as nonlinear activation, pooling or convolution[11].

The idea of DenseNet is to use concatenation of feature maps produced in preceding layers as the input to succeeding layers:


where refers to the concatenation of the feature maps produced in layers [11]. Such dense connectivity enables all layers not only to receive the gradient directly but also to reuse features computed in preceding layers. This pipeline avoids the re-calculation of similar features in different layers and makes network can learn different level features in the same layer [11]. The experimental results show that our DenseNet based approach has a considerable improvement compared to DNN based model.

The FCN is the backbone of our model, and the preprocessing and postprocessing modules Quasi-Layers, are also vital parts of the whole system. The Quasi-STFT layer uses two 1-dimensional convolutions, each of which is initialized with real and imaginary part of discrete Fourier transform kernels respectively, following the definition of STFT:



, the Quasi-ISTFT layer is similar to this one. These modules are constructed on normal convlutional layers and thus it’s easily to integrate these modules into the neural network based model. These Quasi-Layers can bring us benefits in two folds, firstly Quasi-ISTFT also offers the probability to define the objective function on a consistent spectrogram as Eq.

5. On the other hand, the integration of STFT and ISTFT into the end-to-end model can make Fourier transform kernel and window function learnable with the backpropagation.

Iv Experiment

Iv-a Experimental Setup

We conducted our experiments on the Center for Speech Technology Voice Cloning Toolkit (VCTK) [12] and The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) [13] corpora, the training data is supplied by VCTK which includes 400 x 109 sentence uttered by 109 native speakers of English with various accents and the model is evaluated in TIMIT. Training and testing in different dataset promise the reliability of results. Moreover, the following broadband noise: speech babble (Babble), cafeteria (Cafe), factory floor noise (Factory), transportation noise (Road). The training set is composed by combining ten random parts from the first half of each noise with each training sample at different SNR levels which is -6, -3, 0, 3 and 6 respectively. The test set is generated by mixing 60 clean utterances of the last half of the above noises at different SNRs. Dividing noises into two halves ensures that the testing noise segments are unseen during training.

The proposed model termed QL-FCN-CSM is given in Figure. 1

. Ahead of the FCN, the raw audio input of 66048 samples, is transformed to a 512 x 16 x 2 matrix by STFT Quasi-layer, the window length and hop length of which are set to 1024 and 512 respectively. Mean, and variance normalization was applied to the input vector to make the training and testing process stable. The perceptual evaluation of speech quality (PESQ)

[14] and the signal to noise ratio (SNR) are used to evaluate the quality and intelligibility of different signals.

Iv-B Experimental Results

Iv-B1 Comparison Between Different Objective Functions

We conducted the experiments with models based on different objective functions, the model which is targeted to minimize the error between the complex spectrogram of clean speech and its noisy version is denoted as QL-FCN-cRM (similar to QL-FCN-CSM, but replace CSM with cRM), and the model which estimate magnitude solely is denoted as QL-FCN-IRM (still similar to QL-FCN-CSM, but replace CSM with IRM).

Table 1 shows that there is a substantial performance gap between QL-FCN-CSM and QL-FCN-cRM, between QL-FCN-CSM and QL-FCN-IRM, which proves the efficiency of CSM which optimize model with the objective function defined in the consistent spectrogram and synthesize waveforms directly. It is observed that the average PESQ scores and SNR of QL-FCN-CSM and QL-FCN-cRM are always better than the other models, which proves the effectiveness of the end-to-end model we proposed. Our best results on 0dB condition are even more encouraging: the PESQ score is 0.38 higher than the DNN-cRM, which is state-of-the-art DNN approach.

It was noteworthy that the convergency speed of QL-FCN-CSM overtaking the others with better performance, these circumstances reinforce the view we hold: the constrain of the estimated spectrogram into the scope of the consistent spectrogram, leading the faster convergence shown in Fig. 4.

Iv-B2 Comparison Between Different Network Architectures

To compare our FCN based model with those base on DNN, experiments compare ours with DNN-cRM [1] (QL is not conducted as there is no convolution procedure here, deep neural network is used instead of FCN) and DNN-IRM [15].

From Table 1, we can observe that QL-FCN-CSM and QL-FCN-cRM outperform DNN-cRM and DNN-IRM all the time. The results proved the efficiency of our selection of network architecture. However, the results of QL-FCN-CSM is comparable to those of QL-FCN-cRM in 6 and -6 conditions. It is because artifacts caused by the loss of phase information are negligible in very high or very low SNR conditions [16].

SNR -6 -3 0 3 6   -6 -3 0 3 6


a 1.179 1.301 1.489 1.672 1.998   -6.00 -3.00 0.00 3.00 6.00
b 1.951 2.112 2.682 2.855 3.106   5.93 8.47 11.32 13.82 16.43
c 1.953 2.068 2.543 2.833 2.966   5.89 8.13 10.76 13.91 16.14
d 1.967 2.077 2.515 2.710 2.976   5.92 8.07 10.83 13.16 15.66
e 1.914 1.836 2.299 2.517 2.843   4.67 6.87 8.38 10.98 14.73
f 1.809 1.787 2.113 2.442 2.798   4.09 6.53 8.05 10.12 13.09


a 1.413 1.676 1.894 2.123 2.342   -6.00 -3.00 0.00 3.00 6.00
b 2.365 2.517 2.720 2.878 3.021   6.34 8.59 11.42 14.0 16.47
c 2.363 2.501 2.686 2.880 3.004   6.30 8.37 10.96 14.03 16.18
d 2.362 2.496 2.690 2.836 2.975   6.29 8.28 11.01 13.26 15.7
e 2.272 2.426 2.516 2.698 2.937   5.01 7.23 8.58 11.12 15.03
f 2.240 2.401 2.493 2.647 2.833   4.59 6.85 8.24 10.44 13.22


a 0.987 1.119 1.265 1.468 1.695   -6.00 -3.00 0.00 3.00 6.00
b 1.783 1.911 2.121 2.304 2.460   7.16 8.82 11.58 14.19 16.53
c 1.778 1.89 2.106 2.302 2.441   7.10 8.55 11.37 14.16 16.25
d 1.78 1.893 2.101 2.246 2.408   7.12 8.59 11.30 13.36 15.75
e 1.687 1.813 1.908 2.113 2.381   5.89 7.55 8.78 11.47 15.33
f 1.625 1.765 1.874 2.046 2.240   5.09 6.93 8.34 10.55 13.27


a 2.182 2.363 2.547 2.721 2.903   -6.00 -3.00 0.00 3.00 6.00
b 2.995 3.095 3.265 3.405 3.529   7.46 9.03 11.74 14.28 16.63
c 2.982 3.084 3.253 3.403 3.530   7.26 8.88 11.53 14.25 16.65
d 2.98 3.078 3.249 3.356 3.493   7.22 8.79 11.45 13.43 15.89
e 2.905 3.007 3.084 3.253 3.467   6.03 7.64 8.87 11.53 15.39
f 2.853 2.966 3.059 3.185 3.352   5.19 7.01 8.47 10.42 13.37
TABLE I: PESQ and SNR performance for the 5 models: No enhancement (a), QL-FCN-CSM (b), QL-FCN-cRM (c), QL-FCN-IRM (d), DNN-cRM (e), DNN-IRM (f).
Fig. 3: A random clip (768 samples) from the waveform of the experimental results. Red line indicates the clean signal. The green line and the red line indicate the output of QL-FCN-CSM and QL-FCN-IRM respectively. It is obvious that estimating spectrogram masks in a consistent manner can reduce distortion of results in the time domain.
Fig. 4: Training CSM-QL and cRM model on VCTK dataset. The preformance of CSM-QL surpass the cRM model with the faster convergence speed.

V Conclusions

The insights and deductions of our work are clear and comprehensive. We draw concepts from prior works that a) Phase processing is essential to speech enhancement tasks; b) Masking on spectrogram would destruct the consistency constraints. In this letter, we unveil facts that inconsistent spectrograms problem slow the convergence of model and cause unintended artifacts. To estimate the clean spectrogram (including magnitude and phase) from the STFT of noisy speech with the constraint of consistency, we design a CSM on complex spectrogram and derive the loss function in the consistent spectrogram, which resolves the problem of inconsistent spectrogram and phase processing simultaneous and jointly.

In technical details, we implement new Quasi-Layers to emulate STFT with convolution layers in the neural network, which makes it possible to optimize our model with an objective function on the consistent spectrogram. DenseNet is selected as the basis of our model framework rather than vanilla CNN or DNN, for its superior ability to extract features with various scales in a spectrogram. The experimental results show that the considered acceleration of convergence and the improvement of quality occurred.


  • [1] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio, Speech & Language Processing, vol. 24, no. 3, pp. 483–492, 2016.
  • [2] T. Gerkmann, M. Krawczyk-Becker, and J. L. Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, 2015. [Online]. Available: https://doi.org/10.1109/MSP.2014.2369251
  • [3] Z. Prusa and P. Rajmic, “Toward high-quality real-time signal reconstruction from STFT magnitude,” IEEE Signal Process. Lett., vol. 24, no. 6, pp. 892–896, 2017. [Online]. Available: https://doi.org/10.1109/LSP.2017.2696970
  • [4] K. K. Paliwal, K. K. Wójcicki, and B. J. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, 2011.
  • [5] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for joint enhancement of magnitude and phase,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016.   IEEE, 2016, pp. 5220–5224. [Online]. Available: https://doi.org/10.1109/ICASSP.2016.7472673
  • [6]

    H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in

    2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, 2015, pp. 708–712. [Online]. Available: https://doi.org/10.1109/ICASSP.2015.7178061
  • [7] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for joint enhancement of magnitude and phase,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016.   IEEE, 2016, pp. 5220–5224.
  • [8] J. Le Roux, “Phase-controlled sound transfer based on maximally-inconsistent spectrograms,” in Proceedings of the Acoustical Society of Japan Spring Meeting, no. 1-Q-51, Mar. 2011.
  • [9] S. Nawab, T. Quatieri, and J. Lim, “Signal reconstruction from short-time fourier transform magnitude,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 31, no. 4, pp. 986–998, 1983.
  • [10] S. Fu, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” CoRR, vol. abs/1709.03658, 2017.
  • [11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017.
  • [12] C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2017.
  • [13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, 1993.
  • [14] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, 7-11 May, 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, Proceedings.   IEEE, 2001, pp. 749–752.
  • [15] M. Tu and X. Zhang, “Speech enhancement based on deep neural networks with skip connections,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017.   IEEE, 2017, pp. 5565–5569.
  • [16] P. C. Loizou, Speech enhancement: theory and practice.   CRC press, 2013.