1 Introduction
Noise interference may severely decrease perceptual quality and intelligibility in speech communication. Likewise, the related tasks, such as automatic speech recognition (ASR), also can be heavily affected by noise interference.
Speech enhancement is thus a highly desired task of taking noisy speech as input and producing an enhanced speech output for better speech quality, intelligibility, and sometimes better criterion in downstream tasks (e.g., lower error rate in ASR). Recently, deep learning (DL) methods have achieved promising results in speech enhancement, especially in dealing with nonstationary noises in challenging conditions. DL can benefit both singlechannel (monaural) and multichannel speech enhancement depending on specific applications. In this paper, we focus on DLbased singlechannel speech enhancement for better perceptual quality and intelligibility, particularly targeting to realtime processing with low model complexity. The Interspeech 2020 deep noise suppression (DNS) challenge has provided a common testbed for such purpose [reddy2020interspeech].1.1 Related work
Formulated as a supervised learning problem, noisy speech can be enhanced by neural networks either in timefrequency (TF) domain or directly in timedomain. The timedomain approaches can further fall into two categories — direct regression
[fu2018end, stoller2018wave] and adaptive frontend approaches [luo2019conv, luo2019dual, zhang2020furcanext]. The former directly learns a regression function from the waveform of a speechnoise mixture to the target speech without an explicit signal frontend, typically by involving some form of 1D convolutional neural network (Conv1d). Taking timedomain signal in and out, the latter adaptive frontend approaches usually adopt a convolution encoderdecoder (CED) or a unet framework, which resembles the shorttime Fourier transform (STFT) and its inversion (iSTFT). The enhancement network is then inserted between the encoder and the decoder, typically by using networks with the capacity of temporal modeling, such as temporal convolutional network (TCN)
[luo2019conv, bai2018empirical] and long shortterm memory (LSTM) [weninger_erdogan_watanabe_vincent_roux_hershey_schuller_2015].As another mainstream, the TFdomain approaches [srinivasan2006binary, narayanan2013ideal, zhao2016dnn, xu2013experimental, yin2019phasen] work on the spectrogram with the belief that finedetailed structures of speech and noise can be more separable with TF representations after STFT. Convolution recurrent network (CRN) [tan2018convolutional]
is a recent approach that also employs a CED structure similar to the one in the timedomain approaches but extracts highlevel features for better separation by 2D CNN (Conv2d) from noisy speech spectrogram. Specifically, CED can take complexvalued or realvalued spectrogram as input. A complexvalued spectrogram can be decomposed into magnitude and phase in polar coordinate or real and imaginary part in the Cartesian coordinate. For a long time, it has been believed that phase is intractable to estimate. Hence, early studies only focus on magnitude related training target while ignoring phase
[huang2014deep, xu2014regression, takahashi2018mmdenselstm], resynthesizing the estimated speech by simply applying estimated magnitude with the noisy speech phase. This thus limits the upper bound of performance, while the phase of estimated speech will deviate significantly with serious interferences. Although many recent approaches have been proposed for phase reconstruction to address this issue [wang2015deep, liu2019supervised], the neural network remains realvalued.Typically, training targets defined in the TF domain mainly fall into two groups, i.e., maskingbased targets, which describe the timefrequency relationships between clean speech and background noise, and mappingbased targets which correspond to the spectral representations of clean speech. In the masking family, ideal binary mask (IBM) [wang2005ideal], ideal ratio mask (IRM) [narayanan2013ideal] and spectral magnitude mask (SMM) [wang2014training] only use the magnitude between clean speech and mixture speech, ignoring the phase information. On the contrast, phasesensitive mask (PSM) [erdogan2015phase] was the first one that utilizes phase information showing the feasibility of phase estimation. Subsequently, complex ratio mask (CRM) [williamson2015complex] was proposed, which can reconstruct speech perfectly by enhancing both real and imaginary components of the division of clean speech and mixture speech spectrogram simultaneously. Later, Tan et al. [tan2019complex] proposed a CRN with one encoder and two decoders for complex spectral mapping (CSM) to estimate the real and imaginary spectrogram of mixture speech simultaneously. It is worth noting that CRM and CSM possess the full information of a speech signal so that they can achieve the best oracle speech enhancement performance in theory.
The above approaches have been learned under a realvalued network, although the phase information has been taken into consideration. Recently, deep complex unet [choi2019phase] has combined the advantages of both a deep complex network [trabelsi2017deep] and a unet [ronneberger2015u] to deal with complexvalued spectrogram. Particularly, DCUNET is trained to estimate CRM and optimizes the scaleinvariant sourcetonoise ratio (SISNR) loss [luo2019conv] after transforming the output TFdomain spectrogram to a timedomain waveform by iSTFT. While achieving stateoftheart performance with temporal modeling ability, many layers of convolution are adopted to extract important context information, leading to large model size and complexity, which limits its practical use in efficiencysensitive applications.
1.2 Contributions
In this paper, we build upon previous network architectures to design a new complexvalued speech enhancement network, called deep complex convolution recurrent network (DCCRN), optimizing an SISNR loss. The network effectively combines both the advantages of DCUNET and CRN, using LSTM to model temporal context with significantly reduced trainable parameters and computational cost. Under the proposed DCCRN framework, we also compare various training targets and the best performance can be obtained by the complex network with the complex target. In our experiments, we find that the proposed DCCRN outperforms CRN [tan2019complex] by a large margin. With only 1/6 computation complexity, DCCRN achieves competitive performance with DCUNET [choi2019phase] under the similar configuration of model parameters. While targeting to realtime speech enhancement, with only 3.7M parameters, our model achieves the best MOS in realtime track and the secondbest in nonrealtime track according to the P.808 subjective evaluation in the DNS challenge.
2 The DCCRN Model
2.1 Convolution recurrent network architecture
The convolution recurrent network (CRN), originally described in [tan2018convolutional]
, is an essentially causal CED architecture with two LSTM layers between the encoder and the decoder. Here, LSTM is specifically used to model the temporal dependencies. The encoder consists of five Conv2d blocks aiming at extracting highlevel features from the input features, or reducing the resolution. Subsequently, the decoder reconstructs the lowresolution features to the original size of the input, leading the encoderdecoder structure to a symmetric design. In detail, the encoder/decoder Conv2d block is composed of a convolution/deconvolution layer followed by batch normalization and activation function. Skipconnection is conducive to flowing the gradient by concentrating the encoder and decoder.
Unlike the original CRN with magnitude mapping, Tan et al. [tan2019complex] recently proposed a modified structure with one encoder and two decoders to model the real and imaginary parts of complex STFT spectrogram from the input mixture to clean speech. Compared with the traditional magnitudeonly target, enhancing magnitude and phase simultaneously has obtained remarkable improvement. However, they treat real and imaginary parts as two input channels, only applying a realvalued convolution operation with one shared realvalued convolution filter, which is not confined with the complex multiply rules. Hence the networks may learn the real and imaginary parts without prior knowledge. To address this issue, in this paper, the proposed DCCRN modifies CRN substantially with complex CNN and complex batch normalization layer in encoder/decoder, and complex LSTM is also considered to replace the traditional LSTM. Specifically, the complex module models the correlation between magnitude and phase with the simulation of complex multiplication.
2.2 Encoder and decoder with complex network
The complex encoder block includes complex Conv2d, complex batch normalization [trabelsi2017deep] and realvalued PReLU [he2015delving]. The complex batch normalization and PReLU follow the implementation of the original paper. We design the complex Conv2d block according to that in DCUNET[choi2019phase]. Complex Conv2d consists of four traditional Conv2d operations, which control the complex information flow throughout the encoder. The complexvalued convolutional filter is defined as , where the realvalued matrices and represent the real and imaginary part of a complex convolution kernel, respectively. At the same time, we define the input complex matrix . Therefore, we can get complex output from the complex convolution operation :
(1) 
where denotes the output feature of one complex layer.
Similar to complex convolution, given the real and imaginary parts of the complex input and , complex LSTM output can be defined as:
(2)  
(3)  
(4) 
where and represent two traditional LSTMs of real part and imaginary part, and is caculated by input with .
2.3 Training target
When training, DCCRN estimates CRM and is optimized by signal approximation (SA). Given the complexvalued STFT spectrogram of clean speech and noisy speech , CRM can be defined as
(5) 
where and denote the real and imaginary parts of the noisy complex spectrogram, respectively. The real and imaginary parts of the clean complex spectrogram are represented by and . Magnitude target SMM also can be used for comparison: , where and
indicate the magnitude of clean speech and noisy speech, respectively. We apply signal approximation, which directly minimizes the difference between the magnitude or complex spectrogram of clean speech and that of noisy speech applied with mask. The loss function of SA becomes
and , where CSA and MSA denote the CRMbased SA and SMM based SA, respectively. Alternatively, the Cartesian coordinate representation can also be expressed in polar coordinates:(6) 
We can use three multiplicative patterns for DCCRN, which will be compared with experiments shortly. Specifically, the estimated clean speech can be calculated as below.

DCCRNR:
(7) 
DCCRNC:
(8) 
DCCRNE:
(9)
DCCRNC obtains in the manner of CSA and DCCRNR estimates the mask of the real and imaginary parts of , respectively. Moreover, DCCRNE performs in polar coordinates, and it is mathematically similar to DCCRNC. The difference is that DCCRNE uses the activation function to limit the mask magnitude to 0 to 1.
2.4 Loss function
The loss function of model training is SISNR, which has been commonly used as an evaluation metric to replace the mean square error (MSE). SISNR is defined as:
(10) 
where and are the clean and estimated timedomain waveform, respectively.
denotes the dot product between two vectors and
is Euclidean norm (L2 norm). In details, we use STFT kernel initialized convolution/deconvolution module to analyze/synthesize waveform [gu2019end] before sending to network and calculating the loss function.3 Experiments
3.1 Datasets
In our experiments, we first evaluated the proposed models as well as several baselines on a dataset simulated on WSJ0 [garofolo1993csr], and then the bestperformed models were further evaluated on the Interspeech2020 DNS Challenge dataset [reddy2020interspeech]. For the first dataset, we select 24500 utterances (about 50 hours) from WSJ0 [garofolo1993csr], which includes 131 speakers (66 males and 65 females). We shuffle and split training, validation, and evaluation sets to 20000, 3000 and 1500 utterances, respectively. The noise dataset contains 6.2 hours freesound noise and 42.6 hours music from MUSAN [musan2015], which we use 41.8 hours for training and validation, and the rest 7 hours for evaluation. The speechnoise mixtures in training and validation are generated by randomly selecting utterances from the speech set and the noise set and mixing them at random SNR between 5 dB and 20 dB. The evaluation set is generated at 5 typical SNRs (0 dB, 5 dB, 10 dB, 15 dB, 20 dB).
The second big dataset is based on the data provided by the DNS challenge. The 180hour DNS challenge noise set includes 150 classes and 65,000 noise clips and the clean speech set includes over 500 hours of clips from 2150 speakers. To make full use of the dataset, we simulate the speechnoise mixture with dynamic mixing during model training. In detail, at each training epoch, we ﬁrst convolve speech and noise with a room impulse response (RIR) randomlyselected from a simulated 3000RIR set by the image method
[allen1979image], and then the speechnoise mixtures are generated dynamically by mixing reverb speech and noise at random SNR between 5 and 20 dB. The total data ‘seen’ by the model is over 5000 hours after 10 epochs of training. We use the official test set for objective scoring and final model selection.3.2 Training setup and baselines
For all of the models, the window length and hop size are 25 ms and 6.25 ms, and the FFT length is 512. We use Pytorch to train the models, and the optimizer is Adam. The initial learning rate is set to 0.001, and it will decay 0.5 when the validation loss goes up. All the waveforms are resampled at 16k Hz. The models are selected by early stopping. In order to choose the model for the DNS challenge, we compare several models on the WSJ0 simulation dataset, described as follows.

LSTM: a semicausal model contains two LSTM layers, and each layer has 800 units; we add one Conv1d layer in which kernel size is 7 in the time dimension, and the lookahead is 6 frames to achieve semicausal. The output layer is a 257unit fullyconnected layer. The input and output are the noisy and estimated clean spectrogram with MSA, respectively.

CRN: a semicausal model contains one encoder and two decoders with the best configuration in [tan2019complex]
. The input and output are the real and imaginary part of the noisy and estimated STFT complex spectrogram. Two decoders process the real and imaginary parts separately. The kernel size is also (3,2) in frequency and time dimension, and the stride is set to (2,1). For the encoder, we concatenate real and imaginary parts in the channel dimension, so the shape of the input feature is [BatchSize, 2, Frequency, Time]. Moreover, the output channel of each layer in encoder is {16,32,64,128,256,256}. The hidden LSTM units are 256, and a dense layer with 1280 units is after the last LSTM. On account of skip connection, each layer in input channel of real or imaginary decoder is {512,512,256,128,64,32}.

DCCRN: four models consist of DCCRNR, DCCRNC, DCCRNE and DCCRNCL (masking like DCCRNE). The direct current component of all these models is removed. The number of channel for the first three DCCRN is {32,64,128,128,256,256}, while the DCCRNCL is {32,64,128,256,256,256}. The kernel size and stride are set to (5,2) and (2,1), respectively. The real LSTMs of the first three DCCRN are two layers with 256 units and DCCRNCL uses complex LSTM with 128 units for the real part and imaginary part, respectively. And a dense layer with 1280 units is after the last LSTM.

DCUNET: we use DCUNET16 for comparison and the stride in time dimension is set to 1 to fit with the DNS challenge rules. Moreover, the channels in encoder is set to [72,72,144,144,144,160,160,180].
For the implementation of semicausal convolution[bahmaninezhad2019unified]
, there are only two differences with commonly used causal convolution in practice. First, we pad zeros in front of the time dimension at each Conv2ds in the encoder. Second, for the decoder, we look ahead one frame in each convolution layer. This eventually leads to 6 frames lookhead, totally
ms, confined with the DNS challenge limit — 40 ms.3.3 Experimental results and discussion
The model performance is first assessed by PESQ^{1}^{1}1https://www.itu.int/rec/TRECP.862200102I/en on the simulated WSJ0 dataset. Table 1 presents the PESQ score on the test sets. In each case, the best result is highlighted by a boldface number.
Model  Para.(M)  0dB  5dB  10dB  15dB  20dB  Ave. 

Noisy    2.062  2.388  2.719  3.049  3.370  2.518 
LSTM  9.6  2.783  3.103  3.371  3.593  3.781  3.326 
CRN  6.1  2.850  3.143  3.374  3.561  3.717  3.329 
DCCRNR  3.7  2.832  3.192  3.488  3.717  3.891  3.424 
DCCRNC  3.7  2.832  3.187  3.477  3.707  3.840  3.409 
DCCRNE  3.7  2.859  3.203  3.492  3.718  3.891  3.433 
DCCRNCL  3.7  2.972  3.301  3.559  3.755  3.901  3.498 
DCUNET  3.6  2.971  3.297  3.556  3.760  3.916  3.500 
On the simulated WSJ0 test set, we can see that the four DCCRNs outperform the baseline LSTM and CRN, which indicates the effectiveness of complex convolution. DCCRNCL achieves better performance than other DCCRNs. This further shows that complex LSTM is also beneficial to complex target training. Moreover, we can see that fullcomplexvalue network DCCRN and DCUNET are similar in PESQ. It worth noting that the computational complexity of DCUNET is almost 6 times than that of DCCRNCL, according to our runtime test.
Model 


no reverb  reverb  Ave.  

Noisy      2.454  2.752  2.603  
NSNet (Baseline) [9054254]  1.3  0  2.683  2.453  2.568  
DCCRNE [T1]  3.7  37.5  3.266  3.077  3.171  
DCCRNEAug [T2]  3.7  37.5  3.209  3.219  3.214  
DCCRNCL [T2]  3.7  37.5  3.262  3.101  3.181  
DCUNET [ T2]  3.6  37.5  3.223  2.796  3.001 
Model  Para.(M)  no reverb  reverb  realrec  Ave.  

Noisy    3.13  2.64  2.83  2.85  
NSNet (Baseline) [9054254]  1.3  3.49  2.64  3.00  3.03  
Track 1  DCCRNE  3.7  4.00  2.94  3.37  3.42 
Team 9  UNK  3.87  2.97  3.28  3.39  
Team 17  UNK  3.83  3.05  3.27  3.34  
Track 2  Team 9  UNK  4.07  3.19  3.40  3.52 
DCCRNEAug  3.7  3.90  2.96  3.34  3.38  
Team 17  UNK  3.83  3.15  3.28  3.38 
In the DNS challenge, we evaluate the two best DCCRN models and DCUNET with the DNS dataset. Table LABEL:tab:dnspesq shows the PESQ scores on the test set. Similarly, DCCRNCL achieves a little bit better PESQ than DCCRNE in general. But after our internal subject listening, we find DCCRNCL may oversuppress the speech signal on some clips, leading to unpleasant listening experiences. DCUNET obtains relatively good PESQ on the synthetic nonreverb set, but its PESQ will drop significantly on the synthetic reverb set. We believe that subjective listening becomes very critical when the objective scores are close for different systems. For these reasons, DCCRNE was finally chosen for the realtime track. In order to improve the performance on the reverb set, we add more RIRs in the training set to result in a model called DCCRNEAug, which was chosen for the nonrealtime track. According to the results on the final blind test set in Table 3, the MOS of DCCRNEAug has a small improvement of 0.02 on the reverb set. Table 3 summarizes the final P.808 subjective evaluation results for several top systems in both tracks provided by the challenge organizer. We can see that our submitted models perform well in general. DCCRNE achieves an average MOS of 3.42 on all sets and 4.00 on the nonreverb set. The one frame processing time of our PyTorch implementation of DCCRNE (exported by ONNX) is 3.12 ms tested empirically on an Intel i58250U PC. Some of the enhanced audio clips can be found from https://huyanxin.github.io/DeepComplexCRN.
4 Conclusions
In this study, we have proposed a deep complex convolution recurrent network for speech enhancement. The DCCRN model utilizes a complex network for complexvalued spectrum modeling. With the complex multiply rule constraint, DCCRN can achieve better performance than others in terms of PESQ and MOS in the similar configuration of model parameters. In the future, we will try to deploy DCCRN in low computational scenarios like edge devices. We will also enable DCCRN with improved noise suppression ability in reverberation conditions.
Comments
There are no comments yet.