Speech enhancement (SE) can be described as the technique to separate the speech components from the background noise interference. It intends to improve speech quality and intelligibility in many communication applications such as front-ends for automatic speech recognition (ASR) systems and hearing aidsloizou2013speech
. In recent years, due to the tremendous ability of deep neural networks (DNNs) to deal with non-stationary noise in low signal-to-noise ratio (SNR) conditions, many DNN-based approaches have demonstrated superior performance in single-channel SEwang2018supervised. These DNN-based methods can be divided into two categories, namely masking-based approaches wang2014training; hummersone2014ideal and mapping-based approaches lu2013speech; xu2014regression; yuan2020time; meng2018adversarial; liao2018noise. The masking-based approaches estimate a time-frequency (T-F) mask that is applied to a noisy speech signal for enhancement (e.g., ideal binary mask (IBM), ideal ratio mask (IRM)). The mapping-based approaches are proposed to train a mapping network that directly transforms noisy speech features (e.g., magnitude spectra, log-power spectra) to clean ones.
More recently, generative adversarial networks (GANs) goodfellow2014generative have demonstrated comparable performance for SE. The same as most other DNN-based approaches, they also require a large number of paired training data, which may be difficult for practical applications. To solve the difficulty of obtaining the parallel recordings of speech and noise from the real scenarios, it is suggested using cycle-consistent GAN (CycleGAN) zhu2017unpaired for SE. Moreover, CycleGAN-based approaches have also demonstrated their promising performance with parallel data, for their capability of preserving the speech structure and reducing speech distortion meng2018cycle; xiang2020parallel; wang2020improved.
Nevertheless, conventional CycleGAN-based approaches have two intractable limitations for SE tasks. Firstly, to ensure cycle-consistency of the original noisy speech domain and target clean speech domain, the enhanced signal always contains the original noise information. In other words, the cycle-consistency-based methods remain audible residual noise, which is challenging to eliminate. This essentially implies that background noise reduction is challenging in these algorithms. Secondly, previous CycleGAN-based SE systems only estimated the magnitude spectrum, log-power spectrum or Mel-cepstral coefficients features while combining the non-oracle (i.e., noisy) phase to reconstruct the time-domain waveform. This leads to severe phase distortion under low SNR scenarios, resulting in serious performance degradation in speech quality and intelligibility.
Multi-stage learning approaches can decompose the original difficult task into multiple more manageable sub-tasks and have demonstrated better performance than single-stage methods in many areas, such as image inpaintinghedjazi2020texture and image deraining li2018recurrent. In this paper, we incorporate a CycleGAN-based magnitude spectrogram mapping network (dubbed CycleGAN-MM) and a deep complex-valued denoising network (dubbed DCD-Net) as a two-stage approach for SE. In the first stage, we utilize CycleGAN-MM to only estimate the magnitude of clean spectra, where we introduce a relativistic average loss in discriminators to stabilize the training. Motivated by recent studies in SE area for better sequence modeling tang2020joint; zheng2020interactive, we employ temporal-frequency attention (T-FA) in generators to capture global dependency along temporal and frequency dimensions, respectively. More recently, it has been demonstrated that the magnitude and phase are difficult to be optimized simultaneously, especially in extremely low SNR conditions wang2020complex. In our preliminarily investigated one-stage CycleGAN-based complex mapping SE system (dubbed CycleGAN-CM), optimizing both real and imaginary (RI) components may cause the unstable training of the generators and discriminators, and consequently its performance get even worse than only estimating the magnitude of clean speech spectrum. That is because when optimizing phase information by estimating the complex spectrum, magnitude estimation may deviate its optimal convergence path by degrees li2021icassp. Therefore, we only optimize the magnitude of the clean spectra in the first stage, which is then coupled with its corresponding noisy phase to obtain a coarsely enhanced complex spectrum. Subsequently, in the second stage, we introduce a deep complex denoising net to further suppress the intractable remaining noise in the previous stage, while simultaneously reconstructing the clean speech phase by estimating both real and imaginary components of the clean spectra. Instead of using real-valued neural networks, DCD-Net is designed by complex-valued convolutional networks and complex T-FA blocks to refine the coarsely enhanced complex spectrum. To the best of our knowledge, this is the first attempt to handle the complex spectral mapping in CycleGAN-based SE systems. To validate the superiority of the proposed scheme, we compare our model with recent state-of-the-art (SOTA) GAN-based and Non-GAN-base SE systems on two public datasets. Experimental results demonstrate the proposed two-stage approach outperforms the one-stage CycleGAN-based systems by a significant margin and achieves SOTA performance.
The remainder of this paper is organized as follows. Section 2 introduces the related works, including CycleGAN for SE, deep complex-valued SE systems, and multi-stage SE approaches. In Section 3, the proposed framework is described in detail. The experimental setup is presented in Section 4, and the experimental results and analysis are provided in Section 5. Finally, some conclusions are drawn in Section 6.
2 Related works
2.1 Cycle-consistent GAN-based for SE
A recent breakthrough in the SE area comes from the application of GANs as feature mapping networks. GAN consists of a generator network () and a discriminator network () that play a min-max game between each other. By using adversarial training, the objective of is to synthesize the fake samples which are indistinguishable from the target data distribution, whilst attempts to discriminate between the real and fake samples. SEGAN is the pioneering work employing GAN for SE task, which directly maps raw waveform of the clean speech from the mixed raw waveform in time domain pascual2017segan
. More recently, Other GAN-based SE algorithms in the time domain have been proposed to leverage different loss functionsbaby2019sergan; fu2019metricgan or generator structures liu2020cp; pascual2019time. Another mainstream of GAN-based SE algorithms operates on the time-frequency (T-F) domain, where is used to estimate a T-F mask soni2018time.
However, conventional GAN-based methods may map noisy features to any random permutation of the clean features in the target domain with only adversarial losses, thus there is no guarantee that the individual enhanced feature is exactly paired with the target clean one zhu2017unpaired
. In other words, these methods cannot restrict that the contextual information of noisy features and enhanced features are always cycle-consistent. As a variant of GAN, CycleGAN is widely used for the unpaired image-to-image translation task, while in speech area it also demonstrates excellent performance on voice conversionkaneko2018cyclegan, music style transfer brunner2018symbolic, and SE meng2018cycle; xiang2020parallel; wang2020improved. Incorporating CycleGANs for SE, these methods demonstrate their effectiveness in improving SE performance especially in maintaining speech integrity and reducing speech distortion. CycleGAN-based approaches for SE contain a noisy-to-clean generator and an inverse clean-to-noisy generator , which transforms the noisy features into the enhanced ones for the former, and vice versa for the latter. As illustrated in Fig. 1, a forward noisy-clean-noisy cycle and a backward clean-noisy-clean cycle jointly constrain and to be cycle-consistent, which are optimized with the adversarial loss, a cycle-consistency loss, and an identity-mapping loss, respectively. Discriminators and
are trained to classify the target speech features as real and the generated speech features as fake.
Nevertheless, in the standard CycleGAN-based approaches, the enhanced signal always contains the original noise information and remains audible residual background noise due to the constraint of cycle-consistency. Moreover, to the best of our knowledge, phase recovery has not been well investigated in previous CycleGAN-based SE approaches.
2.2 Two-stage approaches for noise reduction
Recently, some studies focus on conducting the original SE task by multi-stage networks, which can significantly improve the estimated speech quality. In hao2020masking
, Hao et al. proposed a masking and inpainting SE approach for low SNR and non-stationary noise. This two-stage approach consists of binary masking and spectrogram inpainting. In the first stage, a binary masking model is trained to remove the T-F points dominated by severe noise in low SNR conditions and obtain the spectra with the T-F points dominated by clean speech, while an inpainting model is used to recover the missing T-F points in the second stage. To solve the difficulty of estimating phase spectrum, Du et al. introduced a joint framework composed of a Mel-domain denoising autoencoder and a deep generative vocoder for monaural speech enhancementdu2020joint
, in which the clean speech waveform is reconstructed without using the phase. The first stage enhances the Mel-power spectrum of noisy speech by denoising autoencoder, while a deep generative vocoder focuses on synthesizing the speech waveform in the second stage. More recently, Li et al. proposed a two-stage approach named CTS-Net for SE in complex domainli2021icassp, in which a coarse magnitude estimation network (CME-Net) and a complex spectrum refined network (CSR-Net) jointly optimize the noisy spectra. In the first stage, the target spectral magnitude is coarsely estimated, which is then coupled with the noisy phase to obtain a coarsely estimated complex spectrum. In the second stage, CSR-Net is trained to estimate both RI components of the clean complex spectrum, thus further reducing the residual noise, restoring the clean speech phase, and inpainting missing details of the estimated spectrum.
2.3 Deep complex neural network based for phase-aware SE
Conventional SE methods only estimate the magnitude of the speech, and the time-domain speech waveform is reconstructed by using the noisy phase and the estimated magnitude. The reason for not enhancing the phase is that it was believed that phase is not so important for SE wang1982unimportance, as well as it is intractable to directly estimate the clean speech phase because it is unstructured. However, recent studies paliwal2011importance show the importance of the accurate phase as it can significantly improve perceptual speech quality, especially in low SNR conditions. Subsequently, some SE algorithms mowlaee2012phase; kulmer2014phase are developed to solve this problem, and consistently show objective speech quality improvements when the phase is enhanced.
More recently, the deep learning-based phase-aware SE algorithms can be divided into two categories. The first one operates in the time domain by estimating speech signals from raw-waveform noisy signals without using any explicit T-F representationpascual2017segan; baby2019sergan; pascual2019time; pandey2020densely, thereby avoiding the problem of phase estimation. Recently, it is reported that the fine-detailed structures of noise and speech components are more separable with T-F representation. Hence, another main-stream phase-aware SE approaches work on optimizing both RI components of the complex spectrum by using complex-valued ratio mask (CRM) williamson2015complex or a direct complex-mapping network tan2019complex; tan2019learning; zheng2020interactive
. Thereby, these methods estimate both magnitude and phase information in the frequency domain. Although the above approaches have been proposed to address this issue, they are limited as the neural network still conducts real-valued operations. To this end, deep complex u-net (named DCUNET)choi2019phase and Deep Complex Convolution Recurrent Network (named DCCRN) hu2020dccrn are proposed to conduct phase-aware SE via complex-valued neural network trabelsi2017deep. DCUNET incorporates a deep complex network and u-net structure to enhance the complex spectra of noisy speech. Note that DCUNET is trained to estimate bounded CRM and optimizes the weighted source-to-distortion ratio (wSDR) loss venkataramani2017adaptive after reconstructing the enhanced time-domain waveform by inverse STFT. DCCRN effectively combines both the advantages of DCUNET and CRN tan2018convolutional to estimate the complex spectra of clean speech, in which LSTM is utilized to model temporal context with significantly reduced trainable parameters. However, it may cause the memory bottleneck issue when using LSTM to model temporal dependencies, which may result in reducing training efficiency.
Motivated by these studies, we propose a two-stage deep complex approach, which incorporates a complex-valued denoising network (named DCD-Net) with a CycleGAN-based magnitude mapping network (named CycleGAN-MM). Specifically, CycleGAN-MM is adopted to coarsely estimate the clean spectral magnitude in the first step, while DCD-Net aims to further suppress the intractable residual background noise and simultaneously recover the clean phase information implicitly by estimating both RI components of the clean spectrum.
3.1 Training Target
The overall network architecture is presented in Fig. 2, which is comprised of two sub-networks, namely CycleGAN-MM and DCD-Net. In our SE task, the mixture signal in the time domain is formulated as , where , and denote noisy speech, clean speech and noise, respectively. With the STFT, the noisy speech in the time-frequency domain can be modeled as,
where , and denote the time-frequency () representations of noisy speech, clean speech and noise, respectively. The input to CycleGAN-MM is the magnitude of the noisy spectrum . After the first-stage, the estimated magnitude is then coupled with the original noisy phase to obtain a coarsely enhanced complex spectrum . In the second stage, DCD-Net receives the enhanced complex spectrum to estimate the CRM, which can be defined as:
where and denote the RI components of the noisy speech spectrum, respectively. Similarly, and indicate the RI components of the clean speech spectrum. The real and imaginary parts of the CRM are represented by and . Alternatively, the polar coordinate representation of can be presented as,
where and denote the magnitude and phase of the complex-valued mask, respectively. To make the mask bounded in an unit-circle at the complex space, we use the activation function to limit the magnitude mask ranging from 0 to 1 like in choi2019phase. Hence, the final estimated clean complex spectrum of DCD-Net in polar coordinates can be calculated by,
Note that DCD-Net is optimized by signal approximation (SA), which directly minimizes the difference between the complex spectrum of the clean speech and that of the noisy speech applied with bounded CRM.
3.2 CycleGAN-based Magnitude Mapping network
As shown in Fig. 2, two generators, dubbed and , and two discriminators, dubbed and , are employed in CycleGAN-MM simultaneously. The generator is composed of three components, namely three downsampling layers, six dilated residual attention (DRA) blocks and three homologous upsampling layers, where the detailed structure of this generator is illustrated in Fig. 3. Each downsampling/upsampling layer block is composed of a 2D convolution/deconvolution layer, followed by instance normalization (IN), parametric Relu activation function (PRelu) and gated liner units (GLUs). GLUs can control the information flows throughout the network dauphin2017language, showing the effectiveness of modeling speech sequential structure. In generators, we introduce Temporal-Frequency self-attention (T-F SA) and Temporal-Frequency attention gates (dubbed T-F AG) in DRA block and upsampling layers, respectively. T-F SA and T-F AG are utilized herein to capture relative contextual dependencies along the time and frequency dimensions and directly pass the salient information of source speech features. The discriminator is composed of six 2D convolutions, each of them followed by spectral normalization (SN) and PRelu, compressing the feature maps into a high-level representation. SN is proposed to stabilize the training process of the discriminator miyato2018spectral and demonstrates the effectiveness to avoid vanishing or exploding gradients. The kernel size for each 2D convolutions layer is (3, 5) except (1, 1) for the last layer in the temporal and frequency axis, respectively. The stride is set to (1, 2) along the temporal and frequency axis for the first five downsampling layers, while it is set to (1, 1) for the last layer. The number of channels throughout the 2D convolutions is (32, 32, 64, 64, 128, 1).
3.2.1 Temporal-frequency attention
Attention mechanism vaswani2017attention has been widely used in speech processing tasks, as it can leverage the contextual information in the time-frequency dimension and further enhance the salient speech information importing in the feature learning procedure. Moreover, using attention in SE task can simulate the human auditory perception, so that more attention is put to speech while less is put to the surrounding background noise anderson2013dynamic. Following the terminology in tang2020joint, we compute the attention function on a set of queries, keys, and values simultaneously, and pack them together into feature maps , and , respectively. Here, denotes the batch size of input features, denotes the number of frames, denotes the number of frequency bins and denotes the number of channels in each feature map. Inside the attention module, the feature maps and are first projected into two feature spaces and by convolutions to calculate attention maps . However, the size of the attention weight matrix in the original attention mechanism is , which would be extremely large and cost heavy computational complexity. To address this problem, we introduce temporal-frequency attention (T-FA) to capture the global dependencies along temporal and frequency dimensions, respectively. As discussed in tang2020joint; zheng2020interactive, by factorizing the original attention into temporal attention (TA) and frequency attention (FA), the large attention weight matrix can be subdivided into two much smaller ones, i.e., and . As shown in Fig. 4, the output of can be computed as,
where , , and , respectively. Here, is a learnable scalar coefficient and initialized as 0. Similarly, is employed after TA block, which can be expressed as,
where , , and , respectively. In the T-F AGs, the memory keys come from the output of the previous layer or the final DRA block, while the queries and values come from the output of the homologous downsampling layers. For the T-F SA in DRA blocks, all the , , come from the same output of the previous dilated residual layers.
3.2.2 CycleGAN-MM Loss function
CycleGAN-MM uses the following three losses to jointly optimize the magnitude estimation process, namely relativistic adversarial losses, cycle-consistency losses, and an identity mapping loss.
Relativistic adversarial loss: For the noisy-to-clean mapping, the relativistic average least-square (RaLS) adversarial loss jolicoeur2018relativistic is used to make the enhanced magnitude spectra indistinguishable from the clean ones , which can be expressed as below.
where and is the magnitude spectrum of noisy speech and that of clean speech (e. g. , ), respectively. Here, indicates the adversarial loss of discriminator , and indicates the adversarial loss of noisy-to-clean generator . In the above equations, the generator tries to generate the enhanced magnitude spectra that can deceive the discriminator , and attempts to find the best decision boundary between the clean magnitude spectra and enhanced ones . Similarly, we impose two relativistic average adversarial losses and for the inverse noisy-to-clean mapping.
Cycle-consistency loss: Due to high randomness, can map noisy feature space to any random permutation of the clean feature space with only adversarial loss, so any learned mapping functions can produce an output distribution that matches the target distribution. Hence, we apply the cycle-consistency loss to limit space of possible mapping functions and preserve speech context integrity, which can be defined as follows:
where indicates the -norm reconstruction error.
Identity-mapping loss: We regularize generators and to be close to identity mappings by minimizing identity-mapping loss as in zhu2017unpaired, which can be given by:
where magnitude spectrum and of the target domain (i.e., ) and ) are provided as the input to the generators (i.e., and ), respectively. It helps to preserve the compositions ((i.e., linguistic information) of the source domain and the target domain meng2018cycle, enforcing the generators to better map the target distribution simultaneously.
Finally, the total loss function of CycleGAN-MM can be summarized as follows:
where and are tunable hyper-parameters, which are initialized as 5 and 10, respectively.
3.3 Deep Complex Denoising Net (DCD Net)
After the previous stage of enhancement, we compose the estimated magnitude with the noisy phase to get the coarsely enhanced complex spectrogram. In the second stage, the DCD-Net is proposed to further suppress the background noise and simultaneously recover the clean phase spectrum. As shown in Fig. 5, DCD Net consists of eight complex encoder/decoder layers, and six complex Temporal-Frequency self-attention (CT-F SA) blocks.
3.3.1 Complex-valued encoder-decoder
The complex encoder/decoder layers are composed of complex-valued 2D convolutions/deconvolution blocks, followed by complex instance normalization (IN) and complex PReLU (CPRelu) trabelsi2017deep. The complex IN and CPRelu operate instance normalization and Parametric Relu activation respectively on both real and imaginary values. We design the complex 2D convolution block according to that in DCUNET choi2019phase and DCCRN hu2020dccrn. The number of the channels for the encoder layers is (32, 32, 64, 64, 128, 128, 256, 256), while the kernel size and strides are set to (3, 5) and (1, 2) along the time and frequency axis, respectively. In practice, complex 2D convolution can be implemented as four traditional real-valued convolutional operations, which can be presented in Fig. 6
. For complex-valued convolution operation, we define the input complex vector. Meanwhile, the complex-valued convolutional filter is defined as , where and represent the real and imaginary parts of a complex convolution kernel, respectively. The complex output receives from the complex 2D convolution operation , which can be expressed as,
where is the output of the complex-valued 2D convolution.
3.3.2 Deep complex T-F attention
To make DCD Net capable of capturing long temporal dependency with complex-valued features, we introduce complex-valued attention blocks as proposed in yang2020complex. Following the temporal-frequency attention as mentioned in Section 3.2.1, we propose complex temporal-frequency self-attention (CT-F SA) blocks between the complex encoder-decoder in the DCD-Net, which is calculated by using T-FA in the complex-valued manner. Given the complex-valued input , we first project them to the query matrix , the key matrix and the value matrix , and then calculate the complex attention output, which is defined as:
where , , and denote the real-valued convolutional filter. and denote the real and the imaginary part of the complex attention result, respectively. Hence, the complex temporal attention (CTA) can be expressed as:
where TA denotes the temporal attention as mentioned above. Similarly, the complex frequency attention (CFA) is calculated by using FA in the complex-valued attention manner.
3.4 Loss function
The loss function of the proposed two-stage model is defined as below. In the first step, we pretrain the CycleGAN-MM alone with until convergence, which is calculated as Eq. (11). Then, the CycleGAN-MM and DCD-Net are jointly trained, where the parameters of the first sub-network are initialized with the pretrained CycleGAN-MM, and the overall loss function is expressed as:
where and denote the loss function on the spectral magnitude and RI components, respectively. Here, and represent the RI components of the estimated clean speech spectrum, while and represent the target RI components of the clean speech spectrum.
In our experiments, we choose two public datasets for comparison. We first evaluated the proposed models as well as several SOTA baselines on a widely used dataset simulated on VoiceBank + DEMAND, and further evaluated our model on the WSJ0-SI84 dataset + DNS challenge.
VoiceBank + DEMAND: This dataset is widely used for evaluation as proposed in valentini2016investigating, which is a selection of the Voice Bank corpus with 30 speakers veaux2013voice. The training dataset includes 28 speakers’ 11572 utterances in the same accent region (England), while the test set contains two speakers’ (one male and one female) 824 utterances. The total duration of the training set is around 10 hours and the duration of the test set is around 30 mins. For both the training and testing sets, the average speech signal length was three seconds. For the training set, audio samples are added with one of the 10 noise types (2 artificial and 8 from the DEMAND database thiemann2013diverse) at four SNRs of 0, 5, 10 and 15 dB. The noise is from different environments including offices, public spaces, transportation stations, and streets. The test set is created with 5 test-noise types (all from the DEMAND database, but totally unseen in the training set) at SNRs of 2.5, 7.5, 12.5 and 17.5 dB. The five types of chosen noise are living room, office, bus, and street noise.
WSJ0-SI84 + DNS challenge: For further evaluation, we use the WSJ0-SI84 dataset paul1992design, which includes 7138 utterances by 83 speakers (42 males and 41 females). In our experiments, we split 5428 and 957 utterances by 77 speakers for the training set and validation set, respectively. For the test set, we use two types with 150 utterances of each (seen and unseen speakers). For the first type, the speakers are totally unseen in the training set, while the speakers are within the training dataset for the second type. For the mixed set, we randomly select 20000 noises from the DNS-Challenge 111https://github.com/microsoft/DNS-Challenge to obtain a 55 hours noise set for training. During each mixed process, a random cut vector of noise is mixed with randomly selected clean utterances. Hence, we established a 15000, 1500 noisy-clean pairs at the SNR range of [5dB,- 4dB, -3dB, -2dB, -1dB, 0dB] for training and validation, respectively. The total duration for the training set is about 30 hours. For testing, we select two noises (i.e., babble and factory1 ) from NOISEX92 varga1993assessment to obtain totally 900 utterances (450 for seen speakers, 450 for unseen speakers) at three SNRs of -5dB, 0dB and 5dB.
4.2 Implementation Setup
The original raw waveforms were downsampled from 48kHz to 16kHz. The Hanning window of length 20ms is utilized to produce a set of time frames, with 50% overlap between adjacent frames and the STFT length is 320. When training the VoiceBank dataset, we randomly crop a fixed-length segment (128 frames) with batch size set to 8. As for the WSJ0-SI84 dataset, the maximum utterance length is chunked to 8 seconds and the batch size is set to 16. We adopt the Adam optimizer kingma2014adam with the momentum term ,
. In the first stage (20 epochs), we only train the CycleGAN-MM with an initial learning rate(LR) of 0.0002 for discriminators and 0.0005 for generators, respectively. We use
only for the first 20 epochs to guide the composition reservation. In the second stage, DCD-Net model is jointly trained with pre-trained CycleGAN-MM, while the learning rates are set to 0.001 and 0.0001 for DCD-Net and CycleGAN-MM, respectively. The same learning rates are maintained for the first 50 epochs, while they linearly decay in the remaining iterations. We train the proposed model for 80 epochs on WSJ0-SI84 + DNS dataset and 100 epochs on Voice Bank + DEMAND dataset, respectively.
4.3 Evaluation Metrics
We use the following metrics to evaluate the objective and subjective quality of the enhanced speech. The objective metrics measure the similarity between the enhanced signal and the clean reference of the test set files. The subjective quality is evaluated by DNSMOS reddy2020dnsmos, which is a robust nonintrusive perceptual speech quality metric designed to stack rank noise suppressors with great accuracy. Higher values of all metrics indicate better performance.
PESQ: Perceptual evaluation of speech quality (PESQ) rix2001perceptual score is the most commonly used metric to evaluate speech quality, especially using the wide-band version recommended in ITU-T P.862.2(from -0.5 to 4.5).
STOI: Short-Time Objective Intelligibility (STOI) taal2010short is used as a robust measurement index for nonlinear processing of noisy speech, e.g., noise reduction on speech intelligibility. The value of STOI ranges from 0 to 1.
SSNR: Segmental signal-to-noise ratio, ranging from 0 to .
CSIG: The mean opinion score (MOS) predicts the speech signal distortion, ranging from 1 to 5 hu2007evaluation.
CBAK: The MOS predicts the intrusiveness of background noise, ranging from 1 to 5 hu2007evaluation.
COVL: The MOS predicts the overall effect, ranging from 1 to 5 hu2007evaluation.
DNSMOS: The speech quality ratings of the processed clips varied from very poor (MOS=1) to excellent (MOS=5) reddy2020dnsmos.
For the comparison on Voice Bank + DEMAND dataset, we adopt both GAN-based and Non-GAN-based methods, which are summarized as follows:
SEGAN pascual2017segan is the first SE approach based on the adversarial framework and works end-to-end with the raw audio. It applies to skip connection to generators, connecting each encoding layer to its homologous decoding layer.
MMSE-GAN soni2018time introduces a time-frequency masking-based SE approach based on a modified GAN and learns the mask implicitly while predicting the clean T-F representation.
RSGAN and RSaGAN baby2019sergan introduce relativistic GANs with a relativistic cost function at its discriminators and use gradient penalty to improve speech enhancement performance in the time domain. Note that RSGAN-GP employs relativistic binary cross-entropy loss while RaSGAN-GP employs relativistic average binary cross-entropy loss.
CP-GAN liu2020cp is a novel GAN-based SE system for coarse-to-fine noise suppression, which contains a densely-connected feature pyramid generator and a dynamic context granularity discriminator.
MetricGAN fu2019metricgan aims to optimize the generator with respect to one or multiple evaluation metrics such as PESQ and STOI, thus guiding the generators in GANs to generate data with improved metric scores.
Wave-U-Net stoller2018wave uses the U-Net architecture for speech enhancement, which performs end-to-end audio source separation directly in the time domain.
LSTM weninger2014discriminatively and BiLSTM erdogan2015phase are two RNN-based speech enhancement approaches. Both of them have two layers of RNN cells and the third layer of fully connected NNs.
proposes a fully-convolutional context aggregation network using a deep feature loss at the raw waveform level, which is based on comparing the internal feature activations in a different network.
CRN-MSE tan2018convolutional is a typical convolutional recurrent network with encoder-decoder architecture, and MSE loss is used on the estimated and clean log-magnitude spectrogram. Note that we directly use the reported scores of RSGAN, RaSGAN, LSTM, BiLSTM and CRN-MSE from zhang2020loss.
TFSNN yuan2020time proposes a time-frequency smoothing neural network for SE, which effectively models the correlation in the time and frequency dimensions by using LSTM and CNN, respectively.
To further evaluate the proposed method at different SNRs On WSJ0-SI84 + DNS, we re-implement three state-of-the-art baselines, namely Noncausal-GCRN tan2019learning, DCUNET-20 choi2019phase and Noncausal-DCCRN hu2020dccrn. Noncausal-GCRN tan2019learning is a complex spectral mapping network based on CRN, where both the real and imaginary components are estimated. Notably, to make a fair comparison, we reimplement GCRN as a non-causal configuration, where the Bi-directional LSTM is utilized for time sequencing. DCUNET-20 choi2019phase adopts the complex-valued building blocks and bounded CRM to deal with the complex-valued spectrum. We use the structure of DCUNET-20 with the stride along with the time dimension set to 1. The channels in encoders are set to [8,16,32,32,64,64,128,128,256,256]. Noncausal-DCCRN hu2020dccrn introduces a deep complex convolution recurrent network for speech enhancement, where both CNN and RNN structures handle complex-valued operations. We reimplement DCCRN and employ the Bi-directional LSTM to train the model with a non-causal configuration. Note that we also implement bounded CRM and optimize the system with SI-SNR loss.
5 Results and Analysis
5.1 Ablation study
We first investigate the effectiveness of different attention mechanisms and loss functions based on VoiceBank + DEMAND dataset. As shown in Table 2, we take the CycleGAN with self-attention (SA) and least-square (LS) GAN loss as the baseline. Besides, we implement a CycleGAN-based one-stage network for complex spectral mapping (dubbed CycleGAN-CM), which set the same configuration as CycleGAN-MM(III) except two decoders to separately decode real and imaginary components. From the results, we can have the following observations.
When only the first stage is trained, CycleGAN-MM using RaLS loss achieves 0.07 PESQ, 0.49dB SSNR, 0.08 CSIG, 0.06 CBAK and 0.09 COVL improvements over using conventional LS loss, while using T-FA results in better performance than using SA. CycleGAN-MM(III) integrating RaLS loss and T-FA obtains 0.12 PESQ, 1.03dB SSNR, 0.5% STOI, 0.13 CSIG, 0.05 CBAK and 0.17 COVL improvements than CycleGAN-MM(I). Additionally, when handling one-stage complex spectral mapping, CycleGAN-CM achieves worse performance than CycleGAN-MM. This reveals the difficulty for conventional GAN-based systems to deal with directly mapping complex spectrum, which is likely caused by the unstable training of the generators and discriminators. In other words, optimizing both RI components is an intractable challenge for the GAN-based SE approaches, and thus achieving worse performance than only estimating the magnitude.
Subsequently, in the two-stage systems, DCD-Net is jointly trained with CycleGAN-MM(III) as the first-stage system. When only using RI components loss in DCN-Net, we obtain a marginal improvement over CycleGAN-MM on reducing background noise (e.g., 1.49dB SSNR and 0.25 CBAK improvements), while achieving similar PESQ and STOI. This reveals the necessity and significance of the second stage in residual noise suppression. After adding the magnitude MSE loss of the estimated and clean spectrum, CycleGAN-DCD(II) improves PESQ by 0.10, STOI by 1.2% and CSIG by 0.23, respectively. This indicates that using both RI and Mag loss in DCD-Net achieves a better performance on speech quality (i.e., PESQ), speech intelligibility(i.e., STOI) and speech distortion(i.e., CSIG). Note that both CycleGAN-DCD(I) and CycleGAN-DCD(II) are trained without any attention block in DCD-Net. Compared with CycleGAN-DCD(II), CycleGAN-DCD(III) trained with CT-F SA blocks provides 0.08 gain on PESQ, 0.41dB gain on SSNR, 0.10 gain on CSIG, 0.09 gain on CBAK and 0.10 gain on COVL, respectively. These results verify the effectiveness of the proposed attention mechanism and loss function for improving speech quality in terms of all objective metrics.
Fig. 8 shows the spectrograms of the clean utterance, the noisy utterance and the utterance enhanced by CycleGAN-MM and CycleGAN-DCD. From the figure, we observe that CycleGAN-DCD can effectively suppress the noise components, which are intractable to be eliminated in CycleGAN-MM. For example, as shown in the red sign area of Fig. 8 (c) and (d), CycleGAN-DCD achieved better performance under the pure background noise condition. Besides, the green sign area shows CycleGAN-DCD can also effectively suppress the unnatural residual noise while well preserving the speech components in the case of background noise and speech are heavily mixed.
|One-stage complex Systems|
|Two-stage complex Systems|
|Proposed CycleGAN-based approaches|
5.2 Comparison with different two-stage structures
To validate the efficacy of DCD-Net and the proposed two-stage complex structure, we also conduct experiments on VoiceBank + DEMAND dataset. Specifically, we first compare DCD-Net with other existing complex spectrum enhancing networks (i.e., Noncausal-GCRN tan2019learning, Noncausal-DCCRN hu2020dccrn) and then design their corresponding two-stage structures (i.e., CycleGAN-GCRN, CycleGAN-DCCRN) for comparison. Note that the trainable parameter of Noncausal-GCRN, Noncausal-DCCRN and DCD-Net is 9.8 million, 3.7 million and 3.5 million, respectively. As shown in Table 4, one can observe following phenomena. Firstly, when compared with real-valued complex-mapping network, DCD-Net outperforms GCRN in all metrics by a large margin. For example, DCD-Net provides average 0.19 PESQ, 0.21 CSIG, 0.03 CBAK and 0.12 COVL improvements than GCRN with relatively lower model complexity, which indicates the merit of complex-valued networks. Secondly, DCD-Net surpasses another complex-valued network (i.e., DCCRN) in terms of COVL and CBAK scores, while providing similar PESQ and STOI scores. This indicates DCD-Net can effectively suppress the background noise and reduce the speech distortion simultaneously. Finally, to demonstrate the merit of the proposed two-stage combination, we combine CycleGAN-MM with GCRN, DCCRN and DCD-Net as different two-stage structures (i.e., CycleGAN-GCRN, CycleGAN-DCCRN and the proposed CycleGAN-DCD) for comparison. Note that we also use bounded CRM in CycleGAN-GCRN for better performance. Compared with different two-stage methods, CycleGAN-DCD achieves consistently better performances than real-valued CycleGAN-GCRN by a significant margin, while CycleGAN-DCD outperforms complex-valued CycleGAN-DCCRN in terms of PESQ, STOI, CBAK and COVL. This validates that the proposed CycleGAN-DCD surpasses other two-stage structures including real-valued and complex-valued methods.
5.3 Comparison with the State-of-the-Art
Table 3 shows the comparisons with mentioned baselines on VoiceBank + DEMAND dataset. Note that CycleGAN-MM and CycleGAN-DCD employ the best configurations from the ablation study. First, we observe that CycleGAN-DCD achieves a notable improvement over most existed GAN-based methods. For example, CycleGAN-DCD exceeds SEGAN by a large margin in PESQ, STOI, CSIG, CBAK and COVL, which are 0.73, 1.8%, 0.76, 0.63 and 0.69, respectively. Compared with more recently proposed CP-GAN and Metric-GAN, our model still achieves better performance in speech quality and speech intelligibility. Note that the consistent improvements in CSIG, CBAK and COVL also indicate that CycleGAN-DCD performs better in preserving speech integrity while removing the background noise. Then, when it comes to recently proposed Non-GAN methods, the proposed model also achieves better performance across most metrics. For example, CycleGAN-DCD provides 0.38 CSIG, 0.24 CBAK and 0.27 COVL improvements than DFL-SE, while CycleGAN-MM gets lower CBAK and similar COVL over DFL-SE. This indicates the two-stage denoising system demonstrates consistently superior performance on background noise suppression, while further improving the speech distortion and overall effect.
Table 4 shows the comparisons with DCUNET, GCRN and DCCRN on WSJ0-SI84 + DNS dataset. Firstly, we observe that DCUNET, Noncausal-DCCRN and Noncausal-GCRN obtain better performance than CycleGAN-MM under different noise conditions. This is because CycleGAN-MM only estimates the magnitude spectrum and reuses the noisy phase to reconstruct waveform, which causes severe phase distortion under low SNRs. For example, Noncausal-GCRN provides average 0.28 and 0.27 PESQ improvements than CycleGAN-MM on Babble and Factory1 noises, while 3.71% and 1.86% improvements in terms of STOI. Secondly, when adding the denoising net to refine the coarsely enhanced complex spectrum, CycleGAN-DCD outperforms the one-stage model by a large margin in all metrics. For example, CycleGAN-DCD provides average 0.32 and 0.30 PESQ improvements than CycleGAN-MM on Babble and Factory1 noises, while providing 0.99dB and 1.03dB gain SSNR. This indicates the necessity and significance of the proposed DCD-Net in improving the speech quality and intelligibility, while further suppressing the residual noise. It can also be observed that the proposed two-stage model consistently outperforms the baselines in terms of all metrics. For example, compared with the best baseline Noncausal-GCRN, we notice that CycleGAN-DCD obtains average 0.06, 0.33% and 0.21dB improvements in terms of PESQ, STOI and SSNR, respectively.
Besides, CSIG, CBAK and COVL improvements (i.e., CSIG, CBAK and COVL) over the unprocessed mixtures are shown in Fig. 9. we can observe that our proposed approach produces considerable improvements than all the baselines in all metrics, especially in CBAK. This reveals the superior capability of CycleGAN-DCD on reducing residual noise and speech distortion, while consistently improving the speech overall quality.
The evaluation of the subjective speech quality on WSJ0-SI84 + DNS dataset is presented in Table 5. We can observe that our method yields the best performance on both seen and unseen speakers with Babble and Factory1 noise types. For example, CycleGAN-DCD yieds average 0.27 and 0.30 DNSMOS scores over one-stage CycleGAN-MM for Factory1 and Babble noise types, respectively. It indicates that our two-stage system can dramatically improve the speech perception of enhanced speech over CycleGAN-MM and other baseline systems under various noisy conditions.
6 Conclusion and Future work
In this work, a deep complex-valued denoising sub-net is integrated into a CycleGAN-based magnitude mapping sub-net as a two-stage SE approach, which aims at estimating both the magnitude and phase of the clean speech spectrum. In the first stage, a CycleGAN-based network is first trained to estimate the spectral magnitude with relativistic average least-square losses, cycle-consistency losses and an identity mapping loss. Then, the coarsely estimated magnitude is coupled with the original noisy phase as the input to a complex denoising net, which aims to suppress the residual noise and recovery the clean phase. Notably, the denoising net directly estimates both RI components of the clean spectrum applied with a complex ration mask. Additionally, the temporal-frequency attention mechanism is employed in both two stages for modeling the global dependencies along temporal and frequency dimensions, respectively. To the best of our knowledge, this is the first CycleGAN-based approach to estimate both the clean magnitude and phase information for single-channel SE. Experiments results on VoiceBank and WSJ0-SI84 datasets verify that the proposed method outperforms the conventional one-stage CycleGAN-based SE model and other state-of-the-art GAN-based as well as Non-GAN-based baselines by a considerable margin.
In future work, we will investigate the proposed CycleGAN-DCD as a complex spectral mapping network for multi-microphone speech enhancement, in which accurate phase estimation is likely more essential. Considering the promising performance of power compression and phase estimation on speech dereverberation task as discussed in li2021importance, we will investigate to use the compressed spectral magnitude as the input feature to the first stage. Besides, we will attempt to decompose the two-stage SE task into two much easier sub-tasks. In the first task, we plan to employ a CycleGAN-based network to transform the non-stationary noise type to stationary noise type like noise-whitening, while we plan to utilize a denoising net to suppress the stationary noise in the second sub-task.
This work was supported in part by the National Natural Science Foundation of China under Grant 61631016 and Grant 61501410, and in part by the Fundamental Research Funds for the Central Universities under Grant 3132018XNG1805. This work was also supported by the Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China (No. SKLMCC2020KF005)