Signals contaminated by various background interference in daily life vastly degrade the speech quality for human listeners. In this context, speech enhancement (SE) came into being, aiming to improve hearing comfort by separating human voice and noise signals. Traditional SE methods [boll1979suppression, hu2013cepstrum, gerkmann2011unbiased]
have been extensively studied in the last decades, achieving remarkable improvements in general but breaking down when faced with non-stationary scenarios. Recently, this issue has been well addressed with the involvement of deep learning paradigms.
According to the training patterns they work on, existing single-channel SE methods can be classified into two categories: waveform-based time-domain speech estimation and time-frequency (T-F) domain spectrum reconstruction. As the spectrum contains more distinct feature patterns, suppressing noise in the T-F instead of the time domain seems to have more advantages[li2021icassp, yin2020phasen].
Most previous studies explore the recovery of magnitude spectrogram and directly incorporate the noisy phase for speech waveform reconstruction [tan2018gated, lan2020redundant, tai2021idanet]. However, as shown in Fig. 1, the unprocessed phase restricts the performance upper bound of magnitude-based SE methods. Based on this observation, Yin et al. [yin2020phasen] proposed a two-stream framework, utilizing magnitude information to facilitate phase prediction.
In this paper, we argue that phase information appears irregular, signifying that operating on phase spectrum is intractable to model phase information accurately, regardless of the assistance of external knowledge. Considering the pros and cons of each training pattern, i.e., magnitude spectrum-based methods can take advantage of apparent spectral regularity, whereas complex spectrum-based methods can implicitly estimate phase information, we present a unified two-branch framework that fosters strengths and circumvents weaknesses of different paradigms. In concrete, we split a SE task into two sub-targets, one branch for direct magnitude spectrum reconstruction and the other for implicit phase estimation.
In the design of different branches, we utilize the information from the magnitude-based stream as the additional supervised signal to facilitate the feature processing of the complex-based stream. Meanwhile, the implicit estimated phase from the complex-based stream will serve as a substitute for the noisy one. Besides, inspired by expert learning [jain2019multi], we replace the regular convolution layer with our proposed collaborative expert block (CEB) and its variants to increase the model’s capability of feature processing. Comprehensive experiments on TIMIT dataset verify the effectiveness and superiority of our framework.
We design a speech enhancement system with two objectives in mind. First, it should make full use of the merits of each training paradigm. To realize this, we propose a two-branch SE framework to recover magnitude and phase spectrum simultaneously. Second, each branch should avoid their corresponding drawbacks as much as possible. Considering the vocals are easily distinguishable on the magnitude spectrum, and there exist potential associations between magnitude and complex spectrum, we use external knowledge from the magnitude-based stream to calibrate feature processing of the complex-based stream. In addition, the phase information implicitly optimized by the complex-based stream will unify with the estimated magnitude spectrum from the magnitude-based stream to reconstruct the speech signal.
We present the diagram of the proposed approach, as shown in Fig. 2. In this paper, we denote the
as the noisy complex spectrum after short time fourier transform (STFT), whileare the estimated real and imaginary parts from the complex-based stream. Then, estimated (complex) phase can be obtained from via trigonometric function. Finally, estimated magnitude from the magnitude-based stream is used along with to reconstruct the time-domain speech waveform by inverse STFT.
2.2 The proposed two-branch framework
Both sub-networks use complex spectrum as input and have a similar network topology, consisting of a standard encoder-decoder framework. Instead of using a regular convolutional layer as the basic unit for encoder and decoder, CEB and its variants are utilized to better capture the intrinsic harmonic features of the spectrum. In the bottleneck layer, we use stacked temporal convolution modules (S-TCM) proposed in [li2021two] to capture both short and long-range sequence dependencies. Fig.2 (c) presents the architecture of S-TCM. It first squeezes the input feature into a lower dimension through a convolution layer. A self-gating operation is applied to better control the information flow, followed by a convolution layer. Each branch in the self-gating part uses 1-D convolution blocks with increasing dilation factors. We use 3 groups of S-TCMs as the sequential module, each of which stacks 6 S-TCM units with different dilation rates, namely, (1, 2, 4, 8, 16, 32).
2.3 Collaborative expert block (CEB)
With the growing demand for speech-related services over the past few years, the scenarios requiring speech enhancement are becoming more diverse and complex. This observation increases concerns about current SE systems: do we need to train a separate model for each scenario to guarantee the denoising performance? Considering that training multiple models is cumbersome from a commercialization point of view, we get inspiration from expert systems and design CEB (Fig. 2 (d)) as a substitute for the regular convolutional layer. Note that for the encoder in the complex-based branch, we use a variant of CEB dubbed CCEB (compensatory and collaborative expert block) to utilize additional supervised signals from the magnitude-based stream.
We use two parallel convolutions as different experts. Different experts are expected to have different views on the same object. Each expert’s output is used to offer guidance and control the other’s information flow via the gating mechanism. To ease the parameter burden, we squeeze and excite features (channel dimension and ) before and after expert learning by applying convolution layers.
2.4 loss function
Since two sub-networks have different training targets, we apply different loss functions to optimize sub-networks until convergence. Specifically, we train magnitude-based branches with mean absolute errors (MAE) loss in magnitude and train complex-based branches with MAE losses in real and imaginary parts. Two branches are trained jointly, and the total loss is defined as:
where refers to the number of training samples. and represent the enhanced magnitude and clean magnitude. and denote the real and imaginary part of the ground truth. is the weighted coefficient, and we set as 0.5 in this work.
We conducted our experiments using the TIMIT corpus [garofolo1993darpa]. 2000, 100 and 192 clean utterances are randomly selected for training, validating and testing. We create training and validation datasets under SNR levels ranging from -5 dB to 10 dB with the interval 1 dB, and generate the testing dataset under the SNR conditions of (-5 dB, 0 dB, 5 dB, 10 dB). 100 non-speech noises from [hu2010tandem] and 5 life noises (cafeteria, restaurant, park, office, and meeting) from [thiemann2013demand] are used for training and validation. Other five types of noises (babble, f16, factory2, m109, and white) from [varga1993assessment]
are used to evaluate the generalization capacity of all networks. All noises are first concatenated into a long vector. Then, an excerpt of this vector having the same length as the clear utterance is randomly chosen, scaled, and subsequently mixed with the given clean signal, providing the desired SNR condition. As a result, we generate approximately 36 hours of data for training, 1.5 hours for validation, and 1 hour for testing.
3.2 Experimental setting
All utterances are sampled at 16 kHz. A 20 ms Hamming window is utilized, with 50% overlap in adjacent frames. 320 point FFT is used, leading to 161-D spectral features. The batch size is set to 2, and utterances are zero-padded to match the length of the longest in a mini-batch. All networks are trained for 30 epochs, and the learning rate is fixed at 0.0002. We optimize models by Adam. The Short-Time Objective Intelligibility (STOI)[taal2010short] and Perceptual Evaluation of Speech Quality (PESQ) [rix2001perceptual] are selected for speech quality evaluation.
We compare our proposed model with several baselines including CCRN [tan2019complex], GCRN [tan2019gcrn], PHASEN [yin2020phasen], and CTS-Net [li2021two], all of which consider phase optimization. Specifically, CCRN is a complex spectrum-based model that incorporates convolution and LSTM. GCRN is developed on CCRN by replacing regular convolutions with gated linear units. PHASEN explicitly estimates the phase with the assistance of the magnitude spectrum. CTS-Net is a two-stage approach that successively models the magnitude and complex spectrum.
We use complex mapping MAE loss for pure complex domain methods (CCRN, GCRN), while we use complex mapping MAE loss and magnitude mapping MAE loss for the others (PHASEN, CTSNet, ours). Besides, as PHASEN is originally utilized to spectrogram under 512 point FFT, we extend it to 320 FFT for fair comparisons.
3.4 Experimental results
Table 1 shows the SE performance of ours and existing state-of-the-art models. We can see methods that take additional consideration of the magnitude spectrum yield better results than methods that merely use the complex spectrum under low SNR conditions. When SNR is high, the influence of magnitude tends to be small. The rationale behind this phenomenon is: as SNR degrades, the clearer structure of harmonics in magnitude spectrum can offer a better direction for network optimization. Our proposed model outperforms the previous best model PHASEN and CTS-Net on both metrics. Compared with PHASEN, we adopt a roundabout way to obtain phase from real and imaginary spectrums rather than directly estimate phase information, which avoids the difficulty brought by irregular texture. In addition, unlike CTS-Net, which simply supplies the magnitude knowledge at the beginning of the phase estimation, we provide a more fine-grained information-sharing scheme by utilizing synchronous knowledge to calibrate information at each layer.
3.5 Visualizing phase influence
The benefit of our framework in phase prediction can be visualized in Fig. 3, where we provide figures of phase difference to illustrate the compensation of the magnitude-based stream for explicit and implicit phase estimation. With the assistance of external knowledge from the magnitude-based stream, both PHASEN and our model have the ability to identify texture borders. The phase spectrum estimated by our framework is much closer to the ground truth, especially those areas between harmonics (white box), efficiently optimizing the human auditory perception of estimated speech. This observation verifies our assumption that exploring the potential regularity from the chaos phase spectrum is much difficult than that from the complex spectrum.
3.6 Ablation study
In an ablation study, we compare our framework with three variants to investigate the effects of each part. Among these variants, w/o phase represents that we use estimated magnitude and unaltered phase to restore speech; w/o multi-experts represents that we replace each CEB/CCEB with a regular convolutions; w/o compensation represents that we forgo interactions from magnitude-based streams to complex spectrum-based streams. As shown in Table. 2, we can observe that each part can indeed improve the SE performance.
We proposed the two-branch framework with collaborative learning for monaural speech enhancement. Considering the apparent spectral regularity and potential associations between magnitude and complex spectrum, we use external knowledge from the magnitude-based branch as assistance to facilitate complex spectrum reconstruction. Meanwhile, implicit phase information from the complex spectrum-based branch will replace the noisy one. By doing so, advantages from magnitude-based methods and complex-based methods can be well incorporated.