Noises and echoes are well-known sources of degradation for speech quality in a full-duplex communication (FDC) system. In a FDC system, several speech and noise sources can be at play at each end of the communication line, in addition to the echo signal incident on the nearend microphone due to the playback of the farend loopback signal. Consider a realistic scenario with two desired speech sources, one nearend and one farend, with noise sources on each end. The input signal to the nearend microphone is given by
where is the time index, is the farend microphone input; is a distortion function responsible for both linear and nonlinear transformations in the communication system; and are speech sources from the nearend and farend speakers, respectively; and, and are noise sources from the nearend and farend environments, respectively. For simplicity, we absorb any linear time-invariant transformation of the sound sources due to the environment and the sensors into the signal terms. We denote the complex-valued time-frequency (TF) domain counterpart for each signal by its corresponding boldface uppercase, e.g., for , with and being the frequency and frame indices, respectively.
In traditional acoustic echo cancellation (AEC) that only utilized linear filtering techniques, noise terms are often ignored or absorbed into the speech terms. Hence, the signal models are often viewed as some variants of
with the goal of estimating
in either time or TF domain to remove the linearly transformed component of the farend signal. The termis often called residual echo term, which captures any nonlinear components that are not removable by the linear AEC system. Many recent AEC systems have adopted a two-stage approach with a traditional digital signal processing (DSP) module for linear AEC followed by a neural network module for residual echo suppression (RES) [Valin2021Low-ComplexityPercepnet, Peng2021ICASSPSuppression, Halimeh2021CombiningCancellation, Ivry2021DeepSuppression, Wang2021WeightedAEC-Challenge, Pfeifenberger2021AcousticLearning, Seidel2021Y-NetSuppression, Peng2021AcousticInformation].
Neural RES is highly related to the task of deep noise suppression (DNS), the latter focusing on noise removal rather than echo cancellation. In a broad sense, the residual echoes can be seen as a noise source, thus a few works have adapted DNS models as the RES module for AEC systems [Sridhar2021ICASSPResults, Braun2021InterspeechChallenge]. Given that AEC, RES, and DNS share more similarities than differences, a more unifying formulation naturally follows. Both speech enhancement tasks share a common goal of retrieving the clean speech term
, with the differences mainly lying in the sources of degradation. Moreover, neural networks are often capable of performing both linear and nonlinear filtering, rendering the linear DSP modules somewhat redundant and inefficient. In particular, modules such as adaptive filters and delay compensators require a ‘warm-up’ period for convergence and can be susceptible to changes in the acoustic environments[Braun2021InterspeechChallenge], especially in a doubletalk scenario. On the other hand, neural modules, once trained, operate on a fixed set of weights and provide consistent enhancement without the need for an initial convergence period.
With these considerations in mind, we propose an end-to-end joint noise and echo suppression model based on the D3Net building blocks [Takahashi2021DenselyTasks]. Inspired by [Hu2020DCCRN:Enhancement], we introduce a pseudocomplex extension of the D3Net, allowing fast computation even on hardware without native complex number supports. Unlike most systems based on convolutional neural networks (CNNs), the proposed architecture exploits the multi-resolution nature of the D3Net building blocks, and operates without any pooling, preserving full feature resolutions throughout the network.
We further proposed a dual-masking technique using two complex-valued output masks to allow the model to simultaneously perform echo/noise suppression and speech enhancement. The complex-valued nature of the masks allows the model to natively compensate delays between the loopback and the microphone signals, eliminating the need for an additional delay compensation module. Due to the purely convolutional nature, the model, with only about 354K parameters, can be easily adapted for causal real-time implementation in future work.
2 Proposed System
2.1 Complex Densely-Connected Multidilated DenseNet
The densely-connected multidilated DenseNet (D3Net) was formally introduced in [Takahashi2021DenselyTasks] as a building block of a U-Net architecture for semantic segmentation and music source separation, both achieving state-of-the-art performances. Variations of the D3Net have also been used successfully with near state-of-the-art performances on sound event localization and detection [shimada2021accdoa, Shimada2021EnsembleDetection].
Although the U-Net structure has been shown to be very effective in source separation, noise suppression, and speech enhancement, it is notorious for being extremely computationally expensive and can rarely be adapted for real-time implementation. Moreover, while the pooling operations have been useful in non-dilated CNNs to capture features at different scales, the multidilated nature of the D3Net is already capable of capturing these scale-dependent features without the need for pooling, which is prone to poor output resolutions. As a result, the proposed model was designed in a sequential manner to ensure adaptability for real-time implementation, and without pooling to preserve feature resolutions.
In this work, we adapted the D3Net building blocks described in [Takahashi2021DenselyTasks] by extending them to the complex-valued domain using the pseudocomplex technique in [Hu2020DCCRN:Enhancement]. For a layer with trainable parameters, such as the convolutional layer, a pseudocomplex extension is given by
where are independent layers of identical functions. For a parameterless operation , such as activations, the pseudocomplex extension is simply applied on the real and imaginary parts separately, that is,
The overall architecture, the complex densely-connected multidilated DenseNet (cD3Net), is shown in Figure 1
. We replaced the ReLU activations in the original model with leaky ReLU to ensure the output masks can take any value from the entire complex space. The pseudocomplex extension was applied at the layer level, i.e., all convolutional layers are individually pseudocomplexified. Since the pseudocomplex extensions are real-valued modules under the hood, training and inference of the model can be done efficiently even on hardware without native complex number support.
2.2 Input Representation
The complex-valued TF-domain nearend microphone signal and the farend loopback signal
are used as the input signals. The input tensor was formed via channel-wise stacking of the TF-domain input signals such that
The latter two channels were explicitly provided to the model to facilitate the implicit learning of the distortion function required to generate the echo cancellation masks. The latter two channels also provide temporal cues to the model, especially when there is a non-negligible time offset between the echo and the loopback signals.
2.3 Time-Frequency Masking
Conventionally, the most popular TF masking method used in AEC, DNS, and several other speech enhancement tasks, has been to apply a mask (or on the nearend microphone signal to obtain an enhanced speech signal , where is the elementwise multiplication operator. Although this method works very well in a single-track enhancement, this masking technique would discard the rich acoustic information already available in the farend loopback signal for echo suppression.
Since the farend signal is available, we consider the use of two separate masks, , where is mainly responsible for echo suppression, and for noise suppression and speech enhancement. The dual masking operation is given by
is complex-valued, unless there is a significant misalignment between the farend loopback and the nearend microphone signal beyond the short-time Fourier transform (STFT) window size, there is no need for time-domain compensation asitself can account for the delay operation.
All models in this paper were trained using only the synthetic dataset of the Microsoft AEC Challenges [Sridhar2021ICASSPResults, Braun2021InterspeechChallenge]. The synthetic dataset provides four signals for each acoustic scene: clean nearend speech signal, potentially noisy farend loopback signal, potentially noisy echo signal, and potentially noisy nearend microphone signal (a mixture of nearend speech, nearend noise, and echo signal). We used the nearend microphone and the farend loopback signals as inputs. The clean nearend speech signals used as the training targets were first scaled to the same amplitude as the speech content in the nearend microphone signal, using the scaling factors provided in the ground truths. The provided echo signals were not used in this paper. The official train and test set consists of 9.5K and 500 acoustic scenes, respectively. We take another 500 samples out of the train set to form a validation set, leaving 9K acoustic scenes for training.
All models in this paper were trained for epochs each. Each epoch drew a random subset of scenes from the train set. All models were trained with an effective batch size of using an Adam optimizer [Kingma2014Adam:Optimization] with a learning rate of , weight decay of , and learning rate reduction by a factor of for every three epochs without validation loss improvement. Since the audio signals were provided at sampling rate, we used a 512-sample () square-root Hann window with overlaps for STFT.
3.2 Loss Function
In order to balance between echo and noise cancellation in an energy-based sense and speech enhancement in a perceptual sense, we consider the use of a weighted loss function between the negative signal-to-distortion ratio loss and the perceptual PMSQE loss[Martin-Donas2018AQuality], given by
where are weight terms which sum to one.
Unlike most systems utilizing SDR-based losses [Kolbaek2020OnEnhancement], the negative SDR loss is implemented using the scale-dependent SDR (SD-SDR), instead of the scale-invariant counterpart (SI-SDR)111We have also repeated the reported experiments with the standard SDR and SI-SDR losses replacing the SD-SDR loss. SD-SDR loss indeed performed the best in this task across all metrics. The full experimental details were omitted due to space constraints., in order to encourage the model to preserve amplitude scaling of the clean speech estimates [LeRoux2019SDRDone], especially since the amplitude scaling information between the clean speech and the nearend microphone signals is available. For zero-mean estimates and targets, the negative SD-SDR loss is given by
3.3 Data Augmentation
We experimented with a few data augmentation techniques222We also experimented with additional noise and codec companding, but these techniques did not result in appreciable performance gains. in the raw signal domain to allow the model to be more robust to realistic signal degradation in a FDC system. For this work, the augmentation was always applied to the farend loopback signal since not all components of the signals that constitute the nearend microphone signal were available in the dataset. Whenever activated, each augmentation technique was applied with a probability independently of one another.
Specifically, we experimented with time shifting between the farend loopback signal and the nearend microphone signal to simulate asynchronous signal arrivals, and amplitude scaling to allow the model to be more robust to discrepancy between the amplitudes of the audio content in the loopback signal and the echo signal. A time shift of up to samples, which is the STFT block size used in this paper, was applied on the farend loopback signal. Amplitude scaling of a factor between to was applied to the farend loopback signal.
4 Results and Discussion
4.1 Baseline Models
For comparison, we adapted NSNet [Xia2020WeightedEnhancement], Conv-TasNet [Luo2019Conv-TasNet:Separation], and DCCRN [Hu2020DCCRN:Enhancement] for AEC. All three models have their first layers modified to support a multichannel input consisting of the nearend microphone and farend loopback signals. NSNet was additionally modified to use the same STFT parameters as the cD3Net. All other settings of DCCRN and Conv-Tasnet followed the defaults provided by Asteroid [Pariente2020Asteroid:Researchers]. All baseline models used the traditional single-mask method and were trained using the same amount of data and epochs as the proposed models.
We report two sets of results for each model: one on the test split of the synthetic dataset [Sridhar2021ICASSPResults] used for the development, and one on the blind test set used for the Interspeech 2021 AEC Challenge [Braun2021InterspeechChallenge].
For the synthetic dataset used for the development, we evaluate the performance of the models based on four objective metrics commonly used in AEC: scale-invariant source-to-distortion ratio (SI-SDR) [LeRoux2019SDRDone], standard source-to-distortion ratio (SDR) [Vincent2006PerformanceSeparation], short-time objective intelligibility (STOI) [Taal2010ASpeech], and perceptual evaluation of speech quality (PESQ) [Rix2001PerceptualCodecs].
For the blind real test set, we evaluate the model using objective proxies of mean opinion score (MOS) via the DNSMOS system [Reddy2020TheFramework] and degradation MOS via the AECMOS system [Cutler2021CrowdsourcingImpairment, Braun2021InterspeechChallenge]. The AECMOS provides two degradation MOS (DMOS) scores. The Echo DMOS score captures degradation due to farend echo while the Other DMOS captures degradation due to any other sources. Both DMOS are rated on a -to- scale, with a higher score reflecting a better audio quality. As per the convention in the ICASSP and Interspeech AEC Challenges [Sridhar2021ICASSPResults, Braun2021InterspeechChallenge], we report MOS for the nearend singletalk scenario, Echo DMOS for the farend singletalk scenario, and both Echo and Other DMOS for the doubletalk scenario.
|Synthetic Test Set||Real Test Set MOS|
|cD3Net single mask||352K||1||0||–||13.10||13.60||.90||1.77||3.94||4.11||4.16||2.60|
|cD3Net dual mask||354K||1||0||–||13.22||13.74||.90||1.80||3.94||4.19||4.12||3.13|
|farend shift and scale||13.21||13.70||.91||2.06||3.99||4.44||4.26||4.00|
The results for all models evaluated are shown in Table 1. Note that we do not perform any preprocessing nor postprocessing on the test set, i.e., no delay compensation or gain adjustment. The networks are fully responsible for all processing required to perform end-to-end echo/noise suppression and speech enhancement with the test set tracks as-is.
In both the synthetic and real test sets, cD3Net models of any configuration consistently outperformed the three baseline models across most metrics, despite only having about a tenth of the parameters of the smallest baseline model. This demonstrates the ability of cD3Net architecture to learn useful intermediate features in a very parameter-efficient manner. Interestingly, NSNet, which performed the best among the baselines on the synthetic test set for most metrics, has the simplest architecture amongst the baselines and also does not use pooling. The unexpected poor performance by DCCRN and Conv-TasNet could be due to the limited number of epochs all models are trained on, as well as inherent shortcomings in their architectures. DCCRN perform frequency downsampling in each layer, while Conv-TasNet uses a basis kernel in lieu of STFT [Luo2019Conv-TasNet:Separation]. The small 16-sample time-domain kernel is likely insufficient for learning effective representations for echo cancellation due to the inherent time offsets between the nearend and loopback signals.
4.3.2 Masking techniques
The dual-mask variant of cD3Net shows a slight improvement on the synthetic test set results compared to the single-mask variant. Although the real test set results are comparable for nearend MOS, farend singletalk Echo DMOS, and doubletalk Echo MOS, the doubletalk Other DMOS shows a significant improvement with the dual-mask technique. This is somewhat expected by design, as the single-mask technique is unlikely to be able to cope with both echo/noise suppression and speech enhancement in a doubletalk scenario.
4.3.3 Loss function
The models discussed so far were all trained purely using the negative SDR loss. Using a pure PMSQE loss on the dual-mask cD3Net, we found that the SDR and SI-SDR metrics on the synthetic test set degraded considerably, despite the real test set MOS indicating otherwise. Due to the lack of ground truth data on the real test set, it is not possible to exactly analyze the cause of this discrepancy, although we suspect that PMSQE may simply teach the model to optimize for perceived speech quality without regard to the actual amount of echo or noise suppressed.
Regardless, we found that using a weighted composite loss of negative SDR and PMSQE results in a better trade-off between the SDR-based metrics and the perceptual proxies, and even improves STOI and PESQ with the 1:3 and 1:1 SDR-to-PMSQE weightings.
4.3.4 Data Augmentation
Using the 1:3 weighting, which performed best on the synthetic test set, we experimented with data augmentation in the form of time shifting and amplitude scaling on the loopback signal, as described in Section 3.3. It is important to note that the echo and the loopback signals in the synthetic test set are nearly exactly time-aligned. On the other hand, the time offsets between the loopback and the echo signals in the real test set can be much more significant. Since we are using end-to-end models without offset compensation, it is not surprising that adding random time shifts to the farend loopback signal considerably improved the model performance on the real test set. The slight drop in the performance on the synthetic test set is likely due to the domain shift introduced by the augmentation.
Although the introduction of random farend amplitude scaling alone worsens the performance across both test sets, when used with random shifting, amplitude scaling further improved the DMOS performance in the real test set, achieving the best performances across the MOS metrics.
In this work, we proposed cD3Net, an end-to-end pooling-free network for joint echo cancellation, noise suppression, and speech enhancement, using a complex-valued extension of the D3Net building block, with a very small parameter count of only 354K. The complex-valued extension eliminates the need for additional linear filtering or preprocessing modules while the multidilated building blocks allow feature extraction using large receptive fields without pooling, preserving full feature resolution throughout the network. To fully utilize the farend loopback signal information, we also introduced a time-frequency domain enhancement technique using dual masks for joint AEC, DNS, and speech enhancement. Using a mixed loss training with scale-dependent SDR and PMSQE to account for enhancement in both energy-based and perceptual senses, evaluation on both synthetic and real test sets showed promising results. The proposed model achieved upwards offor SI-SDR, for SDR, and upwards of for STOI, as well as scored or above for all deep proxies of degradation MOS.