Self-Supervised Learning based Monaural Speech Enhancement with Multi-Task Pre-Training

by   Yi Li, et al.

In self-supervised learning, it is challenging to reduce the gap between the enhancement performance on the estimated and target speech signals with existed pre-tasks. In this paper, we propose a multi-task pre-training method to improve the speech enhancement performance with self-supervised learning. Within the pre-training autoencoder (PAE), only a limited set of clean speech signals are required to learn their latent representations. Meanwhile, to solve the limitation of single pre-task, the proposed masking module exploits the dereverberated mask and estimated ratio mask to denoise the mixture as the second pre-task. Different from the PAE, where the target speech signals are estimated, the downstream task autoencoder (DAE) utilizes a large number of unlabeled and unseen reverberant mixtures to generate the estimated mixtures. The trained DAE is shared by the learned representations and masks. Experimental results on a benchmark dataset demonstrate that the proposed method outperforms the state-of-the-art approaches.


page 1

page 2

page 3

page 4


Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Recently, self-supervised learning (SSL) techniques have been introduced...

Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation

Self-supervised learning (SSL) achieves great success in monaural speech...

Self-supervised Learning for Speech Enhancement

Supervised learning for single-channel speech enhancement requires caref...

RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

We present RemixIT, a simple yet effective self-supervised method for tr...

Self-supervised Pre-training Reduces Label Permutation Instability of Speech Separation

Speech separation has been well-developed while there are still problems...

Continual self-training with bootstrapped remixing for speech enhancement

We propose RemixIT, a simple and novel self-supervised training method f...

Embodied Self-supervised Learning by Coordinated Sampling and Training

Self-supervised learning can significantly improve the performance of do...

1 Introduction

Deep learning techniques have been extensively utilized in speech enhancement for teleconferencing, automatic speech recognition (ASR), and hearing aids [1][2]. However, the novel networks are predominantly trained in a supervised mechanism. A vast training set of clean speech signals is required to be well-labelled in the training stage and suffers from drawbacks such as the strong possibility of a mismatch between the training and inference conditions [3][4]. To relax the constraints of supervised learning approaches, self-supervised learning (SSL) based speech enhancement aims to train the model without large labelled datasets to reconstruct the target speech signal from noisy speech. Therefore it becomes highly practical and attractive.

Recently, the SSL techniques have been applied in speech enhancement problem. Wang et al. use an autoencoder to learn a latent representation of clean speech signals [3]. However, the pre-training stage only consists of one pre-task which is the mapping of the clean speech spectrogram. Then, Kataria et al. propose a framework called Perceptual Ensemble Regularization Loss (PERL) which shows effectiveness on SSL PASE+ models [5][6]. However, the PERL is limited with the requirement of massive training data.

Followed by our previous work [7], to further improve the speech enhancement performance, we introduce both the dereverberation mask (DM) and the estimated ratio mask (ERM) to provide the time-frequency relationships between the clean speech signal and the reverberant mixture. Hence, inspired by [8], we propose a multi pre-tasks SSL method which only needs a limited set of randomly selected clean speech signals and the corresponding mixture recordings in the pre-training.

Figure 1: The block diagram of the proposed method is shown in (a). The masking module is shown in (b). Features are extracted as the input to the pre-task autoencoder (PAE). The latent representation of clean speech signal is learnt, meanwhile, the target speech signal in the reverberant mixture is estimated in the masking module. The estimated mixture is produced from the downstream task autoencoder (DAE) which shares the learned representation and masks. The enhanced signal is obtained from the output of the decoder in the testing stage.

The contributions of this paper are summarized as follows:

Multi pre-tasks with self-training are proposed to solve the speech enhancement problem.

To address the speech enhancement problem with reverberant environment, we apply dereverberation mask in the masking module to dereverberate the mixture.

2 Proposed Method

2.1 Multi pre-tasks based autoencoders

The block diagram of the proposed method is shown in Fig. 1 (a). In the training stage, we exploit two variational autoencoders, pre-training autoencoder (PAE) and downstream task autoencoder (DAE), for different tasks. The encoder and decoder of the PAE are denoted as and , respectively. Similarly, we use and to present the encoder and decoder of the DAE respectively.

The input of the pre-task consists of a limited set of clean speech signals, background noise, and reverberated both speech and noise signals. The mel-frequency cepstral coefficients (MFCC) feature [9] is first extracted. The encoder obtains the features as the input and produce latent representations of both clean speech signal and the mixture. In the proposed method, we consider two pre-tasks for pre-training: latent representation and mask estimation. The first task aims to learn the latent representation of only clean speech signals. However, the second task trains DM and ERM to describe the representation relationships from the target speech signal to the mixture. Both the latent representation and masks are trained by minimizing the discrepancy between the input representation and the corresponding reconstruction. The decoder obtains the averaged masked representations from two tasks and produces the estimated speech signal.

Both and consist of 4 1-D convolutional layers. In , the size of the hidden dimension sequentially decreases from 512 256 128

64. Consequently, the dimension of the latent space is set to 64, and a stride of 1 sample with a kernel size of 7 for the convolutions. Different from

, increases the size of the latent dimensions inversely.

Different from the PAE, the DAE only needs access to the reverberant mixture. The feature is extracted from the reverberant mixture and input to . Consequently, the latent representation of the mixture is obtained as the output of

. The learnt representation and masks from the PAE are exploited to modify the loss functions and learn a shared latent space between the clean speech and mixture representations. Benefited from the pre-tasks, a mapping from the mixture domain to the target speech domain is learnt with the latent representation of the clean speech signal. Furthermore,

is trained to produce the estimated mixture as the downstream task.

The DAE network follows a similar architecture to PAE. consists of 6 1-D convolutional layers where the hidden layer sizes decrease from 512 400 300 200 100 64, and increases the sizes inversely.

In the testing stage, after the features are extracted from the reverberant mixtures, they are fed into the trained as the input. As aforementioned, the loss function in is trained with the mapping of the latent space from the mixture domain to the target speech domain. Thus, the trained produces an estimated latent representation of the reverberant mixture. Then, the estimated masks are used to dereverberate and denoise the mixture representation. Finally, the trained obtains the reconstructed representation and maps to the target speech signal.

2.2 Masking Module

As aforementioned, the masking module is exploited to trains DM and ERM to describe the representation relationships from the target speech signal to the mixture. The architecture of the masking module is depicted in Fig. 1 (b).

The masking module has three sub-layers. The encoder produces the mixture representation , and the aim of the masking module is to estimate the target speech representation . The first two sub-layers consists of two time-frequency (TF) masks, DM and ERM, respectively. According to [10], DM is presented:


where ’ is the dot product, and and are the clean speech signal and the interference, respectively. The dereverberated mixture is obtained as:


where is the estimated DM. However, in practice, obtaining the dereverberated mixtures is very challenging [11]. Although most of the reverberations are removed by DM, the remaining reverberations in still limit the performance [7]. Thus, in the second sub-layer, we exploit ERM in the second sub-layer to further improve the speech enhancement in reverberant environments, which can be defined as:


Then, the background noise and the remaining reverberations are removed by ERM. Moreover, the ReLU activation is added to each mask and produces the output for the next sub-layer. Additionally, a residual connection

[12] is applied in the masking module to ease the training of the module. Finally, the target speech representation is obtained with a PReLU activation [13] as:


The overall loss to train the masking module is a combination of three loss terms as:



denotes the Kullback-Leibler (KL) loss and is applied to train the latent representation closed to a normal distribution

[3]. Then, is the coefficient of and empirically set to 0.001. Besides, denotes the loss between the target speech signal and the corresponding reconstruction. The 2 norm of the error is exploited as the loss function. Similarly, the cycle loss consists of and the loss between the latent representation and the corresponding reconstruction:


where is the estimated representation of the target speech signal. Moreover, is the coefficient of representation loss and empirically set to 0.001. Finally, the combination of losses are utilized in PAE to improve the speech enhancement performance.

3 Experimental Results

3.1 Experimental Setup

The proposed method is trained by using the Adam optimizer with a learning rate of 0.001 and the batch size is 20. The number of epochs for PAE and DAE are 700 and 1500, respectively. All the experiments are run on a work station with four Nvidia GTX 1080 GPUs and 16 GB of RAM. The magnitude spectrograms have 513 frequency bins for each frame as a Hanning window and a discrete Fourier transform (DFT) size of 1024 samples are applied.

To evaluate the proposed model, we use composite metrics that approximate the Mean Opinion Score (MOS) including COVL: MOS predictor of overall signal quality, CBAK: MOS predictor of background-noise intrusiveness, CSIG: MOS predictor of signal distortion [14] and Perceptual Evaluation of Speech Quality (PESQ). Higher values of the measurements imply that the desired speech signal is better estimated.

3.2 Comparisons and Datasets

We compare the proposed method with two recent SSL speech enhancement approaches [3][8]. The first one is SSE [3] which exploits two autoencoders to process pre-task and downstream task, respectively. The architecture is similar to the proposed method. The second one is pre-training fine-tune (PT-FT) [8], which uses three models and three SSL approaches for pre-training: speech enhancement, masked acoustic model with alteration (MAMA) used in TERA [15] and continuous contrastive task (CC) used in wav2vec 2.0 [16]. We reproduce the PT-FT method with DPTNet model [17] and speech enhancement as the pre-task because it shows the best enhancement performance in [8].

SSE [3] PT-FT [8] Proposed
Paired Data
Multiple Models
Single Pre-Task
Table 1: Comparison of SSL speech enhancement approaches with the proposed method. More specifically, the PT-FT method use 50,800 paired utterances in the training stage. However, only 200 utterances are required in the proposed method. Besides, 3 pre-tasks are trained in the PT-FT method and we train 2 pre-tasks in the proposed method.

To evaluate the speech enhancement performance, in the training stage, 600 clean utterances from 20 speakers with three room environments are randomly selected from the DAPS dataset [18]. The training data consists of 10 male and 10 female speakers each reading out 5 utterances and recorded in different indoor environments with different real room impulse responses (RIRs). In each environment, we first randomly select 12 utterances to generate the pair of training data as clean speech signals and mixtures to train the PAE. Then, the rest 188 utterances are exploited for DAE to obtain the estimated mixtures. Moreover, we use three background noises (, , and ) from the NOISEX dataset [19] and three SNR levels (-5. 0, and 5 dB) to generate the mixtures. In the testing stage, 300 clean utterances of 10 speakers are randomly selected and used to generate the mixtures with the same background noises and SNR levels as the configuration in training stage.

3.3 Results and Discussions

In the evaluations, we first conduct the experiments in three cases as different interferences in the PAE:

Case 1: The interference only consists of the background noises (, , and ).

Case 2: In the SSE method shown in [3], only limited amount of clean speech signals and unlabeled mixtures are available in the training stage. Therefore, to further evaluate the proposed masking module, we randomly generate a Gaussian noise to produce the reverberant mixture as the interference. Hence, compared with [3], no extra information is introduced.

Case 3: To evaluate the performance with various interferences, we use both the background noise (Case 1) and the unlabelled mixture (Case 2) to generate the interference. In both Cases 2 3, the mixtures used in the PAE and the DAE are unseen to the each other.

3.3.1 Case 1

SSE [3] 1.48 2.28 1.90 1.84
PT-FT [8] 1.58 2.34 2.04 1.91
Proposed 1.71 2.45 2.16 1.97
Table 2: Averaged speech enhancement performance (Case 1) in terms of three room environments, three noise interferences and three SNR levels.

From Table 2, it is clearly observed that the proposed method outperforms the state-of-the-art methods in terms of all three performance measurements. In [8], the original PT-FT method is trained with Libri1Mix train-360 set [20] which contains 50,800 utterances. However, in the comparison experiments, we use the limited amount of training utterances (200). Therefore, the speech enhancement performance of the PT-FT suffers a significant degradation compared with the original paper. The latent representation and the masking module have limitations, however, the proposed method takes advantage of both approaches and mitigates the speech enhancement problem. Thus, the speech enhancement performance is improved compared with only learning the clean speech representation in the SSE method.

3.3.2 Case 2

SSE [3] 1.39 2.31 1.82 1.75
PT-FT [8] 1.44 2.34 1.89 1.90
Proposed 1.64 2.38 2.11 1.92
Table 3: Averaged speech enhancement performance (Case 2) in terms of three room environments, three noise interferences and three SNR levels.

It can be seen from Table 3 that the proposed method always achieves highest enhancement performance compared with SSE and PT-FT. However, the enhancement performance at Case 2 suffers a degradation compared with Case 1. Because in Case 2, the interference consists of the undesired speech signal, the background noise, and reverberation of both speech signals and noises. It is highlighted that, due to different distributions between speech and noise interference domains, the task of personalized speech enhancement from the mixture with undesired speech signals is more challenging than from noise interferences [21].

3.3.3 Case 3

In this case, we use two background noises ( and ) from the NOISEX dataset [19] and two SNR levels (-5 and 5 dB). The experimental results are shown in Table 4.

SSE [3] 1.37 2.27 1.77 1.66
PT-FT [8] 1.49 2.25 1.84 1.87
Proposed 1.69 2.29 2.13 1.90
Table 4: Averaged speech enhancement performance (Case 3) in terms of three room environments, two noise interferences and two SNR levels.

It can be observed from Table 4 that the speech enhancement performance is significantly improved by the proposed method compared to the baselines. Although Table 3 indicates the case where the interference consists of the background noise and the reverberant mixture. The improvement in terms of PESQ, CBAK, and COVL is more obvious than the other two cases.

In all comparison experiments, we can observe that: (1) The proposed method outperforms the recent SSL-based speech enhancement methods. (2) When the interference contains both background noise and the undesired speech signal, the enhancement performance is degraded. (3) The proposed method still improves the speech enhancement performance in the hardest case (Case 3) because unseen scenario is also considered. Moreover, the improvement is more significant as the case is more challenging.

4 Conclusion

In order to address the monaural speech enhancement problem in reverberant environments, a multi pre-tasks SSL method was proposed. In the pre-training stage, the latent representation of the clean speech signal was learnt as the first pre-task. Meanwhile, in the PAE, a DM- and ERM-based masking module was applied to assist to estimate the target speech representation. We evaluated the proposed method in three cases with different interferences. The experimental results showed that pre-training with multi pre-tasks provides better speech enhancement performance than the state-of-the-art approaches within the benchmark dataset.


  • [1] Y. Sun, Y. Xian, W. Wang, and S. M. Naqvi,

    “Monaural source separation in complex domain with long short-term memory neural network’,”

    IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 359 – 369, 2019.
  • [2] Y. Xian, Y. Sun, W. W. Wang, and S. M. Naqvi, “A multi-scale feature recalibration network for end-to-end single channel speech enhancement,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 1, pp. 143 – 155, 2020.
  • [3] Y.-C. Wang, S. Venkataramani, and P. Smaragdis, “Self-supervised learning for speech enhancement,”

    International Conference on Machine Learning (ICML)

    , 2020.
  • [4] Z. H. Du, M. Lei, J. Q. Han, and S. L. Zhang, “Self-supervised adversarial multi-task learning for vocoder-based monaural speech enhancement,” Interspeech, 2020.
  • [5] S. Kataria, J. Villalba, and N. Dehak,

    “Perceptual loss based speech denoising with an ensemble of audio pattern recognition and self-supervised models,”

    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  • [6] M. Ravanelli, J. Y. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised learning for robust speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  • [7] Y. Li, Y. Sun, and S. M. Naqvi, “Single-channel dereverberation and denoising based on lower band trained SA-LSTMs,” IET Signal Processing, vol. 14, no. 10, pp. 774 – 782, 2021.
  • [8] S.-F. Huang, S.-P. Chuang, D.-R. Liu, Y.-C. Chen, G.-P. Yang, and H.-Y. Lee, “Stabilizing label assignment for speech separation by self-supervised pre-training,” Interspeech, 2021.
  • [9] M. Xu, L.-Y. Duan, J. F. Cai, L.-T. Chia, C. S. Xu, and Q. Tian, “HMM-based audio keyword generation,” Advances in Multimedia Information Processing: 5th Pacific Rim Conference on Multimedia, pp. 566 – 574, 2004.
  • [10] Y. Sun, W. Wang, J. A. Chambers, and S. M. Naqvi, “Two-stage monaural source separation in reverberant room environments using deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 125–138, 2019.
  • [11] Y. Zhao and D. L. Wang, “Noisy-reverberant Speech Enhancement Using DenseUNet with Time-frequency Attention,” Interspeech, 2020.
  • [12] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” Computer Vision and Pattern Recognition, 2016.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun,

    “Delving deep into rectifiers: surpassing human-level performance on imagenet classification,”

    IEEE International Conference on Computer Vision, 2015.
  • [14] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229 – 238, 2008.
  • [15] A. T. Liu, S.-W. Li, and H.-Y. Lee, “TERA: self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351 – 2366, 2021.
  • [16] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” Neural Information Processing Systems (NeurIPS), 2020.
  • [17] J. J. Chen, Q. R. Mao, and D. Liu,

    “Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation,”

    Interspeech, 2020.
  • [18] G. J. Mysore, “Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1006 – 1010, 2014.
  • [19] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 12, no. 3, pp. 247 – 251, 1993.
  • [20] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent,

    “LibriMix: an open-source dataset for generalizable speech separation,”

    Interspeech, 2020.
  • [21] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: deep audio-visual speech enhancement,” Interspeech, 2018.