Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems

09/11/2021 ∙ by Yangyang Xia, et al. ∙ Facebook Carnegie Mellon University 0

Supervised speech enhancement relies on parallel databases of degraded speech signals and their clean reference signals during training. This setting prohibits the use of real-world degraded speech data that may better represent the scenarios where such systems are used. In this paper, we explore methods that enable supervised speech enhancement systems to train on real-world degraded speech data. Specifically, we propose a semi-supervised approach for speech enhancement in which we first train a modified vector-quantized variational autoencoder that solves a source separation task. We then use this trained autoencoder to further train an enhancement network using real-world noisy speech data by computing a triplet-based unsupervised loss function. Experiments show promising results for incorporating real-world data in training speech enhancement systems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised single-channel speech enhancement has seen considerable improvement in the last few years, primarily due to the use of deep neural networks (DNNs) [17]. Training an effective se system requires parallel databases of simulated degraded speech signals and their reference signals as the learning objective is often a function of the clean speech signals. The performance of se systems trained on such artificially generated noisy speech inputs depends heavily on (a) the variety and amount of noise recordings available, and (b) if the simulated degradation is realistic. While these supervised se systems have surpassed non data-driven approaches by a large margin [8], concerns around their generalization capabilities remain. Enabling SE systems to learn from real-world noisy speech can ensure that the networks are trained on real acoustical conditions rather than synthetic ones. Moreover, these data are readily available and can be obtained with relative ease. Lastly, such methods can also enable a system trained on simulated data to adapt to a new environment.

The primary challenge in incorporating real-world noisy speech for training SE systems is the lack of corresponding clean speech signals as training targets. A few recently proposed methods seek for alternative reference signals. Mixture-invariant training (MixIT) [18] attempted unsupervised speaker separation by forcing the network to separate mixture of mixtures. However, it can suffer from over-separation problem. Following MixIT, Noisy-target Training [2] treats real-world noisy speech data as reference and mixes them with noise signals to generate “more noisy” signals for training the SE system.

Another possibility to relax supervision is through the prediction or generation of pseudo ground truth. Although it is tempting to calculate the loss through a no-reference speech quality prediction network [1], experiments have shown that dnns might over-optimize one perceptual metric without necessarily improving others [9, 21], let alone a prediction of them. Wang et al.

used a pair of generative adversarial networks to map speech signals from noisy to clean

[16]. The trained generator is then used to generate a pseudo reference signal. A similar setup was also proposed and studied by Xiang and Bao [20]

with multiple learning objectives. These studies were inspired by unpaired image-to-image translation through cycle-consistency constraints

[22]. However, in  [16] the cycle-consistency constraint did not enforce clean speech embeddings and degraded speech embeddings to share the same latent space by using multiple encoders.

Generation of pseudo reference signals can also be done through a latent representation. In particular, methods based on ssl frameworks can be used. In this framework, a speech signal is typically transformed to a latent space by an autoencoder. Then, ssl tasks are assigned in the latent space to establish correlations between a measure taken in this space and a physically meaningful measure taken in the signal domain. For example, the context encoder learns to generate content of a masked region in an image based on its surrounding pixels [10].

In this paper, we propose two unsupervised loss functions for speech enhancement enabled by ssl. These unsupervised loss functions do not require the reference clean speech and allow us to incorporate real-world noisy speech in the training process. Our semi-supervised approach consists of two stages. The first supervised stage includes a novel modification to the vqvae that solves a source separation task using a corpus of paired

data. In the second semi-supervised stage, the learned vqvae is used to transform any given degraded speech signal to a pseudo noise ground truth and a pseudo speech ground truth, respectively. We then construct unsupervised losses based on a triplet formulation using these estimated ground truths. These losses are used to train an enhancement system along with the supervised losses from the paired data. Note that, the framework is designed in a semi-supervised setting with the assumption that some amount of

paired data and potentially (much) more unpaired (real-world) data are available during training. The unpaired data can be real-world noisy speech recordings for which corresponding clean references are not available.

Organization of this paper. In Section 2, we provide some necessary background on supervised se and vqvae. We then describe our method in Section 3. Experimental setups are described in Section 4 and results are discussed in Section 5. Section 6 concludes our paper.

2 Background

2.1 Supervised DNN-based speech enhancement

We assume that the observed degraded speech contains clean speech corrupted by additive noise. This relationship can be established in the stft domain as


where , , and represent the stft at frame and frequency index of the degraded speech, clean speech, and noise, respectively. One common se method is to train a dnn to predict a magnitude gain , so that the stftm of enhanced speech signal can be obtained by


Finally, the phase of the degraded signal is combined with the enhanced stftm to reconstruct the enhanced speech signal through inverse stft.

Conventionally, the paired sets are required during training. The supervised training involves a reconstruction loss,


where and denote the clean and enhanced stftm in vector form, and is a distance measure such as the mse.

2.2 Encoder-Decoder in self-supervised learning

Self-supervised learning (SSL) methods usually construct tasks in a learned representation space. These tasks can be solved without requiring any labels for a given dataset. The assumption usually is that the representation learned by solving these pretext tasks will be useful for the downstream tasks. We follow the well-known encoder-decoder framework to learn such representations from speech signals. This autoencoding process can be described by


where Encoder and Decoder are realized by dnns and

denotes a feature reconstruction loss function such as the mse. Within this paradigm, ssl could impose an auxiliary task to the encoded features, the decoded features, or both. A generic representation of this process can be described by


where each of and denotes the loss function of an auxiliary task. “Transform” refers to manipulation that provides distinctive goals to the auxiliary task. In context encoders [10], for example, partial occlusion is applied to the input image, forcing the encoder to learn features that would extrapolate the occluded pixel values.

It should be noted that the labels used in the pretext tasks are readily available in the original dataset and therefore the training targets in Eq. (7) and Eq. (8) shall not incur additional labeling effort. More specifically, we shall design the task in such a way that it does not require the clean reference speech for real-world degraded noisy speech. This task shall ultimately enable an unsupervised loss function,


As opposed to Eq. (3), this loss function can be used to train an se system on real-world degraded speech data. In the next section, we will describe a procedure that enables this process.

3 Method

Figure 1: Supervised training procedure of the modified vqvae using paired data (top) and unsupervised training procedure of a speech enhancement system using unpaired data (bottom).

Our approach consists of two training stages. The first stage consists of training a modified vqvae that is constrained to separate speech and noise from the degraded speech signal in both latent and feature domains. Then, we describe two loss functions derived from this model that can be used to train any dnn-based se systems in a semi-supervised manner. The unsupervised loss functions enable incorporation of real-world noisy speech during training.

3.1 Modified vqvae for source separation

Compared to traditional autoencoders, the vqvae has an additional quantization step in the latent space that avoids issues like high variance

[15]. The whole process can be described by


where is a set of learnable vectors, and is a distance function. We define to be the log-power spectra of degraded speech in vector form.

We design the vqvae to do an acoustical source separation task with a codebook-lookup constraint. To achieve this goal, we partition the codebook in Eq. (11) into two equal halves,


The quantization and decoding processes in Eq. (11) and Eq. (12) are then modified to produce three outputs,


where denotes one of speech, noise, and degraded speech.

3.2 Encoder and decoder architectures

Figure 2:

Flow diagram of our vqvae system. The order of parameters follows the PyTorch convention.

Both the encoder and decoder of our model are implemented using cnn. Specifically, the encoder consists of three two-dimensional convolutional layers, each followed by two-dimensional batch normalization and relu. The final convolutional layer is followed by two residual blocks; a residual block is defined as two convolutional layers with an additive connection between the input and the output of the final layer

[4]. The decoder’s architecture is a mirror image of that of the encoder, with each convolutional layer replaced by a transposed convolutional layer.

The overall architecture is illustrated in Figure 2. For a -by- log-power spectrogram, each energy bin is transformed to a -dimensional embedding by the encoder. The quantizer described in the previous step then transforms each embedding to its closest cluster center in terms of cosine distance. Finally, the decoder transforms quantized embeddings back to the log-power spectrogram.

3.3 Training procedure for modified vqvae

The training procedure of the modified vqvae is depicted in the top half of Figure 1. Log-power spectrograms of degraded speech signals pass through the vqvae in order to obtain the embedding and reconstructed feature for each source. We train the vqvae using the reconstruction loss in Eq. (3), the VQ loss, and the commitment loss as described in [15]. Note that each loss now consists of three components based on the degraded speech, clean speech, and noise, respectively. Although the mse was originally used in [15] as the distance function in Eq. (11), we found that cosine distance made training more stable for this particular task. The VQ loss and the commitment loss are modified accordingly.

3.4 Unsupervised loss functions for enhancement

After training using paired data and supervised loss functions, the vqvae is frozen while training a speech enhancement system. Specifically, a supervised se system takes in a degraded speech signal and outputs , the stftm of enhanced speech. It is then transformed to a log-power spectrogram and passed through the frozen vqvae that outputs the continuous embedding, the quantized embeddings, and the decoded features in the process. We define the unsupervised embedding-space loss as


where is a constant and is the cosine distance. Note that the continuous embedding is used instead of the quantized embedding as the latter is not differentiable. Similarly, the unsupervised feature-space loss can be derived from the decoded features,


where and are decoded from their corresponding quantized embeddings using Eq. (17). Note that both losses are calculated per time-frequency bin. We believe that the source separation task imposed on the vqvae makes and pseudo-positive targets, and and pseudo-negative targets. We used the triplet margin [14] because neither target is ideal.

The bottom half of Figure 1 shows how to train a dnn-based se system using real-world data. After obtaining the enhanced log-power spectrogram from the system, the frozen vqvae is used to calculate the continuous embedding, the quantized embeddings, and the reconstructed features. The unsupervised embedding loss can be calculated by Eq. (18); the unsupervised feature loss can be calculated by Eq. (19

). This loss is backpropagated to adapt the parameters of the se system. If this system is also trained on paired data, the entire procedure is a semi-supervised training process.

In the next section, we will describe the experimental setup used to evaluate the effectiveness of these unsupervised losses for se.

4 Experimental setup

4.1 Dataset

We used the clean speech of the Interspeech 2020 Deep Noise Suppression (DNS) Challenge dataset [12] and the ESC-50 dataset [11] for simulating paired data in all our experiments. The DNS training set contains a total of 500 hours of clean speech. The ESC-50 dataset contains 50 different types of environmental sounds (noises). In our experiments, we used fractions of these datasets to synthesize the paired data for training both the vqvae and the supervised part of the se system. The real-world noisy speech or the unpaired data was obtained from the Audioset dataset [3]. Audio recordings in Audioset tagged with “speech” class were further filtered by a sound event detector [6] to ensure that a large part of the recording contains speech along with other sounds. All audio recordings are sampled at 16k Hz. The average SNR of the filtered Audioset data estimated by the WADA algorithm [5] is around 10 dB.

4.2 Training procedure for VQ-VAE

To train our vqvae, we randomly sampled 1-second speech segment from the DNS dataset and 1-second noise segment from the ESC-50 dataset, respectively. We then mixed the two signals at a SNR randomly sampled from the range dB. The mixed signals are scaled to provide a dynamic range of 40 dB. The resulting degraded signal was the input to the vqvae.

4.3 Training and evaluation procedure for enhancement

We used stacked Gated Recurrent Units described in

[19] as the baseline system for real-time speech enhancement in our experiments. Similar to the training procedure for the vqvae, we simulated degraded speech from speech signals in the DNS dataset and noise from the ESC-50 dataset. The degraded-clean pairs were used to train the se system with the supervised loss function in Eq. (3). We consider three different conditions for training the enhancement system: (1) Baseline: the enhancement model is trained using only the paired data with supervised losses, (2) Paired-Unsupervised: the unsupervised loss functions (either Eq. (18) or Eq. (19)) are calculated from the paired data, and (3) Unpaired-Unsupervised: the unsupervised losses calculated from the real-world unpaired data in addition to the paired data. We summarize the setup of these systems in Table 1.

Method Training Data
Loss Function
Baseline paired -
Paired-Embedding paired Eq. (18)
Paired-Feature paired Eq. (19)
Unpaired-Embedding paired & unpaired Eq. (18)
Unpaired-Feature paired & unpaired Eq. (19)
Table 1: System configurations

To evaluate the quality of enhanced speech signals, we used the perceptual evaluation of speech quality (PESQ) [13] and scale-invariant signal-to-distortion ratio (SI-SDR) [7] metrics.

Figure 3: Absolute sisdr improvement (top) and pesq improvement (bottom) averaged across all evaluation conditions for seen noise types during training. The averaged snr and pesq of unprocessed speech are 0 dB and 1.39 MOS, respectively.
Figure 4: Absolute sisdr improvement (top) and pesq improvement (bottom) averaged across all evaluation conditions for unseen noise types during training. The averaged snr and pesq of unprocessed speech are 0 dB and 1.41 MOS, respectively.

5 Experimental results and discussions

Method SNR
-10 dB -5 dB 0 dB 5 dB 10 dB
Table 2: Evaluation of speech enhancement systems trained on 20% supervised data: seen (unseen) noise conditions

5.1 Effect of the amount of paired data

We present the absolute improvement of all se systems under the seen noise condition as a function of the amount of paired training data in Figure 3. With the minimum paired data (10%), the unsupervised losses based training were not able to improve over the supervised baseline; in fact, many performed noticeably worse than the baseline. As the amount of paired data increased to 20% of DNS speech and ESC-50 noise, all unsupervised loss functions were able to largely improve and surpassed the baseline performance. This indicates that a decent amount of paired data is necessary for making the vqvae learn a reliable representation of speech and noise. Finally, as more amount of paired data was presented in training, the significance of unsupervised losses goes down. This suggests that the supervised loss function eventually outweighs the unsupervised losses.

5.2 Generalization to unseen noise types

We present the absolute improvement of all se systems under the unseen noise condition as a function of the amount of paired training data in Figure 4. We note the similar trend as observed in the results for the seen noise conditions: unsupervised loss functions require a certain amount of supervised training to benefit the system. The overall performance compared to the seen noise condition is generally worse and improves slower as the amount of paired data increased. This phenomenon is generally true for supervised se systems. At 30% of supervised data, however, we observe similar improvement to the seen noise condition by including the unsupervised losses calculated from the paired data. This indicates that the unsupervised losses calculated from the paired data is generalizable to unseen noise conditions.

5.3 Effect of unsupervised loss functions

As Figure 3 and Figure 4 revealed that 20% of supervised data is the minimum from our setting that the se systems start benefitting from unsupervised loss functions, we present the detailed evaluation across noise conditions in Table 2. Results show that the paired-embedding loss is the best across most SNR conditions. The paired-feature loss is slightly more superior under some unseen noise types. The unpaired loss functions had more impact at higher SNRs. We believe that this could be because the filtered Audioset has relatively high SNR.

5.4 Learned embedding margin

To verify that the source separation task imposed on the vqvae was effective, we present the averaged triplet margin calculated on the validation set in Figure 5. As defined in Eq. (18), the margin should be high when global SNR is low, and the margin should be low when global SNR is high. As Figure 5 shows, using 10% and 20% supervised data were not enough to learn the correct relationship. While using 30% supervised data worked, using 40% data made a more drastic improvement. This shows that the more training data the better the vqvae learns the ssl tasks, which in turn would improve the quality of unsupervised losses for se.

Figure 5:

Averaged triplet margin on the validation set as a function of global SNR. Lines with more negative slopes correspond to better learned representation. The margin was calculated on the validation set in three-dimensional latent space. The standard deviations from smallest amount of data to largest amount of data were 0.55, 0.28, 0.31, and 1.53, respectively.

6 Conclusions

In this paper, we introduced two novel unsupervised loss functions for speech enhancement that were enabled by a modified vector-quantized variational autoencoder and a self-supervised learning task. We showed that the loss functions calculated on supervised data were able to improve supervised speech enhancement systems when the amount of training data is small. We also showed that the loss functions calculated on real-world noisy speech data were able to improve the supervised se systems in some noise conditions. In the future, we plan on fine-tuning the vqvae on enhanced speech data. We will also explore sampling techniques of real-world data to better match the evaluation condition.


  • [1] A. A. Catellier and S. D. Voran (2020) Wawenets: a no-reference convolutional waveform-based approach to estimating narrowband and wideband speech quality. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 331–335. External Links: Document Cited by: §1.
  • [2] T. Fujimura, Y. Koizumi, K. Yatabe, and R. Miyazaki (2021) Noisy-target training: a training strategy for dnn-based speech enhancement without clean speech. arXiv preprint arXiv:2101.08625. Cited by: §1.
  • [3] J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio Set: an ontology and human-labeled dataset for audio events. In ICASSP, Cited by: §4.1.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.2.
  • [5] C. Kim and R. M. Stern (2008)

    Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis

    In Ninth Annual Conference of the International Speech Communication Association, Cited by: §4.1.
  • [6] A. Kumar and V. Ithapu (2020) A sequential self teaching approach for improving generalization in sound event recognition. In

    International Conference on Machine Learning

    pp. 5447–5457. Cited by: §4.1.
  • [7] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR–half-baked or well done?. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §4.3.
  • [8] Y. Luo and N. Mesgarani (2019) Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, pp. 1256–1266. Cited by: §1.
  • [9] J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado (2018)

    A deep learning loss function based on the perceptual evaluation of the speech quality

    IEEE Signal processing letters 25 (11), pp. 1680–1684. Cited by: §1.
  • [10] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §1, §2.2.
  • [11] K. J. Piczak (2015) ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1015–1018. Cited by: §4.1.
  • [12] C. K. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun, et al. (2020) The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework. arXiv preprint arXiv:2001.08662. Cited by: §4.1.
  • [13] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In IEEE ICASSP, Vol. 2, pp. 749–752. Cited by: §4.3.
  • [14] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §3.4.
  • [15] A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. In 31st International Conference on Neural Information Processing Systems, pp. 6309–6318. External Links: ISBN 9781510860964 Cited by: §3.1, §3.3.
  • [16] Y. Wang, S. Venkataramani, and P. Smaragdis (2020) Self-supervised learning for speech enhancement. arXiv preprint arXiv:2006.10388. Cited by: §1.
  • [17] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 22 (12), pp. 1849–1858. Cited by: §1.
  • [18] S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson, and J. R. Hershey (2020) Unsupervised sound separation using mixture invariant training. In NeurIPS, External Links: Link Cited by: §1.
  • [19] Y. Xia, S. Braun, C. K. Reddy, H. Dubey, R. Cutler, and I. Tashev (2020) Weighted speech distortion losses for neural-network-based real-time speech enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 871–875. Cited by: §4.3.
  • [20] Y. Xiang and C. Bao (2020) A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (), pp. 1826–1838. External Links: Document Cited by: §1.
  • [21] Y. Zhao, B. Xu, R. Giri, and T. Zhang (2018) Perceptually guided speech enhancement using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5074–5078. Cited by: §1.
  • [22] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE ICCV, pp. 2223–2232. Cited by: §1.