Log In Sign Up

A Study on Speech Enhancement Based on Diffusion Probabilistic Model

Diffusion probabilistic models have demonstrated an outstanding capability to model natural images and raw audio waveforms through a paired diffusion and reverse processes. The unique property of the reverse process (namely, eliminating non-target signals from the Gaussian noise and noisy signals) could be utilized to restore clean signals. Based on this property, we propose a diffusion probabilistic model-based speech enhancement (DiffuSE) model that aims to recover clean speech signals from noisy signals. The fundamental architecture of the proposed DiffuSE model is similar to that of DiffWave–a high-quality audio waveform generation model that has a relatively low computational cost and footprint. To attain better enhancement performance, we designed an advanced reverse process, termed the supportive reverse process, which adds noisy speech in each time-step to the predicted speech. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus SE task. Moreover, relative to the generally suggested full sampling schedule, the proposed supportive reverse process especially improved the fast sampling, taking few steps to yield better enhancement results over the conventional full step inference process.


Cold Diffusion for Speech Enhancement

Diffusion models have recently shown promising results for difficult enh...

Conditional Diffusion Probabilistic Model for Speech Enhancement

Speech enhancement is a critical component of many user-oriented audio a...

SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

Neural vocoder using denoising diffusion probabilistic model (DDPM) has ...

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Recently, diffusion-based generative models have been introduced to the ...

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation

Diffusion models have shown a great ability at bridging the performance ...

A non-causal FFTNet architecture for speech enhancement

In this paper, we suggest a new parallel, non-causal and shallow wavefor...

Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models

Diffusion probabilistic models (DPMs) represent a class of powerful gene...

1 Introduction

The goal of speech enhancement (SE) is to improve the intelligibility and quality of speech, by mapping distorted speech signals to clean signals. The SE unit has been widely used as a front-end processor in various speech-related applications, such as speech recognition [15, 3, 2], speaker recognition [18], assistive hearing technologies [7, 12], and audio attack protection [41]

. Recently, deep neural network (DNN) models have been widely used as fundamental tools in SE systems, yielding promising results

[17, 37, 39, 40, 27, 24, 13]. Compared to traditional SE methods, DNN-based methods can more effectively characterize nonlinear mapping between noisy and clean signals, particularly under extremely low signal-to-noise (SNR) scenarios and/or non-stationary noise environments [32, 10, 23].

Traditional SE methods calculates the noisy-clean mapping through the discriminative methods in time-frequency (T-F) domain or time domain

. For the T-F domain methods, the time-domain speech signals are first converted into spectral features through a short-time Fourier transform (STFT). The mapping function of noisy to clean spectral features is then formulated by a direct mapping function

[17, 40], or a masking function [37, 38, 31]. The enhanced spectral features are reconstructed to time-domain waveforms with the phase of the noisy speech based on the inverse STFT operation [36]. As compared with T-F domain methods, it has been shown that the time-domain SE methods can avoid the distortion caused by inaccurate phase information [5, 6]

. To date, several audio generation models have been directly applied to or moderately modified to perform SE, estimating the distribution of the clean speech signal, such as generative adversarial networks (GAN)

[21, 29, 4]

, autoregressive models


, variational autoencoders (VAE)

[14], and flow-based models [30].

The diffusion probabilistic model, proposed in [28]

, has shown strong generation capability. The diffusion probabilistic model includes a diffusion/forward process and a reverse process. The diffusion process converts clean input data to an isotropic Gaussian distribution by adding Gaussian noise to the original signal at each step. In the reverse process, the diffusion probabilistic model predicts a noise signal and subtracts the predicted noise signal from the noisy input to retrieve the clean signal. The model is trained by optimizing the evidence lower bound (ELBO) during the diffusion process. Recently, the diffusion probabilistic models have been shown to provide outstanding performance in generative modeling for natural images

[8, 19], and raw audio waveforms [11, 16]. As reported in [11], the DiffWave model, formed by the diffusion probabilistic model, can yield state-of-the-art performance on either conditional or unconditional waveform generation tasks with a small number of parameters.

In this study, we propose a novel diffusion probabilistic model-based SE method, called DiffuSE. The basic model structure of DiffuSE is similar to that of Diffwave. Since the target task is SE, DiffuSE uses the noisy spectral features as the conditioner, rather than the clean Mel-spectral features used in DiffWave. Meanwhile, we modified the reverse process by using noisy speech, instead of the isotropic Gaussian noise. To further improve the quality of the enhanced speech, we pretrained the model using clean Mel-spectral features as a conditioner. After pretraining, we replaced the conditioner with noisy spectral features, reset the parameters in the conditioner encoder, and preserved other parameters for the SE training.

The contributions of this study are three-fold: (1) It is the first study to apply the diffusion probabilistic model to the SE tasks. (2) We derive a novel supportive reverse process, specifically for the SE task, which combines the noisy speech signals during the reverse process. (3) The experimental results confirm the effectiveness of DiffuSE, which provides comparable or even better performacne as compared to related time-domain generative SE methods.

The remainder of this paper is organized as follows. We present the diffusion models in Section II and introduce the DiffuSE architecture in Section III. We provide the experimental setting in Section IV, report the results in Section V, and conclude the paper in Section VI.

2 Diffusion probabilistic models

This section introduces the diffusion and the reverse procedures of the diffusion probabilistic model. A detailed mathematical proof of the model’s ELBO can be found in [8], and we only discuss the diffusion and reverse processes with their algorithm in this section.

Figure 1: The diffusion process (solid arrows) and reverse processes (dashed arrows) of the diffusion probabilistic model.
for  do
     Sample and
     Take gradient step on
     according to Eq. 6
end for
Algorithm 1 Training

2.1 Diffusion and Reverse Processes

A diffusion model of steps is composed of two processes: the diffusion process with steps and the reverse process [28]. The input data distribution of the diffusion process is defined as on , where is the data dimension. is a step-dependent variable at diffusion step with the same dimension . The diffusion and the reverse processes are illustrated in Figure 1.

In Figure 1, The solid arrows are the diffusion process from data to the latent variable , represented as:



is formulated by a fixed Markov chain,

, with a small positive constant ratio , and the Gaussian noise is added to the previous distribution . The overall process gradually converts data to a latent variable with an isotropic Gaussian distribution of , according to the predefined schedule .

The sampling distribution at the -th step, , can also be derived from the distribution of in a closed form by marginalizing as:


where and . Empirically, we can sample the -th step distribution from the initial data directly. In contrast, The dashed arrows in Figure 1 are the reverse process, converting the latent variable to , which is also defined by a Markov chain:


where is the distribution of the reverse process with learnable parameter . Because the marginal likelihood is intractable for calculations in general, the model should be trained using ELBO. Recently, [8] showed that under a certain parameterization, the ELBO could be calculated using a closed-form solution.

for  do
     Compute and
     according to Eq. 7
end for
Algorithm 2 Sampling

2.2 Training through Parameterization

2.2.1 Parameterization

The transition probability in the reverse process

in Eq. 3 can be represented by two parameters, and , as , with a learnable parameter . is an

-dimensional vector, that estimates the mean of the distribution of


denotes the standard deviation (a real number) of the

distribution. Note that both values take two inputs: the diffusion step , and variable . Further, Eq. 2 can also be reparameterized as for . was set to as a time-dependent parameter.

2.2.2 Training and Sampling

In the reverse process, in Eq. 3 aims to predict the previous distribution by the current mixed data with extra Gaussian noise added in the diffusion process. Therefore, the predicted mean is estimated by eliminating the Gaussian noise in the mixed data . According to the derivations in [8], can be predicted by a given and as Eq. 4:


Note that the real Gaussian noise added in the diffusion process is unknown in the reverse process. Therefore, the model should be designed to predict . In contrast, , the standard deviation of the , can be fixed to a constant for every step t as Eq 5:


Therefore, for predicting in the reverse process, the model parameters aim to estimate the Gaussian noise by input and . During the diffusion process, the training loss of the model is defined to reduce the distance of the estimated noise and the Gaussian noise in the mixed data , as shown in Eq. 6.


After the training process, was computed using Eq. 7 where .


To summarize, the model is trained during the diffusion process by estimating the Gaussian noise inside the mixed-signal , and samples the data through the reverse process. We describe the diffusion and reverse processes in Algorithms 1 and 2, respectively. Table 1 lists the parameters of the diffusion probabilistic models.

Process Parameter Meaning
ratio of in
ratio of noise added in
ratio of in
isotropic Gaussian noise
predicted noise from model
predicted mean from model
standard deviation
Table 1: Parameters in the diffusion probabilistic models

3 DiffuSE architecture

In the proposed DiffuSE model, we derive a novel supportive reverse process to replace the original reverse process, to eliminate noise signals from the noisy input more effectively.

3.1 Supportive Reverse Process

In the original diffusion probabilistic model, the Gaussian noise is applied in the reverse process. Since the clean speech signal was unseen during the reverse process, the calculated speech signal, , may be distorted during the reverse process from step . To address this issue, we proposed a supportive reserve process, starting the sampling process from the noisy speech signal , and combining at each reverse step while reducing the additional Gaussian signal.

The noisy speech signal can be considered as a combination of the clean speech signal and background noise , as . In the supportive reserve process, we define a new valuable , which is a combination of noisy speech and the predicted as shown in Eq. 8:


where can be formulated as from the mean of is known as in the diffusion process. Therefore, we filled the remaining part of noise by the Gaussian signal as Eq. 9:


In diffusion models, is used to predict the noise signal from . For the SE task, the objective of could also be considered as predicting the non-speech part , which is then used to recover the clean speech signal from the mixed-signal . Therefore, although the distribution of noise in the combination of the supportive reverse process is not Gaussian, still has the ability to predict the non-speech components from the noisy signal at the -th step based on the learned knowledge about different speech-noise combinations during the diffusion process. In addition, because is a combination of the clean speech signal and the Gaussian noise , to reach a more efficient clean speech recovery, the supportive reverse process directly uses the noisy speech signal as the input of the reverse process rather than the Gaussian noise. Meanwhile, at each reverse step, the supportive reverse process combines with the noisy speech and the Gaussian noise to form the input of . After the overall reverse process is completed, we follow the suggestion in [1] to combine the enhanced and original noisy signal to obtain the final enhanced speech. The detailed procedure of the supportive reverse process is shown in Algorithm 3.

for  do
     Compute and
     according to Eq. 8 and 9
end for
Algorithm 3 Supportive Reverse Sampling

3.2 Model Structure

3.2.1 DiffWave Architecture

The model architecture of DiffWave is similar to that of WaveNet [20]. Without an autoregressive generation constraint, the dilated convolution is replaced with a bidirectional dilated convolution (Bi-DilConv). The non-autoregressive generation property of DiffWave yields a major advantage over WaveNet in that the generation speed is much faster. The network comprises a stack of residual layers with residual channel . These layers were grouped into blocks, and each block had layers. The kernel size of Bi-DilConv is 3, and the dilation is doubled at each layer within each block as . Each of the residual layers has a skip connection to the output, which is the same as that used in Wavenet.

Figure 2: The architecture of the proposed DiffuSE model

3.2.2 DiffuSE Architecture

Figure 2 shows the model structure of the DiffuSE. As Diffwave, the conditioner in DiffuSE aims to keep the output signal similar to the target speech signal, enabling to separate the noise and clean speech from the mixed data. Thus, we replace the input of the conditioner from clean Mel-spectral features to noisy spectral features. We set the parameter of DiffuSE, , to be similar to those used in the DiffWave model [11].

3.3 Pretraining with Clean Mel-spectral Conditioner

To generate high-quality speech signals, we pretrained the DiffuSE model with the clean Mel-spectral features. In DiffWave, the conditional information is directly adopted from the clean speech, allowing the model to separate the clean speech and noise from the mixed-signals. After pretraining, we changed the conditioner from clean Mel-spectral features to the noisy spectral features, reset the parameters in the conditioner encoder, and preserved other parameters for the SE training.

3.4 Fast Sampling

Given a trained model from Algorithm 1, the authors in [11] discovered that the most effective denoising steps in sampling occur near and accordingly derived a fast sampling algorithm. The algorithm collapses the -step in the diffusion process into

-step in the reverse process with a proposed variance schedule. This motivates us to apply the fast sampling into DiffuSE to reduce the number of denoising steps. In addition, by changing

and to and using Eq. 8 and Eq. 9, respectively, the fast sampling schedule can be combined with the supportive reverse process.

4 Experiments

4.1 Data

We evaluated the proposed DiffuSE on the VoiceBank-DEMAND dataset

[34]. The dataset contains 30 speakers from the VoiceBank corpus [35], which was further divided into a training set and a testing set with 28 and 2 speakers, respectively. The training utterances were mixed with eight real-world noise samples from the DEMAND database [33] and two artificial (babble and speech shaped) samples at SNR levels of 0, 5, 10, and 15 dB. The testing utterances were mixed with different noise samples, according to SNR values of 2.5, 7.5, 12.5, and 17.5 dB to form 824 utterances (0.6 h). Additionally, utterances from two speakers were used to form a validation set for model development, resulting in 8.6 h and 0.7 h of data for training and validation, respectively. All of the utterances were resampled to 16 kHz sampling rates.

for  do
     Compute and
end for
Algorithm 4 Fast Sampling

4.2 Model Setting and Training Strategy

The DiffuSE model was constructed using 30 residual layers with three dilation cycles and a kernel size ofthree. Based on the design of DiffWave in [11], we set the number of diffusion steps and residual channels as for Base and Large DiffuSE, respectively. The training noise schedule was linearly spaced as for Base DiffuSE, and for Large DiffuSE. The learning rate was for both pretraining (using clean Mel-spectrum) and fine-tuning the DiffuSE model. The dimension for the Mel-spectrum was 80, and the dimension of the noisy spectrum was 513 for the same window size of 1024 with 256 shifts. The parameter in the supportive reverse process was set to for larger than 1, and was set to 0.2. During pretraining, we followed the instructions in [11], where the vocoder model was trained for one million iterations, and the large model for three hundred thousand iterations for better initialization. In the training of the SE model, we trained the model for 300 thousand iterations for Base DiffuSE and 700 thousand iterations for Large DiffuSE. The batch size was 16 for Base DiffuSE and 15 for Large DiffuSE because of resource limitations. Both pretraining and fine-tuning DiffuSE are based on an early stopping scheme.

4.3 Evaluation Metrics

We report the standardized evaluation metrics for performance comparison, including perceptual evaluation of speech quality (PESQ)

[26], (the wide-band version in ITU-T P.862.2), prediction of the signal distortion (CSIG), prediction of the background intrusiveness (CBAK), and prediction of the overall speech quality (COVL) [9]. Higher scores indicated better SE performance for all of evaluation scores.

Noisy - 1.97 3.35 2.44 2.63
RP Fast 1.96 3.13 2.22 2.52
Full 1.97 3.21 2.22 2.57
RP- Fast 2.07 3.21 2.57 2.62
Full 2.05 3.27 2.48 2.64
RP- Fast 2.05 3.31 2.21 2.64
Full 2.12 3.38 2.25 2.72
RP- Fast 2.29 3.47 2.67 2.85
Full 2.31 3.51 2.61 2.88
SRP Fast 2.41 3.61 2.82 2.99
Full 2.39 3.60 2.79 2.97
(a) Evaluation results of the Base DiffuSE model.
Large DiffuSE Schedule PESQ CSIG CBAK COVL
Noisy - 1.97 3.35 2.44 2.63
RP Fast 2.09 3.29 2.31 2.67
Full 2.16 3.39 2.31 2.75
RP- Fast 2.18 3.35 2.60 2.74
Full 2.20 3.42 2.48 2.78
RP- Fast 2.16 3.42 2.30 2.76
Full 2.17 3.45 2.29 2.78
RP- Fast 2.37 3.56 2.69 2.94
Full 2.33 3.55 2.56 2.91
SRP Fast 2.43 3.63 2.81 3.00
Full 2.39 3.63 2.75 2.99
(b) Evaluation results of the Large DiffuSE model.
Table 2: Evaluation results of (a) Base DiffuSE model and (b) Large DiffuSE model; both DiffuSE models adopted the original reverse process (RP) and the supportive reverse process (SRP). From “RP”, we further implemented “RP-” by replacing the Gaussian noise to noisy signal, and “RP-” by adding noisy signal at the generated output. “RP-” is a combination of “RP-” and “RP-”. The results of the fast and full sampling schedules are listed as “Fast” and “Full”, respectively. The results of the original noisy speech (denoted as “Noisy”) are also listed for comparison.
Noisy 1.97 3.35 2.44 2.63
SEGAN 2.16 3.48 2.94 2.80
DSEGAN 2.39 3.46 3.11 3.50
SE-Flow 2.28 3.70 3.03 2.97
DiffuSE(Base) 2.41 3.61 2.82 2.99
DiffuSE(Large) 2.43 3.63 2.81 3.00
Table 3: Evaluation results of DiffuSE with comparative time-domain generative SE models. DiffuSE with the Base and Large models are denoted as DiffuSE(Base) and DiffuSE(Large), respectively. All of the metric scores for the comparative methods are taken from their source papers.

5 Experimental Results

In this section, we first present the DiffuSE results with the original reverse process and the proposed supportive reverse process. Next, we compare DiffuSE with other state-of-the-art (SOTA) time-domain generative SE models. Finally, we justify the effectiveness of DiffuSE by visually analyzing the spectrogram and waveform plots of the enhanced signals.

5.1 Supportive Reverse Process Results

In the supportive reverse process, we adopted two sampling schedules, namely a fast sampling schedule and a full sampling schedule. For the fast sampling schedule, the variance schedules were for Base DiffuSE and for Large DiffuSE, as suggested in [11]. The full sampling schedule used the same as that used in the diffusion process.

Tables 2 (a) and (b) list the results of the Base DiffuSE model and the Large DiffuSE model, respectively. In the tables, the results of DiffuSE using the original reverse process and the supportive reverse processes are denoted as “RP” and “SRP,” respectively. The table reports the results of both fast and full sampling schedules. To investigate the effectiveness of the supportive reverse process, we further tested performance by including noisy speech signal at the input, output, and both input and output of the DiffuSE model with the original reverse process; the results are denoted by “RP-,” “RP-,” and “RP-,” respectively, in Table 2. When adding noisy speech at the input, we directly replaced the Gaussian noise with a noisy speech signal. When adding the noisy speech at the output, the final enhanced speech is a weighing average of the enhanced speech (80%) and the noisy speech signal (20%).

From Table 2 (a), we first note that, except for RP, all of the DiffuSE setups achieved improved performance over “Noisy” with a notable margin (for both fast and full sampling schedules). Next, we observe that “RP-,” “RP-,” and “RP-” outperform “RP,” showing that including the noisy speech at the input and output can enable the original reverse process to attain better enhancement performance. Finally, we note that “SRP” outperforms “RP,” “RP-,” “RP-,” and “RP-” for both fast and full sampling schedules, confirming the effectiveness of the proposed supportive reverse process for DiffuSE.

Next, from Table 2 (b), we observe that the results of the Large DiffuSE model present trends similar to those of the Based DiffuSE model (shown in Table 2 (a)). All of the DiffuSE setups provided improved performance over “Noisy,” and “SRP” achieved the best performance among the DiffuSE setups. When comparing Tables 2 (a) and (b), the Large DiffuSE model yielded better enhancement results than the Base DiffuSE model, revealing that a more complex DiffuSE model can provide better enhancement results.

From Tables 2 (a) and (b), we notice that for “RP” and “RP-,” the full sampling schedule provided better results than the fast sampling schedule, which is consistent with the findings reported in DiffWave [11]. In contrast, for “RP-,” “RP-,” and “SRP,” the fast sampling schedule yielded better results than the full sampling schedule. A possible reason is that the noisy speech signal is a combination of clean speech and noise signals and presents clearly different properties from the pure Gaussian noise. Therefore, when including noisy speech in the input, it is more suitable to apply a fast sampling schedule than the full sampling schedule.

In addition to quantitative evaluations, we present spectrogram and waveform plots to qualitatively analyze the enhanced speech signals obtained from the DiffuSE models. Figures 3 and 4, respectively, show the spectrogram and waveform plots of (a) clean, (b) noisy, (c) enhanced speech using DiffuSE with the original reverse process (denoted as DiffuSE+RP), and (d) enhanced speech using DiffuSE with the supportive reverse process (detonated as DiffuSE+SRP). From Figure 3, we first note that both of the original and supportive reverse processes can effectively remove noise components from a noisy spectrogram. Next, we observe notable speech distortions in (c) DiffuSE+RP, especially in the high-frequency regions (marked with red rectangles). For (d) DiffuSE+SRP, although some noise components remained, the speech structures were better preserved as compared to (c) DiffuSE+RP. From Figure 4, the waveform plots present similar trends to the spectrogram plots: the waveform of (d) DiffuSE+SRP preserves speech structures better than that of (c) DiffuSE+RP (please compare the two waveforms around 0.8 and 1.3 (s)). The observations in Figures 3 and 4 better explain the results obtained using the supportive reverse process over the original reverse process, as reported in Table 2. The samples of the DiffuSE-enhanced signals can be found online111

(c) Clean
(d) Noisy
(e) DiffuSE+RP
(f) DiffuSE+SRP
Figure 3: Spectrogram plots of (a) Clean speech, (b) Noisy signal, (c) Enhanced speech by DiffuSE with the original reverse process (DiffuSE+RP) (d) Enhanced speech by DiffuSE with the supportive reverse process (DiffuSE+SRP).
(a) Clean
(b) Noisy
(c) DiffuSE+RP
(d) DiffuSE+SRP
Figure 4: Waveform plots of (a) Clean speech, (b) Noisy signal, (c) Enhanced speech by DiffuSE with the original reverse process (DiffuSE+RP) (d) Enhanced speech by DiffuSE with the supportive reverse process (DiffuSE+SRP).

5.2 Comparing DiffuSE with Related SE Methods

The proposed DiffuSE model is a time-domain generative SE model. For comparison, we selected three SOTA baselines that are also based on time-domain generative SE models, namely SEGAN [21], SE-Flow [30], and improved deep SEGAN (DSEGAN) [22]. The experimental results of the three comparative SE methods are presented in Table 3. The results of the DiffuSE with the supportive reverse process are also listed, where DiffuSE(Base) and DiffuSE(Large) denote the results of using the base and large models, respectively. Compared with the three baselines, the PESQ scores of DiffuSE(Base) and DiffuSE(Large) are 2.41 and 2.43, respectively, both of which are much higher than those obtained from the comparative methods. The CSIG scores of DiffuSE(Base) and DiffuSE(Large) are 3.61 and 3.63, respectively, again notably higher than those achieved by SEGAN and DSEGAN. The results confirm that the proposed DiffuSE method provides a competitive performance against SOTA generative SE models.

6 Conclusions

In this study, we have proposed DiffuSE, the first diffusion probabilistic model-based SE method. To enable an efficient sampling procedure, we proposed modifying the original reverse process to a supportive reverse process, specially designed for the SE task. Experimental results show that the supportive reverse process can improve the quality of the generated speech with few steps to obtain better performance than that of the full reverse process. The results also show that DiffuSE achieves SE performance comparable to that of other SOTA time-domain generative SE models. The results of DiffuSE are reproducible and the code of DiffuSE will be released online. We believe that the results will shed light on further extensions of using the diffusion probabilistic model for the SE task. In future work, we will further improve the DiffuSE model through different network structures.

7 Acknowledgement

We would like to thank Alexander Richard at Facebook for his valuable comments about this work.


  • [1] M. Abd El-Fattah, M. I. Dessouky, S. Diab, and F. Abd El-Samie (2008) Speech enhancement using an adaptive wiener filtering approach. Progress In Electromagnetics Research M 4, pp. 167–184. Cited by: §3.1.
  • [2] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey (2015)

    Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

    In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [3] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 708–712. Cited by: §1.
  • [4] S. Fu, C. Liao, Y. Tsao, and S. Lin (2019) MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement. In

    International Conference on Machine Learning

    pp. 2031–2041. Cited by: §1.
  • [5] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai (2018)

    End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (9), pp. 1570–1584. Cited by: §1.
  • [6] F. G. Germain, Q. Chen, and V. Koltun (2018)

    Speech denoising with deep feature losses

    arXiv preprint arXiv:1806.10522. Cited by: §1.
  • [7] E. W. Healy, J. L. Vasko, and D. Wang (2019) The optimal threshold for removing noise from speech is similar across normal and impaired hearing—a time-frequency masking study. The Journal of the Acoustical Society of America 145 (6), pp. EL581–EL586. Cited by: §1.
  • [8] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239. Cited by: §1, §2.1, §2.2.2, §2.
  • [9] Y. Hu and P. C. Loizou (2007) Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16 (1), pp. 229–238. Cited by: §4.3.
  • [10] M. Kolbæk, Z. Tan, and J. Jensen (2016) Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (1), pp. 153–167. Cited by: §1.
  • [11] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro (2020) Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761. Cited by: §1, §3.2.2, §3.4, §4.2, §5.1, §5.1.
  • [12] Y. Lai, F. Chen, S. Wang, X. Lu, Y. Tsao, and C. Lee (2016)

    A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation

    IEEE Transactions on Biomedical Engineering 64 (7), pp. 1568–1578. Cited by: §1.
  • [13] J. Le Roux, S. Watanabe, and J. R. Hershey (2013) Ensemble learning for speech enhancement. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4. Cited by: §1.
  • [14] S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud (2020) A recurrent variational autoencoder for speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 371–375. Cited by: §1.
  • [15] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach (2014)

    An overview of noise-robust automatic speech recognition

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (4), pp. 745–777. Cited by: §1.
  • [16] S. Liu, Y. Cao, D. Su, and H. Meng (2021) DiffSVC: a diffusion probabilistic model for singing voice conversion. arXiv preprint arXiv:2105.13871. Cited by: §1.
  • [17] X. Lu, Y. Tsao, S. Matsuda, and C. Hori (2015) Speech enhancement based on deep denoising autoencoder. In Interspeech, pp. 436–440. Cited by: §1, §1.
  • [18] D. Michelsanti and Z. Tan (2017) Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. arXiv preprint arXiv:1709.01703. Cited by: §1.
  • [19] A. Nichol and P. Dhariwal (2021) Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672. Cited by: §1.
  • [20] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.2.1.
  • [21] S. Pascual, A. Bonafonte, and J. Serra (2017) SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. Cited by: §1, §5.2.
  • [22] H. Phan, I. V. McLoughlin, L. Pham, O. Y. Chén, P. Koch, M. De Vos, and A. Mertins (2020) Improving gans for speech enhancement. IEEE Signal Processing Letters 27, pp. 1700–1704. Cited by: §5.2.
  • [23] J. Qi, J. Du, S. M. Siniscalchi, and C. Lee (2019) A theory on deep neural network based vector-to-vector regression with an illustration of its expressive power in speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 1932–1943. Cited by: §1.
  • [24] J. Qi, H. Hu, Y. Wang, C. H. Yang, S. M. Siniscalchi, and C. Lee (2020)

    Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement

    arXiv preprint arXiv:2007.13024. Cited by: §1.
  • [25] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and M. Hasegawa-Johnson (2017) Speech enhancement using bayesian wavenet.. In Interspeech, pp. 2013–2017. Cited by: §1.
  • [26] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. Vol. 2, pp. 749–752. Cited by: §4.3.
  • [27] S. M. Siniscalchi (2021) Vector-to-vector regression via distributional loss for speech enhancement. IEEE Signal Processing Letters 28, pp. 254–258. Cited by: §1.
  • [28] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015)

    Deep unsupervised learning using nonequilibrium thermodynamics

    In International Conference on Machine Learning, pp. 2256–2265. Cited by: §1, §2.1.
  • [29] M. H. Soni, N. Shah, and H. A. Patil (2018) Time-frequency masking-based speech enhancement using generative adversarial network. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5039–5043. Cited by: §1.
  • [30] M. Strauss and B. Edler (2021) A flow-based neural network for time domain speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5754–5758. Cited by: §1, §5.2.
  • [31] A. S. Subramanian, S. Chen, and S. Watanabe (2018) Student-teacher learning for blstm mask-based speech enhancement. arXiv preprint arXiv:1803.10013. Cited by: §1.
  • [32] K. Tan, X. Zhang, and D. Wang (2019) Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5751–5755. Cited by: §1.
  • [33] J. Thiemann, N. Ito, and E. Vincent (2013) The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics ICA2013, Vol. 19, pp. 035081. Cited by: §4.1.
  • [34] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.. In SSW, pp. 146–152. Cited by: §4.1.
  • [35] C. Veaux, J. Yamagishi, and S. King (2013) The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), pp. 1–4. Cited by: §4.1.
  • [36] D. Wang and J. Chen (2018)

    Supervised speech separation based on deep learning: an overview

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (10), pp. 1702–1726. Cited by: §1.
  • [37] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 22 (12), pp. 1849–1858. Cited by: §1, §1.
  • [38] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller (2015) Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. In International conference on latent variable analysis and signal separation, pp. 91–99. Cited by: §1.
  • [39] B. Xia and C. Bao (2014) Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Communication 60, pp. 13–29. Cited by: §1.
  • [40] Y. Xu, J. Du, L. Dai, and C. Lee (2014) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (1), pp. 7–19. Cited by: §1, §1.
  • [41] C. Yang, J. Qi, P. Chen, X. Ma, and C. Lee (2020) Characterizing speech adversarial examples using self-attention u-net enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3107–3111. Cited by: §1.