Traditional speech enhancement systems modify a noisy mixture to reduce the amount of noise it contains, but in doing so they introduce distortion in the speech. The distortion increases when there is more noise in the mixture leading to poor quality speech . In contrast, speech synthesis systems generate high quality speech from only textual information. These text-to-speech systems (TTS) are complex as they need to generate realistic acoustic representation without a reference audio signal. In this work, we propose to combine these two methods, i.e., using speech synthesis techniques for speech enhancement. This is an easier task than TTS since we have a reference noisy audio signal from which we can extract the desired prosody instead of having to invent it. By predicting the “acoustic features” of the clean speech from the noisy speech in the speech enhancement system, we can generate high quality noise-free resyntheses.
Parametric Resynthesis (PR) systems [2, 3] predict clean acoustic parameters from noisy speech and synthesize speech from these predicted parameters using a speech synthesizer or vocoder. Current speech synthesizers are trained to generate high quality speech for a single speaker. In previous work we showed that a single speaker PR system can synthesize very high quality clean speech at KHz  and performs better than the corresponding TTS system . Hence, a critical question is whether these systems can be generalized to unknown speakers. The main contribution of the current work is to show that when trained on a large number of speakers, neural vocoders can successfully generalize to unseen speakers. Furthermore, we show that PR systems using these neural vocoders can also generalize to unseen speakers in the presence of noise.
In this work, we test the speaker dependence of neural vocoders, and their effect on the enhancement quality of PR. We show that when trained on speakers, WaveGlow , WaveNet , and LPCNet  are able to generalize to unseen speakers. We compare the noise reduction quality of PR with three state-of-the-art speech enhancement models and show that PR-LPCNet outperforms every other system including an oracle Wiener mask-based system. In terms of objective metrics, the proposed PR-WaveGlow performs better in objective signal and overall quality.
1.1 Related work
Traditional speech enhancement systems generally predict a time-Frequency mask to reduce noise in the magnitude spectrum domain, for example [7, 8]. Recent works perform speech enhancement in the time-domain directly, which has the additional advantage of reconstructing the phase of the signal. A modified WaveNet was proposed for speech denoising , by using non-causal convolutions on noisy speech and predicting both clean speech and the noise signal. Another approach is to progressively downsample the noisy audio to a bottleneck feature and then upsample with skip connections to the corresponding downsampled features to enhance speech. SEGAN  uses this approach in a GAN setting and Wave-U-Net [11, 12] uses it in the U-Net setting. The aim of these approaches is to remove noise from the audio at different scales. Compared to these systems, we do not focus on modelling noise but only focus on modelling speech. We evaluate our approach against three of these systems [10, 11, 9]. These papers publish results on the same dataset we used and also each provide several enhanced files, which we utilize in our listening tests.
2 System Overview
Our PR models have two parts. First is a prediction model that estimates the clean acoustic features from noisy audio. Second, a vocoder synthesizes “clean” speech from the predicted “clean” acoustic parameters. The aim of the prediction model is to reduce noise while the vocoder synthesizes high quality audio.
2.1 Prediction model
The prediction model is trained with parallel clean and noisy speech. It takes noisy mel-spectrogram as input and is trained to predict clean acoustic features . The predicted clean acoustic features vary based on the vocoder used. In this work we used WaveGlow, WaveNet, LPCNet and WORLD  as vocoders. For Waveglow and WaveNet, we predict clean mel-spectrograms. For LPCNet, we predict -dimensional Bark-scale frequency cepstral coefficients (BFCC) and two pitch parameters: period and correlation. For WORLD we predict the spectral envelope, aperiodicity, and pitch. For WORLD and LPCNet, we also predict the and of these acoustic features for smoother outputs. The prediction model is trained to minimize the mean squared error (MSE) of the acoustic features
where are the predicted and are the clean acoustic features. The Adam optimizer  is used for training. During test, for a given a noisy mel-spectrogram, clean acoustic parameters are predicted. For LPCNet and WORLD we use maximum likelihood parameter generation (MLPG)  algorithms to refine our estimate of the clean acoustic features from predicted acoustic features, , and .
The second part of PR resynthesizes speech from the predicted acoustic parameters using a vocoder. The vocoders are trained on clean speech samples and clean acoustic features . During synthesis, we use predicted acoustic parameters to generate predicted clean speech . In the rest of this section we describe the vocoders, three neural: WaveGlow, WaveNet, LPCNet and one non-neural: WORLD.
to a Gaussian distribution conditioned on the mel spectrogram. For inference, WaveGlow samples a latent variable from the learned Gaussian distribution and applies the inverse transformations conditioned on to reconstruct the speech sample . The model is trained to minimize the log likelihood of the clean speech
is the log-likelihood of the spherical zero mean Gaussian with variance. During training is used. We use the officially published waveGlow implementation111 https://github.com/NVIDIA/waveglow with the original setup, i.e., 12 coupling layers, each consisting of 8 layers of dilated convolution with 512 residual and 256 skip connections. We will refer to the PR system with WaveGlow as its vocoder as PR-WaveGlow.
LPCNet: LPCNet is a variation of WaveRNN  that simplifies the vocal tract response using linear prediction from previous time-step samples
LPC coefficients are computed from the 18-band BFCC. It predicts the LPC predictor residual , at time . Then sample is generated by adding and .
A frame conditioning feature is generated from
input features: 18-band BFCC and 2 pitch parameters via two convolutional and two fully connected layers. The probabilityis predicted from , , , via two GRUs  (A and B) combined with dualFC layer followed by a softmax. The largest GRU (GRU-A) weight matrix is forced to be sparse for faster synthesis. The model is trained on the categorical cross-entropy loss of and the predicted probability of the excitation . Speech samples are -bit mu-law quantized. We use the officially published LPCNet implementation222 https://github.com/mozilla/LPCNet with units in GRU-A and 16 units in GRU-B. We refer to the PR system with LPCNet as its vocoder as PR-LPCNet.
WaveNet: WaveNet  is a autoregressive speech waveform generation model built with dilated causal convolutional layers. The generation of one speech sample at time step , , is conditioned on all previous time step samples . We use the Nvidia implementation333https://github.com/NVIDIA/nv-wavenet which is the Deep-Voice  model of WaveNet for faster synthesis. Speech samples are mu-law qauantized to 8 bits. The normalized log mel-spectrogram is used in local conditioning. WaveNet is trained on the cross-entropy between the quantized sample and the predicted quantized sample .
For WaveNet, we used a smaller model that is able to synthesize speech with moderate quality. We tested the PR model’s dependency on speech synthesis quality by testing on a smaller model. We used layers with residual, skip connections, and 256 gate channels with maximum dilation of . This model can synthesize clean speech with average predicted mean opinion score (MOS) for a single speaker . The PR system with WaveNet as its vocoder is referred to as PR-WaveNet.
WORLD: Lastly, we use a non-neural vocoder WORLD which synthesizes speech from three acoustic parameters: spectral envelope, aperiodicity, and . We use WORLD with the Merlin toolkit 444https://github.com/CSTR-Edinburgh/merlin. WORLD is a source-filter model that takes previously mentioned parameters and synthesizes speech. We also use spectral enhancement to modify the predicted parameters as is standard in Merlin .
|Unseen - Male|
|Unseen - Female|
|Wave-U-Net (from )||3.5||3.2||3.0||_|
|SEGAN (from )||3.5||2.9||2.8||_|
We use the publicly available noisy VCTK dataset  for our experiments. The dataset contains 56 speakers for training: 28 male and 28 female speakers from the US and Scotland. The test set contains two unseen voices, one male and another female. Further, there is another available training set, consisting 14 male and 14 female from England, which we use to test generalization to more speakers.
The noisy training set contains ten types of noise: two are artificially created, and the eight other are chosen from DEMAND . The two artificially created are speech shaped noise and babble noise. The eight from DEMAND are noise from a kitchen, meeting room, car, metro, subway car, cafeteria, restaurant, and subway station. The noisy training files are available at four SNR levels: , , , and dB. The noisy test set contains five other noises from DEMAND: living room, office, public square, open cafeteria, and bus. The test files have higher SNR: , , , and dB. All files are downsampled to KHz for comparison with other systems. There are training audio files and testing audio files.
3.2 Exp 1: Speaker independence of neural vocoders
Firstly, we test if WaveGlow and WaveNet can generalize to unseen speakers on clean speech. Using the data described above, we train both of these models with a large number of speakers () and test them on 6 unseen speakers. Next, we compare their performance to LPCNet which has previusly been shown to generalize to unseen speakers. In this test, each neural vocoder synthesizes speech from the original clean acoustic parameters. Following the three baseline papers [10, 11, 9], we measure synthesis quality with objective enhancement quality metrics  consisting of three composite scores: CSIG, CBAK, and COVL. These three measures are on a scale from 1 to 5, with higher being better. CSIG provides and estimate of the signal quality, BAK provides an estimate of the background noise reduction, and OVL provides an estimate of the overall quality.
LPCNet is trained for 120 epochs with a batch size of 48, where each sequence has 15 frames. WaveGlow is trained for 500 epochs with batch size 4 utterances. WaveNet is trained for 200 epochs with batch size 4 utterances. For WaveNet and WaveGlow we use GPU synthesis, while for LPCNet CPU synthesis is used as it is faster555We also found that GPU synthesis code was incomplete as of commit 3a7ef33. WaveGlow and WaveNet synthesize from clean mel-spectrograms with window length 64 ms and hop size 16 ms. LPCNet acoustic features use a window size of 20 ms and a hop size of 10 ms.
We report the synthesis quality of three unseen male and three unseen female speakers, and compare them with unseen utterances from one known male speaker. For each speaker, the average quality is calculated over 10 files. Table 1 shows the composite quality results along with the objective intelligibility score from STOI . We observe that WaveGlow has the best quality scores in all the measures. The female speaker scores are close to the known speaker while the unseen male speaker scores are a little lower. We note here that these values are not as high as single speaker WaveGlow, which can synthesize speech very close to the ground truth. We also note that LPCNet scores are lower than those of WaveGlow but better than WaveNet. Between LPCNet and WaveNet, we do not observe a significant difference in synthesis quality for male and female voices. Although WaveNet has lower scores, it is consistent across known and unknown speakers. Thus, we can say that WaveNet generalizes to unseen speakers.
3.3 Exp 2: Speaker independence of parametric resynthesis
Next, we test the generalizability of the PR system across different SNRs and unseen voices. We use the test set of 824 files with 4 different SNRs. The prediction model is a 3-layer bi-directional LSTM with 800 units that is trained with a learning rate of . For WORLD filter size is 1024 and hop length is 5 ms. We compare PR models with a mask based oracle, the Oracle Wiener Mask (OWM), that has clean information available during test.
Table 2 reports the objective enhancement quality metrics and STOI. We observe that the OWM performs best, PR-WaveGlow performs better than Wave-U-Net and SEGAN on CSIG and COVL. PR-WaveGlow’s CBAK score is lower, which is expected since this score is not very high even when we synthesize clean speech (as shown in Table 1). Among PR models, PR-WaveGlow scores best and PR-WaveNet performs worst in CSIG. The average synthesis quality of the WaveNet model affects the performance of the PR system poorly. PR-WORLD and PR-LPCNet scores are lower as well, we observe that both of these models sound much better than the objective scores would suggest. We believe, as both of these models predicts , even a slight error in prediction affects the objective scores adversely. For this, we test the PR-LPCNet using the noisy instead of the prediction, and the quality scores increase. In informal listening the subjective quality with noisy F0 is similar to or worse than the predicted files. Hence we can say that the objective enhancement metrics are not a very good measure of quality for PR-LPCNet and PR-WORLD.
We also test objective quality of PR models and OWM against different SNR and noise types. The results are shown in Figure 1. We observe with decreasing SNR, CBAK quality for PR models stays the same, while for OWM, CBAK score decreases rapidly. This shows that the noise has a smaller effect on background quality compared to a mask based system, i.e., the background quality is more related to the presence of synthesis artifacts than recorded background noise.
3.4 Listening tests
Next, we test the subjective quality of the PR systems with a listening test. For the listening test, we choose 12 of the 824 test files, with four files from each of the 2.5, 7.5 and 12.5 dB SNRs. We observed the 17.5 dB file to have very little noise, and all systems perform well with them. In the listening test, we also compare with the OWM and three comparison models. For these comparison systems, we included the publicly available output files in our listening tests, selecting five files from each: Wave-U-Net has 3 from 12.5 dB and 2 from 2.5 dB, Wavenet-denoise and SEGAN have 2 common files from 2.5 dB, 2 more files each are selected from 7.5 dB and 1 from 12.5 dB. For Wave-U-Net, there were no 7.5 dB files available publicly.
The listening test follows the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) paradigm . Subjects were presented with 8-10 anonymized and randomized versions of each file to facilitate direct comparison: 4 PR systems (PR-WaveNet, PR-WaveGlow, PR-LPCNet, PR-World), 4 comparison speech enhancement systems (OWM, Wave-U-Net, WaveNet-denoise, and SEGAN), and clean and noisy signals. Subjects were also provided reference clean and noisy versions of each file666All files are available at http://mr-pc.org/work/icassp20/. Five subjects took part in the listening test. They were told to rate the speech quality, noise-suppression quality, and overall quality of the speech from , with being the best. We observe intelligibility of all of the files to be very high, so instead of doing an intelligibility listening test, we ask subjects to rate the subjective intelligibility as a score from .
Figure 3 shows the result of the quality listening test. PR-LPCNet performs best in all three quality scores, followed by PR-WaveGlow and PR-World. The next best model is the Oracle Wiener mask followed by Wave-U-Net. Table 3 shows the subjective intelligibility ratings, where PR-LPCNet has the highest subjective intelligibility, followed by OWM, PR-WaveGlow, and PR-World. It also reports the objective quality metrics on the 12 files selected for the listening test for comparison with Table 2 on the full test set. We observe that while PR-LPCNet and PR-WORLD have very similar objective metrics (both quality and intelligibility), they have very different subjective metrics, with PR-LPCNet being rated much higher).
3.5 Tolerance to error
Finally, we measure the tolerance of PR models to inaccuracy of the prediction LSTM using the two best performing vocoders, WaveGlow and LPCNet. For this test, we randomly select 30 noisy test files. We make the predicted feature noisy as, , where . The random noise is generated from a Gaussian distribution with the same mean and variance at each freuency as . Next, we synthesize with the vocoder from . For WaveGlow, is the mel-spectrogram and for LPCNet, is 20 features. We repeat the LPCNet test adding noise into all features and only the 18 BFCC features (not adding noise to ).
Figure 2 shows the objective metrics for these files. We observe that for WaveGlow, does not affect the synthesis quality very much and decreases performance incrementally. For LPCNet, we observe that errors in the BFCC are tolerated better than errors in .
We show that the neural vocoders WaveGlow, WaveNet, and LPCNet can be used for speaker-independent speech synthesis when trained on 56 speakers. We also show that using these three vocoders, the parametric resynthesis model is able to generalize to new noises and new speakers across different SNRs. We find that PR-LPCNet outperforms the oracle Wiener mask-based system in subjective quality.
-  Jingdong Chen, Jacob Benesty, Yiteng Huang, and Simon Doclo, “New insights into the noise reduction wiener filter,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1218–1234, 2006.
-  Soumi Maiti and Michael I Mandel, “Parametric resynthesis with neural vocoders,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019, To appear.
-  Soumi Maiti and Michael I Mandel, “Speech denoising by parametric resynthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2019, pp. 6995–6999.
-  Ryan Prenger, Rafael Valle, and Bryan Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” arXiv preprint arXiv:1811.00002, 2018.
-  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu, “WaveNet: A generative model for raw audio.,” in Proc. ISCA SSW, Sept. 2016, p. 125.
-  Jean-Marc Valin and Jan Skoglund, “Lpcnet: Improving neural speech synthesis through linear prediction,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2019, pp. 5891–5895.
-  Yuxuan Wang, Arun Narayanan, and DeLiang Wang, “On training targets for supervised speech separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014.
Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux,
“Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,”in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2015, vol. 2015-Augus.
-  Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet for speech denoising,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2018, pp. 5069–5073.
-  Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
-  Craig Macartney and Tillman Weyde, “Improved speech enhancement with the wave-u-net,” arXiv preprint arXiv:1811.11307, 2018.
-  Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
-  Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, Jul. 2016.
-  Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980 [cs], Dec. 2014.
-  Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura, “Speech parameter generation algorithms for hmm-based speech synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2000, vol. 3, pp. 1315–1318.
-  Diederik P Kingma and Prafulla Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” arXiv preprint arXiv:1807.03039, 2018.
-  Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.
-  Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew
Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman,
“Deep voice: Real-time neural text-to-speech,”
Proceedings of the International Conference on Machine Learning. JMLR. org, 2017, pp. 195–204.
Zhizheng Wu, Oliver Watts, and Simon King,
“Merlin: An open source neural network speech synthesis system,”Proc. SSW, 2016.
-  Cassia Valentini-Botinhao et al., “Noisy speech database for training speech enhancement algorithms and tts models,” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017.
-  Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent, “The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics ICA2013. ASA, 2013, vol. 19, p. 035081.
-  Yi Hu and Philipos C Loizou, “Evaluation of objective measures for speech enhancement,” in Proceedings of Interspeech, 2006.
-  Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010, pp. 4214–4217.
-  “Method for the subjective assessment of intermediate quality level of audio systems,” Tech. Rep. BS.1534-3, International Telecommunication Union Radiocommunication Standardization Sector (ITU-R), 2015.