WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement

04/06/2020 ∙ by Tsun-An Hsieh, et al. ∙ 0

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered, or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the embedded features in the hidden layers instead of on the physical spectral features commonly used in speech separation tasks. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-mapping-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech related applications, such as automatic speech recognition (ASR), voice communication, and assistive hearing devices, play an important role in modern society. However, most of these applications are not robust when noises are involved, and speech enhancement (SE)

[9, 7] has been used as a fundamental tool in these applications. SE aims to improve the quality and intelligibility of the original speech signal. Traditional SE approaches are derived based on the statistical properties of speech and distortion signals (e.g., Wiener filtering). Although these traditional SE approaches perform well under many conditions, the enhancement performance degrades when the statistical properties are not fulfilled.

In recent years, researchers have tried to incorporate deep learning algorithms into the SE task. Many SE systems are derived to carry out enhancement on the frequency-domain acoustic features; well-known examples include the fully connected neural network

[10, 27], convolutional neural network (CNN) [3], recurrent neural network (RNN) [25, 1], and their combinations [28, 20]. Although the above-mentioned approaches that perform enhancement in the frequency domain can already provide outstanding performance, the enhanced speech signals cannot reach perfection due to lack of accurate phase information. To tackle this problem, some studies [26, 18] follow the frequency-domain based pipeline in a phase-aware manner, while [16] directly perform enhancement on the raw waveform.

In this paper, we propose an end-to-end raw waveform-mapping-based SE method using a convolutional recurrent neural network, termed WaveCRN. Two tasks are used to test the proposed WaveCRN SE model: (1) speech denoising and (2) compressed speech restoration. For speech denoising, we evaluate our method on an open-source dataset

[23] and obtain state-of-the-art PESQ (perceptual evaluation of speech quality) scores [15]

using a relatively simple architecture and L1 loss function. For compressed speech restoration, evaluated on the TIMIT database

[4], the proposed WaveCRN model recovers extremely compressed speech (compressing speech samples from 16-bit to 2-bit) with a notable relative STOI (short-time objective intelligibility) [17] improvement of 75.51% (from 0.49 to 0.86).

2 Related Works

In this section, we review existing raw waveform based SE approaches. Several studies have shown that the phase information is important when converting spectral features to waveforms. A class of studies [26, 18] conducted phase-aware SE using a complex ratio mask (cRM) to jointly reconstruct magnitude and phase. Wang et al. [19]

proposed using a convolutional recurrent neural network (CRN) for SE in a close-talk scenario. Although the performance of these approaches is superior to previous works based on the magnitude spectrogram and ideal ratio mask (IRM), it is still difficult to estimate the phase spectrogram perfectly. In the field of ASR, researchers have found that using raw waveform input can achieve lower word error rates than using hand-crafted features

[2, 6]. For the SE task, fully convolutional network (FCN) has been popularly used to perform waveform-mapping directly [3, 16, 12, 14, 13]. Compared to a fully connected architecture, FCN retains better local information and thus can more accurately model the high-frequency-components of speech signals. More recently, Pandey et al. proposed to use a temporal convolutional neural network (TCNN) to more precisely characterize temporal features and perform SE in the time domain [11].

3 Methodology

In this section, we describe the details of our SE system. The architecture is a fully differentiable end-to-end neural network that does not require pre-processing and handcrafted features. We leverage the advantages of CNN and RNN to model spatial and temporal information. Referring to conventional SE methods, our model is showed in Fig. 1.

3.1 1D Convolutional Input Module

Most of previous deep-learning-based SE approaches use log-power-spectrum (LPS) as input. Therefore, pre-processing is required to convert the raw waveform into LPS features, which are then fed into the deep-learning model. Then, the phase information of the noisy speech is used to reconstruct the enhanced waveform. To perform time-domain SE, we design a light weight 1D CNN input module to mimic the behavior of short time Fourier transform (STFT). Benefited by the nature of neural networks, the CNN module is fully trainable. An input noisy audio

X (X

) is convolved with a two-dimensional tensor

W (W ) to extract the feature map F , where

are the batch size, number of channels, kernel size, time steps, and audio length, respectively. Notably, to reduce the sequence length for computational efficiency, we set the convolution stride to half the size of the kernel, so the length of

F is reduced from to .

3.2 Temporal Encoder

In this work, we present a bidirectional RNN to capture the temporal correlation of the feature maps extracted by the input module in both directions. For one feature map , it can be formulated as a sequence , , and then passed to the recurrent feature extractor. The hidden state extracted in both directions are concatenated as . Affine transform is used to ensure that the dimensions of input and output feature maps are the same.

Figure 1:

The architecture of the proposed WaveCRN model. For local feature extraction, a 1D CNN maps the noisy audio

x into a 2D feature map F. Bi-SRU then encodes F into an restricted feature mask (RFM) M, which is element-wisely multiplied by F to generate a masked feature map F’. Finally, a transposed 1D convolution layer recovers the enhanced waveform y from F’.

3.3 Restricted Feature Mask

The restricted optimal ratio mask (ORM) has been widely used in SE and speech separation tasks [8]. For our task, an alternative restricted ORM called the restricted feature mask (RFM) , where all the elements are in the range of -1 to 1, is applied to mask the feature map F as:


is the masked feature map estimated by element-wisely multiply the mask M and the feature map F for waveform generation. The main difference between the restricted ORM and RFM is that the former is applied in the time-frequency domain while the latter transforms the feature map, rather than directly applied in the time-frequency domain.

3.4 Waveform Generation

As described in Section 3.1, the sequence length is reduced from to

due to the stride in the convolution process. Length restoration is essential to generate an output waveform of the same length as the input. Given the input length, output length, stride, and padding as

, , , and , the relation of and can be formulated as:


Let , , , we have . That is, the input and output lengths are guaranteed to be the same.

3.5 Model Structure Overview

In summary, as shown in Fig. 1, our model leverages the benefits of CNN and RNN. Given a noisy speech utterance, for local feature extraction, a 1D CNN maps the noisy audio x into a 2D feature map F. Bi-SRU then encodes F into an RFM M, which is element-wisely multiplied by F to generate a masked feature map F’. Finally, a transposed 1D convolution layer is used to recover the enhanced waveform y from F’.

3.6 Comparing LSTM and SRU

SRU [21]

has better parallelization than LSTM. LSTM recursively encodes sequence similarity with sequential gates. However, the dependency on hidden states leads to slow training and inference. In contrast, all gates in the SRU depend on the input of the corresponding time, and the temporal correlation is captured by adding a highway connection between the recurrent layers. Therefore, the gates in the SRU are computed simultaneously. Furthermore, replacing the matrix multiplication with the Hadamard product while computing the state vectors speeds up the forward and backward pass calculation.

4 Experiments

This section presents the datasets, experimental setup, experimental results, and analyses.

4.1 Datasets

4.1.1 Speech Denoising

For the speech denoising task, an open-source dataset [23] is used, which incorporates the voice bank corpus [24] and DEMAND [22] for noisy speech generation. In the voice bank corpus, 28 out of 30 speakers are used for training and the remaining speakers are used for testing. For the training set, the clean speech is combined with 10 types of noises with 4 SNR conditions (0, 5, 10, and 15 dB), while 5 types of unseen noises are mixed with the clean speech under 4 different SNR conditions (2.5, 7.5, 12.5, and 17.5) for the testing set.

4.1.2 Compressed (2-bit) Speech Restoration

For compressed speech restoration, we use the TIMIT [4] corpus, which consists of clean speech only. The original speech samples are in 16-bit format. In this work, each sample was compressed to a 2-bit format, i.e., each compressed sample was represented by -1, 0, or +1. By saving 87.5% of bits, data transmission and storage can be considerably reduced, which is favorable for IoT applications. Expressing the clean speech as and the compressed speech as , the optimization process becomes:


where denotes the SE process.

4.2 Model Architecture

In the input module, we extract local features with a 1D convolutional layer, which contains 256 channels with 6ms kernel and 3ms stride. To ensure the recovered audio length to be the same as that of the input, we reflectively pad the input sequence at both sides so that the input length is divisible by the stride size. For the temporal encoder, Bi-SRU is used. Corresponding to the features extracted from the previous stage, the size of the hidden state is set to 256 with 6 stacks, and each hidden state is transformed to half of its dimension. Next, all hidden states are concatenated together as a mask and element-wisely multiplied by the feature map generated from the first stage. Finally, in the waveform generation step, a transposed convolutional layer maps the 2D feature map into a 1D sequence, which is then passed through a hyperbolic tangent activation function to output the final predicted waveform.

4.3 Experimental Results and Analyses

4.3.1 Speech Denoising

For the speech denoising task, we adopted five standardized evaluation metrics:

CSIG that reveals the signal distortion mean opinion score, CBAK that represents the background intrusiveness, COVL that stands for the overall speech quality, SSNR

that measures the segmental signal-to-noise-ratio, and

PESQ. In addition to Wiener filtering and SEGAN [12], we also listed several well-known SE approaches that use a L1 norm based objective function. A comparative system that combines CNN and BLSTM (termed ConvBLSTM) was also implemented, where SRU was replaced by LSTM in Fig.1. As shown in Table 1, WaveCRN is superior to all the models in terms of various perceptual and signal-level evaluation metrics. Compared with ConvBLSTM, WaveCRN can produce better performance through a simpler architecture. Next, we visually investigate an example speech utterance enhanced by WaveCRN. To see the effect of RFM, we implemented a comparative WaveCRN without RFM. In this system, the SRU in Fig. 1 directly generates enhanced features without estimating the RFM; the system is termed WaveCRN(w/o RFM). Fig. 2 (a), (b), (c) and (d) depict the spectrograms of noisy, clean and enhanced speech by WaveCRN and WaveCRN(w/o RFM). As can be seen from the figure, the silent parts are denoised properly by WaveCRN either with or without RFM. Moreover, it is noted that with RFM, WaveCRN preserves more consonant information within the target speech, which is positively related to the speech intelligibility.

Noisy 1.97 3.35 2.44 2.63 1.68
Wiener 2.22 3.23 2.68 2.67 5.07
SEGAN [12] 2.16 3.48 2.94 2.80 7.73
Wavenet [14] - 3.62 3.23 2.98 -
Wave-U-Net [5] 2.62 3.91 3.35 3.27 10.05
ConvBLSTM 2.54 3.83 3.25 3.18 9.33
WaveCRN 2.64 3.94 3.37 3.29 10.26
Table 1: The results of the speech denoising task. A higher score indicates better performance. In each column, the bold value indicates the corresponding best performance.
(a) Noisy (b) Clean (c) WaveCRN (d) WaveCRN(w/o RFM)
Figure 2: Magnitude spectrograms of noisy, clean speech and enhanced speech by WaveCRN and WaveCRN(w/o RFM).
Model Time (sec) #parameters (K)
ConvBLSTM 58.039 9093
WaveCRN 2.289 4655
Table 2: Execution time and number of parameters of WaveCRN using ConvBLSTM.
(a) Compressed (b) Ground Truth (c) LPS–SRU (d) WaveCRN
Figure 3: Magnitude spectrograms of original, compressed, and restored speech by LPS–SRU and WaveCRN.

4.3.2 Compressed Speech Restoration

For the compressed speech restoration task, we applied WaveCRN to transform the compressed speech to the uncompressed speech. For comparison, we implemented another SRU-based system, termed LPS–SRU. In LPS–SRU, the SRU structure was identical to the one used in WaveCRN, but the input was the log-power-spectrum, and the STFT and inverse STFT were used for speech analysis and generation, respectively. The performance was evaluated in terms of the PESQ and STOI scores. From Table 3, we can see that WaveCRN/LPS–SRU improves the PESQ score from 1.39 to 2.41/1.97, and the STOI score from 0.49 to 0.86/0.79. Both WaveCRN and LPS–SRU achieve significant improvements, while WaveCRN obviously outperforms LPS–SRU.

We further visually investigate the resulting spectrograms. From Fig. 3 (a) and (b), when the speech samples are compressed to a 2-bit format, the speech quality is notably reduced. By using WaveCRN and LPS–SRU, the restored speech presents a clearer structure, as shown in Fig. 3 (c), and (d). Moreover, the white-block regions show that WaveCRN can restore speech patterns more effectively than LPS–SRU without losing phase information.

Next, we compare WaveCRN and ConvBLSTM in terms of inference time and model complexity. The first column in Table 2 shows the execution time of the forward pass, and the second column presents the number of parameters. Under the same hyper-parameter setting, SRU is 25.36 times faster than BLSTM, and the number of parameters is only 51%.

Compressed 1.39 0.49
LPS–SRU 1.97 0.79
WaveCRN 2.41 0.86
Table 3: The results of the compressed speech restoration task.

5 Conclusion

In this paper, we have proposed the WaveCRN model for E2E SE. By taking advantage of CNN and SRU, WaveCRN uses a bi-directional architecture to model the temporal correlation of the extracted feature map. Experimental results of speech denoising and compressed speech restoration tasks show that the proposed WaveCRN model has excellent effectiveness and computational efficiency compared to related works and well-known methods using L1 loss. In summary, the contributions of this study are four fold: (a) WaveCRN is the first work that combines SRU with CNN to perform E2E SE; (b) a novel RFM is derived to directly transform noisy features to enhanced features; (c) the SRU model is relatively simple, but yields comparable performance, compared with other state-of-the-art SE models using the same L1 loss; (d) a new and practical application (i.e., compressed speech restoration) was tested, and promising results were obtained using the proposed WaveCRN model.


  • [1] X. Cui, Z. Chen, and F. Yin (2020) Speech enhancement based on simple recurrent unit network. Applied Acoustics 157, pp. 107019. Cited by: §1.
  • [2] et al. (2015) Analysis of CNN-based speech recognition system using raw speech as input. In Proc. Interspeech, Cited by: §2.
  • [3] S.-W. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai (2018) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM TASLP 26 (9), pp. 1570–1584. Cited by: §1, §2.
  • [4] J. S. Garofolo et al. (1993) DARPA TIMIT acoustic-phonetic continuous speech corpus cd-rom. NIST speech disc 1-1.1. NASA STI/Recon Technical Report 93. Cited by: §1, §4.1.2.
  • [5] R. Giri, U. Isik, and A. Krishnaswamy (2019) Attention Wave-U-Net for speech enhancement. In Proc. WASPAA, Cited by: Table 1.
  • [6] P. Golik, Z. Tüske, R. Schlüter, and H. Ney (2015) Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In Proc. Interspeech, Cited by: §2.
  • [7] M. Kolbk, Z.-H. Tan, and J. Jensen (2017) Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM TASLP 25 (1), pp. 153–167. Cited by: §1.
  • [8] S. Liang, W. Liu, W. Jiang, and W. Xue (2013) The optimal ratio time-frequency mask for speech separation in terms of the signal-to-noise ratio. The Journal of JASA 134 (5), pp. EL452–EL458. Cited by: §3.3.
  • [9] P. C. Loizou (2013) Speech enhancement: theory and practice. 2nd edition, CRC Press, Inc.. Cited by: §1.
  • [10] X. Lu, Y. Tsao, S. Matsuda, and C. Hori (2013)

    Speech enhancement based on deep denoising autoencoder.

    In Proc. Interspeech, Cited by: §1.
  • [11] A. Pandey and D. Wang (2019) TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain. In Proc. Interspeech, Cited by: §2.
  • [12] S. Pascual, A. Bonafonte, and J. Serra (2017) SEGAN: speech enhancement generative adversarial network. In Proc. Interspeech, Cited by: §2, §4.3.1, Table 1.
  • [13] K. Qian et al. (2017-08-20) Speech enhancement using Bayesian Wavenet. In Proc. Interspeech, Cited by: §2.
  • [14] D. Rethage, J. Pons, and X. Serra (2017) A Wavenet for speech denoising. In Proc. Interspeech, Cited by: §2, Table 1.
  • [15] A. W. Rix et al. (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proc. ICASSP, Cited by: §1.
  • [16] T. N. Sainath et al. (2015) Learning the speech front-end with raw waveform CLDNNs. In Proc. Interspeech, Cited by: §1, §2.
  • [17] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE/ACM TASLP 19 (7), pp. 2125–2136. Cited by: §1.
  • [18] N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji (2018) PhaseNet: discretized phase modeling with deep neural networks for audio source separation.. In Proc. Interspeech, Cited by: §1, §2.
  • [19] K. Tan and D. Wang (2018) A convolutional recurrent neural network for real-time speech enhancement. In Proc. Interspeech, Cited by: §2.
  • [20] K. Tan and D. Wang (2019) Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM TASLP 28, pp. 380–390. Cited by: §1.
  • [21] L. Tao et al. (2018) Simple recurrent units for highly parallelizable recurrence. In Proc. EMNLP, Cited by: §3.6.
  • [22] Cited by: §4.1.1.
  • [23] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Proc. SSW, Cited by: §1, §4.1.1.
  • [24] C. Veaux, J. Yamagishi, and S. King (2013) The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In Proc. O-COCOSDA/CASLRE, Cited by: §4.1.1.
  • [25] F. Weninger et al. (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proc. LVA/ICA, Cited by: §1.
  • [26] D. S. Williamson and D. Wang (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM TASLP 25 (7), pp. 1492–1501. Cited by: §1, §2.
  • [27] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM TASLP 23, pp. 7–19. Cited by: §1.
  • [28] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee (2018) Convolutional-recurrent neural networks for speech enhancement. In Proc. ICASSP, Cited by: §1.