A Dual-Staged Context Aggregation Method Towards Efficient End-To-End Speech Enhancement

08/18/2019 ∙ by Kai Zhen, et al. ∙ 0

In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution time domain signal with an affordable model complexity still remains challenging. In this paper, we propose a densely connected convolutional and recurrent network (DCCRN), a hybrid architecture, to enable dual-staged temporal context aggregation. With the dense connectivity and cross-component identical shortcut, DCCRN consistently outperforms competing convolutional baselines with an average STOI improvement of 0.23 and PESQ of 1.38 at three SNR levels. The proposed method is computationally efficient with only 1.38 million parameters. The generalizability performance on the unseen noise types is still decent considering its low complexity, although it is relatively weaker comparing to Wave-U-Net with 7.25 times more parameters.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monaural speech enhancement can be described as a process to extract the target speech signal by suppressing the background interference in the speech mixture in the single-microphone setting. There have been various classic methods, such as spectral subtraction [BollSF79ieeeassp], Wiener-filtering [brown1992introduction] and non-negative matrix factorization [schmidt2006single], to remove the noise without leading to objectionable distortion or adding too much artifacts, such that the denoised speech is of decent quality and intelligibility. Recently, the deep neural network (DNN), a data-driven computational paradigm, has been extensively studied thanks to its powerful parameter estimation capacity and correspondingly promising performance [williamson2016complex][huang2014deep].

DNNs formulate monaural speech enhancement either as mask estimation [narayanan2013ideal] or end-to-end mapping [pascual2017segan]. In terms of mask estimation, DNNs usually take acoustic features in time-frequency (T-F) domain to estimate a T-F mask, such as ideal binary mask (IBM) [wang2005ideal], etc. In comparison, both the input and output of end-to-end speech enhancement DNNs can be T-F spectrograms, or even time domain signals directly without any feature engineering.

In both mask estimation and end-to-end mapping DNNs, dilated convolution [yu2015multi] serves a critical role to aggregate contextual information with the enlarged receptive field. Gated residual network (GRN) [tan2018gated]

employs dilated convolutions to accumulate context in temporal and frequency domains, leading to a better performance than a long short-term memory (LSTM) cell-based model

[chen2017long]. In end-to-end setting, WaveNet [oord2016wavenet] and its variations also adopt dilated convolution in speech enhancement.

For real-time systems deployed in resource-constrained environment, however, the oversized receptive field from dilated convolution can cause a severe delay issue. Although causal convolution can enable real-time speech denoising [tan2018convolutional], it performs less well comparing to the dilated counterpart [tan2018gated]

. Besides, when the receptive field is too large, the amount of padded zeroes in the beginning of the sequence and a large buffer size for online processing can be a burdensome spatial complexity for a small device. Meanwhile, recurrent neural networks (RNN) can also aggregate context through a frame-by-frame processing without relying on the large receptive field. However, the responsiveness of a practical RNN system, such as LSTM

[chen2017long], comes at the cost of the increased number of model parameters, which is neither as easy to train nor resource-efficient. There has been effort to apply dilated DenseNet [li2019densely] or a hybrid architecture to source separation [takahashi2018mmdenselstm], the mechanism to enable dual-staged context aggregate through the heterogeneous model topology has not been addressed.

To achieve efficient end-to-end monaural speech enhancement, we propose a densely connected convolutional and recurrent network (DCCRN), which conducts dual-level context aggregation. The first level of context aggregation in DCCRN is achieved by a dilated 1D convolutional neural network (CNN) component, encapsulated in the DenseNet architecture


. It is followed by a compact gated recurrent unit (GRU) component

[chung2014empirical] to further utilize the contextual information in the “many-to-one” fashion. Note that we also employ a cross-component identical shortcut linking the output of DenseNet component to the output of GRU component to reduce the complexity of the GRU cells. We also propose a specifically designed training procedure for DCCRN that trains the CNN and RNN components separately, and then finetune the entire model. Experimental results show that the hybrid architecture of dilated DenseNet and GRU in DCCRN consistently outperforms other CNN variations with only one level of context aggregation on untrained speakers. Our model is computationally efficient and provides reasonable generalizability to untrained noises with only 1.38 million parameters.

We describe the proposed method in Section 2, and then provide experimental validation in Section 3. We conclude in Section 4.

Figure 1: A schematic diagram of the DCCRN training procedure including dilated DenseNet and GRU components.

2 Model Description

2.1 Context aggregation with dilated DenseNet

Residual learning has become a critical technique to tackle the gradient vanishing issue [he2016deep] when tuning a deep convolutional neural network (CNN), such that the deep CNN can achieve better performance but with a lower model complexity. ResNet illustrates a classic way to enable residual learning by adding identical shortcuts across bottleneck structures [he2016deep]. Although the bottleneck structure includes direct paths to feedforward information from earlier layers to later layers, it does not extend to its full capacity of the information flow. Therefore, ResNet is sometimes accompanied by a gating mechanism, a technique heavily used in RNNs, such as LSTM or GRU, to further facilitate the gradient propagation in convolutional networks [tan2018gated].

In comparison, DenseNet [huang2017densely] resolves the issue by redefining the skip connections. The dense block differs from the bottleneck structure in that each layer takes concatenated outputs from all preceding layers as its input, while its own output is fed to all subsequent layers (Figure 1). Consequently, DenseNet requires fewer model parameters to achieve a competitive performance.

In fully convolutional architectures, the dilated convolution is a popular technique to enlarge the receptive field to cover longer sequences [oord2016wavenet], which has shown promising results in speech enhancement [tan2018gated]. Because of the lower model complexity, dilated convolution is considered as a cheaper alternative to the recurrence operation. Our model adapts this technique sparingly with a receptive field size that does not exceed the frame size.

We use to denote a convolution operation between the input and the filter in the -th layer with a dilation rate :


where are the indices of the input features, output features, channels, and filter coefficients, respectively. Note that is an integer with a range , where is the 1D kernel size. In our system we have two kernel sizes: and

. As DCCRN is based on 1D convolution, the tensors are always in the shape of (features)

(channels). Zero padding keeps the number of features the same across layers. Given the dilation rate [yu2015multi], the convolution operation is defined in (2) with the dilation being activated.

In DCCRN, a dense block combines five such convolutional layers. In each block, the input to the -th layer is a channel-wise concatenated tensor of all preceding feature maps in the same block, thus substituting (1) with


where denotes the first input feature map to the -th block. Note that in this DensNet architecture, grows its depth accordingly, i.e., with a growing rate , the depth of . In the final layer of a block, the concatenated input channels collapse down to , which forms the input to the next block. The first dense block in Figure 1 depicts this process. We stack four dense blocks with the dilation rate of the middle layer in each block to be 1, 2, 4 and 8, respectively. Different from the original DenseNet architecture, we do not apply any transition in-between blocks, except for the very first layer, prior to the stacked dense blocks, expanding the channel of the input from to , and another layer right after the stacked dense blocks to reduce it back to

. This forms our fully convolutional DenseNet baseline. In all the convolutional layers, we use leaky ReLU as the activation.

2.2 Context aggregation with gated recurrent network

DCCRN further employs RNN layers following the dilated DenseNet component (Figure 1). Among LSTM and GRU, two most well-known RNN variations, DCCRN chooses GRU for its reduced computational complexity compared to LSTM. The information flow within each unit is outlined as follows:


where is the index in the sequence. and are the hidden state and the newly proposed one, which are mixed up by the update gate in a complementary fashion as in (4). The GRU cell computes the tanh unit by using a linear combination of the input and the gated previous hidden state as in (5). Similarly, the gates are estimated using another sigmoid units as in (6) and (7). In all linear operations, GRU uses corresponding weight matrices, . We omit bias terms in the equations.

The GRU component in this work follows a “many-to-one” mapping style for an additional level of context aggregation. During training, it looks back time steps and generates the output corresponding to the last time step. To this end, DCCRN reshapes the output of the CNN part, the vector, into sub-frames, each of which is an -dimensional input vector to the GRU cell. We have two GRU layers, one with 32 hidden units and the other one with units to match the output dimensionality of the system. Furthermore, to ease the optimization and to limit the model complexity of the GRU layers, we pass the last sub-frame output of the DenseNet component to the output of GRU component via a skipping connection, which is additive as in the ResNet architecture—the denoised speech is the sum of the output from both components. With the dilated DenseNet component well-tuned, its output will already be close to the clean speech, which leaves less work for GRU to optimize, as detailed in Section 2.5.

2.3 Data flow

During training, as illustrated in Figure 1, the noisy frame is first fed to the DenseNet component (line 3 in Algorithm 1). It comprises of consecutive convolutional layers that are grouped into four dense blocks, where . The output frame of DenseNet, containing samples, is then reformulated to a sequence of dimensional vectors, , which serve as the input of the GRU component: . The cleaned-up signal corresponds to the final state of the GRU with the dimension of .

At test time, the output sub-frame of DCCRN is weighted by Hann window with overlap by its adjacent sub-frames. Note that to generate the last samples, DCCRN only relies on the current and past samples, up to within that frame, without seeing future samples, which is similar to causal convolution. Hence, the delay of DCCRN is the sub-frame size ( second). If it were just for the DenseNet component only, such as those convolutional baselines compared in Section 3, the Hann window with the same overlap rate would still be applied, but the model output would be all samples for the corresponding frame, instead of the last samples.

Table 1 summarizes the network architecture. The current topology is designed for speech sampled at 16kHz.

1:  Input: samples from the noisy utterance,
2:  Output: The last samples of the denoised signal,
3:  DenseNet denoising:
4:  Reshaping:
6:  GRU denoising:
7:  Post windowing:    {# at test time only}
Algorithm 1 The feedforward procedure in DCCRN
Components Input shape Kernel shape Output shape
Change channel (1024, 1) (55, 1, 32) (1024, 32)
DenseNet (1024, 32)
(5, 32, 32) ]55mm[4]
(5, 64, 32)
(55, 96, 32)
(5, 128, 32)
(5, 160, 32)
(1024, 32)
Change channel (1024, 32) (55, 32, 1) (1024, 1)
Reshape (1024, 1) - (4, 256, 1)
GRU (4, 256, 1)
(256+32, 32)
(32+256, 256)
(256, 1)
Table 1: Architecture of DCCRN: for the CNN layers, the data tensor sizes are represented by (size in samples, channels), while the CNN kernel shape is (size in samples, input channels, output channels). For the GRU layers, an additional dimension for the data tensors defines the length of the sequence, , while the kernel sizes define the linear operations (input features, output features). The middle layer of each dense block, marked by a dagger, is with larger kernel size and an optional dilation with the rate of 1, 2, 4, and 8, for the four dense blocks, respectively.

2.4 Objective function

It is known that the mean squared error (MSE) itself cannot directly measure the perceptual quality nor the intelligibility, both of which are usually the actual metrics for evaluation. To address the discrepancy, the MSE can be replaced by a more intelligibly salient measure, such as short-time objective intelligibility (STOI) [fu2018end]. However, the improved intelligibility does not guarantee a better perceptual quality. The objective function in this work is defined in (8), which is still based on MSE, but accompanied by a regularizer that compares mel spectra between the target and output signals. The TF domain regularizer compensates the end-to-end DNN that would only operate in time domain, otherwise. Empirically, it is shown to achieve better perceptual quality, as proposed in [new_paper_bloombergEndtoEnd].


2.5 Model training scheme

We train the CNN and RNN components separately, and then finetune the combined network.

  • [leftmargin=0in]

  • CNN training: First, we train the CNN component to minimize the error .

  • RNN training: Next, we train the RNN part by minimizing , while is locked.

  • Integrative finetuning: Once both CNN and RNN components are pretrained, we unlock the CNN parameter and finetune both components to minimize the final error: . Note that the learning rate for integrative finetuning should be smaller.

Metrics SDR (dB) SIR (dB) SAR (dB) STOI (%) PESQ
SNR level (dB) -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5
Unprocessed -5.01 -0.00 5.01 -4.97 0.04 5.06 25.43 27.97 30.56 0.65 0.72 0.78 1.06 1.11 1.21
Dilated DenseNet 13.54 15.67 17.35 19.78 21.15 22.77 14.85 17.25 19.22 0.88 0.92 0.94 1.95 2.32 2.58
DenseNet+GRU 13.89 16.63 18.72 20.92 23.21 25.28 14.78 17.17 19.68 0.90 0.93 0.95 1.96 2.35 2.57
DCCRN 15.11 17.51 19.29 21.42 24.64 26.70 16.08 18.82 20.77 0.92 0.95 0.96 2.14 2.55 2.83
DCCRN* 15.08 17.45 19.31 21.30 24.71 26.33 16.22 18.76 20.83 0.92 0.94 0.96 2.13 2.54 2.82
Table 2: SDR, SIR, SAR, STOI and PESQ comparison on untrained speakers and trained noise types

3 Experiments

3.1 Experimental setup

In this paper, the experiment runs on TIMIT corpus [timit]. We consider two experimental settings. For the model training, we randomly select 1000 utterances from TIMIT training subset. 5 types of non-stationary noise (birds, cicadas, computer keyboard, machine guns and motorcycles) from [vincent2006performance] are used to create mixtures. Concretely, each clean signal is mixed with a random cut of each of these noise types at a SNR level randomly drawn from the set of integers with the range of dB. Therefore, 5,000 noisy speech samples, totaling 3.5 hours, are used for model training. At test time, we randomly select 100 unseen utterances from TIMIT test subset, and mix each utterance with those 5 types of noise to construct a test set of with unseen speakers. The noise is randomly cut to match the length of test utterances. The mixtures are generated from 3 SNR levels (-5 dB, 0 dB and +5 dB), yielding 1500 test utterances in total.

3.2 Baselines

To validate the two-staged context aggregation method, we compare DCCRN to regular DenseNets with one aggregation mechanism.

  • [leftmargin=0in]

  • Dilated DenseNet uses dilated convolution in the middle layer of each dense block, or DCCRN without context aggregation from GRU layers.

  • DenseNet+GRU refers to the DenseNet architecture coupled with two GRU layers, but with no dilation.

  • DCCRN is our full model with the proposed training scheme in Section 2.5

    . We train the CNN part with 100 epochs, the GRU component with 20 epochs, followed by integrated finetuning of 20 epochs. The learning rates are 1e-4, 5e-6 and 5e-7, correspondingly.

  • DCCRN is with an enlarged frame size, .

All models are trained to our best effort with Adam optimizer [adam]. The batch size is 32 frames. The regularizer coefficient, , is . The GRU gradients are clipped in the range of .

3.3 Performance analysis on untrained speakers

To evaluate the performance, we use BSS_Eval toolbox [vincent2006performance]. The BSS_Eval toolbox provides an objective evaluation on source separation performance, by decomposing the overall error signal-to-distortion ratio (SDR) into components of specific error types. In this work, we focus on signal to interference ratio (SIR), and signal to artifacts ratio (SAR). We also choose short-time objective intelligibility (STOI) [stoi] and perceptual evaluation of speech quality (PESQ) with the wide-band extension (P862.3) [pesq] to measure the intelligibility and quality of the denoised speech. Note that narrow-band PESQ scores (P862) [pesqnb] are approximately greater by 0.5 (e.g., for unprocessed utterances at 0 dB SNR in our case).

As shown in Table 2, by coupling both context aggregation techniques, DCCRN consistently outperforms the other baseline models in all metrics. In terms of SDR, the average improvement is comparing to the DenseNet baseline. The comparison with unprocessed mixtures shows an average STOI improvement of and PESQ of . The performance is not further improved with , due to the trade-off between the increased difficulty in GRU optimization and more temporal context in each sequence111Denoised samples are available at http://pages.iu.edu/~zhenk/speechenhancement.html.

3.4 Generalizability for untrained speakers and noises

(a) STOI
(b) PESQ
(c) Complexity
Figure 2: Comparisons in terms of STOI, PESQ, model complexity (in million) on untrained speakers and noises against the Wave-U-Net model reported in [liao2019incorporating]

To evaluate model performance in an open condition with unseen speakers and noise sources, we scale up the experimental setting. The training dataset is constructed from 3696 utterances from TIMIT training set. Each utterance is mixed with 100 noise types from [hu2004100] at 6 different SNR levels (20dB, 15dB, 10dB, 5dB, 0dB, and -5dB), which yields 40-hour training data of 135GB. 100 unseen utterances are randomly selected from TIMIT test set, with each mixed with three untrained noises (Buccaneer1, Destroyer engine, and HF channel from the NOISEX-92 corpus [varga1993assessment]). Note that the performance of Wave-U-Net [stoller2018wave] was reported in terms of STOI and PESQ in the narrowband mode [liao2019incorporating].

We evaluate models in terms of STOI and PESQ improvements (Fig. 2 (a) and (b)). Wave-U-Net achieves better speech quality improvement. In terms of speech intelligibility, DCCRN gives higher STOI scores at +3dB and +6dB cases. Note that Wave-U-Net contains about 7.25 times more parameters. Some options to further promote the performance of DCCRN are to expand the size of the receptive field [tan2019gated] or to incorporate extra phonetic content [liao2019incorporating], although one of the main focuses of DCCRN is to achieve an affordable model complexity for end-to-end speech enhancement.

4 Conclusion

The paper introduces DCCRN, a hybrid residual network, to aggregate temporal context in dual levels for efficient end-to-end speech enhancement. DCCRN firstly suppresses the noise in time domain with dilated DenseNet, followed by a GRU component to further leverage the temporal context in a many-to-one manner. To tune the model with heterogeneity, we present a component-wise training scheme followed by finetuning. Experiments showed that our method consistently outperforms other baseline models in various metrics. It generalizes very well to untrained speakers, and gives reasonable performance on untrained noises with only 1.38 million parameters.