Efficient Context Aggregation for End-to-End Speech Enhancement Using a Densely Connected Convolutional and Recurrent Network

08/18/2019 ∙ by Kai Zhen, et al. ∙ Indiana University Bloomington ETRI 0

In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution time domain signal with an affordable model complexity still remains challenging. In this paper, we propose a hybrid architecture, incorporating densely connected convolutional networks (DenseNet) and gated recurrent units (GRU), to enable dual-level temporal context aggregation. Due to the dense connectivity pattern and a cross-component identical shortcut, the proposed model consistently outperforms competing convolutional baselines with an average STOI improvement of 0.23 and PESQ of 1.38 at three SNR levels. In addition, the proposed hybrid architecture is computationally efficient with 1.38 million parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monaural speech enhancement can be described as a process to extract the target speech signal by suppressing the background interference in the speech mixture in the single-microphone setting. There have been various classic methods, such as spectral subtraction [boll1979suppression], Wiener-filtering [brown1992introduction] and non-negative matrix factorization [schmidt2006single], to remove the noise without leading to objectionable distortion or adding too much artifacts, such that the denoised speech is of decent quality and intelligibility. Recently, the deep neural network (DNN), a data-driven computational paradigm, has been extensively studied thanks to its powerful parameter estimation capacity and correspondingly promising performance [williamson2016complex][tan2019gated][huang2014deep].

DNNs formulate monaural speech enhancement either as mask estimation [narayanan2013ideal] or end-to-end mapping [pascual2017segan]. In terms of mask estimation, DNNs usually take acoustic features in time-frequency (T-F) domain to estimate a T-F mask, such as ideal binary mask (IBM) [wang2005ideal], ideal ratio mask (IRM) [narayanan2013ideal] and phase-sensitive mask (PSM) [erdogan2015phase]. In comparison, both the input and output of end-to-end speech enhancement DNNs can be T-F spectrograms [zhao2018convolutional], or even time domain signals directly without any feature engineering [rethage2018wavenet].

In both mask estimation and end-to-end mapping DNNs, dilated convolution [yu2015multi] serves a critical role to aggregate contextual information with the enlarged receptive field. Gated residual network (GRN) [tan2018gated]

employs dilated convolutions to accumulate context in temporal and frequency domains, leading to a better performance than a long short-term memory (LSTM) cell-based model

[chen2017long]. In end-to-end setting, WaveNet [oord2016wavenet] and its variations also adopt dilated convolution in speech enhancement [rethage2018wavenet].

For real-time systems deployed in resource-constrained environment, however, the oversized receptive field from dilated convolution can cause a severe delay issue [tan2018gated]. Although causal convolution can enable real-time speech denoising [tan2018convolutional], it performs less well comparing to the dilated counterpart [tan2018gated]

. Besides, when the receptive field is too large, the amount of padded zeroes in the beginning of the sequence and a large buffer size for online processing can be a burdensome spatial complexity for a small device. Meanwhile, recurrent neural networks (RNN) can also aggregate context through a frame-by-frame processing without relying on the large receptive field. However, the responsiveness of a practical RNN system, such as LSTM

[hochreiter1997long], comes at the cost of the increased number of model parameters, which is neither as easy to train nor resource-efficient.

To leverage the benefit from temporal context aggregation, but with a low delay and complexity in the end-to-end setting, we propose a densely connected convolutional and recurrent network (DCCRN), which conducts dual-level context aggregation. The first level of context aggregation in DCCRN is achieved by a dilated 1D convolutional neural network (CNN) component, encapsulated in the DenseNet architecture

[huang2017densely]. It is followed by a compact gated recurrent unit (GRU) component [chung2014empirical] to further utilize the contextual information in the “many-to-one” fashion. Note that we also employ a cross-component identical shortcut linking the output of DenseNet component to the output of GRU component to reduce the complexity of the GRU cells. We also propose a specifically designed training procedure for DCCRN that trains the CNN and RNN components separately, and then finetunes the entire model. We show that the hybrid architecture of dilated DenseNet and GRU in DCCRN consistently outperforms other CNN variations with only one level of context aggregation. Our model is computationally less complex even with the additional GRU layers.

Figure 1: A schematic diagram of the DCCRN training procedure including dilated DenseNet and GRU components.

2 Model Description

2.1 Context aggregation with dilated DenseNet

Residual learning has become a critical technique to tackle the gradient vanishing issue [he2016deep] when tuning a deep convolutional neural network (CNN), such that the deep CNN can achieve better performance but with a lower model complexity. ResNet illustrates a classic way to enable residual learning by adding identical shortcuts across bottleneck structures [he2016deep]. Although the bottleneck structure includes direct paths to feedforward information from earlier layers to later layers, it does not extend to its full capacity of the information flow. Therefore, ResNet is sometimes accompanied by a gating mechanism, a technique heavily used in RNNs, such as LSTM or GRU, to further facilitate the gradient propagation in convolutional networks [tan2018gated].

In comparison, DenseNet [huang2017densely] resolves the issue by redefining the skip connections. The dense block differs from the bottleneck structure in that each layer takes concatenated outputs from all preceding layers as its input, while its own output is fed to all subsequent layers (Figure 1). Consequently, DenseNet requires fewer model parameters to achieve a competitive performance.

In fully convolutional architectures, the dilated convolution is a popular technique to enlarge the receptive field to cover longer sequences [oord2016wavenet], which has shown promising results in speech enhancement [tan2018gated]. Because of the lower model complexity, dilated convolution is considered as a cheaper alternative to the recurrence operation. Our model adapts this technique sparingly with a receptive field size that does not exceed the frame size.

We use to denote a convolution operation between the input and the filter in the -th layer with a dilation rate:

(1)
(2)

where are the indices of the input features, output features, channels, and filter coefficients, respectively. Note that is an integer with a range , where is the 1D kernel size. In our system we have two kernel sizes: and

. As DCCRN is based on 1D convolution, the tensors are always in the shape of (features)

(channels). Zero padding keeps the number of features the same across layers. Given the dilation rate [yu2015multi], the convolution operation is defined in (2) with the dilation being activated.

In DCCRN, a dense block combines five such convolutional layers. In each block, the input to the -th layer is a channel-wise concatenated tensor of all preceding feature maps in the same block, thus substituting (1) with

(3)

where denotes the first input feature map to the -th block. Note that in this DensNet architecture, grows its depth accordingly, i.e., with a growing rate , the depth of . In the final layer of a block, the concatenated input channels collapse down to , which forms the input to the next block. The first dense block in Figure 1 depicts this process. We stack four dense blocks with the dilation rate of the middle layer in each block to be 1, 2, 4 and 8, respectively. Different from the original DenseNet architecture, we do not apply any transition in-between blocks, except for the very first layer, prior to the stacked dense blocks, expanding the channel of the input from to , and another layer right after the stacked dense blocks to reduce it back to

. This forms our fully convolutional DenseNet baseline. In all the convolutional layers, we use leaky ReLU as the activation.

2.2 Context aggregation with gated recurrent network

DCCRN further employs RNN layers following the dilated DenseNet component (Figure 1). Among LSTM and GRU, two most well-known RNN variations, DCCRN chooses GRU for its reduced computational complexity compared to LSTM [irie2016lstm]. The information flow within each unit is outlined as follows:

(4)
(5)
(6)
(7)

where is the index in the sequence. and are the hidden state and the newly proposed one, which are mixed up by the update gate in a complementary fashion as in (4). The GRU cell computes the tanh unit by using a linear combination of the input and the gated previous hidden state as in (5). Similarly, the gates are estimated using another sigmoid units as in (6) and (7). In all linear operations, GRU uses corresponding weight matrices, . We omit bias terms in the equations.

The GRU component in this work follows a “many-to-one” mapping style for an additional level of context aggreation. During training, it looks back time steps and generates the output corresponding to the last time step. To this end, DCCRN reshapes the output of the CNN part, the vector, into sub-frames, each of which is an -dimensional input vector to the GRU cell. We have two GRU layers, one with 32 hidden units and the other one with units to match the output dimensionality of the system. Furthermore, to ease the optimization and to limit the model complexity of the GRU layers, we pass the last sub-frame output of the DenseNet component to the output of GRU component via a skipping connection, which is additive as in the ResNet architecture—the denoised speech is the sum of the output from both components. With the dilated DenseNet component well-tuned, its output will already be close to the clean speech, which leaves less work for GRU to optimize, as detailed in Section 2.5.

2.3 Data flow

During training, as illustrated in Figure 1, the noisy frame is first fed to the DenseNet component (line 3 in Algorithm 1). It comprises of consecutive convolutional layers that are grouped into four dense blocks, where . The output frame of DenseNet, containing samples, is then reformulated to a sequence of dimensional vectors, , which serve as the input of the GRU component: . The cleaned-up signal corresponds to the final state of the GRU with the dimension of .

At test time, the output sub-frame of DCCRN is weighted by Hann window with overlap by its adjacent sub-frames. Note that to generate the last samples, DCCRN only relies on the current and past samples, up to within that frame, without seeing future samples, which is similar to causal convolution. Hence, the delay of DCCRN is the sub-frame size ( second). If it were just for the DenseNet component only, such as those convolutional baselines compared in Section 3, the Hann window with the same overlap rate would still be applied, but the model output would be all samples for the corresponding frame, instead of the last samples.

Table 1 summarizes the network architecture.

1:  Input: samples from the noisy utterance,
2:  Output: The last samples of the denoised signal,
3:  DenseNet denoising:
4:  Reshaping:
5:                            
6:  GRU denoising:
7:  Post windowing:    {# at test time only}
Algorithm 1 The feedforward procedure in DCCRN
Components Input shape Kernel shape Output shape
Change channel (1024, 1) (55, 1, 32) (1024, 32)
DenseNet (1024, 32)
(5, 32, 32) ]55mm[4]
(5, 64, 32)
(55, 96, 32)
(5, 128, 32)
(5, 160, 32)
(1024, 32)
Change channel (1024, 32) (55, 32, 1) (1024, 1)
Reshape (1024, 1) - (4, 256, 1)
GRU (4, 256, 1)
(256+32, 32)
(32+256, 256)
(256, 1)
Table 1: Architecture of DCCRN: for the CNN layers, the data tensor sizes are represented by (size in samples, channels), while the CNN kernel shape is (size in samples, input channels, output channels). For the GRU layers, an additional dimension for the data tensors defines the length of the sequence, , while the kernel sizes define the linear operations (input features, output features). The middle layer of each dense block, marked by a dagger, is with larger kernel size and an optional dilation with the rate of 1, 2, 4, and 8, for the four dense blocks, respectively.

2.4 Objective function

It is known that the mean squared error (MSE) itself cannot directly measure the perceptual quality nor the intelligibility, both of which are usually the actual metrics for evaluation[fu2018end]. To address the discrepancy, the MSE can be replaced by a more intelligibly salient measure, such as short-time objective intelligibility (STOI) [fu2018end]. However, the improved intelligibility does not guarantee a better perceptual quality. The objective function in this work is defined in (8), which is still based on MSE, but accompanied by a regularizer that compares mel spectra between the target and output signals. The TF domain regularizer compensates the end-to-end DNN that would only operate in time domain, otherwise. Empirically, it is shown to achieve better perceptual quality, as proposed in [new_paper_bloombergEndtoEnd].

(8)

2.5 Model training scheme

We train the CNN and RNN components separately, and then finetune the combined network.

  • [leftmargin=0in]

  • CNN training: First, we train the CNN component to minimize the error .

  • RNN training: Next, we train the RNN part by minimizing , while is locked.

  • Integrative finetuning: Once both CNN and RNN components are pretrained, we unlock the CNN parameter and finetune both components to minimize the final error: . Note that the learning rate for integrative finetuning should be smaller.

Metrics SDR (dB) SIR (dB) SAR (dB) STOI (%) PESQ
SNR level (dB) -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5
Unprocessed -5.01 -0.00 5.01 -4.97 0.04 5.06 25.43 27.97 30.56 0.65 0.72 0.78 1.06 1.11 1.21
Gated ResNet 10.72 14.51 17.13 16.99 20.67 23.11 11.93 15.76 18.45 0.86 0.91 0.94 1.63 2.01 2.27
DenseNet 12.54 14.86 16.77 18.23 20.64 22.67 13.61 16.02 17.96 0.87 0.91 0.93 1.70 1.98 2.22
Dilated DenseNet 13.75 15.75 17.44 19.81 21.27 22.54 14.98 17.26 19.19 0.89 0.92 0.94 1.97 2.30 2.57
DenseNet+GRU 14.19 16.92 18.57 21.72 24.53 26.66 14.81 17.67 19.29 0.90 0.93 0.95 1.96 2.36 2.57
DCCRN 15.09 17.48 19.39 21.37 23.76 25.61 16.11 18.61 20.55 0.92 0.95 0.96 2.16 2.55 2.82
DCCRN* 15.06 17.47 19.39 21.34 23.71 25.63 16.08 18.61 20.56 0.91 0.94 0.96 2.15 2.52 2.82
Table 2: SDR, SIR, SAR, STOI and PESQ scores for six models in comparison.
(a) STOI
(b) WB-PESQ
(c) Complexity
Figure 2: A comparison of STOI and PESQ scores, and the spatial complexity of the models.

3 Experiments

3.1 Experimental setup

In this paper, the experiment runs on TIMIT corpus [timit]. For the model training, we randomly choose 50 male speakers and 50 female speakers from TIMIT training subset. 5 types of non-stationary noise (birds, cicadas, computer keyboard, machine guns and motorcycles) from [vincent2006performance] are used to create mixtures. Concretely, each clean signal is mixed with a random cut of each of these noise types at a SNR level randomly drawn from the set of integers with the range of dB. Therefore, 5,000 noisy speech samples, totaling 3.5 hours, are used for model training. At test time, we randomly select 2 male speakers and 2 female speakers from TIMIT test subset, and mix each utterance with those 5 types of noise. To add each signal and noise, the noise is randomly cut to match the length of the clean signal. The mixtures are generated from 3 SNR levels (-5 dB, 0 dB and +5 dB), yielding 600 test utterances in total.

Experiments are conducted on 6 models for comparison:

  • [leftmargin=0in]

  • Gated ResNet refers to the gating mechanism-enhanced bottleneck residual blocks extracted from GRN [tan2018gated]. All layers in Gated ResNet are 1-D convolutional layers with the kernel size being 55. The size of the output channel for the narrow and wide layers in bottleneck structures are 20 and 200, respectively. We repurpose the original GRN so that the variation can handle the time-domain samples, as the original GRN is designed for TF spectrograms.

  • DenseNet refers to the CNN part of DCCRN with no dilation. Its architecture corresponds to the first three rows in Table 1 without dilation, reshaping, and GRUs.

  • Dilated DenseNet applies dilated convolution in the middle layer of each dense block.

  • DenseNet+GRU refers to the DenseNet architecture coupled with two GRU layers, but with no dilation.

  • DCCRN is our full model with the proposed training scheme in Section 2.5

    . We train the CNN part with 100 epochs, the GRU component with 20 epochs, followed by integrated finetuning of 20 epochs. The learning rates are 1e-4, 5e-6 and 5e-7, correspondingly.

  • DCCRN is with an enlarged frame size, .

All models are trained to our best effort with Adam optimizer [adam]. The batch size is 32 frames. The regularizer coefficient, , is . The GRU gradients are clipped in the range of .

3.2 Experimental results

To evaluate the performance in blind source separation, we calculate SDR, SIR and SAR from BSS_Eval toolbox [vincent2006performance]. We also choose short-time objective intelligibility [stoi] and perceptual evaluation of speech quality (PESQ) with the wide-band extension (P862.3) [pesq] to measure the intelligibility and quality of the denoised speech. Note that narrow-band PESQ scores (P862) [pesqnb] are approximately greater by 0.5 (e.g., for unprocessed utterances at 0 dB SNR).

First, we find that both DenseNet and Gated ResNet architectures can serve as our CNN baselines with comparable performance (Table 2). Note that the performance of Gated ResNet does not represent the original GRN [tan2018gated] due to the missing T-F transformations. The DenseNet baseline is preferred in that it is with much fewer model parameters (Figure 2(c)). In addition, both the dilated convolution (Dilated DenseNet) and the individual GRU component (DenseNet+GRU) provide effective context aggregation with superior performance compared to the standalone DenseNet baseline, with the DenseNet+GRU having marginally higher scores than the in-frame dilation. By coupling both two context aggregation techniques, DCCRN consistently outperforms the other baseline models in all metrics. In terms of SDR, the average improvement is comparing to the DenseNet baseline. STOI and PESQ scores are also displayed in Figure 2 (a) and (b). The comparison with unprocessed mixtures shows an average STOI improvement of 0.23 and PESQ of 1.38. The performance is not further improved with , due to the trade-off between the increased difficulty in GRU optimization and more temporal context in each sequence.

The comparison of the model complexity is summarized in Figure 2 (c). Our method is with much fewer parameters than LSTM [chen2017long] and WaveNet based speech enhancement models [rethage2018wavenet]. Furthermore, even though the number of parameters in Gated ResNet is already lowered than the original GRN [tan2018gated], both our DenseNet baseline and DCCRN with the GRU layers are even smaller, while better in performance. DCCRN with sequences increases the model complexity by 0.64 million more parameters but does not improve the performance. Denoised samples are available at http://pages.iu.edu/~zhenk/speechenhancement.html

4 Conclusion

The paper introduces DCCRN, a hybrid residual network, to aggregate temporal context in dual levels for end-to-end speech enhancement. DCCRN firstly suppresses the noise in time domain with dilated DenseNet, followed by a GRU component to further leverage the temporal context in a many-to-one manner. To tune the model with heterogeneity, we present a component-wise training scheme followed by finetuning. Experiments showed that our method consistently outperforms other baseline models in various metrics. We plan to extend the experiment to investigate the model generalisability on unseen data at untrained SNR level for more noise types.

References