In real environments, clean speech is often contaminated by background interference, which may significantly reduce the performance of automatic speech recognition[graves2013speech], speaker verification [reynolds2000speaker] and hearing aids [dillon2008hearing]. Monaural speech enhancement aims to extract the target speech from the mixture when only one-microphone is available [loizou2013speech]
. In recent years, deep neural networks (DNNs) have shown their promising performance on monaural speech enhancement even in highly non-stationary noise environments owing to their superior capability in modeling complex nonlinearity[wang2018supervised]. Typical DNN-based methods can be categorized into two classes according to estimation targets, where one is masking-based [wang2014training] and the other is spectral mapping-based [xu2014regression].
Conventional DNNs usually adopted fully-connected (FC) layers for noise reduction [wang2014training, xu2014regression]. To tackle the speaker generalization problem, Chen et al.
proposed to utilize stacked long short-term memory (SLSTM)[chen2016large]
, which significantly outperformed DNNs. Recently, various convolutional neural networks (CNNs) with complex topology were proposed[fu2017complex, tan2018gated, pandey2019tcnn, pirhosseinloo2019monaural], which could reduce the number of trainable parameters. More recently, Tan et al. combined convolutional auto-encoder (CAE) [badrinarayanan2015segnet]
and LSTM together to propose convolutional recurrent neural network (CRN)[tan2018convolutional], where CAE helped to learn temporal-frequency (T-F) patterns and LSTM effectively captured dynamic sequence correlations.
A variety of models with more complex topologies have been proposed recently [tan2018gated, pandey2019tcnn, pirhosseinloo2019monaural, tan2018convolutional], which have shown improved performance, they still have some limitations for the following two fold. For one thing, the number of the parameters is often partly limited to meet the low-latency requirement, which heavily restricts the depth of the network. For another, the increase of depth is more likely to cause a gradient vanishing problem. Recently, progressive learning was proposed [gao2016snr, li2019convolutional], which decomposes the mapping process into multiple stages. Experimental results in [li2019convolutional] have shown that it dramatically decreases the number of trainable parameters and effectively maintains the performance. Based on this conception, recursive learning [li2020recursive] was proposed by reusing the network for multiple stages, where the output in each stage is linked by a memory mechanism. It further alleviates the parameter burden and deepens the network without introducing extra parameters.
Human tends to generate adaptive attention with dynamic neuron circuits to percept complicated environments[anderson2013dynamic], which is also described by the auditory dynamic attending theory for continuous speech processing [jones1976time]. For example, when a person hears an utterance from the real environments, the more noise components are dominant, the more neuron attentions are needed to figure out the meaning and vice versa. This phenomenon reveals the dynamic mechanism of auditory perception system. Motivated by the physiological phenomenon, we propose a novel network combining dynamic attention and recursive learning together. Different from the previous networks [tan2018gated, pandey2019tcnn, pirhosseinloo2019monaural, tan2018convolutional]
that a single complex network is designed for the task, the framework encompasses a major network and an auxiliary sub-network in parallel, where the one is noise reduction module (NRM) and the other is attention generator module (AGM). The workflow of the framework is as follows: at each intermediate stage, both the noisy feature and the estimation from the last stage are combined into the current input. AGM is adopted to generate the attention set, which is subsequently applied to NRM through the pointwise convolution and sigmoid function. In this way, AGM actually serves as a type of perception module to flexibly adjust the weight distribution of NRM, leading to better performance for noise suppression. To our best knowledge, it is the first time for a dynamic attention mechanism to be introduced for speech enhancement task.
2 Problem formulation and notation
In the time domain, a noisy signal can be modeled as , where
is the discrete time index. With short-time Fourier transform (STFT), it can be further rewritten as:
where , , and respectively refer to the noisy, clean, and noise components at the frequency bin index and the time frame index . In this study, the network is deployed to estimate the magniude of spectrum (MS), which is then incorporated with noisy phase to recover the estimated spectrum. Inverse short-time Fourier transform (iSTFT) is used to reconstruct the waveform in the time domain.
For simplicity of notation, we define the principal notations used in this paper. , , , and denote the magnitude of noisy spectrum, the magnitude of clean spectrum, the estimated magnitude of spectrum in the th stage, and the final estimated magnitude of clean spectrum, respectively. and refer to the timestep and the feature length, respectively. As recursive learning is used, the superscript denotes the stage index, and the number of stages is notated as .
3 Architecture illustration
3.1 Stage recurrent neural network
Stage recurrent neural network (SRNN) is first proposed in [li2020recursive], which is the core component in recursive learning. It is capable of aggregating the information across different stages with a memory mechanism, which is comprised of two parts, namely two-dimensional convolutional (2-D Conv) block and convolutional-RNN (Conv-RNN) block. The first part tries to project the input features into a latent representation, followed by Conv-RNN to update the state in the current stage. Assuming the output of 2-D Conv and Conv-RNN at stage are respectively notated as and , the inference of SRNN is formulated as:
where and refer to the functions of 2-D Conv and Conv-RNN, respecively. is the state of the last stage. In this study, ConvGRU [ballas2015delving] is adopted as the unit in Conv-RNN, whose calculation process gives as follows:
where and represent the weight matrices of the cell. and
, respectively, denote the sigmoid and the tanh activation functions.represents the convolutional operator and is the element-wise multiplication. Note that biases are ignored for notation convenience.
3.2 Attention gate
Attention U-Net (AU-Net) is first proposed in [oktay2018attention] to improve the accuracy in the segmentation-related tasks, where attention gates (AGs) are inserted between the convolutional encoder and the decoder. Compared with a standard U-Net, AU-Net has the capability of automatically suppressing the irrelevant regions and emphasizing the important features. As the spectrum includes abundant frequency components, where formants are often dominant in the low-frequency regions and the regions of the high-frequency have a sparse distribution, it is necessary to discriminate different spectral regions with different weights. The schematic of the AG adopted in this paper is shown in Figure 1. Assuming the inputs of the unit are and , where and refer to the feature of a decoding layer and its corresponding feature in an encoding layer, respectively. the output can be calculated as:
where , and are the convolution kernels. Note that the unit consists of two branches, where the one merges the information of both inputs and generates the attention coefficients through a sigmoid function and the other copy the information of and multiply the coefficients. After the output of AG is obtained, it is concatenated with the feature from the corresponding decoding layer along the channel dimension as the input of the next decoding layer.
3.3 Proposed architecture
The overview of the proposed architecture is depicted in Figure 2-(a). It has two modules, namely AGM and NRM, and the two modules are designed to interleave execution during the whole process. The architecture is operated with a recursive procedure, i.e., the whole forward stream can be unfolded into multiple stages. In each stage, the original noisy spectrum and the estimation from the last stage are concatenated, serving as the network input. It is sent to AGM to generate the current attention set, representing the attention distribution at the current stage. It is subsequently applied to NRM to control the information flow throughout the network. NRM also receives the input to estimate the MS. As a consequence, the output of AGM dynamically depends on how well MS is estimated in the last stage, i.e., AGM is capable of re-weighting the attention distribution according to the previous feedback from the noise reduction system.
Assuming the mapping functions of AGM and NRM are denoted as and , respectively. The calculation procedure of the proposed architecture works as follows:
where is the generated attention set at stage . and represent the network parameters for AGM and NRM.
In this study, we use a typical U-Net [ronneberger2015u] topology for AGM, which consists of the convolutional encoder and the decoder. The encoder consists of five successive 2-D convolutional layers, each of which is followed by batch normalization (BN) [ioffe2015batch] and exponential linear unit (ELU) [clevert2015fast]. The number of channels through the encoder is (16, 32, 32, 64, 64). The decoder is the mirror representation of the encoder except all the convolutions are replaced by deconvolutions [noh2015learning]
to effectively enlarge the mapping size. Similarly, the number of channels through the decoder is (64, 64, 32, 32, 16). The kernel size, stride for both encoder and decoder are (2, 5) and (1, 2), respectively. Skip connections are introduced to compensate for information loss during the encoding process.
The detail of NRM is shown in Figure 2-(b). It includes three parts, namely SRNN, AU-Net and a series of GLUs [tan2018gated]. Given the input of the network, whose size is . 161 is the feature length, and 2 is the number of the input channel. The output size after SRNN and consecutive six convolutional blocks are . It is subsequently reshaped into . six concatenated GLUs proposed by [tan2018gated] are set to explore the contextual correlations efficiently. The output of GLUs is reshaped back to , which is subsequently sent to the decoder to expand the feature size and estimate the MS. The number of channels for the encoder and decoder in AU-Net are (16, 16, 32, 32, 64, 64) and (64, 32, 32, 16, 16, 1), respectively. The kernel size and stride in NRM are the same as the setup in AGM except for the last layer, which takes the pointwise convolution, followed by Softplus as the nonlinearity [zheng2015improving] to obtain the MS. Note that different from direct skip connections in a standard U-Net, the feature mappings from the encoder are multiplied with the gating coefficients from the AGs before they are concatenated with the decoding features, which help to weigh the feature importance in multiple encoding layers.
The connection between AGM and NRM is shown in Figure 2-(c), where each of the middle features in the decoder of AGM is multiplied to the corresponding feature in the encoder of NRM through the pointwise convolution and the sigmoid function. Note that the sigmoid function is applied to range the value scale into .
3.4 Loss function
As the network is trained for multiple stages, at each of which we obtain an intermediate estimation, the accumulated loss can be defined as , where is the weighted coefficient for each stage,
is the loss function for theth stage. We set , with in this study, i.e., the same emphasis is given to each training stage.
|Metric||PESQ||STOI (in %)|
|Proposed (Q = 3)||2.60||2.91||3.21||3.46||3.05||83.74||89.07||93.17||95.79||90.44|
|Metric||PESQ||STOI (in %)|
|Proposed (Q = 3)||2.37||2.75||3.06||3.34||2.88||79.16||87.54||91.61||95.02||88.33|
4 Experimental setup
The experiments are conducted on TIMIT corpus [garofolo1993darpa]. 4856, 800 and 100 clean utterances are selected for training, validation and testing, respectively. Training and validation dataset are created under the SNR levels ranging from -5dB to 10dB with the the interval 1dB whilst we test the model under the SNR conditions of (-5dB, 0dB, 5dB, 10dB). 130 types of noises used in [li2020recursive]
are for training and validation. Another 5 types of noises from NOISEX92 (babble, f16, factory2, m109 and white) are used to explore the generalization capacity of networks. All the collected noises are first concatenated into a long vector. During each mixed process, a random cutting point is generated, which is subsequently mixed with an utterance under an SNR level. As a result, we create 40,000, 4000, 800 noisy-clean pairs for training, validation and testing, respectively.
In this study, four networks are selected as the baselines, namely SLSTM [chen2016large], CRN [tan2018convolutional], GRN [tan2018gated] and DCN [pirhosseinloo2019monaural], all of which have achieved state-of-the-art performance recently. For SLSTM, four LSTM layers with 1024 units are stacked, followed by one FC layer to obtain the MS. The input of SLSTM includes the concatenation of the current frame and the previous ten frames. CRN is a type of real-time architecture combining CNN and LSTM. GRN and DCN are typical fully-convolutional network with gating mechanism.
4.3 Parameter setup
All the utterances are sampled at 16kHz. The 20ms Hamming window is applied, with 10ms overlap in adjacent frames. 320-point STFT is adopted, leading to a 161-D feature vector in each frame. All the models are trained with mean-square error (MSE) criterion, optimized by Adam [kingma2014adam]
. The learning rate is initialized at 0.001, we halve the learning rate when consecutive 3 validation loss increment happens and the training is early-stopped when 10 validation loss increment happens. All the models are trained for 50 epochs. The minibatch is set to 4 at an utterance level. Within a minibatch, the utterance whose timestep is less than the longest one is padded with zero.
5 Results and analysis
This section evaluates the performance of different models with perceptual evaluation speech quality (PESQ) [recommendation2001perceptual] and short-time objetive intelligibility (STOI) scores [taal2010short].
5.1 Objective results
Tables 1 and 2 summarize the results of different models for seen and unseen noise cases, respectively. From the two tables, one can observe the following phenomena. First, CRN, GRN, DCN, and the proposed model consistently outperform SLSTM in both seen and unseen noise cases. This is because SLSTM solely considers the sequence correlations but neglecting the implicit T-F patterns, which is crucial for spectrum recovery. Moreover, stacked LSTM tends to cause an attenuation effect due to the gradient vanishing problem, which limits the performance. Second, compared with the baselines, the proposed architecture obtains notable improvements in both metrics. For example, when going from CRN to the proposed model, PESQ is improved by 0.16 and STOI is improved by 1.01% on average for seen cases. A similar trend is also observed for unseen cases, indicating that the proposed model has a good noise generalization capability. Third, we observe that GRN and DCN can achieve close performance. This can be explained as both networks have similar topology, where dilation convolutions combined with the gating mechanism are applied for sequence modeling.
5.2 Impact of stages
We study the impact of the number of stages Q, which is given in Figure 3. One can get that with the increase of Q, both values of PESQ and STOI are consistently improved when Q . This indicates that SRNN can effectively refine the performance of the network by a memory mechanism. We also find that when Q increases from 3 to 5, PESQ value slightly degrades whilst STOI is still improved. This is because the distance-based loss like MSE is utilized, the loss function and the optimization process cannot guarantee consistent optimization for both metrics, which is consistent with the previous study in [li2020recursive].
5.3 Parameter comparison
Table 3 summarizes the number of trainable parameters for different models. One can see that the proposed model dramatically decreases the number of trainable parameters compared with other baselines. This demonstrates the superior parameter efficiency of the proposed architecture.
In a complicated scenario, a person usually dynamically adjusts the attention to the change of the environments for continuous speech. Based on this neural phenomenon, we propose a framework combining dynamic attention and recursive learning. To adaptively control the information flow of the noise reduction network, a separate sub-network is designed to update the attention representation in each stage and is subsequently applied to the major network. As a recursive paradigm is taken for training, the network is reused for multiple stages. As a result, we achieve a refined estimation stage by stage. Experimental results show that, compared with previous state-of-the-art strong models, the proposed model achieves consistently better performance while further decreasing the parameter burden.