Log In Sign Up

Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated. The model introduces a "full" attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline.


page 4

page 5


NeuralEcho: A Self-Attentive Recurrent Neural Network For Unified Acoustic Echo Suppression And Speech Enhancement

Acoustic echo cancellation (AEC) plays an important role in the full-dup...

Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement

This paper proposes an noise type classification aided attention-based n...

Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition

Integration of multiple microphone data is one of the key ways to achiev...

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

Deep learning algorithm are increasingly used for speech enhancement (SE...

Channel-Attention Dense U-Net for Multichannel Speech Enhancement

Supervised deep learning has gained significant attention for speech enh...

Exploring Tradeoffs in Models for Low-latency Speech Enhancement

We explore a variety of neural networks configurations for one- and two-...

Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks

We propose an end to end deep learning approach for generating real-time...

1 Introduction

Neural networks have seen great success in the field of speech enhancement and noise suppression in the last decade [Narayanan2013Ideal, geiger2014investigating, Erdogan2015Phase, wang2014training, pascual2017segan, rethage2018wavenet, soni2018time, ephraim1985speech]. This approach now outperforms most traditional model-based statistical approaches, such as spectral subtraction [boll1979suppression], the minimum mean-square error log-spectral method [ephraim1985speech], Wiener filtering [scalart1996speech] or OM-LSA [tran2010speech]

. Recurrent neural network (RNN), has been widely used

[chen2017long, weninger2015speech, sun2017multiple].

Various RNN-based structures such as gated recurrent unit (GRU) networks, long short term memory (LSTM) networks, have been explored due to their powerful sequence learning ability, as well as combinations and variations

[valin2018hybrid], which show superiority in the field of audio signal processing. The introduction of an attention mechanism further enhances the potential of these seq-to-seq methods [li2019multi]. The proposed approach makes use of Mel frequency features [wu2000word] to extract features to represent the sound sequence and introduces a new bidirectional attention-based structure in speech enhancement.

Our work is inspired by the recent success of unidirectional attention-RNN models in various seq-to-seq learning tasks [hao2019attention], which allow the clean speech to be perceived with high attention while the noise and background have low attention. We extend this idea to make full use of the information around the focal frame as indicated by a high level of attention, and refer to this as a full as opposed to partial or half attention method typical of previous approaches. The new architecture uses a bidirectional connection structure combined with this full attention mechanism.

The rest of the paper is organized as follows. In Section 2, our method is described in detail. The experiments and results are presented in Section 3. Finally we draw conclusions in Section 4.

2 Method

As illustrated in Figure 1

, the process is divided into three major parts: feature extracting, deep learning and filtering. The input audio signal is speech with noise and the output audio signal is expected to be clean. Feature extraction calculates Mel filter bank (FBank) features, while the neural network implements regression and generates clean estimates of the FBank

[wu2000word] for filtering.

Figure 1: System Overview.

2.1 Feature Extraction

Frequency features are extracted from the short-time Fourier transform (STFT) of input audio signal. In contrast to typical methods that use all sampling points in the spectrum as features


, we choose to extract the features in FBank to reduce the number of the weight to estimate in the system. We use the input and output feature vectors to conduct spectrum suppression as described in Section


2.2 Attention-based Bidirectional Architecture

The deep learning architecture is shown in Figure 2. The input is , where is the vector that represents the amplitude at specific frequency points of noisy speech at frame . represents the total number of frames.

Figure 2: Deep Learning Structure: The left and right sides respectively represent the calculation process of the forward and reverse attention parameters.

A dense layer serves as the encoder to give a high-level representation from the input features.


The number of output features requires fine adjustment to maintain balance between representation and generalization. is used in both the forward LSTM cell and the backward LSTM cell to generate the key and the query vectors.


The key and query vectors are shown as , , and , where the backward or forward direction is denoted by subscript or , respectively. As shown above, backward LSTM cell can be regarded as a normal LSTM cell with the reverse sequence input.

This approach will make it possible to use latent information in the attention mechanism. A dense layer is also added before generating the final query, but this step is omitted here for the simplicity of explanation. In forward and backward processing, the attention mechanism generates two attention weight vectors.


Following the correlation calculation in [hao2019attention] and [luong2015effective], normalized attention weight and can be learned as:


The attention weight vectors indicate the degree of relevance of neighboring frames to the time frame , which we refer to as a focal frame. If the utterance is too long, the attention weights of distant frames may be nearly zero. Based on human pronunciation patterns, it is reasonable to assume that the length of backward effective attention sequences and the length of forward effective attention sequences may be different. Based on thie, and are implemented as separate constants.

If , we can set . If , we can set , where is the total number of frames in the input sequence. The context vector containing the information of key vectors and can be computed.


where denotes the concatenation of two vectors. Another dense layer severs as a decoder to make use of the context vector , query vectors and the input feature . Similar to [hao2019attention], the model generate an enhancement vector and the final gain as:


2.3 Mel Filter Bank Generator

The model generates a FBank feature vector for every frame to use for noise filtering. The gain of each triangular filter at the peak response is calculated from the input vector and the output vector . The amplitude of other points of the filter bank is then adjusted proportionally according to the gain at the peak.

3 Results

3.1 Datasets

We tested the quality of the speech enhancement on two free speech databases, THCHS-30 [wang2015thchs]



THCHS-30 is a free Chinese speech database, which involves more than 30 hours of clean speech signals recorded by a single microphone in silence and with white noise, car noise and cafeteria noise. We extend the noise conditions further by creating a mix of car noise and cafeteria noise, implemented along with the white noise condition. Signal-to-noise ratio (SNR) ranges from

dB to 10 dB.

QUT-NOISE-TIMIT is synthesized by mixing 5 different background noise sources with TIMIT [garofolo1993darpa]. For the training set, and 5 dB SNR data were used, but the evaluation set contains SNR ranging from dB to 15 dB. The total length of train and test data corresponds to 25 hours and 12 hours, respectively.

3.2 Experiment Setup

The sampling rate of the audio files is 16 kHz, with 512 point Hanning-windowed frames and 128 points overlap.

In both datasets, we randomly divide the data into training set and test set in a 4:1 ratio. The result is verified using 5-fold cross-validation. The input and the output are both 42-dimensional FBank vectors. The batch size is set to 96, the number of LSTM cells is set to 350, and the value of and

are set to 15 and 5, respectively. The loss function is mean square error (MSE), and dropout regularization is used. The setting of the learning rate is dynamically adjusted according to the SNR as follows:

In the experiments on the THCHS-30 dataset, we compared our approach with OM-LSA, LSTM-RNN and LSTM with attention(LSTM-Att) [hao2019attention]. It is worth mentioning that the LSTM-Att method is unidirectional, which can make the comparision more convincing.

In the experiments on the QUT-NOISE-TIMIT dataset, we choose CNN-LSTM, O-T, T-AB, T-GSA, C-T-GSA to be the baseline [kim2019transformer], in order to compare our model with other end-to-end sequence structures. The perceptual evaluation of speech quality (PESQ) [rix2001perceptual] is used as the evaluation criteria.

3.3 Main Results

Results for THCHS-30 are shown in Table 1. The new approach outperforms the three baselines in most instances, especially when the noise pattern is irregular and SNR is relatively low. The bidirectional model performs better than the unidirectional model or the model without an attention mechanism. This sheds light on the effectiveness of using additional context when implementing the attention mechanism.

Method Raw OM-LSA LSTM-RNN LSTM-Att Bi-Att
cafe&car white cafe&car white cafe&car white cafe&car white cafe&car white
dB 0.6424 0.0310 0.8337 0.8258 1.3532 1.3750 1.3786 1.3756
dB 1.1098 0.4169 1.2895 1.4132 1.7820 1.6757 1.8277 1.7512
0 dB 1.5216 0.9977 2.1160 1.9523 2.2624 2.2322 2.4584 2.2324
5 dB 2.1597 1.6791 2.2753 2.4723 2.4023 2.6439 2.4330 2.4468
10 dB 2.6210 2.3163 2.6620 2.7027 2.4732 2.8659 2.4942 2.5165
Table 1: PESQ results on THCHS-30: Test set consists of 5 SNR ranges:-10, -5, 0, 5, 10dB. Bold indicates best results.

Results for QUT-NOISE-TIMIT are shown in Table 2

. The bidirectional attention model outperforms the selected baseline models in most SNR situation. The area where the improvement is most significant is in the SNR range from

to 5 dB. Our hypothesis is that the FBank structure can better suppress the full frequency band under high noise conditions, but that in the case of low noise energy, insufficiently fine point-to-point suppression may reduce the quality of the entire speech signal.

SNR (dB) 0 5 10 15
Raw 1.07 1.08 1.13 1.26 1.44 1.72
CNNLSTM 1.43 1.65 1.89 2.16 2.35 2.54
O-T 1.29 1.45 1.63 1.87 2.07 2.29
T-AB 1.49 1.67 1.85 2.01 2.28 2.50
T-GSA 1.54 1.76 2.00 2.28 2.74
C-T-GSA 1.43 1.64 1.88 2.17 2.40 2.67
Bi-Att 2.47
Table 2: PESQ results on QUT-NOISE-TIMIT: Test set consists of 6 SNR ranges: , , 0, 5, 10, 15 dB, according to result in [kim2019transformer].

Based on the asymmetry of speech in the time domain, it is reasonable to assume that the attention may vary in different ways in forward and backward directions. The results of having separate attention sequence length constants, i.e. and , are shown in Figure 3, for the THCHS-30 dataset in mixed car and cafe noise at an SNR of 0 dB. The resulting PESQ changes slightly as and vary. We notice that bigger values of and result in no obvious improvement. The model performs best when the value of and are both in the range of 5 to 20 dB.

Figure 3: PESQ at different values of and and visualization of results: darker colors represent larger PESQ. and represents the length of forward and backward attention sequence respectively.

The impact of the attention weight is illustrated in Figure 4. We set as and as on a test example on THCHS-30 (251 frames). In the left figure, the and axis represents and . On the right, the and axis represents and . The point represents attention weight. As seen in the figure, the model gives different weights to the contextual frames. For illustration, the spectrum of a speech utterance is shown in Figure 5. The original speech signal is superimposed with the café and car noise to form noisy speech with an SNR of 0 dB.

Figure 4: The visualization of bidirectional attention weights, , . Top: positive direction. Bottom: negative direction.
Figure 5: A test example before and after speech enhancement. Top: noisy speech. Middle: speech processed by our model. Bottom: clean speech.

4 Conclusion

This paper demonstrates a new attention-based LSTM architecture for speech enhancement. The primary innovation is the use of a two-way attention network to take full advantage of latent information. Mel filter bank features are used to reduce complexity. Results demonstrate that the resulting perceptual quality of the enhanced speech is higher than comparative baseline methods in most cases. The attention-based approach also shows better generalization ability to irregular noise conditions. In addition, asymmetric forward and backward attention weights are implemented and evaluated. Overall, the proposed approach demonstrates the positive impact of increasing the extent of speech and language context for noise suppression applications.