Transformer with Gaussian weighted self-attention for speech enhancement

10/13/2019 ∙ by Jaeyoung Kim, et al. ∙ 0

The Transformer architecture recently replaced recurrent neural networks such as LSTM or GRU on many natural language processing (NLP) tasks by presenting new state of the art performance. Self-attention is a core building block for Transformer, which not only enables parallelization of sequence computation but also provides the constant path length between symbols that is essential to learn long-range dependencies. However, Transformer did not perform well for speech enhancement due to the lack of consideration for speech and noise physical characteristics. In this paper, we propose Gaussian weighted self-attention that attenuates attention weights according to the distance between target and context symbols. The experimental results showed that the proposed attention scheme significantly improved over the original Transformer as well as recurrent networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks have shown the great success in the speech enhancement [12, 6, 22, 13, 15, 16]. Neural networks can directly learn the complicated nonlinear mapping from data without any prior assumption, which outperformed the existing model-based statistical approaches such as MMSE STSA [4] or OM-LSA [5, 2].

Although there are many different neural network architectures used for speech enhancement, recurrent neural networks such as LSTM [9] or GRU [1] were the most popular due to their powerful sequence learning. Recently, a new sequence learning architecture, Transformer [20]

presented significant improvements over recurrent networks for machine translation and many other natural language processing tasks. Transformer uses a self-attention mechanism to compute symbol by symbol correlations in parallel over the entire input sequence, which are used to predict the similarity between the target and neighboring context symbols. The predicted similarity vector is normalized by softmax function and used as attention weights to combine context symbols.

Unlike recurrent networks, Transformer can process sequence input in parallel, which can significantly reduce training and inference time. Moreover, Transformer provides a fixed path length that is the number of time steps to traverse before computing attention weights or symbol correlations. Typically, recurrent networks have the path length proportional to the distance between target and context symbols due to sequential processing, which makes it difficult to learn long-range dependencies between symbols. Transformer resolved this issue with the self-attention mechanism.

However, Transformer did not show the similar improvement on the speech denoising problem. The main issue is that the fixed path length property that benefited many NLP tasks is not compatible with noise or speech physical characteristics. Noise or speech processes tend to be more correlated for the closer components. Therefore, positional encoding is required to penalize attention weights proportional to the distance between symbols to reflect signal characteristics.

In this paper, we propose a Gaussian weighted self-attention scheme that attenuates the attention weights according to the distance between correlated symbols. The attenuation is determined by the Gaussian variance which can be learned during training. The evaluation result showed that the proposed scheme significantly improved over the existing Transformer architectures as well as the former best recurrent model based on the LSTM architecture.

2 Proposed Architectures

Figure 1: Block Diagram of Transformer Encoder for Speech Enhancement

Figure 1 shows the proposed denoising network based on the Transformer encoder architecture. The original Transformer consists of encoder and decoder networks but we only used the encoder network because input and output sequences have the same length and therefore, the alignment between input and output sequences are not necessary for the speech denoising problem. Network input,

is a noisy magnitude spectrum after short-time Fourier transform (STFT) of time-domain noisy signal

. is an index of utterance, is an index of frame and is an index of frequency. The input noisy signal is given by

(1)

where is clean speech and is noise signal. Each encoder layer consists of multi-head self-attention, layer normalization and fully-connected layer, which is the same as the original Transformer encoder. The network output is a time-frequency mask that predicts clean speech by multiplying it with noisy input:

(2)

The estimated clean magnitude spectrum

is multiplied with input phase spectrum and transformed into a time-domain signal, by inverse short-time Fourier transform (ISTFT).

2.1 Gaussian Weighted Self-Attention

Figure 2: The block diagram of the proposed multi-head self-attention: The G.W. block is to element-wise multiply the Gaussian weighting matrix with the generated score matrix. The matrix dimensions are denoted to the right of each signal.

Figure 2 describes the proposed Gaussian weighted multi-head self-attention. , and are batch size, sequence length and input dimension. is the number of self-attention units. Query, key and value matrices are defined as follows:

(3)
(4)
(5)

where is hidden layer output and , and are network parameters.

The main difference in the proposed self-attention scheme is to apply the Gaussian weighting matrix to the score matrix which is computed from the key and query matrix multiplication as follows:

(6)

is the Gaussian weighting matrix. It is element-wise multiplied with the score matrix, . The proposed Gaussian weighting matrix is defined as follows:

(7)

where is , is a target frame index, is a context frame index and is a trainable parameter. For example, corresponds to the weighting factor for the context frame when the target frame index is . The diagonal terms in correspond to the weighting factors for the target frames, which is always set to be 1. is inversely proportional to the distance between target and context frames, which can provide higher fading on the distant context frames and the smaller attenuation for the closer ones. Furthermore, since is a trainable parameter, context localization can be learned by speech and noise training data.

After the softmax function, the self-attention matrix is multiplied by the value matrix :

(8)

One thing to note is that the absolute value of the matrix is applied to the softmax function. The reason for this is that unlike NLP tasks, the negative correlation information in the signal estimation is as important as the positive correlation. By taking the absolute value of the Gaussian weighted score matrix, the resultant self-attention weights will only depend on the score magnitude, which enables to equally utilize both positive and negative correlations.

2.2 Comparison with the existing positional encoding method

The attention biasing [17] used for an acoustic model design is a different positional encoding scheme. It is an additive weighting scheme applied to the score matrix in the self-attention block. Its purpose is similar to Gaussian weighting: to provide the locality on the context sequence around the target frame. The main difference is that the encoded terms in the attention biasing are always negative and they are added to the score matrix. Therefore, the attention biasing cannot utilize the negative correlation of the score matrix because the negative correlation would be more negative after the attention biasing. On the contrary, the Gaussian weighting matrix is a non-negative multiplicative factor as in Eq. 6. Since and score matrix can be separable, negative correlation in can be easily utilized by only taking absolute value of it. Section 3 will show the significant improvement from using both positive and negative correlations.

2.3 Extension to Complex Transformer Architecture

Figure 3: Block Diagram of Complex Transformer architecture

Figure 3 depicts the complex Transformer architecture for speech enhancement. Compared with the real Transformer architecture in Figure 1, the complex network has two input and two output corresponding to real and imaginary parts, respectively. The network input and are real and imaginary parts of complex noisy spectrum. The one trick for the complex Transformer denoiser is to feed absolute values of real and imaginary spectrum, which showed significantly better SDR and PESQ performance. The network output is a complex mask that generates complex denoised output as follows:

(9)
(10)

where a subscript means a real part and a subscript corresponds to an imaginary part. The right grey block in Figure 3

describes the decoder layer of the complex Transformer network.

and are real and imaginary output of layer, respectively. The first multi-head self attention blocks are applied to each real and imaginary input separately. After layer normalization, the second multi-head attention gets mixed input from real and imaginary paths. For example, the left second multi-head attention gets the right-side layer normalization output as key and value input in Figure 2. The query input comes from the left-side layer normalization. The main idea is to exploit the cross-correlation between real and imaginary parts by mixing them in the attention block. After another layer normalization, a complex fully-connected layer is applied. The complex fully-connected layer has real and imaginary weights and the standard complex operation is performed on the complex input from the second layer normalization.

2.4 End-to-End Metric Optimization

The author in [11] proposed a new multi-task denoising scheme that jointly optimizes SDR and PESQ metrics. The proposed denoising framework outperformed the existing spectral mask estimation schemes [6, 12, 22] and generative models [13, 15, 16] to provide a new state of the art performance. In this paper, we applied it to train real and complex Transformer networks. Figure 2 in [11] illustrated the overall training framework. First, the denoised complex spectrum is transformed into the time-domain acoustic signal via Griffin-Lim ISTFT [8]

. Second, the proposed SDR and PESQ loss functions in 

[11] are computed based on the acoustic signal. The two loss functions are combined as follows:

(11)

where and are SDR and PESQ loss functions defined in Eq. 20 and 32 in [11], respectively. is a hyper-parameter to adjust relative weighting between SDR and PESQ loss functions and is set to be after grid-search on the validation set. All the neural network models evaluated at Section 3 were trained to minimize .

3 Experimental Results

SDR PESQ
Loss Type -10 dB -5 dB 0 dB 5 dB 10 db 15 dB -10 dB -5 dB 0 dB 5 dB 10 db 15 dB
Noisy Input -11.82 -7.33 -3.27 0.21 2.55 5.03 1.07 1.08 1.13 1.26 1.44 1.72
CNN-LSTM -2.31 1.80 4.36 6.51 7.79 9.65 1.43 1.65 1.89 2.16 2.35 2.54
TF Origin. -3.25 0.92 3.39 5.35 6.39 8.10 1.29 1.45 1.63 1.87 2.07 2.29
TF A.B. -2.80 1.18 3.67 5.67 6.78 8.18 1.49 1.67 1.85 2.01 2.28 2.50
TF G.W. -1.66 2.35 4.95 7.10 8.40 10.36 1.54 1.76 2.00 2.28 2.51 2.74
TF Complex -1.57 2.51 5.03 7.36 8.58 10.40 1.43 1.64 1.88 2.17 2.40 2.67
Table 1: SDR and PESQ results on QUT-NOISE-TIMIT: Test set consists of 6 SNR ranges:-10, -5, 0, 5, 10, 15 dB. The highest SDR or PESQ scores for each SNR test data were highlighted with bold fonts.

3.1 Experimental Settings

Two datasets were used for training and evaluation of the proposed Transformer architectures:

QUT-NOISE-TIMIT [3]: QUT-NOISE-TIMIT is synthesized by mixing 5 different background noise sources with the TIMIT [7]. For the training set, -5 and 5 dB SNR data were used but the evaluation set contains all SNR ranges. The total length of train and test data corresponds to 25 hours and 12 hours, respectively. The detailed data selection is described at Table 1 in [11].

VoiceBank-DEMAND [19]: 30 speakers selected from Voice Bank corpus [21] were mixed with 10 noise types: 8 from Demand dataset [18] and 2 artificially generated one. Test set is generated with 5 noise types from Demand that does not coincide with those for training data.

3.2 Main Result

Table 1 shows SDR and PESQ performance of Transformer models on the QUT-NOISE-TIMIT corpus. CNN-LSTM is the prior best performing recurrent model which is comprised of convolutional and LSTM layers. Its network architecture is described at Section 3 in  [11]. TF Origin. represents the original Transformer encoder, TF A.B. is the Transformer model with attention biasing explained in Section 2.2, TF G.W. is the Transformer with Gaussian-weighted self-attention and TF Complex is the complex Transformer model. The real transformers consisted of 10 encoder layers and the complex Transformer has 6 decoder layers. The encoder and decoder layers were described in Figure 1 and 3 and they have 1024 input and output dimensions.

First, TF Origin. showed large performance degradation compared with CNN-LSTM over all SNR ranges. Second, the Transformer with attention biasing substantially improved SDR and PESQ performance over TF origin., which suggested that the positional encoding is an important factor to improve Transformer performance on this denoising problem. However, the Transformer with attention biasing still suffered from the large loss compared with the recurrent model, CNN-LSTM. Finally, with the proposed Gaussian Weighting, the Transformer model significantly outperformed all the previous networks including CNN-LSTM. Especially, the large performance gap between attention biasing and Gaussian weighting suggested that using negative correlations is as important as using positive ones.

The complex Transformer showed 0.1 to 0.2 dB SDR improvement over all the SNR ranges compared with the real Transformer. However, PESQ performance got degraded over the real Transformer. The reason for degradation could be overfitting due to the larger parameter size or due to the difficulty in predicting phase spectrum. It would be the future work for the complex network to provide the consistent performance on SDR and PESQ metrics.

3.3 Comparison with Generative Models

Models CSIG CBAK COVL PESQ SSNR SDR
Noisy Input 3.37 2.49 2.66 1.99 2.17 8.68
SEGAN 3.48 2.94 2.80 2.16 7.73 -
WAVENET 3.62 3.23 2.98 - - -
TF-GAN 3.80 3.12 3.14 2.53 - -
CNN-LSTM 4.09 3.54 3.55 3.01 10.44 19.14
TF G.W. 4.18 3.59 3.62 3.06 10.78 19.57
Table 2: Evaluation on VoiceBank-DEMAND corpus

Table 2 showed comparison with other generative models. All the results except CNN-LSTM and TF G.W. came from the original papers: SEGAN [13], WAVENET [15] and TF-GAN [16]. CSIG, CBAK and COVL are objective measures where high value means better quality of speech [10]. CSIG is mean opinion score (MOS) of signal distortion, CBAK is MOS of background noise intrusiveness and COVL is MOS of the overall effect. SSNR is Segmental SNR defined in [14].

The proposed Transformer model outperformed all the generative models for all the perceptual speech metrics listed in Table 2 with large margin. The main improvement came from the joint SDR and PESQ optimization schemes in [11] that benefited both CNN-LSTM and TF G.W. However, TF G.W showed consistently better performance over CNN-LSTM for all the metrics, which agreed with the result at Table 1.

4 Conclusion

In this paper, Gaussian weighted self-attention was proposed. The proposed self-attention scheme attenuates attention weights proportional to the distance between the correlated symbols and equally utilizes both positive and negative correlations. The evaluation result showed that the proposed self-attention scheme significantly improved SDR and PESQ performance over the recurrent network as well as the Transformer model with attention biasing.

References

  • [1] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1.
  • [2] I. Cohen and B. Berdugo (2001) Speech enhancement for non-stationary noise environments. Signal processing 81 (11), pp. 2403–2418. Cited by: §1.
  • [3] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason (2010) The qut-noise-timit corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010. Cited by: §3.1.
  • [4] Y. Ephraim and D. Malah (1984) Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing 32 (6), pp. 1109–1121. Cited by: §1.
  • [5] Y. Ephraim and D. Malah (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing 33 (2), pp. 443–445. Cited by: §1.
  • [6] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 708–712. Cited by: §1, §2.4.
  • [7] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett (1993) DARPA timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93. Cited by: §3.1.
  • [8] D. Griffin and J. Lim (1984) Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §2.4.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [10] Y. Hu and P. C. Loizou (2008) Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16 (1), pp. 229–238. Cited by: §3.3.
  • [11] J. Kim, M. El-Kharmy, and J. Lee (2019) End-to-end multi-task denoising for joint sdr and pesq optimization. arXiv preprint arXiv:1901.09146. Cited by: §2.4, §3.1, §3.2, §3.3.
  • [12] A. Narayanan and D. Wang (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7092–7096. Cited by: §1, §2.4.
  • [13] S. Pascual, A. Bonafonte, and J. Serra (2017) SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. Cited by: §1, §2.4, §3.3.
  • [14] S. R. Quackenbush (1986) OBJECTIVE measures of speech quality (subjective).. Cited by: §3.3.
  • [15] D. Rethage, J. Pons, and X. Serra (2018) A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069–5073. Cited by: §1, §2.4, §3.3.
  • [16] M. H. Soni, N. Shah, and H. A. Patil (2018) Time-frequency masking-based speech enhancement using generative adversarial network. Cited by: §1, §2.4, §3.3.
  • [17] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel (2018) Self-attentional acoustic models. arXiv preprint arXiv:1803.09519. Cited by: §2.2.
  • [18] J. Thiemann, N. Ito, and E. Vincent (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America 133 (5), pp. 3591–3591. Cited by: §3.1.
  • [19] C. Valentini, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In 9th ISCA Speech Synthesis Workshop, pp. 146–152. Cited by: §3.1.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [21] C. Veaux, J. Yamagishi, and S. King (2013) The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference, pp. 1–4. Cited by: §3.1.
  • [22] Y. Wang, A. Narayanan, and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (12), pp. 1849–1858. Cited by: §1, §2.4.