DeepAI
Log In Sign Up

Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation

Several speech processing systems have demonstrated considerable performance improvements when deep complex neural networks (DCNN) are coupled with self-attention (SA) networks. However, the majority of DCNN-based studies on speech dereverberation that employ self-attention do not explicitly account for the inter-dependencies between real and imaginary features when computing attention. In this study, we propose a complex-valued T-F attention (TFA) module that models spectral and temporal dependencies by computing two-dimensional attention maps across time and frequency dimensions. We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DCCRN) using the REVERB challenge corpus. Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications, such as automatic speech recognition, compared to earlier approaches for self-attention.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/14/2021

Efficient conformer-based speech recognition with linear attention

Recently, conformer-based end-to-end automatic speech recognition, which...
02/03/2021

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Deep complex U-Net structure and convolutional recurrent network (CRN) s...
04/17/2021

MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation

Recently, our proposed recurrent neural network (RNN) based all deep lea...
05/20/2019

Less Memory, Faster Speed: Refining Self-Attention Module for Image Reconstruction

Self-attention (SA) mechanisms can capture effectively global dependenci...
07/13/2020

Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention

Speech recognition is a well developed research field so that the curren...
04/19/2021

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets

In this paper, we present an update to the NISQA speech quality predicti...
06/17/2019

Real to H-space Encoder for Speech Recognition

Deep neural networks (DNNs) and more precisely recurrent neural networks...

1 Introduction

Speech dereverberation is a method for eliminating the ambiguous effects introduced by surround reflections when the speech is captured by a distant microphone. Statistical-based speech enhancement techniques played a crucial role as a front-end processor in several speech processing pipelines, such as [dereverb1, dereverb2, dereverb3, GTSAD]. The requirement for such robust speech systems to assist in the recognition of naturalistic speech recorded by distant microphones has become more important as human-machine interaction technologies gain traction [challenge1, challenge2, fs01, fs03]

. Additionally, advances in technologies such as hearing aids require the speech systems to enhance perceptual quality of speech captured in adverse environmental conditions, thus improving human hearing abilities. Several deep learning (DL)-based speech enhancement systems have been successfully developed to address concurrent improvements in perceptual quality and performance of back-end speech and language applications using fully convolutional neural networks (FCN), and recurrent networks (RNN)

[monauralTCN, monauralDNN, monauralWRN, monauralGCRN]

. The majority of these approaches work with the complex short-term fourier transform (STFT) of distorted speech, either to enhance the log-power spectrum (LPS) and reuse the unaltered distorted phase signal

[Mag_1, Mag_2, Ernst, skipconvnet, skipconvgan]

, or to estimate the complex ratio mask (cRM)

[CRM_1, CRM_2, crm3] and directly enhance the complex spectrogram to restore a cleaner time-domain signal. Enhancing the magnitude response or LPS enables back-end speech applications to operate more efficiently. This is because the most of of back-end speech applications are trained using LPS-derived speech features. Alternatively, speech applications aimed at enhancing perceived quality and intelligibility of speech make extensive use of complex spectrograms to recover magnitude & phase of distorted signal using DL approaches.

As deep neural networks (DNN) advance to be compatible with complex representations, researchers have investigated many speech enhancement strategies to estimate cRM using deep complex neural networks (DCNN). To address reverberation which distorts the signal in both time and frequency, many sequence-to-sequence learning strategies such as recurrent neural networks (RNNs) and long short-term memory (LSTM)

[LSTM, monauralWRN] have also been explored. In addition to the FCNs, these methods capture and leverage the temporal correlations for speech dereverberation. In recent years, self-attention (SA) has become a widely utilized mechanism for sequence-to-sequence learning tasks [attn1, attn2, selfattn_conv, attn_speech1]. SA is a mechanism for selective context aggregation that generates an output sequence by computing a weighted average of the input sequence. The learned weights represent the level of attention the network pays to subsets of the input sequence while generating an output sequence. For speech dereverberation task, SA will allow the network to attend time-frequency (TF) locations to reduce the smearing effects of reverberation. However, conventional SA approaches used in DCNN networks do not account for the inter-dependencies between real and imaginary component of complex-valued features.

The purpose of this study is to develop a complex-valued time-frequency (T-F) self-attention mechanism that computes attention using both real and imaginary components to accurately model temporal dependencies using deep neural networks. To demonstrate the effectiveness of our proposed complex-valued SA mechanism, we integrate two SA approaches with DCCRN: (i) the conventional self-attention mechanism [attn1], and (ii) the sample independent dual attention block (SDAB) [SDAB] using channel-wise concatenated real and imaginary components. The REVERB challenge corpus is used to examine the improvements in overall speech quality and back-end speech application performance achieved by integrating our proposed self-attention with a fully convolutional and recurrent network, DCCRN.

2 Problem Formulation

For a given acoustic environment, a speech signal received by an omni-directional microphone can be modeled as:

(1)

where is the signal as observed by a distant microphone, is the clean speech signal from the source, is additive background noise, is the room impulse response (RIR) from the source to the microphone, and

represents number of samples in RIR. The relation in frequency domain can be represented as Eq-(

1), where , , and represent the STFT of observed noisy and reverberated speech, clean speech from source, RIR, and background noise respectively. The goal of speech dereverberation is to estimate the complex spectrogram, from . For this, we estimate a cRM using DCNN to jointly estimate real and imaginary components of enhanced speech, as shown in Eq-(2),

(2)

where, is the CRM estimated by a DCNN (). , , , , , are the real and imaginary components of the estimated CRM, complex spectrograms corresponding to reverberant and enhanced speech respectively.

3 Complex-Valued Deep Network

In this section, we describe the DCCRN model used as a base architecture to evaluate the self-attention mechanisms. Unlike a real-valued network, a complex network allows us to associate each intermediate feature map with its real and imaginary components. Complex convolutions within a complex network perform real-valued convolutions on real and imaginary components of the feature maps and weights.

Consider , a feature map with dimensions which is provided as input to a complex 2-D convolution layer to produce an output feature map with dimensions . In order to perform a complex equivalent of conventional convolution, we assume the kernel to be a complex-valued weight matrix with its corresponding real and imaginary counterparts represented by two real-valued matrices: and . The output of the complex convolution is computed as shown in Eq-(3) where and are the first channels and remaining channels representing the real and imaginary components of the input feature maps.

(3)

Similarly, a complex-valued versions of activation functions (

ReLU) and normalization (BatchNorm) techniques are used to design our DCCRN model. For further details on complex modules, we recommend readers to refer [cmplxpaper, pandey]. A detailed block diagram of our DCCRN model used in this study is shown in Fig. LABEL:fig:network. Here, ‘’ represent input/output channels of an encoder/decoder block, , ,and

’ represents the kernel size, stride and padding parameters used for convolution layers. We use a 6-layer U-Net

[UNet]

, which is an encoder-decoder network with two layers of gated recurrent units (GRU). The encoder extracts spectral and temporal features from an input complex spectrogram, and the decoder constructs an enhanced complex spectrogram from the encoded features. The real-valued convolutions within each encoder and decoder layer of a conventional U-Net are replaced with their complex counterparts. Each encoder and decoder block consists of a complex-valued ReLU activation, complex-valued convolutional layer, a self-attention module , a complex-valued dense block, and a complex-valued normalization, see Fig.

LABEL:fig:network. In this study, we replace the self-attention module within the encoder and decoder with alternate existing versions and the proposed method. The remaining pipeline is left unaltered for a fair comparison.

4 Attention Mechanisms

In this section, we describe two alternate approaches to self-attention mechanisms and propose a fully complex self-attention mechanism. Similar to [SDAB], we use attention mechanisms to estimate attention maps for time and frequency in parallel to address the smearing effects of reverberation.

4.1 Sample-Independent Dual Attention Block (SDAB)

From early speech enhancement studies, lost harmonics along the frequency axis were regenerated using uniform non-linear functions. Likewise, non-linear recursive relations along the time axis are used to estimate signal-to-noise (SNR). Based on this, in [SDAB], a sample-independent dual attention block (SDAB) was recently proposed, see Fig. 2

. Unlike conventional self-attention mechanisms, SDAB estimates attention using fully-connected (FC) layers. Intermediate feature maps are reshaped into stacks of 1-D vectors along the time and frequency axis to form

and matrices. Later, fully-connected layers are used to learn weights for each vector based on their correlations with others along a given dimension. These weights are analogous to the weights learned by a conventional self-attention mechanism with the exception that the sum of weights might not add up to one. In [SDAB], real-valued convolutions are employed on complex feature maps stacked as channels. However, in this study we use complex convolutions and compute the SDAB attention mechanism on real and imaginary parts of a complex feature map in parallel.

Figure 2: Sample-Independent Dual Attention (SDAB) computed independently for real and imaginary components over time and frequency.

(a)                                                                                                              (b)

Figure 3: Attention computed over time and frequency for complex domain: (a) conventional SA computed independently for real and imaginary components, (b) proposed complex-valued SA computed using real and imaginary components

4.2 Conventional Self-Attention (SA)

A conventional self-attention method used for speech applications would ideally map each T-F bin by estimating the contributions of every time-frequency (T-F) bin in a spectrogram. However, in this study, similar to the previous strategy (SDAB), we use a conventional self-attention mechanism to learn the contributions of 1-D vectors along time and frequency axes, see Fig. 3-a. The self-attention mechanism is a three step process performed on its three major components: query (Q), key (K) and value (V) matrices which are linear projections of the input sequence, see-Eq-(4). First, correlations between query and key are computed using an outer product. Later, these correlations are converted into contributions using a “SoftMax” function which results in an attention map. Each row of this attention map represents the contribution of all rows in keys towards a particular row of the query matrix. Finally, this attention map is used to linearly combine rows of value matrix to obtain each row of the output.

(4)

We use real-valued convolutions for the linear projections of intermediate feature maps. The real and imaginary parts of these linear projections are then reshaped to construct and dimensional query, key, and value matrices which are used to compute attention over a given axis. Here, represents the number of real and imaginary channels of an intermediate feature map. Fig. 3-a illustrates the self-attention over the time axis for only the imaginary component. Similar computations are carried out in other branches.

4.3 Proposed Complex-Valued Time-Frequency SA

A self-attention mechanism, when computed separately for real and imaginary components, does not completely take into account the inter-dependencies of these components in a complex domain. Capturing these inter-dependencies is particularly important for speech applications that need real and imaginary components to be improved jointly to reverse distortions present in the phase. Thus, modifying these components individually might not be an efficient solution for systems aiming at improving the speech quality. Therefore, we propose a fully complex self-attention mechanism that leverages real and imaginary components to estimate the attention over a given dimension, see Fig. 3-b. Similar to the conventional self-attention mechanism, linear projections of intermediate features maps are performed using complex convolutions. These projections are then reshaped to form complex query, key, and values matrices. The complex correlations between query and key in the proposed method are computed using Hermitian and complex multiplication operations, Eq-(5). Later, a “SoftMax” operation over the magnitude of the correlation scores is performed to estimate an attention map. Finally, this attention map is applied to the complex value matrix to obtain each row of the output.

(5)

The proposed fully complex self-attention has the same number of trainable parameters as a conventional self-attention mechanism used for real and imaginary individually. We strongly believe that fully complex self-attention mechanism, which accounts for the cross-relation between real and imaginary components, should help to improve performance of a DCCN for speech applications.

Simulated Real
CD () LLR () FWSegSNR () PESQ () SRMR () SRMR ()
Room #1 #2 #3 #1 #2 #3 #1 #2 #3 #1 #2 #3 #1 #2 #3
Far Microphone
No Processing 2.672 5.207 4.962 0.518 0.701 0.941 9.781 6.854 6.035 2.621 2.028 1.909 4.586 2.973 2.731 3.175
WPE 2.456 5.163 4.900 0.466 0.678 0.918 10.150 7.103 6.211 2.720 2.082 1.951 4.840 3.204 2.885 3.431
CplxUNet 3.632 4.113 3.783 0.510 0.667 0.572 6.872 6.287 7.346 2.567 2.328 2.258 5.277 4.956 4.320 5.343
SDAB 2.316 3.966 3.754 0.318 0.632 0.667 10.725 7.594 7.780 2.900 2.394 2.325 5.117 4.650 4.163 4.629
Conventional SA 2.177 3.584 3.340 0.244 0.517 0.509 10.771 6.423 7.297 2.920 2.409 2.252 5.370 5.206 4.351 5.648
Proposed 2.134 3.548 3.287 0.233 0.513 0.496 9.362 6.383 7.777 2.996 2.451 2.328 6.050 5.335 4.429 5.785
Near Microphone
No Processing 1.992 4.634 4.384 0.467 0.452 0.742 10.440 8.712 7.418 3.142 2.419 2.303 4.498 3.746 3.572 3.192
WPE 1.854 4.577 4.302 0.444 0.423 0.708 10.694 8.967 7.623 3.289 2.483 2.356 4.626 3.991 3.855 3.507
CplxUNet 3.305 3.575 3.562 0.427 0.474 0.465 8.890 8.403 7.835 2.825 2.676 2.591 5.084 5.040 4.831 5.494
SDAB 1.793 3.575 3.326 0.230 0.445 0.507 13.359 10.936 9.614 3.377 2.750 2.701 4.955 4.828 4.732 4.794
Conventional SA 1.933 2.878 2.817 0.201 0.345 0.386 10.518 9.333 8.899 3.418 2.942 2.702 5.385 5.387 4.662 5.838
Proposed 1.904 2.783 2.762 0.200 0.329 0.380 10.113 9.561 9.387 3.479 3.003 2.719 5.506 5.624 4.738 6.012
Table 1: Improvements in Speech Quality Measures on EvalData of REVERB Challenge

5 Experiments

All networks studied and compared in this work are evaluated on the REVERB Challenge corpus [challenge1, challenge2]. The REVERB Challenge corpus is a collection of simulated and real recordings of speech sampled at 16kHz. The simulated data is generated using clean speech from WSJCAM0 and room impulse responses (RIRs) collected from three different sized rooms (small, medium, and large) and two different microphone placements (near, far) for a single microphone, 2-channel, and 8-channel microphone arrays. For further details on the corpus, see [challenge1, challenge2]. Models in this study are trained on 7,861 simulated reverberant and clean utterance pairs which correspond to approximately 15 hours. We use the evaluation set ( 5 hrs) of the corpus to compare performances of DCCRN with various attention mechanisms and a widely used statistical dereverberation algorithm, weighted prediction error (WPE). We use the metrics provided for the REVERB Challenge to evaluate the improvements in speech quality. We recommend referring [challenge1, challenge2, KaldiASR] for better understanding of quality metrics. We also evaluate the improvements in performance for back-end systems such as automatic speech recognition (ASR) and speaker verification (SV) systems by monitoring word error rates (WER) and equal error rates (EER).

5.1 Experimental Setup

Figure-(LABEL:fig:network) summarizes the network setup and details of each building block in our system development. The complex kernels are initialized with unitary matrices for better generalization. For a given speech utterance, we generate complex spectral images by first computing STFT with a frame length of 32 ms and 75% overlap. Next, the lower half of the complex STFT is divided into batches with consecutive frames to form complex spectral images of size . Similar to our previous work [skipconvnet], we perform optimal smoothing on power spectral density (PSD) on complex spectral images of reverberant and clean utterance pairs being fed to the network. The output of the DCCRN is a predicted complex mask which is applied to the input (as formulated in Eq-(2

)) to obtain an enhanced complex spectrogram. In this study, we use a combination of loss functions which are aimed at reducing estimation errors in magnitude and complex domains. We also use dynamic compression on the magnitudes of the clean and estimated signals before computing the loss:

(6)

where ,,, and are the compression exponent applied to the magnitudes, phases of clean & estimated complex spectrograms, and weight factor to combine the magnitude only with the complex loss respectively. Similar to [cmplxloss], we set both and

to 0.3. All investigated networks in the study are trained using Adam optimizer for 20 epochs with a batch-size of 4. We report improvements seen on performance metrics mentioned in earlier sections for

Eval set of the corpus.

In order to evaluate the performance improvements for speaker verification (SV), we train a conventional X-vector extractor [xvectors] on the Voxceleb 1+2 datasets [VoxCeleb1, VoxCeleb2]. Later, a probabilistic linear discriminant analysis (PLDA) model is trained on the clean Train set of the REVERB corpus. Similarly, to evaluate the performance improvements for an ASR system, we have trained a hybrid DNN-HMM system [KaldiASR] on the clean speech from the Train set of the REVERB corpus. Later, we use the original and enhanced speech utterances from the Eval set for evaluations.

Simulated Real
WER% EER% WER% EER%
REVERB (No Processing) 38.00 8.21 96.04 6.14
WPE 32.65 8.50 94.26 6.14
CplxUNet 19.15 7.07 66.24 9.94
CplxUNet + SDAB 13.29 5.06 67.93 6.43
CplxUNet + SA 11.66 3.15 42.51 5.85
CplxUNet + Proposed FCSA 10.42 3.05 38.42 5.56
Table 2: Performance of Back-End Speech Systems on EvalData

5.2 Experimental Results & Discussion

We compare the speech quality scores for reverberant speech and complex networks with various self-attention mechanisms in Table-(1). Each metric in the table is associated with either a ‘’ or ‘’ to represent the metric direction for improvement. We see that irrespective of the type of self-attention mechanism, introducing attention over time and frequency improves overall quality of speech. This can also be confirmed from performance improvements achieved for the back-end systems. Although the SDAB mechanism outperformed all other self-attention mechanisms in improving SNR, it was unable to show similar trends in any other quality metrics. A reasonable conclusion can be made that SDAB improves system performance for speech distorted by additive noise. On a similar note, the self-attention mechanism used in parallel for real and imaginary components could not provide improvements similar to the proposed fully complex self-attention. The proposed attention achieves a {1.93, 7.59, 18.09}% and {4.28, 27.13, 10.54}% relative improvements averaged over all speech quality metrics for simulated and real speech recordings compared to DCCRN with other attention mechanisms. Likewise, Table-(2) shows performance of back-end speech systems on speech signals enhanced by various discussed strategies. We see {10.63, 21.59, 45.5}% relative or {4.09, 29.51, 27.82}% absolute improvements on WER for simulated and real speech compared to other networks. Although the corpus was designed to evaluate systems for speech quality and ASR, we used the same Eval set from the corpus with limited speaker variability to evaluate speaker recognition system for a fair comparison. Similar to ASR performance, we see that the proposed attention mechanism outperforms others with {33.24, 20.84}% relative improvements in EER.

6 Conclusion

In this study, we proposed a fully complex self-attention mechanism for a DCCN which improved the learning ability of the DCCN to map complex reverberant spectrograms to their anechoic counterparts. We compared the proposed method’s performance to two distinct approaches to self-attention employed in speech systems. In comparison to alternative SA techniques and a widely utilized WPE dereverberation algorithm, a DCCN coupled with our proposed SA improved speech quality for both real and simulated speech. Additionally, we demonstrated how the proposed attention mechanism benefits back-end speech systems such as ASR and speaker verification. The gains in speech quality and performance of back-end speech applications achieved by the proposed fully complex self-attention mechanism demonstrate that better attention estimations can be computed by our proposed SA which accounts for the inter-dependencies between real and imaginary components of features in the complex domain.

References