TPARN: Triple-path Attentive Recurrent Network for Time-domain Multichannel Speech Enhancement

by   Ashutosh Pandey, et al.

In this work, we propose a new model called triple-path attentive recurrent network (TPARN) for multichannel speech enhancement in the time domain. TPARN extends a single-channel dual-path network to a multichannel network by adding a third path along the spatial dimension. First, TPARN processes speech signals from all channels independently using a dual-path attentive recurrent network (ARN), which is a recurrent neural network (RNN) augmented with self-attention. Next, an ARN is introduced along the spatial dimension for spatial context aggregation. TPARN is designed as a multiple-input and multiple-output architecture to enhance all input channels simultaneously. Experimental results demonstrate the superiority of TPARN over existing state-of-the-art approaches.



There are no comments yet.


page 1

page 2

page 3

page 4


TADRN: Triple-Attentive Dual-Recurrent Network for Ad-hoc Array Multichannel Speech Enhancement

Deep neural networks (DNNs) have been successfully used for multichannel...

Multichannel Speech Enhancement without Beamforming

Deep neural networks are often coupled with traditional spatial filters,...

Self-attending RNN for Speech Enhancement to Improve Cross-corpus Generalization

Deep neural networks (DNNs) represent the mainstream methodology for sup...

Dual-path Self-Attention RNN for Real-Time Speech Enhancement

We propose a dual-path self-attention recurrent neural network (DP-SARNN...

Recurrent Attentive Neural Process for Sequential Data

Neural processes (NPs) learn stochastic processes and predict the distri...

DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement

The dual-path RNN (DPRNN) was proposed to more effectively model extreme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech enhancement is concerned with improving the intelligibility and quality of a speech signal degraded by noise and reverberation. The most basic approach to speech enhancement is monaural processing, where recordings from a single microphone are utilized [loizou2013speech]. Single-channel methods can obtain good enhancement, but are limited in capability to utilize only time-frequency (T-F) information. Multichannel speech enhancement aims at utilizing both T-F and spatial information by using recordings from multiple microphones [benesty2008microphone, gannot2017consolidated].

Supervised speech enhancement using deep neural networks (DNNs) represents the mainstream methodology for speech enhancement [wang2017supervised]. For multichannel processing, a popular approach is to incorporate DNNs with traditional spatial filters, such as an MVDR beamformer [erdogan2016improved, heymann2016neural]

. A DNN is first used to estimate second-order statistics of speech and noise which are then used for computing beamformer weights. Another approach is to train a DNN with spatial features, such as inter-channel time, phase or level difference

[wang2018combining, wang2018multi]

. A more recent trend is to use end-to-end supervised learning, where spatial information becomes an implicit part of supervised learning

[tolooshams2020channel, wang2020multi]. Wang et al. [wang2020multi] proposed a dense convolutional recurrent network (DCRN) for multi-microphone complex spectral mapping, where real and imaginary components of the clean spectrum are directly predicted from the multichannel noisy spectrum. In [tolooshams2020channel], authors used inspiration from complex beamforming to propose a novel channel-attention mechanism inside a dense UNet.

Moreover, time-domain speech enhancement using DNNs has also gained considerable attention in recent years [Pandey2018, luo2019conv, pandey2021dense, luo2020dual]

. Time domain networks directly map noisy speech samples to clean speech samples, and as a result, feature extraction becomes an implicit part of the learning process. Even though highly effective in removing additive interference, time-domain approaches have not yet been established for removing room reverberation, a convolutive interference

[luo2018real]. Time-domain networks have also been explored for end-to-end multichannel speech enhancement [tawara2019multi, liu2020multichannel, luo2020end], however reported performances are far from satisfactory.

In this work, we propose a novel approach for end-to-end time-domain multichannel speech enhancement. We refer to it as TPARN: Triple-path Attentive Recurrent Network. The key idea in the TPARN design is to extend a dual-path network [luo2020dual] with a third path along the spatial dimension. The audio input signals from all channels are first divided into short chunks which are then processed by the TPARN system in three stages. These three stages for processing include: intra-chunk processing for local temporal modeling, inter-chunk processing for global temporal modeling, and inter-channel processing for spatial modeling.

Intra-chunk and inter-chunk processing are performed independently for all the channels using a dual-path attentive recurrent networks (ARN) [pandey2020dual], which are RNNs augmented with self-attention [merity2019single, pandey2021self]. A combination of RNN and self-attention has been proven to be effective for speech processing tasks [pandey2020dual, tan2020sagrnn, pandey2021self, chen2020dual]. The inter-channel processing introduced by us can be modeled using different methods and we explore RNN, self-attention network, and ARN along the spatial dimension. We find ARN to be slightly superior compared to the other two. Besides, with an explicit capability to capture spatial information through neural network based inter-channel processing, our TPARN framework has additional desirable characteristics. For example, TPARN is designed as a multiple-input and multiple-output (MIMO) architecture to enhance all input channels simultaneously. These multi-channel outputs can be further processed by a downstream system if needed. We train and evaluate TPARN on two different datasets with varying degrees of reverberation and noise. We show that TPARN obtains better results than state-of-the-art approaches on both datasets. We also find performance improvements to be more significant on a difficult dataset.

2 Model Description

2.1 Problem Definition

A multichannel noisy signal , where is the number of microphones and is the number of samples, is modeled as


where and . represents noisy signal at microphone . is the received speech including direct-path speech and reverberated speech . is the noise and is the overall interference including noise and reverberation. The goal of multichannel speech enhancement is to get a close estimate of the direct-path clean speech at a reference microphone from .

Figure 1: The proposed TPARN architecture for multichannel speech enhancement.

2.2 Triple-path Attentive Recurrent Network

The overall schema of the proposed TPARN architecture is shown in Fig. 1. It consists of an input linear layer, four TPARN blocks, and an output linear layer.

An input signal is first converted into frames, , using a frame size of samples and frame shift of . is the total number of frames. The frames in T are arranged into chunks with a chunk size of and chunk shift of , leading the input being represented as , where is the number of chunks. Next, the frames of size in are projected to dimensions using the input linear layer, which are then processed by a stack of 4 TPARN blocks. The architecture of a TPARN block is shown in Fig. 2

. The TPARN blocks are densely connected. The input to TPARN blocks are 4d tensors of shape

, which are obtained by concatenating the output from the linear layer encoder and the outputs from preceding TPARN blocks. denotes block id.

For , a linear layer is used at the input to project features of size to . Within a TPARN block, the inputs are processed using a stack of three ARNs: intra-chunk ARN, inter-chunk ARN and inter-channel ARN. The intra-chunk ARN operates independently over all chunks by rearranging its input to shape , and using an ARN that treats the first, second and third dimensions as batch, sequence and feature dimensions respectively. Similarly, the inter-chunk ARN combines all chunks together by rearranging its input to shape . The inter-channel ARN operates along the channel dimension (spatial dimension) by rearranging its input to shape . The sequence length and the batch size for an utterance for different ARNs are given in Table 1.

Figure 2: TPARN block.
Figure 3: (a) RNN block, (b) Attention block, (c) Feedforward block.

Fig. 2 also depicts the architecture of an ARN which comprises of a stack of three blocks, RNN block, attention block, and feedforward block. The architecture of the RNN block, the attention block and the feedforward block is shown in Fig. 3. The input to RNN block is normalized using two independent layer normalization layers. We use separate normalization to make sure that the network has a capability to scale a given signal differently at different locations inside the network.

The input to the attention block is layer-normalized using two independent layer-normalization layers. The first layer-normalized input is used as query, , and the second layer-normalized input is used as key and value for a following attention module. The attention mechanism of the attention module, shown in Fig. 4, is borrowed from [merity2019single]

where its effectiveness with RNN for natural language processing tasks has been demonstrated. Its effectiveness for speech enhancement has also been established in

[pandey2020dual, pandey2021self].

width=0.66 Batch Size Seq. Length Feature Size Intra-chunk ARN Inter-chunk ARN Inter-channel ARN

Table 1: Input size to different ARNs for an utterance with chunks.

The attention module comprises three trainable vectors

, and its inputs are , where is the batch size and is the sequence length (see Table 1). , and are refined using a gating mechanism given in the following equation.


where Sigm()

is the sigmoidal nonlinearity,

Lin() is a linear layer, and denotes elementwise multiplication. , and are broadcast to match the shape of , and . Note that is a deterministic vector, and hence this operation is used only during training to better optimize , and its final value is stored as a vector to use during evaluation.

The final output of the attention block is computed as


The input to the feedforward block in ARN is layer-normalized independently using two different layer normalization layers. The first layer-normalized input is projected to size using a linear layer followed by Gaussian error linear unit (GELU) nonlinearity and a dropout with dropout rate . Finally, it is projected back to size

and added to the second layer-normalized input. Note that we use dense connection in the RNN block and residual connections in the attention and the feedforward block.

The output of the final TPARN block is projected to size using the output linear layer. Next, chunks are combined together using chunk overlap-and-add (OLA) and then frames are combined together using frame OLA to get an enhanced multichannel waveform. Note that TPARN is a MIMO architecture that enhances all input channels simultaneously.

2.3 Loss Function

We use a recently proposed phase constrained magnitude (PCM) loss [pandey2020densely] for training. The PCM loss was proposed to overcome an existing artifact issue with magnitude-based losses for time-domain networks. First, we compute an estimate of overall interference as


Then, a magnitude-based loss is used between reference and estimated speech and noise.


is defined as


where is STFT of and and respectively represent the real and the imaginary component of . is the number of frames and is the number of frequency bins in .

Figure 4: Self-attention mechanism.

3 Experiments

3.1 Datasets

We use two datasets for experiments. The first dataset is created using speech from WSJCAM0 corpus [robinson1995wsjcamo] and noises from the REVERB challenge [kinoshita2016summary]. This dataset uses a 4 microphones circular array with a radius of 10 cm, T60 in the range [0.2, 1.2] seconds, and direct speech-to-noise-ratio (SNR) in the range [5, 20] dB. More details about this dataset can be found in [wang2020multi].

We also create a more challenging dataset using speech and noises from the DNS challenge 2020 corpus111 [reddy2020interspeech]. We select all speakers with one chapter in the dataset and randomly select 90% of speakers for training, 5% for validation, and 5% for evaluation. After this, for each utterance a random chunk of a randomly sampled length with an activity threshold (from script in [reddy2020interspeech]) greater than 0.6 is extracted. The length of utterances are sampled from [3, 10] seconds for training and [3, 10] seconds for test and validation. This results in a total of 53 k utterances for training, 2.6 k for validation, and 3.3 k for test. Next, all the noises from the DNS corpus are randomly divided into training, validation and test noises in a proportion similar to the number of speech utterances.

The algorithm to generate spatialized multichannel data from DNS speech and noises is given in Algorithm 1. All the points inside a room are sampled at least 0.5 m away from walls, and the distance between array center and different sound sources is kept between 0.75 m and 2 m. We use Pyroomacoustics [scheibler2018pyroomacoustics] with hybrid approach where the image method with order 6 is used to model early reflections and ray-tracing is used to model the late reverberation. A similar approach to data generation was used in [kabeli2021online].

width=0.4 SI-SDR  STOI  PESQ unproc. -3.8 70.9 1.63 MSE 10.1 95.7 3.07 PCM 10.4 96.9 3.42

Table 2:

Comparison of Loss Functions for TPARN.

for split in {train, test, validation } do
      for speech utterances in split  do
  • Draw room length and width from [5,10] m, and height from [3, 4] m;

  • Draw 1 array location and 1 speech source location;

  • Get 4 uniformly placed mic locations on a circle of radius 10 cm centered at array location;

  • Draw , number of noise sources, from [5, 10]

  • Draw noise locations in room

  • Generate RIRs corresponding to speech source location and noise locations for mic locations in circular array

  • Draw noise utterances from noises in split

  • Propagate speech and noise signals to mics by convolving with corresponding RIRs

  • Draw a value from [-10, 10] dB, and add speech and noises at each mic using a scale so that the overall direct speech SNR is ;

      end for
end for
Algorithm 1 DNS dataset spatialization process.

3.2 Experimental settings

All the utterances are resampled to 16 kHz. We use , , , , , and

. For RNN, we use bidirectional long short-term memory networks (BLSTMs) with hidden size

in each direction. Dropout rate in feedforward blocks is set to

%. We use phase constrained magnitude (PCM) loss in Eq. (6) for training TPARN. All the models are trained for 100 epochs on 4 second long utterances randomly extracted from training samples during training. A batch size of

is used. Automatic mixed precision training is utilized for efficient training [micikevicius2017mixed]. Learning rate is initialized with and is dynamically scaled to half if the best validation score does not improve in five consecutive epochs.

We compare TPARN with two recently proposed complex spectrum based end-to-end models; DCRN [wang2020multi] and channel-attention dense UNet (CA-DUNet)[tolooshams2020channel] . We also compare it with two time-domain end-to-end models; residual speech denoising fully convolutional network (rSDFCN) [liu2020multichannel], and filter-and-sum network with transform average and concatenate module (Fasnet TAC) [luo2020end].

All the models are compared using three enhancement objective metrics: short-time objective intelligibility (STOI) [taal2010short], perceptual evaluation of speech quality (PESQ) [rix2001perceptual], and scale-invariant signal-to-distortion ratio (SI-SDR). STOI scores are reported in percentage.

3.3 Experimental results

We begin by comparing time-domain mean squared error (MSE) loss with spectral magnitude based PCM loss in Eq. (5). Results on WSJCAM0 dataset using TPARN are reported in Table 2. We see that even though MSE loss obtains good SI-SDR and STOI scores, it is considerably worse in terms of PESQ. This also suggests that PCM loss obtains bigger improvements for joint denoising and dereverberation compared to denoising in [pandey2021dense]. We use PCM loss for the rest of the experiments with TPARN.

Next, we explore the behavior of spatial (inter-channel) ARNs at different locations inside a TPARN bloc: Pre - before intra-chunk and inter-chunk ARN, Mid - between intra-chunk and inter-chunk ARN, and Post - after inter-chunk and intra-chunk ARN. We also experiment with three different configurations for spatial processing: Attention - removing the RNN block from ARN, RNN - removing the attention block from ARN, and ARN. The results for different models are summarized in Table 3. We see that for the WSJCAM0 dataset, different configurations of spatial ARN in TPARN block obtain similar results. However, for the challenging DNS dataset, RNN is considerably worse in comparison of Attention and ARN. Moreover, best scores are obtained at Pre location for Attention and at Post location for ARN. These results indicate that RNNs are good for spatial modeling but may not suffice in extreme conditions, as in the DNS dataset.

Additionally, we perform an ablation experiment (not reported here) on the number of spatial ARNs inside TAPRN by removing spatial ARNs from TPARN blocks at different locations. We find that a TPARN with 3 spatial ARNs in the first, second and the fourth TPARN block obtains consistently better results for both datasets and for different learning strategies, such as multiple-input and single-output (MISO) and MIMO.

Finally, we compare best TPARN results with baseline models in Table 4. We report single-input and single-output (SISO), MISO, and MIMO results for DCRN and TPARN, and MISO results for the rest of the models, as in their original studies. TPARN MIMO is converted to MISO by averaging the output of the final TPARN block. We notice a very interesting observation that TPARN-SISO is worse than DCRN-SISO, but TPARN-MIMO and TPARN-MISO are better than DCRN-MISO and DCRN-MIMO. This suggests that TPARN exploits spatial information to a larger extent than in DCRN. Further, DCRN-MIMO is worse than DCRN-MISO, but TPARN-MIMO is slightly better than TPARN-MISO. This indicates that TPARN is capable of MIMO learning without any performance degradation, which provides additional advantage of enhancing all channels simultaneously with spatial cue preservation. Moreover, performance improvements over the baseline models are even better for the difficult DNS dataset. For example, TPARN is better by dB in SI-SDR, % in STOI and in PESQ compared to the second best DCRN-MISO. An exception in baseline models is CA-DUNet that obtains impressive SI-SDR but drastically worse PESQ for WSJCAM0.

width=.9 Test Dataset WSJCAM0 DNS Test Metric SI-SDR STOI PESQ SI-SDR STOI PESQ Unproc. Loc. -3.8 70.9 1.38 -7.6 63.8 1.38 Attention Pre 10.3 96.8 3.40 7.6 91.1 2.66 Mid 10.1 96.8 3.39 7.0 90.2 2.56 Post 10.2 96.8 3.40 6.9 90.2 2.56 RNN Pre 10.3 96.8 3.40 6.9 90.0 2.55 Mid 10.3 96.9 3.44 7.1 90.4 2.59 Post 10.3 96.9 3.41 7.2 90.3 2.57 ARN Pre 10.4 96.8 3.40 7.4 90.5 2.59 Mid 10.4 96.9 3.41 7.1 90.4 2.59 Post 10.4 96.9 3.42 7.8 91.1 2.65

Table 3: Comparisons between different spatial processing modules at different locations in TPARN blocks.

width=0.9 Test Dataset WSJCAM0 DNS Test Metric SI-SDR STOI PESQ SI-SDR STOI PESQ Approach Unprocessed -3.8 70.9 1.38 -7.6 63.8 1.38 WM rSDFCN-MISO [1] 4.2 83.8 2.00 -2.2 68.5 1.49 CRM CA-DUNet-MISO [2] 10.7 96.0 2.88 3.5 83.3 1.99 WM Fasnet TAC-MISO [3] 8.2 94.7 2.93 4.7 86.5 2.26 CSM DCRN-SISO [4] 6.6 93.6 2.90 3.9 89.8 2.60 CSM DCRN-MISO [4] 9.4 96.5 3.31 4.6 90.1 2.57 CSM DCRN-MIMO [4] 8 .0 95.9 3.27 3.6 89.4 2.57 WM TPARN-SISO 5.1 93.6 2.92 3.0 84.1 2.14 WM TPARN-MISO 10.2 96.8 3.40 8.2 91.6 2.72 WM TPARN-MIMO 10.4 96.9 3.43 8.4 91.9 2.75

Table 4: Comparisons with baseline models. WM: waveform mapping (time-domain), CRM: complex ratio masking. CSM: complex spectral mapping.

4 Conclusions

We have proposed a novel triple-path attentive recurrent network for multichannel speech enhancement in the time-domain. TPARN is designed as a simple extension of a single-channel dual-path network to multichannel network by adding a third-path along the spatial dimension. TPARN is a multiple-input and multiple-output (MIMO) architecture that can simultaneously enhance signals at all microphones. We have shown that TPARN obtains significantly better results than other state-of-the-art models in very noisy and reverberant conditions. Future research includes exploring TPARN for ad-hoc array processing and moving sources.