Speech signals captured by microphones placed at a distance from speakers are often corrupted with various undesired acoustic sources, such as interfering speakers, reverberation and ambient noise, which lead to a decreased speech quality and intelligibility. In recent years, aiming at separating out the speakers from the microphone signals and reduce background noise, sound source separation techniques based on deep learning have been proposed. Sound source separation techniques can be broadly categorized into single-channel and multi-channel methods, based on the number of microphones which are used. Single-channel source separation methods typically exploit spectro-temporal diversity between the speech and the noise signals [12, 13, 18, 26, 15, 14, 22, 19]. These methods typically perform source separation by estimating masks corresponding to each sound source using convolutional, recurrent or transformer networks. To improve the source separation performance, these methods also aim to learn short-term and long-term temporal dependencies of speech signals by neural structures, which have large receptive fields [13, 18], deep and wide recurrent layers [12, 15], or dual path architectures using recurrent or transformer layers [14, 22, 19].
When multiple microphones are available, multi-channel filters allow to exploit the spatial diversity between the speakers and the background noise in addition to the spectro-temporal diversity [9, 17, 2, 27, 16]. Multi-channel filters, also often referred to as spatial filters and beamformers [6, 9]
, perform source separation by linearly filtering and summing the microphone signals. Conventional multi-channel filters are typically estimated based on a linear optimization problem and require estimates of certain parameters, e.g., covariance matrices, direction-of-arrivals or steering vectors of sound sources[9, 6]. These parameters can be estimated based on masks obtained by, e.g., single-channel source separation techniques [17, 16, 27], or can be estimated directly using microphone signals . Instead of formulating the multi-channel filter as a linear optimization problem, it has been recently proposed to directly estimate multi-channel filters by a generalized recurrent beamformer (GRNN-BF) network , learning a non-linear optimization solution. GRNN-BF network is able to estimate multi-channel filters, although still from covariance matrices computed based on camera-guided DOA estimators. Aiming at end-to-end source separation by directly estimating multi-channel filters from input spectra, we propose a transformer-recurrent-U network (TRUNet) in this paper. To efficiently capture the spatial diversity using a transformer network, we propose transformer architectures that use an attention mechanism across microphone channels. To draw valid conclusions on reverberant sound source separation, we evaluate the proposed network on a challenging and realistic reverberant dataset, generated from measured room impulse responses of an actual microphone array.
The proposed TRUNet is depicted in Fig. 1. TRUNet consists of a spatial processing network, a spectro-temporal processing network and an inverse short-time Fourier transform
inverse short-time Fourier transform(iSTFT) layer. First, it accepts spectra of the multi-channel signals as input. Then, the multi-channel filters are estimated by a spatial processing network and a spectro-temporal processing network in an end-to-end fashion. For the spatial processing network, since capturing the spatial diversity is not straightforward, we investigate several transformer architectures operating across microphone channels. For the spectro-temporal processing network, we consider a multi-channel recurrent convolutional network with U structure (RUNet) to efficiently capture spectral and temporal dependencies corresponding to each speaker. To separate out speakers the multi-channel filters, which are complex-valued and time-varying, are applied to the multi-channel input spectra. In addition to multi-channel filters, we also consider estimating single-channel filters that can still benefit from a multi-channel spectro-temporal designed filter. Finally, the separated sources are transformed to the time domain using the iSTFT layer, which enforces short-time Fourier transform (STFT) consistency in the network [26, 3].
We train the proposed network to separate out speakers from noisy and reverberant speech mixtures. The reverberant speech signals in the mixtures can be thought as containing early reflections, which are beneficial for speech intelligibility and naturalness, and a late reverberation component, which is known to have a detrimental effect on speech quality and intelligibility . Therefore, we aim at separating out a low-reverberant speech signal preserving the early reflections. We use the complex mean-squared error (MSE) loss function compressed with an exponent factor to balance the optimization of small and large MSE errors [7, 3], which was found to be superior to other losses for source separation . We also explore the impact of the exponent factor on the source separation performance. Furthermore, we train the proposed network on a large dataset accounting for the crucial aspects of realistic multi-channel audio such as a large number of speakers, various noise types, different microphone signal levels and reverberation as in [2, 3].
We experimentally compare the proposed net with the state-of-the-art single and multi-channel source separation methods for challenging noisy and reverberant conditions. The results show that the proposed network outperforms state-of-the-art source separation methods.
2 Sound source separation system
We consider an acoustic scenario comprising two competing speakers and background noise in a reverberant environment. TRUNet accepts spectra of the multi-channel signals as input. Aiming at separating out the speakers, the network estimates two complex-valued filters, using the spatial processing network and the spectro-temporal processing network. The multi-channel filtering on -channel signals is performed as
where denotes the separated speech signal corresponding to speaker in the STFT domain, denotes the multi-channel filter directly estimated by the network, denotes the stacked vector of all microphone signals, denotes the conjugate transpose operator, and and are the frame index and the frequency index. In addition, a single-channel filtering version of (1) is considered, i.e., , where denotes the single-channel complex-valued filter and denotes one selected microphone signal in the STFT domain. In the following, we present the spatial processing network and the spectro-temporal processing network of the proposed TRUNet architecture.
2.1 Spatial processing network using transformers
The proposed spatial processing network consists of spatial transformer blocks. The spatial transformer blocks are similar to the encoder transformer block proposed in , however, they accept spectra as the input and have a spatial attention function with an output, a representation incorporating inter-channel information. The inputs of each transformer block consist of three pairs, i.e., queries with the feature dimension, keys , and values , which can be real and imaginary parts or the magnitude and the phase of the input spectra. To direct the attention of a transformer block to sub-spaces of spectral feature space, the keys and values are linearly projected as
where denotes the sub-space index, also refered to as heads , and are learnable projection matrices. By this projection, the spatial attention function are applied on the sub-spaces in parallel with a head embedding dimension , speeding up the process. The spatial attention function is then performed by weighting the sum of the values, where the weights are computed by (real-valued) dot products of the queries and the keys, followed by a softmax function, as
where denotes the transpose operator. The dot product operation in (3) could be seen as a similar way as covariance matrices in conventional beamforming may be computed, and weighting the sum of the values could be seen as a similar way as beamforming weights may be computed. To allow a transformer block to jointly attend to information from different representation sub-spaces at different channels, the attention outputs of heads are concatenated and linearly projected with , using a multi-head attention , i.e.,
. The multi-head attention output with a corresponding residual connection together are then followed by a layer normalization and a fully connected feed-forward network, resulting the output of a transformer block .
Since leveraging spatial diversity into a network is not straightforward, we consider several variants for computing cross-channel attention into the proposed spatial transformer networks (TNets):
TNet–Cat consists of several transformer blocks using the spatial attention function in (3). The output of each transformer block is used as the input to all query, key and value matrices of the next block, i.e. . The input of the first transformer block is the concatenation of the real and imaginary parts of the spectra (see Fig. 2a). This approach simply computes the attention based on the real-valued dot product (3) and can be seen as a straightforward approach to combining all multi-channel spectra into the proposed spatial transformers.
TNet–RealImag uses two separate transformer stacks for real and imaginary parts, respectively (see Fig. 2b). Queries and keys are all computed from the multi-channel spectra, i.e. . Since the softmax function is not well-defined for complex arithmetic, the magnitude of the complex inner product is used instead, i.e. , where denotes the magnitude operator and denotes the conjugate transpose operator. The output of the network is computed as the concatenation of the outputs of the last real and imaginary transformer stack blocks. As the real and imaginary parts are processed separately, this approach may still not be able to directly exploit the spatial information between channels, e.g., phase differences.
TNet–MagPhase is analogous to TNet–RealImag, except that the spectral magnitude and the spectral phase are used instead of the real and imaginary parts (see Fig. 2b). The output of the network is computed as the concatenation of the outputs of the last spectral magnitude and phase transformer stack blocks.
2.2 Spectro-temporal processing network
Since the spatial processing network may not be able to efficiently capture spectral and temporal diversities, the suppression capability of filters estimated by the spatial network is limited. Therefore, we propose TRUNet as an extension of TNets aiming at also capturing spectral and temporal diversities by using, e.g., a RUNet [20, 2, 3]. In this work, a multi-channel RUNet is adopted, accepting a multi-channel input which is a concatenation of the real and the imaginary part. It has symmetric convolutional and deconvolutional encoder and decoder layers with kernels of size
, aiming to deal with reverberation, and a stride ofin time and frequency dimensions. The number of channels across layers increases per encoder layer, and decreases mirrored in the decoder. The input and the final output channels are
. All convolutional layers are followed by leaky ReLU activations. The encoder and decoder are connected by two BLSTM layers, which are fed with all features flattened along the channels. Aiming at a better network training while avoiding vanishing gradients, residual connections are used which link convolutional encoder layers and corresponding decoder layers. In addition, a residual connection linking the BLSTM layers was used. Motivated by the results in for speech enhancement, the residual connections are implemented as convolutions with channels and kernels. The network is then followed by a fully connected layer and a tanh activation to estimate multi-channel filters of both sources. In addition to multi-channel filters, we consider to estimate single-channel filters using a similar network, however, with an extra convolutional layer after the last decoder layer with the output channel .
2.3 Loss function
where and denote the spectral phase of the STFT of the target speech signal and separated speech signals, respectively.
To cope with the source-speaker to target-speaker mapping problem, we use utterance permutation invariant training (uPIT) .
3 Experimental setup
We use realistic and large training and test sets to ensure generalization of our results to real-world acoustic conditions. For training, validation and testing, we use three different speech databases, i.e. 540 h of speech data from , 40 h from VCTK , and 5 h from DAPS , respectively, and different noise databases from [24, 4, 8]. We consider a 8-channel microphone array on a circle of 5 cm radius.
For training, we simulate RIR sets of random positions in 1000 differently sized rooms using the image method , while for validation and testing we use measured RIRs using the actual microphone array in 10 different rooms. The rooms were with reverberation times between 0.2 to 0.8 s and direct to reverberant ratios between -12 to 5.8 dB.
: two overlapping speech signals of 30 s length are convolved with a RIR from a randomly chosen position in the same room and mixed with energy ratios drawn from a Gaussian distribution with
dB. The reverberant mixture and noise are then mixed with a signal-to-noise ratio (SNR) drawn from a Gaussian distribution withdB. The resulting mixture signals are finally re-scaled to levels distributed with dBFS. The target speech signals are generated as low-reverberant speech signals using the reverberant impulse responses shaped with a an exponential decay function , enforcing a maximum reverberation time of 200 ms. We generate training, validation and test sets of 1000 h, 4 h and 4 h, respectively, at a sampling rate of kHz.
3.2 Baseline method
We consider a number of multi-channel source separation methods in our experiments: 1) DBNet, combining direction-of-arrival estimation and conventional beamforming . 2) eDBNet, a DBnet extension using post-masking via convolutional-recurrent networks . 3) RUNet, estimating the multi-channel filters from the multi-channel input spectra as described in Section 4.2. 4) TRNet, similar to the proposed TRUNet, but without the convolutional and deconvolutional encoder and decoder layers.
We also consider the following single-channel methods: 1) mTNet–1DP and mTNet–2DP, transformer masking-based networks as proposed in  with either one or two dual-path transformers, respectively. 2) mRUNet and mRUNet–Res, recurrent convolutional masking-based networks with U structure, as in . The latter uses residual connections as proposed in TRUNet. 3) mRNet and mRNet–Res, recurrent masking-based networks using 4 BLSTSM layers, as in [12, 15], the latter using residual connections as well. 4) mTRUNet and mTRNet, consististing of mTNet–1DP followed by mRUNet and mRNet–Res. 5) sTNet–1DP–RealImag, exploiting a dual-path transformer for the real and imaginary parts separately to estimate a single-channel complex-valued filter per speaker.
3.3 Algorithmic parameters
We use an STFT frame length of 512 samples, an overlap of between successive frames, a Hann window and an FFT size .
For TNets and their extensions, we set the number of spatial transformer blocks111We also considered and , but no significant performance improvement was observed. . In addition, all transformer blocks were used with positional encoding . For the networks using recurrent convolutional network with U structure, we use a sequence of 5 layers, i.e., , with , , , and filters. For the networks using recurrent layers, BLSTM layers with 1200 units are used. For mTNet–1DP and mTNet–2DP, we use 4 layers of transformers in each path. We train all networks using the loss function , except where explicitly different loss function is used. All networks were trained Adam optimizer 
. In addition, we use gradient clipping technique with a maximumnorm of 5, similarly as used in .
4 Experimental results
In this section, we evaluate the speech separation performance of the proposed networks in terms of the scale-invariant signal-to-distortion ratio (SDR) and the signal-to-noise ratio (SIR) of BSSEval  and PESQ . In Section 4.1, we evaluate the performance of the proposed spatial transformer networks (TNets). In Section 4.1, we evaluate the performance of the proposed TNet extensions including TRUNet and benchmark them against multi-channel and single-channel baseline methods.
4.1 Spatial processing network source separation performance
In Table 1 the source separation performance of TNets using different queries, keys and values and different number of attention heads are compared with DBnet, which uses conventional beamformers. We observe that TNets result in larger SDR and SIR improvements compared to DBnet. The largest SIR improvement, indicating how well the speakers are separated, is obtained by TNet–MagPhase, using separate spatial transformers for the spectral phase and the spectral magnitude with the complex-valued dot product. In addition, we observe that a lower number of attention heads consistently results in a larger SIR. A lower number of attention heads results in more global attention across spectrum sub-spaces as well as larger embedding size. Nevertheless, the improvement of TNets is limited particularly for the SIR, which can be mainly attributed to the limited capability of TNets to efficiently exploit spectral and temporal diversities in addition to the spatial diversity. We focus from now on systems using heads, as they outperform the other settings.
4.2 TRUNet source separation performance
|spatial transformer networks (TNet)|
|TNet Extensions using Spectrol-Temporal Processing|
Table 2 shows the source separation performance of TNet extensions versus TNets and the multi-channel baseline methods. We observe that TNet extensions using spectro-temporal processing networks result in larger performance measures, particularly for the SIR and the PESQ improvements, compared to TNets. Nevertheless, we observe that only some TNet extensions (TRNet–Cat*, TRUNet–Cat*, TRUNet–RealImag*, TRUNet–MagPhase*) result in a larger SDR improvement (about dB) and a larger SIR improvement ( dB) compared to the multi-channel baseline methods. Therefore, we investigate the main factors contributing to the performance measure improvement of TNet extensions in the remainder.
We observe that among TNet extensions, those using recurrent U networks (TRUNet–Cat, TRUNet–Cat*) largely outperform networks using only a BLSTM recurrent network (TRNet–Cat, TRNet–Cat*), with a samll gain of dB for the SDR improvement and a large gain of dB for the SIR improvement. In addition, we observe that the TNet extensions with single-channel filtering (TRNet–Cat*, TRUNet–Cat*) consistently obtain a larger SIR improvement with a gain of dB and a larger PESQ improvement with a gain of 0.9 compared to TNet extensions with multi-channel filtering (TRNet–Cat, TRUNet–Cat). This may imply that, for the considered networks, single-channel filtering is sufficient and summing all microphone spectra after filtering may be unnecessary. Among the TNet extensions with single-channel filtering, the network using the spectral phase and the spectral magnitude (TRUNet–MagPhase*) obtains the best SDR and SIR improvements, with a gain of dB for the SDR improvement and dB for the SIR improvement over TRUNet–RealImag*, using spatial transformers for the real and imaginary parts. In addition, TRUNet–MagPhase* obtains the largest performance measures even compared to all other multi-channel methods, indicating the important role of recurrent U-structured convolutional network in capturing spectro-temporal information and spatial transformers in capturing spatial information. In the remainder, we will focus on the evaluation of TRUNet–MagPhase*.
We further compare the source separation performance of TRUNet–MagPhase* with the single-channel baseline methods (see Table 3). We observe that although all considered methods result in a similar SDR improvement of about dB, TRUNet–MagPhase* stands out with a larger SIR improvement of and a PESQ improvement of . Among all considered single-channel methods, the network estimating single-channel complex-valued filters using dual-path transformer (sTNet–1DP–RealImag) obtains the lowest SIR improvement. As an interesting observation, when mRNet and mRUNet are used with residual connections (mRNet–Res and mRUNet–Res), the SIR improvement significantly increases by more than 1 db.
As noted in Table 1 and Table 2, among all considered multi-channel methods, TRUNet–MagPhase* is the network with the smallest model size while it outperforms all others. As for single-channel methods, mTNet–1DP has the smallest model size while obtaining high SDR and SIR measures. Nevertheless, the SIR and the PESQ improvement of mTNet–1DP are still about dB and lower than TRUNet–MagPhase*.
Finally, we explore the impact of the exponent factor on the source separation performance of TRUNet–MagPhase*. Figure 3 depicts SDR and SIR improvements, when the compressed loss function is used. Smaller exponent factors are shown to obtain an increasing SIR improvement (about to dB) and a decreasing SDR improvement (about to db), compared to large factors. In order to achieve large improvements for both SDR and SIR, we finally experiment with linearly combining two cMSE losses with complementary exponent factors, i.e. . When combining the compressed loss functions using and the combination factor , we obtain additional improvements resulting in final SDR and SIR of 13.40 and 9.70, respectively.
We proposed an end-to-end multi-channel source separation network that directly estimates multi-channel filters from multi-channel input spectra. The network consists of a spatial processing network using spatial transformers and a spectro-temporal processing network using a recurrent U-structured convolutional network. In addition to multi-channel filters, we also consider estimating single-channel filters from multi-channel input spectra using TRUNet. We trained the network using the cMSE loss function on a large reverberant dataset, and tested on realistic data using measured RIRs from an actual microphone array. The experimental results show that both proposed spatial processing network and spectro-temporal processing network are crucial to obtain competitive performance. In particular, the results show that the proposed spatial processing network using transformers with the complex-valued dot product for the spectral magnitude and the spectral phase results in a larger SIR improvement compared to using transformers for the real and imaginary parts. In addition, the spectro-temporal processing network using a recurrent U-structured convolutional network obtains larger separation performance compared to using only a BLSTM recurrent network. Moreover, the results show that our proposed architecture TRUNet achieves larger separation performance with single-channel filtering than multi-channel filtering, even larger than the performance obtained by single-channel source separation methods.
-  (1979-04) Image method for efficiently simulating small-room acoustics. 65 (4), pp. 943–950. Cited by: §3.1.
-  (2021-06) DBNet: DOA-driven beamforming network for end-to-end reverberant sound source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 211–215. Cited by: §1, §1, §2.2, §3.1, §3.2, §3.2, Table 1, Table 2, Table 3.
-  (2021-06) Towards efficient models for real-time deep noise suppression. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 656–660. Cited by: §1, §1, §2.2, §3.1.
The qut-noise-timit corpus for evaluation of voice activity detection algorithms. In Proceedings of the Annual Conference of the International Speech Communication Association, pp. 3110–3113. Cited by: §3.1.
-  Device and produced speech (DAPS) dataset. Note: https://ccrma.stanford.edu/~gautham/Site/daps.html Cited by: §3.1.
-  (2015-Mar.) Multichannel signal enhancement algorithms for assisted listening devices. IEEE Signal Processing Magazine 32 (2), pp. 18–30. Cited by: §1.
-  (2018-Jul.) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37 (4). Cited by: §1, §2.3.
A large-scale open-source acoustic simulator for speaker recognition. IEEE Signal Processing Letters 23 (4), pp. 527–531. Cited by: §3.1.
-  (2017) A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (4), pp. 692–730. Cited by: §1.
-  IEEE ICASSP 2021 Deep Noise Suppression (DNS) Challenge. Note: https://github.com/microsoft/DNS-Challenge Cited by: §3.1.
-  (2014) Adam: a method for stochastic optimization. Note: arXiv preprint External Links: Cited by: §3.3.
Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §1, §2.3, §3.2, Table 3.
-  (2019-Aug.) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §1.
-  (2020-05) Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 46–50. Cited by: §1.
-  (2020) WHAMR!: noisy and reverberant single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 696–700. Cited by: §1, §3.2, §3.3, Table 3.
Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 6129–6133. Cited by: §1.
Beam-TasNet: time-domain audio separation network meets frequency-domain beamformer. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 6384–6388. Cited by: §1.
FurcaNet: an end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation. Note: arXiv preprint External Links: Cited by: §1.
-  (2021-Jun.) Attention is all you need in speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 21–25. Cited by: §1, §3.2, Table 3.
-  (2019) Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 5751–5755. Cited by: §2.2.
-  (2017) Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Red Hook, NY, USA, pp. 6000–6010. Cited by: §2.1, §3.3.
-  (2021-06) TSTNN: two-stage transformer based neural network for speech enhancement in the time domain. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. , pp. 7098–7102. Cited by: §1.
-  (2013-01) Effects of spatial and temporal integration of a single early reflection on speech intelligibility. The Journal of the Acoustical Society of America 133 (1), pp. 269–282. Cited by: §1.
-  (2019-09) WHAM!: extending speech separation to noisy environments. In Proceedings of INTERSPEECH, Cited by: §3.1.
-  (2018) Exploring tradeoffs in models for low-latency speech enhancement. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC), Vol. , pp. 366–370. Cited by: §2.3.
-  (2019-05) Differentiable consistency constraints for improved deep speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 900–904. Cited by: §1, §1.
-  (2021-Sep.) Generalized spatio-temporal rnn beamformer for target speech separation. In Proceedings of INTERSPEECH, Brno, Czechia, pp. 3076–3080. Cited by: §1.
-  (2019) CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), [sound]. In University of Edinburgh. The Centre for Speech Technology Research, Cited by: §3.1.