DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

by   Yuma Koizumi, et al.

Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequential modeling ability of Conv-TasNet architectures by integrating Conformer layers into a new mask prediction network. To make the model computationally feasible, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. We trained the model on 3,396 hours of noisy speech data, and show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet.


Parameter Enhancement for MELP Speech Codec in Noisy Communication Environment

In this paper, we propose a deep learning (DL)-based parameter enhanceme...

Cross-attention conformer for context modeling in speech enhancement for ASR

This work introduces cross-attention conformer, an attention-based archi...

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

With the development of deep learning, neural network-based speech enhan...

Speech denoising by parametric resynthesis

This work proposes the use of clean speech vocoder parameters as the tar...

Deep Interaction between Masking and Mapping Targets for Single-Channel Speech Enhancement

The most recent deep neural network (DNN) models exhibit impressive deno...

Speech enhancement with mixture-of-deep-experts with clean clustering pre-training

In this study we present a mixture of deep experts (MoDE) neural-network...

Sparse Mixture of Local Experts for Efficient Speech Enhancement

In this paper, we investigate a deep learning approach for speech denois...

1 Introduction

Speech enhancement (SE) is the task of recovering target speech from a noisy signal [33]. In addition to its applications in telephony and video conferencing [28], single-channel SE is a basic component in larger systems, such as multi-channel SE [7, 12], multi-modal SE [8, 1, 9, 31]

, and automatic speech recognition (ASR) 

[6, 17, 21] systems. Therefore, it is important to improve both the denoising performance and the computational efficiency of single-channel SE.

In recent years, rapid progress has been made on SE using deep neural networks (DNNs) 

[33]. Conv-TasNet [23] is a powerful model for SE that uses a combination of trainable analysis/synthesis filterbanks [26] and a mask prediction network using stacked 1-D dilated depthwise convolution (1D-DDC) layers. Since the denoising performance and computational efficiency are mainly affected by the mask prediction network, one of the main research topics in SE is improving the mask prediction architecture [15, 36, 3, 29, 34, 22, 2]. For example, the improved time-dilated convolution network (TDCN++) [15, 36] extended Conv-TasNet to improve SE performance.

A promising candidate for improving mask prediction networks is the Conformer architecture. The Conformer [10] architecture has been shown to be effective in ASR [10], diarization [24], and sound event detection [25, 11]. Conformer is derived from the Transformer [32] architecture by including 1-D depthwise convolution layers to enable more effective sequential modeling.

In this paper we combine Conformer layers with the dilated convolution layers of the TDCN++ architecture. However, this introduces two critical problems related to the short window and hop sizes used in trainable analysis/synthesis filterbanks. The first problem is large computational cost because the time-complexity of the multi-head-self-attention (MHSA) in the Conformer has a quadratic dependence on sequence length. Secondly, the small hop-size of neighboring time-frames reduces the temporal reach of sequential modeling when using temporal convolution layers.

In order to make the model computationally feasible, we use a linear-complexity variant of self-attention in the Conformer, known as fast attention via positive orthogonal random features (FAVOR+), as used in Performer [5]. These ideas are partly inspired by the local-global network for speaker diarization using a time-dilated convolution network (TDCN) [24] which shows that the combination of a linear complexity self-attention and a TDCN improves both local and global sequential modeling. We show in experiments below that the resulting model, which we call the dilated FAVOR Conformer (DF-Conformer), achieves better enhancement fidelity than the TDCN++ of comparable complexity.

2 Preliminaries

2.1 Conv-TasNet and its extensions on speech enhancement

Let the -sample time-domain observation be a mixture of a target speech and noise as , where is assumed to be environmental noise and does not include interference speech signals. The goal of SE is to recover from .

In mask-based SE, a mask is estimated using a mask prediction network and applied to the representation of

encoded by an encoder, then the estimated signal is re-synthesized using a decoder. The enhancement procedure can be written as


where and are the signal encoder and decoder, respectively, is the encoder output dimension, is the element-wise multiplication, and

is the mask prediction network. Early studies used the short-time-Fourier-transform (STFT) and the inverse-STFT (iSTFT) as encoder and decoder 

[6, 18], respectively. More recent studies use a trainable encoder/decoder [23] which are often called trainable “filterbanks” [27], e.g. in Asteroid  [26].

One of the main research topic in SE is the design of the network architecture of , because the performance and computational efficiency of SE are mainly affected by the structure of . Conv-TasNet [23] is a powerful model for speech separation and SE, and whose consists of stacked 1D-DDC layers. TDCN++ [15, 36] is an extension of Conv-TasNet. The main difference of TDCN++ with Conv-TasNet is the use of instance norm instead of global layer norm and the addition of explicit scale parameters after each dense layer. The pseudo-code for in the TDCN++ is shown in Algorithm 1. TDCN++ consists of stacked TDCN-blocks, and each TDCN-block mainly consists of two dense layers for frame-wise feature modeling and one 1D-DDC layer for sequence modeling. The dilation factor increases exponentially to ensure a sufficiently large temporal context window to take advantage of the long-range dependencies of the speech signal, and TDCN-blocks are repeated times where . The time complexity of TDCN++ is roughly proportional to when , where is the input dimension of TDCN-blocks.

1 Function MaskPredictorOfTDCN++():
2       for  to  do
4       return
5 Function TdcnBlock(, ):
6       return
Algorithm 1 of TDCN++ [15, 36] where and

is logistic sigmoid function.

2.2 Conformer

Conformer [10] is a derived model of Transformer [32] that was originally proposed for ASR [10] and later adopted in audio-related applications such as audio event detection [25, 11] and speech separation [4]. The structure of the Conformer is similar to the TDCN++, in that it consists of stacked Conformer-blocks [10]. Algorithm 2 shows the pseudo-code of a Conformer-block. By comparing Algorithm 1 and 2, we can see that the constituent layers of the Conformer-block and the TDCN-block are also similar; one Conformer-block mainly consists of several dense layers for frame-wise feature modeling, and one 1-D depthwise convolution layer and one MHSA-module for sequence modeling [10]. One of the main differences between the TDCN-block and the Conformer-block is the MHSA-module. Conformer enables global sequence modeling by using MHSA-modules instead of dilated depthwise convolution layers with local receptive fields.

1 Function ConformerBlock():
2       return
Algorithm 2 Conformer block [10].

means batch normalization. For details of

and , see [10].

3 Proposed method

In this section, we first describe two problems for incorporating the Conformer into TDCN++ framework in Sec 3.1, and our solutions for each problem are described in Sec. 3.2 and 3.3, respectively.

3.1 Model structure and computational challenges

Based on the successes of Conformer in speech-related tasks, we aim to replace the TDCN blocks in TDCN++ with Conformer-blocks. Unfortunately, the simple combination of trainable filterbanks and Conformer-blocks causes two critical problems. These problems are caused by the short window size of 2.5 ms and hop size of 1.25 ms used in trainable filterbanks for short-time analysis of the input signal.

Problem 1: The computational complexity. The computational cost of MHSA-module is quadratic in the number of frames . In the original Conformer model [10], convolutional subsampling limits the size of . For example, for a 1 second signal, is 25. In contrast, for TDCN++, the same signal would result in .

Problem 2: The receptive field for sequence modeling is insufficient. The original Conformer has a hop-size of 40 ms, while the standard trainable filterbank has a hop-size of 1.25 ms. This means that the receptive field for depthwise convolution is 6.25 ms when using the default kernel size of 5, which may degrade the accuracy of the analysis of local changes in the signal.

One possible approach is to use the dual-path approach [3, 29, 34], which is equivalent to using sparse and block diagonal attention matrices corresponding to the inter- and intra-transformers, respectively. Alternatively, we use FAVOR+ attention introduced in Performer [5] which has linear computational complexity: . The novelty in our approach comes from using linear FAVOR+ attention to replace softmax-dot-product attention as well as performing local analysis with 1D-DDC to replace non-dilated convolutions in Conformer. Based on these two characteristics of the proposed method, we name our as dilated-FAVOR-Conformer (DF-Conformer), and -layer DF-Conformer is referred as DF-Conformer-. The pseudo-code of DF-Conformer- is shown in Algorithm 3. The time complexity of DF-Conformer- is also roughly in proportion to when .

1 Function DF-Conformer():
2       for  to  do
4       return
5 Function DF-ConformerBlock(, ):
6       return
Algorithm 3 using DF-Conformer-. Red lines are differences from TDCN++ and Conformer-block.

3.2 Linear time-complexity MHSA-module using FAVOR+

Recently, many extended Transformer architectures have been proposed to make improvements around computational and memory efficiency [30, 14]. Performer [5] is one of them; it is an Transformer architecture which uses FAVOR+. In self-attention, the query, , key, , and value, are combined as . In FAVOR+, this is approximated as , for a suitable feature map applied to the rows of each matrix, avoiding the quadratic term . Here is a normalizing diagonal matrix with , and

an all ones vector. This approximation is made accurate in FAVOR+ using a random projection based non-negative valued

of a suitable size [5]. To implement this idea, we replace the softmax-dot-product self-attention in Algorithm 2 with FAVOR+ self-attention. Hereafter, we refer to this new module as “MHSA-FAVOR-module”.

3.3 Use of dilated depthwise convolution in Conformer

We strengthen the network’s temporal analysis capability by using 1D-DDC instead of the standard 1-D depthwise convolution used in the Conformer-blocks. As in TDCN++, we use an exponentially increasing dilation factor . To implement this idea, DF-Conformer-block also takes as an argument, and it is passed to the 1D-DDC layer as the dilation parameter.

In a similar strategy as [24] and DF-Conformer, MHSA-FAVOR-module can also be incorporated into the TDCN-block. As an alternative network architecture, we insert between line 9 and 10 of Algorithm 1, and refer to it as “Conv-Tasformer”.

4 Experiments

We conducted ablation studies and objective experiments in Sec. 4.2 and 4.3, respectively. Audio demos are available111google.github.io/df-conformer/waspaa2021/.

4.1 Experimental setup

Dataset: We used the same dataset used in the SE experiment of [36]. This dataset uses speech from LibriVox (librivox.org) and non-speech sounds from freesound.org . The duration of all samples were 3 sec, and sampling rate was 16 kHz. Training, validation, and test datasets consisted of 4,076,102 (3396.8 hours), 7,417 (6.2 hours), and 7,387 (6.2 hours) examples, respectively. We mixed speech and noise samples in the same manner of [35]. The minimum and maximum signal-to-noise ratio (SNR) of noisy input were dB and dB, respectively, and the average extended short-time objective intelligibility measure (ESTOI) [13] score was 63.7%.

Loss function: We estimated masks for both speech and noise in the same manner of [36, 17]. Each mask was multiplied with and re-synthesized to the time-domain using the same decoder. A mixture consistency projection layer [35] was applied to ensure the mixture of estimated speech and noise equals the noisy input. Finally, the negative thresholded SNR [36] loss222 where a soft threshold that clamps the loss at dB. In this study, we used . was calculated for both speech and noise, and mixed by weighting 0.8 for speech and 0.2 for noise.

Comparison of methods and hyper-parameters: For the ablation studies in Sec. 4.2, we used three Conformer-based models. The first model is Conformer- which simply replaces TDCN-blocks in TDCN++ with Conformer-blocks. The second model is F-Conformer- which is a model that uses only FAVOR+ in DF-Conformer-. The last model is Conformer--STFT which uses STFT and iSTFT as and , respectively. For Conformer--STFT models, we estimated a complex-valued mask [19]. We cannot increase the number of parameters of Conformer- due to its computational complexity, therefore, we used two different model sizes; 3.7M and 8.75M parameters. The former size was determined according to the maximum model size of Conformer-

that can be trained on third-generation Tensor Processing Units (TPUv3). The latter size is that of TDCN++ used in previous studies 

[15, 36]. The hyper parameters were and were used for 3.7M models, and , , and were used for 8.75M models. For both model sizes, attention heads and random projection features were used in FAVOR+.

For the SE performance evaluation in Sec. 4.3, we compared DF-Conformer- and Conv-Tasformer with TDCN++ [15, 36] to confirm the superiority of the proposed models from its base model. In TDCN++, we used the same setting used in [35], namely, , , and . In Conv-Tasformer, we used the same setting of TDCN++ except and to reduce the number of parameters.

For all models, , and the window and hop sizes of trainable filterbanks were 2.5 ms and 1.25 ms, respectively. For STFT, the window and hop sizes were 30 ms and 10 ms, respectively, and fast-Fourier-transform size was 512. All models were trained for 500k steps on 128 Google TPUv3 cores with a global batch size of 512. We configured the Adam optimizer [16] with weight decay 1e-6, and learning rate schedule [32] of , where is a number of training steps. We clipped the gradient by global norm to 5.0. We stored a separate checkpoint of exponential-moving-averaged weights accumulated over training steps with decay rate 0.9999.

4.2 Evaluation of FAVOR+

Figure 1: Comparison of RTF. (a) RTF of Conformer-4 increases as duration of input waveform increases, whereas that of F-Conformer-4 becomes constant. (b) RTFs of DF-Conformer-8 and TDCN++ are comparable, whereas that of Conv-Tasformer is larger than others due to additional MHSA-FAVOR-block.

To confirm the effects of FAVOR+, we compared the real-time factor (RTF) of Conformer-4-STFT, Conformer-4, and F-Conformer-4 using 1 CPU. Figure 1 (a) shows the comparison results. In the case of Conformer-4-STFT, RTF does not increase significantly because was in our STFT setting and it is still feasible with MHSA-module. Whereas RTF of Conformer-4 increases linearly as was in our trainable filterbank setting and MHSA-module. Since the time-complexity of FAVOR+ is in proportion to , F-Conformer-4 has solved this problem.

Model #Params SI-SNRi ESTOI RTF
Conformer-4-STFT 3.82 M 12.47 83.4 0.02
Conformer-4 3.74 M 13.91 84.8 0.31
F-Conformer-4 3.59 M 12.40 80.5 0.06
Conformer-8-STFT 9.30 M 12.64 84.5 0.03
F-Conformer-8 8.83 M 13.81 83.7 0.13
Table 1: Results of evaluation for FAVOR+. Prefix “F” means the use of FAVOR+, and postfix “STFT” means the use of STFT and iSTFT for and , respectively.

We also compared these methods using two objective metrics; scale-invariant SNR improvement (SI-SNRi) [20] and the ESTOI. Table 1 shows the results. By comparing Conformer-4-STFT and Conformer-4, the use of a trainable filterbak achieved higher scores than STFT as similar to previous studies [15]. When using the small-size model, the SI-SNRi score of F-Conformer-4 was almost the same as those on the Conformer-4-STFT. Meanwhile, with the 8.75M models, SI-SNRi of F-Conformer-8 was 1.2 dB higher than that of Conformer-8-STFT, and ESTOI scores of those were almost comparable. These results suggest that the use of FAVOR+ can achieve high time-domain SE performance with a larger model while avoiding the increase in computational complexity.

4.3 Objective evaluation

Model #Params SI-SNRi ESTOI RTF
TDCN++ [36] 8.75 M 14.10 85.7 0.10
Conv-Tasformer 8.71 M 14.36 85.6 0.25
DF-Conformer-8 8.83 M 14.43 85.4 0.13
iTDCN++ [36] 17.6 M 14.84 87.1 0.22
iConv-Tasformer 17.5 M 15.25 87.2 0.48
iDF-Conformer-8 17.8 M 15.28 87.1 0.26
iDF-Conformer-12 37.0 M 15.93 88.4 0.46
Table 2: Experimental results. Meaning of prefix and postfix are the same as Table 1. Additional prefixes “D” and “i” mean the use of 1D-DDC and iterative model, respectively.

We compared DF-Conformer-8, TDCN++, and Conv-Tasformer using SI-SNRi, ESTOI, and RTF. From the comparison results shown in Table 2, DF-Conformer-8 and Conv-Tasformer achieved comparable scores, and these scores were higher than that of TDCN++. Also, by comparing DF-Conformer-8 and F-Conformer-8 in Table 1, the use of 1D-DDC significantly improved the scores while avoiding to increase RTF. These results suggest that the use of both 1D-DDC and FAVOR+ is effective in SE. We also compared RTF of these methods as shown in Fig. 1 (b). RTFs of DF-Conformer-8 and TDCN++ were comparable, whereas that of Conv-Tasformer was larger than others due to additional MHSA-FAVOR-block. Therefore, when inserting FAVOR+ in TDCN-block as Conv-Tasformer, it will be necessary to devise the position and number of MHSA-FAVOR-module in order to improve the computational efficiency.

We also compared the iterative extension of these models [15]. Using iterative model improved the scores of all methods, and the results tended to be similar to the non-iterative models. Furthermore, we evaluated a larger model as iDF-Conformer-12 with , , and the number of attention heads were 8. The size of model was determined so that RTF becomes comparable with iConv-Tasformer. As we can see the results, the scores clearly improved using a large model, thus DF-Conformer would be able to scale the performance according to the model size.

Figure 2: Examples of attention matrices in DF-Conformer-8. Spectrograms of noisy input and enhanced output (top row), and attention matrices for first and third (middle row) and last (bottom row) Conformer blocks calculated by . The x and y axes of attention matrices denote the key and query, respectively.

We finally point out three characteristics in DF-Conformer’s attention matrices. First, none of all attention matrices has a local structure that focuses only on nearby time-frames. Secondly, most attention matrices in earlier layers referred to low SNR time-frames to capture the noise characteristics (e.g. Fig. 2 middle-left), or referred to time-frames with similar spectral structures (e.g. Fig. 2 middle-right). Thirdly, some attention matrices of deep layers resemble a sum of a nearly-diagonal matrix and a block matrix (e.g. Fig. 2 bottom). This results suggest that the earlier layers roughly analyze the speech and noise from the entire utterance, and later layers refine the mask based on the local structure.

5 Conclusion

In this study, we proposed DF-Conformer which is a Conformer-based time-domain SE network. To improve the computation complexity and local sequential modeling, we extended Conformer using a linear complexity attention mechanism and 1-D dilated separable convolutions. Experimental results showed that (i) the use of a linear complexity attention solves the computational-complexity problems, and (ii) our model achieve higher performance than TDCN++. From the results of experiments, we conclude that DF-Conformer is an effective model for SE. Future works include joint-training of SE and ASR using an all Conformer model, and comparison with the dual-path methods [3, 29, 34] on the SE task.


  • [1] T. Afouras, J. S. Chung, and A. Zisserman (2018) The conversation: deep audio-visual speech enhancement. In Proc. Interspeech, Cited by: §1.
  • [2] S. Braun, H. Gamper, C. K. A. Reddy, and I. Tashev (2021) TOWARDS efficient models for real-time deep noise suppression. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
  • [3] J. Chen, Q. Mao, and D. Liu (2020)

    Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation

    In Proc. Interspeech, Cited by: §1, §3.1, §5.
  • [4] S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou (2020) Continuous speech separation with Conformer. arXiv:2008.05773. Cited by: §2.2.
  • [5] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller (2021) Rethinking attention with performers. In Proc. Int. Conf. Learn. Represent. (ICLR), Cited by: §1, §3.1, §3.2.
  • [6] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux (2015)

    Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks

    In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §2.1.
  • [7] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux (2016) Improved MVDR beamforming using single-channel mask prediction networks. In Proc. Interspeech, Cited by: §1.
  • [8] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg (2018) Seeing through noise: visually driven speaker separation and enhancement. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
  • [9] R. Gu, S. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu (2020) Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Signal Process. 14 (3), pp. 530–541. Cited by: §1.
  • [10] A. Gulati, C.-C. Chiu, J. Qin, J. Yu, N. Parmar, R. Pang, S. Wang, W. Han, Y. Wu, Y. Zhang, and Z. Zhang (2020) Conformer: convolution-augmented transformer for speech recognition. In Proc. Interspeech, Cited by: §1, §2.2, §3.1, 2.
  • [11] T. Hayashi, T. Yoshimura, and Y. Adachi (2020)

    CONFORMER-based id-aware autoencoder for unsupervised anomalous sound detection

    Technical report DCASE2020 Challenge. Cited by: §1, §2.2.
  • [12] T. Higuchi, K. Kinoshita, N. Ito, S. Karita, and T. Nakatani (2018) Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
  • [13] J. Jensen and C. H. Taal (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE Trans. Audio Speech Lang. Process. 24 (11), pp. 2009–2022. External Links: Document Cited by: §4.1.
  • [14] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret (2020) Transformers are RNNs: fast autoregressive transformers with linear attention. In Int. Conf. Mach. Learn. (ICML), Cited by: §3.2.
  • [15] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey (2019) Universal sound separation. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), Cited by: §1, §2.1, §4.1, §4.1, §4.2, §4.3, 1.
  • [16] D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learn. Represent. (ICLR), Cited by: §4.1.
  • [17] K. Kinoshita, T. Ochiai, M. Delcroix, and T. Nakatani (2020) Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §4.1.
  • [18] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda (2018) DNN-based source enhancement to increase objective sound quality assessment score. IEEE/ACM Trans. Audio Speech Lang. Process. 26 (10), pp. 1780–1792. Cited by: §2.1.
  • [19] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi (2020) Speech enhancement using self-adaptation and multi-head self-attention. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §4.1.
  • [20] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR–Half-baked or well done?. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §4.2.
  • [21] C. Li, J. Shi, W. Zhang, A. S. Subramanian, X. Chang, N. Kamo, M. Hira, T. Hayashi, C. Boeddeker, Z. Chen, and S. Watanabe (2021) ESPnet-SE: end-to-end speech enhancement and separation toolkit designed for ASR integration. In Proc. IEEE Spok. Lang. Technol. Workshops (SLT), Cited by: §1.
  • [22] Y. Luo, C. Han, and N. Mesgarani (2021) ULTRA-lightweight speech separation via group communication. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
  • [23] Y. Luo and N. Mesgarani (2019) Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27 (8), pp. 1256–1266. Cited by: §1, §2.1, §2.1.
  • [24] S. Maiti, H. Erdogan, K. Wilson, S. Wisdom, S. Watanabe, and J. R. Hershey (2021) END-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §1, §3.3.
  • [25] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda (2020)

    Conformer-based sound event detection with semi-supervised learning and data augmentation

    In Proc. Detect. Classif. Acoust. Scenes Events Workshop (DCASE), Cited by: §1, §2.2.
  • [26] M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis, J. Heitkaemper, M. Olvera, F.-R. Stöter, M. Hu, J. M. Martín-Doñas, D. Ditter, A. Frank, A. Deleforge, and E. Vincent (2020)

    Asteroid: the PyTorch-based audio source separation toolkit for researchers

    In Proc. Interspeech, Cited by: §1, §2.1.
  • [27] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent (2020) Filterbank design for end-to-end speech separation. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §2.1.
  • [28] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan (2021) ICASSP 2021 deep noise suppression challenge. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
  • [29] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong (2021) Attention is all you need in speech separation. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §3.1, §5.
  • [30] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler (2020) Efficient Transformers: a survey. arXiv:2009.06732. Cited by: §3.2.
  • [31] E. Tzinis, S. Wisdom, A. Jansen, S. Hershey, T. Remez, D. P. W. Ellis, and J. R. Hershey (2021) Into the wild with AudioScope: unsupervised audio-visual separation of on-screen sounds. In Proc. Int. Conf. Learn. Represent. (ICLR), Cited by: §1.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §1, §2.2, §4.1.
  • [33] D. Wang and J. Chen (2018)

    Supervised speech separation based on deep learning: an overview

    IEEE/ACM Trans. Audio Speech Lang. Process. 26 (10), pp. 1702–1726. External Links: ISSN 2329-9290 Cited by: §1, §1.
  • [34] K. Wang, B. He, and W.-P. Zhu (2021) TSTNN: two-stage Transformer based neural network for speech enhancement in the time domain. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §3.1, §5.
  • [35] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous (2020) Differentiable consistency constraints for improved deep speech enhancement. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §4.1, §4.1, §4.1.
  • [36] S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. R. Hershey (2020) Unsupervised sound separation using mixture invariant training. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §1, §2.1, §4.1, §4.1, §4.1, §4.1, Table 2, 1.