Speech enhancement (SE) is the task of recovering target speech from a noisy signal . In addition to its applications in telephony and video conferencing , single-channel SE is a basic component in larger systems, such as multi-channel SE [7, 12], multi-modal SE [8, 1, 9, 31]
, and automatic speech recognition (ASR)[6, 17, 21] systems. Therefore, it is important to improve both the denoising performance and the computational efficiency of single-channel SE.
In recent years, rapid progress has been made on SE using deep neural networks (DNNs). Conv-TasNet  is a powerful model for SE that uses a combination of trainable analysis/synthesis filterbanks  and a mask prediction network using stacked 1-D dilated depthwise convolution (1D-DDC) layers. Since the denoising performance and computational efficiency are mainly affected by the mask prediction network, one of the main research topics in SE is improving the mask prediction architecture [15, 36, 3, 29, 34, 22, 2]. For example, the improved time-dilated convolution network (TDCN++) [15, 36] extended Conv-TasNet to improve SE performance.
A promising candidate for improving mask prediction networks is the Conformer architecture. The Conformer  architecture has been shown to be effective in ASR , diarization , and sound event detection [25, 11]. Conformer is derived from the Transformer  architecture by including 1-D depthwise convolution layers to enable more effective sequential modeling.
In this paper we combine Conformer layers with the dilated convolution layers of the TDCN++ architecture. However, this introduces two critical problems related to the short window and hop sizes used in trainable analysis/synthesis filterbanks. The first problem is large computational cost because the time-complexity of the multi-head-self-attention (MHSA) in the Conformer has a quadratic dependence on sequence length. Secondly, the small hop-size of neighboring time-frames reduces the temporal reach of sequential modeling when using temporal convolution layers.
In order to make the model computationally feasible, we use a linear-complexity variant of self-attention in the Conformer, known as fast attention via positive orthogonal random features (FAVOR+), as used in Performer . These ideas are partly inspired by the local-global network for speaker diarization using a time-dilated convolution network (TDCN)  which shows that the combination of a linear complexity self-attention and a TDCN improves both local and global sequential modeling. We show in experiments below that the resulting model, which we call the dilated FAVOR Conformer (DF-Conformer), achieves better enhancement fidelity than the TDCN++ of comparable complexity.
2.1 Conv-TasNet and its extensions on speech enhancement
Let the -sample time-domain observation be a mixture of a target speech and noise as , where is assumed to be environmental noise and does not include interference speech signals. The goal of SE is to recover from .
In mask-based SE, a mask is estimated using a mask prediction network and applied to the representation ofencoded by an encoder, then the estimated signal is re-synthesized using a decoder. The enhancement procedure can be written as
where and are the signal encoder and decoder, respectively, is the encoder output dimension, is the element-wise multiplication, and
is the mask prediction network. Early studies used the short-time-Fourier-transform (STFT) and the inverse-STFT (iSTFT) as encoder and decoder[6, 18], respectively. More recent studies use a trainable encoder/decoder  which are often called trainable “filterbanks” , e.g. in Asteroid .
One of the main research topic in SE is the design of the network architecture of , because the performance and computational efficiency of SE are mainly affected by the structure of . Conv-TasNet  is a powerful model for speech separation and SE, and whose consists of stacked 1D-DDC layers. TDCN++ [15, 36] is an extension of Conv-TasNet. The main difference of TDCN++ with Conv-TasNet is the use of instance norm instead of global layer norm and the addition of explicit scale parameters after each dense layer. The pseudo-code for in the TDCN++ is shown in Algorithm 1. TDCN++ consists of stacked TDCN-blocks, and each TDCN-block mainly consists of two dense layers for frame-wise feature modeling and one 1D-DDC layer for sequence modeling. The dilation factor increases exponentially to ensure a sufficiently large temporal context window to take advantage of the long-range dependencies of the speech signal, and TDCN-blocks are repeated times where . The time complexity of TDCN++ is roughly proportional to when , where is the input dimension of TDCN-blocks.
Conformer  is a derived model of Transformer  that was originally proposed for ASR  and later adopted in audio-related applications such as audio event detection [25, 11] and speech separation . The structure of the Conformer is similar to the TDCN++, in that it consists of stacked Conformer-blocks . Algorithm 2 shows the pseudo-code of a Conformer-block. By comparing Algorithm 1 and 2, we can see that the constituent layers of the Conformer-block and the TDCN-block are also similar; one Conformer-block mainly consists of several dense layers for frame-wise feature modeling, and one 1-D depthwise convolution layer and one MHSA-module for sequence modeling . One of the main differences between the TDCN-block and the Conformer-block is the MHSA-module. Conformer enables global sequence modeling by using MHSA-modules instead of dilated depthwise convolution layers with local receptive fields.
3 Proposed method
3.1 Model structure and computational challenges
Based on the successes of Conformer in speech-related tasks, we aim to replace the TDCN blocks in TDCN++ with Conformer-blocks. Unfortunately, the simple combination of trainable filterbanks and Conformer-blocks causes two critical problems. These problems are caused by the short window size of 2.5 ms and hop size of 1.25 ms used in trainable filterbanks for short-time analysis of the input signal.
Problem 1: The computational complexity. The computational cost of MHSA-module is quadratic in the number of frames . In the original Conformer model , convolutional subsampling limits the size of . For example, for a 1 second signal, is 25. In contrast, for TDCN++, the same signal would result in .
Problem 2: The receptive field for sequence modeling is insufficient. The original Conformer has a hop-size of 40 ms, while the standard trainable filterbank has a hop-size of 1.25 ms. This means that the receptive field for depthwise convolution is 6.25 ms when using the default kernel size of 5, which may degrade the accuracy of the analysis of local changes in the signal.
One possible approach is to use the dual-path approach [3, 29, 34], which is equivalent to using sparse and block diagonal attention matrices corresponding to the inter- and intra-transformers, respectively. Alternatively, we use FAVOR+ attention introduced in Performer  which has linear computational complexity: . The novelty in our approach comes from using linear FAVOR+ attention to replace softmax-dot-product attention as well as performing local analysis with 1D-DDC to replace non-dilated convolutions in Conformer. Based on these two characteristics of the proposed method, we name our as dilated-FAVOR-Conformer (DF-Conformer), and -layer DF-Conformer is referred as DF-Conformer-. The pseudo-code of DF-Conformer- is shown in Algorithm 3. The time complexity of DF-Conformer- is also roughly in proportion to when .
3.2 Linear time-complexity MHSA-module using FAVOR+
Recently, many extended Transformer architectures have been proposed to make improvements around computational and memory efficiency [30, 14]. Performer  is one of them; it is an Transformer architecture which uses FAVOR+. In self-attention, the query, , key, , and value, are combined as . In FAVOR+, this is approximated as , for a suitable feature map applied to the rows of each matrix, avoiding the quadratic term . Here is a normalizing diagonal matrix with , and
an all ones vector. This approximation is made accurate in FAVOR+ using a random projection based non-negative valuedof a suitable size . To implement this idea, we replace the softmax-dot-product self-attention in Algorithm 2 with FAVOR+ self-attention. Hereafter, we refer to this new module as “MHSA-FAVOR-module”.
3.3 Use of dilated depthwise convolution in Conformer
We strengthen the network’s temporal analysis capability by using 1D-DDC instead of the standard 1-D depthwise convolution used in the Conformer-blocks. As in TDCN++, we use an exponentially increasing dilation factor . To implement this idea, DF-Conformer-block also takes as an argument, and it is passed to the 1D-DDC layer as the dilation parameter.
4.1 Experimental setup
Dataset: We used the same dataset used in the SE experiment of . This dataset uses speech from LibriVox (librivox.org) and non-speech sounds from freesound.org . The duration of all samples were 3 sec, and sampling rate was 16 kHz. Training, validation, and test datasets consisted of 4,076,102 (3396.8 hours), 7,417 (6.2 hours), and 7,387 (6.2 hours) examples, respectively. We mixed speech and noise samples in the same manner of . The minimum and maximum signal-to-noise ratio (SNR) of noisy input were dB and dB, respectively, and the average extended short-time objective intelligibility measure (ESTOI)  score was 63.7%.
Loss function: We estimated masks for both speech and noise in the same manner of [36, 17]. Each mask was multiplied with and re-synthesized to the time-domain using the same decoder. A mixture consistency projection layer  was applied to ensure the mixture of estimated speech and noise equals the noisy input. Finally, the negative thresholded SNR  loss222 where a soft threshold that clamps the loss at dB. In this study, we used . was calculated for both speech and noise, and mixed by weighting 0.8 for speech and 0.2 for noise.
Comparison of methods and hyper-parameters: For the ablation studies in Sec. 4.2, we used three Conformer-based models. The first model is Conformer- which simply replaces TDCN-blocks in TDCN++ with Conformer-blocks. The second model is F-Conformer- which is a model that uses only FAVOR+ in DF-Conformer-. The last model is Conformer--STFT which uses STFT and iSTFT as and , respectively. For Conformer--STFT models, we estimated a complex-valued mask . We cannot increase the number of parameters of Conformer- due to its computational complexity, therefore, we used two different model sizes; 3.7M and 8.75M parameters. The former size was determined according to the maximum model size of Conformer-
that can be trained on third-generation Tensor Processing Units (TPUv3). The latter size is that of TDCN++ used in previous studies[15, 36]. The hyper parameters were and were used for 3.7M models, and , , and were used for 8.75M models. For both model sizes, attention heads and random projection features were used in FAVOR+.
For the SE performance evaluation in Sec. 4.3, we compared DF-Conformer- and Conv-Tasformer with TDCN++ [15, 36] to confirm the superiority of the proposed models from its base model. In TDCN++, we used the same setting used in , namely, , , and . In Conv-Tasformer, we used the same setting of TDCN++ except and to reduce the number of parameters.
For all models, , and the window and hop sizes of trainable filterbanks were 2.5 ms and 1.25 ms, respectively. For STFT, the window and hop sizes were 30 ms and 10 ms, respectively, and fast-Fourier-transform size was 512. All models were trained for 500k steps on 128 Google TPUv3 cores with a global batch size of 512. We configured the Adam optimizer  with weight decay 1e-6, and learning rate schedule  of , where is a number of training steps. We clipped the gradient by global norm to 5.0. We stored a separate checkpoint of exponential-moving-averaged weights accumulated over training steps with decay rate 0.9999.
4.2 Evaluation of FAVOR+
To confirm the effects of FAVOR+, we compared the real-time factor (RTF) of Conformer-4-STFT, Conformer-4, and F-Conformer-4 using 1 CPU. Figure 1 (a) shows the comparison results. In the case of Conformer-4-STFT, RTF does not increase significantly because was in our STFT setting and it is still feasible with MHSA-module. Whereas RTF of Conformer-4 increases linearly as was in our trainable filterbank setting and MHSA-module. Since the time-complexity of FAVOR+ is in proportion to , F-Conformer-4 has solved this problem.
We also compared these methods using two objective metrics; scale-invariant SNR improvement (SI-SNRi)  and the ESTOI. Table 1 shows the results. By comparing Conformer-4-STFT and Conformer-4, the use of a trainable filterbak achieved higher scores than STFT as similar to previous studies . When using the small-size model, the SI-SNRi score of F-Conformer-4 was almost the same as those on the Conformer-4-STFT. Meanwhile, with the 8.75M models, SI-SNRi of F-Conformer-8 was 1.2 dB higher than that of Conformer-8-STFT, and ESTOI scores of those were almost comparable. These results suggest that the use of FAVOR+ can achieve high time-domain SE performance with a larger model while avoiding the increase in computational complexity.
4.3 Objective evaluation
|TDCN++ ||8.75 M||14.10||85.7||0.10|
|iTDCN++ ||17.6 M||14.84||87.1||0.22|
We compared DF-Conformer-8, TDCN++, and Conv-Tasformer using SI-SNRi, ESTOI, and RTF. From the comparison results shown in Table 2, DF-Conformer-8 and Conv-Tasformer achieved comparable scores, and these scores were higher than that of TDCN++. Also, by comparing DF-Conformer-8 and F-Conformer-8 in Table 1, the use of 1D-DDC significantly improved the scores while avoiding to increase RTF. These results suggest that the use of both 1D-DDC and FAVOR+ is effective in SE. We also compared RTF of these methods as shown in Fig. 1 (b). RTFs of DF-Conformer-8 and TDCN++ were comparable, whereas that of Conv-Tasformer was larger than others due to additional MHSA-FAVOR-block. Therefore, when inserting FAVOR+ in TDCN-block as Conv-Tasformer, it will be necessary to devise the position and number of MHSA-FAVOR-module in order to improve the computational efficiency.
We also compared the iterative extension of these models . Using iterative model improved the scores of all methods, and the results tended to be similar to the non-iterative models. Furthermore, we evaluated a larger model as iDF-Conformer-12 with , , and the number of attention heads were 8. The size of model was determined so that RTF becomes comparable with iConv-Tasformer. As we can see the results, the scores clearly improved using a large model, thus DF-Conformer would be able to scale the performance according to the model size.
We finally point out three characteristics in DF-Conformer’s attention matrices. First, none of all attention matrices has a local structure that focuses only on nearby time-frames. Secondly, most attention matrices in earlier layers referred to low SNR time-frames to capture the noise characteristics (e.g. Fig. 2 middle-left), or referred to time-frames with similar spectral structures (e.g. Fig. 2 middle-right). Thirdly, some attention matrices of deep layers resemble a sum of a nearly-diagonal matrix and a block matrix (e.g. Fig. 2 bottom). This results suggest that the earlier layers roughly analyze the speech and noise from the entire utterance, and later layers refine the mask based on the local structure.
In this study, we proposed DF-Conformer which is a Conformer-based time-domain SE network. To improve the computation complexity and local sequential modeling, we extended Conformer using a linear complexity attention mechanism and 1-D dilated separable convolutions. Experimental results showed that (i) the use of a linear complexity attention solves the computational-complexity problems, and (ii) our model achieve higher performance than TDCN++. From the results of experiments, we conclude that DF-Conformer is an effective model for SE. Future works include joint-training of SE and ASR using an all Conformer model, and comparison with the dual-path methods [3, 29, 34] on the SE task.
-  (2018) The conversation: deep audio-visual speech enhancement. In Proc. Interspeech, Cited by: §1.
-  (2021) TOWARDS efficient models for real-time deep noise suppression. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. In Proc. Interspeech, Cited by: §1, §3.1, §5.
-  (2020) Continuous speech separation with Conformer. arXiv:2008.05773. Cited by: §2.2.
-  (2021) Rethinking attention with performers. In Proc. Int. Conf. Learn. Represent. (ICLR), Cited by: §1, §3.1, §3.2.
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §2.1.
-  (2016) Improved MVDR beamforming using single-channel mask prediction networks. In Proc. Interspeech, Cited by: §1.
-  (2018) Seeing through noise: visually driven speaker separation and enhancement. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
-  (2020) Multi-modal multi-channel target speech separation. IEEE J. Sel. Top. Signal Process. 14 (3), pp. 530–541. Cited by: §1.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. In Proc. Interspeech, Cited by: §1, §2.2, §3.1, 2.
CONFORMER-based id-aware autoencoder for unsupervised anomalous sound detection. Technical report DCASE2020 Challenge. Cited by: §1, §2.2.
-  (2018) Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
-  (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE Trans. Audio Speech Lang. Process. 24 (11), pp. 2009–2022. External Links: Cited by: §4.1.
-  (2020) Transformers are RNNs: fast autoregressive transformers with linear attention. In Int. Conf. Mach. Learn. (ICML), Cited by: §3.2.
-  (2019) Universal sound separation. In Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), Cited by: §1, §2.1, §4.1, §4.1, §4.2, §4.3, 1.
-  (2015) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learn. Represent. (ICLR), Cited by: §4.1.
-  (2020) Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §4.1.
-  (2018) DNN-based source enhancement to increase objective sound quality assessment score. IEEE/ACM Trans. Audio Speech Lang. Process. 26 (10), pp. 1780–1792. Cited by: §2.1.
-  (2020) Speech enhancement using self-adaptation and multi-head self-attention. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §4.1.
-  (2019) SDR–Half-baked or well done?. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §4.2.
-  (2021) ESPnet-SE: end-to-end speech enhancement and separation toolkit designed for ASR integration. In Proc. IEEE Spok. Lang. Technol. Workshops (SLT), Cited by: §1.
-  (2021) ULTRA-lightweight speech separation via group communication. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
-  (2019) Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27 (8), pp. 1256–1266. Cited by: §1, §2.1, §2.1.
-  (2021) END-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §1, §3.3.
Conformer-based sound event detection with semi-supervised learning and data augmentation. In Proc. Detect. Classif. Acoust. Scenes Events Workshop (DCASE), Cited by: §1, §2.2.
Asteroid: the PyTorch-based audio source separation toolkit for researchers. In Proc. Interspeech, Cited by: §1, §2.1.
-  (2020) Filterbank design for end-to-end speech separation. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §2.1.
-  (2021) ICASSP 2021 deep noise suppression challenge. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1.
-  (2021) Attention is all you need in speech separation. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §3.1, §5.
-  (2020) Efficient Transformers: a survey. arXiv:2009.06732. Cited by: §3.2.
-  (2021) Into the wild with AudioScope: unsupervised audio-visual separation of on-screen sounds. In Proc. Int. Conf. Learn. Represent. (ICLR), Cited by: §1.
-  (2017) Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §1, §2.2, §4.1.
Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26 (10), pp. 1702–1726. External Links: Cited by: §1, §1.
-  (2021) TSTNN: two-stage Transformer based neural network for speech enhancement in the time domain. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §1, §3.1, §5.
-  (2020) Differentiable consistency constraints for improved deep speech enhancement. In Proc. Int. Conf. on Acoust., Speech, and Signal Process. (ICASSP), Cited by: §4.1, §4.1, §4.1.
-  (2020) Unsupervised sound separation using mixture invariant training. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §1, §2.1, §4.1, §4.1, §4.1, §4.1, Table 2, 1.