1 Introduction
The tasks of speech enhancement and general audio source separation have greatly benefited from advances in neural networks in the past decade. Methods for speech enhancement have traditionally used statistical characteristics of noise and speech to estimate the spectrum of the noise and subsequently of clean speech
[1, 2]. However, such methods usually suffer in non-stationary noise conditions, which can be remedied using machine learning methods. Architectures with recurrent neural networks like LSTMs
[3], the attention mechanism [4] and temporal convolutional networks [5] have showcased interesting performance improvements in recent years [6, 7, 8, 9, 10].A neural network-based approach that has shown good results and flourished with the increased capacity and computational power of modern processors is learned-domain speech processing, that is, training of models that learn representations of the audio inputs and perform processing steps on them. This is done in an end-to-end manner and without hand-crafted, fixed transforms like the short-time Fourier transform (STFT). These methods were popularized with the publication of Conv-TasNet [11], which fostered research and preceded many other works on end-to-end encoder/masker/decoder architectures.
Nevertheless, later contributions have shown that the good performance obtained by Conv-TasNet does not come specifically from the freely-learned convolutional encoder/decoder pair. In [12], equivalent results were obtained by replacing the learned encoder with a multi-phase gammatone analysis filterbank. In [13], the authors show that gains from Conv-TasNet can be attributed to the high time resolution and the time-domain loss.
While learned-domain methods have excellent performance and the benefit of not decoupling magnitude from phase, they usually work on short frames (2ms), which implies having to deal with a larger number of frames if compared to traditional STFT frame sizes (32ms). This is especially problematic in models using the attention mechanism, whose computational complexity grows quadratically with respect to the sequence length due to the dot-product operation.
Dual-path methods have managed to alleviate some issues related to the modeling of long sequences for speech applications: in [14] the authors proposed segmenting the sequence of frames into chunks and processing the sequences with LSTMs inside the chunks, followed by processing across chunks, being therefore able to model both short- and long-term dependencies with a reduced computational footprint; the authors of [15] proposed using what they called the improved Transformer, using multi-head self-attention along with LSTMs; and in [16] the SepFormer model was introduced, relying on attention only, as in the original Transformer paper [17].
Although dual-path methods reduce complexity to a feasible level, the number of frames is still cumbersome, requiring large amounts of memory during training. Another drawback common to existing learned-domain approaches is that the models usually work with 8kHz audio data, which is a considerable disadvantage against wideband methods. Additionally, the learned-encoder features have reduced interpretability compared to well-established, fixed filters such as the STFT. Therefore, using larger frame time-frequency representations still presents itself as a desirable feature, though working with complex representations brings additional challenges.
Studies on the contribution of STFT magnitude and phase spectra in speech processing [18, 19, 20, 21] have shown that the relative importance of phase varies considerably with frame size. In particular, the loss of spectral resolution renders the magnitude less relevant at very short frames ( 2ms). This does not apply to the phase spectrum, which encodes information related to the zero-crossings of the signal [18]. For larger frames (around 32ms), the opposite holds: magnitude is more important than phase [22]
. These effects are also reflected in enhancement/separation evaluation metrics. Based on these observations, with an appropriate choice of frame size we can easily port existing attention-based models to use time-frequency representations by only processing the magnitudes. In this paper we investigate what compromises and benefits can be attained when working with magnitudes of longer frames.
2 Speech enhancement
The task of speech enhancement is formulated in this work as follows: we consider a mixture consisting of clean speech and additive noise at time samples :
(1) |
Except where explicitly needed, we will drop the indexes, and represent the signals as vectors (in bold). The estimated clean signal
is obtained via masking of the noisy input. First the time-domain signal is encoded into a representation more adequate for separation:(2) |
then a mask is applied to this encoded representation
(3) |
where denotes the element-wise multiplication operator. We then apply a decoder to the masked representation in order to return to the time domain, resulting in the clean speech estimate
(4) |
This process is illustrated in Figure 1.
3 Architecture
The models in this work are organized in the structure of an encoder, a masking part and a decoder, as depicted in Figure 0(a). Experiments were conducted using either a freely-learned encoder/decoder pair, or the STFT as encoder and the inverse STFT (iSTFT) as the decoder.
![]() |
![]() |
3.1 STFT encoder/decoder
As presented in Section 1
, a commonly used representation for audio signals is the time-frequency domain, which captures more prominently the structure of the audio signals and also facilitates source separation tasks due to increased sparsity. The method used in this work for obtaining the time-frequency representation
is the STFT, obtained by applying the discrete Fourier transform (DFT) to windowed, overlapped frames:(5) |
where and are the frequency bin and frame indexes, respectively, is the local time index, is a window function, is the window length and is the hop length. We use the one-sided DFT, so with this type of encoder we have . We can define the frame overlap ratio as , which in this work will be given as a percentage. The input to the masker in this case is the magnitude of the complex time-frequency representation:
(6) |
The masked magnitudes processed by the masker are joined with the with the noisy phase and then fed to the decoder, which is the inverse STFT operation. This is shown in Figure 0(b).
3.2 Learned encoder/decoder
In the case of a learned encoder/decoder pair, similar to [11, 16]
, we use one-dimensional convolutional layers (Conv1d). The encoder transforms the audio signals directly into a higher dimensional representation, and a rectified linear unit (ReLU) enforces a non-negativity constraint, so we have
(7) |
where . The decoder uses a transposed one-dimensional convolutional layer (ConvTranspose1d) to convert the masked estimate back to the time domain:
(8) |
3.3 Masker network
The masker DNN in our experiments is a reduced version of the SepFormer [16], based on Huang’s implementation [23]. This architecture is displayed in Figure 2 and follows the dual-path principle introduced in [14]: the mixture frames are chunked into overlapping segments, which are then stacked. As can be seen in Figure 1(b), the sequence modeling steps are applied first along the chunks (intra-chunk processing) and after that the dimensions are transposed and the sequence processing is executed across chunks (inter-chunk). This process is repeated times.
Following [16], the intra- and inter-chunk processing is done without recurrent neural networks, using instead only a sequence of transformer blocks. These transformer blocks include a multi-head attention (MHA) stage and a feed-forward (FFW) block, both preceded by layer normalization (Norm) steps and with skip connections. Positional encoding is added at the blocks’ inputs to introduce position information to the model. This is displayed in Figure 1(c). The output of the SepFormer block is further processed by parametric ReLU (PReLU) and Convolutional (Conv) layers, and the chunks are merged back via the overlap-add method.
Figure 1(a) illustrates the masking principle used, the same as presented in [15], with sigmoid () and hyperbolic tangent (Tanh) branches performing gating and limiting the masks to the interval . A ReLU layer follows this step, enforcing coherence with the non-negative inputs to be masked.
The attention mechanism has quadratic complexity with respect to the sequence length , which can be a problem when using short frames: with the number of frames being given by , where is the total number of samples, a system with short window size will result in a large number of elements to process. The dual-path architecture is able to reduce the complexity from to in the best case scenario, when the chunk size is . However, this only mitigates the problem to a certain extent; as the number of frames starts becoming much larger than the chunk size, the inter-chunk processing stage starts to dominate, with the complexity tending to quadratic again.
![]() |
![]() |
![]() |
4 Experiments
4.1 Dataset
The models were trained on the DNS-Challenge dataset [24]. We generated 100 hours of 4 second long noisy mixtures sampled at 16kHz, with 20% reserved for validation. Testing was performed on clean samples from a subset of the WSJ0 [25] corpus mixed with noise from the CHiME3 Challenge dataset [26], at SNRs ranging from -10dB to 15dB, at 5dB intervals.
4.2 Model
The number of blocks used in the SepFormer was , and for the number of repetitions of the attention mechanism in the intra- and inter-chunk transformer blocks, respectively. The number of dimensions used in the feed forward layers was 256. The chunking was performed with 50% overlap.
In the case of the convolutional encoder/decoder pair, the number of learned filters was 256. The filter size was set to 32 (or 2ms at 16kHz), with stride 16, therefore 50% overlap. For the STFT case, the window function is the Hann window, with length 512 (or 32ms at 16kHz), with 50% or 75% overlap. All configurations of the model contain approximately 6.6 million parameters.
For model training, ADAM [27] was used as the optimizer, with a learning rate of
, halved after 5 epochs without improvement. Gradient clipping at 5 was employed. We did not use dynamic mixing during training. Following the original SepFormer paper, the loss function used for training was the scale-invariant signal-to-distortion ratio
[28].5 Results and discussion
5.1 Enhancement performance
The estimated utterances were evaluated on instrumental perceptual metrics: POLQA [29] for speech quality and ESTOI [30] for intelligibility. The results of the different configurations are organized in Table 1.
In the learned-domain case, the chunk size of 250 as in [16] performs best against a setup with shorter chunks, hinting at the importance of modeling short-term relations in the sequence. Nevertheless, long-term relations also play a role in performance, as can be seen in the magnitude STFT experiments: the configuration with chunks size 50 seems to find a balance between short- and long-term, if compared to the models with 25 and 100. Setting the frame overlap to 50% to obtain even fewer frames resulted in degradation to the quality metrics. The performance of the STFT model at different input SNRs remained consistent with the learned-encoder case. Informal listening evaluations confirmed the findings from the instrumental scores and found the learned-encoder estimates to contain a buzzing sound that is absent from the magnitude STFT outputs. Audio examples are available online111https://uhh.de/inf-sp-magnitudetransformer.
5.2 Execution profiling
Model | Frame (ms) | Frame Overlap | Chunk Size | GMACs | GPU Time (ms) | CPU Time (ms) | POLQA | ESTOI |
Learned-domain SepFormer [16] | 2 | 50% | 250 | 45.75 | 69 | 909 | 2.98 | 0.79 |
2 | 50% | 100 | 45.10 | 66 | 895 | 2.91 | 0.79 | |
Magnitude STFT SepFormer (ours) | 32 | 75% | 100 | 6.26 | 11 | 109 | 2.92 | 0.77 |
32 | 75% | 50 | 5.93 | 11 | 153 | 3.01 | 0.78 | |
32 | 75% | 25 | 5.99 | 11 | 106 | 2.95 | 0.78 | |
32 | 50% | 25 | 3.08 | 11 | 89 | 2.78 | 0.76 |
The metrics concerning execution are also given in Table 1, in the form of the number of giga multiply–accumulate operations (GMACs) and the execution time for a 10-second input, averaged over 10 executions. The profiling was executed on a computer equipped with an Intel Core i9-10900X CPU at 3.70GHz and an NVIDIA GeForce RTX 2080 Ti graphics card.
In terms of performance profiling, the chunk size does not play a big role in offline processing, as larger chunks will increase computation efforts for the intra-chunk step but reduce it for the inter-chunk part, and the opposite happens with shorter chunks. Frame size and overlap, however, have a more important impact on execution, since they control the number of elements to process. Going from a learned-encoder setup with 2ms frames overlapped at 50% to an STFT configuration of 32ms frames with 75% overlap, we can see a significant reduction of complexity. Taking the best performing model from each category (chunk size of 250 for the learned encoder case and 50 for its STFT counterpart), we can see a reduction of the number of accumulate-add operations by a factor of 7.7. This reduction also translates to execution time, which is reduced approximately by a factor of 6, for both GPU and CPU inference.
Looking at further practical aspects of a speech enhancement model implementation, memory allocation was profiled on CPU, with the results presented in Figure 3. The learned-encoder variant allocates around 2GB of RAM shortly after 30 seconds of input, and over 4GB after the 60 second mark. For low-resource, embedded devices, usage of such a model is therefore severely hindered. For the whole range of values analyzed, up to 2.5 minutes of uninterrupted data, the magnitude STFT SepFormer kept memory usage below 2GB.
Considering a potential application in an online speech enhancement scenario, chunk processing times have to be analyzed. The time it takes for a new chunk to be complete for processing acts as a maximum value for feasible online execution. Taking into account the fact that intra-chunk attention calculations have complexity linked to the fixed chunk size and assuming the attention matrices for previous chunks can be stored to save operations, the most important step to be examined is the inter-chunk step.
For practical online operation, the time it takes to process a chunk must be shorter than its length plus its shift. Figure 4 shows that in the learned-encoder case this threshold is already reached for sequences shorter than 10 seconds, whereas the magnitude STFT version in the configurations tested kept the execution below the threshold for at least 35 seconds. Note that due to frame length and overlap configuration, the algorithmic latency of the STFT encoder in the best performing configuration (chunk size of 50) is double the value for the learned encoder. This makes it unsuitable for real-time communication applications, but can be remedied by reducing the chunk size or increasing its overlap, at the expense of slight degradation in speech enhancement and computation time. We therefore included in Figure 4 the magnitude model using a chunk size equal to 25, with an algorithmic latency similar to the learned-encoder case. It is also worth mentioning that other blocks in the model add processing time, such as the encoder and the chunking step. Their contribution to the overall complexity is nevertheless not as significant as the inter-chunk attention.


6 Conclusion
Recent advances in learned-domain speech processing show excellent performance, but the models are demanding in terms of memory and processing power, therefore lacking feasibility in practical applications. Motivated by previous contributions on learned and traditional filterbanks and on the relation between frame size and magnitude/phase processing, we show that by replacing the learned features with STFT magnitudes, we can obtain equivalent performance in terms of perceptually-motivated metrics while considerably reducing resource allocation and processing time. These findings are a big step towards making the implementation of state-of-the-art transformer-based speech enhancement systems possible in real-life applications, especially on embedded devices.
7 Acknowledgements
This work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 247465126. We would like to thank J. Berger and Rohde&Schwarz SwissQual AG for their support with POLQA.
References
- [1] T. Gerkmann and E. Vincent, “Spectral masking and filtering,” in Audio Source Separation and Speech Enhancement, E. Vincent, T. Virtanen, and S. Gannot, Eds. John Wiley & Sons, 2018, ch. 5, pp. 65–85.
- [2] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement: A Survey of the State of the Art. Morgan & Claypool Publishers, 2013.
-
[3]
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. - [4] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 2015. http://arxiv.org/abs/1409.0473
- [5] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” in SSW, vol. 125, 2016, p. 2.
- [6] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR,” in Latent Variable Analysis and Signal Separation, E. Vincent, A. Yeredor, Z. Koldovský, and P. Tichavský, Eds. Springer International Publishing, 2015, pp. 91–99. http://link.springer.com/10.1007/978-3-319-22482-4_11
- [7] K. Tan and D. Wang, “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,” in Interspeech 2018. ISCA, Sep. 2018, pp. 3229–3233. https://www.isca-speech.org/archive/interspeech_2018/tan18_interspeech.html
- [8] J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, May 2020, pp. 6649–6653. https://ieeexplore.ieee.org/document/9053591/
- [9] J. Richter, G. Carbajal, and T. Gerkmann, “Speech enhancement with stochastic temporal convolutional networks,” in ISCA Interspeech, Shanghai, China, Oct. 2020.
- [10] R. Rehr and T. Gerkmann, “SNR-based features and diverse training data for robust DNN-based speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Proc., vol. 29, pp. 1937–1949, 2021.
- [11] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, Aug. 2019. https://ieeexplore.ieee.org/document/8707065/
- [12] D. Ditter and T. Gerkmann, “A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, May 2020, pp. 36–40. https://ieeexplore.ieee.org/document/9053602/
- [13] J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, and R. Haeb-Umbach, “Demystifying TasNet: A Dissecting Approach,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, May 2020, pp. 6359–6363. https://ieeexplore.ieee.org/document/9052981/
- [14] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, May 2020, pp. 46–50. https://ieeexplore.ieee.org/document/9054266/
-
[15]
J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation,” in
Interspeech 2020. ISCA, Oct. 2020, pp. 2642–2646. https://www.isca-speech.org/archive/interspeech_2020/chen20l_interspeech.html - [16] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention Is All You Need In Speech Separation,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, Jun. 2021, pp. 21–25. https://ieeexplore.ieee.org/document/9413901/
- [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” Advances in Neural Information Processing Systems, vol. 30, Dec. 2017. http://arxiv.org/abs/1706.03762
- [18] M. Kazama, S. Gotoh, M. Tohyama, and T. Houtgast, “On the significance of phase in the short term Fourier spectrum for speech intelligibility,” The Journal of the Acoustical Society of America, vol. 127, no. 3, pp. 1432–1439, Mar. 2010. http://asa.scitation.org/doi/10.1121/1.3294554
- [19] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Filterbank Design for End-to-end Speech Separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE, May 2020, pp. 6364–6368. https://ieeexplore.ieee.org/document/9053038/
- [20] T. Peer and T. Gerkmann, “Intelligibility Prediction of Speech Reconstructed From Its Magnitude or Phase,” in ITG Conference on Speech Communication, 2021, p. 5.
- [21] ——, “Phase-Aware Deep Speech Enhancement: It’s All About The Frame Length,” arXiv:2203.16222 [cs, eess], Mar. 2022, arXiv:2203.16222. http://arxiv.org/abs/2203.16222
- [22] D. Wang and Jae Lim, “The unimportance of phase in speech enhancement,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679–681, Aug. 1982. http://ieeexplore.ieee.org/document/1163920/
- [23] S.-F. Huang, S.-P. Chuang, D.-R. Liu, Y.-C. Chen, G.-P. Yang, and H.-y. Lee, “Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training,” in Interspeech 2021. ISCA, Aug. 2021, pp. 3056–3060. https://www.isca-speech.org/archive/interspeech_2021/huang21h_interspeech.html
- [24] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results,” in Interspeech 2020. ISCA, Oct. 2020, pp. 2492–2496. https://www.isca-speech.org/archive/interspeech_2020/reddy20_interspeech.html
- [25] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Complete,” LDC93S6A. Web Download. Philadelphia: Linguistic Data Consortium, 1993.
- [26] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes,” Computer Speech & Language, vol. 46, pp. 605–626, Nov. 2017. https://linkinghub.elsevier.com/retrieve/pii/S088523081630122X
- [27] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 2015. http://arxiv.org/abs/1412.6980
- [28] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE, May 2019, pp. 626–630. https://ieeexplore.ieee.org/document/8683855/
- [29] J. G. Beerends, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I–Temporal Alignment,” J. Audio Eng. Soc., vol. 61, no. 6, p. 19, 2013.
- [30] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, Nov. 2016. http://ieeexplore.ieee.org/document/7539284/