The performance of speech enhancement and source separation systems has vastly improved with the introduction of deep learning and neural network based techniques[1, 2, 3, 4, 5, 6, 7, 8, 9]. Some recent advances include the use of generative adversarial networks , sophisticated adaptations of U-Net  based architectures [12, 13]
and many more. Designing end-to-end systems that directly estimate the waveforms of the enhanced speech signal by operating on the noisy speech waveforms has proven to be beneficial and resulted in several high-performance models[2, 3, 4, 5, 13, 6].
End-to-end speech enhancement networks typically replace the Short-Time Fourier transform (STFT) operation by a learnable ’front-end’ layer [2, 3, 4, 5]. To this end, the first layer of such neural models performs windowing followed by a dense matrix multiplication. To transform the data back into the waveform domain, these models also employ a trainable back-end layer which inverts the front-end via another dense matrix multiplication and the overlap-add method.
With growing interest in “hearables” and other wearable audio devices, low-compute and real-time scenarios are increasingly encountered in audio processing applications. These devices come with low power, low memory and stringent compute requirements but offer the opportunity for audio processing everywhere. In these contexts, storing and performing inference with dense matrices inside a trainable STFT can be prohibitively expensive or downright infeasible. For example, to learn an STFT with -point transforms takes parameters. The front-end parameters alone could fill the L2 cache of a modern processor, leaving no room for the rest of the model to be evaluated without cache-misses. We aim to address this issue by creating an efficient front-end for low-compute models operating directly on the waveform.
In this work we propose an efficient learnable STFT front-end for low-compute audio applications and show how it can be used to improve end-to-end speech enhancement. The trainable FFT copies the butterfly mechanism for the Fast Fourier Transform (FFT); and, when initialized appropriately, computes the Fourier Transform. We also propose replacing the standard fixed window of the STFT by a trainable windowing layer. In terms of computational advantages, our model requires no increase in compute and a minimal increase in the number of required parameters compared to the fixed FFT. Using our model also leads to significant savings in memory compared to standard adaptive front-end implementations. We evaluate our model on the VCTK speech enhancement task  and demonstrate that the proposed front-end outperforms STFT front-ends on a variety of perceptually-motivated speech enhancement metrics.
2 The FFT as a Network Layer
Fast Fourier Transforms (FFT) are based on factoring the Discrete Fourier Transform (DFT) matrix into a set of sparse matrices . We implement these sparse matrix operations efficiently in MXNet  to make a trainable layer for low-compute environments. The factorization we use is based on the butterfly mechanism and is as follows.
Recall that, given a vectorof length , the -point DFT applies a transformation to get the DFT coefficients ; with the element-wise version of this operation being,
Here, denotes the twiddle-factor of the DFT. As usual, split Eq. 1 into two
point DFT operations on the even-indexed and the odd-indexed elements of,
The twiddle-factors are odd-symmetric about i.e., . Thus,
Defining a diagonal matrix of twiddle-factor values, , we rewrite Eq. 4 in matrix form as,
In this equation, and denote the odd-indexed and even-indexed terms of .
Then, we factor out the -point DFT, and apply an even/odd permutation matrix to get,
Disregarding the data, we can write as,
Substitute for the matrix of twiddle factors to get,
To simplify even further, is,
Thus, we can write in terms of as,
It is necessary to stack the result of since it occurs more than once. The component matrices and are composed of stacks of diagonal matrices. Generalizing this further, we can represent an -point FFT as a series of matrix multiplications where, the first matrix is a permutation matrix and all other matrices are sparse matrices formed by stacks of diagonal matrices. Mathematically, we can write the -point DIT-FFT algorithm as a matrix multiplication by
where, denotes the ’th twiddle factor matrix, denotes the product of all permutation matrices , and is the number of twiddle factor matrix multiplications involved. We can write a general formula to construct the twiddle factor matrix of the twiddle factor matrices using identity matrices of size and , as well as the Kronecker product as follows,
Fig. 1 visualizes these matrices and the associated sparsity patterns for an -point DIT-FFT algorithm..
As an illustrative example, consider the FFT matrix derivation for a -point FFT operation. Given the permuted data samples, we first apply a -point DFT operation on successive pairs of input samples. This step can be written as a matrix multiplication operation by matrix where,
The next step applies a matrix multiplication operation using matrix where we can write in terms of the twiddle factor as,
The overall 4-point FFT operation can be expressed as,
We see that the matrix is a sparse matrix with a block diagonal structure. Similarly, is also sparse and composed of stacks of diagonal matrices.
2.1 Trainable FFT layer
To make the above FFT layer trainable, the set of matrix multiplies can be represented as a neural network with several sparsely connected layers. The DFT on the other hand is a single layer with dense connectivity. We preserve the general FFT structure by only connecting network nodes on the block diagonal structure given by the FFT. This preserves the speed and structure of the FFT while both reducing the number of parameters in the front-end and operating on the raw waveform. Fig. 2 illustrates the DFT and FFT connectivity structures and how they may be interpreted as neural network layers. In practice, we explicitly implement all complex operations with real values. When initialized to do so, the model returns identical results to the FFT algorithm. All of these operations can be efficiently implemented and trained with sparse matrix routines.
2.2 Inverse FFT
To compute the inverse FFT we use the time reversal conjugate trick. Given the DFT representation of the time domain frame , we can compute the inverse Fourier transform of using only the FFT. In particular,
In our model we leverage this property. However, the forward FFT layer and the inverse FFT layer do not share parameters. We use different learned FFTs for the forward and inverse transforms. These FFT layers are initialized and updated as separate entities.
3 The Adaptive STFT for Speech Enhancement
Given a trainable FFT front-end, we can now operate the model on time domain waveforms. For our models we use 256-point FFTs such that this model’s front-end has about two orders of magnitude fewer weights than a typical trainable STFT front-end. In addition to making the FFT and IFFT trainable, we also show how we can make trainable synthesis and analysis windows.
3.1 Learning a Window
The typical -point STFT with a hop-size of , chunks an input audio sequence into windows of size with an overlap of . These overlapping frames are stacked to form columns of a matrix. Call this matrix where the first column of holds values , the second column holds values , and so on in the standard python notation. To apply a windowing function upon , construct a windowing matrix . The matrix is diagonal with the desired -point window on its diagonal. The windowed version of is then . This same logic applies for windowing in the inverse STFT.
By making a network parameter, we can learn a windowing function suited to the learned transform. For all our experiments, is initialized as a Hann window. During training, is freely updated without non-negativity or constant-overlap-add constraints. In Fig. 3, we show that the learned analysis and synthesis windows are highly structured.
3.2 Model Architecture
In conjunction with the learned transforms and windows, we use a masking based separation network. The learned FFT front-end transforms each column of . Let represent the trainable FFT front-end. The masking network takes and predicts two sigmoid masks: and . These masks are applied via element-wise multiplication to produce an estimate of the clean speech in the transform domain, in term of its real and imaginary parts. Specifically,
Here and compute the element-wise real and imaginary components of a complex matrix, and denotes the element-wise multiplication operation. In our experiments, we found that using separate real and imaginary masks outperformed a single magnitude mask. Fig. 4 illustrates the full model pipeline.
We experimented with a mask prediction RNN containing 80k parameters. This network is composed of two linear layers, and a gated recurrent unit (GRU). The GRU is unidirectional for our intended use case of real-time speech enhancement. Instead of performing complex valued back-propagation, we simply stack the real and imaginary components of the input and output. The masking network architecture is shown in Fig.5.
We use the 56 speaker VCTK training and testing setup 
where, each speaker has about 400 sentences. During training, we mix speech and noise at signal to noise ratios (SNRs) of 0dB, 5dB, 10dB, and 15dB. We train all models until convergence on the training set before evaluating them. During evaluation, we use SNRs of 2.5dB, 7.5dB, 12.5dB, and 17.5dB.
, and for the loss function, we use the complex lossgiven in Eq. 16 . Over informal experiments, we found that this loss function performed better than time domain loss functions in terms of perceptual metrics. The loss function is a weighted combination of magnitude mean squared error and complex mean squared error loss. Here, is the predicted clean Fourier spectrum and is the true clean Fourier spectrum.
The power is applied element-wise and in the case of complex numbers is applied on the predicted magnitude and then multiplied with the predicted phase. For our own experiments, we use and .
With low-compute scenarios in mind, we examined a model with approximately 80k parameters. The majority of the parameters are used in the masking network with the learned FFT and learned window using 512 parameters each. For all model runs, we initialize the windows as Hann windows and the trainable FFTs as FFTs. In our experiments, we compared four models with the attributes described below. The tested setups are: (1) Fixed Window Fixed FFT, (2) Trainable Window Fixed FFT, (3) Fixed Window Trainable FFT, (4) Trainable Window Trainable FFT. In the above list, fixed denotes parameters that were frozen and not updated during training. The Fixed Window Fixed FFT model has (1024) fewer parameters than the Trainable Window Trainable FFT model.
The first model has fixed Hann windows and a fixed FFT. This model only learns a masking network and serves as a benchmark for our adaptations. The other models serve to illustrate the benefits of and relationship between trainable windows and trainable FFTs.
4.4 Evaluation Metrics
We evaluate the above models on a speech enhancement task and compare them on the following metrics: signal distortion (), noise distortion (), overall quality () , Perceptual Evaluation of Speech Quality (PESQ) , and segmental SNR (SSNR). , , , and PESQ are perceptual metrics intended to imitate a mean opinion score test. estimates distortion of the speech signal, estimates intrusiveness of the noise, summarizes the overall quality, and PESQ estimates the speech quality. For all of these metrics, higher is better.
Table 1 gives the results of our experiments for the 80k parameter model. We use the fixed FFT, fixed window version without any trainable parameters in its front-end as the baseline model. In general, we observe that making the FFT layer trainable improves speech enhancement performance. This improvement is consistently observed both in the case of a fixed window (compare column- to column-) and when the window is trainable (compare column- to column-). The effects of the window function are more inconsistent and interesting. A trainable window with a fixed FFT degrades separation performance (compare column- to column-). Alternatively, a trainable window used with a trainable FFT improves upon the fixed window, trainable FFT model (compare column- to column-). Overall, when a trainable window is used in conjunction with a trainable FFT, we get the best performance across all metrics.
In light of the need for high-performance speech systems in low-compute contexts, we proposed an alternative to learning a DFT matrix in trainable STFT systems. Our efficient front-end leverages the sparse structure of FFTs to both reduce computational and memory requirements. The trainable front-end is made up of several highly structured sparse linear layers and a learned window. We demonstrate an application of this front-end in speech enhancement. Using a trainable FFT and a trainable window improves speech enhancement performance over a fixed STFT system with no increase in computational complexity.
-  Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis, “Deep learning for monaural speech separation,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 1562–1566.
-  Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis, “End-to-end source separation with adaptive front-ends,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers. IEEE, 2018, pp. 684–688.
-  Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
-  Yi Luo and Nima Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
-  Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, Jiqing Han, and Anyan Shi, “Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation,” Proc. Interspeech 2019, pp. 3183–3187, 2019.
-  Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073.
-  John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
-  Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller,
“Discriminatively trained recurrent neural networks for single-channel speech separation,”in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2014, pp. 577–581.
-  Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “SEGAN: Speech Enhancement Generative Adversarial Network,” Proc. Interspeech 2017, pp. 3642–3646, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  Craig Macartney and Tillman Weyde, “Improved Speech Enhancement with the Wave-U-Net,” arXiv preprint arXiv:1811.11307, 2018.
-  Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
-  Cassia Valentini-Botinhao et al., “Noisy speech database for training speech enhancement algorithms and TTS models,” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017.
-  Charles Van Loan, Computational frameworks for the fast Fourier transform, vol. 10, Siam, 1992.
-  Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
-  Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech.,” in SSW, 2016, pp. 146–152.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Kevin Wilson, Michael Chinen, Jeremy Thorpe, Brian Patton, John Hershey, Rif A Saurous, Jan Skoglund, and Richard F Lyon, “Exploring tradeoffs in models for low-latency speech enhancement,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018, pp. 366–370.
-  Yi Hu and Philipos C Loizou, “Evaluation of objective measures for speech enhancement,” in Ninth International Conference on Spoken Language Processing, 2006.
-  ITU-T Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.