1 Introduction
The performance of speech enhancement and source separation systems has vastly improved with the introduction of deep learning and neural network based techniques
[1, 2, 3, 4, 5, 6, 7, 8, 9]. Some recent advances include the use of generative adversarial networks [10], sophisticated adaptations of UNet [11] based architectures [12, 13]and many more. Designing endtoend systems that directly estimate the waveforms of the enhanced speech signal by operating on the noisy speech waveforms has proven to be beneficial and resulted in several highperformance models
[2, 3, 4, 5, 13, 6].Endtoend speech enhancement networks typically replace the ShortTime Fourier transform (STFT) operation by a learnable ’frontend’ layer [2, 3, 4, 5]. To this end, the first layer of such neural models performs windowing followed by a dense matrix multiplication. To transform the data back into the waveform domain, these models also employ a trainable backend layer which inverts the frontend via another dense matrix multiplication and the overlapadd method.
With growing interest in “hearables” and other wearable audio devices, lowcompute and realtime scenarios are increasingly encountered in audio processing applications. These devices come with low power, low memory and stringent compute requirements but offer the opportunity for audio processing everywhere. In these contexts, storing and performing inference with dense matrices inside a trainable STFT can be prohibitively expensive or downright infeasible. For example, to learn an STFT with point transforms takes parameters. The frontend parameters alone could fill the L2 cache of a modern processor, leaving no room for the rest of the model to be evaluated without cachemisses. We aim to address this issue by creating an efficient frontend for lowcompute models operating directly on the waveform.
In this work we propose an efficient learnable STFT frontend for lowcompute audio applications and show how it can be used to improve endtoend speech enhancement. The trainable FFT copies the butterfly mechanism for the Fast Fourier Transform (FFT); and, when initialized appropriately, computes the Fourier Transform. We also propose replacing the standard fixed window of the STFT by a trainable windowing layer. In terms of computational advantages, our model requires no increase in compute and a minimal increase in the number of required parameters compared to the fixed FFT. Using our model also leads to significant savings in memory compared to standard adaptive frontend implementations. We evaluate our model on the VCTK speech enhancement task [14] and demonstrate that the proposed frontend outperforms STFT frontends on a variety of perceptuallymotivated speech enhancement metrics.
2 The FFT as a Network Layer
Fast Fourier Transforms (FFT) are based on factoring the Discrete Fourier Transform (DFT) matrix into a set of sparse matrices [15]. We implement these sparse matrix operations efficiently in MXNet [16] to make a trainable layer for lowcompute environments. The factorization we use is based on the butterfly mechanism and is as follows.
Recall that, given a vector
of length , the point DFT applies a transformation to get the DFT coefficients ; with the elementwise version of this operation being,(1) 
Here, denotes the twiddlefactor of the DFT. As usual, split Eq. 1 into two
point DFT operations on the evenindexed and the oddindexed elements of
,(2)  
(3) 
The twiddlefactors are oddsymmetric about i.e., . Thus,
(4) 
Defining a diagonal matrix of twiddlefactor values, , we rewrite Eq. 4 in matrix form as,
(5) 
In this equation, and denote the oddindexed and evenindexed terms of .
(6) 
Then, we factor out the point DFT, and apply an even/odd permutation matrix to get,
(7) 
Disregarding the data, we can write as,
(8) 
.
Substitute for the matrix of twiddle factors to get,
(9) 
To simplify even further, is,
(10) 
Thus, we can write in terms of as,
(11) 
where,
(12) 
It is necessary to stack the result of since it occurs more than once. The component matrices and are composed of stacks of diagonal matrices. Generalizing this further, we can represent an point FFT as a series of matrix multiplications where, the first matrix is a permutation matrix and all other matrices are sparse matrices formed by stacks of diagonal matrices. Mathematically, we can write the point DITFFT algorithm as a matrix multiplication by
(13) 
where, denotes the ’th twiddle factor matrix, denotes the product of all permutation matrices , and is the number of twiddle factor matrix multiplications involved. We can write a general formula to construct the twiddle factor matrix of the twiddle factor matrices using identity matrices of size and , as well as the Kronecker product as follows,
(14) 
Fig. 1 visualizes these matrices and the associated sparsity patterns for an point DITFFT algorithm..
As an illustrative example, consider the FFT matrix derivation for a point FFT operation. Given the permuted data samples, we first apply a point DFT operation on successive pairs of input samples. This step can be written as a matrix multiplication operation by matrix where,
The next step applies a matrix multiplication operation using matrix where we can write in terms of the twiddle factor as,
The overall 4point FFT operation can be expressed as,
We see that the matrix is a sparse matrix with a block diagonal structure. Similarly, is also sparse and composed of stacks of diagonal matrices.
2.1 Trainable FFT layer
To make the above FFT layer trainable, the set of matrix multiplies can be represented as a neural network with several sparsely connected layers. The DFT on the other hand is a single layer with dense connectivity. We preserve the general FFT structure by only connecting network nodes on the block diagonal structure given by the FFT. This preserves the speed and structure of the FFT while both reducing the number of parameters in the frontend and operating on the raw waveform. Fig. 2 illustrates the DFT and FFT connectivity structures and how they may be interpreted as neural network layers. In practice, we explicitly implement all complex operations with real values. When initialized to do so, the model returns identical results to the FFT algorithm. All of these operations can be efficiently implemented and trained with sparse matrix routines.
2.2 Inverse FFT
To compute the inverse FFT we use the time reversal conjugate trick. Given the DFT representation of the time domain frame , we can compute the inverse Fourier transform of using only the FFT. In particular,
In our model we leverage this property. However, the forward FFT layer and the inverse FFT layer do not share parameters. We use different learned FFTs for the forward and inverse transforms. These FFT layers are initialized and updated as separate entities.
3 The Adaptive STFT for Speech Enhancement
Given a trainable FFT frontend, we can now operate the model on time domain waveforms. For our models we use 256point FFTs such that this model’s frontend has about two orders of magnitude fewer weights than a typical trainable STFT frontend. In addition to making the FFT and IFFT trainable, we also show how we can make trainable synthesis and analysis windows.
3.1 Learning a Window
The typical point STFT with a hopsize of , chunks an input audio sequence into windows of size with an overlap of . These overlapping frames are stacked to form columns of a matrix. Call this matrix where the first column of holds values , the second column holds values , and so on in the standard python notation. To apply a windowing function upon , construct a windowing matrix . The matrix is diagonal with the desired point window on its diagonal. The windowed version of is then . This same logic applies for windowing in the inverse STFT.
By making a network parameter, we can learn a windowing function suited to the learned transform. For all our experiments, is initialized as a Hann window. During training, is freely updated without nonnegativity or constantoverlapadd constraints. In Fig. 3, we show that the learned analysis and synthesis windows are highly structured.
3.2 Model Architecture
In conjunction with the learned transforms and windows, we use a masking based separation network. The learned FFT frontend transforms each column of . Let represent the trainable FFT frontend. The masking network takes and predicts two sigmoid masks: and . These masks are applied via elementwise multiplication to produce an estimate of the clean speech in the transform domain, in term of its real and imaginary parts. Specifically,
(15)  
Here and compute the elementwise real and imaginary components of a complex matrix, and denotes the elementwise multiplication operation. In our experiments, we found that using separate real and imaginary masks outperformed a single magnitude mask. Fig. 4 illustrates the full model pipeline.
We experimented with a mask prediction RNN containing 80k parameters. This network is composed of two linear layers, and a gated recurrent unit (GRU). The GRU is unidirectional for our intended use case of realtime speech enhancement. Instead of performing complex valued backpropagation, we simply stack the real and imaginary components of the input and output. The masking network architecture is shown in Fig.
5.4 Experiments
4.1 Dataset
We use the 56 speaker VCTK training and testing setup [14]
where, each speaker has about 400 sentences. During training, we mix speech and noise at signal to noise ratios (SNRs) of 0dB, 5dB, 10dB, and 15dB. We train all models until convergence on the training set before evaluating them. During evaluation, we use SNRs of 2.5dB, 7.5dB, 12.5dB, and 17.5dB
[17].4.2 Training
We used the MXNet [16] framework for all our experiments. To optimize the network parameters, we use the Adam algorithm [18]
, and for the loss function, we use the complex loss
given in Eq. 16 [19]. Over informal experiments, we found that this loss function performed better than time domain loss functions in terms of perceptual metrics. The loss function is a weighted combination of magnitude mean squared error and complex mean squared error loss. Here, is the predicted clean Fourier spectrum and is the true clean Fourier spectrum.(16) 
The power is applied elementwise and in the case of complex numbers is applied on the predicted magnitude and then multiplied with the predicted phase. For our own experiments, we use and [19].
4.3 Models
With lowcompute scenarios in mind, we examined a model with approximately 80k parameters. The majority of the parameters are used in the masking network with the learned FFT and learned window using 512 parameters each. For all model runs, we initialize the windows as Hann windows and the trainable FFTs as FFTs. In our experiments, we compared four models with the attributes described below. The tested setups are: (1) Fixed Window Fixed FFT, (2) Trainable Window Fixed FFT, (3) Fixed Window Trainable FFT, (4) Trainable Window Trainable FFT. In the above list, fixed denotes parameters that were frozen and not updated during training. The Fixed Window Fixed FFT model has (1024) fewer parameters than the Trainable Window Trainable FFT model.
The first model has fixed Hann windows and a fixed FFT. This model only learns a masking network and serves as a benchmark for our adaptations. The other models serve to illustrate the benefits of and relationship between trainable windows and trainable FFTs.
4.4 Evaluation Metrics
We evaluate the above models on a speech enhancement task and compare them on the following metrics: signal distortion (), noise distortion (), overall quality () [20], Perceptual Evaluation of Speech Quality (PESQ) [21], and segmental SNR (SSNR). , , , and PESQ are perceptual metrics intended to imitate a mean opinion score test. estimates distortion of the speech signal, estimates intrusiveness of the noise, summarizes the overall quality, and PESQ estimates the speech quality. For all of these metrics, higher is better.
4.5 Results
Trainable Window  ✓  ✓  

Trainable FFT  ✓  ✓  
3.586  3.580  3.624  3.686  
2.820  2.791  2.868  2.942  
2.878  2.868  2.944  3.018  
PESQ  2.217  2.204  2.312  2.395 
SSNR  5.572  5.256  5.657  6.137 
LOSS  0.079  0.080  0.071  0.070 
Table 1 gives the results of our experiments for the 80k parameter model. We use the fixed FFT, fixed window version without any trainable parameters in its frontend as the baseline model. In general, we observe that making the FFT layer trainable improves speech enhancement performance. This improvement is consistently observed both in the case of a fixed window (compare column to column) and when the window is trainable (compare column to column). The effects of the window function are more inconsistent and interesting. A trainable window with a fixed FFT degrades separation performance (compare column to column). Alternatively, a trainable window used with a trainable FFT improves upon the fixed window, trainable FFT model (compare column to column). Overall, when a trainable window is used in conjunction with a trainable FFT, we get the best performance across all metrics.
5 Conclusion
In light of the need for highperformance speech systems in lowcompute contexts, we proposed an alternative to learning a DFT matrix in trainable STFT systems. Our efficient frontend leverages the sparse structure of FFTs to both reduce computational and memory requirements. The trainable frontend is made up of several highly structured sparse linear layers and a learned window. We demonstrate an application of this frontend in speech enhancement. Using a trainable FFT and a trainable window improves speech enhancement performance over a fixed STFT system with no increase in computational complexity.
References
 [1] PoSen Huang, Minje Kim, Mark HasegawaJohnson, and Paris Smaragdis, “Deep learning for monaural speech separation,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 1562–1566.
 [2] Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis, “Endtoend source separation with adaptive frontends,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers. IEEE, 2018, pp. 684–688.
 [3] Yi Luo and Nima Mesgarani, “Tasnet: timedomain audio separation network for realtime, singlechannel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
 [4] Yi Luo and Nima Mesgarani, “ConvTasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
 [5] Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, Jiqing Han, and Anyan Shi, “Deep Attention Gated Dilated Temporal Convolutional Networks with IntraParallel Convolutional Modules for EndtoEnd Monaural Speech Separation,” Proc. Interspeech 2019, pp. 3183–3187, 2019.
 [6] Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073.
 [7] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
 [8] Dong Yu, Morten Kolbæk, ZhengHua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speakerindependent multitalker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.

[9]
Felix Weninger, John R Hershey, Jonathan Le Roux, and Björn Schuller,
“Discriminatively trained recurrent neural networks for singlechannel speech separation,”
in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2014, pp. 577–581.  [10] Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “SEGAN: Speech Enhancement Generative Adversarial Network,” Proc. Interspeech 2017, pp. 3642–3646, 2017.
 [11] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 [12] Craig Macartney and Tillman Weyde, “Improved Speech Enhancement with the WaveUNet,” arXiv preprint arXiv:1811.11307, 2018.
 [13] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Waveunet: A multiscale neural network for endtoend audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
 [14] Cassia ValentiniBotinhao et al., “Noisy speech database for training speech enhancement algorithms and TTS models,” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017.
 [15] Charles Van Loan, Computational frameworks for the fast Fourier transform, vol. 10, Siam, 1992.
 [16] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
 [17] Cassia ValentiniBotinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Investigating RNNbased speech enhancement methods for noiserobust TexttoSpeech.,” in SSW, 2016, pp. 146–152.
 [18] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [19] Kevin Wilson, Michael Chinen, Jeremy Thorpe, Brian Patton, John Hershey, Rif A Saurous, Jan Skoglund, and Richard F Lyon, “Exploring tradeoffs in models for lowlatency speech enhancement,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018, pp. 366–370.
 [20] Yi Hu and Philipos C Loizou, “Evaluation of objective measures for speech enhancement,” in Ninth International Conference on Spoken Language Processing, 2006.
 [21] ITUT Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for endtoend speech quality assessment of narrowband telephone networks and speech codecs,” Rec. ITUT P. 862, 2001.