Speech enhancement is the process of improving the quality of speech whose quality was degraded due to additive noise. Speech enhancement is a critical preprocessing method used to improve the performance of automatic speech recognition (ASR) systems operating in presence of background noise. Noisy speech is first fed into a speech enhancement system to produce enhanced speech which is then fed into the ASR model. Speech enhancement systems also plays critical role in improving the quality of speech used in devices like hearing aids and cochlear implants.
In references [1, 2] authors demonstrated speech enhancement using classical methods. Recently researchers have started applying deep learning methods for performing speech enhancement as indicated in the following references [3, 4, 5]. In references [6, 7] authors demonstrated speech enhancement using generative adversarial networks (GAN).
Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. In  authors demonstrated that EEG features can be used to overcome the performance loss of ASR systems in presence of background noise. Though references [10, 11, 9, 12] demonstrated isolated and continuous speech recognition using EEG signals for various experimental conditions, they didn’t specifically study the speech enhancement problem. In this paper we demonstrate that EEG features can be used to improve the quality of speech recorded in presence of background noise. We make use of GAN and long short-term memory (LSTM)  networks to demonstrate speech enhancement using EEG features. In  authors demonstrated EEG based attention driven speech enhancement using wiener filters where EEG was used to detect auditory attention where as in this paper we demonstrate speech enhancement for ”Spoken” speech using EEG features and auditory attention detection module is not required for performing speech enhancement. Our idea is mainly inspired by the results demonstrated in  where authors demonstrated EEG features are less affected by external background noise.
2 Design of Experiments for building Training and Test Set
For training set data five female and five male subjects took part in the experiment. For test set data five male and three female subjects took part in the experiment. Except two subjects, rest all were native English speakers for both the databases. All subjects were UT Austin undergraduate,graduate students in their early twenties.
For training set, the 10 subjects were asked to speak the first 30 sentences from the USC-TIMIT database and their simultaneous speech and EEG signals were recorded. This data was recorded in absence of external background noise. We then asked each subject to repeat the same experiment two more times, thus we had 30 speech EEG recording examples for each sentence.
For test set, the 8 subjects were asked to repeat the same previous experiment but this time we used background music played from our lab computer to generate background noise. Here we had 24 speech EEG recording examples for each sentence. Both the training and test set experiments had two subjects in common.
We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 1. We used EEGLab  to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
3 EEG and Speech feature extraction details
We followed the same methodology used by authors in references [10, 9, 12] for EEG and speech preprocessing. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [9, 10, 12]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum (MFCC) as features for speech signal. We extracted MFCC 13 features and the MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features to avoid seq2seq problem.
4 EEG Feature Dimension Reduction Algorithm Details
After extracting EEG and acoustic features as explained in the previous section, we used non linear methods to do feature dimension reduction in order to obtain set of EEG features which are better representation of acoustic features. We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) .We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 2. We used KPCA with polynomial kernel of degree 3 [9, 10, 12]. We used python scikit library for performing KPCA. The cumulative explained variance plot is not supported by the library for KPCA as KPCA projects features to different feature space, hence for getting explained variance plot we used normal PCA but after identifying the right dimension we used the KPCA to perform dimension reductions.
5 Speech Enhancement models
We used two different types of model for performing speech enhancement using EEG features. We first performed experiments using a simple LSTM regression model and then we performed speech enhancement experiments using a GAN model. In the below sections we explain the architecture of our models and experiment set up details. Our GAN model architecture is different from the ones used by authors in references [6, 7]
. We added Gaussian noise with zero mean and standard deviation 10 to the recorded MFCC features from training set to generate noisy MFCC features. These noisy MFCC features will used during training of the models as explained in below sub sections. The gaussian noise was not added to the EEG features from training set as our hypothesis was effect of background noise on EEG features is negligible. The gaussian noise was not added to the test set data.
5.1 LSTM Regression Model
Our LSTM regression model consists of two layers of LSTM with 128 hidden units in each layer followed by a time distributed dense layer with 13 hidden units. The LSTM regression model architecture is shown in Figure 3. The model was trained for 1000 epochs to observe loss convergence and adam optimizer
was used. Batch size was set to 100. Mean squared error (MSE) was used as the loss function.
During training time, we concatenate the generated noisy MFCC features (after adding gaussian noise) and recorded EEG features from the training set and feed it as a single vector input to the LSTM regression model and corresponding clean MFCC features from training set of dimension 13 are set as targets.
During test time, we concatenate the MFCC and EEG features from test set and feed it as a single vector input to the trained LSTM regression model to output corresponding enhanced MFCC. Griffin Lim reconstruction  algorithm is used to convert enhanced MFCC to speech.
5.2 GAN Model
Generative Adversarial Network (GAN) consists of two networks namely the generator model and the discriminator model which are trained simultaneously. The generator model learns to generate data from a latent space and the discriminator model evaluates whether the data generated by the generator is fake or is from true data distribution. The training objective of the generator is to fool the discriminator.
Our generator model consists of two parallel LSTMs with 128 hidden units in each layer. The outputs of the two parallel LSTMs are concatenated and fed into another LSTM with 128 hidden units followed by a time distributed dense layer of 13 hidden units. The architecture of discriminator model is similar to that of the generator model but instead of the time distributed dense layer, a dense layer with single hidden unit sigmoid activation is used. The last time step output of the preceding LSTM layer is fed into the dense layer.
During training time, the generator always takes noisy MFCC ( obtained after adding gaussian noise to clean MFCC from training set) and clean EEG ( from training set) as input pairs and outputs fake MFCC. Generator model architecture is shown in Figure 4. The discriminator can take three possible pairs of inputs during training. Let be the sigmoid output of the discriminator for (fake MFCC, clean EEG) pair input, be the sigmoid output of the discriminator for (clean MFCC, clean EEG) pair input and be the sigmoid output of the discriminator for (noisy MFCC, clean EEG) pair input, then we can define the loss function of the generator as and loss function of the discriminator as for speech enhancement. The model was trained for 200 epochs using adam optimizer. Discriminator model architecture is shown in Figure 5. Input 1, Input 2 in the figure refers to the three possible pairs of input for the discriminator during training. Figures 6 and 7 shows the training loss for the generator and discriminator models.
During test time, the trained generator model takes (MFCC, EEG) input pair from the test set and outputs enhanced MFCC and we use griffin lim reconstruction algorithm to convert enhanced MFCC to speech.
To evaluate the quality of the enhanced speech we computed three performance metrics namely signal-noise-ratio (SNR) dB, Perceptual evaluation of speech quality (PESQ)  and Short Term Objective Intelligibility (STOI)  for test set data and corresponding enhanced speech outputted by the models when the test set data was given as input. We observed that all the three metrics were higher for enhanced speech output compared to that of the test set data as shown in Table 1 indicating the enhanced speech output was of better quality than the corresponding test set data.
Since STOI and PESQ calculations involve the use of a clean audio signal as reference we computed STOI and PESQ values only for two subjects data from test set as only two subjects were common in test set and training set, hence we had a clean reference speech signal only for these two common subjects from the training data set. The average STOI, PESQ values for all the test, corresponding enhanced utterances of the two subjects and average SNR in dB for all the test, corresponding enhanced utterances for all the eight subjects are shown in Table 1.
In this paper we demonstrated cleaning of noisy spoken speech using EEG features recorded in parallel with spoken speech. We make use of state of art deep learning models and EEG signal processing principles to derive our results. To our best knowledge this is the first time a spoken speech enhancement using EEG features is demonstrated using deep learning models. We further plan to publish the data sets used in this work to help advancement of research.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.
-  Michael Berouti, Richard Schwartz, and John Makhoul, “Enhancement of speech corrupted by acoustic noise,” in ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1979, vol. 4, pp. 208–211.
-  Yariv Ephraim, “Statistical-model-based speech enhancement systems,” Proceedings of the IEEE, vol. 80, no. 10, pp. 1526–1555, 1992.
Shahla Parveen and Phil Green,
“Speech enhancement with missing data techniques using recurrent neural networks,”in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2004, vol. 1, pp. I–733.
Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori,
“Speech enhancement based on deep denoising autoencoder.,”in Interspeech, 2013, pp. 436–440.
-  Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R Hershey, and Björn Schuller, “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2015, pp. 91–99.
-  Santiago Pascual, Joan Serrà, and Antonio Bonafonte, “Towards generalized speech enhancement with generative adversarial networks,” arXiv preprint arXiv:1904.03418, 2019.
-  Santiago Pascual, Antonio Bonafonte, and Joan Serra, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  Gautam Krishna, Co Tran, Jianguo Yu, and Ahmed Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
-  Gautam Krishna, Co Tran, Mason Carnahan, and Ahmed Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
-  Gautam Krishna, Yan Han, Co Tran, Mason Carnahan, and Ahmed H Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
-  Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, and Ahmed H Tewfik, “Speech recognition with no speech or with noisy speech beyond english,” arXiv preprint arXiv:1906.08045, 2019.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Neetha Das, Simon Van Eyndhoven, Tom Francart, and Alexander Bertrand, “Eeg-based attention-driven speech enhancement for noisy speech mixtures using n-fold multi-channel wiener filters,” in 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 1660–1664.
-  Shrikanth Narayanan, Asterios Toutios, Vikram Ramanarayanan, Adam Lammert, Jangwon Kim, Sungbok Lee, Krishna Nayak, Yoon-Chul Kim, Yinghua Zhu, Louis Goldstein, et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
-  Arnaud Delorme and Scott Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
-  Sebastian Mika, Bernhard Schölkopf, Alex J Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Daniel Griffin and Jae Lim, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
-  Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 749–752.
-  Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.