Recently researchers have started exploring the possibility of synthesizing speech and text from neural signals. In [2, 3] authors demonstrated synthesizing intelligible speech from electrocorticography (ECoG) neural signals. Performing speech synthesis and speech recognition using neural signals might help people with speaking disabilities and difficulties to communicate with virtual personal assistants like Alexa, Bixby, Siri etc thereby improving technology accessibility and at the same time it will allow them to have normal conversation with their loved ones as well. In 
authors proposed a brain–computer interfaces (BCIs) system that control a cursor to select letters one-by-one to spell out words but users can transmit only upto 10 words per min using their system, a rate slower than the average of 150 words per minute of natural speech whereas a continuous neural signal based speech recognition system would be capable of producing output at 150 words per minute. Electrocorticography (ECoG) is an invasive way of measuring electrical activity of brain where a brain surgery is performed to implant the ECoG electrodes. On the other hand electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. The EEG sensors are placed on the scalp of a subject to obtain the EEG recordings. Like ECoG, EEG also offer high temporal resolution even though the spatial resolution and signal to noise ratio (SNR) offered are lower compared to ECoG. Since EEG is a non invasive approach it is more safer and easier to deploy and study compared to ECoG. In authors provided preliminary results for synthesizing speech from EEG signals. In [6, 7, 8] authors demonstrated continuous and isolated speech recognition using EEG signals for a limited English vocabulary in presence and absence of background noise.
The results described by authors in [6, 8] demonstrate that continuous speech recognition using EEG is extremely challenging mainly due to the poor SNR offered by EEG signals. In a recent work described in  authors proposed various techniques to improve the performance of EEG based continuous speech recognition systems. They show that by using an external language model and by adding deep layers in speech recognition encoder with their weights initialized with weights derived from an EEG to acoustic + articulatory regression model, will help in improving the recognition test time results. Even though their proposed method improved the results described in [6, 8], additional sensors are needed to record acoustic or articulatory features to implement their method.
In  authors introduced the concept of generative adversarial networks (GAN) where two networks namely the generator model and the discriminator model which are trained simultaneously. The generator model learns to generate data from a latent space and the discriminator model evaluates whether the data generated by the generator is fake or is from true data distribution. The training objective of the generator is to fool the discriminator. In this paper we show that the concept of GAN can be used to generate more meaningful EEG features from raw EEG features to improve the performance of EEG based continuous speech recognition systems.
We improve the results demonstrated by authors in  using their data sets for for some of the test time experiments and for other cases our results were comparable with theirs. Our proposed approach can be implemented without using any additional sensor information, whereas in  authors used additional features like acoustic or articulatory information to improve the performance of EEG based continuous speech recognition systems.
2 Generative Adversarial Network Model
Our generator model, as shown in Figure 1, is very similar to the encoder part of the CTC ASR model described by authors in 
. However we initialize the gated recurrent unit (GRU) layers with random weights. After the temporal convolutional network (TCN) 
layer an average pooling layer is used. The average pooling layer calculates the average value of all the time step outputs of the TCN layer. The average pool layer output is passed to dense layer with two hidden units and softmax activation function to produce fake label tokens. The label tokens were one hot vector encoded.
In order to define the loss functions for both our generator and discriminator model let us first define few terms. Let be the sigmoid output of the discriminator for (real EEG features, fake label) input pair and let be the sigmoid output of the discriminator for (real EEG features, real label) input pair during training time. Then we can define the loss function of generator as and loss function of discriminator as .
The intuition here is since GAN learns the loss function also during training, the generator which is similar to the encoder of the CTC ASR model in  will learn the most accurate EEG to label or text mapping. Especially the TCN layer in generator will learn the fine representations of input EEG features which are easily mapped to labels. During test time, the first GRU layer in the trained generator model takes EEG features of dimension 30 as input and we take output from the TCN layer in the generator which produces EEG representations of dimension 32. These EEG representations of dimension 32 are further used to perform continuous speech recognition experiments.
The Figure 3 shows the generator and discriminator model training loss. The GAN model was trained for 201 epochs using adam optimizer with a batch size of 50.
3 ASR model used for performing continuous speech recognition experiments
We performed continuous speech recognition using the raw EEG features of dimension 30 (baseline) and also using the EEG features of dimension 32 generated using the TCN layer of the generator described before.
For performing experiments we used the connectionist temporal classification (CTC) [15, 11] model described in Figure 1 in  with the exact same hyper parameters and training parameters used by authors in  but the encoder layers in the CTC model were initialized with random weights [6, 8]. An external language model was used during inference time like the ones used by authors in .
4 Data Sets used for performing experiments
For performing continuous speech recognition experiments using EEG, we used Data set A and B used by authors in . First we perform continuous speech recognition experiments using EEG features of dimension 30 from Data set A and B  and then we pass the EEG features of dimension 30 to our GAN generator model described in Figure 1 and get the output from the TCN layer of the generator to get EEG features of dimension 32. Then experiments are performed using those EEG features of dimension 32.
For performing training the GAN model we used the combined EEG data for first two unique sentences from Data set A and B used by authors in , consisting of a total of 102 EEG recording examples. Since there were only two unique sentences, hence the generator model’s final dense layer had two hidden units with softmax activation function. We considered EEG samples for only two unique sentences since we were interested in faster and stable training of the GAN model. More details of the data set, EEG experiment design, EEG recording hardware etc are covered in [6, 1].
For each data set we used 80% of the data as training set and remaining as test set. The train-test split was done randomly. There was no overlap between training and testing set. The way we splitted data for performing continuous speech recognition experiments in this work was exactly similar to the method used by authors in .
5 EEG feature extraction details
The EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. The EEGlab’s Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. The Figure 4 shows ICA component classification for various artifacts present in an EEG recording sample for a subject from Data set B. We can observe a significant presence of EMG artifact due to speech production. The Figure 5 shows the EEG signals after removing all artifacts for 31 channels. The figure shows a sample of 5 second recording from Data set B.
We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy[7, 6]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals. The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
6 EEG Feature Dimension Reduction Algorithm Details
After extracting EEG features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA)  to perform initial denoising of the EEG feature space as explained by authors in [6, 7]
. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension. We used KPCA with polynomial kernel of degree 3[7, 6]. We used these EEG features of dimension 30 as EEG features for calculating baseline results for continuous speech recognition experiments and then these 30 EEG dimensional EEG features are passed to the generator model described in Figure 1 during test time to get EEG features of dimension 32 from the TCN layer as output.
We used word error rate (WER) as performance metric for continuous speech recognition experiments during test time.
Table 1 shows the test time results obtained for continuous speech recognition experiments for Data set A. For baseline results we use 30 dimensional EEG features with CTC encoder with random weights, we then compare results obtained using our proposed method in this paper with the results obtained by authors in . We specifically compare our results with the results explained in Table 1 in reference . As seen from Table 1 continuous speech recognition using EEG features of dimension 32 generated using TCN layer in our GAN generator model described in Figure 1 always resulted in superior performance compared to baseline and demonstrated superior performance or lower WER compared to the method introduced by authors in  for some of the test time experiments and for other cases our results were comparable with theirs.
Table 2 shows the test time results obtained for continuous speech recognition experiments for Data set B. Similar observations seen in Table 1 were also noted for the results described in Table 2.
Results from Tables 1 and 2 summarizes that our proposed method can be used to generate EEG features to improve the performance of continuous EEG based speech recognition systems. Our proposed method doesn’t depend on additional features like acoustic or articulatory features like the method used by authors in .
8 Conclusion and Future work
In this we paper we demonstrate that by making use of the ability of generative adversarial networks (GAN) to learn the loss function, the model can be trained to generate more meaningful EEG features from raw EEG features to improve the performance of EEG based continuous speech recognition systems. We compare our method with the method described by authors in  to improve the performance of EEG based continuous speech recognition systems and we demonstrate that our method outperforms their method for some of the test time experiments and for other cases our results were comparable with theirs when trained and tested using the same data sets and our proposed method doesn’t need additional features like acoustic features or articulatory features which are needed to implement the method described by authors in .
For future work we would like to improve the current results by adding CTC loss to our generator loss and also include a non differentiable external language model with the generator but that will require larger training data set with more number of EEG examples and data from larger number of subjects. We would also like to combine our proposed method with the method introduced by authors in  to see if that helps in establishing a new baseline for state-of-the-art continuous EEG based speech recognition.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.
-  G. Krishna, C. Tran, M. Carnahan, Y. Han, and A. H. Tewfik, “Improving eeg based continuous speech recognition,” arXiv preprint arXiv:1911.11610, 2019.
-  G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis from neural decoding of spoken sentences,” Nature, vol. 568, no. 7753, p. 493, 2019.
M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J. Krusienski, and T. Schultz, “Speech synthesis from ecog using densely connected 3d convolutional neural networks,”Journal of neural engineering, vol. 16, no. 3, p. 036019, 2019.
-  J. S. Brumberg, K. M. Pitt, A. Mantie-Kozlowski, and J. D. Burnison, “Brain–computer interfaces for augmentative and alternative communication: A tutorial,” American journal of speech-language pathology, vol. 27, no. 1, pp. 1–12, 2018.
-  G. Krishna, C. Tran, Y. Han, M. Carnahan, and A. Tewfik, “Speech synthesis using eeg,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on. IEEE, 2020.
-  G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
-  G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
-  G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in
International Conference on Machine Learning, 2014, pp. 1764–1772.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
-  B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
-  A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,”Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
-  S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.