Automatic speech recognition (ASR) system maps acoustic features to text. ASR systems forms front end or back end in many state of the art voice assistant systems like Bixby, Alexa,Siri,Cortana etc. Most of the current state of the art ASR systems are trained only with acoustic features and can operate only with acoustic input, this limits technology accessibility for people who can’t speak at all or for people with speaking disabilities like stuttering. On the other hand electroencephalograpgy (EEG) is a non invasive way of measuring electrical activity of human brain. EEG sensors are placed on the scalp of the subjects to obtain the EEG recordings. Recently in  authors demonstrated isolated speech recognition using EEG features on a limited English vocabulary of four words and five vowels. In  authors demonstrated continuous speech recognition using connectionist temporal classification (CTC) 
and attention model on an English vocabulary of 20 unique sentences using EEG and combination of EEG, acoustic features as input. In  authors demonstrated continuous speech recognition using different EEG feature sets. In [2, 5] authors used state of the art end-to-end ASR models to directly map EEG features to text.
In  authors demonstrated combining articulatory features with acoustic features improves the performance of ASR systems and in  authors demonstrated combining EEG features with acoustic features also improve the performance of ASR systems operating in presence of background noise. In [5, 2] authors didn’t take into account the articulatory features for performing speech recognition. In this paper we show that the articulatory features can be used to design more robust ASR EEG encoder models and can help in improving the performance of EEG based continuous speech recognition systems.
authors initialized the weights of the recurrent neural network (RNN) encoder in their CTC network with random weights. In this paper we demonstrate that initializing the weights of the first few RNN layers in the CTC network with weights of a RNN trained to predict concatenation of acoustic and articulatory features from EEG features will help in significantly improving the performance of EEG based speech recognizer. We further demonstrate predicting articulatory features from EEG features using temporal convolutional network (TCN) model.
In [2, 5] authors didn’t use external language model during inference time. In this paper we demonstrate that using an external language model during inference time significantly improves the beam search decoder performance of EEG based speech recognizer. Finally in this paper we demonstrate EEG based speech recognition results for a larger English vocabulary size than the ones used by authors in [2, 5].
Ii Connectionist Temporal Classification (CTC)
The encoder of our CTC network consists of two gated recurrent unit (GRU) layers with 128 hidden units and 64 hidden units respectively connected to a temporal convolutional network (TCN)  layer with 32 filters as shown in Figure 1. The GRU layers contained dropout regularization 
with a dropout rate of 0.1. The decoder of the CTC network consists of a combination of dense layer and softmax activation. The output of the encoder layer is fed into the decoder at every time step. The two GRU layers in the encoder network are initialized with weights of the GRU layers of the model shown in Figure 2. The model in Figure 2 consists of two layers of GRU with 128 and 64 hidden units respectively connected to a time distributed dense layer of 19 hidden units with linear activation to predict concatenation of acoustic or Mel-frequency cepstrum coefficients (MFCC) features of dimension 13 and articulatory features of dimension 6 or a net feature dimension of 19 at every time step. The model shown in Figure 2 was trained for 500 epochs with mean squared error (MSE) as loss function with adam optimizer and batch size one. Basically the model shown in Figure 2 is a GRU based regression model which predicts combination of acoustic and articulatory features from EEG features of dimension 30 at every time step.
. The batch size was set to 32 for the CTC model. The TCN layer in the CTC encoder was initialized with random weights. The main motivation behind this idea was that the first two GRU layers of the CTC encoder will help in discovering acoustic and articulatory representations from the input EEG features and the TCN layer will learn the mapping of those representations to text. There was no fixed value for the time steps for the encoder of the CTC model. As usual the number of time steps is equal to the product of the sampling frequency of the input features and input sequence length. During inference time of the CTC model we used a combination of CTC beam search decoder and an external 4-gram language model. Along with beam search we include the log of the probability assigned by the language model for the label sequence during inference time. We used a character based CTC model for this work. The CTC model predicted a character at every time step. Figure 3 shows the loss convergence of the CTC ASR model and Figure 4 shows the loss convergence of the GRU regression model described in Figure 2. All the scripts were written using python keras deep learning and tensorflow 2.0 framework.
Iii Model to predict articulatory features from EEG features
For predicting articulatory features of dimension 6 from EEG features of dimension 30 we used the model described in Figure 5. Our model consists of a temporal convolutional network (TCN)  layer with 128 filters connected to a time distributed dense layer of 6 hidden units with linear activation to predict articulatory features of dimension 6 at every time step. A dropout  regularization with dropout rate 0.2 was applied after the TCN layer. Mean squared error (MSE) was used as the loss function for this regression model and model was trained for 1000 epochs with adam optimizer. The batch size was set to one. Figure 6 shows the loss convergence for the model.
Iv Design of Experiments for building the database
Seven male subjects took part in our EEG-Speech experiment. All the seven subjects were UT Austin graduate students in their mid twenties. English was not their mother tongue.
They were asked to speak the first 30 English sentences from USC-TIMIT database  and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 65dB. The music played from our lab computer was used as the source of background noise. Each subject was asked to repeat the experiment two more times. Here the subjects read out loud the English sentences that were shown to them on a computer screen. We used 80 % of the total data as training set and remaining as test set for all the models.
We used Brain product’s ActiChamp EEG amplifier. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 7. We used EEGLab  to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
V EEG and Speech feature extraction details
We followed the same EEG and speech preprocessing methods used by authors in [1, 2]. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1, 2]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum coefficients (MFCC) as features for speech signal. We extracted MFCC features of dimension 13. The MFCC features were also sampled at 100Hz, same as the sampling frequency of EEG features.
We used acoustic-to-articulatory speech inversion tool introduced by authors in  to extract articulatory features of dimension 6 from the recorded speech signal. The articulatory features were also extracted at the same sampling frequency of 100 Hz as that of the MFCC and EEG features. The six articulatory tract variables (TV’s) that were extracted were Lip Aperture (LA), Lip Protrusion (LP), Tongue Body Constriction Location (TBCL), Tongue Body Constriction Degree (TBCD), Tongue Tip Constriction Location (TTCL) and Tongue Tip Constriction Degree (TTCD) .
Vi EEG Feature Dimension Reduction Algorithm Details
After extracting EEG and acoustic features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA)  to denoise the EEG feature space as explained by authors in [2, 1]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 8. We used KPCA with polynomial kernel of degree 3 [1, 2].
We used word error rate (WER) as performance metric to evaluate the performance of the CTC ASR model on test set data for various number of sentences as shown in Table 1. The average WER is reported in Table 1. Language model (LM) was included during inference time. Without the language model we observed higher error rates than the ones reported in Table 1. We observed roughly around 5 to 10 % increase in error rates for most of the experiments when the external language model was not included during inference time. As seen from Table 1 results it is quite evident that initializing the GRU layers in the encoder of the CTC network with pre trained regression model weights significantly improves the test time performance of EEG based CTC speech recognizer especially as the vocabulary size increase.
We used two performance metric to evaluate the performance of TCN regression model for predicting articulatory features from EEG features. The two performance metrics were root mean squared error (RMSE) and normalized RMSE between the predicted articulatory features during test time and ground truth articulatory features from test set. The RMSE values were normalized by dividing the RMSE values with the absolute difference between the maximum and minimum value in the test set observation vector. We observed an average RMSE of0.632 and average normalized RMSE of 0.115 on the test set data.
Viii Conclusion and Future work
In this paper we demonstrated various techniques to improve the performance of EEG features based continuous speech recognition systems. We further demonstrated predicting articulatory features from EEG features with very low RMSE and normalized RMSE.
For future work we plan to conduct experiments with data collected from subjects with speaking disabilities and build a larger speech-EEG data base.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work. The first author would like to thank Prof Alex Dimakis from ECE department at UT Austin for suggesting us to carry out experiments using TCN layers.
-  G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
-  G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
temporal classification: labelling unsegmented sequence data with recurrent
neural networks,” in
Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
-  G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
-  K. Kirchhoff, G. A. Fink, and G. Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition,” Speech Communication, vol. 37, no. 3-4, pp. 303–319, 2002.
-  S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
-  A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
-  S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,”Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
-  G. Sivaraman, V. Mitra, H. Nam, M. K. Tiede, and C. Y. Espy-Wilson, “Vocal tract length normalization for speaker independent acoustic-to-articulatory speech inversion.” in INTERSPEECH, 2016, pp. 455–459.
-  S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.