In recent years there has been lot of interesting work done in the fields of lip reading and audio visual speech recognition. In  authors demonstrated end-to-end sentence level lip reading and in  authors demonstrated deep learning based end-to-end audio visual speech recognition. Similarly there has been lot of new results published in the field of speech recognition using bio signals, mainly using electrocorticography (ECoG) and electroencephalography (EEG). ECoG is an invasive way of measuring electrical activity of human brain where a subject need to undergo a brain surgery to get ECoG electrodes implanted. On the other hand EEG is a non invasive way of measuring electrical activity of human brain where signals are recorded by placing EEG sensors on the scalp of the subject. In  authors demonstrated isolated speech recognition using EEG and combination of EEG, acoustic features on a limited English vocabulary of four words and five vowels. In [4, 5] authors demonstrated continuous speech recognition using EEG features in clean and noisy environments. In  authors demonstrated that EEG features are helpful in improving the robustness of automatic speech recognition (ASR) systems operating in noisy environments. In this paper we investigate whether EEG features are more helpful than acoustic features to improve the performance of continuous visual speech recognition systems.
Given the limited amount of data we had in our hand we implemented our own custom model for performing recognition instead of using state of the art computer vision feature extraction networks like Resnet or VGG net . The main goal of the work explained in this paper is to investigate the feasibility of using EEG features for performing continuous visual speech recognition.
Our results demonstrate that EEG features are in fact helpful in improving the performance of continuous visual speech recognition systems. We demonstrate our results for a limited English vocabulary consisting of 30 unique sentences.
Ii Visual Speech Recognition System Models
Figure 1 explains the architecture of the recognition model used for recognition of combination of video and EEG data or video and acoustic data. The part of the network used for extracting features from EEG and acoustic data consists of three layers of gated recurrent unit (GRU) with 128, 64 and 32 hidden units respectively. Each GRU layer included a dropout regularization 
with dropout rate 0.1. This part of the network can take EEG or acoustic features or concatenation of EEG and acoustic features as input depending on how the model is trained. The part of the network used for extracting features from video frames consisted of two dimensional convolutional network layers and two dimensional max pooling layer. We used two convolutional network layers and one max pooling layer. The convolutional layers had 100 filters with ReLUactivation function and a kernel size of (1,3) and the max pooling layer had a pool size of (1,2). The output of the max pool layer is flattened and reshaped in order to concatenate it with the features extracted by the other part of the network described before. The concatenated features are fed into a temporal convolutional network (TCN)  layer consisting of 32 filters, whose output is fed into the decoder of the connectionist temporal classification (CTC) [12, 5, 4] network. The decoder of the CTC network consists of combination of a dense layer which performs affine transformation and a softmax activation. The output of the encoder is fed into the decoder of the CTC network at every time step. The number of time steps of the encoder is same as the product of sampling frequency of the input features and sequence length. There was no fixed time step value since different subjects spoke with different rate of speech. We used a character based CTC model in this work. The model was predicting a character at every time step.
The details of the CTC loss function are covered in[12, 5, 4]. During inference time we used a combination of CTC beam search decoder and an external 4-gram language model, known as shallow fusion . Figure 2 shows the architecture of the model used for performing speech recognition using only video features. It is very similar to the model explained in Figure 1 except it doesn’t contain additional network layers to extract acoustic or EEG features.
Iii Design of Experiments for building the database
Seven male UT Austin graduate students in their early to mid twenties took part in the speech-EEG-Video experiment. Each subject was asked to speak the first 30 English sentences from USC-TIMIT data base  and their simultaneous speech, EEG and video was recorded. The sentences were shown to subjects on a computer screen and they read out loud the sentences. The data was recorded in presence of a background noise of 65dB. The music played from our lab computer was used as the source of noise. Then each subject was asked to repeat the same experiment two more times, thus we had 90 speech-eeg-video recordings from each subject.
We used Brain product’s ActiChamp EEG amplifier. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 4. We used EEGLab  to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
Data from first 6 subjects was used as the training set and the last subject data was used as the test set.
Iv Feature extraction and preprocessing details
We followed the same EEG and speech preprocessing methods used by authors in [3, 4]. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [3, 4]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum coefficients (MFCC) as features for speech signal. We extracted MFCC features of dimension 13. The MFCC features were also sampled at 100Hz, same as the sampling frequency of EEG features.
We extracted 100 frames per second from the recorded video. We used YOLO
object recognition model to perform face recognition from the extracted video frames. Figure 6 shows a raw extracted RGB video frame and Figure 7 shows the corresponding face frame extracted using YOLO. The maximum x dimension value in our extracted face data set was 426 and maximum y dimension value in our extracted face data set was 381. Z was of dimension 3 (RGB). Our initial plan was to perform experiments using RBG frames but we were constrained by memory requirements of our computing hardware, hence we transformed the RGB face frames to gray scale and re sized all the gray scale face frames to a dimension of 100 X 100. Figure 8 shows the corresponding gray scale re sized face frame. We used python imaging library (PIL) for re sizing the image. We also tried extracting mouth frames from the extracted face frames but we observed missing data problem, hence we proceeded with feeding the face frames instead of the mouth frames to the model. We strongly encourage researchers to use RGB frames and use three dimensional convolutional and max pooling layers instead of two dimensional layers in the models described in Figures 1 and 2, if you have sufficient computing resources.
V EEG Feature Dimension Reduction Algorithm Details
After extracting EEG and acoustic features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA)  to denoise the EEG feature space as explained by authors in [4, 3]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 5. We used KPCA with polynomial kernel of degree 3 [3, 4].
We used word error rate (WER) as performance metric of the model during test time. Table 1 shows the results obtained during test time. Table 1 shows the average WER on test set for various feature set inputs. The results demonstrate that EEG features are more helpful than acoustic features to improve the performance of continuous visual speech recognition systems operating in noisy environments. Using all modalities ( acoustic, EEG, Video) gave the highest test time performance or the lowest word error rate on test set.
Vii Conclusion and Future work
In this paper we demonstrated the feasibility of using EEG features to improve visual speech recognition systems operating in noisy environments. For future work we would like to build a much larger data set and validate our results on a larger corpus. We encourage other researchers to put a joint effort in building a state of the art Speech-Video-EEG data base to help advance research in this area.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.
The first author would like to thank Prof Yann Soullard from Université De Rennes 2 for his crucial help with debugging the CTC model code.
-  Y. M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, “Lipnet: End-to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016.
-  T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
-  G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
-  G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
-  G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,”
The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
-  S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” inInternational Conference on Machine Learning, 2014, pp. 1764–1772.
-  S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 369–375.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,”Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
-  S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.