Predicting Video features from EEG and Vice versa

05/16/2020 ∙ by Gautam Krishna, et al. ∙ 0

In this paper we explore predicting facial or lip video features from electroencephalography (EEG) features and predicting EEG features from recorded facial or lip video frames using deep learning models. The subjects were asked to read out loud English sentences shown to them on a computer screen and their simultaneous EEG signals and facial video frames were recorded. Our model was able to generate very broad characteristics of the facial or lip video frame from input EEG features. Our results demonstrate the first step towards synthesizing high quality facial or lip video from recorded EEG features. We demonstrate results for a data set consisting of seven subjects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The electroencephalography (EEG) is a non-invasive technique for measuring electrical activity of human brain where EEG sensors are placed on the scalp of the subject to obtain the EEG recordings. The EEG signals reflects electrical activity of millions of synchronous cortical neurons sharing similar spatial orientation. EEG offers very high temporal resolution even though the spatial resolution and signal to noise ratio offered are poor. The non-invasive nature of EEG makes it easy to study and deploy compared to other invasive neural recording techniques like electrocorticography (ECoG) and local field potentials. EEG signals are commonly used to drive various brain computer interface (BCI) applications. For example in references

[1, 2] authors show that EEG signals can be used to perform continuous and silent speech recognition where the EEG signals recorded in parallel with speech are translated to text. In [3] authors demonstrate continuous silent speech recognition where they translated EEG signals recorded in parallel while subjects were silently reading English sentences in their mind to text. Similarly in [4, 5] authors provided preliminary results for synthesizing speech from EEG features.

In [1, 2]

authors also demonstrated that EEG features can be used to enhance the performance of automatic speech recognition (ASR) systems operating in presence of background noise. The references

[6, 7, 8, 9, 10] demonstrated continuous audio-visual speech recognition and end-to-end lip reading. Technologies like speech recognition using EEG, lip reading can help people who can’t produce voice or people with speaking disabilities to use virtual personal assistants like Amazon Alexa, Apple Siri etc there by improving technology accessibility.

The performance of visual speech recognition systems and lip reading systems degrades in presence of darkness and performance of audio-visual speech recognition systems degrades in presence of background noise. In [11] authors demonstrated that EEG features can be used to improve the performance of visual and audio-visual speech recognition systems. In this paper we study the problem of predicting facial video features from recorded EEG features and it’s inverse problem, ie: predicting EEG features from recorded video frames. We make use of the data set used by authors in [11] for this work and we demonstrate our results for seven subjects during test time.

Our deep learning model was able to generate facial video frame from input EEG features with very broad characteristics and our results demonstrate the first step towards the end goal of synthesizing high quality video frames from EEG features. Generating facial frames from neural EEG signals and vice-versa might help in better understanding the underlying neuroscience principles behind lip reading, facial expressions etc.

2 Deep Learning Models

The Figure 1 explains the architecture of the deep learning model used to predict video from input EEG features. The model takes EEG features of the shape [batch size, time steps, 30] as input and produces video of shape [batch size,time steps,100,100] as output. The temporal convolutional network (TCN) [12]

layer had 128 filters, the time distributed dense layers contained linear activation functions. The first time distributed dense layer contained 10000 hidden units and the final time distributed dense layer contained 100 hidden units. The first time distributed dense layer’s output is reshaped to shape [batch size,time steps, 100,100]. Each of the two dimensional convolutional transpose layers consists of 100 filters with a kernel size equal to (1,1) and rectified linear unit (ReLU) activation function. The two dimensional convolutional transpose layer outputs are passed to a two dimensional up-sampling layer with size equal to (1,1). The model was trained for 500 epochs using adam

[13]

as the optimizer with the batch size set to 100. We used mean squared error (MSE) as the loss function and the validation split hyper parameter was set to a value of 0.05. Figure 3 shows the training and validation loss.

The Figure 2 explains the architecture of the deep learning model used to predict EEG features from input video frames. The model takes video of shape [batch size,time steps,100,100] as input and produces EEG features of the shape [batch size, time steps, 30] as output. Each of the two dimensional convolutional layers had 100 filters with kernel size equal to (1,3) and ReLU activation function. The two dimensional max pooling layer had a pool size of (1,2). After flattening the max pool layer output it is reshaped to the shape [batch size, time steps, shape of flatten layer[1]/time steps]. The time distributed dense layer consists of 30 hidden units and linear activation function. The model was trained for 1000 epochs with adam as the optimizer. The batch size was set to 100 and the validation split hyper parameter was set to 0.05. We used MSE as the loss function.

For each subject for each experiment we used 10% of the data as test set, 85% of the data as training set and 5% as validation set. The train-test-validation split was done randomly and there was no overlap between training, test and validation set.

Figure 1: Model for predicting video frames from EEG features
Figure 2: Model for predicting EEG features from video frames
Figure 3: Training and Validation loss for model used to predict video from EEG features

3 Data Sets used for performing experiments

We used the data set used by authors in [11] for this work. It consists of simultaneous recordings of EEG signals and facial video of subjects uttering English sentences. The data set consists of simultaneous EEG and video data from seven subjects. More details of experiment design, data set, EEG recording hardware are explained in [11].

4 EEG and video feature extraction details

We followed the same EEG and video preprocessing techniques used by authors in [11] to process the EEG and video data.

The EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [14]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals.

Then we extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy

[2, 1]. In total there were 155 features (31(channels) X 5) for EEG signals. The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

Like explained by authors in [11] we extracted 100 frames per second from the recorded video. We used YOLO[15]

object recognition model to perform face recognition from the extracted video frames. Then all RGB face frames were transformed to gray scale and then we resized all the gray scale face frames to a dimension of 100 X 100 using python imaging library (PIL). We further extracted lip or mouth frames from the gray scale face frames using DLib and iBug face landmark predictor with 68 landmarks

[16]. The iBug face landmark predictor was not able to detect mouth or lip for all the face frames, hence we used a mixture of facial and lip or mouth frames where we kept the original facial frames when the iBug face landmark predictor failed to make accurate mouth or lip detection.

5 EEG Feature Dimension Reduction Algorithm Details

The 155 EEG feature space was reduced to a dimension of 30 using non-linear principal component analysis. We used kernel PCA

[17]

with polynomial kernel of degree 3 to perform non-linear PCA. Cumulative explained variance plots were used to get an idea to estimate the optimal dimension

[1]. The non-linear dimension reduction was performed to further denoise the EEG feature space.

6 Results

We used root mean square error (RMSE) computed between the predictions during test time and ground truth from test set as the performance metric to evaluate the model for each of the seven subjects. The obtained results are described in Figure 4. For predicting video from EEG features, subject 4 demonstrated lowest RMSE value of 12.3 during test time whereas for predicting EEG features from video, subject 1 demonstrated lowest RMSE value of 108.3 during test time.

The Figure 5 shows a video face frame for subject 1 from test set and Figure 6 shows the corresponding predicted facial video frame from the input EEG features during test time. It is clear from Figure 6 that only very broad characteristics of the facial video frame were observed during prediction during test time. The boundary of the face can be observed in Figure 6. This might be the first step towards the final goal of predicting high quality video frames from EEG as our model was able to learn very broad characteristics of facial video frames from input EEG features.

Figure 4: Test time results
Figure 5: ground truth video face frame from subject 1 test set
Figure 6: corresponding predicted facial frame for subject 1 during test time

7 Conclusions and Future work

In this paper we explored predicting gray scale facial or lip video frames from input EEG features as well as predicting EEG features from input gray scale video frames using deep learning models. During test time we observed that our model was able to learn very broad characteristics of the facial video frames from input EEG features. Our results might be the first step towards the final goal of synthesizing high quality video from input EEG features.

For future work we would like to improve our current results by training the models with a larger data set. It might also be worth exploring the use of generative adversarial networks (GAN) [18] to solve these problems provided a larger data set is available to train the GAN model. In this work we observed poor performance when we performed experiments using GAN.

8 Acknowledgements

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the CTC model used in this work.

References