Neural networks have recently been shown to achieve outstanding performance in several machine learning domains such as image recognition  and voice recognition . Most of these breakthroughs have been achieved with convolutional neural networks (CNNs) , but some promising results has also been demonstrated by using recurrent neural networks (RNNs) for tasks such as speech and handwriting recognition [12, 11]
, usually when using the long short-term memory (LSTM) architecture.
. The results reported, despite some success, do not show the same dramatic progress achieved by ‘deep learning’ methods as compared to the previous state of the art; while in areas such as image or voice recognition ‘deep’ neural networks have resulted in classification accuracy exceeding other methods by far, this has not yet been the case with EEG in general and P300 detection specifically. The small number of samples typically available in neuroscience (or BCI) is most likely one of the main reasons. In addition, the high dimensionality of the EEG signal, the low signal to noise (SNR) and the existence of outliers in the data, pose other difficulties when trying to use neural networks for BCI tasks (see). The main question in this research is whether the RNN model, and particularly LSTM, can enhance the accuracy of P300-based BCI systems and if so, under what conditions.
P300-based BCI systems can identify when a subject’s attention is distracted toward a target event by examining the subject’s electroencephalogram (EEG) data. The first system that used the P300 effect was presented by  and since then different versions of P300 based BCI systems were suggested. One example of such a paradigm is the P300 rapid serial visual presentation (RSVP) speller. In this paradigm letters are presented one after the other in a random order, and the subject is asked to pay attention only to one of the letters called target letter or target stimuli (by counting them silently, for example). Whenever a subject pays attention to the target letter, a special waveform called P300 is expected to occur when a person’s attention is distracted toward a rare event. It is called P300 since there is usually a peak in the EEG amplitude 300ms after the presentation of a target event. The advantage of the RSVP paradigm is that it does not require any eye movements, and can thus be operated by patients who have lost control of their eye gaze completely.
2.1 “Deep” Neural Networks - Overview
Deep Neural Networks (DNNs) are networks constructed from layers of artificial neural networks (ANNs) between the input and output. There are two main type of ANN architectures: Feed Forward Neural Networks (FFNN or FF) and Recurrent Neural Networks (RNN). In FFNN directed cycles are not allowed (i.e., data can flow only to the next layer) while the RNN architecture allows directed cycles within the network (i.e., data can also flow between “neurons” in the same layer). The directed cycles in an RNN allows the network to “remember” past events, making it suitable for sequence learning.
The architecture we propose is a combination of several ANN layer types described below:
Fully Connected Layer - FC - A layer where each neuron in the input is connected to each neuron in the output is often called Fully Connected Layer or FC. In an FC layer, the output is obtained by the following equation:
is the input vector,is a matrix that represent a linear mapping and is vector that reflects the transformation called the ( holds the value for each output unit).
represent an element-wise non-linearity such as ReLU, sigomid or TanH:
Convolutional Neural Network Layer - CNN is a layer that utilizes local correlations between adjacent cells of the input layer. Since, unlike in an FC layer, the output can be of multiple dimensions, we refer to the output as a Feature Map. The feature maps are obtained by activating trainable multi-dimension kernels across the input layer. Equation 1 describe the output of element in a 1D CNN layer feature map:
If the kernel is 1D filter with a length of and the next layer has outputs, then is a matrix of size .
Recurrent Neural Network Layer - RNN - The simplest form of an RNN [24, 30] is a layer where the output is connected to the input. Unlike FF layers, in RNN, the result at time () is a function of both the current input and the previous state:
is a non-linear activation function. The structure of an RNN layer, allows the network to contain memory, since it has access to information from previous time-stamps. RNNs are known to suffer from a phenomenons called "vanishing gradient" and "exploding gradient"
: while training, the gradient of the loss function may not propagate to the first layers (i.e., the layers closer to the input layer) or may reach very large values (thus updating the layer weights too much). These problems prevent RNNs from learning long temporal dependencies (See). A common solution for these problem is called Long Short Term Memory layer.
Long Short Term Memory Layer – LSTM is a type of RNN with a special architecture designed to allow overcome RNN difficulties to learn long term dependencies. Each unit in an LSTM layer has two outputs: - the main output and - the unit’s memory. LSTM uses an architecture with a set of "gates" in order to control the data flow and by that overcome the vanishing and exploding gradient mentioned above. LSTM equations:
represent weights of the current state and represent weights of the previous states.
2.2 P300 Classification
There are a lot of methods for identifying the P300 target signal for a BCI task. Blankertz et al. 
suggest to select the time interval with maximal separation between the target and non target samples, average their electro-potential value and use shrinkage LDA to classify these features. Using this method has a drawback due to the low complexity of LDA model
. The winner of the BCI competition III: dataset II used an ensemble of support vector machines (SVM)
, and other methods include hidden Markov model, k-nearest neighbours, and more.
More recently, given the success of ‘deep’ neural networks , there have been several attempts to apply ‘deep learning’ for BCI related tasks. Cecotti and Graser  were the first to use CNNs for P300 spelling. In their work, they train an ensemble of CNN-based P300 classifiers to identify the existence of P300. Manor and Geva  used CNN for the RSVP P300 classification task and suggested a new spatio-temporal regularization which have shown improvement in the performance.
Unlike feed forward network models such as CNN and multi-layer perceptron (MLP), the RNN architecture allows directed cycles within the network, which enables the model to “memorize past events”. LSTM is a type of RNN, which includes a special node that can be described as a differentiable memory cell. The specific architecture of LSTM enables it to overcome some of the weakness of simple RNNs .
There are several reasons why LSTM is a good candidate for modelling the P300 pattern. First, RNN and LSTM have shown success when modeling time series for tasks such as handwriting and speech recognition [12, 11, 32]. In addition, RNN is known to have the capability to approximate dynamical systems , which makes it a natural candidate for modelling the dynamics of EEG data. Another motivation is that RNN can be seen as powerful form of hidden Markov models (HMM), which have been shown to classify EEG successfully [26, 21, 7]; RNNs can be seen as HMMs with an exponentially large state space and an extremely compact parametrization .
modeled inter-subject EEG features for identifying cognitive load by using convolutional LSTM. They created a video from three different band powers in each electrode. One of the major differences between their work and ours is that we use the original signal without any feature extraction (such as band power), and that we focus specifically on P300 speller.
We compared the performance of LSTM based methods with other methods on a dataset from a RSVP P300 speller study . We used average prediction across 10 trials to measure the P300 speller accuracy as applied in .
3.1 P300 speller experiments settings
The dataset includes 55 channels of EEG recordings from 11 subjects. Each subject is presented with 10 repetitions of 60 to 70 sets of 30 different letters and symbols. In total there are approximately 20,000 samples for each subject where 1/30 of them are supposed to contain a P300 target. While the original experiment contains 3 different settings (interval of 116ms with/without colors and 83ms with color), we used the experiment setting of 116ms intervals with letters in different colors. For more detail, see .
In addition to the filters applied in , all models that we used share the same pre-processing stage of down-sampling the input frequency from 200Hz to 25 Hz. The result is that each learning sample is a matrix of 55 channels with 25 time samples each, or features. Each sample thus covers exactly 1 second around the target event, at times [-200,800] ms.
3.2 Formulating the BCI task
In P300 speller the task is to identify to which letter a subject paid his attention to by identifying a P300 pattern in the EEG. This can be done by finding a function that when given an EEG sample –
, return the probability that P300 pattern is found in it. By identifying the EEG sample with the maximalscore among the EEG samples of all the different letters we can identify the target stimuli (i.e. the letter that the subject focuses on). In the following section, the explanation will focus on identifying a single letter and this for the sake of simplifying the explanation.
3.3 RSVP P300 speller trials
P300 identification function
An EEG sample data that was recorded when presented character
The set of all the available stimulus
The predicted target stimuli (i.e. the letter the subject was suppose to focus on)
The true target stimuli (i.e. the letter the subject was suppose to focus on)
Group of trial
An EEG sample data that was recorded when presented character on trial
A trial is the presentation of all the characters in an alphabet, one after the other, in a random order. In each trial, only one letter is the actual target stimuli.
The predicted character in a single trial is computed by finding the character with maximal value of among the group of all the letters ():
In order to achieve robust character recognition prediction, a common approach is repeating each trial several time and average the prediction for each different stimulus. The group of all repetitions of the same trial will be called . EEG recording where stimuli presented in trial is called . The formula for predicting the target stimuli in attempts is:
3.4 Loss function
In neural network, the weights are updated by deriving the loss function with respected to the network weights. If we train the neural network to identify a P300 target directly (as in ) the error, depends only on whether the sample contains a P300 target. Here refer to the true label of sample :
In this research we use to the binary log loss function:
The error in a neural network is typically calculated on multiple samples, as follows:
Here indicates the size of the batch of samples.
3.5 Classification Models
The models evaluated in this experiment are:
CNN (Fig.1(a)) – The CNN model we use is similar to the one used in . The first layer is composed of 10 spatial filters, each of size – the number of channels. The second layer contains 13 different temporal filters with size of . Each one of the temporal filters processes 5 subsequent time stamps without overlapping. The third and fourth layers are simple fully connected layers followed by a single cell with sigmoid activation function that emits a scalar.
LSTM large/small (Fig.1(b)) – LSTM large/small are both composed of single LSTM layers with 100 and 30 hidden cells in each, correspondingly. Both models end with a single cell with sigmoid activation layer that emits a scalar.
LSTM-CNN large/small Fig.1(c) – The model has CNN as a first layer (the spatial domain layer) and LSTM as the second layer for the temporal domain. The first convolutional layer is the same as in the CNN model. Unlike the CNN model, the temporal layer is an LSTM layer with 100/30 hidden cells. The last layer contains a single cell a with sigmoid activation layer that emits a scalar.
In order to examine the power of each method in modelling the inter-subject and intra-subject variance we have conducted the following experiments:
Training and testing on each subject’s data separately in order to explore intra-subject generalization.
Training and testing on all the different subjects data combined in order to investigate the impact of larger amounts of data.
Training on all subjects expect one. We conduct this experiment in order to explore the value of using a model that was trained off-line, on different subjects, and then use this model on new subject, with or without additional calibration.
A highly desired property from BCI systems is tolerance to a small degree of noise in the stimuli onset time. In order to evaluate the resistance to such noise, we use a model trained on the original stimuli onset (i.e, noise level = 0ms) and evaluate its performance on different stimuli onset: noise levels of -120ms,-80ms,-40ms, +40ms, +80ms, and 120ms. We conducted this experiment using 10-fold cross validation in order to be able to get statistically significant results. This last experiment was conducted only on the CNN and LSTM-CNN models and used data from all subjects (as in experiment 2 described above).
For all the experiments, the different models were trained using RMSProp
optimizer for 30 epochs with a learning rate of 0.001 and then continued to train for 30 epochs with a learning rate of 0.00001.
is a stochastic gradient descent (SGD) method. Unlike simple SGD, the method can adapt different learning rate for each parameter separately and use moving average across the past gradient in order to scale the learning rate per-feature. We decided to use RMSProp since it is known to be robust and fast[31, 15, 28].
3.6 Implementation and Real Time Possible Usage
The code was implemented using the Keras framework. Training was conducted using a 4-core i7 laptop with 16Gb RAM. Training took 110 seconds for the small LSTM-CNN model and 24 seconds for the CNN model. The difference is due to the distributed nature of CNN, which allows much of the computation to be computed in parallel. Predicting on single example takes about 0.6 milliseconds. In terms of space, both models require less than 70kb of disk space.
One of the advantages of using “deep learning” models is that they allow compressing knowledge from a lot of samples into a compact form. As we show in our experiments, it is possible to pre-train on multiple subjects and then fine tune it to a specific subjects calibration data. For example, training on 3000 calibration samples using the 4-core i7 laptop will take less than a minute (fine-tuning for 30 epochs). After the model is trained, using it for real-time prediction is feasible as well, since predicting each sample takes 0.6 milliseconds. The data and code can be found online111https://github.com/Ori226/p300_lstm.
Tab.1 summarizes the results of the different experiments; all results are based on an average of 10 consecutive trials to detect the target letter, as in . The results for training and testing on the same subject (Tab.2) indicates that LSTM is inferior (82%), and even the LSTM_CNN combined model performs less than the the simple LDA method (86 and 93% in the LSTM_CNN models and 96% using LDA). A possible advantage for LSTM only becomes apparent with larger amounts of data – when training and testing on all the subjects together (Tab.1). The large LSTM model performs poorly – 77%; we suspect it is due to the large number of trainable parameters – 62501 (“over-fitting”); this is why we introduced CNN as a first layer and reduced number of hidden LSTM cells.
Tab.2 summarizes the results per single subject. When comparing the accuracy result of each subject separately, we can see there is significant difference among subjects, across the different models. For example subject fat results in higher accuracy than icn regardless of the tested model. Eventually, the best network method – using training on other subjects and recalibration with a combined CNN-LSTM large model, is able to boost the results of the worse subject to 86%.
|subject||LDA||LSTM large||LSTM-CNN large||CNN||LSTM small||LSTM-CNN small|
In the second stage, we continue training the model on the rest of the test subject’s data using a smaller learning rate (0.0001 using RMSProp) for 30 epochs. The second training stage results are presented in columns CNN and LSTM-CNN all except one fine tune. The results indicate that as in the other cross-subject evaluation, the LDA accuracy is much poorer than those of the CNN and LSTM-CNN models (65% as opposed to 84%). When we allow calibrating the model for each subject, we achieve an average accuracy of 97% for both CNN and LSTM-CNN.
Resistance to temporal noise is displayed in Tab.4
. LSTM-CNN shows a significant advantage over both LDA and CNN when testing with stimuli onset different than the one used for training. LSTM-CNN-small achieves an accuracy higher by 3% and 6% when adding or removing 40ms to the original stimuli onset, and a t-test indicates that the difference between each pair of groups are statistically significant (- marked in bold). LDA accuracy fall by more than 20% when facing temporal noise.
A possible explanation can be seen when looking at the two models’ saliency map (Fig.3). In order to investigate the “attention”, or the sensitivity of the LSTM model, and compare it to the CNN model, we used a technique suggested by  and draw the absolute gradient of the neural network with respect to the input.
If is a differentiable, scalar-valued function, its gradient is the vector whose components are the partial derivatives of , which is a vector-valued function. In our case is the neural network with fixed weights and input . The partial derivatives of with respect to can be interpreted as “how changing each value of will change the prediction score”. This gradient should not be confused with the gradient used for training, where the goal is to optimize the model parameters when is fixed.
In the case of P300 prediction, is a matrix of ( - number of channels, - number of time steps) and is the neural network where is the model’s weights after training. The gradient (see Eq.8) is a matrix with the same size as the input , where the amplitude of each cell reflects its impact on the function value. Cells with high absolute value can be interpreted as the cells that have a significant influence on the prediction function.
The results displayed in Fig.2(a) and Fig.2(b) show the average absolute gradient across all the target samples of a single cross validation test data: the warm colors correspond to high gradient values, indicating that the model is more sensitive to change in this input feature. We can see the sensitivity of the CNN model spread across the recording relatively evenly as opposed to the LSTM-CNN which is focused around the 250ms and 450ms time-stamps.
5 Discussion and Future Direction
In this work we examined using LSTM neural networks for the task of the BCI task of P300 speller. Despite its temporal nature, no version of LSTM investigated in this work has shown a significant advantage compared to the CNN model suggested by . We did see LSTM results improve with large amounts of data from multiple subjects, and superior results with a combined CNN-LSTM model; moreover, we have shown that this combined model is significantly more robust to temporal noise in the stimuli onset. We also show that the sensitivity of the LSTM based model is much more focused on the area between 250ms to 450ms than CNN based model. This sensitivity is correlated with what we know about the P300 ERP (a peak around 300ms after the stimuli onset). We thus believe that the smaller area of the sensitivity explains the robustness of the LSTM model to noise in the time domain, since it is less sensitive to the data outside the P300 phenomena.
In our research we have used one of the largest EEG datasets for supervised machine learning – approximately 20,000 labelled samples per subject. Nevertheless, for a single subject we do not see any advantage in using ”deep learning” over simple linear methods such as LDA. This is in contrast to other reports, e.g.  report an improvement of 8.9% to 15.3% (reduction of error) on a working memory dataset with 2670 labelled samples. The benefits of “deep learning” models can be seen in transfer learning among subjects. We did not find evidence that RNNs may be superior to CNNs in classifying EEG patterns, although the LSTM model was more robust to noise. Each sample in our experiment had a fixed length, which, allowed us to use the feed forward models such as CNN. Future work on RNN may investigate using sequence-to-sequence training with variable length samples (such as suggested by ), where RNNs have advantage over the feed forward models.
-  Laura Acqualagna and Benjamin Blankertz. Gaze-independent BCI-spelling using rapid serial visual presentation (rsvp). NeuroImage, 124:901–908, 2013.
-  et al. Bashivan, Pouya. Learning representations from eeg with deep recurrent-convolutional neural networks. arXiv, 1511.06448, 2015.
-  Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
-  et al. Blankertz, Benjamin. Single-trial analysis and classification of erp components—a tutorial. NeuroImage, 50:814–825, 2011.
-  Hubert Cecotti and Axel Graser. Convolutional neural networks for p300 detection with application to brain-computer interfaces. IEEE transactions on pattern analysis and machine intelligence, 33:433–445, 2011.
-  François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
-  Febo Cincotti, A Scipione, A Timperi, D Mattia, AG Marciani, J Millan, S Salinari, L Bianchi, and F Bablioni. Comparison of different feature classifiers for brain computer interfaces. pages 645–647, 2003.
-  PR Davidson, RD Jones, and MTR Peiris. Detecting behavioural microsleeps using eeg and lstm recurrent neural networks. 27:5754–5757, 2005.
-  Lawrence Ashley Farwell and Emanuel Donchin. Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Neural Computation, 70:510–523, 1988.
-  Alex Graves. Supervised sequence labelling. pages 5–13, 2012.
-  Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, and Santiago Fernández. Unconstrained on-line handwriting recognition with recurrent neural networks. pages 577–584, 2008.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. pages 6645–6649, 2013.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. arXiv, 9.8:1735–1780, 1997.
-  Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In , pages 3128–3137, 2015.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. pages 1097–1105, 2012.
-  et al. LeCun, Yann. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.
-  Xiao-Dong Li, John KL Ho, and Tommy WS Chow. Approximation of dynamical time-variant systems by continuous-time recurrent neural networks. IEEE Transactions on Circuits and Systems II: Express Briefs, 52(10):656–660, 2005.
-  Fabien Lotte, Marco Congedo, Anatole Lécuyer, Fabrice Lamarche, and Bruno Arnaldi. A review of classification algorithms for eeg-based brain–computer interfaces. Journal of neural engineering, 4(2):R1, 2007.
-  Ran Manor and Amir B. Geva. Convolutional neural network for multi-category rapid serial visual presentationBCI. Frontiers in computational neuroscience, 9, 2015.
-  Bernhard Obermaier, Christoph Guger, Christa Neuper, and Gert Pfurtscheller. Hidden markov models for online classification of single trial eeg data. Pattern recognition letters, 22(12):1299–1309, 2001.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML (3), 28:1310–1318, 2013.
-  Alain Rakotomamonjy and Vincent Guigue. BCI competition iii: dataset ii-ensemble of svms for BCI p300 speller. IEEE transactions on biomedical engineering, 55(3):1147–1154, 2008.
-  David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985.
-  Mohammad Soleymani, Sadjad Asghari-Esfeden, Maja Pantic, and Yun Fu. Continuous emotion detection using eeg signals and facial expressions. pages 1–6, 2014.
-  Soroosh Solhjoo, Ali Motie Nasrabadi, and Mohammad Reza Hashemi Golpayegani. Classification of chaotic signals using hmm classifiers: Eeg-based mental task classification. pages 1–4, 2005.
Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor.
The recurrent temporal restricted boltzmann machine.pages 1601–1608, 2009.
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.
Paul J Werbos.
Generalization of backpropagation with application to a recurrent gas market model.Neural networks, 1(4):339–356, 1988.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, volume 14, pages 77–81, 2015.
-  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. pages 4694–4702, 2015.