After revolutionizing fields like computer vision, deep learning methods have also recently been used to improve classification in applications based on brain computer interfaces (BCIs)
. A deep belief network model was used to distinguish motor imagery tasks
, outperforming support vector machines (SVM), or to extract features of EEG signals 
. Other approaches to decode EEG data e.g. used deep convolutional neural networks (deep ConvNets) for feature extraction and visualization, or built a recurrent convolutional neural network architecture to model cognitive events from EEG data , applying multi-dimensional features. Likewise for intracranial EEG data, deep neural networks supported classification of epileptic signals [7, 8, 9].
However, performances of deep learning methods are strongly dependent on the amount of available data. Furthermore, the different methods are mostly restricted to certain conditions when it comes to the design of the data. Assumptions like equal underlying distributions or feature spaces may pertain in classical image recognition tasks, but are mostly not satisfied for real-world applications based on human brain signals. Intra- and inter-individual varieties cause conditions where performances of exactly the same classifier change daily. Also quite similar tasks can exhibit completely different efficiencies in distinguishing classes. In fields such as computer vision, deep learning methods have been enhanced by approaches for transfer learning [10, 11], especially when only small data are given to train a network. Models, pretrained upon extensive databases , built the foundation for significant enhancements for example in object categorization or image segmentation [13, 14, 15]. The networks seem to learn the fundamental constitution of the training data to utilize the information for classification in other similar sets. Real-life applications subsist in smooth and fast handling, therefore long training periods are unwanted and collecting substantial real-time data goes beyond the constraints of useful application.
Recently, transfer learning techniques have found their way into the context of BCI implementations . Different approaches are applied e.g. to solve a transfer between different types of error-related potentials  using a linear support vector machine or to find a way to deal with deviation in latencies  or signal variations  in brain controlled interfaces, based on linear discriminant analysis (LDA) . Implementations reverting to deep ConvNets already have generalized non-invasive error-related recordings across subjects, without fine-tuning the network again . However, there is still little utilization and transfer learning across different error decoding tasks for intracranial human brain data in combination with deep ConvNets has not yet been investigated.
In this paper, we determine the impact of transfer learning in intracranial brain recordings across two different error tasks. The paradigms may differ slightly in their way to elicit errors, but basically target the same reaction of perceiving self-caused mistakes in instructed tasks. The classification performances are analyzed separately for both data sets and are compared to those gained by algorithms including transfer learning implantations. It becomes apparent that under conditions with few available data, pre-training and subsequent transfer can improve decoding in error-related classifications tasks.
Ii System and Experimental Design
Two different paradigms were designed to generate a considerable set of intracranial EEG recordings in which error-related brain activity is accessible. In a first paradigm, a flanker task is supposed to elicit error signals under strictly experimental conditions, while in a second paradigm a car game-like environment simulated a more real-life situation. Each participant took part in both experimental paradigms.
Ii-a Eriksen flanker task (EFT)
This task was designed according to . The participants had to respond to the middle character (either R or L) of a set of letters, acting under time pressure. The audiovisual feedback was given according to a right or a wrong button press, or a reaction slower than a predefined time limit (see figure 1 A). The time limit was set individually to the mean reaction time of a training phase. For details see . An error was defined as a wrong button press, while a right button press was cited as correct.
Ii-B Car driving task (CDT)
The second paradigm consisted of a car driving task in which participants were instructed to stay on a road while avoiding certain obstacles (e.g. bombs) punished with a negative score and collecting beneficial objects (e.g. coins) rewarded with plus points (see figure 1 B). As the speed of the game was fixed, the participant’s goal consisted of achieving a highest possible score when reaching the finish line. In this task, an error event was traced when an obstacle was hit; when a beneficial object was hit, the event was declared correct.
Iii Pre-processing, Decoding & Statistics
The data were obtained by intracranial recordings collected in experiments with 15 patients suffering from epilepsy, who gave their informed consent. According to unique trigger pulses, generated during each experiment, the acquired data were aligned to the event-related meta- information. The aligned data were re-sampled to and re-referenced to a common average, subsequently an electrode-wise exponential moving standardization  with decay factor was applied. The data were cut into trials and divided into test and training set according to the specific decoding intervals.
Our decoding architecture made use of the open-source deep learning toolbox braindecode for raw time-domain EEG , using the deep ConvNet model Deep4Net111https://robintibor.github.io/braindecode/source/braindecode.models.html
with a stride of 2. The model comprised four convolution-max-pooling blocks, of these the first block executed step-wise a temporal convolution and a spatial convolution over all channels. Followed by the max pooling, the network owned three conventional convolution-max-pooling blocks. Finally a dense softmax classification layer delivered the output. The network used batch normalization and dropout, exponential linear unit (ELUs) served as activation functions for the different layers. The backward computation of the gradients was based on the output of the categorical cross-entropy loss and optimized usingadam . Further details of the basic implementation and decisions according to design of the network are discussed in .
A random permutation test  was applied to determine significances per participant. A vector consisting of the true distribution of class labels was compared to
vectors of randomly shuffled labels to generate a realistic distribution of possible outcomes of the classification. It appeared that the numbers of trials per class were highly unbalanced for all participants. To overcome this problem when creating batches during the training, a class balanced batch size iterator related the samples per batch with the inverse relation to the distribution of the actual trials. For the significance, we solved the imbalance by defining the label matches per vector separately for each class, then averaging over classes and comparing the outcome to the decoded accuracy to estimate the p-value relating to the underlying distribution. Significance was tested for each participant and set of decoding parameters. Single sets exceeding a value ofwere disregarded in further analysis and did not contribute to final results. The significances of the group differences in figure 4 were determined on the level of trials, using a sign test .
Iv Decodability of Error-related Signals
For the two data sets, we used the deep ConvNet model Deep4Net to determine the two-class decodability of perceived erroneous/correct events in intracranial human brain recordings. Here and in the following, the available data were split into two sets with a proportion of for training and for testing. For each participant, the decoding accuracy was calculated for different intuitive decoding intervals, which are defined according to the appearance of an event. Figure 2 shows the comparison of the single accuracies for different intervals in blue symbols contrasted for the two data sets and depicts in addition the median accuracy over all participants per interval in form of filled red symbols. In this illustration, only participants are considered who showed significant classification results for both paradigms.
The classification yielded in median performances of for the car driving task and for the Eriksen flanker task using the decoding interval , and for the interval and finally and for the interval . For both tasks, the interval outperformed the others and was therefore used predominantly for the later implementations to transfer learned features. Table I
gives an overview of decoding on various intervals and different number of training epochs.
|epochs||-0.5 - 1s||-0.25 - 0.75s||0 - 1s|
|10||(72.05 ± 3.28) %||(67.63 ± 3.12) %||(71.27 ± 6.29) %|
|50||(72.91 ± 2.15) %||(73.36 ± 5.00) %||(73.43 ± 1.87) %|
|200||(75.44 ± 3.10) %||(76.94 ± 2.17) %||(74.01 ± 1.99) %|
|epochs||-0.5 - 1s||-0.25 - 0.75s||0 - 1s|
|10||(73.71 ± 4.16) %||(82.05 ± 7.85) %||(72.95 ± 6.43) %|
|50||(70.33 ± 5.60) %||(78.47 ± 5.96) %||(70.33 ± 5.60) %|
|200||(81.16 ± 11.10) %||(81.50 ± 9.49) %||(73.56 ± 9.49) %|
V Decoding Transfer Across Paradigms
V-a Responses in the frequency domain
We investigated the single data sets in the frequency domain, using a multitaper method to estimate the power spectral density . Optical inspection and comparison of time-frequency spectra for identical electrodes but different tasks revealed obvious similarities for several electrodes. Figure 3A depicts one example where a resemblance is unambiguous, showing the response for error vs correct in electrode I2 located in the right insular cortex. The dotted line marks the onset of the event. Nevertheless, other electrodes did not show any effects, or effects could only be seen strongly for one of the tasks, as illustrated in figure 3C. A global overview of all electrodes for this exemplary participant is given in figure 3B. The blue and green markers refer to the electrodes selected for figures 3A and 3C.
Moreover, we analyzed the behaviour of frequency-band power time-series of significant channels. Decrease and increase of the power were tagged for both paradigms, CDT and EFT, and compared among themselves, to get an estimation of similarities in temporal developments of the frequency power. Figure 3D illustrates the outcome of this type of analysis, dividing the figure into four conditions of overlapping tags for the two paradigms. The color code in each subplot refers to the sum of significant channels, normalized to the number of channels per participant and to the maximal value of significant channels, exhibiting the specific tag indicated by the subplot title. The individual color values are broken down to frequency band and participant. Significant decrease for both paradigms as well as a significant increase in the lower frequency bands () can be seen in the data for most of the participants. However, for all participants an increase in the gamma band is prominent, covering the bands from to . For some of the participants the manifestation is present in more channels than for others, according to the specific implantation and the adjacent brain area.
V-B Compilation of different transfer techniques
Initially we examined the output of three different transfer techniques, choosing a small number of post-training epochs compared to the number of epochs () in the pre-training with the first data set. An assembly of the results is given in table II, showing the median accuracy over the participants. Errors were estimated by selecting the interquartile range of the bootstrapped samples per interval and technique. For each of the three implementations, the network was pre-trained on a given data set , while a then unknown set was used for testing or fine-tuning, respectively. The whole data were processed so that the feature space remained the same for the two sets. Therefore an adjustment of the input layer was not necessary.
|fine-tuning on (network pre-trained on )|
|layers||epochs||-0.5 - 1s||-0.25 - 0.75s||0 - 1s|
|all||0||(50.46 ± 1.11) %||(49.28 ± 0.65) %||(48.82 ± 2.04) %|
|all||10||(67.53 ± 1.39) %||(66.84 ± 10.23) %||(69.78 ± 3.32) %|
|last||50||(61.32 ± 2.87) %||(62.98 ± 1.50) %||(63.01 ± 5.82) %|
|fine-tuning on (network pre-trained on )|
|layers||epochs||-0.5 - 1s||-0.25 - 0.75s||0 - 1s|
|all||0||(53.98 ± 4.90) %||(57.46 ± 6.70) %||(54.19 ± 3.56) %|
|all||10||(73.44 ± 7.94) %||(72.12 ± 7.92) %||(76.83 ± 13.47) %|
|last||50||(66.67 ± 4.02) %||(68.89 ± 2.86) %||(59.48 ± 6.64) %|
The first approach consisted of the pre-training and subsequently classification on the second unseen set based on the pre-defined weights without fine-tuning. Generalizing from EFT to CDT, the deep ConvNet was not able to predict the true classes of the tasks and presented poor performances around chance. For the transfer from the CDT to the EFT data set accuracies were slightly better, exceeding chance level and showing a peak performance of for the interval .
Secondly, the pre-trained network was fine-tuned by training on a then unknown data set for epochs with a smaller learning rate. Here indeed the network learns informative features and obtains accuracies around for both of the paradigms. However, comparison with the performances given in table I indicates that there is no enhancement when using the pre-training. To the contrary, the accuracies do not yield the high values obtained by training directly on the classification data set training for epochs.
The third implementation was inspired by techniques from computer vision, where networks are pre-trained by a huge training set and only a few last layers are trained again by a smaller set of similar data to fine-tune the weights in the deeper layers. We captured this idea and froze all layers after pre-training and adjusted only the weights of the last classification layer. In both data sets performances yielded accuracies of and higher, but not reaching the values obtained when fine-tuning the whole network, even with less epochs.
V-C Performance dependency on the amount of data
Again, the network was pre-trained on a given data set to implement the weights. To draw a comparison between conditions with only few data and situations where more data are available, we took the second data set and gradually reduced the amount of data used for fine-tuning from to of the available training data ( of the entire data), once more employing a smaller learning rate as in the pre-training. Median accuracies and the underlying distributions are presented in figure 4A (top) for and figure 4B (top) for
. The boxes depict the interquartile range, the whiskers extend to the most extreme data points and outliers are drawn as asterisks. The plots at the bottom of each subfigure reveal the distribution of the intra-participant difference between the two compared decoding accuracies. E.g. to obtain the values for figure4A we calculated the difference for each participant, where corresponds to the accuracy gained with pre-training while corresponds to the accuracy achieved without pre-training. For decoding on , figure 4A, there is no big difference between the two conditions. Even with less data for the final training, the pre-training cannot enhance the performance. In contrast, in figure 4B the pre-training on has the effect that for a decreasing amount of data the performance gradually gets better, exhibiting significant differences of median accuracies up to for a fraction of of the training data. The distribution of the intra-participant differences for the smaller amount of used data confirm this trend. Median accuracies yielded with pre-training are consistently better than in cases when only the training on the unseen set was performed. Due to the relatively small number of participants, significance was tested on the level of single trials.
A last comparison claims to test whether the distinction between the two cases originates from a transfer of more general features of the brain signals and not the true underlying conditions. Therefore, the performed transfer was contrasted to the decoding results of pre-training on with randomly shuffled labels and then fine-tuned on . Hereby the network wasn’t able to learn the features of the two conditions. Indeed the results show that the decoding using unshuffled labels during the pre-training performs clearly better for decreasing data, as illustrated in 4C. The lower plot again shows the distribution of the intra-participant difference, where the values were determined by . Here, too, differences for the fewer data exhibit positive median values and distributions mainly over zero.
Vi Conclusion & Outlook
In this paper, two different issues were analyzed. First, the proof of decodability of error-related signals in the underlying intracranial brain recordings was brought to the fore. This was tested for two paradigms, differing by their affinity to real-life application. Error decoding has been investigated several times using EEG data e.g. when observing and controlling robots [31, 32, 33], or in real interaction simulations , but not yet on the basis of intracranial recordings. Here, we obtained accuracies up to for the CDT and for the EFT. The quite high performances reinforce the use of these data for approaches reverting to transfer techniques. However, the high errors show non-negligible differences of the results, which certainly should be treated with caution. Different patients were equipped with differing implantations, which in turn covered different brain areas. Thus, it cannot be excluded that more or less informative channels were given in the varying data sets, leading necessarily to diverse decoding performances. Because of the different implantations, we abstained from an inter-subject transfer.
The second aspect concerned the similarity of the data sets gained by the different paradigms and their transferability. Time-frequency spectra of same channels revealed striking similarities for some of the channels. More precise examinations of frequency-band dependent time-series of the power spectral density uncovered an extensive increase of significant channels in the gamma band between and , as already indicated in . Likewise, the results indicate a similarity in the characteristics of the data for the two distinct paradigms.
A comparison of several transfer approaches for the whole extent of data but a lower number of epochs did not lead to improvement of the decoding. When the network was trained directly with the objective data set exclusively, higher accuracies were yielded compared to pre-training the network. As already shown by  on EEG data, a direct transfer without further fine-tuning did not succeed.
In many cases, acquiring intracranial data is hardly possible and raised data sets are often not extensive. In this paper we illustrated a significant improvement of decoding for decreasing amounts of data when the network is pre-trained by a similar set. Interchanging the two data sets led to no enhancements, which might be explained by the fact that in this case the pre-training was performed on the set comprising only few trials and therefor possibly made the generalities of the conditions not sufficiently or hardly learnable. Instead the question arises whether, for a transfer, the relation of the amount of data used for pre- and post-training plays a determining role for the applicability of this technique. Certainly, a degree of similarity between the data sets has to be given, also with respect to the manifestation of the two conditions, which could be shown here by randomization of labels.
Several interesting questions and approaches can be deduced from these results. E.g. a network might be trained on an extensive set of non-invasive data to learn problem-specific characteristics, which subsequently can be fine-tuned by a small intracranial data set. Here, a change of network architecture can make a transfer possible, assuming data in different feature spaces. Likewise, data augmentation can contribute to advance classification in rather small data sets.
The authors would like to thank everyone involved in creating and processing the data sets, but especially the patients for their conscientious participation.
-  F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for eeg-based brain–computer interfaces,” Journal of neural engineering, vol. 4, no. 2, p. R1, 2007.
-  X. An, D. Kuang, X. Guo, Y. Zhao, and L. He, “A deep learning method for classification of eeg data based on motor imagery,” in International Conference on Intelligent Computing. Springer, 2014, pp. 203–210.
-  C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
-  Y. Ren and Y. Wu, “Convolutional deep belief networks for feature extraction of eeg signal,” in Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014, pp. 2850–2853.
-  K. G. Hartmann, R. T. Schirrmeister, and T. Ball, “Hierarchical internal representation of spectral features in deep convolutional networks trained for eeg decoding,” in Brain-Computer Interface (BCI), 2018 6th International Conference on. IEEE, 2018, pp. 1–6.
-  P. Bashivan, I. Rish, M. Yeasin, and N. Codella, “Learning representations from eeg with deep recurrent-convolutional neural networks,” arXiv preprint arXiv:1511.06448, 2015.
-  D. Ahmedt-Aristizabal, C. Fookes, K. Nguyen, and S. Sridharan, “Deep classification of epileptic signals,” arXiv preprint arXiv:1801.03610, 2018.
-  A. Antoniades, L. Spyrou, D. Martin-Lopez, A. Valentin, G. Alarcon, S. Sanei, and C. C. Took, “Detection of interictal discharges with convolutional neural networks using discrete ordered multichannel intracranial eeg,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 12, pp. 2285–2294, 2017.
-  M.-P. Hosseini, D. Pompili, K. Elisevich, and H. Soltanian-Zadeh, “Optimized deep learning for eeg big data and seizure prediction bci via internet of things,” IEEE Transactions on Big Data, vol. 3, no. 4, pp. 392–404, 2017.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
-  H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers, “Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in european conference on computer vision. Springer, 2014, pp. 346–361.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
-  V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. Grosse-Wentrup, “Transfer learning in brain-computer interfaces,” IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 20–31, 2016.
-  S. K. Kim and E. A. Kirchner, “Handling few training data: classifier transfer between different types of error-related potentials,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 24, no. 3, pp. 320–332, 2016.
-  I. Iturrate, R. Chavarriaga, L. Montesano, J. Minguez, and J. Millán, “Latency correction of event-related potentials between different experimental protocols,” Journal of neural engineering, vol. 11, no. 3, p. 036005, 2014.
-  I. Iturrate, L. Montesano, and J. Minguez, “Task-dependent signal variations in eeg error-related potentials for brain–computer interfaces,” Journal of neural engineering, vol. 10, no. 2, p. 026024, 2013.
-  B. Blankertz, S. Lemm, M. Treder, S. Haufe, and K.-R. Müller, “Single-trial analysis and classification of erp components—a tutorial,” NeuroImage, vol. 56, no. 2, pp. 814–825, 2011.
-  M. Völker, R. T. Schirrmeister, L. D. Fiederer, W. Burgard, and T. Ball, “Deep transfer learning for error decoding from non-invasive eeg,” in Brain-Computer Interface (BCI), 2018 6th International Conference on. IEEE, 2018, pp. 1–6.
-  W. J. Gehring, B. Goss, M. G. Coles, D. E. Meyer, and E. Donchin, “A neural system for error detection and compensation,” Psychological science, vol. 4, no. 6, pp. 385–390, 1993.
-  C. W. Eriksen and B. A. Eriksen, “Target redundancy in visual search: Do repetitions of the target within thedisplay impair processing?” Perception & Psychophysics, vol. 26, no. 3, pp. 195–205, 1979.
-  M. Völker, L. D. Fiederer, S. Berberich, J. Hammer, J. Behncke, P. Kršek, M. Tomášek, P. Marusič, P. C. Reinacher, V. A. Coenen et al., “The dynamics of error processing in the human brain as reflected by high-gamma activity in noninvasive and intracranial eeg,” NeuroImage, 2018.
-  R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball, “Deep learning with convolutional neural networks for eeg decoding and visualization,” Human brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  E. J. Pitman, “Significance tests which may be applied to samples from any populations,” Supplement to the Journal of the Royal Statistical Society, vol. 4, no. 1, pp. 119–130, 1937.
-  M. Hollander, D. A. Wolfe, and E. Chicken, Nonparametric statistical methods. John Wiley & Sons, 2013, vol. 751.
-  D. J. Thomson, “Spectrum estimation and harmonic analysis,” Proceedings of the IEEE, vol. 70, no. 9, pp. 1055–1096, 1982.
-  J. C. Mazziotta, A. W. Toga, A. Evans, P. Fox, J. Lancaster et al., “A probabilistic atlas of the human brain: theory and rationale for its development,” Neuroimage, vol. 2, no. 2, pp. 89–101, 1995.
-  I. Iturrate, R. Chavarriaga, L. Montesano, J. Minguez, and J. d. R. Millán, “Teaching brain-machine interfaces as an alternative paradigm to neuroprosthetics control,” Scientific reports, vol. 5, p. 13893, 2015.
-  A. F. Salazar-Gomez, J. DelPreto, S. Gil, F. H. Guenther, and D. Rus, “Correcting robot mistakes in real time using eeg signals,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 6570–6577.
-  J. Behncke, R. T. Schirrmeister, W. Burgard, and T. Ball, “The signature of robot action success in eeg signals of a human observer: Decoding and visualization using deep convolutional neural networks,” in Brain-Computer Interface (BCI), 2018 6th International Conference on. IEEE, 2018, pp. 1–6.
-  A. Buttfield, P. W. Ferrez, and J. R. Millan, “Towards a robust bci: error potentials and online learning,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 14, no. 2, pp. 164–168, 2006.