The recording of increasingly affordable and precise electroencephalographic (EEG) data is creating unprecedented opportunities to understand brain activity, aid personalized prognostics, and promote health through wearable biofeedback systems (Nan et al., 2012). Electroencephalography is non-invasive, safe, inexpensive, and shows rich temporal content; in contrast with other brain imaging modalities, such as magnetic resonances, entailing higher costs, risks and restrictions on the periodicity of recordings (Fowle and Binnie, 2000). EEG monitoring is widely used to assess psychiatric disorders, and has shown to be a valuable source to study Schizophrenia, a disorder affecting about 1% of the world population, largely susceptible to misdiagnoses (Owen et al., 2016).
Despite the inherent advantages of monitoring electrophysiological brain activity, its use for diagnosing neuronal diseases is still capped by the limited size of case-control populations (Litjens et al., 2017), as well as by the intrinsic difficulties of mining brain signals. Brain signal data is high-dimensional, multivariate, susceptible to noise/artefacts, rich in temporal-spatial-spectral content, and highly-variable between individuals (da Silva, 2013).
This work proposes a dedicated class of neural networks to extract discriminative features of Schizophrenia from electrophysiological brain data. The proposed approach combines principles from pairwise distance learning and spectral imaging in order to address the aforementioned challenges, enabling superior diagnostics. Accordingly, the proposed approach offers six major contributions:
Ability to learn from small datasets by taking advantage of Siamese Network layering, inherently prepared to extract features from a limited number of observations of a possible data distribution under study (Gorbachevskaya and Borisov, 2002). The features produced by these networks have proven to be useful to perform classification as they rely on either the homologous or discriminative properties of observation-pairs in a pairwise distance domain (Koch et al., 2015);
Ability to deal with the rich and complex spectral and temporal content of EEG data by processing the signal into spectral images with a fine frequency and temporal resolution per electrode, and by subsequently reshaping the Siamese network architecture with adequate convolutional operations;
Robustness to noise and wave-instability by assessing distances on the spectral content under a cosine-loss. Gathered evidence show less susceptibility to artefacts and the inherent variability of electrophysiological potentials composed of continuously changing overlapping electrical fields created by localized neurons (da Silva, 2013);
Ability to deal with the multivariate nature of the signal (rich spatial content) by capturing interdependencies between channels as their content is simultaneously used to shape the weights of shared connections in the network;
Ability to handle the extremely-high dimensional nature of the gathered spectral content of brain signal data (high-resolution spectral image per scalp location) under L1 regularization;
Applicability of the proposed EEG-based diagnostics to alternative populations or diseases, evidenced by the: i) placed Bayesian optimization step (Snoek et al., 2012)
for hyperparameter tuning and fixing feature numerosity; ii) fully-automated nature of the approach once signals are recorded; and iii) generalization ability of the learning process on validation data.
In contrast with the traditional stance to neural information processing systems, this manuscript explores whether we can go deep on highly-dimensional spatiotemporal data in the presence of a very limited number of data observations. This stance is much needed in healthcare given the limited size of trials, often motivated by disease rarity, capped size of control population, trial eligibility requirements, or the facultative nature of EEG assessments.
The manuscript is organized as follows. After formalizing the problem, Section 2 surveys existing contributions on the diagnosis of individuals from brain signal data. Section 3 describes the proposed solution. Section 4 shows extended evidence of its relevance for diagnosing Schizophrenia. Finally, concluding remarks are drawn in Section 5.
1.1 Problem Description
Problem. A EEG recording or brain signal observation is a multivariate time series =, where is a measure of the electrophysiological activity in scalp channel and instant , is the number of time points, and is the multivariate order (number of channels). Given a brain signal dataset , i.e. EEG recordings annotated with a label
, we aim to identify a discriminative feature space to classify (unlabeled) observations. Specifically, we are interested in the classification of Schizophrenia against case-control populations.
Background. The electrophysiological signal produced by a specific channel
in the cerebral cortex is a univariate time series that can be decomposed into a frequency time series using a discrete Fourier transform. The analysis of the frequency domain of a signal, generally referred as spectral analysis, determines the predominant waves monitored at a certain location. A short-time discrete Fourier transform can be alternatively applied along a sliding window of the raw signal to capture potentially relevant changes on the spectral activity of the brain throughout the EEG recording. The spectral content produced by this time-varying form of spectral analysis is here informally referred as aspectral image since it measures brain activity along two contiguous axes, time and frequency.
2 Related Work
Dvey-Aharon et al. (2015) claim mostly changes in functional connectivity are seen in schizophrenia patients, as well as differences in theta-frequency activity. A classification approach was applied on 1-minute signals recorded by a single electrode. The developed system consists of four stages: performing several preprocessing tasks and breaking the raw signals into relevant intervals; transformation of the EEG signal into a time/frequency representation via the Stockwell transformation; feature extraction from the time/frequency representation; and discrimination of specific time frames following a given set of stimuli between the time/frequency matrix representations of the healthy subjects and the schizophrenia patients. The results obtained for the model were: discrimination accuracy rate, specificity rate, sensitivity rate and p-value significance. Despite promising results, the approach requires the performance of cognitive tasks by the individuals under assessment throughout the recording. Dvey-Aharon et al. (2017) introduced another way of looking at the EEG signal using connectivity maps derived from the brain activity. In order to build these maps, a similarity function needs to be chosen, so one can check which nodes are more similar to which ones. Results showed that the degradation of connectivity is being accelerated within schizophrenia individuals. And that information relay changes in an abnormal manner primarily in the prefrontal area. In terms of accuracy, the connectivity maps method achieved ; specificity rate; sensitivity rate; and significance p-value. This gives a good insight on how connectivity maps can be applied to discriminate schizophrenia. And most important, that one should take into account that a change in a certain region can influence other regions in the brain.
Sabeti et al. (2009)
introduces another approach to classify schizophrenia was employed based on entropy and complexity measures of the EEG signal. The features extracted from the signal were: Shannon entropy, spectral entropy, approximate entropy, Lempel-Ziv complexity and Higuchi fractal dimension. Genetic programming was used for feature selection. With these features, two algorithms were implement to perform the classification: Adaptative boost (Adaboost) and Linear Discriminant Analysis (LDA). Both algorithms were validated with and without the feature selection step. The recordings were done with eyes open. Not relying on the definition of a task, but it can still be biased by the environment surrounding the patient. Without feature selection, the accuracy reachedand , for LDA and Adaboost respectively. With Genetic programming for feature selection, the accuracy reached and . These results were obtained using leave-one-out cross validation.
Notable examples of connectionist and spectral approaches were introduced to discriminate and characterize Schizophrenia. Nevertheless, there is still a research gap on how to simultaneously explore the rich spectral, temporal and spatial nature of brain signals to perform classification. In spite of the indisputable role of neural network learning for the analysis of complex spatiotemporal signal data, its role for EEG-based diagnostics of psychiatric disorders remains largely unexplored due to the absence of large cohorts and the inherent stochastic complexities associated with electrophysiological data.
2.1 Siamese Neural Network
First introduced by Bromley et al. (1994)
as a novel model used in the task of signature classification whose aim was to distinguish signature forgeries from the real ones, Siamese Neural Networks (SNN) are deep learning architectures with two sub-networks that consist on the same instance, hence being called "siamese networks". This architecture receives as input a pair of samples. Subsequently, the outputs of the pairs used as input to these "siamese networks" are joined in a distance function. The proposed distance function between the output of the SNNs is the cosine similarity (for signatures from the same person the output should be, and for forged ones). This model had outstanding results at the time, detecting of the forged signatures and of the genuine signatures. More recently, Koch et al. (2015)
successfully used a SNN Architecture for One Shot Learning (meaning the model only sees each class once in an epoch). This approach reachedaccuracy in the test set. These results were achieved through a Siamese Convolutional Architecture. Once this kind of network is trained, its learned representations via a supervised metric-based approach with SNNs are useful to perform tasks like classification, relying on the discriminative properties of these features.
3 Our Approach
Our architecture is inspired by the model proposed in Koch et al. (2015). An advantage of this type of architecture is the ability to augment the original dataset from an instance-based data space to a pair-based one. The proposed approach has two main steps: 1) feature extraction; and 2) classification. In step 1, the internal representations obtained from the SNN architecture model are extracted after training. In step 2, a classification task is performed using these extracted features. Previous to both steps, we perform hyperparameter optimization for every model using Bayesian Optimization (BO) (Snoek et al., 2012).
3.1 Dataset Description
Approaches based on induced stimuli or task performance, followed by the analysis of event related potentials, are not considered in this work. Instead, a resting state setup is consider to monitor the underlying brain patterning at the brain cortex, independently of the surrounding environment/undertaken task. Subsequently, this avoids any additional interference on the EEG signal recorded. Howells et al. (2018) findings support the use of this setup, claiming that differences on the spectral activity – such as higher delta and a lower alpha synchronization in psychotic disorders – can be optimally detected in resting state protocols with both open and closed eyes.
The target dataset consists of a total of 84 individuals: 39 healthy controls and 45 schizophrenia subjects. This population consists of adolescents who had been screened by a psychiatrist and got either a positive or negative diagnostic for the schizophrenia neuropathology. EEG recordings were sampled at Hz with minute duration. Individuals were set in a resting state with eyes closed. In accordance with the 10-20 system of electrode placement, the topographical positions of the placed EEG channels are: F7, F3, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, O2. The dataset is available online, sourced from (Gorbachevskaya and Borisov, 2002).
3.2 Siamese Neural Network Architecture
The SNN architecture contains two sub networks that correspond to the same instance (twin networks). Both of these twin networks are referred to as the Base Network (BN). The input and output of the BN are an example and a feature vector, respectively. The output feature vector corresponds to the features extracted in the aforementioned step 1.
In our case, the BN receives as input a Discrete Short-Time Fourier Transform (DSTFT) representation of the EEG signal, that is extracted from the 1 minute recording of a channel of an individual. The DSTFT is taken with 2 seconds length windows in order to capture frequencies as low as Hz, corresponding to the delta wave frequencies (in Howells et al. (2018) it is pointed out that frequencies lower than
Hz are relevant to differentiate Schizophrenia). This image is processed through two convolutional layers, followed by a fully connected layer. The activation function used in the convolutional layers is the rectified linear function(Hahnloser et al., 2000), while the fully connected layer uses the softmax activation function, normalizing the domain of the feature representations, .
Once the BN network (Fig. 1) is built, a replication of it is made, producing its twin and sharing their weights. The SNN layout is achieved joining these twins and computing a distance metric between their outputs, as shown in Fig. 2. In our case, the inputs to the SNN are pairs of DSTFT representations and the outputs are the computed distance between the representations obtained by the BN.
The SNN tries to solve what is known as a neighbor separation problem, consisting on the separation of instances in a dataset that contains different classes. In our case we have two classes: schizophrenic and healthy control individuals. In this neighbor separation problem, pairs of individuals of the same class (schizophrenic with schizophrenic or healthy with healthy) are called neighbors and pairs of individuals of different classes (schizophrenic with healthy) are called non-neighbors. The network learns a transformation with the objective of assigning small distance to neighbors and large distance to non-neighbors.
With the previously described architecture, the neighbor separation problem can be posed as a minimization problem of a certain loss function that depends on such distance. InHadsell et al. (2006), the Contrastive Loss function is introduced to that end, defined as:
where ) is the input pair, if and are neighbors and otherwise, the distance between the predicted values of and , and is the margin value of separation. Minimization of the Contrastive Loss function leads to a scenario where neighbors are pulled together and non-neighbors are pushed apart, according to a certain distance metric. The margin value is sensitive. High values of increase the separation between non-neighbors (pairs of different class), impacting positively the accuracy although making the training slower. In contrast, low values of may cause the model not to learn the desired behavior.
The distance metric considered in our case is the cosine distance. This metric was chosen with the belief that a pattern based metric (cosine) would perform better than a magnitude based one (euclidean), in order to shed light on how the schizophrenia pathology expresses itself through the EEG.
Besides the type of layers and the distance metric, the following techniques are integrated in the model: regularization and Dropout layers. The
regularization is useful because it helps remove features that are not useful for the task. Dropout layers are introduced to improve generalization. Regularization is applied at the kernel of all layers. The Dropout probability used is, as suggested by Srivastava et al. (2014), and is applied after each convolutional layer. Adam (Kingma and Ba, 2014) is used to optimize the network during the training session.
3.2.1 Hyperparameter Tuning
The number of layers, as well as their type, are fixed. The rest of hyperparameters (regularization factor, margin value, learning rate, kernel size and output dimension of the BN) are susceptible to optimization. As previously mentioned, we apply BO to that end. BO is set to run with a maximum of acquisitions and starts with iterations to perform an initial exploration. In each iteration and acquisition, a -fold Cross Validation with is done with the whole dataset. The combination of hyperparameters that has the best average validation accuracy across the -folds is chosen to perform the feature extraction. Each of the hyperparameters are assigned the following value domains to explore: regularization factor , margin value , learning rate , kernel size with (the same kernel size is used for both convolutional layers) and final output dimension . The BO surrogate model is a standard Gaussian Process. Expected Improvement is used as an acquisition function and the Limited-Memory Broyden–Fletcher–Goldfarb–Shanno algorithm as the acquisition optimizer.
The DSTFT magnitudes are normalized, under the hypothesis that there exists a threshold from which there is no additional information to identify the schizophrenia pathology. With this, the values are normalized by an upper value, . Values of smaller than are divided by and magnitudes bigger than are set to . This allows every magnitude of the frequencies to be within the interval after the normalization is performed. We take advantage of the BO exploration to obtain , by introducing it in the same optimization process made for the SNN hyperparameters. The domain assigned to be explored for is .
3.2.2 Pairwise Dataset Structure
We want the network to learn a valid transformation that generalizes to all channels. To that end, the pairs are set such that only equal channels are paired. Pairs of different channels are not considered, since we see different channels as correlated spaces with different properties. Fortunately, the SNN is capable of learning different spaces/classes, as shown in Koch et al. (2015), where the proposed system is able to learn a similar setup. This pairwise schema can be seen as a data augmentation technique being performed with the addition of noise to the dataset. This noise is present by mixing all the channels with the aforementioned restrictions in order to be coherent. No other data augmentation scheme, such as image transformations (scaling, rotations), is applied.
From our original EEG dataset, spectral images are derived with examples, and a pairwise dataset is built. Formally, with , where is the number of EEG channels. The space complexity of the pair dataset is . The SNN training session is done with a batch size multiple of the number of channels. In particular, we use . Therefore, there are pairs of individuals in each batch and each pair of individuals has channel pairs. This scheme can only be applied in small datasets, since the model does not scale well in terms of space complexity, but our goal is precisely to tackle small datasets by the creation of a whole new optimization space where the variability contained in the data can be exploited in a different way.
After the SNN has been trained (in a
epochs session), the outputs of the BN for every example were the result of our feature extraction process. With these features, the following classifiers were trained to identify schizophrenia: Support Vector Machines (SVM), Random Forest (RF), XGBoost (XGB), Naive Bayes (NB) and k-Nearest Neighbors (kNN). This process was performed with a Leave-One-Out Cross Validation (LOOCV). For each of these classifiers, BO hyperparameter tuning is also performed, setup with a maximum ofacquisitions and iterations for initial exploration. The hyperparameter domains for each classifier were:
SVM: type of kernel (linear or radial-basis function kernel), cost, and gamma coefficient
RF: number of estimators
XGB: maximum depth , learning rate , and number of estimators
NB has no hyperparameters
kNN: number of neighbors
The hyperparameter tuning optimization for the classifiers is also performed in a -Fold Cross Validation setup (), but instead of using the whole dataset (as was the case for the SNN) only the training set of the LOOCV partition was used. Similar to the BO for the SNN, the combination of hyperparameters with the best average validation accuracy is chosen for each classifier.
For the sake of comparison, the results of the classification obtained using the extracted features from the SNN are compared with the results of the classification using a competitive baseline Fast Fourier Transform (FFT) (Welch, 1967) frequency features, by performing a BO for each classifier in the same conditions described in 3.3
. Further, a comparison with a Convolutional Neural Network (CNN) optimized for classification is made. This CNN differs from the BN, with the addition of a ReLU layer in the end to output a single label/classification value. The hyperparameters are obtained using BO, through use of the SNN hyperparameters domains (except for the margin value which is not included in this CNN architecture).
The FFT-based feature classifiers are referred to as: (i) FFT-kNN, (ii) FFT-NB, (iii) FFT-RF, (iv) FFT-SVM, (v) FFT-XGB. The CNN classifier is referred to as (vi) DSTFT-CNN. The proposed classifiers based on the SNN extracted features are referred to as: (vii) DSTFT-SNN-kNN, (viii) DSTFT-SNN-NB, (ix) DSTFT-SNN-RF, (x) DSTFT-SNN-SVM, (xi) DSTFT-SNN-XGB.
According to Table 1, the SNN features outperform the FFT frequency baselines considered by an average of pp in accuracy. One particular observation is that among the FFT features there was a clear difference between classifiers: FFT-kNN and FFT-NB were worse than FFT-XGB, FFT-RF and FFT-SVM. The same was not found among the SNN features, with all of the variants performing comparably. This is due to most of the discriminative work of the problem being solved by the SNN transformation of the DSTFT representation of the EEG signal. Suggesting that the SNN alone is capable of generalizing better and have extremely discriminative features.
Further, as expected the DSTFT-CNN baseline underperformed by an average of -10pp than all of the classifiers that used SNN features. The latter is justified by the low amount of data provided to train the model, which in contrast is much smaller than the amount of data used to train the SNN model. Adding to this, the high dimensionality of the DSTFT features is a characteristic that DSTFT-CNN is not originally prepared to handle. On the other hand, the regularization techniques still present in the DSTFT-CNN baseline were key factors to outperform the FFT baselines.
In contrast with the Table 1, Figs. 2(a), 2(b) and 2(c) compare the performance of the assessed classifiers for each channel individually. We can observe that, independently of the channel, SNN classifiers still outperform the baseline classifiers considered. The classification performance is considerably stable among channels and a few percentage points lower than the original feature space of all channels. The classifiers found to have more variability between channels were: FFT-XGB, FFT-RF, SNN-XGB and SNN-RF.
The rich nature of the electrophysiological data measured at the cerebral cortex makes deep learning a natural candidate to study disorders disrupting the normal brain activity. Nevertheless, the limited size of case-control populations, together with the inherent variability of the spectral content within and among individuals, had left the value of neural network approaches largely unexplored. This manuscript stresses the relevance of revisiting this problem. By reshaping the architecture, loss and applied regularization, we show that the use of neural networks to classify Schizophrenia can increase the accuracy of diagnostics by 15-to-20 percentage points against peer alternatives (without hampering sensitivity or specificity). Two master principles underlie these results: 1) the mapping of the original data space into a pairwise distance space to support data augmentation while enhancing the discriminative power of the output features; and 2) the exploration of the rich nature of brain patterning through convolution operations on the spectral imaging of the signal, with weights learned under a cosine loss to better account for the inherent noise of electrophysiologic data.
As future work, we aim to extend the experimental analysis towards alternative disorders and populations with potentially different EEG instrumentation or protocol; contrast the performance of the proposed EEG-based learners against state-of-the-art brain imaging learners on a population of individuals with (and without) neurodegenerative conditions being currently monitored; and to establish a method that is capable of performing a neurofeedback technique to tackle Schizophrenia symptoms, similarly to what has been previously proposed Nan et al. (2012).
This work is supported by national funds through FCT under project iLU DSAIPA/DS/0111/2018 and INESC-ID pluriannual UID/CEC/50021/2019.
- Bromley et al.  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a" siamese" time delay neural network. In Advances in neural information processing systems, pages 737–744, 1994.
- da Silva  F. L. da Silva. Eeg and meg: relevance to neuroscience. Neuron, 80(5):1112–1128, 2013.
- Dvey-Aharon et al.  Z. Dvey-Aharon, N. Fogelson, A. Peled, and N. Intrator. Schizophrenia detection and classification by advanced analysis of eeg recordings using a single electrode approach. PloS one, 10(4):e0123033, 2015.
- Dvey-Aharon et al.  Z. Dvey-Aharon, N. Fogelson, A. Peled, and N. Intrator. Connectivity maps based analysis of eeg for the advanced diagnosis of schizophrenia attributes. PloS one, 12(10):e0185852, 2017.
- Fowle and Binnie  A. J. Fowle and C. D. Binnie. Uses and abuses of the eeg in epilepsy. Epilepsia, 41:S10–S18, 2000.
- Gorbachevskaya and Borisov  K. Gorbachevskaya and S. Borisov. Eeg of healthy adolescents and adolescents with symptoms of schizophrenia. http://brain.bio.msu.ru/eeg_schizophrenia.htm, 2002. Online; accessed 1st February 2019.
- Hadsell et al.  R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In
- Hahnloser et al.  R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947, 2000.
- Howells et al.  F. M. Howells, H. S. Temmingh, J. H. Hsieh, A. V. van Dijen, D. S. Baldwin, and D. J. Stein. Electroencephalographic delta/alpha frequency activity differentiates psychotic disorders: a study of schizophrenia, bipolar disorder and methamphetamine-induced psychotic disorder. Translational psychiatry, 8(1):75, 2018.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Koch et al.  G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
- Litjens et al.  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- Nan et al.  W. Nan, J. P. Rodrigues, J. Ma, X. Qu, F. Wan, P.-I. Mak, P. U. Mak, M. I. Vai, and A. Rosa. Individual alpha neurofeedback training effect on short term memory. International journal of psychophysiology, 86(1):83–87, 2012.
- Owen et al.  M. J. Owen, A. Sawa, and P. B. Mortensen. Schizophrenia. The Lancet, 388(10039):86 – 97, 2016. ISSN 0140-6736. doi: https://doi.org/10.1016/S0140-6736(15)01121-6. URL http://www.sciencedirect.com/science/article/pii/S0140673615011216.
- Sabeti et al.  M. Sabeti, S. Katebi, and R. Boostani. Entropy and complexity measures for eeg signal classification of schizophrenic and control participants. Artificial intelligence in medicine, 47(3):263–274, 2009.
Snoek et al. 
J. Snoek, H. Larochelle, and R. P. Adams.
Practical bayesian optimization of machine learning algorithms.In Advances in neural information processing systems, pages 2951–2959, 2012.
- Srivastava et al.  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Welch  P. Welch. The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on audio and electroacoustics, 15(2):70–73, 1967.