Brain–computer interface (BCI)  is an emerging technology that enables a communication pathway between a user and an external device (e.g., a computer) through the acquisition and analysis of brain signals. Then these signals are translated into commands that are understood by a device, such as a computer. Owing to its practicality, electroencephalogram (EEG)-based non-invasive BCIs are widely used [27, 4, 16]. Earlier, Aricò et al.  categorized user-centered BCIs into two types, active/reactive and passive BCIs. In this paper, our focus is not only on active BCIs but also on passive BCIs. Generally, two types of brain signals such as evoked and spontaneous EEG are primarily considered for active/reactive BCIs . Evoked BCIs exploit unintentional electrical potentials reacting to external or internal stimuli. Examples of evoked BCIs include steady-state visually evoked potentials (SSVEP) [18, 23] and event-related potentials . Additionally, spontaneous BCIs use an internal cognitive process such as event related desynchronization and event related synchronization (ERD/ERS) in sensorimotor rhythms, e.g., motor imagery (MI) [18, 5] induced by imagining movements in addition to physical movement. Well-known examples of passive BCIs include the use of sleep/drowsy EEG signals for sleep stage classification or identifying mental fatigue to alert a driver of a dangerous situation and seizure EEG patterns for onset detection to provide the patient with a warning of a potential seizure.
Generally, machine learning-based BCIs consist of five main processing stages: (i) an EEG signal acquisition phase based on each paradigm, (ii) signal preprocessing (e.g.
, channel selection and band-pass filtering), (iii) feature representation learning, (iv) classifier learning, and finally (v) a feedback stage. Basically, most of machine learning-based BCI methods follow these processes, however, these methods need specific modification to classify a user’s intention/condition for each different paradigm. In other words, machine learning-based methods need to have prior knowledge of different EEG paradigms [4, 18, 23, 16, 29]. Therefore, conventional machine learning-based BCIs have discovered EEG representations through extremely specialized approaches, e.g., a common spatial pattern (CSP)  or its variants [30, 1] for MI signals and a canonical correlation analysis (CCA)  for SSVEP signals decoding.
While hand-crafted feature representation learning has a pivotal role in a conventional machine learning framework [4, 23, 28], deep learning-based representation has had remarkable results in the BCI community [27, 16, 33]
. These deep learning-based methods have integrated a feature extraction step with a classifier learning step such that those steps are jointly optimized, thereby improving performance. Among various deep learning methods, convolutional neural networks (CNNs) have the advantage[14, 16, 6] of maintaining the structural and configurational information in the original data. In this respect, developing a novel CNN architecture for EEG signal representation has taken a center stage in the BCI studies [27, 31, 25, 14, 32, 15, 3, 7, 9].
However, some challenges still remain. First, existing CNN-based methods [27, 14, 25, 31, 7, 9] are mostly comprised of stacked convolutional layers. In other words, those existing methods extract features sequentially. But, ignoring multiple ranges of spectral-temporal features can cause a critical problem because EEG signal features for different subjects , paradigms , and types  are found in diverse ranges. For example, Fig. 1
depicts two different subjects’ MI EEG power spectral density (PSD) curves. Clearly, these two plots have different distributions from each other even though these PSDs are estimated by the same task. Therefore, it is important to capture multi-scale spectral information in EEGs forgeneral use in BCI, i.e., a generic method applicable to various types of BCIs.
In addition, those stacked CNN-based methods [14, 27, 9, 7] have numerous trainable parameters, thus requiring large amounts of training samples, whereas BCIs generally acquire a limited number of EEG trials . Therefore, generalizing conventional stacked CNN-based methods in BCI is quite difficult because deep learning is a data-hungry problem, i.e., rarely generalized with a lack of data.
Finally, interpreting a learned stacked CNN from a neurophysiologically appropriate standpoint  is quite complicated because the CNN identifies complex patterns of data in latent space making a direct explanation difficult .
In this study, we propose a novel deep learning-based BCI method to mitigate the previously discussed difficulties. The main contributions of our study are as follows:
First, we propose a novel CNN architecture, that is applicable independently from the input paradigm or type of EEG and can represent multi-scale spatio-spectral-temporal features.
Second, the proposed method achieved positive performance on five different datasets for four differnt paradigms (two for active BCIs and two for passive BCIs). The proposed method outperformed or was similar to state-of-the-art linear and deep learning methods, which were individually designed for each specific paradigm.
Last, we analyze the proposed network using a variety of techniques.
The rest of this paper is organized as follows: Section II reviews previous research on various EEG representation learning via linear model-based or deep learning-based methods. In Section III, we propose a novel and compact deep CNN that classifies multi-paradigm EEG by representing multi-scale spatio-spectral-temporal features. Section IV presents experimental settings and results by comparing the proposed method and comparable baselines. In Section V, we analyze our proposed method from several points of view. Finally, Section VI summarizes the proposed study and suggest future research directions.
Ii Related Work
Learning a class-discriminative feature representation of EEG is still challenging in both theory and practice. Numerous prior studies have attempted to extract features from EEGs. In this section, we briefly discuss linear methods and deep learning models used for EEG signal representation.
Ii-a Linear Models
Over the past decades, CSP  and its variants [1, 30] have played an essential role in decoding MI. Blankertz et al.  and Ang et al.  independently used a spatial filtering-based method for classifying MI. Ang et al.  band-pass filtered EEG data before applying CSP, thereby attempting to decode EEG signals in a spatio-spectral manner. They named the proposed method filter bank CSP (FBCSP). Furthermore, Suk and Lee  also decoded MI by jointly optimizing multi spectral filters in a Bayesian framework.
CCA is commonly utilized for detecting SSVEP  owing to its practical ability to be implemented without the calibration stage. The standard CCA method  deployed sinusoidal signals as reference signals and estimated canonical correlation between the reference signals and input EEG signals to identify an evoked frequency in SSVEP EEGs.
In addition, to characterize the sleep stage, entropy calculation-based approaches were frequently used. Sanders et al. 
classified the sleep stage using the spectral-temporal features of EEGs learned from short-time Fourier transformation. Furthermore, Zheng and Lu focused on identifying a driver’s mental fatigue during driving. They  applied filter banks to EEG signals to extract spectral information, and then transformed the filtered EEG signals to spectral space, i.e., estimated PSD of filtered EEG signals. By doing so, Zheng and Lu  effectively assessed the regression score of the driver’s mental states which were labeled using the PERCLOS index, a measure of neurophysiological fatigue.
used filter banks in a channel-wise manner to capture the spatio-spectral information. Then, by encoding the temporal evolution of extracted spatio-spectral feature vectors, they
effectively constructed epileptic seizure EEG signal spatio-spectral-temporal features and classified the seizure and non-seizure features utilizing a support vector machine (SVM). Recently, spectral features derived from a principal component analysis (PCA) exhibited superior performance for seizure onset detection. In particular, Lee et al.  band-pass filtered raw signals and calculated PSD. Then they  applied PCA for the extraction of EEG signal spectral features.
These practical linear model-based BCI methods [4, 1, 30, 23, 17, 26, 34] have demonstrated credible performance. However, these methods need to have certain prior neurophysiology knowledge , because their feature extraction stages are specifically designed for each EEG paradigm. Conversely, our method does not need to be specialized for different paradigms.
Ii-B Deep and Hierarchical Models
Recently, deep learning methods, especially CNNs have achieved promising results in EEG signal decoding researches. For instance, Schirrmeister et al.  introduced Shallow ConvNet, Deep ConvNet, Hybrid ConvNet, and Residual ConvNet. These authors  evaluated how well various proposed CNNs decoded MI. Ko et al.  also proposed a novel CNN architecture which is inspired by a recurrent convolutional neural network  for MI classification, deep recurrent spatio-temporal neural network (RSTNN).
While a standard CCA  has obtained state-of-the-art performance in SSVEP BCI, Kwak et al.  developed a CNN for SSVEP feature representation learning. These authors simply combined spatial and temporal convolution to enable the system to learn data patterns in the latent space, thereby correctly generalizing EEG signal features. Meanwhile, Waytowich et al.  applied EEGNet  to the SSVEP paradigm and achieved a higher performance than that of the standard CCA .
Supratak et al. 
developed a deep neural network for sleep stage detection. More precisely, they combined a CNN for representation learning and a recurrent neural network for sequential residual learning. Furthermore, they trained the deep learning model in two separate steps, optimizing the model by individual pre-training and fine-tuning. In the meantime, Gao et al.  proposed an EEG-based spatio-temporal convolutional neural network (ESTCNN) for driver fatigue evaluation. The ESTCNN  simply convolved the band-pass filtered EEG to represent temporal dependencies and flattened the extracted features for spatial features fusion. Lastly, densely connected layers were used for the identification of a user’s condition .
To detect a seizure type, Asif et al. 
proposed a multi-spectral deep feature learning using a deep CNN, SeizureNet. These authors transformed the EEG signals to spectral space using saliency-encoded spectrogram generation and fed the extracted spectral features to a deep neural network. In the meantime, Emami et al.  independently proposed another CNN-based approach for detecting seizure onset. They  band-pass filtered and segmented the input EEG patterns, and then used a deep CNN for classification.
Recently, Lawhern et al. [16, 32] proposed a novel CNN, so-called EEGNet. Unlike other linear or deep learning-based methods, the EEGNet classified various EEG paradigms using a single architecture, i.e., not specifically tuned for different paradigms. Further, Lawhern et al.  introduced a separable convolution  and used it as a parameter reduction method.
On the one hand, the deep and hierarchical models decoded the EEG signals well without any custom feature extraction stage for their respective paradigm [27, 14, 15, 3, 7, 31, 9] or even various paradigms [16, 32]. On the other hand, the deep CNNs extracted the EEG features at a sequential level using stacked convolutional layers without exploiting multi-scale spectral representation. Conversely, the proposed method exploits multi-scale spatio-spectral-temporal features irrespective of the input EEG paradigms.
In this section, we propose a deep multi-scale neural network (MSNN), which can represent EEG features from different paradigms by exploiting spatio-spectral-temporal information at multi-scale.
Iii-a Multi-Scale Neural Network
As mentioned previously, an FBCSP  is one of the most successful models to exploit EEG signal multi-scale features, especially for MI. Thus, many successful MI EEG signal decoding algorithms [27, 30] or even other paradigm classification algorithms  are inspired by the FBCSP  model. In this study, the proposed multi-scale neural network (MSNN) also learns multi-scale feature representations. However, the network automatically learns from data through discriminative multiple spectral filters, rather than manually defining multi-frequency bounds as in FBCSP . Basically, our proposed method consists of three types of blocks: (1) a spectral-temporal feature representation block, (2) a spatial feature representation block, and, (3) a classification block, as depicted in Fig 2.
First, in the spectral-temporal feature representation block, stacked convolutional layers extract EEG data spectral-temporal features, such as existing EEG classification methods. However, the proposed model exploits intermediate activations for gathering multi-scale spectral information. Then, the spatial feature representation block discovers spatial patterns from the extracted multi-scale features. Finally, these multi-scale spatio-spectral-temporal features are concatenated, pooled, and fed into the densely connected layer for classification.
Iii-B Spectral-Temporal Feature Representation Block
Given an input EEG data , we reshape it in the form of , i.e., , where and denote the number of electrode channels and timepoints, respectively.
In the MSNN, the input EEG data are temporally convolved in a channel-wise manner by a temporal convolutional layer to expand the number of feature maps. Thus, the activated features have the form , where ( and are the sampling frequency and the feature map dimension for the first temporal convolution layer.). The main benefits of using a separable convolution [6, 16] are a significant reduction of tunable weights in the model and, more importantly, an efficient and explicit decoupling of the relationship between the temporal and the feature map dimensions of the input features. This is accomplished by learning kernels independently for each feature map. Thus, as in BCI literature, the separable convolution  enables the system to learn temporal kernels individually from the feature map dimensions (using a depthwise convolution ), and then optimally re-combine the feature maps (using a pointwise convolution ).
In this block, by setting a kernel size of , where denotes the kernel size of the -th temporal separable convolution, the -th temporal separable convolutional layer represents EEG signal features in the range of sec, hence, Hz, where is a frequency property extracted at the first temporal convolutional layer. Therefore, the spectral-temporal feature representation layers can deal with different timepoints or frequency ranges by using various kernel sizes for the input EEG data.
Additionally, each different layer that has a different kernel size extracts features in different frequency and timepoint ranges. In other words, a spectral-temporal convolution layer with a larger kernel represents longer-term temporal features, i.e., a lower-range of spectral features and vice versa. Then, the MSNN exploits intermediate activations from each layer, thus learning multi-scale feature representations.
In addition, a separable convolution  only operates convolutions in a cross- way, thus, the number of parameters is small compared to a conventional convolution. For instance, while a -th separable temporal convolution has only parameters, the conventional convolution with the same size kernel has parameters, where denotes the feature maps dimension of -th layer.
Furthermore, in this processing, as described above, the MSNN uses its intermediate activations to exploit multi-scale representations. In other words, the proposed network obtains numbers of spectral-temporal features , like:
where , , and , respectively, denote the -th separable convolution, the first temporal convolution, and a function composition between arbitrary functions and , i.e., . Thus, by extracting features , the MSNN effectively represents the spectral-temporal features from the multi-scale viewpoint, thereby automatically enhancing generalization. In addition, as all inputs are zero-padded
zero-paddedbefore each separable temporal convolution, the output features have the same dimension for the channels and timepoints, except for the feature map dimension. Thus, the -th spectral-temporal feature now has the form .
Iii-C Spatial Feature Representation Block
In the spatial feature representation block, a common spatial convolution is used for feature extraction. In this block, the kernel size is constrained to be equal to the number of EEG channels, hence, a convolution with a kernel of is used. Additionally, by setting the kernel size to be the same as the number of electrode channels, similar to many existing deep learning-based BCI methods [27, 14, 16], the proposed MSNN extracts spatial information from the original EEG acquisition channel distributions of multi-scale spectral temporal features. Then, the MSNN can obtain neurophysiologically plausible information from the input data distribution.
Furthermore, the spatial feature representation can be applied unrestrictedly, thus in the proposed method, we add this block after every extracted spectral-temporal features , like,
where denotes the -th spatial convolution and is spatio-spectral-temporal features estimated by the and . We use valid paddings for every spatial convolution, thus the -th spatio-spectral-temporal feature has the form . By setting the number of spatial convolutions to be identical to the number of spectral-temporal convolutions, unlike many previous researches using deep learning for BCI [27, 14, 16, 3, 31], we extract spatial features of each range from spectral-temporal features. In other words, unlike many previous stacked CNNs, the proposed architecture uses every intermediate activated feature set to exploit spatial information, thereby creating the capability to extract various ranges of EEG features at multi-scale.
Iii-D Classification Block
For classifier learning, because we have numbers of different (or same when ) size of spatio-spectral-temporal features , , the classifier in the proposed method has to concatenate the features in the feature map dimension. Thus, the concatenated feature is represented as:
where denotes the concatenation operation.
For the classifier network, let us assume that the number of output classification nodes is denoted by and we use a single linear mapping layer. Then, we need to train the large number of parameters (note that we disregard the bias term for a convenient calculation) because has the form , and it would still require a large number of training samples. Therefore, after representing the input EEG data multi-scale spatio-spectral-temporal features, the proposed MSNN has one extra operation for reducing the trainable weights. Unlike the existing deep learning-based BCI methods [16, 27, 14, 15, 3, 31]
, global average pooling (GAP), which is widely used in the computer vision field is performed.
The GAP layer 
, a type of pooling layer, averages nodes from each feature map, thus eliminating the requirement for any window size or stride. By applying GAP, our proposed MSNN efficiently extracts significant features. From the BCI literature, the GAP layer  can be understood to be a method that can emphasize an important frequency range and its surrounding area for each feature map dimension. Thus, for the extracted multi-scale features in the MSNN, the GAP layer  stresses the crucial spectral-temporal part resulting in concise information for the final decision making.
Additionally, the GAP layer  significantly reduces the number of classifier parameters used in the proposed MSNN. Specifically, after the GAP layer , the extracted feature is reduced to the form , whereas the feature without GAP has the form . Therefore, we drastically suppress the trainable parameters in the classifier from to .
Then, the MSNN prediction, , for the input EEG data, , is as follows:
where and respectively denote the weight matrix and bias of the classifier.
Finally, the cross-entropy loss, , that is used for network training is calculated by the prediction and the label :
respectively denote the mini-batch sizes and the cross-entropy loss function, andand denote the prediction and ground-truth label for the -th training sample in the mini-batch111All codes used in our experiments are available at ‘https://github.com/DeepBCI/Deep-BCI/tree/master/1_Intelligent_BCI/Multi_Scale_Neural_Network_for_EEG_Representation_Learning_in_BCI.’.
In this section, we describe the datasets used for performance evaluation, our experimental settings, and baseline settings. Furthermore, we present the performance of our method and competing methods.
Iv-a Datasets and Preprocessing
In this study, we used five different publicly available datasets to validate the proposed method on four different EEG data paradigms.
Iv-A1 Motor Imagery
First, we used two big datasets for MI EEGs, GIST-MI 222Available at http://gigadb.org/dataset/100295 and KU-MI 333Available at http://gigadb.org/dataset/100542,444Experimental results of the KU-MI dataset  are reported in Supplementary B.. The GIST-MI  dataset consists of two different MI tasks: left-hand and right-hand MI that are acquired from 52 subjects. All EEG signals were recorded from 64 Ag/AgCl electrode channels according to the standard 10-20 system, sampled at 512Hz. Each class contained 100 or 120 trials, and each trial was a 3 sec long MI task. Because this dataset is not separated into training and test samples, we conducted a five-fold cross-validation for a fair evaluation. For the MI datasets, we preprocessed signals by applying a large Laplacian filtering555When the target channel does not have four nearest neighbors, we just used available channels and their average value to filter the target channel., baseline correction by subtracting the mean value of the fixation signal from each MI trial, and band-pass filtering between 4 and 40Hz. Then, we removed the first and last 0.5 sec from each trial, and finally applied Gaussian normalization. We applied the same mean and standard deviation values for normalization to the test samples. The multi-channel EEG signals were only shifted and scaled by their respective channel-wise mean and standard deviation values. Thus, inter-channel relations inherent in the data were preserved.
Iv-A2 Steady-State Visually Evoked Potentials
We also used the KU-SSVEP dataset 3 for SSVEP decoding experiments in this study. This KU-SSVEP dataset  was acquired from 54 subjects and recorded from 62 Ag/AgCl electrode channels using the 10-20 system. The KU-SSVEP dataset  contains four EEG classes from target stimuli at 5.45, 6.67, 8.57, and 12Hz, and each class has 25 EEG trials of training and testing samples for each session. We preprocessed the SSVEP signals by applying band-pass filtering between 4 and 15Hz and selected eight channels in the occipital region, ‘PO3, POz, PO4, PO9, O1, Oz, O2, and PO10,’ because this region is widely used for SSVEP classification .
With respect to passive BCI , we considered two different paradigms, seizure EEG signals  and vigilance EEG signals . Owing to its theoretical and practical benefits, in this study, we conducted experiments identifying drivers’ mental fatigue. We also used a publicly available SEED-VIG EEG dataset 666Available at: http://bcmi.sjtu.edu.cn/seed/download.html for the drowsy driving task data. This dataset  consists of 23 experiments, i.e., trials, and each trial is recorded for approximately 2 hours while simulated driving occurs. The EEG signals are acquired from 17 electrode channels according to the 10-20 system and sampled at 200Hz 
. For this dataset, we band-pass filtered EEG signals in the range between 0.5 and 40Hz, each epoch was 8 sec in length. Because the dataset was originally labeled usingPERCLOS levels , we categorized the label vectors into three classes, awake, tired, and drowsy with two threshold values(0.35 and 0.7) . Then, for every 23 experiments, a five-fold cross-validation was used for performance estimations.
Finally, we conducted seizure onset detection experiments with the widely used and publicly available CHB-MIT 777Avaliable at: https://physionet.org/content/chbmit/1.0.0/ dataset. The CHB-MIT dataset  contains EEG data from 24 subjects sampled at 256Hz acquired from 23 electrode channels (24 or 26 in a few cases) according to the 10-20 system. In this work, we selected EEG trials that have the same 23 channels montage and removed some trials acquired from the different montage. By following , we used a leave-one-record-out cross-validation. More precisely, we trained the proposed method using all non-seizure records and all seizure records but one, and tested the model on the remaining seizure record 
. Then, we repeated this process for the number of seizure records in the dataset, thus, each seizure record was tested. For training, the test trial epochs were 10 sec in length. During validation and testing session, a 10 sec length EEG signal was input into the proposed network using a 1/256 stride. Then, we observed whether the probability values for each EEG signal timepoint was ictal or normal.
For all datasets, the training samples were randomly selected and split again into training and validation samples for model selection. Specifically, we divided the training samples at a 9:1 ratio for each subject and used them for training and model selection respectively.
|Method||GIST-MI ||KU-SSVEP ||SEED-VIG ||CHB-MIT |
|Classification accuracy||Number of false detections|
|MeanStd.||MeanStd.||False Positive (Drowsy)||Mean (Mean latency)|
|CSP + LDA ||.66.14||-||-||-||-|
|FBCSP + LDA ||.68.15||-||-||-||-|
|PSD + SVM ||-||-||31.2015.47||6.74||-|
|Shoeb and Guttag ||-||-||-||-||5.35 (5.11)|
|Shallow ConvNet ||.63.11||.52.20||34.8919.13||6.51||19.21 (8.48)|
|Deep ConvNet ||.61.07||.96.08||41.3121.04||8.65||8.74 (7.52)|
|RSTNN ||.69.12||.65.20||39.8422.56||8.08||24.35 (9.31)|
|ESTCNN ||.67.10||.79.17||41.1021.31||8.71||6.41 (7.01)|
|EEGNet [16, 32]||.64.07||.93.10||46.6322.10||11.26||5.40 (6.23)|
|MSNN (Proposed)||.81.12||.93.08||31.1017.29||5.38||5.35 (4.98)|
Iv-B Experimental Settings
In our work, we compared our method with paradigm-specific linear model-based and deep learning-based methods for each EEG paradigm.
Iv-B1 Linear Models - Motor Imagery
First, we built a CSP with a linear discriminant analysis (CSP + LDA)  and an FBCSP with an LDA (FBCSP + LDA)  for MI decoding. We used four filters and regularized covariance for the CSP  and FBCSP . Additionally, we also used nine non-overlapped filter banks in the 440Hz range, i.e., 48, 812, , 36
40Hz, and, finally selected 10 features using the mutual information-based feature selection method FBCSP.
Iv-B2 Linear Models - Steady-State Visually Evoked Potentials
We also built a standard CCA  for SSVEP classification. We set reference signals for each stimulus including second harmonics. Furthermore, the standard CCA  does not require training samples for the optimization, thus we only estimated each session in its entirety from the KU-SSVEP dataset  for the CCA performance estimation.
Iv-B3 Linear Models - Drowsiness
Iv-B4 Linear Models - Seizure
In addition, we also reimplemented Shoeb and Guttag ’s method for the seizure onset detection experiment. We applied the PSD to the EEG data in a channel-wise manner. Then, the 3 sec time window time evolution  method was used for capturing temporal information. Finally, the represented spatio-spectral-temporal features were fed into an SVM using an RBF kernel ().
Iv-B5 Deep Neural Networks - Motor Imagery
We also implemented deep learning-based BCI models888See ‘Appendix A: Architectural Details of Deep Models for BCIs’ for more detail architectures and learning schedules. for MI. Basically, most of the existing deep learning models [27, 14, 7, 9] have focused on a paradigm-specific BCI task. However, we conducted experiments over all types of datasets for each deep learning model to demonstrate the validity of the proposed method. We built a Shallow ConvNet and a Deep ConvNet as proposed by Schirrmeister et al. . The Shallow ConvNet  consists of two convolutions, temporal and spatial, with a squaring nonlinear activation, an average pooling, and a logarithmic activation. The Deep ConvNet  has five convolutions, temporal and spatial, and three additional temporal convolutions. The RSTNN  is also used for these experiments. This network  consists of three recurrent convolutional layers, and each recurrent convolutional layer has three recurrent temporal convolutions  with a spatial convolution.
Iv-B6 Deep Neural Networks - Steady-State Visually Evoked Potentials
Iv-B7 Deep Neural Networks - Drowsiness
Iv-B8 Deep Neural Networks - Multi-paradigm
Finally, we also implemented the EEGNet  in our study. As previously mentioned, we used different kernel sizes for two different EEGNets,  and . Nevertheless, the basic architecture of the network was the same for various EEG paradigms, having a temporal convolution, depthwise spatial convolution , and separable temporal convolution .
Iv-B9 Proposed Multi-Scale Neural Network
While training our proposed network, depicted in Fig. 2, we set a mini-batch size of 16, an exponentially decreasing learning rate (initial value: 0.03, decreasing ratio per epoch: 0.001), and an Adam optimizer. For the first temporal convolution, we used a conventional temporal convolution with the kernel size of () and . Furthermore, we used three spectral-temporal feature representation convolutions, i.e., , and set , , and with , , and . Then, for the spatial feature representation block, we used three spatial convolutions because the number of spatial convolutional layers must be the same as the number of spectral-temporal separable convolutional layers. The proposed method used different kernel sizes for the SSVEP dataset, similar to the EEGNet  due to the fact that SSVEP EEG data is created by target frequencies [18, 23]. For the KU-SSVEP dataset , we set , , and for the spectral-temporal feature representation block, and used the same settings for the others. The SSVEP classification performance estimated by this method is marked by
. Additionally, batch normalization was performed after every convolution. Finally, for the classification block, all activated features from thespatio-spectral-temporal block were concatenated and fed into the GAP  layer. Then, after flattening, the multi-scale
features were linearly mapped by a dense layer. In this proposed network, a leaky rectified linear unit (ReLU) activation function, an L1-L2 regularizer (and ), and a Xavier initializer  are used for all tunable parameters except for the final decision layer that is activated by a softmax activation function instead of a leaky ReLU. We selected model components that demonstrated the best performance for validation, i.e., model selection samples, as mentioned previously.
Iv-C Experimental Results
Iv-C1 Motor Imagery
All experimental results are summarized in TABLE I. Our proposed network clearly outperformed other baselines for MI EEG signal decoding. Importantly, the proposed network achieved a higher accuracy than those methods designed specifically for MI classification: CSP , FBCSP , Shallow ConvNet , Deep ConvNet , and RSTNN . With this clear improvement in accuracy, we could expect that our proposed method is one step closer to MI-based BCI commercialization.
Iv-C2 Steady-State Visually Evoked Potentials
Our proposed MSNN achieved a slightly lower performance than CCA , Deep ConvNet , and EEGNet  in the SSVEP classification. However, the difference in performance between our MSNN and the other three baselines, CCA , Deep ConvNet , and EEGNet , was reasonably small and the proposed method performed with a credible accuracy score.
The proposed MSNN made the smallest number of mistakes in decision making for passive BCI . In particular, the proposed method detected a driver’s mental fatigue, i.e., drowsiness, from the EEG signals. Our proposed method predicted 31.10 incorrect trials from a total of 177 samples on average. Furthermore, accurately detecting a drowsy state is one of the most important MSNN capabilities for practical use. Our proposed model only made 5.38 mistakes out of 35 drowsy trials on average, thus exhibiting the highest precision score.
Finally, the MSNN incorrectly identified 5.35 seizures among 178 total test seizure samples. Furthermore, our proposed network was the fastest for detecting seizures, i.e., it exhibited the shortest latency time (approximately 4.98 sec on average) among various methods. In other words, our proposed method demonstrated the best performance even with the shortest latency time. Additionally, the proposed model correctly identified approximately 92% of the seizures within 4.98 sec. We do not present the standard deviation values for this seizure detection experiment because each test trial consisted of different numbers of seizures.
V Analyses and Discussions
In this section, we analyzed our proposed network. We determined the feature response by estimating PSD values and relevance scores  to show the multi-scale learning benefits. We also visualized learned weights and represented features of the proposed method using different methodologies, activation pattern maps  and t-SNE plots. Additionally, we observed a practical use for the proposed method, especially for drowsiness and seizure detection experiments.
V-a Multi-Scale EEG Feature Extraction
To demonstrate the multi-scale information capture ability of our proposed method, we estimated and plotted PSD values and relevance scores  for MI EEG samples. Specifically, we estimated PSD values for subject 48 and 52 in the GIST-MI dataset ’s EEG samples from channels on the motor cortex. Additionally, we calculated relevance scores for those subjects by a layer-wise relevance propagation . In our results, all classification methods evenly demonstrated well-generalization (baselines: 80% and proposed: 85%) for subject 48, whereas only the proposed method achieved superior performance for subject 52 (baselines: 65% and proposed: 80%). As Fig. 3 shows, subject 48’s EEG samples are highly activated at the range, while subject 52’s samples do not show any clear trend at the range, but in a wider range. Our proposed network exhibited a high relevance score at the low-frequency range for subject 48 who exhibited a clear trend at the low-frequency range. Furthermore, the relevance scores for subject 52 were roughly alike for the wider range, where subject 52’s PSD demonstrated a less clearly defined trend.
From this phenomenon, we can conclude that our proposed MSNN can capture important features on the multi-scale range, not only in the frequency of interest. In other words, while other existing methods gather spatio-spectral-temporal information at the sequential level, the proposed network exploit multi-scale features, thereby improving learning ability999Randomly selected additional results are reported in Supplementary C..
V-B Activation Patterns
Earlier, Haufe et al.  proposed an activation pattern which is based on a forward-backward modeling in signal processing. The activation pattern method  provides a way to interpret weight matrices in multivariate neuroimaging, as presented in the signal processing literature.
The proposed method, clearly, decodes the input EEG signal to the corresponding label, i.e., inferring a user’s intention or condition from an observed EEG pattern. Therefore, it is a backward process computational model. Hence, for a concrete and meaningful understanding of learned layers, it is essential to reverse this backward process model to a forward process. Finally, in this work, we estimated and visualized the activation patterns of the learned weights shown in Fig. 3(a). We extracted the spatial convolutions of Shallow ConvNet , Deep ConvNet , RSTNN , EEGNet , and the proposed model. Then, we estimated activation patterns and visualized them in a topological manner. We do not estimate ESTCNN  activation patterns because the ESTCNN  does not have any spatial feature representation layers and those visualized patterns are estimated by the first subject’s first fold data in the GIST-MI dataset . Finally, we normalized the activation patterns in [0, 1] range before visualization.
In this investigation, we observed right-lateralized brain activation/deactivation patterns, and the same patterns in the left hemisphere when a user imagined the movement of left-hand and right-hand respectively. Furthermore, the proposed model shows relatively clearer patterns than the other models, thus, we can conclude that our method thoroughly represents input EEG signal spatial features.
V-C Discriminative Power of EEG Representations
To validate the representation ability of the proposed network, we plotted t-SNE transformed learned features shown in Fig. 3(b). Specifically, we exhibited extracted features from test SSVEP EEG samples from the first, second, and third spatio-spectral-temporal feature representation layers, i.e., , , and (first three figures in Fig. 3(b)). Then, we also depicted the final learned feature, i.e., . These intermediate features , , and are temporally pooled just for visualization like . We used the first subject’s first session data in the KU-SSVEP dataset , and used a learning rate of 200, a perplexity of 10 for the t-SNE calculation, and visualization.
From these visualized represented features, we could observe that is more class-discriminative than the other intermediate features. Additionally, we observed a trend, which demonstrated that a feature learned by a deeper layer is more disentangled than others learned by shallower layers.
V-D Mental Fatigue Classification
For the application analysis of drowsiness detection, we visualized confusion matrices that were estimated by the experimental results of the SEED-VIG dataset  in Fig. 3(c). Because the labels that identify the mental status were decided using the PERCLOS levels , the label at the boundary of the two classes may not be accurate. In this respect, we can conclude that the proposed method is useful for drowsiness state detection because false detections predicted by the proposed method are mostly at the boundaries between classes, e.g., the ‘awake’ vs. ‘tired’ or ‘tired’ vs. ‘drowsy’ case. In addition, for practical application, it is essential to detect the drowsy state accurately to avoid unexpected situations, such as a car accident. The proposed method achieved the highest and most promising result for detecting drowsiness among other baselines, i.e., it achieved the highest precision score for identifying the drowsy state. Therefore, we can also expect that our proposed method can be applied in real-world situations.
V-E Early Seizure Detection
Early detection  of seizures is one of the most important potential practical applications for this work. Hence, we also validated tthe benefits of the proposed method in early seizure detection. Specifically, in the training phase, the MSNN was trained using normal and ictal EEG samples with binary labels (e.g., 0: normal and 1: seizure) similar to a conventional training framework. In the testing phase, we input the EEG samples using a sliding window with a 1/256 stride. Then, we observed the change in the output probability values to determine the character of the input (normal or ictal).
Additionally, we visualized these changes in Fig. 3(d) (We used the first subject’s third EEG trial in the CHB-MIT dataset  for the visualization). In Fig. 3(d), magenta-colored dot-dashed lines denote the seizure onset and offset. Colored solid lines denote the probability change of various methods. In this visualization, we observed that the proposed method is more stable for detecting seizures. Specifically, the proposed method detects the seizure EEG signal as a seizure state with a strong probability (almost 1), whereas the other methods have low confidence values (Shoeb and Guttag ’s method and ESTCNN ) or even make incorrect decisions regarding the seizure state (Shallow ConvNet , Deep ConvNet , RSTNN , and EEGNet ).
In this work, we proposed a novel and compact deep multi-scale neural network which can learn multi-scale EEG signal features. In our experiments, we validated our novel architecture’s effectiveness over diverse EEG paradigms, MI, SSVEP, seizure, and drowsy EEG signals. Furthermore, we inspected the relevance scores to demonstrate the benefits of the multi-scale feature extraction ability, investigated activation pattern maps to understand what types of neurophysiological phenomena were learned by our CNN model, and visualized the t-SNE of learned features to examine the ability of our method to differentiate feature classes. Finally, we also demonstrated that the proposed method can be used for precise drowsiness detection and early seizure detection. In all these respects, we concluded that the proposed deep multi-scale neural network offers significant potential for interpreting EEG signals. Additionally, because the proposed network is clearly generalizable to various EEG paradigms, it is expected to have promising benefits that can apply to neural architecture search methods , thereby making a deep learning-based BCI adaptable to different paradigms.
From a practical standpoint, many limitations remain with regard to the inter-subject variation  in performance. In the present work, we experimented in a subject-dependent manner. In general use, it is important for a BCI system to be useful for any subject operating in a subject-independent way. Thus, in the future, we will focus on developing a subject-neutral multi-paradigm BCI system using adversarial learning [8, 13] or other learning strategies .
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (No. 2017-0-00451, Development of BCI based Brain and Cognitive Computing Technology for Recognizing User’s Intentions using Deep Learning).
-  (2008) Filter Bank Common Spatial Pattern (FBCSP) in Brain-Computer Interface. In IEEE International Joint Conference on Neural Networks, pp. 2390–2397. Cited by: §I, §II-A, §II-A, §III-A, §IV-B1, §IV-C1, TABLE I.
-  (2018) Passive BCI Beyond the Lab: Current Trends and Future Directions. Physiological Measurement 39 (8), pp. 08TR02. Cited by: §I, §I, §IV-A3, §IV-C3.
-  (2019) SeizureNet: A Deep Convolutional Neural Network for Accurate Seizure Type Classification and Seizure Detection. arXiv preprint arXiv:1903.03232. Cited by: §I, §II-B, §II-B, §III-C, §III-D.
-  (2008) Optimizing Spatial Filters for Robust EEG Single-trial Analysis. IEEE Signal Processing Magazine 25 (1), pp. 41–56. Cited by: §I, §I, §I, §II-A, §II-A, §IV-B1, §IV-C1, TABLE I.
-  (2017) EEG Datasets for Motor Imagery Brain–Computer Interface. GigaScience 6 (7), pp. gix034. Cited by: §I, §IV-A1, TABLE I, Fig. 3, 3(a), §V-A, §V-B.
Xception: Deep Learning with Depthwise Separable Convolutions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258. Cited by: §I, §II-B, §III-B, §III-B, §IV-B8.
-  (2019) Seizure Detection by Convolutional Neural Network-based Analysis of Scalp Electroencephalography Plot Images. NeuroImage: Clinical 22, pp. 101684. Cited by: §I, §I, §I, §II-B, §II-B, §IV-B5.
-  (2016) Domain-Adversarial Training of Neural Networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §VI.
-  (2019) EEG-Based Spatio-Temporal Convolutional Neural Network for Driver Fatigue Evaluation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I, §I, §I, §II-B, §II-B, §IV-B5, §IV-B7, TABLE I, §V-B, §V-E.
-  (2010) Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of the thirteenth International Conference on aAtificial Intelligence and Statistics, pp. 249–256. Cited by: §IV-B9.
-  (2014) On the Interpretation of Weight Vectors of Linear Models in Multivariate Neuroimaging. NeuroImage 87, pp. 96–110. Cited by: §I, 3(a), §V-B, §V.
-  (2016) Transfer Learning in Brain-Computer Interfaces. IEEE Computational Intelligence Magazine 11 (1), pp. 20–31. Cited by: §I, §I, §VI.
-  (2019) Domain Adaptation with Source Selection for Motor-Imagery based BCI. In 2019 7th International Winter Conference on Brain-Computer Interface (BCI), pp. 1–4. Cited by: §VI.
-  (2018) Deep Recurrent Spatio-Temporal Neural Network for Motor Imagery based BCI. In 2018 6th International Conference on Brain-Computer Interface (BCI), pp. 1–3. Cited by: §I, §I, §I, §II-B, §II-B, §III-C, §III-C, §III-D, §IV-B5, §IV-C1, TABLE I, §V-B, §V-E.
-  (2017) A Convolutional Neural Network for Steady State Visual Evoked Potential Classification Under Ambulatory Environment. PLoS one 12 (2), pp. e0172578. Cited by: §I, §II-B, §II-B, §III-D.
-  (2018) EEGNet: A Compact Convolutional Neural Network for EEG-based Brain–Computer Interfaces. Journal of Neural Engineering 15 (5), pp. 056013. Cited by: §I, §I, §I, §I, §II-A, §II-B, §II-B, §II-B, §III-A, §III-B, §III-C, §III-C, §III-D, §IV-B8, TABLE I, §V-B, §V-E.
-  (2017) Early Seizure Detection by Applying Frequency-based Algorithm Derived from the Principal Component Analysis. Frontiers in Neuroinformatics 11, pp. 52. Cited by: §II-A, §II-A, §V-E.
-  (2019) EEG Dataset and OpenBMI Toolbox for Three BCI Paradigms: An Investigation into BCI Illiteracy. GigaScience 8 (5), pp. giz002. Cited by: §I, §I, §IV-A1, §IV-A2, §IV-B2, §IV-B9, TABLE I, §V-C, footnote 4.
-  (2017) Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §VI.
-  (2015) Recurrent Convolutional Neural Network for Object Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3367–3375. Cited by: §II-B, §IV-B5.
-  (2013) Network in Network. arXiv preprint arXiv:1312.4400. Cited by: Fig. 2, §III-D, §III-D, §III-D, §IV-B9, 3(b).
-  (2017) Explaining Nonlinear Classification Decisions with Deep Taylor Decomposition. Pattern Recognition 65, pp. 211–222. Cited by: Fig. 3, §V-A, §V.
-  (2015) A Comparison Study of Canonical Correlation Analysis based Methods for Detecting Steady-State Visual Evoked Potentials. PLoS One 10 (10), pp. e0140703. Cited by: §I, §I, §I, §II-A, §II-A, §II-B, §IV-B2, §IV-B9, §IV-C2, TABLE I.
-  (2019) EEGNAS: Neural Architecture Search for Electroencephalography Data Analysis and Decoding. In International Workshop on Human Brain and Artificial Intelligence, pp. 3–20. Cited by: §VI.
-  (2018) Learning Temporal Information for Brain-Computer Interface Using Convolutional Neural Networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I, §I.
-  (2014) Sleep Stage Classification with Cross Frequency Coupling. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4579–4582. Cited by: §II-A, §II-A.
-  (2017) Deep Learning with Convolutional Neural Networks for EEG Decoding and Visualization. Human Brain Mapping 38 (11), pp. 5391–5420. Cited by: §I, §I, §I, §I, §II-B, §II-B, §III-A, §III-C, §III-C, §III-D, §IV-B5, §IV-C1, §IV-C2, TABLE I, §V-B, §V-E.
-  (2010) Application of Machine Learning to Epileptic Seizure Detection. In Proceedings of the 27th International Conference on Machine Learning, pp. 975–982. Cited by: §I, §II-A, §IV-A4, §IV-B4, TABLE I, §V-E.
-  (2009) Application of Machine Learning to Epileptic Seizure Onset Detection and Treatment. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §I, §IV-A3, §IV-A4, TABLE I, 3(d), §V-E.
-  (2012) A Novel Bayesian Framework for Discriminative Feature Extraction in Brain-Computer Interfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2), pp. 286–299. Cited by: §I, §II-A, §II-A, §III-A.
-  (2017) DeepSleepNet: A Model for Automatic Sleep Stage Scoring based on Raw Single-channel EEG. IEEE Transactions on Neural Systems and Rehabilitation Engineering 25 (11), pp. 1998–2008. Cited by: §I, §I, §II-B, §II-B, §III-C, §III-D.
-  (2018) Compact Convolutional Neural Networks for Classification of Asynchronous Steady-State Visual Evoked Potentials. Journal of Neural Engineering 15 (6), pp. 066031. Cited by: §I, §II-B, §II-B, §II-B, §IV-A2, §IV-B6, §IV-B8, §IV-B9, §IV-C2, TABLE I.
-  (2019) A Survey on Deep Learning based Brain Computer Interface: Recent Advances and New Frontiers. arXiv preprint arXiv:1905.04149. Cited by: §I, §I.
-  (2017) A Multimodal Approach to Estimating Vigilance using EEG and Forehead EOG. Journal of Neural Engineering 14 (2), pp. 026017. Cited by: §II-A, §II-A, §IV-A3, §IV-B3, TABLE I, 3(c), §V-D.