Electroencephalogram (EEG) signal is a physiological electrical signal with corresponding characteristics generated by the brain receiving certain stimulus. It not only can effectively reflect the functional state of the brain, but also give feedback on the current state of a person’s physical function , and is therefore widely used in the analysis of neurological diseases , brain-computer interfaces , and the study of cognitive processes . With the help of EEG devices, it is convenient to obtain event-related potential data (ERP) in visual working memory tasks. It has been found that there is a strong correlation between individual working memory capacity and the signals produced by the brain nervous system that maintain memory over species. EEG-based classification methods [2, 22] are important for further analysis of the mental activity during working memory.
used a deep convolution neural network to classify EEG signals based on the spatio-temporal features of EEG signals, and the experimental results showed that its performance is better than that of the traditional support vector machine-based classification. Zhang et al.
combined convolution networks and recurrent neural networks in a cascade or parallel manner, and the proposed model can better extract the spatio-temporal features of EEG signals.
These data-driven approaches have been extensively studied in subject-dependent scenarios, and experimental results show that they can perform well for specific subjects for tasks such as EEG signal feature extraction and classification. However, in practice, potential feature discrepancies between subjects lead to unsatisfactory results for subsequently introduced individuals (subjects) on previously trained models. Moreover, calibrating the model requires a lot of labeled data, which is a time-consuming and expensive mission. At present, a large number of traditional methods such as Kernel common spatial patterns[7, 1], Riemannian space , PCA-based methods , etc. Deep learning, as a class of end-to-end methods with powerful feature extraction capability, has also been gradually applied to EEG transfer learning [11, 6, 24]. However, most deep transfer learning methods for EEG lack decent ability to multi-source information with both spatial and temporal information, and how to use transfer learning to solve brain state recognition task in the working memory EEG field has rarely been investigated.
Inspired by the study on the transferability of deep learning by Long et al. , in this paper, we propose a cross-subject deep adaptation with spatial attention method named CS-DASA that can effectively transfer task models between subjects. The key idea of this work is to establish a neural network architecture that can effectively extract spatio-temporal features and accomplish subject-to-subject transfer learning. To achieve this goal, first, we transform the original 1-D working memory EEG data into ”multi-frame EEG images” defined in section II, which can incorporate more spatio-temporal information. Then, CS-DASA inputs the EEG image data from both subjects together into ConvLSTM, which extracts the feature representation shared by both subject domains. Subsequently, the learned feature representations are input into subject-specific 2-D convolution layers, in which the parameters can be fine-tuned according to different subjects. Meanwhile, the domain discrepancy from this pair of subjects can be calculated from each couple of subject-specific layers by MMD in a reproducing kernel Hilbert space 
. And then, a joint optimization incorporating the goal of reducing the domain discrepancy in the loss function of source domain task can be implemented for finishing the domain adaptation cross subjects. Additionally, an attention mechanism with the capacity of focusing on the discriminative spatial information between two subject domains is implemented to improve the adaptation performance.
The overall framework of the proposed CS-DASA is shown in Fig. 1. As shown in the figure, the CS-DASA mainly consists of three parts: a shared feature representation module with several ConvLSTM layers, subject-specific feature extraction module through several 2-D convolution layers and a spatial attention block. We first introduce some preliminaries including the definition of transfer learning, and then detail each component of the proposed model.
Definition 1 (multi-frame EEG images): In most of memory operation tasks, Oscillatory cortical activities primarily consist of three frequency bands of theta (4-7Hz), alpha (8-13Hz) and beta (13-30 Hz). Then, with the information of topology structure from the 3-D electrode, we use the Azimuthal Equidistant Projection to obtain 2-D projected locations of electrodes to construct single image with 3 channels . Finally, multi-frame EEG images can be constructed through a time window in per trial, and a multi-frame EEG image can be represented as: , where represents the number of frames, and , as well as respectively represent the numbers of channel, the size of width and the size of height in EEG images.
Definition 2 (EEG classification with domain adaptation transfer): Given a completely-labeled source domain and a target domain including ( can be 0) samples with labels and samples , EEG classification transfer learning hopes to utilize the learned knowledge to acquire the mapping function in the target domain. Additionally, the promise is that , , , and/or , where and represent the feature spaces of and ,
means the marginal probability distribution, and
refers to the conditional probability distribution.
Definition 3 (One-to-One transfer): In this work, under the background of cross-subject transfer, the source subject domain is one subject’s EEG image data, and the target is another subject’s EEG image data. In the following sections, We denote the One-to-One Transfer as .
Ii-B Subject-Shared Representation Learning with ConvLSTM
The first part of the proposed model is composed of several ConvLSTM layers aimed at extracting common representation features from both subjects. Note that before subject-subject transfer these ConvLSTM layers have been rained in the source domain, while their parameter will be frozen during the transfer learning. ConvLSTM implements convolution operations for input-to-state and state-to-state transitions, which can capture more spatial information, like topology structure between electrode locations, of multi-frame EEG images than LSTM and extract more valuable temporal information than convolution neural networks .
ConvLSTM replaces matrix multiplication with a convolution operation for each gate in the LSTM cell. In this way, it captures the underlying spatial features by performing convolution operations in multidimensional data. Another major difference between ConvLSTM and LSTM is the number of input dimensions. Unlike most LSTMs that receive one-dimensional input data, the ConvLSTM used in this paper accepts 3-D image EEG data, which is formulated as follows.
where denotes the th step of ConvLSTM; denotes the input data; denotes the hidden state; denotes the state of the storage cell; , and are the input gate, forget gate and output gate of ConvLSTM, respectively. and are the weights and biases to be learned; , ,
and tanh denote the convolution operation, element multiplication, Sigmoid function and tanh function. Let mapping functiondenote the th layer of the stacked ConvLSTM layers, and then the representation features of source and target subject EEG image data can be formulated as:
where and are final feature representations of the two subject domain through layers ConvLSTM.
Ii-C Subject-Specific Knowledge Transfer with MMD
The output and through stacked ConvLSTM layers are then input their subject-specific feature extraction module consists of several Conv2D layers. Maximum Mean Discrepancy (MMD) strategy will impose constraint for subject-specific feature extraction during transfer learning. Through embedding the learned representations output by subject-shared ConvLSTM from two subject domains to a reproducing kernel Hilbert space, MMD can reduce the domain discrepancy with the help of adaptation network described in the Fig. 1. And, the squared formulation of MMD can be calculated as:
where denotes the reproducing kernel Hilbert space and represents the kernel function endowed with a Gaussian kernel in this work. and denote the samples from two domains and . Since the output from the previous layers is 4-D data with the size of , we reshape them into in order to make their size suitable for Conv2D layers. Let denote the th layer of Conv2D, then the transfer loss from MMD can be calculated as:
Ii-D Subject-to-Subject Spatial Attention
Obviously, subjects wearing the similar/same EEG devices can generate image EEG data that share very similar spatial patterns and not all regions from image EEG data contribute equally to the representation of the EEG signals. Therefore, we propose the subject-to-subject spatial attention to allocate importance to each region in each pair of regions of source domain and target domain. Specifically, we design the model architecture to make the size of feature matrix () the same with the raw input EEG image, which may let the latent representation correspond to the spatial structure of the raw image as much as possible. Specifically, before calculating the attention matrices, it is necessary to reshape the output features from Conv2D layers:
where , and denote channel, width and height of the output feature from Conv2D; is equal to ; represents the calculated attention matrix. represents the matrix transposition of and is the dot-product operation. And then, the final feature representation incorporates the subject-to-subject spatial information . Note that the concatenation operation between and is in the dimension of , and the size of is .
Ii-E Total Loss Function
The total loss function of CS-DASA can be divided into domain loss and MMD loss:
where denotes the cross-entropy loss, is the source domain label, represents the output of the source domain, and is the domain discrepancy penalty parameter.
In this version of work, we conduct the task of one-to-one transfer in a working memory EEG dataset to verify the performance of the proposed model. In future version, we will add more experiments on many-to-one transfer task and do some sensitive analysis.
|DDC||63.8/20.2||66.5/20.3||61.9/16.1||70.2/14.3||70.9/18.7||62.9/ 24.4||62.5/ 26.8||63.5/27.1||64.4/25.3||62.7/13.8||53.6/21.2||50.8/26.2||37.2/29.2|
|CS-DASA (nonatt)||65.5/21.2||67.3/21.4||62.5 16.6||73.3/14.6||/75.7/19.5||62.8/24.4||65.3/26.8||66.0/26.6||65.2/25.5||65.2/11.1||55.5/20.0||50.8/25.8||37.3/28.8|
Iii-a Dataset and Model Implementation
We use the working memory EEG dataset from this work , and it has 64 electrodes with three frequency bands (theta, alpha and beta). The chosen multi-frame EEG data in our experiments consist of 2670 samples from 13 subjects, which belong to four categories (load 1-4). And hence, this dataset has the size of .
In the future several months, we will keep adjusting the parameter and architecture of our model. Although some changes will happen in the final version of this work, we here give the temporary experiment design for the purpose of readers better understanding what we do.
The model is implemented with the PyTorch 1.1 framework on two RTX 2080Ti GPUs. The subject-shared networks consist of 2 ConvLSTM layers, in which the first one has 2 LSTM layers with 8 and 16 hidden units and another one also owns 2 LSTM layers with 16 and 16 hidden units. The subject-specific networks is made of 2 Conv2D layers with 32 and 8 convolution kernels. Note that before entering the subject-specific networks, the output with the size of(not consider the batch-size) is reshaped as . In the end, a Conv2D with 4 kernels and 2 full-connected layers with and hidden units are included in the class prediction networks. Additionally, the learning rate and batch-size are set to 0.0001 and 8, and the optimizer takes Adam, which shows better performance than SGD.
Iii-B Comparison Methods
We give brief introduction of baseline and state-of-the-arts models in this work. For fair comparison, all deep learning-based models share the same setup with the proposed model. However, considering that the too large feature size may be time-consuming and deteriorate the performance, we take the corresponding single-frame EEG data (size of ) from , and implement an average pooling strategy to reduce the feature size to .
TCA : This work attempted to use the maximum average discrepancy to learn to replicate the transferable components between domains in the kernel Hilbert space. It can reduce the distribution distance between different domains and thus achieve domain adaptation.
W-BDA : This method adaptively exploits the importance of marginal distribution discrepancy and conditional distribution discrepancy. Meanwhile, it not only considers distribution adaptation, but also adaptively changes the weight of each class.
JDA : To address the fact that previous transfer methods do not simultaneously reduce the discrepancy between both marginal and conditional distributions, JDA aims to these two kind of distributions in the process of dimensionality reduction, perform domain transfer and establish new feature representations.
DDC : DDC adds an adaptation layer between the source and target domains and sets a domain confusion loss function to allow the network to learn how to classify while reducing the discrepancy in distribution between the source and target domains.
Deep-Coral : This method applies Coral, an unsupervised domain adaptive method, to deep neural networks in the form of nonlinear transformation that aligns correlations of layer activations.
Iii-C Results and Analysis
We carry out
transfer for all 13 subjects in the dataset, and hence each target subject has another 12 independent source subjects. The statistical results of Mean and standard deviation (Mean/STD) are shown in TableLABEL:tab1.
Obviously, deep learning-based transfer methods own better performance than traditional methods, since the traditional ones cannot capture spatio-temporal information well, and they are not able to deal with high-dimension data. Although the CNN-3D model share the same experiment settings with the proposed CS-DASA, it cannot achieve ideal results in that the ability of feature extraction from Conv3D layers cannot catch up with that from ConvLSTM in this task. Among three models-DDC, Deep-Coral and CS-DASA, the proposed can show the best performance. Note that the proposed attention mechanism not only can improve the classification result but also reduce the STD significantly. Besides, when exploring the reason why our model get bad results on S13, we find the negative transfer happens and the source-only experiment can perform great better. Subject-independent experiments in  also show the same phenomenon. In future version of this work, we will further explore the difference in S13 from the view of latent feature representation and give a more detailed explanation.
Iii-D Conclusion and Future Work
In this version of paper, we propose a cross-subject domain adaptation with spatial attention method for transfer learning in workload classfications between subjects. Experiments on a public WM EEG dataset verify the fantastic performance of our model.
In future version of this paper or future work, we will do more sensitive analysis on the loss weight , conduct many-to-one experiments, and design a source domain auto-selection method.
-  (2012) A study of kernel csp-based motor imagery brain computer interface classification. In 2012 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), pp. 1–4. Cited by: §I.
-  (2014) Spectrotemporal dynamics of the eeg during working memory encoding and maintenance predicts individual behavioral capacity. European Journal of Neuroscience 40 (12), pp. 3774–3784. Cited by: §I.
-  (2015) Learning representations from eeg with deep recurrent-convolutional neural networks. arXiv preprint arXiv:1511.06448. Cited by: §II-A, §III-A, §III-B, §III-C.
-  (2020) EEG spectral power abnormalities and their relationship with cognitive dysfunction in patients with alzheimer’s disease and type 2 diabetes. Neurobiology of aging 85, pp. 83–95. Cited by: §I.
Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §I.
-  (2016) Unsupervised domain adaptation techniques based on auto-encoder for non-stationary eeg-based emotion recognition. Computers in biology and medicine 79, pp. 205–214. Cited by: §I.
-  (2018) Transfer kernel common spatial patterns for motor imagery brain-computer interface classification. Computational and mathematical methods in medicine 2018. Cited by: §I.
-  (2016) A multi-modal parcellation of human cerebral cortex. Nature 536 (7615), pp. 171–178. Cited by: §I.
-  (2003) How many people are able to operate an eeg-based brain-computer interface (bci)?. IEEE transactions on neural systems and rehabilitation engineering 11 (2), pp. 145–147. Cited by: §I.
-  (2012) Fuzzy hopfield neural network clustering for single-trial motor imagery eeg classification. Expert systems with applications 39 (1), pp. 1055–1061. Cited by: §I.
-  (2019) Multisource transfer learning for cross-subject eeg emotion recognition. IEEE transactions on cybernetics 50 (7), pp. 3281–3293. Cited by: §I.
-  (2017) Improving cross-day eeg-based emotion classification using robust principal component analysis. Frontiers in computational neuroscience 11, pp. 64. Cited by: §I.
-  (2015) Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97–105. Cited by: §I.
Transfer feature learning with joint distribution adaptation. In
Proceedings of the IEEE international conference on computer vision, pp. 2200–2207. Cited by: §III-B.
-  (2015) Real-time neuroimaging and cognitive monitoring using wearable dry eeg. IEEE Transactions on Biomedical Engineering 62 (11), pp. 2553–2567. Cited by: §I.
-  (2010) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22 (2), pp. 199–210. Cited by: §III-B.
-  (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28. Cited by: §II-B.
-  (2016) Deep coral: correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443–450. Cited by: §III-B.
-  (2019) Classification of eeg-based single-trial motor imagery tasks using a b-csp method for bci. Frontiers of Information Technology & Electronic Engineering 20 (8), pp. 1087–1098. Cited by: §I.
-  (2017) Single-trial eeg classification of motor imagery using deep convolutional neural networks. Optik 130, pp. 11–18. Cited by: §I.
-  (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §III-B.
-  (2004) Neural activity predicts individual differences in visual working memory capacity. Nature 428 (6984), pp. 748–751. Cited by: §I.
-  (2017) Balanced distribution adaptation for transfer learning. In 2017 IEEE international conference on data mining (ICDM), pp. 1129–1134. Cited by: §III-B.
-  (2017) Cross-session classification of mental workload levels using eeg and an adaptive deep learning model. Biomedical Signal Processing and Control 33, pp. 30–47. Cited by: §I.
-  (2019) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE transactions on cybernetics 50 (7), pp. 3033–3044. Cited by: §I.