Toward Subject Invariant and Class Disentangled Representation in BCI via Cross-Domain Mutual Information Estimator

10/17/2019 ∙ by Eunjin Jeon, et al. ∙ 0

In recent, deep learning-based feature representation methods have shown a promising impact in electroencephalography (EEG)-based brain-computer interface (BCI). Nonetheless, due to high intra- and inter-subject variabilities, many studies on decoding EEG were designed in a subject-specific manner by using calibration samples, with no much concern on its less practical, time-consuming, and data-hungry process. To tackle this problem, recent studies took advantage of transfer learning, especially using domain adaptation techniques. However, there still remain two challenging limitations; i) most domain adaptation methods are designed for labeled source and unlabeled target domain whereas BCI tasks generally have multiple annotated domains. ii) most of the methods do not consider negatively transferable to disrupt generalization ability. In this paper, we propose a novel network architecture to tackle those limitations by estimating mutual information in high-level representation and low-level representation, separately. Specifically, our proposed method extracts domain-invariant and class-relevant features, thereby enhancing generalizability in classification across. It is also noteworthy that our method can be applicable to a new subject with a small amount of data via a fine-tuning, step only, reducing calibration time for practical uses. We validated our proposed method on a big motor imagery EEG dataset by showing promising results, compared to competing methods considered in our experiments.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Brain-computer interface (BCI) allows users to directly communicate or control external devices by thoughts, typically measured with electroencephalography (EEG) [graimann2009brain]. EEG signals that measure the electrical activity of brain are usually categorized into two types, evoked and spontaneous, depending on their inducing manner in non-invasive BCIs. Evoked EEGs, e.g., steady-state visually evoked potentials, steady-state somatosensory evoked potentials, and event-related potentials, are derived from immediate automatic responses to an external stimulus regardless of a user’s own will, whereas spontaneous EEGs induce activation of event-related (de)synchronization (ERD/ERS) when carrying out mental tasks at a user’s own will. Of various EEG signal types, we focus on motor imagery

signals characterized by ERD/ERS induced by imagining body movements without any physical movements. Taking advantage of controlling system without explicit commands, there have been many studies on decoding motor imagery using machine learning-based methods.

For motor imagery EEG, because of high variability among subjects (inter-subject) and sessions for the same subject (intra-subject) on account of inherent background neural activities, fatigue, concentration levels, etc. [jayaram2016transfer], it is challenging to train a generic model, that can be applicable to different datasets or subjects. Therefore, training a model for each subject is a typical approach to decode brain signals despite time-consuming and amount of data-requirable process [lin2017improving]. In order to address the limitation, previous studies involved multiple subjects and/or sessions data simultaneously in training a single model in the way of transfer learning [jayaram2016transfer, lin2017improving] and presented its potential.

We focus on, in this work, boosting model generalization among subjects in a transfer learning manner [jayaram2016transfer, lin2017improving], by considering a subject as a domain [pan2009survey]. In the meantime, due to the property of motor imagery EEG, i.e., inter-subject high variability [jayaram2016transfer], it may be observed that a large distributional discrepancy on the feature representation space between different domains, i.e., different subjects, which referred to as a domain shift [pan2009survey, chai2016unsupervised, jeon2019domain]. Therefore mitigating the domain shift is one of the important issues in the transfer learning for BCI tasks [chai2016unsupervised, jeon2019domain].

Regarding the domain shift problem, there have been numerous studies in machine learning, which referred to domain adaptation [ganin2016domain, wang2018deep]. However, a direct application of domain adaptation techniques to BCI tasks is challenging because BCI tasks generally exploit multiple annotated domains, i.e., subjects, whereas the previous domain adaptation methods have been mostly studied between two domains, labeled source and unlabeled target domain [ganin2016domain, long2015learning, tzeng2017adversarial, bousmalis2017unsupervised].

Meanwhile, recent studies have reported that the domain adaptation methods may not only improve the feature representation of target domain, but corrupt it. This corruptive phenomena is called negative transfer [wang2019transferable, peng2019domain, xu2018deep]. To address this issue, [wang2019transferable] indicated that a source domain includes not only transferable data, but also untransferable factors with respect to domain adaptation. Further, [peng2019domain] sorted the source domain as domain-invariant, domain-specific, and class-irrelevant. To this end, it is of great importance to differentiate positive and negative transferable factors from data.

In respect of applying domain adaptation methods to BCI tasks, we consider domain shifts from multiple domains in a single procedure jointly by estimating mutual information from multiple subjects, which is different from the previous studies that mostly considered two domains [ganin2016domain, tzeng2017adversarial, long2015learning] or calculated one-to-one discrepancies even for multiple domains (one source vs. target) [zhao2018adversarial, xu2018deep]. In other words, the proposed method aims at finding domain-invariant, i.e., domain-unspecific, feature representations, applicable to multiple domains, allowing to quick fine-tuning with a minimal number of samples for a target domain, if necessary.

In terms of transferability, we decompose the domain-invariant features into class-irrelevant and class-relevant using another mutual information. Technically the proposed method minimizes mutual information between class-relevant and irrelevant features, thereby disentangling feature representations. Further, in the viewpoint of meta-learning [vanschoren2018meta]

, we build a generic model that can classify EEG signals of different subjects and can quickly adapt to a new subject by fine-tuning with a small amount of data of the target subject.

We evaluated our proposed network on GIST-Motor Imagery dataset [cho2017eeg] for a classification task. Our experimental results demonstrated that (i) the class disentanglement enhances the performance and (ii) maximizing mutual information between subjects diminishes the distributional differences. Our results show promising performance compared to the comparative methods trained individually, even with fewer samples. The main contributions of this work can be three-fold:

  • First, we propose a deep neural network that disentangles

    domain-invariant, class-irrelevant, and class-relevant feature representations by means of mutual information in an end-to-end manner.

  • Second, the proposed network required smaller calibration data to generalize on a new subject, thereby achieving an impactful utility in practice.

  • Finally, although the proposed network is trained with multiple subjects’ data, it shows plausible results on classification performance in comparison with baseline methods trained individually.

Related Work

Transfer learning in BCI

In general, many methods of decoding EEGs are devised for an individual use with independent training per subject due to high inter- and intra-subject variability. However, it is required plenty of data and time-consuming [lin2017improving]. To cope with the limitation, various studies have been focused on training a model with multiple subjects and/or sessions with the goal of learning a subject-invariant representation. They are applied in two ways; (i) zero-shot or few-shot learning for decoding unseen subjects’ data in training [ozdenizci2019transfer, kindermans2014integrating, kang2014bayesian, azab2019weighted] and (ii) improving performance by incorporating other subjects’ data [lin2017improving, wei2018subject, jeon2019domain]. Especially, deep learning-based methods [chai2016unsupervised, jeon2019domain, ozdenizci2019transfer] approach this issue from a domain adaptation standpoint. They all constrain the distributional discrepancy between subjects by minimizing the maximum mean discrepancy in the latent space [chai2016unsupervised] or making the encoder to confuse the domain [jeon2019domain, ozdenizci2019transfer]. Among them, two studies fundamentally devised for nothing but two subjects [chai2016unsupervised, jeon2019domain]. In [ozdenizci2019transfer], a classifier is trained from the pre-trained subject-invariant representation, hence, confusing the domain and identifying the class are not trained in an end-to-end manner. Our proposed network achieves the domain adaptation among multiple domains, i.e., subjects, and classification simultaneously in an end-to-end manner.

Figure 1: Overview of the proposed network. We randomly select trials regardless of subjects for a mini-batch. After a local encoder maps an input to a local feature , a global encoder receives the local feature, then, extracts a global feature . To embed subject-invariant feature, we maximize mutual information between two features by using a local discriminator and global discriminator . We split the global feature into class-relevant feature and class-irrelevant feature . Then, the classifier takes and discriminates the corresponding class. Through a gradient reversal layer (GRL), we reduce mutual information between and estimated from a mutual information neural estimator . Finally, the decoder reconstructs the original input from the concatenated feature .

Domain Adaptation

There are various studies to mitigate differences between source and target domains. [wang2018deep] categorized into one-step and multi-step approaches by the presence of intermediate domain which decreases a gap between source and target domain. In one-step domain adaptation, most methods achieve it by means of a domain discriminator, which guides its features to be indistinguishable between domains by reversing its gradient during training via adversarial learning [ganin2016domain, tzeng2017adversarial, zhang2018collaborative]. However, they consider a single target domain by using also one source domain. In the case of multiple source domains, there are multiple domain discriminators which identify the domain between a target domain and each source domain, separately [zhao2018adversarial, xu2018deep]. Since we should denote a specific target domain to take advantage of existing methods, they are inapplicable for our task dealing with several source domains. Therefore, we propose a novel paradigm based on mutual information for minimizing the distributional discrepancy. In order to do this, we assume that a latent space maximized the mutual information between multiple domains can be viewed as a common shared space among them.

Representation Disentanglement

Recent researchers have concentrated on disentangled representation learning in various fields such as image translation

[zheng2019disentangling, chen2018isolating], few-shot learning [yoon2019plug, ridgeway2018learning], and domain adaptation [liu2018detach, peng2019domain] to find semantic and interpretable information within latent space. [zheng2019disentangling] encodes the input into the class-relevant and class-irrelevant feature by using two different encoders; (i) an encoder is trained by minimizing classification-related loss and (ii) the other encoder fools the classifier via adversarial learning. [yoon2019plug] factorizes latent representations into identity factors and style factors by minimizing mutual information between two factor sets on the disentangler part. In regard to domain adaptation, [peng2019domain] divides a latent representation into domain-invariant, domain-specific, and class-irrelevant features based on minimizing mutual information among them for domain agnostic learning. Similar to these approaches, we decompose our feature representations into the class-relevant and the class-irrelevant ones from the domain-invariant, i.e., subject-invariant, in an attempt to prevent the negative transfer. In particular, we also utilize mutual information between two decomposed feature representations to ensure disentanglements.

Proposed Methods

In this work, we regard each subject as one domain. Thus, we assume that it is given for subjects with labeled samples, where . For uncluttered, we omit the superscript () without loss of generality. Let denotes a power spectral density transformed from a raw EEG trial including spatio-spectral information, where and are the number of electrode channels and frequency bins, respectively. We also define as the corresponding class label. The goal of this work is to build a deep neural network robustly applicable for multiple subjects. In other words, we want to develop an intention identification system that can be generalized for all subjects, i.e., minimize the classification errors. Further, the generalized network can be applied to quickly decode a new subject’s intention through fine-tuning.

For the classification, we propose a method to maximize the mutual information among subjects and to disentangle a latent representation to the class-relevant and class-irrelevant features.

An overall framework of our proposed method is shown in Figure 1

. Basically, our proposed network has an autoencoder structure. In the encoding-related block (blue box in Figure

1), an input is embedded to the subject-invariant latent space through an encoder , i.e., . In the decoding-related block (yellow box in Figure 1), there are three components; (i) a classifier to identify the class of an input ; (ii) a mutual information estimator to diminish dependency among two features, i.e., class-relevant feature and class-irrelevant feature ; (iii) the decoder to reconstruct an input with a concatenated feature , i.e., [, ]. Later, the trained network is fine-tuned for a new subject by using his/her data only.

Subject-Invariant Feature Embedding

First, we introduce a method to embed an input to the subject-invariant latent space by maximizing mutual information among all subjects. We exploit a deep neural network, i.e., deep infomax (DIM), maximizing the mutual information between input and encoder’s output [hjelm2018learning]

. By setting a mini-batch in stochastic gradient descent based learning with samples of multiple subjects randomly, mutual information across subjects can be estimated by the DIM framework. Further, maximizing the mutual information leads to minimize the discrepancy among subjects.

We utilize an encoder and a discriminator to maximize mutual information analogous with DIM [hjelm2018learning]. Here, both encoder and discriminator have two types (local and global characteristic). The local encoder maps an input to a local feature where , , and represent the height, width, and depth of the feature. The global encoder computes a global feature from the corresponding local feature, where is the depth of the global feature. In other words, the encoder is structured such that it produces an intermediate local feature and a global feature in order.

Then, we estimate and maximize the mutual information between the local feature and the global feature using the local discriminator and the global discriminator . While both discriminators take the same local and global features as inputs, they maximize the mutual information in similar but different ways. First, two encoders , , and the global discriminator are trained by maximizing


where is the number of samples in a mini-batch and

is derived from Jenshen Shannon divergence (JSD)-based mutual information between the joint distribution

and the product of marginal distributions [nowozin2016f]. Eq. (1) is rewritten as follows:


where and are different but sampled from the same distribution by shuffling the samples from the joint distribution along the batch axis [belghazi2018mutual] and .

In the meantime, two encoders , , and the local discriminator are also trained by maximizing the mutual information as follows:


where is also formulated as JSD-based mutual information and denotes a -th local patch of the local feature . After calculating the mutual information between the global feature and a local patch of the local feature for all patches, they are averaged and used to update the parameters of the local encoder, the global encoder, and the local discriminator, respectively. In the end, the calculated global feature is subject-invariant and represents the input in both global and local viewpoints.

Class Disentanglement

Although we find subject-invariant features representation by maximizing local/global mutual information as described above, as those are learned in an unsupervised manner, the output global feature is not necessarily, useful per se for classification. [peng2019domain, zheng2019disentangling] addressed that feature learned in an unsupervised generally includes both task-relevant and task-irrelevant information. Hence, we assume that the subject-invariant feature needs disentangling to class-relevant and class-irrelevant factors. Lastly, the class-relevant factors are enough for the classification task at the end.

In order for feature decomposition, we split the global feature into two parts such that one part is related to the class-relevant feature and the other part is related to the class-irrelevant feature , where . Then, the classifier is trained by minimizing the softmax cross-entropy loss with only the class-relevant feature as input as follows:



denotes an one-hot label vector for an input


For the decomposition of the global feature into two factors of and to support the classifier , we also exploit mutual information neural estimation (MINE) [belghazi2018mutual], for which another networks is introduced. In [belghazi2018mutual], the parameters of MINE need to be trained by gradient ascent due to the Donsker-Varadhan representation [donsker1983asymptotic]. However, in our case, we want to minimize the mutual information between two factors and , we apply a gradient reversal layer (GRL) [ganin2016domain]. A GRL passes features innate to MINE during forward propagation, but reverses the gradients during back propagation, i.e., reversed gradients.



Since the disentangled features are expected to keep spatio-spectral information in the input, we lastly build a decoder network for reconstruction of the global feature . The split features, i.e., class-relevant feature and class-irrelevant feature , are concatenated to the global feature along the depth axis . Then, the concatenated global feature is regarded as an input to the decoder . In our case, we denote a reconstruction loss as follows:


We train our proposed network with Eq. (1), (3), (4), (5), and (6) in an end-to-end manner. Details of the learning algorithm are provided in Algorithm 1.

Input : local encoder , global encoder , local discriminator , global discriminator , decoder , mutual information estimator , classifier
Output : well-generalized local encoder , global encoder , classifier
1 while not converged do
2       // Subject-Invariant Feature Embedding
3       Update , , , by (1) and (3);
4       // Class Disentanglement
5       Split global feature to and ;
6       Update , , by (4);
7       // Mutual Information Estimation
8       Calculate mutual information between the disentangled feature pair () with ;
9       Update by (5);
10       Update , by ;
11       // Reconstruction
12       Reconstruct by the concatenated feature with and ;
13       Update , , by (6);
15 end while
Algorithm 1 Learning subject-invariant and class-disentangled representation

Experiments & Results

We expect that our proposed network can be generalized for multiple subjects by learning subject-invariant and class-disentangled representation. In this regard, we have two applications, i.e., enhancing performance due to the effect of data augmentation and a possibility of transfer learning to an unseen subject with a small amount of data. We evaluate the proposed method over the public GIST-motor imagery dataset [cho2017eeg].

Data & Preprocessing

GIST-motor imagery dataset111Available at [cho2017eeg] is a big-data set of subjects ( females, years old). It consists of EEG signals of two different motor imagery tasks of left-hand and right-hand. All EEG signals were recorded from Ag/AgCl electrodes according to a 10-20 system and sampled with Hz. In a subject, each class has or trials, acquired from four sessions. All subjects were asked to take a rest for seconds, then to imagine the hand movement for seconds by following the given instruction in the monitor. Since two subjects (subject 29 and 34) had a high correlation with electromyography (EMG), they were known as bad subjects and not used for analysis in [cho2017eeg]. Therefore, we also conducted experiments using only the other subjects’ data.

The signals were preprocessed by a large Laplacian filtering to reduce noise and a bandpass filtering in the range of to Hz related to sensorimotor rhythms. After segmenting the signals to a baseline (2 seconds) and task (3 seconds), we subtracted a mean value of the baseline to each trial for baseline correction. In an attempt to exploit frequency properties of the data, we computed the power spectral density (PSD) by applying Welch’s method. Consequently, we obtained []-sized datum per sample and use it as an input to our method.

Method Accuracy [%]
Shallow ConvNet
Ours (wo CD)
Ours (w CD)
Table 1: Average of -fold classification accuracy [%] on all subjects in GIST-motor imagery dataset [cho2017eeg] with common spatial pattern (CSP) [ramoser2000optimal], filter bank common spatial pattern (FBCSP) [ang2008filter], Shallow ConvNet [schirrmeister2017deep], and our proposed method without and with the class disentanglement (wo and w CD).
Figure 2: Classification results. The x-axis and y-axis denote respectively a subject’s ID and classification accuracy [%]. Note that subject and are not used in our experiments.

Experimental Settings

As mentioned above, a subject performed motor imagery during four sessions in total. Therefore, to perform -fold cross-validation, we randomized each subject’s trials regardless of sessions and divided the dataset into training, validation, and test set to the ratio . For each training and validation, we utilized all subjects’ training and validation data. We applied Gaussian normalization to all inputs for each channel to preserve spatial information. To verify the generalization of the network, we conduct experiments in two different scenarios.

  • Scenario I: Since our network is regarded as learning subject-invariant and class-disentangled representations, we train our network using all subjects’ data within a single session, then, test it for each subject.

  • Scenario II: After training our proposed network, the network is able to make a common feature among subjects. Thus, we take advantage of it to reduce calibration time for a new subject. Here, by assuming that we have a few trials (samples) for a new subject, i.e., a target subject. Without the target subject, we train our network using the remaining subjects ( subjects in our dataset), then, randomly select %, %, and % of the unseen subject’s training trials. Subsequently, we perform fine-tuning by minimizing all lossess, with the target subject’s selected data in a subject-wise manner.

Model Implementation


To capture spatial and spectral information respectively, the local encoder is a spectral convolutional layer with , in the meantime, the global encoder is composed of a spatial convolutional layer with . The dimension of depth at each convolutional layer was set to and , respectively. Consequently, we acquired the local feature and the global feature . In the global discriminator, after embedding the local feature through another spatial convolutional layer with filters, it was concatenated with the global feature along the depth axis. Then, the concatenated feature was taken as an input to two fully-connected layers, with and units per layer. We used the encode-and-dot-product architecture for the local discriminator in accordance with [hjelm2018learning]. In the local discriminator, each local and global feature passed to a convolutional layer and a fully-connected layer, respectively, where the dimension of both layers was . Since the local feature still had a channel-axis, we considered a channel of the local feature as a local patch. After taking the dot-product between the global feature and each channel of the local feature, we took an average of them to perform Eq. (3). A classifier was composed of fully-connected layers, where each layer had units. Regarding the mutual information estimator, there were two fully-connected layers to receive class-relevant and class-irrelevant feature respectively, where the number of hidden nodes was . Then, we figured out the sum of two outputs from the two layers and passed it on other fully-connected layers with . Finally, our decoder was composed of a spatial deconvolutional layer and spectral convolutional layer with units.

Training Settings

Exponential linear units (ELU) was used for a nonlinear function in our network. Except for two discriminators and the mutual information estimator, we applied the batch normalization

[ioffe2015batch]. In addition, we applied an -regularization with a cofficient of to the classification loss Eq. (4) and dropout [srivastava2014dropout] with a rate of to prevent over-fitting. We trained models using momentum optimizer (momentum) with a learning rate of , where the dimension of a mini-batch size was 253. In the scenario II, we need to do fine-tuning for unseen subjects. Thus, we trained the same network under the different settings by ablation studies. Our network was implemented based on Tensorflow222

Ratio 30% 50% 70%
Method CSP FBCSP Ours (w CD) CSP FBCSP Ours (w CD) CSP FBCSP Ours (w CD)
Table 2: -fold average performance of each subject (). denotes .


For evaluation, we compared our proposed method to other methods of decoding motor imagery. We briefly introduce two machine-learning based models and a deep-learning based model.

  • Common Spatial Pattern (CSP) [ramoser2000optimal]

    : Thanks to its popularity and simplicity, this is a de facto method in motor-imagery BCI. CSP learned spatial filters to disassemble multi-channel EEG signals into distinguishable patterns for class by maximizing the differences of the variance of classes. We used 2 spatial filters for each class,

    i.e., left-hand and right-hand.

  • Filter Bank Common Spatial Pattern (FBCSP) [ang2008filter]: This method achieved good performance in various BCI tasks. First, raw EEG signals were bandpass-filtered into Hz bands at an interval of

    Hz. In each of bands, we applied the CSP (filter=2 for each class) first for feature extraction and then selected the CSP features through mutual information-based feature selection

    [ang2008filter] with the highest scores.

  • Shallow ConvNet [schirrmeister2017deep]: This deep network showed its power of learning feature representations in a fully data-driven manner and end-to-end learning. Shallow ConvNet was composed of two convolutional layers capturing spatio-temporal information of raw EEG signal and a fully-connected layer. However, we modified the spatial filter size as because the number of electrode channels is in GIST-motor imagery dataset [cho2017eeg]. For fair comparison, we did not use ‘cropped-training’ strategy, which considered in the original work [schirrmeister2017deep].

Since the baseline methods were designed with raw EEG trials, we also utilized the raw EEGs including spatio-temporal information different from our proposed method. After conducting CSP and FBCSP, we employed a linear discriminant analysis for classification [vidaurre2010toward]. It is worth noting that we trained the baseline methods individually.


Scenario I

We trained two models with all subjects’ data to validate the effectiveness of class disentanglement under the same experimental settings. On one hand, ours (wo CD) was composed of an encoder, two discriminators, and classifier, then, it was trained by minimizing Eq. (1), (3), and (4). On the other hand, ours (w CD) is our full proposed network. While ours (wo CD) achieved accuracy on average over subjects in -fold cross-validation, ours (w CD) showed accuracy, referred to Table 1. In other words, we could enhance the performance by discarding class-irrelevant information from subject-invariant feature.

Although we trained our network using all subjects within one training session, our network showed reasonably high results compared to other baseline methods trained individually. Especially, from Figure 2, a few subjects who empirically considered as BCI illiterate, not making discriminative motor imagery patterns, showed promising performance improvements thanks to training with other subjects including discriminative patterns.

Scenario II

In the scenario II, we conducted the experiments for CSP [ramoser2000optimal], FBCSP [ang2008filter], and our proposed method. The results with baseline methods are shown in Table 2. In the results, our proposed method presented superiority to others in most cases. Additionally, we estimated -value to validate statistical significances between the proposed method and baselines. We used two-tailed, Wilcoxon’s signed rank test. This supports our hypothesis that the feature representation learning in our framework is effective to transfer and to be updated with a minimal number of samples for a new subject.


In this paper, we proposed a deep neural network that learns subject-invariant and class-disentangled representation via mutual information estimation among features in different levels for BCI tasks in an end-to-end manner. The subject-invariant latent space can be deployed in decoding a subject even with smaller training data and improving performance by adding other subjects’ data, i.e., transfer learning. We evaluated our proposed method with a big motor imagery EEG dataset. Further, we expect our proposed method can be applied to other types of EEG signals.


This work was also supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0-00451; Development of BCI based Brain and Cognitive Computing Technology for Recognizing User’s Intentions using Deep Learning).