Subject Adaptive EEG-based Visual Recognition

10/26/2021 ∙ by Pilhyeon Lee, et al. ∙ Yonsei University 0

This paper focuses on EEG-based visual recognition, aiming to predict the visual object class observed by a subject based on his/her EEG signals. One of the main challenges is the large variation between signals from different subjects. It limits recognition systems to work only for the subjects involved in model training, which is undesirable for real-world scenarios where new subjects are frequently added. This limitation can be alleviated by collecting a large amount of data for each new user, yet it is costly and sometimes infeasible. To make the task more practical, we introduce a novel problem setting, namely subject adaptive EEG-based visual recognition. In this setting, a bunch of pre-recorded data of existing users (source) is available, while only a little training data from a new user (target) are provided. At inference time, the model is evaluated solely on the signals from the target user. This setting is challenging, especially because training samples from source subjects may not be helpful when evaluating the model on the data from the target subject. To tackle the new problem, we design a simple yet effective baseline that minimizes the discrepancy between feature distributions from different subjects, which allows the model to extract subject-independent features. Consequently, our model can learn the common knowledge shared among subjects, thereby significantly improving the recognition performance for the target subject. In the experiments, we demonstrate the effectiveness of our method under various settings. Our code is available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Brain-computer interface (BCI) has been a long-standing research topic for decoding human brain activities, playing an important role in reading the human mind with various applications [bci_app1, bci_app2, bci_app3, bci_app_new]. For instance, BCI systems enable a user to comfortably control machines without requiring any peripheral muscular activities [bci_game1, bci_game2]. In addition, BCI is especially helpful for people suffering from speech or movement disorders, allowing them to freely communicate and express their feelings by thinking [bci_kb, bci_emotion, bci_wheelchair, vp_eeg_new]. It also can be utilized to identify abnormal states of brains, such as seizure state, sleep disorder, and dementia [seizure1, seizure2, dementia, sleep]. Recently, taking it to the next level, numerous works attempt to decode brain signals for figuring out what audiovisual stimulus is being taken by a person, providing deeper insight for analyzing human perception [eeg_imgnet, bci_visaud, eeg_music, zsl_sbi].

There are different ways to collect brain signals, e.g., electroencephalography (EEG), magnetoencephalography (MEG), and functional magnetic resonance imaging (fMRI). Among them, EEG is considered the most favorable one to analyze human brain activities since it is non-invasive and promptly acquirable. With its numerous advantages, EEG-based models have been largely explored by researchers and developed for various research fields such as disorder detection [disorder1, disorder2], drowsy detection [dd_sbi, dd_2], emotion recognition [em_sbi, seed_sbd, deap_sbd], etc.

In this paper, we tackle the task of visual recognition based on EEG signals, whose goal is to classify visual stimuli taken by subjects. Recently, thanks to the effectiveness of deep neural networks (DNNs), existing models have shown impressive recognition performances 

[em_sbi, dd_2, eeg_imgnet, eegdnn]. However, they suffer from the large inter-subject variability of EEG signals, which greatly restricts their scalability. Suppose that a model faces a new user not included in the training set – note that this is a common scenario in the real world. Since the EEG signals from the user are likely to largely differ from those used for training, the model would fail to recognize the classes. Therefore, in order to retain the performance, it is inevitable to collect EEG signals for training from the new subject, which requires additional costs proportional to the number of the samples. If we have sufficient training samples for the new subject, the model would show great performance, but it is not the case for the real-world scenario.

To handle this limitation and bypass the expensive cost, we introduce a new practical problem setting, namely subject adaptive EEG-based visual recognition. In this setting, we have access to abundant EEG signals from various source subjects, whereas the signals from a new user (target subject) are scarce, i.e., only a few samples (-shot) are allowed for each visual category. At inference, the model should correctly classify the EEG signals from the target subject. Fig. 1 provides a graphical illustration of the proposed problem setting.

Naturally, involving the copious samples from source subjects in the model training would bring about performance gains compared to the baseline using only signals from the target subject. However, as aforementioned, the signals obtained from the source and the target subjects are different from each other, and thus the performance improvements are limited. To maximize the benefits of pre-acquired data from source subjects, we here provide a simple yet effective baseline method. Our key idea is to allow the model to learn subject-agnostic representations for EEG-based visual recognition. Technically, together with the conventional classification loss, we design a loss to minimize maximum mean discrepancy (MMD) between feature distributions of EEG signals from different subjects. On the experiments under a variety of circumstances, our method shows consistent performance improvements over the vanilla method.

Our contributions can be summarized in three-fold.

  • We introduce a new realistic problem setting, namely subject-adaptive EEG-based visual recognition. Its goal is to improve the recognition performance for the target subject whose training samples are limited.

  • We design a simple baseline method for the proposed problem setting. It encourages the feature distributions between different subjects to be close so that the model learns subject-independent representations.

  • Through the experiments on the public benchmark, we validate the effectiveness of our model. Specifically, in the extreme 1-shot setting, it achieves the performance gain of 6.4% upon the vanilla model.

2 Related work

2.1 Brain activity underlying visual perception

Over recent decades, research on visual perception has actively investigated to reveal the correlation between brain activity and visual stimuli [vp_eeg1, vp_eeg2, vp_eeg3]. Brain responses induced by visual stimuli come from the occipital cortex that is a brain region for receiving and interpreting visual signals. In addition, visual information obtained by the occipital lobe is transmitted to nearby parietal and temporal lobes to perceive higher-level information. Based on this prior knowledge, researchers have tried to analyze brain activities induced by visual stimuli. Eroğlu et al[vp_eronlu] examine the effect of emotional images with different luminance levels on EEG signals. They also find that the brightness of visual stimuli can be represented by the activity power of the brain cortex. Stewart et al[vp_stewart] attempt to distinguish the presence of visual stimuli within a single trial in EEG recordings. It is revealed in their analyses that the individual components of EEG signals are spatially located in the visual cortex and are effective in classifying visual states. More recently, Spampinato et al[eeg_imgnet] tackle the problem of EEG-based visual recognition by learning a discriminative manifold of brain activities on diverse visual categories. Besides, they build a large-scale EEG dataset for training deep networks and demonstrate that human visual perception abilities can be transferred to deep networks. Kavasidis et al[brain2image] propose to reconstruct the observed images by decoding EEG signals. They find that EEG contains some patterns related to visual contents, which can be used to effectively generate images that are semantically coherent to the visual stimuli.

In line with these works, we build a visual recognition model to decode EEG signals induced by visual stimuli. In addition, we design and tackle a new practical problem setting where a limited amount of data is allowed for new users.

2.2 Subject-independent EEG-based classification

Subject-dependent EEG-based classification models have widely been studied, achieving the noticeable performances [mi1_sbd, mi2_sbd, seed_sbd, deap_sbd, hwang2021bci]. However, EEG signal patterns greatly vary among individuals, building a subject-independent model remains an important research topic to be solved. Hwang et al[em_sbi] train a subject-independent EEG-based emotion recognition model by utilizing an adversarial learning approach to make the model not able to predict the subject labels. Zhang et al[attention_sbi]

propose a convolutional recurrent attention model to classify movement intentions by focusing on the most discriminative temporal periods from EEG signals. In 

[dd_sbi], an EEG-based drowsy driving detection model is introduced, which is trained in an adversarial manner with gradient reversal layers in order to encourage feature distribution to be close between subjects.

Besides, to eliminate the expensive calibration process for new users, zero-training BCI techniques are introduced which does not require the re-training. Lee et al[cnn_ztb] try to find the network parameters that generalize well on common features across subjects. Meanwhile, Grizou et al[ztb] propose a zero-training BCI method that controls virtual and robotic agents in sequential tasks without requiring calibration steps for new users.

Different from the works above, we tackle the problem of EEG-based visual recognition. Moreover, we propose a new problem setting to reduce the cost of acquiring labeled data for new users, as well as introduce a strong baseline.

3 Dataset

Before introducing the proposed method, we first present the dataset details for experiments. We use the publicly available large-scale EEG dataset collected by [eeg_imgnet] that consists of 128-channel EEG sequences lasting for 440 ms from six different subjects (five male and one female). The EEG signals are filtered using a notch filter (49-51 Hz) and a band-pass filter (14-72 Hz) to include two frequency bands, i.e

., Beta and Gamma. The dataset contains 40 easily distinguishable object categories from ImageNet 

[ImageNet], which are listed in Table 1

. The number of image samples looked at by subjects is 50 for each class, constituting a total of 2,000 samples. We use the official splits, keeping the ratio of training, validation, and test sets as 4:1:1. The dataset contains a total of 6 splits and we measure the mean and the standard deviation of performance of 6 runs in the experiments. We refer readers to the original paper 

[eeg_imgnet] for further details about the dataset.

n02106662 German shepherd n02951358 Canoe n03445777 Golf ball n03888257 Parachute
n02124075 Egyptian cat n02992529 Cellular telephone n03452741 Grand piano n03982430 Pool table
n02281787 Lycaenid n03063599 Coffee mug n03584829 Iron n04044716 Radio telescope
n02389026 Sorrel n03100240 Convertible n03590841 Jack-o’-lantern n04069434 Reflex camera
n02492035 Capuchin n03180011 Desktop computer n03709823 Mailbag n04086273 Revolver
n02504458 African elephant n03197337 Digital watch n03773504 Missile n04120489 Running shoe
n02510455 Giant panda n03272010 Electric guitar n03775071 Mitten n07753592 Banana
n02607072 Anemone fish n03272562 Electric locomotive n03792782 Mountain bike n07873807 Pizza
n02690373 Airliner n03297495 Espresso maker n03792972 Mountain tent n11939491 Daisy
n02906734 Broom n03376595 Folding chair n03877472 Pajama n13054560 Bolete
Table 1: The list of object classes utilized for collecting EEG signals with ImageNet [ImageNet] class indices.

4 Method

In this section, we first define the proposed problem setting (Sec. 4.1). Then, we introduce a baseline method with subject-independent learning to tackle the problem. Its network architecture is illustrated in Sec. 4.2, followed by the detailed subject-independent learning scheme (Sec. 4.3). An overview of our method is depicted in Fig. 2.

Figure 2: An overview of the proposed method. Colors and shapes respectively represent subject identities and classes. During feature learning, we train the model to accurately predict the class from the EEG signals. To alleviate the feature discrepancy of source and target signals, we propose a feature adaptation stage which minimizes the maximum mean discrepancy. Consequently, both source and target features are projected on the same manifold, enabling accurate predictions on target signals during inference.

4.1 Subject Adaptive EEG-based Visual Recognition

We start by providing the formulation of the conventional EEG-based visual recognition task. Let denote the dataset collected from the -th subject. Here, denotes the -th EEG sample of subject with its channel dimension and the duration , while is the corresponding ground-truth visual category observed by the subject and is the number of the samples for subject . In general, the EEG samples are abundant for each subject, i.e., . To train a deep model, multiple datasets from different subjects are assembled to build a single training set , where is the total number of subjects. At inference, given an EEG sample , the model should predict its category. Here, it is assumed that the input signal at test time is obtained by one of the subjects whose samples are used during the training stage, i.e., . However, this conventional setting is impractical especially for the case where EEG data from new subjects are scarce.

Instead, we propose a more realistic problem setting, named Subject Adaptive EEG-based Visual Recognition. In this setting, we aim to utilize the knowledge learned from abundant data of source subjects to classify signals from a target subject whose samples are rarely accessible. For that purpose, we first divide the training set into source and target sets, i.e., and . We choose a subject and set it to be the target while the rest become the sources. For example, letting subject be the target, and . Based on the sparsity constraint, the target dataset contains only a few examples, i.e., , where . In practice, we make the target set have only samples with their labels per class (-shot). Note that we here use the -th subject as the target, but any subject can be the target without loss of generality. After trained on and , the model is supposed to predict the class of an unseen input signal which is obtained from the target subject .

4.2 Network Architecture

In this section, we describe the architectural details of the proposed simple baseline method. Our network is composed of a sequence encoder , an embedding layer , and a classifier . The sequence encoder

is a single-layer gated recurrent unit (GRU), which takes as input an EEG sample and outputs the extracted feature representation

, where

is the feature dimension. Although the encoder produces the hidden representation for every timestamp, we only use the last feature and discard the others since it encodes the information from all timestamps. Afterwards, the feature

is embedded to the semantic manifold by the embedding layer , i.e., , where is the dimension of embedded features. The embedding layer

is composed of a fully-connected (FC) layer with an activation function. As the final step, we feed the embedded feature

to the classifier

consisting of a FC layer with the softmax activation, producing the class probability

. Here, is a set of the trainable parameters in the overall network. To train our network for the classification task, we minimize the cross-entropy loss as follows.


where and indicate the number of samples in source and target sets.

4.3 Subject-independent Feature Learning

In spite of the learned class-discriminative knowledge, the model might not fully benefit from the data of source subjects due to the feature discrepancy from different subjects. To alleviate this issue and better exploit the source set, we propose a simple yet effective framework, where subject-independent features are learned by minimizing the divergence between feature distributions of source and target subjects. Concretely, for the divergence metric, we estimate the multi-kernel maximum mean discrepancy (MK-MMD) 

[long2015mmd] between the feature distributions and from two subjects and as follows.


where is the mapping function to the reproducing kernel Hilbert space, while indicates the Frobenius norm. denotes the -th feature from subject encoded by the sequence encoder , whereas and are the total numbers of samples from the -th and the -th subjects in the training set, respectively. In practice, we use the samples in an input batch rather than the whole training set due to the memory constraint. We note that the embedded feature could also be utilized to compute the discrepancy, but we empirically find that it generally performs inferior to the case of using (Sec. 5.3).

Reducing the feature discrepancy between different subjects allows the model to learn subject-independent features. To make feature distributions from all subjects close, we compute and minimize the MK-MMD of all possible pairs of the subjects. Specifically, we design the discrepancy loss that is formulated as:


where is the number of the subjects in the training data including the target.

By minimizing the discrepancy loss, our model could learn subject-independent features and better utilize the source data to improve the recognition performance for the target subject. The overall training loss of our model is a weighted sum of the losses, which is computed as follows:


where is the weighting factor, which is empirically set to 1.

5 Experiments

5.1 Implementation Details

The input signals for our method contain a total of 128 channels () with a recording unit of 1 ms, each of which lasts for 440 ms. Following [eeg_imgnet], we only use the signals within the interval of 320-480 ms, resulting in the temporal dimension . As described in Sec. 4.2, our model consists of a single-layer gated recurrent unit (GRU) followed by two fully-connected layers respectively for embedding and classification. For all layers but the classifier, we set their hidden dimensions to the same one with input signals to preserve the dimensionality, i.e.,

. For non-linearity, we put the Leaky ReLU activation after the embedding layer


. To estimate multi-kernel maximum mean discrepancy, we use the radial basis function (RBF) kernel 

[vert2004primer] as the mapping function. For effective learning, we make sure that all the subjects are included in a single batch. Technically, we randomly pick 200 examples from each source dataset and take all samples in the target dataset to configure a batch. Our model is trained in an end-to-end fashion from scratch without pre-training. For model training, we use the Adam [kingma2014adam] optimizer with a learning rate of .

Validation set
Subject top-1 accuracy (%) top-3 accuracy (%)
-shot Vanilla Ours -shot Vanilla Ours
Test set
Subject top-1 accuracy (%) top-3 accuracy (%)
-shot Vanilla Ours -shot Vanilla Ours
Table 2: Quantitative comparison of methods by changing the target subject. For evaluation, we select one subject as a target and set the rest as sources, then compute the top- accuracy for the test set from the target subject. Note that only a single target sample for each class is included in training, i.e., -shot setting. We measure the mean and the standard deviation of a total of 5 runs following the official splits.

5.2 Quantitative Results

To validate the effectiveness of our method, we compare it with two different competitors: -shot baseline and the vanilla model. First, the -shot method is trained exclusively on the target dataset. As the amount of target data is limited, the model is expected to poorly perform and it would serve as the baseline for investigating the benefit of source datasets. Next, the vanilla model is a variant of our method that discards the discrepancy loss. Its training depends solely on the classification loss without considering subjects, and thus it can demonstrate the effect of abundant data from other unrelated subjects.

Validation set
top-1 accuracy (%) top-3 accuracy (%)
-shot Vanilla Ours -shot Vanilla Ours
Test set
top-1 accuracy (%) top-3 accuracy (%)
-shot Vanilla Ours -shot Vanilla Ours
Table 3: Quantitative comparison of methods by changing the number of target samples per class provided during training. The value of means that only samples of the target subject are used for training. We measure the mean and the standard deviation of a total of 5 runs for all subjects following the official splits.

Comparison in the 1-shot setting.

We first explore the most extreme scenario of our subject adaptive EEG-based visual classification, i.e., the 1-shot setting. In this setting, only a single example for each visual category is provided for the target subject. The experimental results are summarized in Table 2. As expected, the -shot baseline performs the worst due to the scarcity of training data. When including the data from source subjects, the vanilla setting improves the performance to an extent. However, we observe that the performance gain is limited due to the representation gap between subjects. On the other hand, our model manages to learn subject-independent information and brings a large performance boost upon the vanilla method without regard to the choice of the target subject. Specifically, the top-1 accuracy of subject #1 on the validation set is improved by 7.2% from the vanilla method. This clearly validates the effectiveness of our approach.

top-1 accuracy (%) top-3 accuracy (%)
after after after after
Table 4: Ablation on the location of feature adaptation. We compare two variants that minimize discrepancy after the sequence encoder and the embedding layer , respectively. We measure the mean and the standard deviation of a total of 5 runs for all subjects.

Comparison with varying .

To investigate the performance in diverse scenarios, we evaluate the models with varying for the -shot setting. Specifically, we change from 1 to 5 and the results are provided in Table 3. Obviously, increasing leads to performance improvements for all the methods. On the other hand, it can be also noticed that regardless of the choice of , our method consistently outperforms the competitors with non-trivial margins, indicating the efficacy and the generality of our method. Meanwhile, the performance gaps between the methods get smaller as grows, since the benefit of source datasets vanishes as the volume of the target dataset increases. We note, however, that a large value of is impractical and sometimes even unreachable in the real-world setting.

5.3 Analysis on the location of feature adaptation

Our feature adaptation with the discrepancy loss (Eq. 3) can be adopted into any layer of the model. To analyze the effect of its location, we compare two variants that minimize the distance of feature distributions after the sequence encoder and the embedding layer , respectively. The results are shown in Table 4, where the variant “after ” generally shows better performance compared to “after ” except for the case where is set to 1. We conjecture that this is because it is incapable for a single GRU encoder (i.e., ) to align feature distributions from different subjects well when the amount of the target dataset is too small. However, with a sufficiently large , the variant “after ” consistently performs better with obvious margins. Based on these results, we compute the MK-MMD on the features after the sequential encoder by default.

6 Concluding Remarks

In this paper, we introduce a new setting for EEG-based visual recognition, namely subject adaptive EEG-based visual recognition, where plentiful data from source subjects and sparse samples from a target subject are provided for training. This setting is cost-effective and practical in that it is often infeasible to acquire sufficient samples for a new user in the real-world scenario. Moreover, to better exploit the abundant source data, we introduce a strong baseline that minimizes the feature discrepancy between different subjects. In the experiments with various settings, we clearly verify the effectiveness of our method compared to the vanilla model. We hope this work would trigger further research under realistic scenarios with data scarcity, such as subject generalization [ghifary2015domain, jeon2021stylizationDG].


This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-00451: Development of BCI based Brain and Cognitive Computing Technology for Recognizing Users Intentions using Deep Learning, No. 2020-0-01361: Artificial Intelligence Graduate School Program (YONSEI UNIVERSITY)).