A novel multimodal approach for hybrid brain-computer interface

04/25/2020 ∙ by Zhe Sun, et al. ∙ Nankai University 0

Brain-computer interface (BCI) technologies have been widely used in many areas. In particular, non-invasive technologies such as electroencephalography (EEG) or near-infrared spectroscopy (NIRS) have been used to detect motor imagery, disease, or mental state. It has been already shown in literature that the hybrid of EEG and NIRS has better results than their respective individual signals. The fusion algorithm for EEG and NIRS sources is the key to implement them in real-life applications. In this research, we propose three fusion methods for the hybrid of the EEG and NIRS-based brain-computer interface system: linear fusion, tensor fusion, and pth-order polynomial fusion. Firstly, our results prove that the hybrid BCI system is more accurate, as expected. Secondly, the pth-order polynomial fusion has the best classification results out of the three methods, and also shows improvements compared with previous studies. For a motion imagery task and a mental arithmetic task, the best detection accuracy in previous papers were 74.20% and 88.1%, whereas our accuracy achieved was 77.53% and 90.19% . Furthermore, unlike complex artificial neural network methods, our proposed methods are not as computationally demanding.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Brain-computer interface (BCI) is an important tool for the detection of signal patterns in brain activity. It has been used widely in many areas, such as in robotics control, workload detection or brain-disease detection[muhl2014eeg, ang2015randomized, latchoumane2012multiway, blankertz2010berlin, liu2020efficient, mao2019brain, wu2020electroencephalographic]. BCI technology is usually divided into invasive BCI and non-invasive BCI. Invasive BCI requires the sensors to record brain activities from within the skull. Non-invasive BCI, on the other hand, records brain signals using sensors placed on the scalp, and it is undoubtedly a much safer technology and easier to use [waldert2016invasive]. However, signals from non-invasive BCI sensors are usually full of noise, from the subjects’ unconscious eye movement or ambient noise, for example. Finding a way to use non-invasive sensors to detect brain signals accurately is still a big challenge[zhang2015sparse].

Transitional non-invasive BCI systems use either electroencephalography (EEG) sensors, near-infrared spectroscopy (NIRS) sensors or Magnetic Resonance Imaging (MRI) to record brain activities[zhang2019strength, zhang2014frequency, fazli2012enhanced, sitaram2009hemodynamic]. Compared with the MRI equipment, EEG and NIRS equipments are low-cost and smaller in size. Both EEG and NIRS have been implemented in many real time BCI applications[fazli2012enhanced].

EEG equipment consists of metal electrodes placed on the scalp to record electrical signals [zheng2018using]

. The electrodes record the activity of the surrounding neurons. For the EEG-based BCI system, classical feature extraction methods such as common spatial patterns (CSP)

[CSP1998], power spectrum density (PSD) and auto-regressive modeling (AR)[MAFeature] have been previously proposed to analyze and localize the EEG patterns and activated brain area of motor imagery (MI) and mental arithmetic (MA)[features]

. Then, in regards to classifiers, several machine learning classifiers such as k-nearest neighbours (KNN)


, support vector machine (SVM)

[SVM1, SVM2], and linear discriminant analysis (LDA) [LDA]

are used in classifying these proposed EEG features. In recent years, with the wide application of deep learning, many studies have already shown that convolution neural network (CNN) exhibits good performance for EEG processing or NIRS processing

[zhang2019novel, trakoolwilaiwan2017convolutional]. CNN and CNN-based models, such as EEGNet [lawhern2018eegnet]

or convolutional neural networks-stacked autoencoders (CNN-SAE)

[cnn-sae], have obtained remarkable MI classification results. Moreover, some researchers have developed an LSTM-based framework [lstm] to extract essential features of time-varying EEG signals to classify motor imagery signals.

NIRS-based BCI systems, generally used to measure the hemodynamic signals from target regions of the brain [nirs_intro], can also analyze oxy-hemoglobin NIRS (oxy-NIRS) or deoxy-hemoglobin NIRS (deoxy-NIRS) concentrations in order to localize and classify the brain activity [hishimoto2019mitochondrial]. Generally, when a specific brain area becomes more active, energy metabolism increases, leading to an increased oxygen consumption and increased levels of carbon dioxide in the area. Then, oxy-NIRS signals will decrease while deoxy-NIRS signals will increase. For the signal processing or pattern classification of NIRS, transitional algorithms such as SVM or artificial neural networks are used[naseer2015fnirs].

Instead of using EEG or NIRS mehods individually, using them simultaneously provides a multimodal hybrid BCI. It has been repeatedly proven in literature that multimodal BCI systems have a better accuracy and stability than the BCIs based on a single modality [openaccess2017, shin2018ternary, chiarelli2018deep]

. For instance, Jaeyoung Shin et al. presented an open source benchmark for a hybrid EEG-NIRS fusion BCI system

[openaccess2017, fazli2012enhanced] which also uses a linear discriminant analysis classifier for the multimodal data classifications.

Non-linear feature fusion is commonly used in deep learning, and there are many ways to achieve it. For instance, in [lin2015bilinear] a bilinear method was used to fuse two feature vectors, which can be thought of as the 2-order tensor product fusion. Even further, a trilinear method was used in [ben2017mutan] to fuse three modality features. In addition, some tensor-based methods have been proposed to solve multimodal fusion problems such as Visual Question Answering: for example, [bai2018deep] proposed a Deep Attention Neural Tensor Network which can discover the joint correlations over images, questions and answers with tensor-based representations; [liu2018efficient] proposed a low-rank multimodal fusion method which performs multimodal fusion using low-rank tensors to improve efficiency; finally, [hou2019deep]

proposed a polynomial tensor pooling block for integrating multimodal features by considering high-order moments.

In this study, a novel approach for the hybrid BCI system is proposed. Firstly, CNN was used to get the feature vectors from the EEG, oxy-NIRS, and deoxy-NIRS signals. A single-modal data was calculated to compare with hybrid-system based classification results. Then, three fusion methods were used to process the multimodal feature vectors: linear fusion classification method, tensor fusion classification method and the th-order polynomial fusion classification method. Jaeyoung Shin et al’s benchmark dataset [openaccess2017] was used for comparison purposes to validate the methods proposed in this paper.

The classification results for both the motor imagery (MI) experiment data and the mental arithmetic (MA) experiment data were compared with results from previous works. In both these two different tasks, the th-order polynomial fusion classification shows the best classification performance. In the MI task, the accuracy achieved was 77.53%, while the accuracy achieved in the MA task was 90.19%. Furthermore, compared with the deep neural networks method, our algorithm is simple and uses less computing resources. It can be used for real-time hybrid BCI systems.

The rest of the paper is organized as follows: hybrid BCI benchmark dataset and fusion algorithm are introduced in Section II. Experiments and pre-processing, as well as their results, are presented and discussed in Section III. The discussion and conclusions are given in the Section IV and Section V respectively.

Ii Materials and Methods

Ii-a Datasets

Subjects 29 subjects, consisting of 14 males and 15 females, participated in the data collection procedures. The average age was (meanstandard deviation).

Data Acquisition EEG data was collected by a thirty-channel BrainAmp EEG amplifier (Brain Products GmbH, Gilching, Germany), with a sampling rate of 1000 Hz. These electrodes were placed according to the international 10-5 system. NIRS data was collected by NIRScout (NIRx GmbH, Berlin, Germany) with a sampling rate of 12.5 Hz and containing thirty-six channels. More details about the electrodes’ position can be found in [openaccess2017].

Fig. 1: Locations of the EEG electrodes (solid circles), NIRS sources (yellow hexagons) and detectors (orange hexagons).
Fig. 2: The flow of experiments.
Fig. 3: An example of EEG, oxy-NIRS and deoxy-NIRS data. Notice that the NIRS data are of low-frequency and are filtered by a bandpass filter from 0.01 Hz to 0.1 Hz.

Motor imagery (MI) experiment Subjects were required to conduct kinesthetic MI. Namely, they needed to imagine opening or closing their hand while they were grabbing a ball, so that the MI is actual MI rather than visual. Each trial consisted of three parts: instruction (2s), task (10s) and rest (15-17s). Each session consisted of a pre-rest (60s) , 20 repetitions of the aforementioned trial, and a post-rest (60s). The MI experiment dataset contains the data from 3 sessions.

Mental arithmetic (MA) experiment Subjects were required to conduct subtraction in the form of a three-digit number minus a one-digit number (e.g., 384-8). As in the MI experiment, the MA experiment dataset also contains data on 3 sessions where each session consists of 20 repetitions of the trial. Subjects were given an initial subtraction calculation, and completed this subtraction as their first trial of their total 20. For the remaining 19, the new subtraction was the result from the last calculation minus the initial one-digit number, which thus remained constant throughout the trials.

Data pre-processing

The EEG data was first re-referenced by using common average reference and filtered with the 4th-order Chebyshev type 2 filter with a bandpass of 0.5-50 Hz. Then, independent component analysis (ICA)-based EOG rejection was used to remove artifacts. After that, the EEG data was downsampled to 200 Hz. For each trial, 35s of data was extracted, containing a segment of the last rest (10s), the task (10s) and the final rest (15s). Then, a 3s time window with 1s step was employed to collect data. Eventually, 33 segments were obtained from each trial. Each segment has a shape of

(channelstimes). For the NIRS data, the concentration changes of oxy- and deoxy-hemoglobin (oxy-NIRS and deoxy-NIRS) were first calculated by using the modified Beer-Lambert law. The oxy-NIRS and dexoy-NIRS were then filtered by a 6th-order zero-phase Butterworth filter with a bandpass of 0.01-0.1 Hz. After that, a 3s time window like the EEG was used to segment the NIRS data. After this process, a segment ends up with a shape of . Thus, for both the MI and MA experiments, each subject completed 60 trials, and each trial has 33 segments. Fig. 3 shows the segments of one of the subjects for three modal data from 0s to 3s.

Ii-B Multimodal Classification Method

In this section, we will first show how the classification model is built by using single modal data. Following this, two common methods to fuse features in deep learning are introduced to allow for comparison with our method and a better understanding of it. After that, our th-order polynomial fusion method is introduced. In the end, in order to tackle the unacceptably large amount of parameters, tensor decomposition is used.

Notation Focusing on the tri-modal task, assume , and denote the EEG, oxy-hemoglobin NIRS (oxy-NIRS) and deoxy-hemoglobin NIRS (deoxy-NIRS), respectively. Also assume that , and denote their respective feature vectors obtained from three CNNs. To express the formula concisely, Einstein notation [harrison2016numeric] is applied to describe multiplication between tensors or between a tensor and a vector. In particular, if we assume to denote a vector and to denote a 3rd-order tensor, their product can be written as , which means . Also, when we concatenate , and , the result is then written as . Given a vector , its first copy is written as , and its th copy is .

Single modal classification We have data from three different modalities, and each has a shape of channelstimes. Specifically, these are EEG, oxy-NIRS and deoxy-NIRS, with shapes , and

, respectively. For the single-modal classification, we conduct 1D-CNN and then each convolutional layer is followed by a batch-norm layer and a ReLU layer

[schirrmeister2017deep]. Since oxy-NIRS and deoxy-NIRS have same shape, we use the same CNN structure for them. Table I and II show the CNNs used on EEG and NIRS in detail. Notice that the feature vectors , and are the results of “AvgPool1d”.

Considering the triple modal fusion problem, there are linear and multilinear methods. Fig. 4, 5, and 6 show three ways to fuse multimodal data.

Linear fusion (LF) classification It is commonly used to simply concatenate all feature vectors obtained from the output of “AvgPool1d”. Thus, the linear fusion product can be written as


where is a fusion matrix, and is the fused feature vector with a length of . is then put in a fully connected network (FCN) in order to generate a classification result. Fig. 4 shows the flow of LF classification.

Fig. 4: Diagram of the linear fusion (LF) method. Fused data is classified using a fully connected neural network (FCN).

Tensor fusion (TF) classification For the tensor fusion, the outer product is usually introduced. It is a natural way to obtain a feature containing the interaction amongst multiple feature vectors. In our task, the outer product result can be written as


where is a 3rd-order feature tensor, and are the lengths of three modal feature vectors. After that, a 4th-order weight tensor is used to obtain the fused feature vector :


Then, normalization is applied to . Finally, FNC is used for classification. This operation is the same as the one shown above in the LF classification section. Fig. 5 shows the flowchart of TF classification.

Fig. 5: Diagram of a tensor fusion method. Fused data is classified using a fully connected neural network (FCN).

th-order polynomial fusion (th-PF) classification Tensor fusion considers the interaction between multiple feature vectors. However, the interactions within each feature vector or between two of the feature vectors are not present in the fusion. To tackle this problem, polynomial fusion is introduced. As in with linear fusion, we firstly obtain by concatenating all feature vectors. Then, copies of are made and the outer product is calculated:


Also, as in with tensor fusion, a th-order weight tensor is employed for fusion:


The subsequent operations are the same as shown in the TF classification section. Fig. 6 shows the flowchart of PF classification.

Fig. 6: Diagram of the th-order polynomial fusion method. Fused data is classified using a fully connected neural network (FCN).

Rethink three fusion methods As it turns out, the fusion weight matrix/tensor in these fusion methods can be “decomposed”. It should be noticed that the “decomposition” here is just to decompose the aforementioned formulas in order to make them easier to understand. For LF, Eq. 1 can be written as:




We can see that LF actually only considers the single modal feature and simply adds them together. For TF, the in Eq. 3 cannot be “decomposed” since three modal vectors are entangled together. This may raise the problem of overfitting in the model, because the noise in the sample may be amplified during feature entanglement. For th-PF, LF is a special case of th-PF when . To be concise with the formula, if we let , then Eq. 5 can be written as:






We can therefore see that all interactions between modes are taken into account. Compared with TF, this kind of entanglement may also introduce more noise, but at the same time it also generates more fruitful features. It can be observed, from the experimental results, that this operation has indeed a positive effect.

Dimension reduction by tensor decomposition In deep learning, the usually large amount of parameters can lead to difficult training or overfitting [shi2017deep, zhang2015sparse]. The number of parameters required by TF increases exponentially along with the increase of the mode, and the number of parameters required by PF increases exponentially along with the increase of the order. In order to tackle this issue, canonical/polyadic (CP) [kolda2009tensor] decomposition is used. Its main idea is based on using several core tensors to represent the original weight tensor. Given a weight tensor , this tensor can be represented by:


where is called rank, and is used to control the size of all core tensors. With the increase of , the core tensors can better approximate the weight tensor (i.e. increasing the ability of representation). Thus, Eq. 5 can be written as:


Considering that all of are the same, we can further reduce the parameters by assuming symmetric structure of the core tensors, namely, .

Complexity analysis Assume that three feature vectors have lengths of , and , and the fusion vector has a length of . Let CP rank be . Then, the computational and storage complexity for LF, TF and th-order PF will be , and , respectively. By conducting CP decomposition, the computational and storage complexity for TF and th-order PF will be and . By assuming symmetric structure of the core tensors, the complexity for th-order PF is reduced to . We can thus see that CP decomposition significantly reduces the amount of parameters, and also that symmetric structure can further reduce parameters on th-order PF.

Layer Input (CT) Operation Filter size Stride Padding Output
1 30600 Conv1D 9 4 0 60148
60148 BatchNorm+ReLU 60148
2 60148 Conv1D 3 1 0 60146
60146 BatchNorm+ReLU 60146
3 60146 Conv1D 3 1 0 60144
60144 BatchNorm+ReLU 60144
4 60144 Conv1D 9 4 0 12032
12032 BatchNorm+ReLU 12032
5 12032 Conv1D 3 1 0 12030
12030 BatchNorm+ReLU 12030
6 12030 Conv1D 3 1 0 12028
12028 BatchNorm+ReLU 12028
7 12028 AvgPool1D 120
8 120 Linear 120 60
9 60 ReLU 60
10 60 Linear 60 2
11 2 Softmax 2
TABLE I: 1D-CNN neural network structure to extract the EEG signal features.
Layer Input (CT) Operation Filter size Stride Padding Output
1 3630 Conv1D 9 4 0 7213
7213 BatchNorm+ReLU 7213
2 7213 Conv1D 3 1 0 7211
7211 BatchNorm+ReLU 7211
3 7211 Conv1D 3 1 0 729
729 BatchNorm+ReLU 729
4 729 Conv1D 9 4 0 1447
1447 BatchNorm+ReLU 1447
5 1447 Conv1D 3 1 0 1445
1445 BatchNorm+ReLU 1445
6 1445 Conv1D 3 1 0 1443
1443 BatchNorm+ReLU 1443
7 1443 AvgPool1D 144
8 144 Linear 144 72
9 72 ReLU 72
10 72 Linear 72 2
11 2 Softmax 2
TABLE II: 1D-CNN neural network structur to extract the NIRS signal features.

Iii Experiments

Iii-a Network configuration

For the single-modal model, the network details are shown in Table I and Table II. For the triple-modal model, the CNN model for extracting feature vectors is the same as the one before “Avgpool1d” in the Table I and Table II. The fused feature vector has a length of 128. TF employs CP decomposition, and th-order PF employs CP decomposition as well as symmetric structure. After that, a FCN with one layer is employed for classification.

In the training phase, we applied cross-entropy as the loss function, and the Adam optimizer to train the model, where the learning rate was set to

and other parameters were default. All models were trained by 300 epochs with a mini-batch size of 16.

For the triple-modal data, we have 29 subjects, with each subject carrying out 60 trials. A trial lasts 35 seconds, where resting goes from -10s to 0s, the task goes from 0s to 10s, followed by another resting period from 10s to 25s. We only used 3s data as the input for the model. For each subject, we conducted a 5-fold cross validation and took the average. Finally, all the subjects’ results were averaged.

Iii-B Results

We firstly compared the classification accuracy results with the best results from the following published paper [openaccess2017]. Our methods show significant improvement.

Fig. 7: Classification accuracy for the MI (figure a) and MA (figure b) tasks. Comparison between our methods and the ones in [openaccess2017]. MI: Motion Imagery task; MA: Mental arithmetic task.
Fig. 8: (a): Classification accuracy for the motion imagery (MI) task. (b): Classification accuracy for mental arithmetic (MA) task.

The results of the classification accuracies of our methods are displayed in Fig. 8, where x-axis indicates the left edge of the moving time window and the y-axis shows the accuracy. We can see that all the tri-modal fusion models perform better than single-modal models. Furthermore, in both tasks, PF achieves better results than LF and TF.

In the MI task, the best results for EEG, oxy-NIRS, deoxy-NIRS, LF, TF and PF are obtained by the time window 1 (71.55%), 1 (67.01%), 5 (69.31%), 7 (75.29%), -1 (75.34%) and 1 (77.53%)(5th-order), respectively.

In the MA task, these respective time windows for optimal results are 7 (81.60%), 10 (86.28%), 8 (84.42%), 7 (89.68%), 8 (87.44%) and 8 (90.19%)(3rd-order), respectively. The best results for LF, TF, PF and [openaccess2017] are shown in Fig. 7.

Iv Discussion

Iv-a Combining EEG and NIRS signals

In recent years, a large number of new technologies have been developed for the analysis of brain activity from single-modal BCI equipment. New algorithms are always being developed over time that improve the reliability of the current BCI system. However, a single-model BCI system still has many limitations. For instance, a multimodal BCI has better anti-noise capabilities than a single-model BCI [shin2018improvement, khan2014multi, wallois2012usefulness, wallois2012usefulness]. Multimodality implies a multi-view, and thus a different representation for the same brain activity patterns[fazli2013brain]. It helps the algorithm to detect specific patterns from the signals. Many previous studies are in line with our results having also indicated the improvement by fusion of multi-signals[ahn2017multi].

We selected a NIRS and EEG-based hybrid BCI fusion system because both of these two systems are real-time and low-cost[shin2018ternary, pei2020brainkilter]. Additionally, compared with MRI, these two technologies do not require a medical license. We believe EEG and NIRS-based hybrid BCI methods can and will be used as a critical tool in many applications.

Iv-B Hybrid BCI signal processing based on tensor fusion

According to recent literature, deep learning methods or artificial neural networks have been applied for the detection of patterns in brain activity. Several artificial neural networks have also been applied to this dataset. However, almost all of the neural networks result in an over-fitting problem since the BCI dataset includes small samples. Although tuning the neural structures or parameters could slightly improve the accuracy, the method’s stability leaves much to be desired [zhang2019novel, appriou2020modern].

A tensor is a higher-order array that represents signals from different types of sensors[lahat2015multimodal, acar2017tensor, hunyadi2017tensor]. For example, EEG data or NIRS data can be represented by time × frequency × electrode, and functional MRI data can be represented by voxels × scans × subjects [beckmann2005tensorial, miwakeichi2004decomposing, acar2007multiway]. In addition, tensor decomposition has been widely used to capture brain signal patterns. In this work, the classification accuracy of our proposed methods is much better than neural networks, and our system is also very robust.

V Conclusions

In this paper we have focused on using tensor fusion methods to construct a hybrid BCI system. The hybrid BCI system includes simultaneous EEG signals and NIRS signals. It also includes two tasks: a MI task and a MA task. We used a shallow CNN neural network to detect the EEG and NIRS feature vectors. Then, we used LF, TF and th-order PF fusion methods to integrate the feature vectors. The results show that the th-order PF fusion method has the best classification accuracy out of the three fusion methods and the single-modal classification. Furthermore, when comparing the performance of our method with published literature, th-order shows better results. We believe our method could be useful for hybrid BCI systems.


The authors would like to thank Dr. Kai Zhang for the fruitful advice and Pau Solé-Vilaró for the English review.

V-a References