Log In Sign Up

EEG-Transformer: Self-attention from Transformer Architecture for Decoding EEG of Imagined Speech

Transformers are groundbreaking architectures that have changed a flow of deep learning, and many high-performance models are developing based on transformer architectures. Transformers implemented only with attention with encoder-decoder structure following seq2seq without using RNN, but had better performance than RNN. Herein, we investigate the decoding technique for electroencephalography (EEG) composed of self-attention module from transformer architecture during imagined speech and overt speech. We performed classification of nine subjects using convolutional neural network based on EEGNet that captures temporal-spectral-spatial features from EEG of imagined speech and overt speech. Furthermore, we applied the self-attention module to decoding EEG to improve the performance and lower the number of parameters. Our results demonstrate the possibility of decoding brain activities of imagined speech and overt speech using attention modules. Also, only single channel EEG or ear-EEG can be used to decode the imagined speech for practical BCIs.


EEG based Continuous Speech Recognition using Transformers

In this paper we investigate continuous speech recognition using electro...

EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification

Different categories of visual stimuli activate different responses in t...

ConTraNet: A single end-to-end hybrid network for EEG-based and EMG-based human machine interfaces

Objective: Electroencephalography (EEG) and electromyography (EMG) are t...

EEG-Based Epileptic Seizure Prediction Using Temporal Multi-Channel Transformers

Epilepsy is one of the most common neurological diseases, characterized ...

EEG Channel Interpolation Using Deep Encoder-decoder Netwoks

Electrode "pop" artifacts originate from the spontaneous loss of connect...

Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19

This work explores speech as a biomarker and investigates the detection ...

A Monotonicity Constrained Attention Module for Emotion Classification with Limited EEG Data

In this work, a parameter-efficient attention module is presented for em...

I Introduction

Brain-computer interfaces (BCIs) are one of the most important consideration for communication systems in real life. Many researchers have studied BCI to recognize human cognitive state or intention based on brain signals such as electroencephalography (EEG) to recognize the crucial features from the brain activity. [39, 11, 40]

. To enhance the performance of decoding EEG signals, preprocessing technology is also important to get a high quality signals with higher accuracy of decoding and lower signal-to-noise ratio

[3, 23, 13, 24]

. Moreover, the decoding technologies including feature extraction and classification have improved significantly in recent years

[14, 29, 19, 15, 25].

Recognizing brain activities during speech or imagined speech has recently attracted a lot of attention and is developing [10, 32]. In particular,imagined speech is evaluated as an advanced technology for brain signal-based communication systems [37, 22, 18]. Imagined speech refers to the internal pronunciation of speech only by imagination without auditory output or pronunciation [33]. Recent studies have shown some features and potentials of imagined speech decoding [22, 28], but fundamental neural properties and their practical use remain to be investigated. Therefore, research on the decoding of imagined speech requires the development of brain signal decoding techniques for more accurate and practical BCI [21, 1].

Several deep learning techniques have been published to decode EEG brain signals, which are architectural designs that considers the characteristics of brain signal characteristics [31, 16]

. It was often used to decode human intention using motor imagery or event-related potential, and have shown superior performance than the conventional machine learning methods such as linear discriminant analysis and support vector machine

[34, 15, 9]. Recently, there are several attempts to find optimal features of EEG by deep neural networks based on the three main features of EEG, temporal, spectral, and spatial features [36, 2]. In addition, EEG-based speaker identification studies also have actively applied machine learning or deep learning techniques [27, 4]. Deep learning may be effective in capturing prominent features from brain signals to verify individual characteristics.

Fig. 1:

Total frameworks in this study. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable classification token to the sequence.


is a model from Google’s 2017 paper “Attention is all you need,” and is implemented only with attention, while following the encoder-decoder, the existing structure of seq2seq. This model does not use RNN, and even though the encoder-decoder structure is designed, it also has better performance than RNN. This is the basis for famous language models such as GPT-3 and DALL-E, and tools such as the Hugging Face Transformers library have made it easy for machine learning engineers to solve a wide range of NLP tasks and have since promoted numerous innovations in NLP and other fields

[41, 38, 30, 7]. Transformer’s attention was created to overcome the limitations of RNN, which was slow in computation due to difficulties in parallel processing. Transformers do not need to process data sequentially like RNNs. In addition, this processing method is possible because it allows much more parallelization than RNN.

Recently, there have been several attempts to commercialize BCI technology[8, 5]. For example, portable and non-hair EEGs were frequently investigated to improve the applicability of BCI in real life, and endogenous paradigms such as motor imagery and imagined speech are used rather than exogenous paradigm such as event-related potential and steady-state visual evoked potential, which needs external devices to give stimuli[26]. In particular, the ear-EEG composed of electrodes disposed inside or around the ear has many advantages over the existing scalp-EEG in terms of stability and portability. In addition, since the Broca-Wernicke region, which is mainly analyzed during overt speech or imagined speech, is distributed close to the left ear channels, there is a possibility that only a small number of channels can be used to recognize the user’s intention [10, 22].

Ii Materials and Methods

Ii-a Data Description

The experimental protocol followed the previous works [22, 21]. Nine subjects (three males; age 25.00 ± 2.96) participated in the study. The study was approved by the Korea University Institutional Review Board [KUIRB-2019-0143-01] and was conducted in accordance with the Declaration of Helsinki. Informed consent was obtained from all subjects.

We recorded EEG signals from scalp during overt speech and imagined speech. After recording two seconds of resting state, two more seconds of voice audio for each word/phrase were provided, followed by consecutive trials of imagined speech or overt speech [22, 20]. During the experiment in which each block was repeated the imagined or overt speech four times, only the first trial of each block was used to match the number of trials with different experimental conditions. Each participant conducted a random experiment 25 times for every 12 words, and a total of 300 trials for each condition. There are 13 classification outputs, consisting of 12 words (ambulance, clock, hello, help me, light, pain, stop, thank you, toilet, TV, water, and yes) and resting state.

Ii-B EEG Preprocessing

The EEG signal was down-sampled at 250 Hz and divided into 2 seconds from the start of each trial. The preprocessing of EEG signal was performed with a 5th Butterworth filter in the high-gamma band of 30–120 Hz, and baseline was corrected by subtracting the average of 500 ms before the start of each trial. We selected the channels located in the Broca and Wernicke’s areas (AF3, F3, F5, FC3, FC5, T7, C5, TP7, CP5, and P5). For removing the artifacts of EOG and EMG from muscle activity around mouth, we conducted artifact removal methods using independent component analysis with references from EOG and EMG. All data processing procedures were performed in Python and Matlab using OpenBMI Toolbox

[17], BBCI Toolbox [12], and EEGLAB [6].

Fig. 2:

Transformer architecture. Self-attention and the feed-forward networks are followed by dropout and bypassed by a residual connection with subsequent layer normalization. This figure is inspired from the Transformer paper


Ii-C Architecture

The proposed classification framework consists of convolution layers and separable convolution layers for extracting time-spectral-spatial information, as shown in Fig. 1. Given the input as raw signals (C T), classification output is set to 13 classes. The kernel size of the first layer is set in relation to the sampling frequency of the data for performing a temporal convolution that imitates the band-pass filter [36]

. Since support vector machine (SVM) classifier has been reported to be robust in decoding the imagined speech

[22, 28]

, we used the squared hinge loss for training that functions similar to the margin loss of SVM. The evaluation was conducted through 5-fold cross-validation and 1000 epoch training for each condition. The probability of chance level of this experiment was 11.11% because the number of samples of each subject were the same for each state condition.

The self-attention module was shown is Fig. 1 and Fig. 2 as well. The attention module serves for mapping a set of query and key-value pairs, where the output is calculated as the weighted sum of the values, where the weights assigned to each value are calculated by the compatibility function of the query and its key. The multi-head attention can jointly focus on information in different representation subspaces at different positions, resulting in an average calculation with one attention.

Ii-D Statistical Analysis

Statistical analysis were performed to verify the results of classification. Kruskal-Wallis non-parametric one-way analysis of variance (ANOVA) were performed to compare the classification performance of imagined speech and overt speech. Post-hoc analysis was conducted with non-parametric permutation-based t-test. The Kruskal-Wallis test was also performed on classification performance using a single channel EEG to estimate the significance of the selected channel. In addition, a paired

t-test was performed to identify significant connectivity changes in Broca and Wernicke’s area during imagined speech and resting states.

Iii Results and Discussion

Iii-a Decoding Performance

We developed the frameworks of decoding the speech-related EEG signals of 13 classes in two conditions of imagined speech and overt speech. The performance of imagined speech and overt speech was compared. The average accuracy of overt speech was 49.5% for 13 classes, including 9 subjects’ performances. The EEG signal during overt speech can contain more significant representation in brain activities. As we conducted preprocessing to remove artifacts related to EOG and EMG around mouth, the EEG signal only contains the brain activity to intent to move mouth and tongue to speak out the pronunciation for each word. The average accuracy of imagined speech was 35.07% for 13 classes, including 9 subjects’ performances. The EEG signal during imagined speech includes only brain activities rather than EMG since they did not move their muscle. Therefore, the performance of imagined speech normally inferior than it of overt speech. However, the difference between overt speech and imagined speech was significantly different(), but not so huge while overt speech was expected to show superior performance.

Iii-B Attention Module

We showed that deep learning model with self-attention module could show reasonable performance. The advantages of the self-Attention module are that it can reduce the total computational complexity per layer, that it can parallelize the computational volume to some extent, and that the path length of long-term dependency in the networks is short.

Iv Conclusion

In this study, we proposed attention module based on transformer architecture to decode imagined speech in EEG. As practical BCIs require a robust system and simple hardware usable in the real-world, we show that the proposed method improved the BCI performance. The results of recognizing speech from human intention had reasonable performance although we used only few channels. And we compared overt speech and imagined speech in terms of performance and statistical analysis. The EEG of overt speech showed superior performance than imagined speech, which was significant different, but not that huge than we expected. Therefore, technology of decoding imagined speech with attention module had potential to use as a real-world communication system. In the future, we developed the architecture that performed with higher performance for imagined speech. Moreover, parameter optimization of self-attention module can increase the performance as well.


  • [1] G. K. Anumanchipalli, J. Chartier, and E. F. Chang (2019) Speech synthesis from neural decoding of spoken sentences. Nature 568 (7753), pp. 493–498. Cited by: §I.
  • [2] M. H. Bhatti, J. Khan, M. U. G. Khan, R. Iqbal, M. Aloqaily, Y. Jararweh, and B. Gupta (2019)

    Soft computing-based EEG classification by optimal feature selection and neural networks

    IEEE Trans. Industr. Inform. 15 (10), pp. 5747–5754. Cited by: §I.
  • [3] T. Castermans, M. Duvinage, M. Petieau, T. Hoellinger, C. De Saedeleer, K. Seetharaman, A. Bengoetxea, G. Cheron, and T. Dutoit (2011-12) Optimizing the performances of a P300-based brain-computer interface in ambulatory conditions. IEEE J. Emerg. Sel. Topics Circuits Syst. 1 (4), pp. 566–577. Cited by: §I.
  • [4] D. Dash, P. Ferrari, and J. Wang (2019) Spatial and spectral fingerprint in the brain: speaker identification from single trial MEG signals.. In Proc. Interspeech, pp. 1203–1207. Cited by: §I.
  • [5] S. Debener, R. Emkes, M. De Vos, and M. Bleichner (2015-11) Unobtrusive ambulatory EEG using a smartphone and flexible printed electrodes around the ear. Sci. Rep. 5, pp. 16743. Cited by: §I.
  • [6] A. Delorme and S. Makeig (2004)

    EEGLAB: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis

    J. Neurosci. Methods 134 (1), pp. 9–21. Cited by: §II-B.
  • [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §I.
  • [8] L. Fiedler, M. Wöstmann, C. Graversen, A. Brandmeyer, T. Lunner, and J. Obleser (2017) Single-channel in-ear-EEG detects the focus of auditory attention to concurrent tone streams and mixed speech. J. Neural Eng. 14 (3), pp. 036020. Cited by: §I.
  • [9] Z. Gao, W. Dang, M. Liu, W. Guo, K. Ma, and G. Chen (2020) Classification of EEG signals on VEP-based BCI systems with broad learning. IEEE Trans. Syst. Man Cybern.: Syst.. Cited by: §I.
  • [10] A. G. Huth, W. A. De Heer, T. L. Griffiths, F. E. Theunissen, and J. L. Gallant (2016) Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532 (7600), pp. 453–458. Cited by: §I, §I.
  • [11] J. Jeong, K. Shim, D. Kim, and S. Lee (2020) Brain-controlled robotic arm system based on multi-directional cnn-bilstm network using eeg signals. IEEE Trans. Neural Syst. Rehabil. Eng. 28 (5), pp. 1226–1238. Cited by: §I.
  • [12] R. Krepki, B. Blankertz, G. Curio, and K. Müller (2007-02) The Berlin Brain-Computer Interface (BBCI)–towards a new communication channel for online control in gaming applications. Multimed. Tools. Appl. 33 (1), pp. 73–90. Cited by: §II-B.
  • [13] N. Kwak, K. Müller, and S. Lee (2015-08) A lower limb exoskeleton control system based on steady state visual evoked potentials. J. Neural Eng. 12 (5), pp. 056009. Cited by: §I.
  • [14] N. Kwak, K. Müller, and S. Lee (2017-02) A convolutional neural network for steady state visual evoked potential classification under ambulatory environment. PloS One 12 (2), pp. e0172578. Cited by: §I.
  • [15] O. Kwon, M. Lee, C. Guan, and S. Lee (2019) Subject-independent brain–computer interfaces based on deep convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 31 (10), pp. 3839–3852. Cited by: §I, §I.
  • [16] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance (2018-07) EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces. J. Neural Eng. 15 (5), pp. 056013. Cited by: §I.
  • [17] M. Lee, O. Kwon, Y. Kim, H. Kim, Y. Lee, J. Williamson, S. Fazli, and S. Lee (2019-01) EEG dataset and OpenBMI toolbox for three BCI paradigms: An investigation into BCI illiteracy. GigaScience 8 (5), pp. giz002. Cited by: §II-B.
  • [18] M. Lee, J. Williamson, D. Won, S. Fazli, and S. Lee (2018) A high performance spelling system based on EEG-EOG signals with visual feedback. IEEE Trans. Neural Syst. Rehabil. Eng. 26 (7), pp. 1443–1459. Cited by: §I.
  • [19] M. Lee, R. D. Sanders, S. Yeom, D. Won, K. Seo, H. J. Kim, G. Tononi, and S. Lee (2017-12) Network properties in transitions of consciousness during propofol-induced sedation. Sci. Rep. 7 (1), pp. 1–13. Cited by: §I.
  • [20] S. Lee, M. Lee, J. Jeong, and S. Lee (2019) Towards an EEG-based intuitive BCI communication system using imagined speech and visual imagery. In Conf. Proc. IEEE. Int. Conf. Syst. Man Cybern. (SMC), pp. 4409–4414. Cited by: §II-A.
  • [21] S. Lee, M. Lee, and S. Lee (2019) EEG representations of spatial and temporal features in imagined speech and overt speech. In Proc. Asian Conf. Pattern Recognit. (ACPR), pp. 387–400. Cited by: §I, §II-A.
  • [22] S. Lee, M. Lee, and S. Lee (2020) Neural decoding of imagined speech and visual imagery as intuitive paradigms for BCI communication. IEEE Trans. Neural Syst. Rehabil. Eng.. Cited by: §I, §I, §II-A, §II-A, §II-C.
  • [23] Y. Lee, N. Kwak, and S. Lee (2020) A real-time movement artifact removal method for ambulatory brain-computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng.. Cited by: §I.
  • [24] Y. Lee, M. Lee, and S. Lee (2020)

    Reconstructing erp signals using generative adversarial networks for mobile brain-machine interface

    arXiv preprint arXiv:2005.08430. Cited by: §I.
  • [25] Y. Lee and M. Lee (2020-02) Decoding visual responses based on deep neural networks with ear-EEG signals. In Int. Winter Conf. Brain-Computer Interface (BCI), Jeongseon, Republic of Korea, pp. 1–6. Cited by: §I.
  • [26] Y. Lee, G. Shin, M. Lee, and S. Lee (2021) Mobile BCI dataset of scalp- and ear-EEGs with ERP and SSVEP paradigms while standing, walking, and running. Sci. Data, pp. 1–12. Cited by: §I.
  • [27] L. A. Moctezuma, A. A. Torres-García, L. Villaseñor-Pineda, and M. Carrillo (2019) Subjects identification using EEG-recorded imagined speech. Expert Syst. Appl. 118, pp. 201–208. Cited by: §I.
  • [28] C. H. Nguyen, G. K. Karavas, and P. Artemiadis (2017) Inferring imagined speech using EEG signals: a new approach using riemannian manifold features. J. Neural Eng. 15 (1), pp. 016002. Cited by: §I, §II-C.
  • [29] A. D. Nordin, W. D. Hairston, and D. P. Ferris (2018-08) Dual-electrode motion artifact cancellation for mobile electroencephalography. J. Neural Eng. 15 (5), pp. 056024. Cited by: §I.
  • [30] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In Int. Conf. on Machine Learning, Stockholm, Sweden, pp. 4055–4064. Cited by: §I.
  • [31] R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball (2017-08) Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 38 (11), pp. 5391–5420. Cited by: §I.
  • [32] J. Schoffelen, A. Hultén, N. Lam, A. F. Marquand, J. Uddén, and P. Hagoort (2017) Frequency-specific directed interactions in the human brain network for language. Proc. Natl. Acad. Sci. (PNAS) 114 (30), pp. 8083–8088. Cited by: §I.
  • [33] T. Schultz, M. Wand, T. Hueber, D. J. Krusienski, C. Herff, and J. S. Brumberg (2017) Biosignal-based spoken communication: a survey. IEEE/ACM Trans. Audio Speech Lang. Process. 25 (12), pp. 2257–2271. Cited by: §I.
  • [34] H. Suk and S. Lee (2012-02) A novel bayesian framework for discriminative feature extraction in brain-computer interfaces. IEEE Trans. PAMI 35 (2), pp. 286–299. Cited by: §I.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural inf. processing syst. (NIPS), California, USA, pp. 5998–6008. Cited by: §I, Fig. 2.
  • [36] N. Waytowich, V. J. Lawhern, J. O. Garcia, J. Cummings, J. Faller, P. Sajda, and J. M. Vettel (2018) Compact convolutional neural networks for classification of asynchronous steady-state visual evoked potentials. J. Neural Eng. 15 (6), pp. 066031. Cited by: §I, §II-C.
  • [37] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan (2002-06) Brain-computer interfaces for communication and control. Clin. Neurophysiol. 113 (6), pp. 767–791. Cited by: §I.
  • [38] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In Int. conf. machine learning (ICML), California, USA, pp. 7354–7363. Cited by: §I.
  • [39] Y. Zhang, H. Zhang, X. Chen, S. Lee, and D. Shen (2017) Hybrid high-order functional connectivity networks using resting-state functional mri for mild cognitive impairment diagnosis. Sci. rep. 7 (1), pp. 1–15. Cited by: §I.
  • [40] Y. Zhang, H. Zhang, X. Chen, M. Liu, X. Zhu, S. Lee, and D. Shen (2019) Strength and similarity guided group-level brain functional network construction for mci diagnosis. Pattern Recognit. 88, pp. 421–430. Cited by: §I.
  • [41] H. Zhao, J. Jia, and V. Koltun (2020) Exploring self-attention for image recognition. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), virtual, pp. 10076–10085. Cited by: §I.