A Novel Deep Learning Architecture for Decoding Imagined Speech from EEG

03/19/2020 ∙ by Jerrin Thomas Panachakel, et al. ∙ indian institute of science 0

The recent advances in the field of deep learning have not been fully utilised for decoding imagined speech primarily because of the unavailability of sufficient training samples to train a deep network. In this paper, we present a novel architecture that employs deep neural network (DNN) for classifying the words "in" and "cooperate" from the corresponding EEG signals in the ASU imagined speech dataset. Nine EEG channels, which best capture the underlying cortical activity, are chosen using common spatial pattern (CSP) and are treated as independent data vectors. Discrete wavelet transform (DWT) is used for feature extraction. To the best of our knowledge, so far DNN has not been employed as a classifier in decoding imagined speech. Treating the selected EEG channels corresponding to each imagined word as independent data vectors helps in providing sufficient number of samples to train a DNN. For each test trial, the final class label is obtained by applying a majority voting on the classification results of the individual channels considered in the trial. We have achieved accuracies comparable to the state-of-the-art results. The results can be further improved by using a higher-density EEG acquisition system in conjunction with other deep learning techniques such as long short-term memory.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A strong motivation to work on imagined speech arises in the context of speech disability. Speech is one of the most basic and natural form of communication acquired by individuals, even by the illiterates. Though exact statistics are not available, it is estimated that more than 70 million people around the world have speech disability. Speech disability due to complete paralysis prevents people from communicating with other in any modality. Complete paralysis may be a congenital disorder or an acquired one, which may be due to disease or accident. A complete recovery in these cases is rare and the cost of treatment is usually beyond the reach of many. Also, these conditions take a toll on the mental health of both the affected person and his relatives, reducing their efficiency and economic output. It will greatly help the affected person if, by some means, we are able to decode his/her thoughts, commonly referred to as “imagined speech”.

The interest in imagined speech dates back to the days of Hans Berger who invented electroencephalogram (EEG) as a tool for synthetic telepathy [6]

. Although it is almost a century since the first EEG recording, the success in decoding imagined speech from EEG signals is rather limited. One of the major reasons being the very low signal-to-noise ratio (SNR) of EEG signals.

The potential of the recent developments in the field of machine learning, such as deep neural networks (DNN) has not been exploited to its full potential in the field of decoding imagined speech, since such techniques require a huge amount of training data. In this paper, we select those EEG channels that best represent the underlying cortical activity of each imagined word by using CSP. The EEG channels so selected for each imagined word is considered as an independent input signal, thus providing more training data. This is in contrast to the earlier approaches concatenating the features to form a single feature vector.

The architecture includes a CSP based channel selection stage, discrete wavelet transform (DWT) based feature extraction stage, a classification stage consisting of a DNN with four dense layers and a maximum voting classifier, This has been tested on the ASU dataset of imagined speech [9]. The accuracy obtained is comparable to the state-of-the-art results.

The rest of the paper is organized as follows: Section II describes prior work in the literature in the field of decoding imagined speech. Section III describes the dataset and procedure for generating the feature vectors. Section IV describes the classifiers in some detail. The results obtained are given in Section V.

Ii Related Work

This section briefly describes the work in the field of imagined speech over the last decade.

C.S. DaSalla et al. developed a BCI system based on vowel imagery [4] in the year 2009. The objective was to discriminate between imagined vowels, /a/ and /u/. The experimental paradigm consisted of three parts:

  1. Imagined mouth opening and imagined vocalisation of vowel /a/.

  2. Imagined lip rounding and imagined vocalisation of vowel /u/.

  3. Control state with no action.

Using CSP generated spatial filter vectors as features and nonlinear SVM as a classifier, they achieved an accuracy in the range of 56% to 72%, depending on the subject. As noted by Brigham et.al [2], the relatively higher accuracy obtained might have arisen because of the additional involvement of motor imagery.

Following a similar approach, Wang Li et al. in 2013 developed a system to classify two monosyllabic Chinese characters meaning “left” and “one” [14]. Visual cue was provided to the subject to instruct him/her on the character to be imagined. When the cue disappears, the subject has to repeatedly imagine the character in his/her mind as many times as possible for a duration of 4 sec. The accuracy obtained by them is around 67%.

In 2010, Brigham et al. came up with an algorithm based on autoregressive (AR) coefficients and k-nearest neighbor (k-NN) algorithm for classifying two imagined syllables /ba/ and /ku/ [2]. In this experiment, the subjects were given an auditory cue on the syllable to be imagined, followed by a series of click sounds. After the last click, the subjects were instructed to imagine the syllable once every 1.5 sec for a period of 6 sec. The accuracy reported is around 61%.

In 2016, Min et.al

used statistical features such as mean, variance, standard deviation, and skewness for pairwise classification of vowels (

/a/, /e/, /i/, /o/, and /u/

) using extreme learning machine (ELM) with radial basis function. In their experimental paradigm, auditory cue was provided at the beginning of the trial to inform the subject as to which vowel was to be imagined. After the auditory cue, two beeps were played, after which the subject has to imagine the vowel heard during the beginning of the trial. An average accuracy of about 72% was reported.

In 2017, Nguyen, Karavas and Artemiadis [9] came up with an approach based on Riemannian manifold features for classifying four different sets of prompts:

  1. Vowels (/a/, /i/ and /u/).

  2. Short words (“in” and “out”).

  3. Long words (“cooperate” and “independent”).

  4. Short-Long words (“in” and “cooperate”).

The accuracy reported for the four sets of prompts are 49.2%, 50.1%, 66.2% and 80.1%, respectively. This dataset is one amongst the few imagined speech datasets that are available in the public domain and is referred to as the “ASU dataset”. More information about this dataset is given in Section III-A.

Balaji et al. in 2017 investigated the use of bilingual imaginary speech,viz., English: “Yes” & “No” and Hindi: “Haan” (meaning “yes”) & “Na” (meaning “no”) for an imagined speech based BCI system [1]. PCA was used for data reduction and Artificial Neural Network (ANN) was used as the classifier. Two specific sets of EEG channels corresponding to language comprehension and decision making were utilized. An interesting part of the experimental protocol is that there is no auditory or visual cue and the subjects were instructed to imagine the answer to a binary question posed either in English or Hindi. The study reports an accuracy of 75.4% for the combined English-Hindi task and quiet a surprising high accuracy of 85.2% for classifying the decision.

In 2017, Sereshkeh et al. came up with an algorithm based on features extracted using discrete wavelet transform (DWT) and regularized neural networks for classifying the imagined decisions of “yes” and “no[13], similar to the work by Balaji et al. They reported an accuracy of about 67%.

In 2018, Cooney et al. [3] used MFCC features and SVM classifier to classify all the 11 prompts in the KARAONE dataset [16]. The prompts consisted of seven phonemic/syllabic prompts (/iy/, /uw/, /piy/, /tiy/, /diy/, /m/, /n/) and four words (“pat”, “pot”, “knew” and “gnaw”). A maximum accuracy of only 33.33% was achieved. The lower accuracy might have arisen because of a larger number of choices instead of a binary choice as in the previous works.

Iii Dataset and Methods

Iii-a Dataset

The ASU dataset consists of 64-channel EEG signal recorded at 1 kHz from 15 healthy subjects during four different types of imagined speech, which were described in Section II. The electrode placement was based on the 10/20 system [5].

Each trial started with a visual cue indicating the word to be imagined. Beep sound repeating at seconds was played for a period of seconds after which the subject was to imagine vocalising the prompt thrice at a rate of one per seconds. For the prompt “in”, was 1 sec and for the prompt ’‘cooperate”, was 1.4 sec.

The highest accuracy reported is for the classification of a short word (“in”) and a long word (“cooperate”). For all participants except “S10” and “S14”, EEG data of 100 trials are available for each prompt. For subjects “S10” and “S14”, data of only 80 trials are available. These two subjects are not included in our analysis due to the mis-match in the number of trials.

The EEG signals are down-sampled to 256 samples/second. A 5th order Butterworth bandpass filter with passband from 8 to 70 Hz is applied to remove electrode drift and EMG artifacts. Further, a notch filter at 60 Hz is applied to remove line noise artifacts.

Iii-B Channel selection

Fig. 1: This figure shows the 10-fold cross-validation accuracy obtained (in percentage) for various choices of the number of electrodes for subject “S9”. The maximum accuracy is obtained when nine channels are chosen. As evident from the figure, an increase or decrease in the number of channels reduces the accuracy.

In our work, instead of concatenating the features obtained from several channels, each channel is viewed as a distinct input. This is possible because of the high correlation between the signals in various channels [12].

CSP linearly transforms the multi-channel EEG signal into a low-dimensional subspace such that the variance of the EEG signal from one class is maximized while that from the other class is minimized. Mathematically, CSP extremizes the following objective function:


where denotes matrix transpose, matrix contains the EEG signals of class , with data samples as columns and channels as rows, is the spatial filter and is the spatial covariance matrix of class .

The spatial filters can be seen as EEG source distribution vector [15]. The channels corresponding to maximal coefficients in the spatial filter that maximizes and the channels corresponding to maximal coefficients in the spatial filter that minimizes have maximum information regarding the two classes [15].

Let and respectively be the spatial filter vectors that maximize and minimize the objective function given in Eq. 1. Since 64 channel EEG signal is used in this work, the dimension of and will be .

Nine channels per class are selected corresponding to the top nine coefficients in the vectors and . We have observed in our experiments that the number of channels more than or less than nine decreases the accuracy (see Fig. 1). The decrease when increasing the number of channels might be because of the fact that the additional channels may not be capturing relevant information. The decrease when the number of channels is decreased is because of the reduction in the training data. The variation in accuracy for different number of channels selected for Subject “S9” is given in . CSP analysis is carried out by the toolbox provided by Lotte et.al. [7].

Iii-C Feature extraction

Since each EEG channel is considered as an independent data vector, algorithms that extract a single feature vector from the entire set of EEG channels (such as Reimannian manifold features used by Nguyen et.al [9] and fuzzy entropy features [11]) cannot be used with the proposed architecture. DWT, Daubechies-4 (db4) wavelet, is extensively used in extracting features from EEG signals [10]. We follow the approach of Sereshkeh et. al in [13], where four levels of wavelet decomposition is performed on the input EEG signals and root-mean-square (RMS), variance and entropy are computed for each level. This yields 12 features (4 levels 3 features) per channel.

Let be the 12 dimensional feature vector extracted from the channel corresponding to the largest coefficient in the spatial filter and be the 12 dimensional feature vector extracted from the channel corresponding to the largest coefficient in the spatial filter . Now, , the feature vector extracted for a specific trial is obtained by concatenating and . The range of is from 1 to 9, since we are considering only nine channels corresponding to each trial. Thus, nine feature vectors of length 24 (12 from each of the two channels) are obtained per trial.

Iv Classifier

Fig. 2:

Architecture diagram of the DNN used. The numbers in brackets indicate the number of neurons in the layer. The feature vector

is obtained by concatenating the feature extracted from the channel corresponding to the largest coefficient in the spatial filter that maximises in Eq. 1 with the features extracted from the channel corresponding to the largest coefficient in the spatial filter that minimizes in Eq. 1

. The activation function of all the hidden layers except 3 is rectified linear unit. The activation function of layer 3 is hyperbolic tangent. The activation function of the single output neuron is


. Dropout and batch normalization layers are not shown in the diagram.

A DNN with four hidden layers is used as the primary classifier. The architecture of the DNN used is shown in Fig. 2. Each dense hidden layer has 40 neurons. Also, dropout and batch normalization layers are added after each dense layer. The dropout ratio is 10% for the dropout layer after the first hidden layer and all the other dropout layers has a dropout ratio of 30%. The activation function of all the hidden layers except third is rectified linear unit. The activation function of layer 3 is hyperbolic tangent. The activation function of the single output neuron is sigmoid.

Since we have derived nine feature vectors per trial, nine outputs are obtained for each trial, one for each feature vector. The final decision is based on majority or hard voting.

Method/Subject S1 S5 S8 S9
Log + LDA [9]
CSP + SVM [4]
Statistical features + ELM [8]
Tangent + RVM (Method 1) [9]
Tangent + RVM (Method 2) [9]
Proposed Method
TABLE I: Comparison of mean std. deviation of the cross-validation accuracies in percentage obtained using different methods (given in each row) in classifying imagined words, “in” and “cooperate”. “S1”, “S5”, “S8” and “‘S9” are the participant IDs.

V Results

10-fold cross-validation is performed on the pre-processed data of each participant. During cross-validation, it is ensured that all the channels corresponding to a trial are either in the training set or in the test set. This is important, since the presence of a couple of channels from the test trials in the training set can lead to high spurious accuracy due to data leakage. The results obtained along with the other results reported in the literature are listed in Table I. The proposed method gives a higher accuracy compared to other methods except “Tangent + RVM Method (2)” which has around 2 to 10% higher accuracy. The average accuracy obtained across the participants is and the best accuracy reported in the literature is . Although the accuracy of the proposed method is lower than the best reported in the literature, the motivation for the present study is to explore the possibility of using DNN in the context of imagined speech. Hopefully, the accuracy may be improved further with denser EEG acquisition system.

Vi Conclusion

The present work shows that it is feasible to view each EEG channel as an independent data vector in order to increase the size of the training set for the purpose of classification in decoding imagined speech using deep learning techniques. We expect an improved performance by acquiring EEG data using higher density EEG acquisition systems. For instance, using a 128-channel EEG system instead of the 64-channel ASU dataset will give more amount of data for training for the same number of trials. This can in turn help us using a more complex network for classification.


  • [1] A. Balaji, A. Haldar, K. Patil, T. S. Ruthvik, C. Valliappan, M. Jartarkar, and V. Baths (2017) EEG-based classification of bilingual unspoken speech using ANN. In Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE, pp. 1022–1025. Cited by: §II.
  • [2] K. Brigham and B. V. Kumar (2010) Imagined speech classification with EEG signals for silent communication: a preliminary investigation into synthetic telepathy. In Bioinformatics and Biomedical Engineering (iCBBE), 2010 4th International Conference on, pp. 1–4. Cited by: §II, §II.
  • [3] C. Cooney, R. Folli, and D. Coyle (2018) Mel frequency cepstral coefficients enhance imagined speech decoding accuracy from EEG. In 2018 29th Irish Signals and Systems Conference (ISSC), pp. 1–7. Cited by: §II.
  • [4] C. S. DaSalla, H. Kambara, M. Sato, and Y. Koike (2009) Single-trial classification of vowel speech imagery using common spatial patterns. Neural networks 22 (9), pp. 1334–1339. Cited by: §II, TABLE I.
  • [5] G. H. Klem, H. O. Lüders, H. Jasper, C. Elger, et al. (1999) The ten-twenty electrode system of the international federation. Electroencephalogr Clin Neurophysiol 52 (3), pp. 3–6. Cited by: §III-A.
  • [6] T. La Vaque (1999) The history of EEG Hans Berger: psychophysiologist. A historical vignette. Journal of Neurotherapy 3 (2), pp. 1–9. Cited by: §I.
  • [7] F. Lotte and C. Guan (2011) Regularizing common spatial patterns to improve BCI designs: unified theory and new algorithms. IEEE Transactions on biomedical Engineering 58 (2), pp. 355–362. Cited by: §III-B.
  • [8] B. Min, J. Kim, H. Park, and B. Lee (2016) Vowel imagery decoding toward silent speech BCI using extreme learning machine with electroencephalogram. BioMed research international 2016. Cited by: TABLE I.
  • [9] C. H. Nguyen, G. K. Karavas, and P. Artemiadis (2017) Inferring imagined speech using EEG signals: a new approach using riemannian manifold features. Journal of neural engineering 15 (1), pp. 016002. Cited by: §I, §II, §III-C, TABLE I.
  • [10] L. F. Nicolas-Alonso and J. Gomez-Gil (2012) Brain computer interfaces, a review. sensors 12 (2), pp. 1211–1279. Cited by: §III-C.
  • [11] S. Raghu, N. Sriraam, G. P. Kumar, and A. S. Hegde (2018) A novel approach for real-time recognition of epileptic seizures using minimum variance modified fuzzy entropy. IEEE Transactions on Biomedical Engineering 65 (11), pp. 2612–2621. Cited by: §III-C.
  • [12] A. G. Ramakrishnan and J. V. Satyanarayana (2016) Reconstruction of EEG from limited channel acquisition using estimated signal correlation. Biomedical Signal Processing and Control 27, pp. 164–173. Cited by: §III-B.
  • [13] A. R. Sereshkeh, R. Trott, A. Bricout, and T. Chau (2017) EEG classification of covert speech using regularized neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (12), pp. 2292–2300. Cited by: §II, §III-C.
  • [14] L. Wang, X. Zhang, X. Zhong, and Y. Zhang (2013) Analysis and classification of speech imagery EEG for BCI. Biomedical signal processing and control 8 (6), pp. 901–908. Cited by: §II.
  • [15] Y. Wang, S. Gao, and X. Gao (2006) Common spatial pattern method for channel selelction in motor imagery based brain-computer interface. In Engineering in medicine and biology society, 2005. IEEE-EMBS 2005. 27th Annual international conference of the, pp. 5392–5395. Cited by: §III-B.
  • [16] S. Zhao and F. Rudzicz (2015) Classifying phonological categories in imagined and articulated speech. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 992–996. Cited by: §II.