Continuous Silent Speech Recognition using EEG

02/06/2020 ∙ by Gautam Krishna, et al. ∙ 0

In this paper we explore continuous silent speech recognition using electroencephalograpgy (EEG) signals. We implemented a connectionist temporal classification (CTC) automatic speech recognition (ASR) model to translate EEG signals recorded in parallel while subjects were reading English sentences in their mind without producing any voice to text. Our results demonstrate the feasibility of using EEG signals for performing continuous silent speech recognition. We demonstrate our results for a limited English vocabulary consisting of 30 unique sentences.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A continuous silent speech recognition model tries to decode what a person was reading in their mind. It can be considered close to mind reading problem where thoughts are also decoded. Research along this direction can enable people with severe cognitive disabilities to use virtual assistants like Siri, Alexa, Bixby etc there by improving technology accessibility. It can also enable people with cognitive disabilities to communicate with other people. Continuous silent speech recognition technology can also potentially allow soldiers and scientists to perform covert communication in sensitive working environments. Finally continuous silent speech recognition technology can introduce a new form of thought based communication for able bodied people.

Electroencephalograpgy (EEG) is a non invasive way of measuring electrical activity of human brain my placing EEG sensors on the scalp of the subject. EEG signals have high temporal resolution even though the spatial resolution is poor. On the other hand Electrocorticography (ECoG) is an invasive way of measuring electrical activity of human brain. ECoG signals have similar temporal resolution like EEG signals but has better spatial resolution and signal to noise ratio (SNR) than EEG signals. The major draw back of ECoG is that it is an invasive procedure requiring the subject to undergo a brain surgery in-order to implant the ECoG electrodes. In this work we use non invasive EEG signals to decode the thoughts of the subjects or perform continuous silent speech recognition.

In [1, 2, 3] authors demonstrated isolated and continuous speech recognition using EEG signals recorded in parallel while subjects were speaking out loud the English sentences and while they were listening to the English utterances for a limited English vocabulary. Authors in [2, 3, 1] used end-to-end automatic speech recognition (ASR) models like connectionist temporal classification (CTC) [4]

, attention model

[5] and transducer model [6] to translate EEG input features directly to text. In this work we perform continuous silent speech recognition where we use a CTC model to map EEG features recorded while the subjects were reading English sentences in their mind, to text.

Other related works include [7]

where authors demonstrated envisioned speech recognition using random forest classifier and in

[8] authors demonstrated imagined speech recognition from EEG signals using synthetic EEG data and CTC network but in our work we use real experimental EEG data and larger vocabulary. In [9] authors demonstrated speech recognition using ECoG signals. In [10] the authors used classification approach for identifying phonological categories in imagined and silent speech but in this paper we demonstrate continuous silent speech recognition.

The major contribution of this work is the demonstration of feasibility of using EEG features to perform continuous silent speech recognition. We believe our results will motivate the research community to improve our results and come up with better state-of-the-art models that can perform continuous silent speech recognition.

Ii Connectionist Temporal Classification (CTC)

The CTC ASR model ideas were first introduced in [11, 4]

. The CTC model can perform continuous speech recognition by making the length of output tokens equal to number of time steps of the input features by allowing repetition of output tokens and by introducing a special token called blank token. Thus the CTC model is alignment free. The CTC ASR model consists of an encoder, decoder and a CTC loss function.

The encoder of our CTC model consists of two layers of gated recurrent unit (GRU)

[12] with 128 hidden units in first GRU layer and 64 hidden units in the second GRU layer. Each GRU layer had a dropout regularization [13] with a dropout rate 0.1. The GRU layers were followed by a temporal convolutional network (TCN) [14]

consisting of 32 filters. The decoder of the CTC model consists of a time distributed dense layer and a softmax activation function. The output of the encoder is fed into the decoder at every time step. The encoder takes the EEG features as input. The number of time steps of the encoder is calculated as the product of sampling frequency of the input EEG features and sequence length. There was no fixed value for the number of time steps. We used dynamic recurrent neural network (RNN) cell.

The CTC model was trained for 130 epochs with a batch size of 32 using adam

[15] optimizer to optimize the CTC loss function. The mathematical details of CTC loss function are covered in [4, 11, 3, 2]. We used a character based model in this work. The model was predicting a character at every time step. During inference time a CTC beam search decoder is used in combination with an external 4-gram language model [16] known popularly as shallow fusion.

We used 80 % of the total EEG data as training set and remaining data as test set. The train-test split was done randomly. Figure 1 shows the architecture of our CTC model and Figure 2 shows CTC training loss convergence. All the scripts were written using Tensorflow 2.0 and Keras deep learning framework.

Fig. 1: CTC ASR Model
Fig. 2: CTC loss convergence
Fig. 3: EEG channel locations for the cap used in our experiments

Iii Design of Experiments for building the database

Four male subjects in their early to mid twenties took part in the EEG experiment. Out of the four subjects three were non native English speakers and one subject was a native English speaker. Each subject was asked to read first 30 English sentences from USC-TIMIT database [17] in their mind without producing any voice and their EEG signals were recorded. The English sentences were shown to them on a computer screen. Each subject was then asked to repeat the same experiment two more times. The data was recorded in absence of background noise. There were 90 EEG recordings per each subject.

We used Brain product’s EEG recording hardware. The EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 3. We used EEGLab [18] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

Iv EEG feature extraction details

We followed the same preprocessing methods used by authors in [2, 1] to process the EEG data and extract EEG features.

EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. The EEGlab’s [18]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG) and electrooculography (EOG) from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1, 2]. Thus in total we extracted 31(channels) X 5 or 155 features for EEG signals. The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

Fig. 4:

Explained variance plot

V EEG Feature Dimension Reduction Algorithm Details

After extracting EEG as explained in the previous section, we used non linear methods to denoise the EEG feature space [2, 3].

We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) [19]. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 4. We used KPCA with polynomial kernel of degree 3 [1, 2].

Vi Results

We used word error rate (WER) as the performance metric to evaluate the CTC model during test time. Table 1 shows the results obtained during test time for different test set vocabulary sizes. The average WER is reported in Table 1.

For all the test set vocabulary sizes we observed WER in 80’s. We believe the test time performance can be improved by training the model with more number of examples. Usually the CTC is model is trained with larger data sets to observe state-of-the-art performance during test time.

We observed that our overall results were poor compared to the results demonstrated by authors in [20] where they used EEG signals recorded in parallel with spoken speech. Our overall results indicate that continuous silent speech recognition remains as a challenging problem and we hope our results will motivate other researchers to develop better decoding models. To the best of our knowledge this is the first time continuous silent speech recognition is demonstrated using real experimental EEG features.

12 5 79 29 343 83.40
24 10 148 59 683 85.58
36 15 224 84 986 83.87
48 20 301 106 1347 86.78
60 25 376 132 1740 86.66
72 30 435 153 2054 83.35
TABLE I: WER on test set

Vii Conclusion and Future work

In this work we demonstrated the feasibility of using EEG signals to perform continuous silent speech recognition.

Future work will focus on improving our current results by doing EEG source localization, developing better decoding models etc.

Viii Acknowledgement

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.