Understanding effect of speech perception in EEG based speech recognition systems

05/29/2020 ∙ by Gautam Krishna, et al. ∙ 0

The electroencephalography (EEG) signals recorded in parallel with speech are used to perform isolated and continuous speech recognition. During speaking process, one also hears his or her own speech and this speech perception is also reflected in the recorded EEG signals. In this paper we investigate whether it is possible to separate out this speech perception component from EEG signals in order to design more robust EEG based speech recognition systems. We further demonstrate predicting EEG signals recorded in parallel with speaking from EEG signals recorded in parallel with passive listening and vice versa with very low normalized root mean squared error (RMSE). We finally demonstrate both isolated and continuous speech recognition using EEG signals recorded in parallel with listening, speaking and improve the previous connectionist temporal classification (CTC) model results demonstrated by authors in [1] using their data set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The electroencephalography (EEG) signals are non invasive neural signals which are recorded by placing EEG sensors on the scalp of the subject. The EEG signals demonstrate high temporal resolution even though the spatial resolution and signal to noise ratio (SNR) demonstrated are poor. Since EEG is a safe non invasive technique it is easy to test and deploy EEG based brain computer interface (BCI) systems. In the recent years there has been lot of interest in research community in trying to develop speech recognition systems using EEG signals where EEG signals recorded in parallel with speech are translated to text using automatic speech recognition (ASR) models. For example the work demonstrated by authors in

[2] demonstrates isolated speech recognition using EEG signals on a limited English vocabulary consisting of four words and five vowels. Similarly in [3] authors demonstrate continuous speech recognition using EEG signals on a limited English vocabulary consisting of 20 unique sentences. In a more recent work explained in [4, 1] authors demonstrated preliminary results for synthesizing speech directly from EEG features. In [1] authors introduced new EEG feature sets and also demonstrated continuous speech recognition using EEG signals recorded in parallel with speech and listening on a limited English vocabulary consisting of 9 unique sentences. The potential benefits of EEG based speech recognition systems include overcoming performance loss of ASR systems operating in presence of background noise [2], helping with improving technology accessibility for people with speaking disabilities or people who are not able to produce voice by allowing them to use voice assistant systems trained to recognize EEG or combination of EEG and acoustic features.

Even though in references [3, 1, 5, 2] authors used EEG signals recorded in parallel with speaking for performing speech recognition, during speaking process a person also gets feedback via listening or hearing his or her own speech. Thus the EEG signals recorded in parallel with speaking includes the brain activity responsible for speech production as well as speech perception. In this paper we propose a deep learning model to separate the perception component from the EEG signals recorded in parallel with speech. We demonstrate separating perception component without causing performance degradation of the speech recognition systems. We further demonstrate predicting EEG signals recorded in parallel with speaking from EEG signals recorded in parallel with passive listening and vice versa with very low normalized root mean squared error (RMSE) using Data set B used by authors in [1]. We finally demonstrate both isolated and continuous speech recognition using EEG signals recorded in parallel with listening, speaking, concatenation of listening, speaking and improve the previous connectionist temporal classification (CTC) model [6] results demonstrated by authors in [1] using their data set B. In [1]

authors didn’t provide results for isolated speech recognition. Even though authors demonstrated EEG based speech recognition results using attention model

[7] in [3, 1] and showed lower test time word error rates (WER) for smaller corpus size, the visualization of attention weights provided by authors in [3] demonstrate that attention model might be memorizing and not actually learning the underlying distribution when trained using smaller data sets. Further in [1]

authors demonstrated that recurrent neural network (RNN) transducer model

[8] demonstrated higher test time WER’s, hence we only use CTC model for performing continuous speech recognition experiments in this paper. We also demonstrate that EEG based isolated speech recognition results can be improved using siamese network [9] when the training data set has only few samples per label.

2 Regression Model

Since EEG signals recorded in parallel with speaking and EEG signals recorded in parallel with passive listening for the same English sentence for the same subject was of different lengths we used encoder-decoder regression model with attention mechanism to predict EEG signals recorded in parallel with speaking from EEG signals recorded in parallel with passive listening and vice versa. Our encoder was a single layer of gated recurrent unit (GRU)

[10]

with 128 hidden units and our decoder was also a GRU with 128 hidden units followed by a time distributed dense layer with 30 hidden units. The time distributed dense layer consists of linear activation function. The encoder GRU layer takes EEG features of dimension 30 as input and the encoder outputs are passed to luong dot product attention layer

[11]

to derive the attention context vectors which are passed to the decoder to get the predictions. The details of dot product attention calculations are explained in

[11]. A dropout regularization [12] with dropout rate 0.2 is applied after the attention layer. When listen EEG of dimension 30 is taken as input the model outputs spoken EEG of dimension 30 as output and vice versa. The details of the data set (Spoken, Listen EEG) are covered in the sections below. Basically the subjects were first asked to listen to English utterances and then they were instructed to speak out loud the utterances that they listened to. The term listen EEG refers to EEG signals recorded during passive learning and term spoken EEG refers to EEG signals recorded in parallel with speaking process. There was no fixed time step value for the input.

The model was trained for 150 epochs using adam

[13]

optimizer with mean squared error (MSE) as the loss function. The batch size was set to 100 and the validation split hyper parameter was set to a value of 0.1. The script was written using Keras deep learning framework.

3 Separation Model

The goal of developing this deep model was to separate out speech perception component present in spoken EEG as one listen or hears his or her own speech during speaking process and thus the EEG signals recorded in parallel with speaking process or spoken EEG contains brain activity responsible for speech production as well as speech perception. Using EEG preprocessing methods discussed in the below sections it is possible to remove other biological signal artifacts from EEG signals to some extend but removing speech perception component using traditional signal processing methods is extremely challenging mainly due to the source localization issue and poor spatial resolution associated with EEG signals. The architecture of the separation model is described in Figure 1. The main idea here is to train a regression model and ASR model simultaneously in such a way that the ASR model helps in separating out the perception component but at the same time that it doesn’t remove useful speech production components during the separation process. Our hypothesis is that the desired EEG signal or features can be modelled as a non-linear function of difference between spoken EEG and listen EEG features. The regression and ASR model are trained simultaneously to efficiently realize this non-linear function (tanh in our case). The model takes spoken EEG of dimension 30 as input as shown in Figure 1 and it passes through the regression part of the separation model. The regression part is similar to the regression model described before in the previous section but here we didn’t use attention layer since the separation model needs to perform subtraction operation and hence the time steps value need to be preserved. Hence we instead performed trimming of Spoken EEG time steps to make it equal to Listen EEG time steps value. On an average we trimmed Spoken EEG time steps by 0.72 seconds. And we used temporal convolutional network (TCN) [14]

instead of GRU for faster training as the separation model is more complex than regression model. Since we didn’t use attention layer, the decoder in the regression part consists of only the time distributed dense layer. The targets for the regression part is set to listen EEG features of dimension 30. The intermediate features outputted by the time distributed dense layer in the regression part is passed to a subtraction arithmetic block as shown in Figure 1. The subtraction arithmetic block calculates the difference between input spoken EEG features and the intermediate features. The difference features are then passed to tanh non-linearity through a fully connected dense layer and finally to a ASR classifier model. The architecture of the ASR classifier model is described in Figure 2. We used MSE as the loss function for regression part and cross entropy was used as the loss function for the ASR classifier model. The separation model was trained for 200 epochs using adam optimizer with a batch size of 50 until the combined MSE, cross entropy loss shows convergence and the validation split hyper parameter was set to a value of 0.1. During test time, the TCN layer in the separation model takes spoken EEG features of 30 as the input and we take output from the dense layer with tanh non-linearity ( the layer after the subtraction block).

4 Isolated Speech recognition Model

For performing isolated speech recognition experiments we used the same ASR classifier model described in Figure 2. The dense layer in the model consists of linear activation function. The last time step output of the TCN layer is passed to the dense layer with two hidden units. The model was trained for 100 epochs using adam optimizer with a batch size of one. We used categorical cross entropy as the loss function for the model. The validation split hyper parameter was set to a value of 0.1.

We also tried performing sentence identification task from a pair of given sentences using EEG features using the siamese network described by authors in [9]. In [9] authors used it for image recognition task. The siamese network is especially useful when we have few examples per label to train the classifier or deep model. Our siamese network consists of two TCN layers with 128 filters connected in parallel, the last time step output of each of the TCN layer is connected to a dense layer with 64 hidden units with sigmoid activation function to derive the embeddings for the given pair of the input. Then L1 distance is calculated between the embeddings. The L1 distance is passed to a dense layer with sigmoid activation. The model is trained for 100 epochs with batch size one and using adam as the optimizer. We used binary cross entropy as the loss function and the validation split hyper parameter was set to a value of 0.1. During test time a pair of EEG features are fed to the siamese network as input and the model outputs one if both features represents the same English sentence else, the model outputs zero.

5 Continuous Speech recognition Model

For performing continuous speech recognition experiments we used the connectionist temporal classification (CTC) [15, 6] model described in Figure 1 in [5] with the exact same hyper parameters and training parameters used by authors in [5] but the encoder layers in the CTC model were initialized with random weights [3, 1]. An external language model was also used during inference time like the ones used by authors in [5]. The CTC model we used in this work is different from the ones used by authors in [1].

Figure 1: Architecture of our proposed separation model
Figure 2: Architecture of the ASR classifier model used in our proposed separation model

6 Data Sets used for performing experiments

For performing continuous speech recognition experiments using EEG we used Data set B used by authors in [1]. It basically consists of listen EEG and spoken EEG recordings from 15 subjects recorded in presence of a background noise of 50dB.

For performing isolated speech recognition experiments we used first two unique sentences from Data set B used by authors in [1], consisting of a total of 90 EEG recording examples for spoken, listen condition. The same data set was used to train our separation model described in Figure 1. Since there were only two unique sentences, hence the ASR classifier model’s final dense layer had two hidden units with softmax activation function. We considered EEG samples for only two unique sentences since we were interested in faster training of the simultaneous regression and ASR classifier models. More details of the data set B, EEG experiment design, EEG recording hardware etc are covered in [1].

For training separation and regression model we used 80% of the total data as training set, remaining 10% as validation set and rest 10% as test set. The train-test split was done randomly. There was no overlap between training, testing and validation set. For training continuous speech recognition and isolated speech recognition models we performed experiments first where we used 80% of the data as training set, remaining 10% as validation set and rest 10% as test set and then we performed experiments where we used data from first 13 subjects as training set, 14th subject data as validation set and last subject data as test set.

7 EEG feature extraction details

We followed the same EEG preprocessing methods used by authors in [2, 3] for extracting raw EEG features for both spoken and listen conditions. The EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. The EEGlab’s [16]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [2, 3]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals. The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

8 EEG Feature Dimension Reduction Algorithm Details

After extracting EEG features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA) [17] to perform denoising of the EEG feature space as explained by authors in [3, 2]

. We reduced the 155 EEG features to a dimension of 30 by applying KPCA. We plotted cumulative explained variance versus number of components to identify the right feature dimension. We used KPCA with polynomial kernel of degree 3

[2, 3].

9 Results

We used word error rate (WER) as the performance metric to evaluate CTC model during test time, classification test accuracy was used as the performance metric to evaluate ASR classifier model or isolated speech recognition model during test time. The classification test accuracy is defined as the ratio of number of correct predictions given by model to total number of predictions given by the model on test set. The lower the WER value better is the continuous ASR system test time performance. For the regression model used for predicting EEG signals recorded in parallel with speaking (spoken EEG) from EEG signals recorded in parallel with passive listening (listen EEG) and vice versa we used normalized root mean squared error (RMSE) as the performance metric. The RMSE values were normalized by dividing the RMSE values with the absolute difference between the maximum and minimum value in the test set observation vector. The Tables 1,2 and 3 shows test time results obtained for continuous speech recognition where we used 10 % of the total data set as test set (the same data split method used by authors in [1] to obtain results for their CTC model) and Table 4 shows test time results obtained for continuous speech recognition where we used last subject data as test set. Tables 5 and 6 shows the test time results obtained for isolated speech recognition using ASR classifier model. For spoken+listen condition, we concatenate spoken and listen EEG features along time step axis for each English sentence. We observed an average improvement of 5 % in test accuracy for spoken, listen, spoken + listen conditions using siamese network over ASR classifier model for the sentence identification task from a pair of input EEG features.

The overall results from Tables 3,4,5 and 6 shows that speech perception components present in EEG provide useful features to the ASR model as the test time results for spoken + listen EEG condition was better than spoken EEG condition for majority of the ASR experiments. However when we design a reliable EEG based speech prosthetic it should only use EEG features responsible for speech production. In Table 1 we show the test time results obtained using spoken EEG features with CTC model after removing perception components using our separation model. We can observe that our separation model was able to remove perception components from spoken EEG without causing much performance degradation for performing recognition. In Tables 1 and 2 the results under Ref[1] column were directly taken from the results mentioned by authors in [1] for CTC model for Data set B under their ’Results’ portion mentioned just before the beginning of their Table 1. In [1] authors didn’t provide results for isolated speech recognition for listen or spoken or listen + spoken condition.

For predicting spoken EEG features from listen EEG features using the regression model we observed a normalized RMSE of 0.0016446532 and for predicting listen EEG features from spoken EEG features we observed a normalized RMSE of 0.0052599716 during test time.

Total
Number
of
Sentences
Total
No
of
words
Spoken
EEG
WER
(%)
REF
[1]
Spoken
EEG
WER
(%)
our
CTC
Model
Spoken
EEG
WER
(%)
after
removing
perception
component
27 173 73.6 66.6 62.7
45 292 83.8 76.53 77.12
63 404 91.1 81 79.18
81 525 91.5 84.2 82.72
Table 1: CTC Model continuous speech recognition test time results using spoken EEG where total data was randomly splitted to form the test set. Number of unique sentences and unique words contained in test set were same as the ones mentioned in Table 3 in [1]
Total
Number
of
Sentences
Total
No
of
words
Listen
EEG
WER
(%)
REF
[1]
Listen
EEG
WER
(%)
our
CTC
Model
27 173 52.6 75.24
45 292 87.09 74.83
63 404 88.88 73.71
81 525 94.9 77.17
Table 2: CTC Model continuous speech recognition test time results using listen EEG where total data was randomly splitted to form the test set
Total
Number
of
Sentences
Total
No
of
words
Listen
+
spoken
EEG
WER
(%)
our
CTC
Model
27 173 56.56
45 292 79
63 404 80.7
81 525 84
Table 3: CTC Model continuous speech recognition test time results using concatenation of listen and spoken EEG where total data was randomly splitted to form the test set
Total
Number
of
Sentences
Spoken
EEG
WER
(%)
our
CTC
Model
Listen
EEG
WER
(%)
our
CTC
Model
Listen
+
Spoken
EEG
WER
(%)
our
CTC
Model
27 81.9 71.4 78.78
Table 4: CTC Model continuous speech recognition test time results where last subject data was used as test set. Number of unique sentences was 9 and unique words contained in test set was 55
Spoken
EEG
% Test
acc
Listen
EEG
% Test
acc
Listen
EEG
+
Spoken
EEG
% Test
acc
50 50 55.56
Table 5: Isolated speech recognition test time results using ASR classifier model where total data was randomly splitted to form the test set
Spoken
EEG
% Test
acc
Listen
EEG
% Test
acc
Listen
EEG
+
Spoken
EEG
% Test
acc
50 50 66.67
Table 6: Isolated speech recognition test time results using ASR classifier model where last subject data was used as test set

10 Conclusion and Future work

In this paper we introduced a deep model to separate speech perception component from EEG signals recorded in parallel with speech. We further demonstrated isolated and continuous speech recognition using EEG features recorded under various conditions (listen, spoken). We finally demonstrated predicting EEG signals recorded in parallel with speaking (spoken EEG) from EEG signals recorded in parallel with passive listening (listen EEG) and vice versa with very low normalized root mean squared error (RMSE) during test time. Future work will focus on validating and improving the results using a larger data set and also perform separation experiments by replacing the ASR classifier model with CTC model. Training a CTC model instead of a simple ASR classifier model within the separation model would require a larger data set.

11 Acknowledgement

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models.

References