EEG based Continuous Speech Recognition using Transformers

12/31/2019 ∙ by Gautam Krishna, et al. ∙ 0

In this paper we investigate continuous speech recognition using electroencephalography (EEG) features using recently introduced end-to-end transformer based automatic speech recognition (ASR) model. Our results show that transformer based model demonstrate faster inference and training compared to recurrent neural network (RNN) based sequence-to-sequence EEG models but performance of the RNN based models were better than transformer based model during test time on a limited English vocabulary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Continuous speech recognition using non invasive brain signals or electroencephalography (EEG) signals is an emerging area of research where non invasive EEG signals recorded from the scalp of the subject is translated to text. EEG based continuous speech recognition technology enables people with speaking disabilities or people who are not able to speak to have better technology accessibility. Current state-of-the-art voice assistant systems process mainly acoustic input features limiting technology accessibility for people with speaking disabilities or people with no ability to produce voice. In [1] authors demonstrated deep learning based isolated speech recognition for recognizing five English vowels and four English words using only EEG features as input. In [2] authors demonstrated EEG based continuous speech recognition using state-of-the art end-to-end sequence-to-sequence recurrent neural network (RNN) based automatic speech recognition (ASR) models like CTC [3], attention [4] on a limited English vocabulary consisting of 20 unique sentences. In [5] authors demonstrated EEG based continuous speech recognition using EEG signals recorded in parallel to spoken speech as well as using EEG signals recorded in parallel with listening utterances using different types of state-of-the-art ASR models on a limited English vocabulary consisting of 9 unique sentences.

In [6] authors introduced a new type of sequence-to-sequence model named as transformer which can be applied to solve several sequence-to-sequence problems to get state-of-the-art performance on various tasks like machine translation [6], language model [7] and speech recognition [8, 9]. Transformers use the concept of self attention, stacked layers of self attention, positional encoding [6]

instead of recurrent networks like gated recurrent unit (GRU)

[10]

, Long short-term memory (LSTM)

[11] to learn sequence-to-sequence mapping. Transformers are faster to train compared to RNN models. To the best of our knowledge transformer models remains unexplored for EEG based continuous speech recognition task. In this paper we investigate EEG based continuous speech recognition used transformer model. We demonstrate our results on an English vocabulary consisting of 30 unique sentences during test time. Our results were better than RNN based model for smaller vocabulary size but as vocabulary size increase RNN based CTC model demonstrated better EEG recognition during test time [12].

Ii Transformer Speech Recognition model

Figure 1 explains the architecture of the transformer ASR model used for mapping EEG features to text. The architecture we used in this work is very similar to the transformer model introduced by authors in [6]

. The model at a higher level can be considered as an encoder-decoder model. The encoder model takes EEG features as input and applies non linear transformations to the input to produce a hidden representation which is fed into the decoder which again applies non linear transformations to produce text. Now we explain the encoder and decoder blocks in detail. The encoder is composed of stack of 8 identical layers. Similarly the decoder is also composed of stack of 8 identical layers. Each encoder layers consists of two sub layers namely multi head attention layer and fully connected layers. Each decoder layer is composed of three sub layers, where first two sub layers function same like the encoder sub layers whereas the third sub layer in the decoder layer performs multi head attention on the output of encoder stack. Each sub layer in encoder and decoder is followed by layer normalization

[13]

and there exists residual connection around each of the sub layers

[6]. We use masking in the multi head attention layer in the decoder to prevent it from depending on future positions [6].

The parameter was set to a value of 256. The parameter decides the output dimension of outputs of sub layers and embedding layers [6]. Both the embedding layers in the encoder and decoder block share the same set of weights. The details of self attention and multi head attention calculations are described in [6]. The parameter h was set to a value of 32. The parameter h refers to number of parallel attention layers or number of attention heads.

The parameter was set to a value of 1024. The parameter refers to the number of hidden units in the fully connected sub layers in encoder and decoder layers. The parameters

(queries, key vector dimension),

(value vector dimension) [6] was set to 8. Basically multi head attention layer output is computed as concatenated outputs of individual attention heads multiplied by a weight matrix [6]. For implementing the positional encoding block shown in Figure 1 we used the same sine and cosine implementations used by authors in [6]. The key and value vectors from the final encoder layer is fed into the third multi head attention sub layer in the decoder layers. The multi head attention layer in the decoder takes query vector value from the layer beneath it.

We used cross entropy as loss function for the model. After decoder block a dense layer is used to perform affine transformation and softmax activation is used to get output prediction probabilities. During inference time we used a combination of beam search decoder and an external 4-gram language model, known as shallow fusion

[14]. The label prediction process stops after the decoder predicts the end token.

The model was trained for 120 epochs using adam

[15]

optimizer and the batch size was set to 100. We used a word based model in this work. The model was predicting a word at every time step. All the scripts were written using keras and tensorflow 2.0 deep learning framework.

Fig. 1: Transformer ASR Model
Fig. 2: EEG channel locations for the cap used in our experiments

Iii Design of Experiments for building the database

We used data set A and data set B used by authors in [2] for this work.

We used Brain product’s ActiChamp EEG amplifier. The EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 2. We used EEGLab [16] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

For each data set we used 80% of the data as training set, remaining 10% as validation set and rest 10% as test set.

Iv EEG feature extraction details

We followed the same EEG preprocessing methods used by authors in [1, 2]. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [16]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1, 2]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

Fig. 3:

Explained variance plot

V EEG Feature Dimension Reduction Algorithm Details

After extracting EEG features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA) [17] to denoise the EEG feature space as explained by authors in [2, 1]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 3. We used KPCA with polynomial kernel of degree 3 [1, 2]. We then computed first and second order derivatives or delta, delta-delta coefficients of the 30 dimensional EEG features, thus the final EEG feature dimension was 90 (30 times 3).

Fig. 4: visualization of attention weights for attention head 5
Fig. 5: visualization of attention weights for attention head 10
Fig. 6: visualization of attention weights for attention head 15

Vi Results

We used word error rate (WER) as performance metric of the model during test time. Table 1 shows the results obtained during test time for Data set A. Table 2 shows the results obtained during test time for Data set B.

The average WER on test set is reported in both the tables. The transformer model demonstrated lower WER for smaller corpus size and WER went up as corpus size increase. Even-though transformer model demonstrated faster training and inference compared to RNN based end-to-end models like CTC, the transformer based results were poor compared to the results demonstrated in [12] especially as corpus size increase.

Interpretability of the visualization of the attention weights for various attention heads for EEG feature input still remains as a challenge. We tried plotting attention weights for various attention heads but was not able to interpret the plots. Figures 4,5 and 6 shows some of the attention weight visualization plots. The attention weights basically learns the alignment between input EEG features and predictions (text) but since EEG signal is a complex signal it is not easy to interpret the attention weight plots directly.

Total
Number
of
Sentences
Number
of
Unique
Sentences
Contained
Number
of
Unique
words
Contained
EEG
WER
(%)
15 5 29 67.7
30 10 59 83.95
45 15 84 88.85
60 20 106 91.04
75 25 132 91.15
90 30 153 93.95
TABLE I: WER on test set for Data set A
Total
Number
of
Sentences
Number
of
Unique
Sentences
Contained
Number
of
Unique
words
Contained
EEG
WER
(%)
12 5 29 62.5
24 10 59 76.65
36 15 84 86.83
48 20 106 85.92
60 25 132 98.59
72 30 153 96.8
TABLE II: WER on test set for Data set B

Vii Conclusion and Future work

In this paper we explored EEG based continuous speech recognition using transformer sequence-to-sequence ASR model. Even though transformer model demonstrated faster training and inference, the test time WER performance was poor compared to RNN based CTC network demonstrated in [12]. The interpretability of visualization of attention weights for various attention heads for EEG input still remains as an open problem.

For future work we would like to build a larger Speech-EEG data base and validate our results on a larger English corpus. We would like to investigate whether our results can be improved by training the transformer model with a larger data set and also if attention weights interpretability can be improved by training the model with a larger data set.

Viii Acknowledgement

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.

References