Continuous speech recognition using non invasive brain signals or electroencephalography (EEG) signals is an emerging area of research where non invasive EEG signals recorded from the scalp of the subject is translated to text. EEG based continuous speech recognition technology enables people with speaking disabilities or people who are not able to speak to have better technology accessibility. Current state-of-the-art voice assistant systems process mainly acoustic input features limiting technology accessibility for people with speaking disabilities or people with no ability to produce voice. In  authors demonstrated deep learning based isolated speech recognition for recognizing five English vowels and four English words using only EEG features as input. In  authors demonstrated EEG based continuous speech recognition using state-of-the art end-to-end sequence-to-sequence recurrent neural network (RNN) based automatic speech recognition (ASR) models like CTC , attention  on a limited English vocabulary consisting of 20 unique sentences. In  authors demonstrated EEG based continuous speech recognition using EEG signals recorded in parallel to spoken speech as well as using EEG signals recorded in parallel with listening utterances using different types of state-of-the-art ASR models on a limited English vocabulary consisting of 9 unique sentences.
In  authors introduced a new type of sequence-to-sequence model named as transformer which can be applied to solve several sequence-to-sequence problems to get state-of-the-art performance on various tasks like machine translation , language model  and speech recognition [8, 9]. Transformers use the concept of self attention, stacked layers of self attention, positional encoding 
instead of recurrent networks like gated recurrent unit (GRU)
, Long short-term memory (LSTM) to learn sequence-to-sequence mapping. Transformers are faster to train compared to RNN models. To the best of our knowledge transformer models remains unexplored for EEG based continuous speech recognition task. In this paper we investigate EEG based continuous speech recognition used transformer model. We demonstrate our results on an English vocabulary consisting of 30 unique sentences during test time. Our results were better than RNN based model for smaller vocabulary size but as vocabulary size increase RNN based CTC model demonstrated better EEG recognition during test time .
Ii Transformer Speech Recognition model
Figure 1 explains the architecture of the transformer ASR model used for mapping EEG features to text. The architecture we used in this work is very similar to the transformer model introduced by authors in 
. The model at a higher level can be considered as an encoder-decoder model. The encoder model takes EEG features as input and applies non linear transformations to the input to produce a hidden representation which is fed into the decoder which again applies non linear transformations to produce text. Now we explain the encoder and decoder blocks in detail. The encoder is composed of stack of 8 identical layers. Similarly the decoder is also composed of stack of 8 identical layers. Each encoder layers consists of two sub layers namely multi head attention layer and fully connected layers. Each decoder layer is composed of three sub layers, where first two sub layers function same like the encoder sub layers whereas the third sub layer in the decoder layer performs multi head attention on the output of encoder stack. Each sub layer in encoder and decoder is followed by layer normalization
and there exists residual connection around each of the sub layers. We use masking in the multi head attention layer in the decoder to prevent it from depending on future positions .
The parameter was set to a value of 256. The parameter decides the output dimension of outputs of sub layers and embedding layers . Both the embedding layers in the encoder and decoder block share the same set of weights. The details of self attention and multi head attention calculations are described in . The parameter h was set to a value of 32. The parameter h refers to number of parallel attention layers or number of attention heads.
The parameter was set to a value of 1024. The parameter refers to the number of hidden units in the fully connected sub layers in encoder and decoder layers. The parameters
(queries, key vector dimension),(value vector dimension)  was set to 8. Basically multi head attention layer output is computed as concatenated outputs of individual attention heads multiplied by a weight matrix . For implementing the positional encoding block shown in Figure 1 we used the same sine and cosine implementations used by authors in . The key and value vectors from the final encoder layer is fed into the third multi head attention sub layer in the decoder layers. The multi head attention layer in the decoder takes query vector value from the layer beneath it.
We used cross entropy as loss function for the model. After decoder block a dense layer is used to perform affine transformation and softmax activation is used to get output prediction probabilities. During inference time we used a combination of beam search decoder and an external 4-gram language model, known as shallow fusion. The label prediction process stops after the decoder predicts the end token.
Iii Design of Experiments for building the database
We used data set A and data set B used by authors in  for this work.
We used Brain product’s ActiChamp EEG amplifier. The EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 2. We used EEGLab  to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
For each data set we used 80% of the data as training set, remaining 10% as validation set and rest 10% as test set.
Iv EEG feature extraction details
We followed the same EEG preprocessing methods used by authors in [1, 2]. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1, 2]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
V EEG Feature Dimension Reduction Algorithm Details
After extracting EEG features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA)  to denoise the EEG feature space as explained by authors in [2, 1]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 3. We used KPCA with polynomial kernel of degree 3 [1, 2]. We then computed first and second order derivatives or delta, delta-delta coefficients of the 30 dimensional EEG features, thus the final EEG feature dimension was 90 (30 times 3).
We used word error rate (WER) as performance metric of the model during test time. Table 1 shows the results obtained during test time for Data set A. Table 2 shows the results obtained during test time for Data set B.
The average WER on test set is reported in both the tables. The transformer model demonstrated lower WER for smaller corpus size and WER went up as corpus size increase. Even-though transformer model demonstrated faster training and inference compared to RNN based end-to-end models like CTC, the transformer based results were poor compared to the results demonstrated in  especially as corpus size increase.
Interpretability of the visualization of the attention weights for various attention heads for EEG feature input still remains as a challenge. We tried plotting attention weights for various attention heads but was not able to interpret the plots. Figures 4,5 and 6 shows some of the attention weight visualization plots. The attention weights basically learns the alignment between input EEG features and predictions (text) but since EEG signal is a complex signal it is not easy to interpret the attention weight plots directly.
Vii Conclusion and Future work
In this paper we explored EEG based continuous speech recognition using transformer sequence-to-sequence ASR model. Even though transformer model demonstrated faster training and inference, the test time WER performance was poor compared to RNN based CTC network demonstrated in . The interpretability of visualization of attention weights for various attention heads for EEG input still remains as an open problem.
For future work we would like to build a larger Speech-EEG data base and validate our results on a larger English corpus. We would like to investigate whether our results can be improved by training the transformer model with a larger data set and also if attention weights interpretability can be improved by training the model with a larger data set.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.
-  G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
-  G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent
neural networks,” in
International Conference on Machine Learning, 2014, pp. 1764–1772.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
-  G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  S. Li, D. Raj, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation,” Proc. Interspeech 2019, pp. 4400–4404, 2019.
-  L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  G. Krishna, C. Tran, M. Carnahan, Y. Han, and A. H. Tewfik, “Improving eeg based continuous speech recognition,” arXiv preprint arXiv:1911.11610, 2019.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 369–375.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,”Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
-  S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.