Improving EEG based Continuous Speech Recognition

11/24/2019 ∙ by Gautam Krishna, et al. ∙ 0

In this paper we introduce various techniques to improve the performance of electroencephalography (EEG) features based continuous speech recognition (CSR) systems. A connectionist temporal classification (CTC) based automatic speech recognition (ASR) system was implemented for performing recognition. We introduce techniques to initialize the weights of the recurrent layers in the encoder of the CTC model with more meaningful weights rather than with random weights and we make use of an external language model to improve the beam search during decoding time. We finally study the problem of predicting articulatory features from EEG features in this paper.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic speech recognition (ASR) system maps acoustic features to text. ASR systems forms front end or back end in many state of the art voice assistant systems like Bixby, Alexa,Siri,Cortana etc. Most of the current state of the art ASR systems are trained only with acoustic features and can operate only with acoustic input, this limits technology accessibility for people who can’t speak at all or for people with speaking disabilities like stuttering. On the other hand electroencephalograpgy (EEG) is a non invasive way of measuring electrical activity of human brain. EEG sensors are placed on the scalp of the subjects to obtain the EEG recordings. Recently in [1] authors demonstrated isolated speech recognition using EEG features on a limited English vocabulary of four words and five vowels. In [2] authors demonstrated continuous speech recognition using connectionist temporal classification (CTC) [3]

and attention model

[4] on an English vocabulary of 20 unique sentences using EEG and combination of EEG, acoustic features as input. In [5] authors demonstrated continuous speech recognition using different EEG feature sets. In [2, 5] authors used state of the art end-to-end ASR models to directly map EEG features to text.

In [6] authors demonstrated combining articulatory features with acoustic features improves the performance of ASR systems and in [1] authors demonstrated combining EEG features with acoustic features also improve the performance of ASR systems operating in presence of background noise. In [5, 2] authors didn’t take into account the articulatory features for performing speech recognition. In this paper we show that the articulatory features can be used to design more robust ASR EEG encoder models and can help in improving the performance of EEG based continuous speech recognition systems.

In [5, 2]

authors initialized the weights of the recurrent neural network (RNN) encoder in their CTC network with random weights. In this paper we demonstrate that initializing the weights of the first few RNN layers in the CTC network with weights of a RNN trained to predict concatenation of acoustic and articulatory features from EEG features will help in significantly improving the performance of EEG based speech recognizer. We further demonstrate predicting articulatory features from EEG features using temporal convolutional network (TCN)

[7] model.

In [2, 5] authors didn’t use external language model during inference time. In this paper we demonstrate that using an external language model during inference time significantly improves the beam search decoder performance of EEG based speech recognizer. Finally in this paper we demonstrate EEG based speech recognition results for a larger English vocabulary size than the ones used by authors in [2, 5].

Ii Connectionist Temporal Classification (CTC)

The encoder of our CTC network consists of two gated recurrent unit (GRU)

[8] layers with 128 hidden units and 64 hidden units respectively connected to a temporal convolutional network (TCN) [7] layer with 32 filters as shown in Figure 1. The GRU layers contained dropout regularization [9]

with a dropout rate of 0.1. The decoder of the CTC network consists of a combination of dense layer and softmax activation. The output of the encoder layer is fed into the decoder at every time step. The two GRU layers in the encoder network are initialized with weights of the GRU layers of the model shown in Figure 2. The model in Figure 2 consists of two layers of GRU with 128 and 64 hidden units respectively connected to a time distributed dense layer of 19 hidden units with linear activation to predict concatenation of acoustic or Mel-frequency cepstrum coefficients (MFCC) features of dimension 13 and articulatory features of dimension 6 or a net feature dimension of 19 at every time step. The model shown in Figure 2 was trained for 500 epochs with mean squared error (MSE) as loss function with adam optimizer

[10] and batch size one. Basically the model shown in Figure 2 is a GRU based regression model which predicts combination of acoustic and articulatory features from EEG features of dimension 30 at every time step.

The CTC model was trained for 120 epochs to optimize the CTC loss using adam optimizer. More details of CTC loss function are described in [3, 11, 2, 5]

. The batch size was set to 32 for the CTC model. The TCN layer in the CTC encoder was initialized with random weights. The main motivation behind this idea was that the first two GRU layers of the CTC encoder will help in discovering acoustic and articulatory representations from the input EEG features and the TCN layer will learn the mapping of those representations to text. There was no fixed value for the time steps for the encoder of the CTC model. As usual the number of time steps is equal to the product of the sampling frequency of the input features and input sequence length. During inference time of the CTC model we used a combination of CTC beam search decoder and an external 4-gram language model. Along with beam search we include the log of the probability assigned by the language model for the label sequence during inference time. We used a character based CTC model for this work. The CTC model predicted a character at every time step. Figure 3 shows the loss convergence of the CTC ASR model and Figure 4 shows the loss convergence of the GRU regression model described in Figure 2. All the scripts were written using python keras deep learning and tensorflow 2.0 framework.

Fig. 1: CTC ASR Model
Fig. 2: Model to derive the initializing weights for the GRU layers of the CTC network
Fig. 3: CTC loss convergence
Fig. 4: GRU regression loss convergence
Fig. 5: Model to predict articulatory features from EEG features
Fig. 6: TCN regression loss convergence
Fig. 7: EEG channel locations for the cap used in our experiments

Iii Model to predict articulatory features from EEG features

For predicting articulatory features of dimension 6 from EEG features of dimension 30 we used the model described in Figure 5. Our model consists of a temporal convolutional network (TCN) [7] layer with 128 filters connected to a time distributed dense layer of 6 hidden units with linear activation to predict articulatory features of dimension 6 at every time step. A dropout [9] regularization with dropout rate 0.2 was applied after the TCN layer. Mean squared error (MSE) was used as the loss function for this regression model and model was trained for 1000 epochs with adam optimizer. The batch size was set to one. Figure 6 shows the loss convergence for the model.

Iv Design of Experiments for building the database

Seven male subjects took part in our EEG-Speech experiment. All the seven subjects were UT Austin graduate students in their mid twenties. English was not their mother tongue.

They were asked to speak the first 30 English sentences from USC-TIMIT database [12] and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 65dB. The music played from our lab computer was used as the source of background noise. Each subject was asked to repeat the experiment two more times. Here the subjects read out loud the English sentences that were shown to them on a computer screen. We used 80 % of the total data as training set and remaining as test set for all the models.

We used Brain product’s ActiChamp EEG amplifier. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 7. We used EEGLab [13] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

Total
Number
of
Sentences
Number of
Unique
sentences
contained
Total Number
of words
contained
Number
of
Unique words
contained
Number of
letters
contained
WER
(%)
GRU
layers
random
weights
+
LM
WER
(%)
GRU
layers
pre
trained
weights
+
LM
21 5 134 29 575 82.93 72.57
42 10 277 59 1121 77.66 75.5
63 15 408 84 1891 85.78 82.5
84 20 536 106 2334 86.3 80.64
105 25 652 132 2863 97.05 77.54
126 30 743 153 3614 103 87.7
TABLE I: WER on test set for CTC model with GRU layers with random weights vs GRU layers with weights derived from the pre trained EEG to MFCC+Articulatory GRU regression model

V EEG and Speech feature extraction details

We followed the same EEG and speech preprocessing methods used by authors in [1, 2]. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [13]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1, 2]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum coefficients (MFCC) as features for speech signal. We extracted MFCC features of dimension 13. The MFCC features were also sampled at 100Hz, same as the sampling frequency of EEG features.

We used acoustic-to-articulatory speech inversion tool introduced by authors in [14] to extract articulatory features of dimension 6 from the recorded speech signal. The articulatory features were also extracted at the same sampling frequency of 100 Hz as that of the MFCC and EEG features. The six articulatory tract variables (TV’s) that were extracted were Lip Aperture (LA), Lip Protrusion (LP), Tongue Body Constriction Location (TBCL), Tongue Body Constriction Degree (TBCD), Tongue Tip Constriction Location (TTCL) and Tongue Tip Constriction Degree (TTCD) [14].

Fig. 8:

Explained variance plot

Vi EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA) [15] to denoise the EEG feature space as explained by authors in [2, 1]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 8. We used KPCA with polynomial kernel of degree 3 [1, 2].

Vii Results

We used word error rate (WER) as performance metric to evaluate the performance of the CTC ASR model on test set data for various number of sentences as shown in Table 1. The average WER is reported in Table 1. Language model (LM) was included during inference time. Without the language model we observed higher error rates than the ones reported in Table 1. We observed roughly around 5 to 10 % increase in error rates for most of the experiments when the external language model was not included during inference time. As seen from Table 1 results it is quite evident that initializing the GRU layers in the encoder of the CTC network with pre trained regression model weights significantly improves the test time performance of EEG based CTC speech recognizer especially as the vocabulary size increase.

We used two performance metric to evaluate the performance of TCN regression model for predicting articulatory features from EEG features. The two performance metrics were root mean squared error (RMSE) and normalized RMSE between the predicted articulatory features during test time and ground truth articulatory features from test set. The RMSE values were normalized by dividing the RMSE values with the absolute difference between the maximum and minimum value in the test set observation vector. We observed an average RMSE of

0.632 and average normalized RMSE of 0.115 on the test set data.

Viii Conclusion and Future work

In this paper we demonstrated various techniques to improve the performance of EEG features based continuous speech recognition systems. We further demonstrated predicting articulatory features from EEG features with very low RMSE and normalized RMSE.

For future work we plan to conduct experiments with data collected from subjects with speaking disabilities and build a larger speech-EEG data base.

Ix Acknowledgement

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work. The first author would like to thank Prof Alex Dimakis from ECE department at UT Austin for suggesting us to carry out experiments using TCN layers.

References

  • [1] G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on.   IEEE, 2019.
  • [2] G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO).   IEEE, 2019.
  • [3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in

    Proceedings of the 23rd international conference on Machine learning

    .   ACM, 2006, pp. 369–376.
  • [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
  • [5] G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
  • [6] K. Kirchhoff, G. A. Fink, and G. Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition,” Speech Communication, vol. 37, no. 3-4, pp. 303–319, 2002.
  • [7] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [9] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [10] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
  • [11] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
  • [12] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
  • [13]

    A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,”

    Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [14] G. Sivaraman, V. Mitra, H. Nam, M. K. Tiede, and C. Y. Espy-Wilson, “Vocal tract length normalization for speaker independent acoustic-to-articulatory speech inversion.” in INTERSPEECH, 2016, pp. 455–459.
  • [15] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.