Continuous Speech Recognition using EEG and Video

12/16/2019 ∙ by Gautam Krishna, et al. ∙ 0

In this paper we investigate whether electroencephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years there has been lot of interesting work done in the fields of lip reading and audio visual speech recognition. In [1] authors demonstrated end-to-end sentence level lip reading and in [2] authors demonstrated deep learning based end-to-end audio visual speech recognition. Similarly there has been lot of new results published in the field of speech recognition using bio signals, mainly using electrocorticography (ECoG) and electroencephalography (EEG). ECoG is an invasive way of measuring electrical activity of human brain where a subject need to undergo a brain surgery to get ECoG electrodes implanted. On the other hand EEG is a non invasive way of measuring electrical activity of human brain where signals are recorded by placing EEG sensors on the scalp of the subject. In [3] authors demonstrated isolated speech recognition using EEG and combination of EEG, acoustic features on a limited English vocabulary of four words and five vowels. In [4, 5] authors demonstrated continuous speech recognition using EEG features in clean and noisy environments. In [3] authors demonstrated that EEG features are helpful in improving the robustness of automatic speech recognition (ASR) systems operating in noisy environments. In this paper we investigate whether EEG features are more helpful than acoustic features to improve the performance of continuous visual speech recognition systems.

Given the limited amount of data we had in our hand we implemented our own custom model for performing recognition instead of using state of the art computer vision feature extraction networks like Resnet

[6] or VGG net [7]. The main goal of the work explained in this paper is to investigate the feasibility of using EEG features for performing continuous visual speech recognition.

Our results demonstrate that EEG features are in fact helpful in improving the performance of continuous visual speech recognition systems. We demonstrate our results for a limited English vocabulary consisting of 30 unique sentences.

Ii Visual Speech Recognition System Models

Figure 1 explains the architecture of the recognition model used for recognition of combination of video and EEG data or video and acoustic data. The part of the network used for extracting features from EEG and acoustic data consists of three layers of gated recurrent unit (GRU)

[8] with 128, 64 and 32 hidden units respectively. Each GRU layer included a dropout regularization [9]

with dropout rate 0.1. This part of the network can take EEG or acoustic features or concatenation of EEG and acoustic features as input depending on how the model is trained. The part of the network used for extracting features from video frames consisted of two dimensional convolutional network layers and two dimensional max pooling layer. We used two convolutional network layers and one max pooling layer. The convolutional layers had 100 filters with ReLU

[10]activation function and a kernel size of (1,3) and the max pooling layer had a pool size of (1,2). The output of the max pool layer is flattened and reshaped in order to concatenate it with the features extracted by the other part of the network described before. The concatenated features are fed into a temporal convolutional network (TCN) [11] layer consisting of 32 filters, whose output is fed into the decoder of the connectionist temporal classification (CTC) [12, 5, 4] network. The decoder of the CTC network consists of combination of a dense layer which performs affine transformation and a softmax activation. The output of the encoder is fed into the decoder of the CTC network at every time step. The number of time steps of the encoder is same as the product of sampling frequency of the input features and sequence length. There was no fixed time step value since different subjects spoke with different rate of speech. We used a character based CTC model in this work. The model was predicting a character at every time step.

The details of the CTC loss function are covered in

[12, 5, 4]. During inference time we used a combination of CTC beam search decoder and an external 4-gram language model, known as shallow fusion [13]. Figure 2 shows the architecture of the model used for performing speech recognition using only video features. It is very similar to the model explained in Figure 1 except it doesn’t contain additional network layers to extract acoustic or EEG features.

Both the models were trained for 120 epochs using adam

[14]

optimizer and the batch size was set to 100. Validation split was 0.1. Figure 3 shows the CTC loss convergence for the model during training. All the scripts were written using keras and tensorflow 2.0 deep learning framework.

Fig. 1: Video-EEG Fusion Recognition Model
Fig. 2: Video Recognition Model
Fig. 3: CTC loss convergence for Video-EEG fusion model
Fig. 4: EEG channel locations for the cap used in our experiments

Iii Design of Experiments for building the database

Seven male UT Austin graduate students in their early to mid twenties took part in the speech-EEG-Video experiment. Each subject was asked to speak the first 30 English sentences from USC-TIMIT data base [15] and their simultaneous speech, EEG and video was recorded. The sentences were shown to subjects on a computer screen and they read out loud the sentences. The data was recorded in presence of a background noise of 65dB. The music played from our lab computer was used as the source of noise. Then each subject was asked to repeat the same experiment two more times, thus we had 90 speech-eeg-video recordings from each subject.

We used Brain product’s ActiChamp EEG amplifier. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 4. We used EEGLab [16] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

Data from first 6 subjects was used as the training set and the last subject data was used as the test set.

Total
Number
of
Sentences
Number
of
Unique
Sentences
Contained
Total
Number
of
words
Contained
Number
of
Unique
words
Contained
Number
of Letters
Contained
Video
WER
(%)
Video
+
MFCC
WER
(%)
Video
+
EEG
WER
(%)
Video
+
EEG
+
MFCC
WER
(%)
MFCC
WER
(%)
90 30 552 153 2598 96.93 85.23 84.60 83.74 86.45
TABLE I: WER on Test Set

Iv Feature extraction and preprocessing details

We followed the same EEG and speech preprocessing methods used by authors in [3, 4]. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [16]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [3, 4]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum coefficients (MFCC) as features for speech signal. We extracted MFCC features of dimension 13. The MFCC features were also sampled at 100Hz, same as the sampling frequency of EEG features.

We extracted 100 frames per second from the recorded video. We used YOLO[17]

object recognition model to perform face recognition from the extracted video frames. Figure 6 shows a raw extracted RGB video frame and Figure 7 shows the corresponding face frame extracted using YOLO. The maximum x dimension value in our extracted face data set was 426 and maximum y dimension value in our extracted face data set was 381. Z was of dimension 3 (RGB). Our initial plan was to perform experiments using RBG frames but we were constrained by memory requirements of our computing hardware, hence we transformed the RGB face frames to gray scale and re sized all the gray scale face frames to a dimension of 100 X 100. Figure 8 shows the corresponding gray scale re sized face frame. We used python imaging library (PIL) for re sizing the image. We also tried extracting mouth frames from the extracted face frames but we observed missing data problem, hence we proceeded with feeding the face frames instead of the mouth frames to the model. We strongly encourage researchers to use RGB frames and use three dimensional convolutional and max pooling layers instead of two dimensional layers in the models described in Figures 1 and 2, if you have sufficient computing resources.

Fig. 5:

Explained variance plot

Fig. 6: Extracted raw video frame
Fig. 7: Extracted RGB face frame from the raw video frame
Fig. 8: RGB frame resized and converted to gray scale

V EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA) [18] to denoise the EEG feature space as explained by authors in [4, 3]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 5. We used KPCA with polynomial kernel of degree 3 [3, 4].

Vi Results

We used word error rate (WER) as performance metric of the model during test time. Table 1 shows the results obtained during test time. Table 1 shows the average WER on test set for various feature set inputs. The results demonstrate that EEG features are more helpful than acoustic features to improve the performance of continuous visual speech recognition systems operating in noisy environments. Using all modalities ( acoustic, EEG, Video) gave the highest test time performance or the lowest word error rate on test set.

Vii Conclusion and Future work

In this paper we demonstrated the feasibility of using EEG features to improve visual speech recognition systems operating in noisy environments. For future work we would like to build a much larger data set and validate our results on a larger corpus. We encourage other researchers to put a joint effort in building a state of the art Speech-Video-EEG data base to help advance research in this area.

Viii Acknowledgement

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.

The first author would like to thank Prof Yann Soullard from Université De Rennes 2 for his crucial help with debugging the CTC model code.

References