Voice activity detection (VAD) system detects presence or absence of human speech. VAD systems are typically used to trigger an automatic speech recognition (ASR) system and helps to improve the performance of the ASR system. Current state of the art VAD systems[1, 2] demonstrates good performance in absence of background noise but their performance degrades significantly in presence of high background noise. For example, the experiments carried out in our lab demonstrate that a VAD system fails to detect speech even in presence of a background noise of 40dB. Designing robust VAD systems is crucial to improve the performance of ASR systems operating in presence of high background noise.
Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. EEG sensors are placed on the scalp of a subject to obtain EEG readings. In  authors demonstrated isolated speech recognition using EEG features for a limited English vocabulary of four words and five vowels. They also demonstrated that EEG features are less affected by external background noises. In [4, 5] authors demonstrated continuous speech recognition using EEG features and in  authors also introduced different types of EEG feature sets. Motivated by the results demonstrated by authors in  we implemented VAD using EEG features to see if the performance of VAD systems operating in presence of high background noise can be improved using EEG features.
In  authors proposed methods to perform VAD using EEG signals but they didn’t provide any results to support their ideas whereas in this we paper we provide results obtained using EEG signals recorded from real experiments. We demonstrate results using two different data sets in this paper. VAD using EEG features might also help to improve the performance of EEG based ASR systems introduced by authors in [5, 4].
Current VAD systems operate with acoustic input only there by limiting technology accessibility to people with speaking disabilities or people who can’t speak at all. VAD using EEG features will allow people with speaking disabilities to use VAD systems, thereby improving technology accessibility. To the best of our knowledge this is the first time a VAD system is demonstrated using only real experimental EEG features.
Ii Voice activity detection model
regularization with dropout rate 0.2 is applied. A time distributed dense layer with 4 hidden units is applied after the final GRU layer. When used with first data set, the time distributed dense layer used sigmoid activation and when used with the second data set, the time distributed dense layer used ReLU
activation. The time distributed dense layer is followed by a dense layer of two hidden units which performs an affine transformation. The dense layer output or logits are passed to softmax activation to predict the class probabilities. The number of time steps of the GRU is equal to the product of sampling frequency of the input features and sequence length. In our case there was no fixed value for time steps.
The GRU based classifier model at every time step predicts whether the given input contains silence or speech. We used one hot vector to label the training data. Target value of 1 was assigned to input feature frame containing speech and a target value of 0 was assigned to the input feature frame containing silence. The models were trained using adam
optimizer for 200 epochs to observe loss convergence. The loss function used was categorical cross entropy and batch size was set to one. For each data set we used 80 % of data to train the model, 10 % as validation set and remaining 10 % as test set. The validation set was used to identify the right set of hyper parameters for the model.
We also tried performing experiments by replacing the GRU layers with temporal convolutional network (TCN) 
layers but the performance was poor even though TCN training was much faster than training the model with GRU layers. All the scripts were written using Python Keras Deep Learning framework.
Iii Design of Experiments for building the database
We used two data sets for this work. The first data set was the data set B used by authors in  where 8 subjects were asked to speak the first 30 English sentences from USC-TIMIT database  and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 65dB. Each subject was asked to repeat the experiment two more times. Here the subjects read out loud the English sentences that were shown to them on a computer screen.
The second data set was the database B used by authors in  where 15 subjects were asked to listen and speak out the first 9 English sentences from USC-TIMIT database and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 50dB. Each subject was asked to repeat the experiment two more times.
We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 3. We used EEGLab  to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
Iv EEG and Speech feature extraction details
We followed the same EEG and speech preprocessing methods used by authors in [3, 5] for both the data sets. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [3, 5]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum (MFCC) as features for speech signal. We extracted MFCC features of dimension 13. The MFCC features were also sampled at 100Hz, same as the sampling frequency of EEG features.
V EEG Feature Dimension Reduction Algorithm Details
After extracting EEG and acoustic features as explained in the previous section, we used Kernel Principle Component Analysis (KPCA)  to denoise the EEG feature space as explained by authors in [5, 3]. We reduced the 155 EEG features to a dimension of 30 by applying KPCA for both the data sets. We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 4. We used KPCA with polynomial kernel of degree 3 [3, 5].
We used classification accuracy as performance metric to evaluate the performance of the VAD model on test set data. Classification accuracy can be defined as ratio of number of correct predictions given by the model to total number of predictions given by the model on test set data.
Table 1 shows the test time results for VAD model when trained and tested using only acoustic or MFCC features, EEG features, concatenation of MFCC and EEG features for both the data sets. When trained and tested using first data set we observed that concatenating EEG and MFCC features as input gave the highest test time accuracy whereas for the second data set we observed that using only EEG features as input gave the highest test time accuracy. For both the data sets we observed that when the model was trained and tested using only MFCC features as input resulted in lowest test time accuracy.
Even though data set 1 was recorded in presence of higher background noise than data set 2 we observed that test accuracy using MFCC was higher for data set 1 but data set set 2 MFCC test accuracy was comparable to data set 1. Similar observations were noted for EEG and EEG + MFCC test time accuracy values too. One possible reason for these observations might be the nature of the data sets. For data set 1 the subjects were reading out loud the sentences shown on a computer screen where as in data set 2 the subjects first listened to the utterances and then they speak out loud the utterances. The EEG recorded in both cases might have slightly different properties and it might depend on the subjects too as each human brain is unique. This needs further exploration and understanding which will be considered for our future work. However for both the data sets we observed that test time accuracy using EEG or EEG + MFCC is higher than using only MFCC as input.
Vii Conclusion and Future work
In this paper we demonstrated voice activity detection (VAD) using only EEG features and we demonstrated that concatenating acoustic features with EEG features as input improves the performance of VAD systems operating in presence of background noise.
To the best of our knowledge this is the first time a VAD system is demonstrated using only real experimental EEG features. For future work, we plan to build a much larger speech EEG data base and also perform experiments with data collected from subjects with speaking disabilities.
We will also investigate whether it is possible to improve the test time accuracy by training the model with more number of examples.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.
-  M. H. Moattar and M. M. Homayounpour, “A simple but efficient real-time voice activity detection algorithm,” in 2009 17th European Signal Processing Conference. IEEE, 2009, pp. 2549–2553.
-  T. Drugman, Y. Stylianou, Y. Kida, and M. Akamine, “Voice activity detection: Merging source and filter-based information,” IEEE Signal Processing Letters, vol. 23, no. 2, pp. 252–256, 2015.
-  G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
-  G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
-  G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
-  M. Koctúrová and J. Juhár, “Eeg based voice activity detection,” in 2018 16th International Conference on Emerging eLearning Technologies and Applications (ICETA). IEEE, 2018, pp. 267–272.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: a simple way to prevent neural networks from overfitting,”
The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
-  S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,”Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
-  S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.