Speech Recognition With No Speech Or With Noisy Speech Beyond English

06/17/2019 ∙ by Gautam Krishna, et al. ∙ 0

In this paper we demonstrate continuous noisy speech recognition using connectionist temporal classification (CTC) model on limited Chinese vocabulary using electroencephalography (EEG) features with no speech signal as input and we further demonstrate single CTC model based continuous noisy speech recognition on limited joint English and Chinese vocabulary using EEG features with no speech signal as input.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. In [1] we demonstrated deep learning based automatic speech recognition (ASR) using EEG signals for a limited English vocabulary of four words and five vowels. In [2]

we demonstrated continuous noisy speech recognition using EEG for larger English vocabulary using connectionist temporal classification (CTC) model and attention model

[3]. We use only CTC model in this work. In this paper we extend our work for a much larger Chinese vocabulary and joint Chinese English or multilingual vocabulary.

Inspired from the unique robustness to environmental artifacts exhibited by the human auditory cortex [4, 5] we used very noisy speech data for this work and demonstrated lower character error rate (CER) for smaller corpus size using EEG features.

In [6] authors decode imagined speech from EEG using synthetic EEG data and CTC network but in our work we use real EEG data, use multilingual vocabulary. In [7]

authors perform envisioned speech recognition using random forest classifier but in our case we use end to end state of art model and perform recognition for noisy speech. In

[8] authors demonstrate speech recognition using electrocorticography (ECoG) signals, which are invasive in nature but in our work we use non invasive EEG signals.

References [9, 10, 11] indicates some of the prior work done in the field of multilingual speech recognition but none of the prior work used EEG signals for performing recognition. In [11] authors use a single end to end attention model for performing recognition but in our work used a single CTC model for performing multilingual speech recognition.

References [12, 13] explains some of the prior work done on Chinese and English joint speech recognition but EEG features were not used for performing recognition.

One of the unique ability of human brain is multilingualism [14, 15], our brain is capable of understanding multiple languages. This was another motivating factor for this work. All the subjects who took part in the experiments were multilingual.

We believe speech recognition using EEG will help people with speaking difficulties to use voice activated technologies with better user experience. As demonstrated in [1] EEG helps ASR systems to overcome performance loss in presence of background noise. This will help ASR systems to perform with high accuracy in very noisy environments like airport, shopping mall etc where there is high level of background noise. Developing a robust multilingual speech recognition system using EEG will help in improving technology accessibility for multilingual people with speaking disabilities. In this work we use Chinese and English languages, which have zero overlap in their scripts and very noisy data was used. Hence we investigate one of the most challenging cases of multilingual speech recognition in this paper.

Major contribution of this paper is the extension of results presented in [1] for a larger Chinese corpus as well as demonstration of multilingual speech recognition using EEG.

Figure 1: ASR Model

2 Connectionist Temporal Classification (CTC)

The main ideas behind CTC based ASR were first introduced in the following papers [16, 17]

. In our work we used a single layer gated recurrent unit (GRU)

[18] with 128 hidden units as encoder for the CTC network. The decoder consists of a combination of a dense layer ( fully connected layer) and a softmax activation. Output at every time step of the GRU layer is fed into the decoder network.

The number of time steps of the GRU encoder is equal to product of the sampling frequency of the input features and the length of the input sequence. Since different speakers have different rate of speech, we used dynamic recurrent neural network (RNN) cell. There is no fixed value for time steps of the encoder.

Usually the number of time steps of the encoder (T) is greater than the length of output tokens for a continuous speech recognition problem. A RNN based CTC network tries to make length of output tokens equal to T by allowing the repetition of output prediction unit tokens and by introducing a special token called blank token [16]

across all the frames. We used CTC loss function with adam optimizer

[19] and during inference time we used CTC beam search decoder.

We now explain the loss function used in our CTC model. Consider training data set with training examples and the corresponding label set

with target vectors

. Consider any training example, label pair (,). Let the number of time steps of the RNN encoder for (,) is . In case of character based CTC model, the RNN predicts a character at every time step. Whereas in word based CTC model, the RNN predicts a word at every time step. For the sake of simplicity, let us assume that length of target vector is equal to

. Let the probability vector output by the RNN at each time step

be and let value of be denoted by . The probability that model outputs on input is given by . During the training phase, we would like to maximize the conditional probability , and thereby define the loss function as .

In case when the length of is less than , we extend the target vector by repeating a few of its values and by introducing blank token to create a target vector of length . Let the possible extensions of be denoted by . For example, when and , the possible extensions are , and . We then define as .

In our work we used character based CTC ASR model. Figure 1 explains the architecture of our ASR model. CTC assumes the conditional independence constraint that output predictions are independent given the entire input sequence.

Figure 2: EEG channel locations for the cap used in our experiments

3 Design of Experiments for building the database

We built simultaneous speech EEG recording English and Chinese databases for this work. Five female and seven male subjects took part in the experiment. All subjects were UT Austin undergraduate,graduate students in their early twenties. All subjects were native Mandarin Chinese speakers and English was their foreign language.

The 12 subjects were asked to speak 10 English sentences and their simultaneous speech and EEG signals were recorded. The first 9 English sentences were the first 9 sentences from the USC-TIMIT database[20], while the 10 sentence was ” Can I get some water ”.

This data was recorded in presence of background noise of 65 dB. Background music played from our lab desktop computer was used as the source of noise. We then asked each subject to repeat the same experiment two more times, thus we had 36 speech EEG recording examples for each sentence.

We then asked the 12 subjects to repeat the same set of previous experiment but this time they were asked to speak the Chinese translation of the 10 English sentences.

We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 2. We used EEGLab [21] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

For this work, we used data from first 10 subjects for training the model, remaining two subjects data for validation and test set respectively.

4 EEG and Speech feature extraction details

EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [21]Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum (MFCC) as features for speech signal.

We first extracted MFCC 13 features and then computed first and second order differentials (delta and delta-delta) thus having total MFCC 39 features. The MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features to avoid seq2seq problem.

Figure 3:

Explained variance plot

5 EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used non linear methods to do feature dimension reduction in order to obtain set of EEG features which are better representation of acoustic features. We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) [22].We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 3. We used KPCA with polynomial kernel of degree 3 [1]. We used python scikit library for performing KPCA. The cumulative explained variance plot is not supported by the library for KPCA as KPCA projects features to different feature space, hence for getting explained variance plot we used normal PCA but after identifying the right dimension we used the KPCA to perform dimension reductions. We further computed delta, delta and delta of those 30 EEG features, thus the final feature dimension of EEG was 90 (30 times 3).

6 Results

We used character error rate (CER) as performance metric to evaluate the model. The CTC model was trained for 400 epochs to observe loss convergence and batch size was set to one. Table 1 shows the results for recognition of Chinese sentences during test time using only EEG features. As seen from the table, the error rate goes up as the vocabulary size increases. When the model was trained on first 3 sentences from training set and tested on first 3 sentences from the test set, a low CER of

1.38 % was observed. For number of sentences = {7,10}, we also tried training the model with concatenation of MFCC and EEG features and observed error rates 55.36 % and 66.11 % on test set respectively, which were slightly lower than error rates observed when the model was trained using only EEG features as seen from Table 1. In general we observed that as the vocabulary size increases, adding MFCC features to EEG features will help in reducing the error rates.

Figure 4: CTC loss convergence

If ={, , , } is the Chinese vocabulary and if ={, , , } is the English vocabulary and then for multilingual training we train the model for . Table 2 shows the result obtained for multilingual speech recognition using only EEG features. Again, the lowest error rate was observed for smallest corpus size and error rate went up as we increase the corpus size. We believe as Chinese vocabulary has large number of unique characters, the model needs more number of training examples to generalize better and to give lower CER as corpus size increases.

In [1] we demonstrated that EEG sensors T7 and T8 contributed most to ASR test time accuracy, so we tried training the model with EEG features from only T7 and T8 sensors and obtained results on Chinese corpus are shown in Table 3. We observed that for some examples of corpus size, the error rates were comparable withe error rates shown in Table 1.

We tried reducing the number of hidden units of the GRU to 64 to see if it can help with overcoming the performance loss due to less amount of training examples and obtained results are shown in Table 4. The results indicate lower CER for larger corpus size compared to the results shown in Table 1. Similarly for T7, T8 training with GRU 64 hidden units on Chinese vocabulary, we observed little reduced CER values of 52.5 % and 69.5 % for number of sentences={3,7} respectively. For number of sentences ={5,10} the error rates were nearly same as the rates reported in Table 3.

When we trained GRU 64 hidden units model on multilingual vocabulary we observed a lower error rate of 44.4 % for number of sentences = {3}, for number of sentences = {5,7,10} the error rates were nearly same as the rates reported in Table 2.

We further observed that for the joint Chinese English or for the multilingual vocabulary, 2 layer GRU with 64 hidden units in each layer model gave lower error rates of 61.3 %,72.4%,74.3% for number of sentences equal to {5,7,10} respectively compared to the error rates reported in Table 2.

Figure 4 shows CTC loss convergence for Chinese vocabulary GRU 128 hidden units model for number of sentences ={3}.

Number
of Sentences
Number of
unique
characters
contained
EEG
(CER %)
3 24 1.38
5 37 34.7
7 59 64.8
10 88 69.4
Table 1: CER on test set for CTC model for Chinese vocabulary
Number
of Sentences
Number of
unique
characters
contained
EEG
(CER %)
3 43 45.6
5 57 66.8
7 81 78.4
10 111 79.4
Table 2: CER on test set for CTC model for joint Chinese English vocabulary
Number
of Sentences
Number of
unique
characters
contained
EEG
(CER %)
3 24 58.7
5 37 69.7
7 59 70.8
10 88 70.1
Table 3: CER on test set for CTC model for Chinese vocabulary using EEG features from only T7 and T8 electrodes
Number
of Sentences
Number of
unique
characters
contained
EEG
(CER %)
3 24 3.6
5 37 31.6
7 59 49.6
10 88 65.8
Table 4: CER on test set for CTC model with GRU 64 units for Chinese vocabulary

7 Conclusions

In this paper we demonstrated continuous noisy speech recognition using only EEG features on Chinese vocabulary as well as multilingual continuous noisy speech recognition using only EEG features on joint Chinese English vocabulary. As far as we know this is the first time a continuous speech recognition using only EEG features is demonstrated for Chinese or multilingual vocabulary.

We observed that as corpus size increase, CTC model CER went up and concatenating acoustic features with EEG features will help in reducing CER. Our work demonstrates the feasibility of using EEG features for multilingual speech recognition.

We further plan to publish our speech EEG data base used in this work to help advancement of the research in this area. For future work, we plan to build a much larger speech EEG data base and investigate whether CTC model results can be improved by training with more number of examples, by incorporating an external language model during inference time or by including a language identification model.

8 Acknowledgements

We would like to thank Kerry Loader from Dell, Austin, TX for donating us the GPU to train the CTC model used in this work.

References

  • [1] G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on.   IEEE, 2019.
  • [2] G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO).   IEEE, 2019.
  • [3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
  • [4] X. Yang, K. Wang, and S. A. Shamma, “Auditory representations of acoustic signals,” Tech. Rep., 1991.
  • [5] N. Mesgarani and S. Shamma, “Speech processing with a cortical representation of audio,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on.   IEEE, 2011, pp. 5872–5875.
  • [6] K. Wang, X. Wang, and G. Li, “Simulation experiment of bci based on imagined speech eeg decoding,” arXiv preprint arXiv:1705.07771, 2017.
  • [7] P. Kumar, R. Saini, P. P. Roy, P. K. Sahu, and D. P. Dogra, “Envisioned speech recognition using eeg sensors,” Personal and Ubiquitous Computing, vol. 22, no. 1, pp. 185–199, 2018.
  • [8] N. Ramsey, E. Salari, E. Aarnoutse, M. Vansteensel, M. Bleichner, and Z. Freudenburg, “Decoding spoken phonemes from sensorimotor cortex with high-density ecog grids,” Neuroimage, 2017.
  • [9] J. Gonzalez-Dominguez, D. Eustis, I. Lopez-Moreno, A. Senior, F. Beaufays, and P. J. Moreno, “A real-time end-to-end multilingual speech recognition architecture,” IEEE Journal of selected topics in signal processing, vol. 9, no. 4, pp. 749–759, 2015.
  • [10] A. Waibel, H. Soltau, T. Schultz, T. Schaaf, and F. Metze, “Multilingual speech recognition,” in Verbmobil: Foundations of Speech-to-Speech Translation.   Springer, 2000, pp. 33–45.
  • [11] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4904–4908.
  • [12] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in

    International conference on machine learning

    , 2016, pp. 173–182.
  • [13] S. Yu, S. Hu, S. Zhang, and B. Xu, “Chinese-english bilingual speech recognition,” in

    International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003

    .   IEEE, 2003, pp. 603–609.
  • [14] A. Costa and N. Sebastián-Gallés, “How does the bilingual experience sculpt the brain?” Nature Reviews Neuroscience, vol. 15, no. 5, p. 336, 2014.
  • [15] E. Higby, J. Kim, and L. K. Obler, “Multilingualism and the brain,” Annual Review of Applied Linguistics, vol. 33, pp. 68–101, 2013.
  • [16] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning.   ACM, 2006, pp. 369–376.
  • [17] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
  • [18] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [20] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
  • [21] A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [22] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.