Neurodegenerative diseases such as amyotrophic lateral sclerosis (ALS) restrict an individual’s potential to fully engage with his or her surroundings by hindering their communication abilities. Brain-Computer Interfaces (BCI) have long been envisioned to assist such patients as they bypass the affected pathways and directly translate neural recordings into text or speech output (Brumberg et al., 2018)
. These devices are trained to generate appropriate models of a subject’s brain, and then classify and translate neural signals into commands.
However, practical implementation of this technology has been hindered by limitations in speed and accuracy of existing systems (Farwell and Donchin, 1988). Many patients rely on communication devices that use motor imagery (McFarland et al., 2000.), or on interfaces that require them to individually identify and spell out text characters such as the "point and click" cursor method (Pandarinath et al., 2017.; Speier et al., 2011; Townsend and Platsko, 2016). Despite significant work in optimizing these systems, the inherent limitations in their designs restrict them to communication rates far less than naturalistic speech (Huggins et al., 2011.).
To address these shortcomings, several studies are using electrocorticography (ECoG) and local field potential (LFP) signals for classification and reconstruction of individual phonemes and their acoustic features (Akbari et al., 2019.; Pasley et al., 2012.)
. These invasive approaches provide superior signal quality as neural recordings are taken directly on top of the cortex or within the cortical layer, thus capturing events from indivdual cells or small populations of neurons with high temporal and spatial accuracy. Previous work attempted translation to continuous phoneme sequences using invasive neural data(Herff et al., 2015.; Moses et al., 2016.); however, despite their reported higher translation speed, their applications are limited to a reduced dictionary (10-100 words). Other design choices meant to enhance phoneme classification capitalize on prior knowledge of the target words, hindering their generalization to unmodified, naturalistic scenarios. Additionally, a recent study synthesized speech directly using recordings from the speech cortex. Though it demonstrates partial transferrability of its decoder amongst patients, the accuracy of said model is again limited to selection of the reconstructed word by a listener from a given pool of 25 words and worsens as the pool size increases (Anumanchipalli et al., 2019.).
Thus, establishing the capability of these approaches to generalize to unconstrained vocabularies is not obvious and has to our knowledge not yet been studied. Here, we present the performance of a two-part decoder network comprising of an LSTM and a particle filtering algorithm on data gathered from six patients. We provide empirical evidence that our algorithm achieves an average accuracy of 32% using a generalized language model based on the expansive Brown corpus, marking an important, non-incremental step in the direction of viability of this interface.
The overall system for translating neural signals into text consists of five steps (Fig.1
). First, signals are recorded from depth electrodes implanted for surgical treatment of epilepsy while patients were instructed to speak individual words. Then, spectral features are extracted from these signals to create feature vectors for classification. An LSTM classifier uses these to generate probability distributions over the set of phonemes at each time point. Then, a partical filtering algorithm temporally smooths the probabilities incorporating prior knowledge from a language model. Finally, the output text is produced and compared with the original spoken words. Each step is further elaborated upon in the following sections.
2.1 Experimental Design
Data was obtained from neurosurgical patients implanted with intracranial depth electrodes to identify seizure foci for potential surgical treatment of epilepsy Tankus et al. (2012.). Implantation of electrodes in the relevant areas of the temporal, frontal, and parietal lobes was based on clinical need. This study was approved by the institutional review board and all subjects consented to participate in this research.
During the study, subjects were asked to repeat individual words ("yes", "no"), or singular vowels with or without preceding consonants. During each trial, they were told which word or string to repeat. Then, they would be prompted by a beep followed by a 2.25 second window during which they repeated the word. The number of trials varied between subjects based on their comfort, resulting in a variable number of trails ranging from 55 to 208. Consequently, the number of phonemes per subject varied from 8 (3 consonants, 5 vowels) to 10 (5 consonants, 5 vowels). The sampling rate of these recordings was 30 kHz. Before further processing, electrodes determined visually to have low signal-to-noise ratio (SNR) were removed.
2.2 Feature Selection
Since we sought to include as input to our decoder differential information stored in the ECoG signals about production of various phonemes, we designed an experiment that mapped power in frequency bands of the neural recordings to the underlying phoneme pronunciation. The motivation for this experiment was that previous studies (E.F.Chang et al., 2013.; Herff et al., 2015.; Moses et al., 2016.) have used bands up to high gamma (70-150 Hz) to map unto underlying speech, but our preliminary analysis found tuning in a greater range of frequencies for several of our subjects (Fig. 2).
Each recording was divided into time windows from -166.67 to 100 ms relative to onset of the speech stimuli. Labels [0,1] were assigned respectively to the corresponding audio signal: [silence, consonant/vowel]. The power per band is pre-processed by z-scoring and then down sampled to 100 Hz. This further acts as an input to a linear classifier which we train using early-stopping and coordinate descent methods. To additionally ensure that the classifier can correctly identify the silence after completion of the phoneme string, we performed training over 100 ms post speech onset, but test the features captured by the classifier weights over 333.33 ms, since most trials end within this time period.
The input feature set for each subject thus is a concatenated matrix comprising of the time domain signal, a frequency band with the highest z-score for vowel production and lastly, a band encapsulating consonant information. The requisite frequency bands are: Subjects 1 - [150-200, 200- 250], 2 - [150-200, 1000-1150], 3 - [70-150], 4- [200-250, 650-1150], 5- [200-250, 700-1150], 6 - [150-400, 600-750]. This algorithm was implemented using the STRF toolbox (Auditory Science Lab, ) in Matlab (Natick, ).
2.3 LSTM Model Description
The first part of our decoder is a stacked two-layer bLSTM which takes as input the feature set and outputs a probability distribution across all phonemes in the given dataset. We use a bLSTM due to its ability to retain temporally distant dependencies when decoding a sequence Graves and Schmidhuber (2005). Further, our analysis reveals that while a single-layer network can differentiate between phonemic classes such as nasals, semivowels, fricatives, front vowels and back vowels; a two-layer model can distinguish between individual phonemes. There are 256 hidden units for each LSTM cell. The model is trained using the ADAM optimizer to minimize weighted cross-entropy error, with weights inversely proportional to the phoneme frequencies. The optimizer is initialized with beta1 = 0.9, beta2 = 0.999 and eps =
. Training occurs over 40 epochs with a learning rate of
. Using leave-one-out cross validation, the recurrent network outputs a time sequence of the probability distributions. Software was implmented using Pytorch.
2.4 Language Model
A language model is used to apply prior knowledge about the expected output given the target domain of natural language. In general, such a model creates prior probability distributions for the output based on the sequences seen in a corpus that reflects the target output. In this study, word frequencies were determined using the Brown corpus, which contains over 2 million words compiled from various types of documents published in the United States in 1961Francis and Kucera (1979). These words were translated into their corresponding phonemic sequences using the CMU Pronouncing Dictionary Weide (1998). For words with multiple pronunciations, one of the possibile pronunciations was randomly chosen for each occurrance of the word. Phoneme prior probabilities were determined by finding the relative frequency of each phoneme in the resulting corpus.
To find probabilities of sequences of phonemes, these prior probabilities can be simplified using the nth-order Markov assumption to create an n-gram modelSpeier et al. (2017); Manning and Schütze (1999). While n-gram models are able to capture local phonemic patterns, they allow for sequences that are not valid words on the language. A probabilistic automaton (PA) creates a stronger prior by creating states for every subsequence that starts a word in the corpus Speier et al. (2015.). Thus, the word “no” would result in three states: \n\, \no\, and the start state which corresponds to a blank string. Each state then links to every state that represents a superstring that is one character longer. Thus, the state \n\, will also link to the state \ni\(Fig. 3).
In the English language, it is possible for multiple words with different spellings to have identical pronunciation (homophones). The language model accounts for this possibility by keeping a list of the words associated with each node in the model along with their relative frequency in the text corpus. In the current implementation, the model will select the highest probability word associated with the selected node. While this process could lead to incorrect selections in practice if the intended target were a less common homophone, deciding between these options would require a language model that incorporates context that extends beyond single words (see future directions).
2.5 Temporal Smoothing
Laplacian smoothing is applied to the output of the LSTM model so that phonemes that were not seen during training are assigned a non-zero probability. A temporal model is then used to apply the language model to the resulting distributions. For simple n-gram based models, dynamic programming methods such as hidden Markov models can be implemented for this purpose. As more sophisticated language models are used, the ability to fully represent the probability distribution over possible output sequences becomes impractical.
In this study, we applied a particle filtering (PF) method previously applied in P300-based brain computer interface systems Speier et al. (2015.)
. PF is a method for estimating the probability distribution of sequential outputs by creating a set of realities (called particles) and projecting them through the model based on the observed dataGordon et al. (1993). Each of these particles contains a reference to a state in the model, a history of previous states, and an amount of time that the particle is going to remain in the current state. The distribution of states occupied by these particles represents an estimation of the true probability distribution.
When the system begins, a set of P particles is generated and each is associated with the root node of the language model. At each time point, samples are drawn from the proposal distribution defined by the transition probabilities from the previous state.
The time that the particle will stay in that state is drawn from a distribution representing how long the subject is expected to spend speaking a specific phoneme. At each time point, the probability weight is computed for each of the particles using,
The weights are then normalized and the probability of possible output strings is found by summing the weights of all particles that correspond to that string. The system keeps a running account of the highest probability output at each time. The effective number of particles is then computed.
If the effective number falls below a threshold, , a new set of particles are drawn from the particle distribution. At each time point, the amount of time for a given particle to remain in a state is decremented. Once that counter reaches zero, the particle transitions to a new state in the language model based on the model transition probabilities .
The simplest evaluation metric used is the trial accuracy, which represents the number of trials classified completely correctly, divided by the total number of trials. Trials are only considered correct if the phoneme sequence matches the labels and each of those phonemes overlaps at least partially with the corresponding label.
Phoneme-based performance was measured in terms of precision, recall, and phoneme error rate. Here, each phoneme classification was considered either a true positive (i.e., correct phoneme overlapping the label), false positive (i.e., classified phoneme that either doesn’t match the corresponding label or occurs during silence), or false negative (i.e., no detected phoneme during a label). Precision is then the number of true positives divided by the sum of the true positives and false positives. Recall is the number of true positives divided by the sum of the true positives and false negatives. Phoneme error rate is the number of changes that would need to be made to the output sequence in order to match the label sequence (also known as the Levenshtein distance) divided by the length of the label sequence Moses et al. (2016.).
For evaluation of the output as a BCI system, we must take into account two factors: the ability of the system to achieve the desired result and the amount of time required to reach that result. Because there is a trade-off between speed and accuracy, evaluation in BCI communication literature is traditionally based on the mutual information between the selected character, x, and the target character, z, referred to as the bits per symbol (B).
In the most common metric, information transfer rate (ITR), the probabilities for all characters are assumed to be the same (p(x)=1/N where N is the size of the alphabet) and errors are assumed to be uniform across all possible characters, reducing the bits per symbol to
Where is the accuracy of individual character selections. Thus the ITR given the average number of characters selected per minute () is Pierce (1980).
It has previously been observed that ITR overestimates the amount of information conveyed by the system because characters do not occur with equal frequency W. Speier (2013). Also, the amount of information that ITR assigns to a word is based largely on the word’s length. This metric assigns a significantly higher amount of information to incorrect strings that share characters with the target, regardless of whether they make syntactic sense or possibly confuse the meaning. An alternative would be to base the metric on word frequency . The accuracy can then be computed as the fraction of correct words (), resulting in a conditional probability of a selection. The bits per symbol () then becomes
Multiplying this by words selected per minute () gives a bit rate based on mutual information (MI).
Because the distributions for speeds, accuracies, and bit rates are not normally distributed, significance was tested for all metrics using Wilcoxon signed-rank tests.
Word accuracies varied between subjects, ranging from (subject 1) to (subject 2) (Table 1). On average, of trials were classified completely correctly and an additional had at least one phoneme match. Of the incorrect classifications produced incorrect words either because none of the output phonemes were correct or because the sequences did not align temporally with the audio signal. In the remaining of trials, the system did not detect speech signals, and produced an empty string as output.
|Subject||()||Partial ()||Incorrect ()||Omission ()|
Expected word accuracy for each subject was computed by finding the expected value of having the output match the target word given only the language model. Though for this computation, the language model was simplified to include only those words from the Brown corpus that could be possibly constructed using the phonemes that a subject uttered, this is a reasonable restraint. Additionally, it is far more lenient as compared to previous studies wherein the output word is constrainted to only a small subset of their analogously feasible word pool. Comparing each of the word accuracies to the expected results from random signals, we found that all subjects perform significantly better than random (p<0.01) (Figure 4).
On average phoneme classification yielded precision, recall, and error rates of 0.46, 0.51, and 73.32, respectively (Table 2). The higher recall suggests that more errors were a result of incorrectly adding phonemes to an output sequence than missing phonemes in the classification. The phoneme error rates that we observed in our output sequences were lower on average than those reported previously by Moses et al. Moses et al. (2016.).
|Subject||Precision||Recall||Phoneme Error Rate|
|Moses et al. Moses et al. (2016.)||-||-||87.56|
To compare the performance of this system with that achieved by existing ERP-based BCI systems, we calculated the bit rate for communication using the mutual information metric described in section 2.6. For the word selection rate, we used the full time that a subject was given to speak a word (2.25 seconds), resulting in a WPM value of 26.67 words per minute for each subject. Word accuracies varied between subject, ranging from to , with an average value of resulting in an average MI of bits per minute. This value was significantly higher than the results presented in Speier et al. (2011). Since the study presented by Townsend et al Townsend and Platsko (2016) used several different configurations for their subjects, we compare our results here with the best performing subject in their study. The average MI value in this study was over twice the value achieved by their best subject, with all but one subject in this study achieving a higher bit rate.
|Speier et al. Speier et al. (2011)||2.53||92.56||6.54||16.54|
|Townsend et al.* Townsend and Platsko (2016)||2.94||100.00||7.33||21.56|
*performance of best single subject in study
Each of the subjects in this study were able to communicate with significantly higher accuracy than chance. Nevertheless, the average word error rate seen in this study (67.8% on average) was higher than the 53% reported in Anumanchipalli et al. (2019.)
. There were several important differences in these studies, however. The primary difference is that their system produced an audio output that required a human listener to transcribe into a word selection. Despite advances in machine learning and natural language processing, humans have superior ability to use contextual information to find meaning in a signal. Furthermore, that study limited classifications to an output domain set of 50 words, which is generally not sufficient for a realistic communication system.
The communication speeds reported here are based on the trial time of 2.25 seconds. This time was set conservatively to make sure that subjects had time to respond to prompts, and the majority of the time was spent waiting for the next speaking cue. The actual time spent speaking the promted words was under 400 ms on average across subjects. This speaking time is in line with the average rate of natural speech, which is usually reported to be in the range of 100-125 words per minute Kemper (1994). Increasing the rate of word production would further improve the bit rate of a speech decoding system in comparison to existing BCI spellers.
While this study showed significant improvements over existing BCI systems in terms of bit rate and speed, our accuracies are lower than those reported in ERP-based BCI studies Townsend and Platsko (2016); Speier et al. (2011). It has been previously reported that BCI users expect an accuracy level that exceeds Huggins et al. (2011.), which is higher than the accuracy values achieved here. In order for a BCI system based on translating neural signals to become a practical BCI solution, improvements need to be made either in signal acquisition, machine learning translation, or user strategy. One approach could be to sacrifice some of the speed advantages by having users repeat words multiple times. While this would reduce communication speeds below natural speeking rates, it would still greatly exceed ERP-based methods, while increasing the signals available for classification which could improve system accuracy.
4.1 Limitations and Future Work
The language model used in this study was designed to be general enough for application in a realistic BCI system. This generality may have been detrimental to the performance in the current study, however. Language models are designed to introduce bias into a system based on the expected output given prior knowledge. Thus, language models based on natural language will bias towards words that are common in everyday speech. The current study design, however, produced many words that are infrequent in the training corpus. For instance, the single phoneme /u/ maps to the word "ooh", which occurred only once in the full corpus. As a result, the language model actually biased away from this output, making it almost impossible to correctly classify. While it would be possible to retrain the language model on the known output words, the results would then depend on knowing the set of target words, which is not realistic for a general communication system.
The results presented in this study are promising, but they represent offline performance which does not include several factors that occur in an online implementation. For instance, offline systems do not include user feedback, which can provide additional motivation or allow the user to adjust their strategy. Also, the current study was limited to epilepsy patients, rather than the target population of ALS patients. While it would be impractical to implant electrodes for such a BCI study in the target population, testing whether the results seen in such invasive studies translate to ALS patients remains to be studied.
The proposed system serves as a step in the direction of a generalized BCI system that can directly translate neural signals into written text. The system achieved bit rates that were significantly higher than the current state of the art in BCI communication. However, communication accuracies are currently insufficient for a practical BCI device, so future work must focus on improving these and developing an interface to present feedback to users.
- Akbari et al. (2019.) H. Akbari, B. Khalighinejad, J. L. Herrero, A. D. Mehta, and N. Mesgarani. Towards reconstructing intelligible speech from the human auditory cortex. Scientific Reports, 10:874, 2019.
- Anumanchipalli et al. (2019.) G. K. Anumanchipalli, J. Chartier, and E. F. Chang. Speech synthesis from neural decoding of spoken sentences. Nature, 568(7753):493–498, 2019.
- (3) B. Auditory Science Lab. Strflab toolbox. v1.45.
- Brumberg et al. (2018) J. Brumberg, K. Pitt, A. Mantie-Kozlowski, and J. Burnison. Brain-computer interfaces for augmentative and alternative communication: A tutorial. Am J Speech Lang Pathol., 2018.
- E.F.Chang et al. (2013.) E.F.Chang, C.A.Niziolek, R.T.Knight, S.S.Nagarajan, and J.F.Houde. Human cortical sensorimotor network underlying feedback control of vocal pitch. PNAS, 110(7):2653–8, 2013.
- Farwell and Donchin (1988) L. A. Farwell and E. Donchin. Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Electroencephalogr. Clin. Neurophysiol, 79(6), 1988.
- Francis and Kucera (1979) W. Francis and H. Kucera. Brown corpus manual dept of linguistics. Technical report, Brown University, 1979.
- Gordon et al. (1993) N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEEE Proceedings F (Radar and Signal Processing), pages 107–113, 1993.
Graves and Schmidhuber (2005)
A. Graves and J. Schmidhuber.
Framewise phoneme classification with bidirectional lstm and other neural network architectures.Neural Netw., 18:602–601, 2005.
- Herff et al. (2015.) C. Herff, D. Heger, A. Pesters, D. Telaar, P. Brunner, G. Schalk, and T. Schultz. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci., 9:217, 2015.
- Huggins et al. (2011.) J. E. Huggins, P. A. Wren, and K. L. Gruis. What would brain-computer interface users want? opinions and priorities of potential users with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler., 12(5):318–324, 2011.
- Kemper (1994) S. Kemper. Elderspeak: Speech accommodations to older adults. Aging and Cognition, pages 17–28, 1994.
- Manning and Schütze (1999) C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 1999.
- McFarland et al. (2000.) D. J. McFarland, L. A. Miner, T. M. Vaughan, and J. R. Wolpaw. Mu and beta rhythm topographies during motor imagery and actual movements. Brain Topogr., 12(3):177–186, 2000.
- Moses et al. (2016.) D. A. Moses, N. Mesgarani, M. K. Leonard, and E. F. Chang. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity. J. Neural Eng., 13(5):56004, 2016.
- (16) . Natick, MA: The Mathworks Inc. Matlab and statistics toolbox. Release 2018a.
- Pandarinath et al. (2017.) C. Pandarinath, P. Nuyujukian, C. Blabe, B. Sorice, J. Saab, F. Willett, L. Hochberg, K. Shenoy, and J. Henderson. High performance communication by people with paralysis using an intracortical brain-computer interface. eLife, 6:e18554, 2017.
- Pasley et al. (2012.) B. N. Pasley, S. V. David, N. Mesgarani, A. Flinker, and S. A. Shamma. Reconstructing speech from human auditory cortex. PLOS Biology, 10(1):1–13, 2012.
- Pierce (1980) J. Pierce. An Introduction to Information Theory. Dover, 1980.
- Speier et al. (2011) W. Speier, C. Arnold, J. Lu, R. K. Taira, and N. Pouratian. Natural language processing with dynamic classification improves p300 speller accuracy and bit rate. J. Neural Eng., page 016004, 2011.
- Speier et al. (2015.) W. Speier, C. Arnold, A. Deshpande, J. Knall, and N. Pouratian. Incorporating advanced language models into the p300 speller using particle filtering. J. Neural Eng., 12:046018, 2015.
- Speier et al. (2017) W. Speier, C. Arnold, N. Chandravadia, D. Roberts, S. Pendekanti, and N. Pouratian. Improving p300 spelling rate using language models and predictive spelling. Brain-Computer Interfaces, pages 13–22, 2017.
- Tankus et al. (2012.) A. Tankus, I. Fried, and S. Shoham. Structured neuronal encoding and decoding of human speech features. Nat. Commun., 3:1015, 2012.
- Townsend and Platsko (2016) G. Townsend and V. Platsko. Pushing the p300- based brain–computer interface beyond 100 bpm: extending performance guided constraints into the temporal domain. J. Neural Eng., 13(2):26024, 2016.
- W. Speier (2013) N. P. W. Speier, C. Arnold. Evaluating true bci communication rate through mutual information and language models. Plos One, 8:e78432, 2013.
- Weide (1998) R. L. Weide. The cmu pronouncing dictionary. Technical report, Carnegie Melon University, 1998.