Log In Sign Up

Human brain activity for machine attention

Cognitively inspired NLP leverages human-derived data to teach machines about language processing mechanisms. Recently, neural networks have been augmented with behavioral data to solve a range of NLP tasks spanning syntax and semantics. We are the first to exploit neuroscientific data, namely electroencephalography (EEG), to inform a neural attention model about language processing of the human brain with direct cognitive measures. Part of the challenge in working with EEG is that features are exceptionally rich and need extensive pre-processing to isolate signals specific to text processing. We devise a method for finding such EEG features to supervise machine attention through combining theoretically motivated cropping with random forest tree splits. This method considerably reduces the number of dimensions of the EEG data. We employ the method on a publicly available EEG corpus and demonstrate that the pre-processed EEG features are capable of distinguishing two reading tasks. We apply these features to regularise attention on relation classification and show that EEG is more informative than strong baselines. This improvement, however, is dependent on both the cognitive load of the task and the EEG frequency domain. Hence, informing neural attention models through EEG signals has benefits but requires further investigation to understand which dimensions are most useful across NLP tasks.


Advancing NLP with Cognitive Language Processing Signals

When we read, our brain processes language and generates cognitive proce...

Decoding EEG Brain Activity for Multi-Modal Natural Language Processing

Until recently, human behavioral data from reading has mainly been of in...

CogAlign: Learning to Align Textual Neural Representations to Cognitive Language Processing Signals

Most previous studies integrate cognitive language processing signals (e...

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

A lack of corpora has so far limited advances in integrating human gaze ...

An Empirical Exploration of Cross-domain Alignment between Language and Electroencephalogram

Electroencephalography (EEG) and language have been widely explored inde...

Understanding Brain Dynamics for Color Perception using Wearable EEG headband

The perception of color is an important cognitive feature of the human b...

1 Introduction

Cognitively inspired NLP is a research field at the intersection of Cognitive Neuroscience and Natural Language Processing (NLP), which lately has received ample attention. Cognitive Neuroscience aims to investigate cognitive processes that occur in the human brain through high-level explanations, whereas NLP has the overall objective to teach machines to read and understand human language. The recent merge of those fields led to the overarching goal of introducing human bias

Wilson et al. (2015); Schwartz et al. (2019); Toneva and Wehbe (2019) into machines, and hence augment neural networks with cognitive data in solving NLP tasks.

Human readers process words without notable effort as most reading processes happen subconsciously. Human text processing can be studied e.g. through eye tracking (ET) records from reading where fixation durations on word level are robustly correlated with cognitive load Drieghe et al. (2005); Fitzsimmons and Drieghe (2011); Rayner et al. (2011). Recently, numerous studies have proven that words represented as Eye-Tracking (ET) features can help a wide range of NLP tasks including Part-of-Speech (POS) tagging Barrett et al. (2016), Relation Detection Hollenstein et al. (2019), and Sentiment Detection Mishra et al. (2017, 2016); Barrett et al. (2018).

Fixation durations correlate with cognitive load but prolonged fixation duration will not help differentiate which cognitive process occurs. Therefore, eye movements are considered indirect measures of human text processing whereas electroencephalography (EEG) and fMRI technologies are direct measures of human brain activity. EEG measures electric activity in the brain. When used non-intrusively, a number of electrodes are placed on the scalp to measure brain surface activity. In the field of Cognitive Science, EEG data plays a vital role to explain various brain phenomena such as cognitive load Antonenko et al. (2010); Zarjam et al. (2011); Kumar and Kumar (2016). It is even possible to decode human cortical activity to synthesize audible speech Anumanchipalli et al. (2019).

In NLP, however, there has just been a single study that investigated how EEG data can enhance machines’ ability to perform named entity recognition, sentiment, and relation classification

Hollenstein et al. (2019). In contrast to eye movements, EEG contains not only signals about cognitive load and mental state, but all brain surface activity (including facial muscle activity which must be removed from the data after recording). Therefore, EEG requires much more de-noising and pre-processing before any text processing signals are exploitable for NLP.

Providing a rigorous and clear approach for the latter is the main aim of this study. On top of that, we go one step further and show how our approach improves performance on three sequence classification tasks - even with access to little brain data.

1.1 Contributions

  • We devise a method for extracting human language processing signals from EEG recordings of a publicly available EEG corpus Hollenstein et al. (2018)

    . In a sanity check, we demonstrate that these multi-dimensional feature vectors let us distinguish between two different reading tasks, namely Normal Reading (NR) and Task Specific Reading (TSR).

  • We inject this human bias into a multi-task neural model for sequence classification. In so doing, we extract the most informative signals on the word-level from these embeddings to regularize the attention mechanism of a Recurrent Neural Network (RNN).

  • We observe that the differences in EEG signals between NR and TSR affect NLP downstream performances differently. Moreover, we show that downstream performance further varies as a function of EEG frequency domains.

Together, these insights have decisive implications about which human EEG signals are most suitable to inject into Machine Learning (ML) models for enhancing their language processing performance. We make our code publicly available


2 Related work

Recently, an array of studies has investigated how external cognitive signals, and thus the injection of human bias, can enhance the capacity of artificial neural networks (ANNs) to understand natural language Hollenstein et al. (2019); Strzyz et al. (2019); Schwartz et al. (2019); Toneva and Wehbe (2019); Gauthier and Levy (2019), and vice versa, how language processing in ANNs might enhance our understanding of human language processing Hollenstein et al. (2019). Others scrutinized whether machine attention deviates from human attention when disentangling linguistic or visual scenes Barrett et al. (2018); Das et al. (2016).

Most studies, however, focused on the use of ET data, and hence exploited gaze features on the word level. Some utilized gaze features as word embeddings to inform ANNs about which syntactic linguistic features humans deem decisive in their language processing. In so doing, they have successfully refined state-of-the-art Named Entity Recognition (NER) systems Hollenstein and Zhang (2019), POS taggers Barrett et al. (2016) and Dependency Parsers Strzyz et al. (2019). Others have drawn attention to the enhancement of semantic disentanglement, and improved tasks such as sarcasm detection Mishra et al. (2017)

, or sentiment analysis

Mishra et al. (2016) through leveraging human gaze.

One recent study, from which we take inspiration, exploited gaze information to regularize attention in a multi-task-like setting for sequence classification Barrett et al. (2018). Here, gaze information improved grammatical error detection, hate speech detection, and sentiment classification. The authors enabled ANNs to utilize human attention during training time without the need of accessing this information during test time. Although this study is similar to ours and serves as the foundation for our code, we go one step beyond, and regularize attention with human brain activity instead of indirect proxies of direct measures such as gaze.

(a) Normal Reading (NR)
(b) Annotation Reading (AR)
Figure 1: EEG activity during Total Reading Time (TRT) in the left (two upper plots) and right (two lower plots) temporal cortex for the -frequency band (4 - 8Hz) over time for a single test subject.

3 EEG feature extraction

First, we introduce the corpus we exploit for this study, then explain our feature extraction and dimensionality reduction approach in detail, and finally outline our attention-based sequence classification model.

3.1 EEG data

In this study, we utilized the recently created Zurich Cognitive Language Processing Corpus (ZuCo)222 Hollenstein et al. (2018). It is a corpus of simultaneous ET and EEG recordings of 12 English native speakers reading individual sentences. Thus, through the ET and EEG alignment, neural activities are available for single words in English. The corpus contains cognitive data from the following three reading tasks:

Task 1

Normal Reading (NR) of 400 sentences that contain sentiment annotations from the Stanford Sentiment Treebank Socher et al. (2013).

Task 2

Normal Reading (NR) of 300 sentences that contain relations between named entities extracted from the Wikipedia relation extraction corpus Culotta et al. (2006).

Task 3

Task-Specific Reading (TSR) of 407 sentences that contain relations between named entities, also from Culotta et al. (2006). We refer to this task as Annotation Reading (AR), since participants were required to annotate sentences while reading by pressing a key to answer whether one specific relation is present in the sentence. This results in a different cognitive load for the human reader.

Each of the 12 participants had to perform all three reading tasks. Participants completed the readings in two sessions on different days. For our experiments, we exclusively used data from NR (Relation) (i.e., Task 2) and AR since these tasks contain text from the same domain. In Figure 1, EEG activity in the temporal cortex for the -frequency domain is shown for a single test subject across trials with respect to NR and AR respectively. One can clearly see that EEG activity is stronger for AR compared to NR in the left and right (fronto-)temporal cortex, which might contribute to differences in downstream performance. This qualitative insight into a single participant’s brain activity over time is in line with the results obtained from the statistical analyses across subjects, which we will elaborate on later in this section. The same activity pattern in (fronto-)temporal areas holds for other frequency domains, but for simplicity and layout constraints, is not shown.

EEG features correspond to ET features (e.g., First Fixation Duration). As such, eight 105-dimensional EEG vectors were extracted for each ET feature corresponding to the EEG activity during the specific eye-movement event (i.e., ET feature) w.r.t. this word. Each of the eight 105-dimensional vectors corresponds to one of the following EEG frequency domains: (4-6 Hz), (6.5-8 Hz) (8.5-10 Hz), (10.5-13 Hz), (13.5-18 Hz) (18.5-30 Hz), (30.5-40 Hz) and (40-49.5 Hz) Hollenstein et al. (2018). Each of the 105 dimensions corresponds to a specific electrode on a 128-channel EEG cap used for recording (see Figure 2). 23 electrodes were removed through Automagic Pedroni et al. (2019) prior to any data analysis, since they did not contain signals relevant for cognitive processing due to their position Hollenstein et al. (2018).

Given that = number of recorded ET features, = number of EEG frequency domains, and, = number of electrodes, this results in EEG features per word per ET feature, where , , and, . Since most of these features reflect signals that are not relevant to cognitive text processing, we apply various feature extraction techniques to reduce the dimensionality.

3.2 EEG feature reduction

We split the EEG data for NR and AR into a train, development and test split with 70/15/15% of the data. Splits were created w.r.t. sentence-level samples. We calculated feature reduction on the train set, tuned it on development splits, and evaluated the reduced features on a held-out test set.

We first investigated how to pre-process the EEG features to find differences between AR and NR. This was done to scrutinize potential differences in EEG signals as a function of cognitive load, and evaluate whether the dimensionality reduction is able to capture those. As a validation step, we also tuned the method for classifying NR

(Sentiment) into its first and second halves respectively. Participants completed NR (Relation) and the first half of NR (Sentiment) in the first session, and AR and the rest of NR (Sentiment) in the second session. Such experimental designs may bias the data with session-specific effects. Hence, this validation step serves both as a data quality check of the dimensionality reduction as well as of the noise removal. Ideally, the text processing signal should remain coherent for both halves of NR (Sentiment). For both steps, we employed a Random Forest classifier to the train set to find the most informative EEG channels for this task.

(a) -frequency band (4-8 Hz)
(b) -frequency band (8.5-13 Hz)
(c) -frequency band (13.5-30 Hz)
Figure 2: Differences in EEG power spectra for TRT in specific electrodes between Normal Reading (NR) and Annotation Reading (AR) for the four frequency bands , , . Best viewed in color. Electrodes colored in green denote higher EEG power spectra for AR, whereas red electrodes indicate higher EEG signals for NR. All differences are statistically significant with . Blue colored electrodes refer to no or non-significant differences between the two reading tasks. All electrodes that are not colored are electrodes that have been excluded during pre-processing prior to data analysis. The reference electrode (Cz), however, was not excluded as a pre-processing step, but not considered during statistical analyses since it always has the minimum value of 0.

Firstly, to extract the most predictive features per word, per ET feature, and per EEG frequency domain we reduced and based on literature.


We binned the eight frequency domains (see above) into the four general frequency bands, (4-8 Hz), (8.5-13 Hz), (13.5-30 Hz) and (30.5-49.5 Hz). This strategy was applied to manually decrease the dimensionality prior to exploiting any machine learning techniques, and reduce computational cost at an early stage. To yield binned frequency domains, we calculated the average power spectrum per electrode for each of the four frequency pairs (e.g., ), thus . Due to the fact that frequencies mainly relate to emotion detection Zheng and Lu (2015); Li and Lu (2009); Oathes et al. (2008); Luo et al. (2008), and not to attentiveness, we further reduced to , and computed embeddings in low-dimensional brain space for , , and only.


In initial data analyses, kernel density estimates (KDE) and

-test bootstrapping showed that power spectra vary across EEG frequency domains but are highly correlated among the different ET features. Therefore, we decided to extract EEG features that correspond to total reading time (TRT), as this ET feature covers all activity related to an individual word, and has proven to be the most informative gaze feature in previous studies Hollenstein and Zhang (2019); Hollenstein et al. (2019). Hence, .

Furthermore, we exploited a Random Forest classifier Breiman (2001) with 100 trees. Random Forest reveals the respective feature indices it requires to solve the classification task. Those can easily be mapped to electrodes on the EEG cap, and hence help to find the brain regions where activity varied across the different reading tasks. We exploited scikit-learn’s Random Forest implementation with the default parameters Pedregosa et al. (2011). However, we set the bootstrap parameter to false to use the entire data set to build each tree which led to better classification results in initial experiments.

We extracted the most informative features to distinguish between NR and AR according to our Random Forest implementation. We conducted experiments over three different values for (i.e., ), and scrutinized performances of different -dimensional embeddings as input for an LSTM Hochreiter and Schmidhuber (1996) to classify sentences in the respective tasks. This notably reduced the feature space from 4200 dimensions to , where .

We computed -test bootstrapping for each electrode in the four general frequency bands between NR and AR to inspect which electrodes show higher power spectra in which task (see Figure 2). All electrodes in the left temporal cortex, which is responsible for both language comprehension and production, that show significantly higher power spectra for AR compared to NR, were included in the most predictive features to solve the binary classification task through Random Forest. This means that Random Forest assigned more importance to brain signals that are enhanced for AR. We assume that stronger EEG activity in these (fronto-)temporal areas of the human brain (as shown qualitatively for a single test subject in Figure 1) are due to higher cognitive load for AR compared to NR in language comprehension and production areas. However, further investigation must go into this line of research to draw definite conclusions.

3.3 Classifying reading task and reading session using reduced EEG features

Val Test
Task 15 45 90 15 45 90
NR–AR 100 100 93.8 100 98.4 85.9
Ses1–Ses2 48.4 57.8 50.5 45. 50. 48.8
Table 1: LSTM binary classification accuracies with EEG word embeddings of different dimensions for two different tasks on both development and test set. Embedding dimensions were extracted through Random Forest tree splits.

We test the reduced EEG features in a simple, binary classification task in order to make it likely that we managed to isolate the cognitive processing of text and get rid of noise. We, therefore, use the reduced features to classify NR and AR as well as classifying whether a sentence was read in the first or second half of NR (Sentiment).

LSTM architecture

We use a vanilla LSTM with 1 hidden layer, 50 hidden units per layer, a layer dropout rate of 0.5, and Adam optimizer Kingma and Ba (2015) with the default learning rate of

. Since the main goal was to predict the two different reading tasks, we minimized binary cross-entropy loss through mini-batch training with a batch size of 32. The model was implemented end-to-end in PyTorch

Paszke et al. (2019).


LSTM denotes the LSTM function Hochreiter and Schmidhuber (1996), h is the hidden state at time step , and represents the word input at current time step, where was embedded in -dimensional EEG space (). We experimented with different values for , which can seen in Table 1. Results show that NR and AR appear to result in different brain signals and can thus be easily classified into two distinct classes. This is in line with Figure 2. In contrast, session could not be classified into its respective days. Hence, it is likely that our features capture a great amount of cognitive text processing signals and little noise, of which the latter could have been confounded by session.

To visualize whether Random Forest features are capable of classifying words into their respective tasks in low-dimensional space, we plotted each word that appeared in sentences in NR or AR respectively using only the two most useful features in Figure 3. The plot shows that even in 2D space, features well reflect the differences between NR and AR.

Figure 3: Dimensionality reduction. Words embedded in two-dimensional brain space through Random Forest tree splits.

4 Sequence labeling with EEG attention

4.1 EEG features to scalar

Attention scores are scalar values that weigh their respective hidden word representations accordingly. To regularize attention, we were thus required to further reduce each -dimensional word vector represented through EEG activity to a single dimension. We experimented with different approaches such as averaging across all electrodes or taking the maximum EEG power spectrum per . Initial experiments on the development set revealed that taking the mean over a

dimensional word vector leads to notably worse results than max-pooling. This might be due to information loss of brain activity as a result of averaging.

Task Source Train Dev Test
sents % positive sents sents
Relation Detection SemEval 2010 8,096 19.3 1,361 1,372
Relation Detection Wikipedia 1,733 10.0 361 354
NE Detection Ontonotes 5.0 89,389 29.7 11,289 11,318
Table 2: Overview of the data sets.
Domain specific scores

The final attention scores were computed through taking the maximum electrode value per dimensional word embedding for each frequency band. was one of [, , ] for concatenated embeddings, and one of [, , ] for embeddings per frequency domain. To yield values within the range [0, 1), we normalized each EEG attention score by the maximum attention score of the respective sentence. We observed that dividing each normalized attention score by some small constant leads to better performance. We assume this is due to the fact that EEG attention scores are somewhat peaky prior to dividing by . Thus, the computation was as follows:


denotes a word representation embedded in -dimensional EEG space at time step for a sentence . To compute we did not exploit concatenated EEG embeddings but used isolated EEG embeddings for each of the three frequency domains . Therefore, was one of . The constant was set to . The computed EEG attention scores served as inputs for our multi-task sequence classification model to supervise attention in the auxiliary task. Hence, final attention weights were yielded through passing the EEG attention scores through the softmax function. The latter computation, however, happened automatically during training and was not done externally.


Max-pooled EEG attention scores for two example sentences read in NR and AR respectively are depicted in Figure 4. We observe that brain activity transitions between words are smoother in NR compared to AR. We observe that for AR, in particular, there are higher activations for words relating to the relation award.

(a) Normal Reading (NR)
(b) Annotation Reading (AR)
Figure 4: Max-pooled EEG attention scores (averaged over all 12 participants) for two example sentences read in Normal Reading (NR) and Annotation Reading (AR) respectively.

4.2 Model

The model is an adaptation of rei2018zero leveraged by barrett2018sequence.333 It is a biLSTM architecture that jointly learns the model parameters and the attention function by alternating training Luong et al. (2016), much related to multi-task learning Dong et al. (2015); Søgaard and Goldberg (2016). The inputs are token-level labelled sequences of EEG scalars to learn the attention function (auxiliary task) and a set of sentence-level labelled sequences for training the model parameters (main task).

If the data point is from the main task, we perform normal training and model parameter update through comparing the model’s class prediction () against the true label () on the sentence level.


If the data point is sampled from the EEG corpus, however, we do not update model parameters. We only modify the attention weights by minimizing the squared error between the EEG value and the attention score as described below.


The model has no access to EEG signals during testing.

4.3 Experiments

Recall from 3.1 that participants read sentences from the Wikipedia relation extraction corpus Culotta et al. (2006). We employ three, for English, widely used Relation Extraction and NER benchmark data sets respectively against a baseline model without supervised attention and models whose attention was either supervised through ET data as in barrett2018sequence or BNC word frequencies Kilgarriff (1995). ET and BNC frequencies are similar to the ones used by barrett2018sequence, which is concatenating ZuCo with an even larger eye-tracking corpus, the Dundee Corpus Kennedy et al. (2003)). We perform binary classification and adapt all datasets as described below to get a sentence-level label. As such, the main task was a -vs.-the-rest binary sentence classification task. Overall statistics about the data sets are displayed in Table 2.

4.3.1 Relation detection

SemEval 2010 Wiki Ontonotes
Attention Precision Recall 1 Precision Recall 1 Precision Recall 1
baseline 80.03 63.21 70.56 54.44 55.00 54.67 88.90 64.46 74.72
BNCFreqInv 78.30 58.00 66.52 61.39 60.00 60.64 91.56 67.38 77.61
MeanFixCont 79.59 60.36 68.62 59.99 58.00 58.75 92.21 66.59 77.33
k NR () 77.90 65.29 70.98 58.76 51.00 54.38 90.41 67.84 77.51
k NR () 79.70 61.21 69.14 65.47 57.00 60.74 91.02 67.18 77.28
k NR () 79.66 63.93 70.91 54.58 51.00 52.52 91.25 67.36 77.50
k NR () 78.58 64.43 70.79 61.09 57.00 58.82 91.85 67.00 77.47
k NR () 79.91 60.86 69.04 58.67 55.00 56.65 91.66 67.10 77.47
k NR () 77.88 64.57 70.53 52.83 49.00 50.77 91.04 67.68 77.63
k NR () 80.32 61.14 69.33 56.88 53.00 54.64 92.25 66.68 77.40
k AR () 79.66 59.64 68.18 60.25 49.00 53.80 91.19 66.84 77.13
k AR() 79.55 63.00 70.29 56.42 49.00 52.42 91.01 67.01 77.18
k AR () 79.02 64.57 71.04 56.70 60.00 57.75 90.97 67.17 77.23
k AR () 79.15 63.79 70.60 57.16 51.00 53.60 90.96 67.00 77.16
k AR () 79.14 64.43 71.01 53.78 52.00 52.73 90.57 67.30 77.20
k AR() 79.47 62.00 69.63 60.63 54.00 56.71 91.67 66.92 77.36
k AR () 79.96 64.50 71.34 59.63 52.00 55.16 91.03 66.87 77.10
Table 3: Relation detection and named entity detection. Results in %. Best scores per metric are displayed in bold face.
SemEval 2010

We used the SemEval 2010 Task 8 data set that defines the task as a multi-way classification of semantic relations between pairs of nominals Hendrickx et al. (2010). The data set contains the following nine distinct relations: Cause-Effect, Instrument-Agency, Product-Producer, Content-Container, Entity-Origin, Entity-Destination, Component-Whole, Member-Collection and Message-Topic. Each sentence that contained the relations Entity-Origin or Entity-Destination served as a positive example of this relation. We have chosen those relations due to their higher frequency compared to other relations. Positive examples were tested against sentences that consisted of one of the remaining seven relations.


The data set provided by Culotta et al. (2006) contains Wikipedia articles labeled with 53 relation types. Since part of this dataset is included in ZuCo we filtered those sentences. We chose the sentences including the most frequent relation job title as positive examples.

4.4 Named entity detection

Ontonotes 5.0

We use the four CoNLL-2003 NER labels PER, LOC, ORG, MISC in the Ontonotes 5.0 data set Weischedel et al. (2013) as positive examples.

5 Results

All scores are averages over five random seeds.

SemEval 2010

Results are in Table 3 and we observe that EEG attention scores clearly help to solve the task. The best EEG-augmented model, k AR (), is better than all baselines. The most notable improvements are mainly due to a higher recall. Precision scores appear to be similar across all models although BNC word frequency augmented model show slightly lower precision than the rest.


Results are also in Table 3. The 1 score of the best EEG-augmented model, k NR (), outperformed all baselines. The BNC-augmented model provides a strong baseline, which could be explained through both the significantly larger number of data points available for BNC word frequencies and due to the fact that word frequencies highly correspond to entities which are crucial to link entities.


Table 3 shows that all EEG augmented models outperform the baseline by a small margin. The performance improvement for recall is again notable. Precision scores appear to be compareable across all models but the baseline. The best model is k NR ().

6 Discussion

Human brain activity appears to help machine attention in attending toward the most crucial words in a sentence. However, there are differences between the exploited EEG attention scores - measured as the improvement in performance on a particular task. This is dependent on both frequency domain and reading task of the EEG signals, as well as the number of features captured in the embeddings. In general, we observe smaller performance gains for NER compared to Relation Extraction. We suppose this is due to NER results being fairly strong in general for English language data Lample et al. (2016); Hollenstein and Zhang (2019). Hence, additional support through cognitive data cannot enhance performance much.

Reading tasks

Brain activity extracted from sentences read in NR appear to be more useful to detect named entities compared to EEG signals from AR. This is not surprising since in an NR setting participants read sentences without any additional task to perform (see 3.1), and therefore read each sentence until its end Hollenstein et al. (2018). On the other hand, brain signals distilled from AR better help the model to detect relations in a given sentence. This might be due to the fact that the readers were required to search for the respective relation while reading the sentence. Thus, participants drew particular attention to the decisive words that form the relation and omitted the rest (see Figure 4).

Frequency domains

Performance enhancement differed depending on the frequency domain from which EEG attention scores were extracted. Both and frequency bands show more useful signals and lead to better performance compared to across all tasks. Lower frequency bands such as (4-8 Hz) and (8.5-13 Hz) are linked to cognitive control Williams et al. (2019) and attentiveness Klimesch (2012), respectively. This might explain why brain signals from those domains are particularly useful to guide machine attention. Higher frequency domains such as (13.5-30 Hz) and (30.5-49.5 Hz), however, are linked to motor activities Pogosyan et al. (2009) and enhanced emotional responses Li and Lu (2009); Oathes et al. (2008), which explains why EEG power spectra in AR increase with the hertz rate (see Figure 2) but are less useful to supervise machine attention than brain signals from lower frequency bands (see Table 3).


We tested EEG attention scores that were max-pooled over individual frequency domain embeddings of different dimensionality. Overall, the attention scores distilled from 15- and 30-dimensional embeddings carry slightly more informative signals than attention scores extracted from 5-dimensional embeddings (see Table 3). The difference, however, is marginal, and for the Wikipedia relation detection, max-pooled attention scores over 5-dimensional embeddings outperform all other supervision signals. Contrary, to classify the word-level EEG signals of sentences into their respective reading tasks, lower-dimensional embeddings lead to better performance than higher-dimensional embeddings (see Table 1). The latter is not surprising since low-dimensional embeddings contain those EEG signals that differ the most between NR and AR. The higher the dimensionality of the embeddings, the more EEG signals are captured that differ less between the two reading tasks.

Improvements despite little data

What is compelling is the fact that we exploited little EEG data to create the attention scores - merely 300 sentences for NR and 407 sentences for AR, averaged over 12 participants. Both ET and BNC frequency attention scores have times more sentences to train the auxiliary task of the model. Even with such few EEG data samples, useful signals could be extracted that help neural networks to understand language in a similar manner as the human brain does. We assume that more data will lead to even higher performance gains. We plan studies to investigate the latter.

7 Conclusions

We presented the first study that leverages EEG activity to inform machine attention about language processing mechanisms of the human brain. This is compelling for two reasons: First, we successfully isolated the text processing signals from noisy EEG data, which considerably reduced its dimensions. Second, we demonstrated that even a small number of EEG data points from human readers can benefit multi-task neural models for sequence classification. Note that the extracted attention scores may be exploited in various neural architectures that employ any form of attention to solve NLP tasks. Third, we showed that downstream performance varies as a function of both cognitive load and EEG frequency domains. This might have decisive implications about which EEG signals are to inject into neural models. We suspect that more data will provide deeper, and more thorough insights into the latter avenue.


We would like to thank Wolfgang Ganglberger, Martin Hebart, Joachim Bingel, and Desmond Elliot for fruitful comments on earlier versions of the paper.


  • P. Antonenko, F. Paas, R. Grabner, and T. Van Gog (2010) Using electroencephalography to measure cognitive load. Educational Psychology Review 22 (4), pp. 425–438. Cited by: §1.
  • G. K. Anumanchipalli, J. Chartier, and E. F. Chang (2019) Speech synthesis from neural decoding of spoken sentences. Nature 568 (7753), pp. 493. Cited by: §1.
  • M. Barrett, J. Bingel, N. Hollenstein, M. Rei, and A. Søgaard (2018) Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, CoNLL 2018, Brussels, Belgium, October 31 - November 1, 2018, A. Korhonen and I. Titov (Eds.), pp. 302–312. External Links: Link, Document Cited by: §1, §2, §2.
  • M. Barrett, J. Bingel, F. Keller, and A. Søgaard (2016) Weakly supervised part-of-speech tagging using eye-tracking data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers, External Links: Link, Document Cited by: §1, §2.
  • L. Breiman (2001) Random forests. Mach. Learn. 45 (1), pp. 5–32. External Links: Link, Document Cited by: §3.2.
  • A. Culotta, A. McCallum, and J. Betz (2006) Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 4-9, 2006, New York, New York, USA, R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and M. Sanderson (Eds.), External Links: Link Cited by: §3.1, §3.1, §4.3.1, §4.3.
  • A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra (2016) Human attention in visual question answering: do humans and deep networks look at the same regions?. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh (Eds.), pp. 932–937. External Links: Link, Document Cited by: §2.
  • D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1723–1732. External Links: Link, Document Cited by: §4.2.
  • D. Drieghe, K. Rayner, and A. Pollatsek (2005) Eye movements and word skipping during reading revisited.. Journal of Experimental Psychology: Human Perception and Performance 31 (5), pp. 954. Cited by: §1.
  • G. Fitzsimmons and D. Drieghe (2011) The influence of number of syllables on word skipping during reading. Psychonomic Bulletin & Review 18 (4), pp. 736–741. Cited by: §1.
  • J. Gauthier and R. Levy (2019) Linking artificial and human neural representations of language. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 529–539. External Links: Link, Document Cited by: §2.
  • I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó. Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2010)

    SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July 15-16, 2010, K. Erk and C. Strapparava (Eds.), pp. 33–38. External Links: Link Cited by: §4.3.1.
  • S. Hochreiter and J. Schmidhuber (1996) LSTM can solve hard long time lag problems. In Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2-5, 1996, M. Mozer, M. I. Jordan, and T. Petsche (Eds.), pp. 473–479. External Links: Link Cited by: §3.2, §3.3.
  • N. Hollenstein, M. Barrett, M. Troendle, F. Bigiolli, N. Langer, and C. Zhang (2019) Advancing NLP with cognitive language processing signals. CoRR abs/1904.02682. External Links: Link, 1904.02682 Cited by: §1, §1, §2, §3.2.
  • N. Hollenstein, A. de la Torre, N. Langer, and C. Zhang (2019) CogniVal: a framework for cognitive word embedding evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 538–549. External Links: Link, Document Cited by: §2.
  • N. Hollenstein, J. Rotsztejn, M. Troendle, A. Pedroni, C. Zhang, and N. Langer (2018) ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading. Scientific data 5, pp. 180291. Cited by: 1st item, §3.1, §3.1, §6.
  • N. Hollenstein and C. Zhang (2019) Entity recognition at first sight: improving NER with eye movement information. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 1–10. External Links: Link, Document Cited by: §2, §3.2, §6.
  • A. Kennedy, R. Hill, and J. Pynte (2003) The Dundee Corpus. In Poster presented at ECEM12: 12th European Conference on Eye Movements, Cited by: §4.3.
  • A. Kilgarriff (1995) BNC database and word frequency lists. http://www. itri. brighton. ac. uk/~ Adam. Kilgarriff/bnc-readme. html. Cited by: §4.3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.3.
  • W. Klimesch (2012) Alpha-band oscillations, attention, and controlled access to stored information. Trends in cognitive sciences 16 (12), pp. 606–617. Cited by: §6.
  • N. Kumar and J. Kumar (2016) Measurement of cognitive load in HCI systems using EEG power spectrum: an experimental study. Procedia Computer Science 84, pp. 70–78. Cited by: §1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, K. Knight, A. Nenkova, and O. Rambow (Eds.), pp. 260–270. External Links: Link, Document Cited by: §6.
  • M. Li and B. Lu (2009) Emotion classification based on gamma-band EEG. In 2009 Annual International Conference of the IEEE Engineering in medicine and biology society, pp. 1223–1226. Cited by: §3.2, §6.
  • Q. Luo, D. Mitchell, X. Cheng, K. Mondillo, D. Mccaffrey, T. Holroyd, F. Carver, R. Coppola, and J. Blair (2008) Visual awareness, emotion, and gamma band synchronization. Cerebral cortex 19 (8), pp. 1896–1904. Cited by: §3.2.
  • M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2016) Multi-task sequence to sequence learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.2.
  • A. Mishra, K. Dey, and P. Bhattacharyya (2017)

    Learning cognitive features from gaze data for sentiment and sarcasm classification using convolutional neural network

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, R. Barzilay and M. Kan (Eds.), pp. 377–387. External Links: Link, Document Cited by: §1, §2.
  • A. Mishra, D. Kanojia, S. Nagar, K. Dey, and P. Bhattacharyya (2016) Leveraging cognitive features for sentiment analysis. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, Y. Goldberg and S. Riezler (Eds.), pp. 156–166. External Links: Link, Document Cited by: §1, §2.
  • D. J. Oathes, W. J. Ray, A. S. Yamasaki, T. D. Borkovec, L. G. Castonguay, M. G. Newman, and J. Nitschke (2008) Worry, generalized anxiety disorder, and emotion: evidence from the eeg gamma band. Biological psychology 79 (2), pp. 165–170. Cited by: §3.2, §6.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §3.3.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.2.
  • A. Pedroni, A. Bahreini, and N. Langer (2019) Automagic: standardized preprocessing of big EEG data. NeuroImage 200, pp. 460–473. External Links: Link, Document Cited by: §3.1.
  • A. Pogosyan, L. D. Gaynor, A. Eusebio, and P. Brown (2009) Boosting cortical activity at beta-band frequencies slows movement in humans. Current biology 19 (19), pp. 1637–1641. Cited by: §6.
  • K. Rayner, T. J. Slattery, D. Drieghe, and S. P. Liversedge (2011) Eye movements and word skipping during reading: effects of word length and predictability.. Journal of Experimental Psychology: Human Perception and Performance 37 (2), pp. 514. Cited by: §1.
  • D. Schwartz, M. Toneva, and L. Wehbe (2019) Inducing brain-relevant bias in natural language processing models. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 14100–14110. External Links: Link Cited by: §1, §2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1631–1642. External Links: Link Cited by: §3.1.
  • A. Søgaard and Y. Goldberg (2016) Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers, External Links: Link, Document Cited by: §4.2.
  • M. Strzyz, D. Vilares, and C. Gómez-Rodríguez (2019) Towards making a dependency parser see. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 1500–1506. External Links: Link, Document Cited by: §2, §2.
  • M. Toneva and L. Wehbe (2019) Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 14928–14938. External Links: Link Cited by: §1, §2.
  • R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, et al. (2013) Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium. Cited by: §4.4.
  • C. C. Williams, M. Kappen, C. D. Hassall, B. Wright, and O. E. Krigolson (2019) Thinking theta and alpha: mechanisms of intuitive and analytical reasoning. NeuroImage 189, pp. 574–580. External Links: Link, Document Cited by: §6.
  • A. G. Wilson, C. Dann, C. G. Lucas, and E. P. Xing (2015) The human kernel. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2854–2862. External Links: Link Cited by: §1.
  • P. Zarjam, J. Epps, and F. Chen (2011) Spectral EEG features for evaluating cognitive load. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3841–3844. Cited by: §1.
  • W. Zheng and B. Lu (2015) Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks. IEEE Trans. Auton. Ment. Dev. 7 (3), pp. 162–175. External Links: Link, Document Cited by: §3.2.