Word embeddings are the corner stones of state-of-the-art NLP models. Distributional representations which interpret words, phrases, and sentences as high-dimensional vectors in semantic space have become increasingly popular. These vectors are obtained by training language models on large corpora to encode contextual information. Each vector represents the meaning of a word.
Evaluating and comparing the quality of different word embeddings is a well-known, largely open challenge. Currently, word embeddings are evaluated with extrinsic or intrinsic methods. Extrinsic evaluation is the process of assessing the quality of the embeddings based on their performance on downstream NLP tasks, such as question answering or entity recognition. However, embeddings can be trained and fine-tuned for specific tasks, but this does not mean that they accurately reflect the meaning of words.
One the other hand, intrinsic evaluation methods, such as word similarity and word analogy tasks, merely test single linguistic aspects. These tasks are based on conscious human judgements. Conscious judgements can be biased by subjective factors and the tasks themselves might also be biased nissim2019fair. Additionally, the correlation between intrinsic and extrinsic metrics is not very clear, as intrinsic evaluation results fail to predict extrinsic performance chiu2016intrinsic; gladkova2016intrinsic. Finally, both intrinsic and extrinsic evaluation types often lack statistical significance testing and do not provide a global quality score.
In this paper, we focus on the intrinsic subconscious evaluation method bakarov2018survey, which evaluates English word embeddings against the lexical representations of words in the human brain, recorded when passively understanding language. Cognitive lexical semantics proposes that words are defined by how they are organized in the brain miller1992wordnet. As a result, brain activity data recorded from humans processing language is arguably the most accurate mental lexical representation available sogaard2016evaluating. Recordings of brain activity play a central role in furthering our understanding of how human language works. To accurately encode the semantics of words, we believe that embeddings should reflect this mental lexical representation.
Evaluating word embeddings with cognitive language processing data has been proposed previously. However, the available datasets are not large enough for powerful machine learning models, the recording technologies produce noisy data, and most importantly, only few datasets are publicly available. Furthermore, since brain activity and eye-tracking data contain very noisy signals, correlating distances between representations does not provide sufficient statistical power to compare embedding typesfrank2017word. For this reason we evaluate the embeddings by exploring how well they can predict human processing data. We build on sogaard2016evaluating’s theory of evaluating embeddings with this task-independent approach based on cognitive lexical semantics and examine its effectiveness. The design of our framework follows three principles:
Multi-modality: Evaluate against various modalities of recording human signals to counteract the noisiness of the data.
Diversity within modalities: Evaluate against different datasets within one modality to make sure the number of samples is as large as possible.
Correlation of results should be evident across modalities and even between datasets of the same modality.
We present CogniVal, the first framework of cognitive word embedding evaluation to follow these principles and analyze the findings. We evaluate different embedding types against a combination of 15 cognitive data sources, acquired via three modalities: eye-tracking, electroencephalography (EEG) and functional magnetic resonance imaging (fMRI). The word representations are evaluated by assessing their ability of predicting cognitive language processing data. After fitting a neural regression model for each combination, we apply multiple hypotheses testing to measure the statistical significance of the results, taking into account multiple comparisons (see Figure 1). This contributes to the consistency of the results and to attain a global score of embedding quality. Our main findings when evaluating six state-of-the-art word embeddings with CogniVal show that the majority of embedding types significantly outperform a baseline of random embeddings when predicting a wide range of cognitive features. Additionally, the results show consistent correlations between between datasets of the same modality and across different modalities, validating the intuition of our approach. Finally, we present an exploratory but promising correlation analysis between the scores obtained using our intrinsic evaluation methods and the performance on extrinsic NLP tasks.
The code of this evaluation framework is openly available111https://github.com/DS3Lab/cognival. It can be used as is, or in combination with other intrinsic as well as extrinsic evaluation methods for word representations.
2 Related Work
mitchell2008predicting pioneered the use of word embeddings to predict patterns of neural activation when subjects are exposed to isolated word stimuli. More recently, this dataset and other fMRI resources have been used to evaluate learned word representations.
For instance, abnar2018experiential and rodrigues2018predicting
evaluate different embeddings by predicting the neuronal activity from the 60 nouns presented bymitchell2008predicting. sogaard2016evaluating shows preliminary results in evaluating embeddings against continuous text stimuli in eye-tracking and fMRI data. Moreover, beinborn2019robust recently presented an extensive set of language–brain encoding experiments. Specifically, they evaluated the ability of an ELMo language model to predict brain responses of multiple fMRI datasets.
EEG data has been used for similar purposes. schwartz2019understanding and ettinger2016modeling
show that components of event-related potentials can successfully be predicted with neural network models and word embeddings.
However, these approaches mostly focus on one modality of brain activity data from small individual cognitive datasets. The lack of data sources has been one reason why this type of evaluation has not been too popular until now bakarov2018can. Hence, in this work we collected a wide range of cognitive data sources ranging from eye-tracking to EEG and fMRI to ensure coverage of different features, and consequently of the cognitive processing taking place in the human brain during reading.
Evidence from cognitive neuroscience
murphy2018decoding review computational approaches to the study of language with neuroimaging data and show how different type of words activate neurons in different brain regions. Similarly, mapping fMRI data from subjects listening to stories to the activated brain regions, revealed semantic maps of how words are distributed across the human cerebral cortex huth2016natural.
Furthermore, word predictability and semantic similarity show distinct patterns of brain activity during language comprehension: semantic distances can have neurally distinguishable effects during language comprehension frank2017wordpred. These findings support the theory that brain activity data does reflect lexical semantics and is thus an appropriate foundation for evaluating the quality of word embeddings.
3 Word embeddings
Pre-trained word vectors are an essential component in state-of-the-art NLP systems. We chose six commonly used pre-trained embeddings to evaluate against the cognitive data sources. See Table 1 for an overview of the dimensions of each embedding type. We evaluate the following types of word embeddings:
|embeddings||dim.||hidden layer units|
|Glove||50||[30, 26, 20, 5]|
Glove: pennington2014glove provide embeddings of different dimensions trained on aggregated global word-word co-occurrence statistics over a corpus of 6 billion words.
Word2vec: Non-contextual embeddings trained on 100 billion words from a Google News dataset mikolov2013distributed.
WordNet2Vec saedi2018wordnet These embeddings represent the conversion from semantic networks into semantic spaces. Trained on WordNet, a lexical ontology for English that comprises over 155,000 lemmas (but trained only on 60,000 words).
pre-trained embeddings use character n-grams to compose the vector of the full wordsmikolov2018advances. We evaluate the embeddings with and without subword information trained on 16 billion tokens of Wikipedia sentences as well as the ones trained on 600 billion tokens of Common Crawl.
ELMo models both complex characteristics of word use (i.e. syntax and semantics), and how these uses vary across linguistic contexts peters2018deep. These word vectors are learned functions of the internal states of a deep bidirectional language model, which is pre-trained on a large text corpus. We take the first of the three output layers, containing the context insensitive word representations.
BERT embeddings are contextual, bidirectional word representations, based on the idea that fine-tuning a pre-trained language model can help the model achieve better results in the downstream tasks devlin2019bert. We take the hidden states of the second to last of 12 output layers as the representation for each token.
4 Cognitive data
|all eye-tracking (aggregated)||text||-||26353||16419||88%|
|Natural Speech broderick2018electrophysiological||speech||19||12000||1625||98%|
|Harry Potter wehbe2014aligning||text||8||5169||1295||92%|
In this paper, we consider three modalities of recording cognitive language processing signals: eye-tracking, electroencephalography (EEG), and functional magnetic resonance imaging (fMRI). All three methods are complementary in terms of temporal and spatial resolution as well as the directness in the measurement of neural activity mulert2013simultaneous. For the word embedding evaluation we selected a wide range of datasets from these three modalities to ensure a more diverse and accurate representation of the brain activity during language processing.
Table 3 shows an overview of the cognitive data sources used, which are described in more detail below. Since the processing in the brain differs depending on whether the information is accessed via the visual or auditory system price2012review, we include data of different stimuli, e.g. participants reading sentences or listening to audio-books. Moreover, our collection of cognitive data sources contains datasets of both isolated (single words) and continuous (words in context, i.e. sentences or stories) stimuli. All datasets include English language stimuli and the participants were native speakers or highly proficient.
Eye-tracking is an indirect measure of cognitive activity. Gaze patterns are highly correlated with the cognitive load associated with different stages of human text processing rayner1998eye. For instance, fixation duration is higher for long, infrequent and unfamiliar words just1980theory.
All eye-tracking datasets used in this work were recorded from natural, self-paced reading. Each dataset provides different eye-tracking features. The most common features, available in all 7 datasets are: first fixation duration, first pass duration, mean fixation duration, total fixation duration and number of fixations. For a complete list and description of the eye-tracking features available in each corpus see Appendix A.1.
Gaze vectors consist of specific features, which are extracted based on the reading times, fixations and regressions on each word. Feature values are aggregated on word type level and scaled between 0 and 1. The feature values were averaged over all subjects within a dataset. This preprocessing step is done separately for each data source before combining them. hollenstein2019entity show that combining gaze data from different sources can be helpful for NLP applications, even when they are recorded with different devices and filtering,
By using as many features as available from each dataset, including features characterizing basic, early and late word processing aspects, the goal is to cover the whole language understanding process on word level.
Electroencephalography records electrical activity from the brain. It measures voltage fluctuations through the scalp with high temporal resolution.oh hauk2004effects presents evidence for the modulation of early electrophysiological brain responses by word frequency. This is evidence that lexical access from written word stimuli is an early process that follows stimulus presentation by less than 200 ms.
The EEG datasets used in this work were either recorded from reading sentences or listening to natural speech. Word-level brain activity could be extracted by mapping to eye-tracking cues (ZuCo), by mapping to auditory triggers (Natural Speech), by recording only the last word in each sentence (N400), or through serial presentation of the words (UCL). Standard preprocessing steps for EEG data, including band-pass filtering and artifact removal, are performed in the same manner for all four data sources. See Appendix A.2 for details on EEG preprocessing.
The EEG data is aggregated over all available subjects and over all occurrences of a token. This yields an n-dimensional vector, where n is the number of electrodes, ranging from 32 to 130, depending on the EEG device used to record the data.
EEG data can be aggregated over all subjects within one dataset, because the number and locations of electrodes are identical. However, due to the differences in the number of electrodes between datasets, we cannot aggregate over all EEG datasets.
Functional magnetic resonance imaging is a technique for measuring and mapping brain activity by detecting changes associated with blood flow. fMRI has a temporal resolution of two seconds, which means that with continuous stimuli such as natural reading or story listening, one scan covers multiple words. We use datasets of isolated stimuli (e.g the Nouns dataset) as well as continuous stimuli (e.g. Harry Potter). While it is easier to extract word-level signals from isolated stimuli, continuous stimuli allow extracting signals in context over a wider vocabulary.
Where multiple trials were available, the brain activation for each word is calculated by taking the mean over the scans. Moreover, if the stimulus is continuous (Harry Potter and Alice datasets), the data is aligned with an offset of four seconds to account for hemodynamic delay444The fMRI signal measures a brain response to a stimulus with a delay of a few seconds, and it decays slowly over a duration of several seconds miezin2000characterizing. For continuous stimuli, this means that the response to previous stimuli will have an influence on the current signal. Thus, context of the previous words is taken into account.
fMRI data contains representations of neural activity of millimeter-sized cubes called voxels. Standard fMRI preprocessing methods such as motion correction, slice timing correction and co-registration had already been applied before. To select the voxels to be predicted we use the pipeline provided by beinborn2019robust. This pipeline consists of extracting corresponding scan(s) for each word, and randomly selecting 100, 500 and 1000 voxels (for the Harry Potter, Pereira and Nouns datasets). The published version of the Alice dataset provided the preprocessed signal averaged for six regions of interest, hence for this particular dataset we predict the activation for these regions only. Appendix A.3 contains the details of the preprocessing steps. Finally, the fMRI data is converted to n-dimensional vectors, where n is the number of randomly selected voxels (100, 500 or 1000) or regions (6).
5 Embedding evaluation method
In order to evaluate the word embeddings against human lexical representations, we fit the embeddings to a wide range cognitive features, i.e. eye-tracking features and activation levels of EEG and fMRI. This section describes how these models were trained and evaluated. After evaluating each combination separately, we test for statistical significance taking into account the multiple comparisons problem. See Figure1 for an overview of the evaluation process.
We fit neural regression models to map word embeddings to cognitive data sources. Predicting multiple features from different sources and modalities allows us to evaluate different aspects of capturing the semantics of a word. Hence, separate models are trained for all combinations. For instance, fitting FastText embeddings to EEG vectors from ZuCo, or fitting ELMo embeddings to first fixation durations of the Dundee corpus.
For the regression models, we train neural networks with k input dimensions, one dense hidden layer of n
nodes using ReLU activation and an output layer ofm nodes using linear activation. The model is a multiple regression with layers of dimension k-n-m, where k is the number of dimensions of the word embeddings and m changes depending on the cognitive data source to be predicted. For predicting single eye-tracking features m equals 1, whereas for predicting EEG of fMRI vectors m is the dimension of the cognitive data vector, or more specifically, the number of electrodes in the EEG data or the number of voxels in the fMRI data. Figure 2
shows this neural architecture. The loss function optimizes the mean squared error (MSE) and uses an Adam optimizer with a learning rate of 0.001.
5-fold cross validation is performed for each model (80% training data and 20% test data). The optimal number of nodes n in the hidden layer is selected individually for each combination of cognitive data source and embedding type. To this end, a grid search is performed before training, which is evaluated on a validation set consisting of 20% of the training data with 3-fold cross validation (see Table 1 for details on the search space). The best model is then saved and used to predict the cognitive feature for each word in the test set. Finally, the results are measured with the mean squared error, averaged over all predicted words.
CogniVal allows for evaluation against another word embedding type as well as evaluation against a random baseline. To generate a fair baseline we create random vectors for each word of n dimensions, corresponding to the same number of dimensions of the embeddings to be evaluated.
5.2 Multiple hypotheses testing
With the purpose of achieving consistency and going towards a global quality metric that can be combined with other evaluation methods, we perform statistical significance testing on each hypothesis. A hypothesis consists of comparing the combination of an embedding type and a cognitive data source to the random baseline.
Since the distribution of our test data is unknown and the datasets are small, we perform a Wilcoxon signed-rank test for each hypothesis dror2018hitchhiker
. Additionally, to counteract the multiple hypotheses problem, we apply the conservative Bonferroni correction, where the global null hypothesis is rejected if, where is the number of hypotheses dror2017replicability. In our setting, and for EEG (one hypothesis per EEG data source), for for fMRI (one hypothesis per participant of each fMRI data source), and for eye-tracking (one hypothesis per feature per eye-tracking corpus).
This approach of significance testing can easily be used in combination with other intrinsic and extrinsic evaluation methods. The significance ratios are shown in Figure 3.
6 Results & Discussion
First, we show in Figure 3 how well each word embedding type is able to predict eye-tracking features, EEG and fMRI vectors. As can be seen the majority of results are significantly better than the random baselines. BERT, ELMo and FastText embeddings achieve the best prediction results. All exact numbers can be found in Appendix B. While a random baseline can be considered a rather naive choice, this setting also allows us compare the performance between word embedding types.
When predicting single eye-tracking features, the performance varies greatly. For instance, Table 4 shows that the prediction error on number of fixations and total reading time from the ZuCo dataset is much lower than for first fixation duration. This suggest that more general eye-tracking features covering the complete reading process of a word are easier to predict.
In the case of predicting voxel vectors of fMRI data, the results improve when choosing a larger number of voxels (see Table 3). Hence, in the remainder of this work we present only the results for 1000 voxels.
We also examined the EEG results in more depth by analyzing which electrodes are predicted more accurately and which electrodes values are very difficult to predict. This is exemplified by Figure 5, which shows the 20 best and worst predicted electrodes of the ZuCo data for the BERT embeddings of 1024 dimensions as well as aggregated over all cognitive data sources. The middle central electrodes are predicted more accurately. The middle central electrodes are known to register the activity of the Perisylvian cortex, which is relevant for language related processing catani2005perisylvian. Moreover it can be speculated that there is a frontal asymmetry between the electrodes on the left and right hemispheres.
Cognitive data implications
The diversity of cognitive data sources chosen for this work allows us to analyze and compare results on several levels and between several cognitive metrics. In order to conduct this evaluation on a collection of 15 datasets from three modalities, many crucial decisions were taken about preprocessing, feature extraction and evaluation type. Since there are different methods on how to process different types of cognitive language understanding signal, it is important to make these decisions transparent and reproducible.
Moreover, it is a challenge to segment brain activity data correctly and meaningfully into word-level signal from naturalistic, continuous language stimulus hamilton2018revolution. This makes consistent preprocessing across data sources even more important.
Another challenge is to consolidate the cognitive features to be predicted. For instance, we chose a wide selection of eye-tracking features that cover early and late word processing. However, choosing only general eye-tracking features such as total reading time would also be a viable strategy. On the other hand, the EEG evaluation could be more coarse-grained, one could also try to predict known ERP effects (e.g. ettinger2016modeling
) or features selected based on frequency bands. Moreover, the voxel selection in the fMRI preprocessing could be improved by either predicting all voxels or applying information-driven voxel selection methodsbeinborn2019robust.
Correlations between modalities
Next, we analyze the correlation between the predictions of the three modalities (Figure 4). There is a strong correlation between the results of predicting eye-tracking, EEG and fMRI features. This implies that word embeddings are actually predicting brain activity signals and not merely preprocessing artifacts of each modality. Moreover, the same correlation is also evident between individual datasets within the same modality. As an example, Figure 6 (bottom) shows the correlation of the results predicted for the Natural Speech and ZuCo EEG datasets, where the first had speech stimuli and the latter text. Figure 6 (top) reveals the same positive correlation for two EEG datasets that were preprocessed differently and were recorded with a different number of electrodes. Moreover, the UCL dataset contains word-by-word reading and the N400 contains natural reading of full sentences.
Correlation with extrinsic evaluation results
We performed a simple comparison between the results of word embeddings predicting cognitive language processing signals and the performance of the same embedding types in downstream tasks. We collected results for two NLP tasks: on the SQuAD 1.1 dataset for question answering rajpurkar2016squad
and on the CoNLL-2003 test split for named entity recognitiontjong2003introduction.
The SQuAD results are taken from devlin2019bert for BERT, from mikolov2018advances for FastText, and from peters2018deep for ELMo. The NER results are from the same source for ELMo and BERT, for Glove-50 from pennington2014glove and for Glove-200 from ghannay2016word. We correlated these results to the prediction results over all cognitive data sources. Figures 7 and 8 show the correlation plots between the CogniVal results and the two downstream tasks.
While this is merely an exploratory analysis, it shows interesting findings: If the cognitive embedding evaluation correlates with the performance of the embeddings in extrinsic evaluation tasks, it might be used not only for evaluation but also as a predictive framework for word embedding model selection. This is especially noteworthy, since it does not seem to be the case for other intrinsic methods chiu2016intrinsic.
We presented CogniVal, the first multi-modal large-scale cognitive word embedding evaluation framework. The vectorized word representations are evaluated by using them to predict eye-tracking or brain activity data recorded while participants were understanding natural language. We find that the results of eye-tracking, EEG and fMRI data are strongly correlated not only across these modalities but even between datasets within the same modality. Intriguinly, we also find a correlation between our cognitive evaluation and two extrinsic NLP tasks, which opens the question whether CogniVal can also be used for predicting downstream performance and hence, choosing the best embeddings for specific tasks.
We plan to expand the collection of cognitive data sources as more of them become available, including data from other languages such as the Narrative Brain Dataset (Dutch, fMRI, lopopolo2018narrative) or the Russian Sentence Corpus (eye-tracking, laurinavichyute2017russian). Thanks to naturalistic recording of longer text spans, CogniVal can also be extended to evaluate sentence embeddings or even paragraph embeddings.
CogniVal can become even more effective by combining the results with other intrinsic or extrinsic embedding evaluation frameworks nayak2016evaluating; rogers2018what and building on the multiple hypotheses testing.
We thank Lisa Beinborn, Stefan Frank and Thomas Lemmin for their valuable input on preprocessing EEG and fMRI data.
Appendix A Data Preprocessing
a.1 Eye-tracking Features
|First fixation duration||Dundee, GECO, Provo, UCL, ZuCo|
|First pass duration (first fixation duration in the first pass reading)||Dundee, GECO, Provo, UCL, ZuCo|
|Mean fixation duration||Dundee, GECO, Provo, ZuCo, CFILT-Sarcasm, CFILT-Scanpath|
|Total fixation duration||Dundee, GECO, Provo, ZuCo|
|Total duration of all regression going from this word||Dundee|
|Total duration of all regression going to this word||Dundee|
|Number of fixations||Dundee, GECO, Provo, ZuCo|
|Number of long regression (>3 tokens) going from this word||Dundee|
|Number of long regression (>3 tokens) going to this word||Dundee|
|Number of refixations||Dundee|
|Number of regressions going from this word||Dundee, Provo|
|Number of regressions going to this word||Dundee, Provo|
|The duration of the last fixation on the current word||GECO|
|Go-past time||GECO, Provo, UCL, ZuCo|
|No fixation occurred in first-pass reading||GECO, Provo|
|Right-bounded reading time||UCL|
All four EEG datasets are converted to the EEGLab format555https://sccn.ucsd.edu/eeglab/index.php, if not already provided in this format. The UCL dataset had been preprocessed by the authors. For the other three datasets, bandpass filtering, artifact removal (i.e. removing blinks and other muscle activity) and quality assessment was performed with Automagic666https://github.com/methlabUZH/automagic.
After preprocessing and retaining only the subjects with good data quality, we use the data of 3 subjects from the N400 dataset, 14 subjects from Natural Speech, 12 subject from ZuCo and the same number of subjects as originally from UCL (i.e. 24).
As mentioned in the main paper, we use the preprocessing pipeline from beinborn2019robust to read the fMRI data, align the scans and select the voxels. We used the Nouns and Pereira readers as is and modified the Harry Potter and Alice readers to extract word-level signals.
Appendix B Detailed Results
b.1 Correlations between datasets
The following plots show example correlations between the prediction results within one modality, but across datasets. It shows correlations between different stimuli and different recording procedures.