Documents on the World Wide Web, and seemingly countless other information sources available in a variety of on-line services, have become a central resource in our day-to-day decisions. As our capabilities are limited in finding relevant information from large collections, computational recommender systems have been introduced to alleviate information overload . To predict our future needs and intentions, recommender systems rely on the history of observations about our interests . Unfortunately, people are reluctant to provide explicit feedback to recommender systems . As a consequence, acquiring information about user intents has become a major bottleneck to recommendation performance, and sources of information about the individual’s interests have been limited to the implicit monitoring of online behavior, such as which documents they read, which videos they watch, or for which items they shop . An intriguing alternative is to monitor the brain activity of an individual; that could mitigate the cognitive load involved in expressing intentions and enable the direct inferring of information about relevance.
To utilize brain signals, we introduce a brain-relevance paradigm for information filtering. The paradigm is based on the hypothesis that relevance feedback on individual words, estimated from brain activity during a reading task, can be utilized to automatically recommend a set of documents relevant to the user’s topic of interest (see Figure 1 for an illustration). Following the brain-relevance paradigm, we introduce the first end-to-end methodology for performing fully automatic information filtering by using only the associated brain activity (Figure 1). The methodology is based on predicting and modeling the user’s informational intents  using brain signals and the associated text corpus statistics, and recommending new and unseen information using the estimated intent model.
We demonstrate the effectiveness of the methodology with brain signals naturally evoked during a text reading task. That is, unlike standard active brain-computer interface (BCI) practices, the method used here does not require the user to perform additional, explicit tasks (such as the mental counting of relevant words) that have been previously shown to enhance the signal-to-noise ratio . Instead, the methodology relies solely on the detection of the neural activity patterns associated with relevance, so that applications benefit from truly implicit and passive measurements.
The data from experiments, in which electroencephalography (EEG) was recorded from 15 participants while they were reading texts, shows that the recommendation of new relevant information can be significantly improved using brain signals when compared to a random baseline. The result suggests that relevance can be predicted from brain signals that are naturally evoked when users read, and they can be utilized in recommending new information from the Web as a part of our everyday information-seeking activities.
2 Brain-relevance paradigm for information filtering
We propose a new paradigm for information filtering based on brain activity associated with relevance. The brain-relevance paradigm is based on the following four hypotheses evaluated empirically in this paper:
Brain activity associated with relevant words is different from brain activity associated with irrelevant words.
Words can be inferred to be relevant or irrelevant based on the associated brain activity.
Words inferred to be relevant are more informative for document retrieval than are words inferred to be irrelevant.
Relevant documents can be recommended based on the inferred relevant and informative words.
The following two sections provide the cognitive neuroscience and the information science motivations as well as existing foundations of the brain-relevance paradigm.
Cognitive neuroscience motivation.
Event-related potentials (ERPs) are obtained by synchronizing electrical potentials from EEG to the onset (“time-locked”) of sensory or motoric events. The last 50 years of psychophysiology have demonstrated beyond a reasonable doubt that ERPs have a neural origin, that mental events can reliably elicit them, and that the measurement of their timing, scalp distribution (“topography”), and amplitude can be invaluable in providing information on normal  and neuropathological functioning .
Mentally controlling interfaces through measured ERPs has, to date, principally relied on the P300. The P300 is a distinct, positive potential that occurs at least 300 ms after stimulus onset and is traditionally obtained via so-called oddball paradigms. Sutton et al.  presented a fast series of simple stimuli with infrequently occurring deviants (e.g. 1 in 6tones having a high pitch) and discovered that these rare “oddballs” would on average trigger a positivity compared to the standard stimuli. Later experiments showed that the degree to which the stimulus provided new information  and was task-relevant  amplified the P300, whereas repetitive, unattended  or easily processed  stimuli could remove the P300 entirely.
For the language domain, the onset of words normally evokes a negativity at ca. 400 ms which has been attributed to semantic processing . This N400 was first observed as a type of “semantic oddball” since the closing word in a sentence such as “I like my coffee with milk and torpedoes” is semantically improbable, but would amplify the N400 rather than cause a P300. However, if a rare syntactic violation occurs in a sentence (“I likes my coffee [..]”), the deviant word once again evokes a positivity, but now at 600 ms . As this P600 shows similarities to the P300 in polarity and topography, it started the ongoing debate as to whether it is a language-specific “syntactic positive shift”, or a delayed P300 [22, 26, 35]. Finally, research on memory has identified a late positive component (LPC) at a latency similar to the P600. The LPC has been related to semantic priming and is particularly strong in tasks where an explicit judgement on whether a word is old or new is to be made . Consequently, it is often associated with mnemonic operations such as recollection . In the present context, relevant words could cue recollection of the user’s intent, thereby amplifying the LPC.
Although the P300/P600 and N400 are often described as contrasting effects, this is not necessarily the case in predicting term relevance. That is, if an odd, task-relevant stimulus yields a P300 or P600 and a semantically irrelevant stimulus an N400, it follows that the total amount of positivity between an estimated 300 and 700 ms may indicate the summed total semantic task relevance. This was indeed found by Kotchoubey and Lang, who showed that semantically relevant oddballs (animal names) that were randomly intermixed amongst words from four other categories evoked a P300-like response for semantic relevance (but at ca. 600 ms). Likewise, our previous work on inferring term-relevance from event-related potentials , showed that a search category elicited either P300s/P600s in response to relevant words or N400s evoked by semantically irrelevant terms.
Information science motivation.
Relevance estimation aims to quantify how well the retrieved information meets the information need of the user. Computational methods are used in estimating statistical relevance measures based on word occurrences in a document collection. These measures are used in many information retrieval applications, such as Web search engines, recommender systems, and digital libraries. One of the most well-known statistical measures of word informativeness or word importance is tf-idf .
The foundation of tf-idf is that low- and medium frequency-words have a higher discriminating power at the level of the document collection, in particular when they have high frequency in an individual document . For example, the word “nucleus” has a low frequency at the collection level but a higher frequency in a document about atoms (i.e., the “Atoms” document) and therefore is considered to discriminate this document better than, for example, the word “the,” which has a high frequency at both the collection and document levels. Search and recommendation systems use word-importance statistics to produce a ranked list of documents that match the word list encoding the user’s search intent. The words in documents can be indexed with weights encoding their importance, and ranking models then compute a relevance score for each document in the document collection and rank the documents according to the relevance scores. For example, if the words “the” and “nucleus” are encoding the user’s intent, then a ranking model could estimate that the document “Atom” should be ranked higher because it has a high importance value for the word “nucleus” compared to, for example, the document “Nuclear Magnetic Resonance,” which also contains the words “nucleus” and “the,” but with lower importance.
In summary, word relevance is determined by the user given the user’s search intent. Word informativeness is determined by the search system given the document collection. Words that are both relevant and informative are words that discriminate relevant documents from irrelevant documents and are needed to recommend meaningful documents. In addition to the brain-activity findings related to the semantic oddball (introduced in the cognitive neuroscience motivation), recent findings in quantifying brain activity associated with language also suggest a connection between the word class and frequency of the word, and the corresponding brain activity. It has been shown that brain activity is different for different word classes in language  and that high-frequency words elicit different activity than low-frequency words .
During the experiment, we recorded the EEG signals of 15 participants while each participant performed a set of eight reading tasks. Experimental details are provided in SI Neural-Activity Recording Experiment.
The text content read by the user consisted of two documents at a time. Each document was chosen from a list of 30 candidate documents, and each document was selected from a different topical area. For example, the documents “Atom,” “Money,” and “Michael Jackson” were part of the list; SI Table 1 provides a detailed list of the documents. One document represented the relevant topic, the other one an irrelevant topic. The user chose the relevant topic herself in the beginning of the experiment. The user read the first six sentences from each document—first the first sentence from both documents, then the second sentence from both documents, and so on. The obtained term-relevance feedback (predicted from brain signals) was then used to retrieve further documents relevant to the user-chosen topic of interest, from among the four million documents available in the database.
In order for the task to be representative of natural reading, no simplifications were done on the text content. In particular, this implies that the sentences have different numbers of words, and word length ranges from very short to very long. Figure 2 illustrates one reading task consisting of the relevant document “Atom” and the irrelevant document “Money” with subsequent document retrieval.
To associate brain activity with relevance, we computed the neural correlates of relevant and irrelevant words for all participants. A participant-specific single-trial prediction model  was computed for each participant, and the performance was evaluated on a left-out reading task (leave-one-task-out, a cross-validation scheme). This procedure matches the example of the task illustrated in Figure 2, consisting of the following steps: (1) users perform a new reading task; (2) relevance predictions are made for each word based on a model that was trained on observations collected during previous reading tasks; and (3) documents are retrieved using the relevance predictions for the present reading task. We present results for the (H1) neural correlates of relevance, (H2) term-relevance prediction, (H3) relation between relevance prediction and word importance, and (H4) document recommendation. Results are presented for both individual users and as grand averages. Technical details are in SI Data Analysis Details.
To quantify the significance and the effect sizes of the brain feedback-based prediction performances, we compared them against performances from prediction models learned from randomized feedback. By comparing against this baseline, we are able to operate with natural and hence non-balanced texts. Standard permutation tests  were applied for significance testing.
We used the Area under the ROC curve (AUC) to quantify the performance of the classifiers. AUC is a widely used and sensible measure even under the class imbalances of our scenario, and it is a comprehensive measure for comparison against the prediction models based on randomized feedback. From the perspective of document recommendations, it is more important to predict relevant words than to predict irrelevant words. To quantify this, we measured the precision (SI Appendix, Equation 1). To demonstrate the influence of a positive predicted word on the document retrieval problem, we additionally measured the tf-idf-weighted precision (SI Appendix, Equation 2). From the user perspective, the quality of the recommended documents is important. To quantify this, we used cumulative imformation gain, which measures the sum of the graded relevance values of the returned documents (SI Appendix, Equation 7). AUC and precision are based on participant-specific relevance judgments, and cumulative information gain is based on external topic-level expert judgments. Details on the concrete definition of the evaluation measurements and the assessment process are available in the SI Appendix.
Neural correlates of relevance.
Grand-average based ERP results show that brain activity associated with relevant words is different from brain activity associated with irrelevant words (H1), over all participants and all reading tasks. The topographic scalp plots in Figure 3
a show the spatial interpolation of relevant ERPs minus irrelevant ERPs over all electrodes from 300ms to 600ms after a word was shown on screen. The topography of the difference showed an initial fronto-central positivity at 300ms, relative to the onset of the word on the screen, followed by a centro-parietal positivity from 400 to 600 ms. The maximal effect of relevance can be clearly observed in Figure3b, with V for relevant words and V for irrelevant words at 367ms over Pz. Following the negativity a late positivity can be observed for both types of words, which reaches a local maximum at a latency of around 600 ms, implicating a possible P600 or LPC.
For descriptive purposes, we tested the difference between the relevant and irrelevant words of well-known P300, N400, and P600 ERP components and their latencies given in the existing literature. There was no significant difference in the early P3 interval ( ms, paired -test, , ), which suggests that the system does not rely on the mere visual resemblance between relevant words and the intent category. However, irrelevant words elicited a negativity compared to relevant words in the N400 window ( ms, , ). Moreover, relevant words were found to significantly elicit a positivity compared to irrelevant words in the P600 interval ( ms, paired -test interval, , ). For the purpose of the subsequent term-relevance prediction, this result verified our approach of computing the temporal features for the ERP classification within the range of 200ms to 950ms (this range was determined based on the pilot experiments). SI Figure 1 shows the remaining scalp plots for other time intervals, and SI Figure 2 shows the grand-average-based ERP curves for all channels.
Across participants and reading tasks, the classification of brain signals by models learned from earlier explicit feedback shows significantly better results than with models learned from randomized feedback (Figure 4a; , Wilcoxon test, ). This implies that the prediction models are able to extract and utilize structured signals significantly and that words can be inferred to be relevant or irrelevant based on the associated brain activity (H2).
Figure 5 shows the classification performance in terms of AUC for each participant. For 13 out of the 15 participants, the term-relevance prediction models perform significantly better than does a prediction model learned based on randomized feedback (hence having ; , within-participant permutation tests with iterations). For two participants, the predictions were essentially random, and they were excluded from the rest of the analyses. It is well known that BCI control does not work for a non-negligible portion of participants (ca. 15-30% ), and the reported results should be interpreted as being valid for the population of users, which can be rapidly screened by using the system on pre-defined tasks.
Relevance for document retrieval.
For our final goal, the retrieval and recommendation of documents, it is important to be able to detect words that are both relevant and informative (measured by the tf-idf) in discriminating between relevant and irrelevant documents in the full collection. Figure 6 visualizes the relationship between the predicted relevance probability of words and their tf-idf values. Relevant words (according to the user’s own judgement afterwards) are predicted as being more relevant than irrelevant words, but also, their tf-idf values are greater (H3). The figure further indicates that the tf-idf dimension explains more of the difference than does the predicted relevance.
In terms of an information retrieval application, the precision of the prediction models is the most important measure. For document retrieval, the influences of positive predicted words on the search results are not equal but rather dependent on the word-specific tf-idf values within each individual document. For example, a true positive predicted word can still have very low impact on the search result if its tf-idf value is low in the relevant documents. Similarly, a false positive predicted word can have only a low impact on the search result, if its tf-idf value is low. Figure 7
visualizes the mean precision of the prediction models from the perspective of the retrieval problem. It shows the mean precision for each of the 13 participants over all of the reading tasks (based on binarizing the predicted probabilities with the threshold). In addition, it shows what is actually crucial: The precision weighted with the tf-idf values from the relevant document is in all cases, except for one, much higher than the precision weighted with the tf-idf values from the irrelevant document.
In conclusion, the results in Figure 7 explain why the prediction models are useful for document retrieval and recommendation, even though the unweighted precision of the prediction models is limited. In detail, our prediction models tend to predict true positive words with higher tf-idf values, and false positive words with lower tf-idf values. This means that our prediction models tend to predict words that the user would judge to be relevant, and which are also discriminative in terms of the user’s search intent.
The final step is to use the relevant words—predicted from brain signals—for document retrieval and recommendation, and to evaluate the cumulative gain. Figure 4b shows that across the participants and reading tasks, document retrieval performance based on brain feedback is significantly better than randomized feedback (top-30 documents, , Wilcoxon test, ). Therefore, relevant documents can be recommended based on the inferred relevant and informative words (H4).
Figure 8 shows the document retrieval performance for each participant in terms of mean information gain. Based on the expert scoring, the scale for the mean information gain is from 0 (irrelevant) to 3 (highly relevant). The visualization shows for each participant the mean information gain over all reading tasks based on brain feedback (blue bars) and randomized feedback (purple bars). For 10 participants, the brain feedback results in significantly greater information gain (; two-sided Wilcoxon test). SI Figure 3 also shows the visualizations for top 10 and top 20 retrieved documents. In both cases, the same significant results hold except for one participant (TRPB113).
By combining insights on information science and cognitive neuroscience, we proposed the brain-relevance paradigm to construct maximally natural interfaces for information filtering: The user just reads, brain activity is monitored, and new information is recommended. To our knowledge, this is the first end-to-end demonstration that the recommendation of new information is possible just by monitoring the brain activity while the user is reading.
The brain-relevance paradigm for information filtering is based on four hypotheses empirically demonstrated in this paper. We showed that (H1) there is a difference in brain activity associated with relevant versus irrelevant words; (H2) there is a difference in the importance of words depending on their relevance to the user’s search intent; (H3) it is possible to detect relevant and informative words based on brain activity; and (H4) it is possible to recommend relevant documents based on the detected relevant and informative words.
From a cognitive neuroscience point of view, it is known that specific ERPs can be particularly associated with relevance. In cognitive science, early P300s have been related with task relevance. In psycholinguistics, N400s are commonly associated with semantic processes  as semantically incongruent words amplify the component whereas semantic relevance reduces it. Late positivity has been related to semantic task-relevant stimuli , in particular if characterizing it as a delayed P3 response, due to the assessing of relevance of language, or an LPC, due to mnemonic operations and semantic judgements. In line with these findings, our grand averages indicate that the ERP at a latency of 500–850 ms is most likely the best predictor of perceiving words that are semantically related to a user’s search intent. The present data do not allow for a dissociation among the P300, N400 or P600 as the most likely neural candidate for evoking the observed effect. Indeed, the method is based on the assumption that task relevance and semantic relevance both contribute positively to the inference of relevance when aiming to ultimately predict a user’s search intent without requiring an additional task by the user.
While our results use real data and are also valid beyond the particular experimental setup, our methodology is limited to experimental setups in which it is possible to control strong noise, such as noise due to physical movements, which are known to cause confounding artifacts in the EEG signal. Another limitation is that the comparison setup in our studies considers only two topics at a time, one being relevant and another being irrelevant. While this is a solid experimental design and can rule out many confounding factors, it may not be valid in more realistic scenarios in which users choose among a variety of topics during their information seeking activities. Furthermore, the presented term-relevance prediction is based on a traditional set of event related potentials to demonstrate the feasibility of the methodology. However, it is possible that more advanced feature extraction could improve the solution further, for example by computing phase synchronization statistics in the delta and theta frequencies, which recently have been shown to be sensitive to the detection of relevant lexical information.
Despite these limitations, our work is the first to address an end-to-end methodology for performing fully automatic information filtering by using only the associated brain activity. Our experiments demonstrate that our method works without any requirements of a background task or artificially evoked event-related potentials; the users are just reading text, and new information is recommended. Our findings can enable systems that analyze relevance directly from individuals’ brain signals naturally elicited as a part of our everyday information seeking activities.
The SI Appendix provides extensive details on all technical aspects. SI Database describes the selection process and the criteria for the pool of candidate documents. SI Neural-Activity Recording Experiment provides the experimental details, i.e., the participant recruiting, the procedure and design of the EEG recording experiment, the apparatus and stimuli definition, and details on the pilot experiments. SI Data Analysis Details describes the general prediction evaluation setup, EEG cleaning and preparing, and the EEG feature engineering. SI Term Relevance Prediction gives a description of the Linear Discriminant Analysis (LDA) method used for the prediction models, and specifics on the evaluation measures for prediction. SI Intent Modeling based Recommendation gives details on the intent estimation model based on the LinRel algorithm, and specifics on the evaluation measures for document retrieval.
We thank Khalil Klouche for designing Figure 1. This work has been partly supported by the Academy of Finland (278090; Multivire, 255725; 268999; 292334; 294238; and the Finnish Centre of Excellence in Computational Inference Research COIN), and MindSee (FP7 ICT; Grant Agreement 611570).
Supplemental Information SI
[0.5em] Natural brain-information interfaces: Recommending information by relevance inferred from human brain signals
8 SI Database
The database used in the experiment was the English Wikipedia provided by Wikimedia (database dump of 2014/07/07 ). For the experiment, our search engine indexed all articles except special pages such as disambiguation pages. The references, notes, and external links were removed from the text of the articles. The final database contained over 4 million articles.
The pool of candidate documents read by the participants during the experiment consisted of 30 documents. The criteria for choosing a document were that (1) the document should describe a topic of general interest and that (2) the first six sentences of the introduction of the document provide a sufficient description of the topic. The final pool of documents fulfilling these criteria are listed in SI Table1.
9 SI Relevance Judgments of Words and Documents
In order to measure the relevance prediction performance, “ground truth” in the form of relevance judgements for individual words is needed, for a specific reading task, on both relevant and irrelevant document. The binary relevance judgment “relevant” or “irrelevant” of a word was provided by each participant during the experiment (see SI Neural Activity Recording Experiment). This allowed us to capture the subjective nature of perceived relevance. In addition, for each document, the word class of each word was defined (nouns, verbs, adjectives, etc.) and three experts judged each word as being “relevant” or “irrelevant” for the given document.
In order to measure the document retrieval performance, the “ground truth” relevance judgments of retrieved documents given a relevant topic of the reading task is needed. For each of the 30 documents in the pool, three experts judged all documents that were retrieved in any experiments (brain feedback-based and random feedback-based), resulting in a pool of 13971 retrieved documents. The experts assessed all the documents according to the following criterion: “Would you be satisfied in having this document in the search result list of documents after examining document x? If yes, how satisfied from 1 to 3, if no 0.” The mean Cohen’s Kappa  indicated substantial agreement between the experts, .
10 SI Neural Activity Recording Experiment
We recorded the electroencephalography (EEG) signals of 17 participants while each participant performed a set of eight reading tasks. The following sections provide the experimental details.
Participants were volunteers recruited from the universities of the Helsinki metropolitan area in Finland. They were selected only if they were right-handed, had no self-reported neuropathological history, and were deemed to have sufficient fluency in English. Handedness was assessed using the Edinburgh Handedness Inventory [25, 7] and English fluency using the Cambridge English “Test your English – Adult Learners” online test . Seventeen participants were recruited to participate in the experiment. The data of two participants were discarded due to technical issues. Of the fifteen remaining, 8 were female and 7 male. Their English fluency was assessed as high (, ), and their handedness as right-handed (, ). They were fully briefed as to the nature and purpose of the study prior to the experiment. Furthermore, and in accordance with the Declaration of Helsinki, they signed informed consents and were instructed on their rights as participants, including the right to withdraw from the experiment at any time without fear of negative consequences. They received two movie tickets as compensation for their participation.
10.2 Procedure and Design
Following the initial briefing, participants were explained the task in more detail, while the EEG equipment was set up. They then received a short training task with two sample topics. When participants indicate their complete understanding of the task, the experiment commenced. Participants completed eight experimental blocks, each consisting of a single reading task with two topics, drawn randomly (without replacement) from the pool of 30 document candidates. At the beginning of the block, they were asked to freely choose which of the two documents described the relevant and irrelevant topic. Every block comprised six trials, each consisting of one sentence from the relevant and one sentence from the irrelevant document, with the presentation order of the sentences randomized between blocks. Each trial consisted of the sequential presentation of words (the word stream), two validity sub-tasks, and an explicit word relevance judgment task (judging the words as “relevant” or “irrelevant”).
Every trial started with a warning signal (the words “Starting trial”), followed by the presentation of the mask. An initial sentence separator was shown before the word stream was shown. The word stream consisted of the sequential presentation of each word in the first sentence, followed by a sentence separator, the words in the second sentence, and concluded by a final sentence separator. Every word and sentence separator was presented for exact 699 ms (). Punctuation marks were not shown. Masking effects were countered to some extent by the frame resizing, which keeps the level of foveal stimulation constant. Prior tests suggested that people had more difficulty reading with than without short masks between the bursts, so as a consequence we removed them. It is possible these masking effects may be much more significant with strong ”flashing”, as would be the case with very short stimulus durations. Here, the words appearing at a slow rate of ca 700 ms per words made reading very easy.
Following the word stream, two extra sub-tasks were presented to validate that the participants had remembered their chosen word and that they had paid attention to both sentences. First, they were asked to type in the name of the relevant topic in order to ascertain they had not forgotten. Then, a recall task was presented to prevent the participants from selectively concentrating on one of the two sentences. One of the sentences was selected randomly and presented in full on the screen, with one of the nouns or verbs substituted by question marks. Participants were asked to type in the word missing in the sentence. They were then presented with feedback in points regarding their performance on these two tasks as a motivational instrument (similar to ).
Then, in the final part of the trial, the participants were asked to explicitly rate the relevance of all words from the relevant topic. All words were shown in one (if the sentence comprised fewer than 35 words) or two columns on the screen. A cursor was presented next to each word, indicating a two-alternative forced-choice decision. Pressing the left arrow key on the keyboard would rate the word as irrelevant and pressing the right would rate it as relevant. Participants were instructed prior to the experiment that they should not re-interpret the relevance of the words and instead make a decision based on their previous viewing of the sentence. To facilitate this, they received a maximum of 2 s to respond to each word, after which the cursor moved to the next word in the sentence. After the last word was rated, the trial was completed, with the next trial starting after an inter-trial interval of ca. 1 s, unless it was the last trial in the block.
After completing a block, they were requested to freely write about their chosen, relevant topic; this task was defined to keep the participant engaged. Finally, they filled out a questionnaire with two items for both topics, one regarding their interest (“how interesting do you find topic x”) and one regarding their knowledge (“how much do you know about topic x”) using a 9-point rating scale (1: not at all – 9: extremely so). Three self-timed breaks with a minimum of one minute evenly split the blocks into four parts. The experiment, excluding preparation and instruction, lasted approximately one hour.
10.3 Apparatus and Stimuli
Words were presented with an 18-point Lucida Console black typeface at
the center of the 19” LCD screen. They were shown against a silver
(RGB , , ) background in the middle of a pixel pattern mask. The mask was a black
rectangle with a grid-like pattern, with an opening to show the
word. This was used to control the degree to which word length
affected light reaching the eyes (i.e. to make sure longer
words were not tantamount to more black pixels on the
screen). Sentence separators were word-like character
repetitions consisting of 4 to 9 numbers (
3333333) or other
non-alphabetic characters (
&&&&&&), which were designed to
mimic the same
early visual activity as words without evoking
The screen was positioned approximately 60 cm from the participants and was running at a resolution of 1680 x 1050 and a refresh rate of 60 Hz. Stimulus presentation, timing, and EEG synchronization were controlled using E-Prime 2 Professional 22.214.171.1243 on a PC running Windows XP SP3. EEG was recorded from 32 Ag/AgCl electrodes, positioned on standardized (using EasyCap elastic caps, EasyCap GmbH, Herrsching, Germany), equidistant electrode sites of the system via a QuickAmp (BrainProducts GmbH, Gilching, Germany) amplifier running at 200 Hz. Additionally, the electro-oculogram for vertical eye movements (and eye blinks) and horizontal eye movements was recorded using bipolar electrodes positioned respectively 2 cm superior/inferior to the right pupil and 1 cm lateral to the outer canthi of both eyes.
10.4 Pilot experiments
Prelimary versions of the final experimental procedure and design were piloted with four separate participants. In these experiments, we tested and evaluated, for example, the stimulus duration, the explicit feedback task, and the points system. The data of these pilot experiments were not used in the final analysis, except that some basic parameter estimations for the final feature engineering process were based on cross-validation experiments on these data (e.g., number of feature windows).
11 SI Data Analysis Details
The evaluation setup for prediction and retrieval followed the general block structure defined by the experimental design. We applied an participant-specific and leave-one-block-out learning and evaluation strategy. The individual prediction models are single-trial prediction models . We report averaged prediction and retrieval performance, unless otherwise noted.
In detail, for a given participant, blocks with explicit term relevance judgments provided by the participant were available. In order to retrieve a brain-feedback-based list of relevant documents for a specific block , two steps were executed. First, to obtain a term relevance prediction model for the given block , a classification model was trained using the data from the remaining blocks. The prediction performance of was then evaluated on the left-out block . Second, to retrieve the set of documents for block , the set of terms predicted to be relevant by the classifier with a probability higher than were used. The retrieval performance was evaluated against the expert judgements of document relevance for the relevant topic of block .
As a baseline comparison, we evaluated the brain feedback-based performances against random-feedback-based performances. The random-feedback scenario corresponds to standard permutation tests and results in permutation-based -values . The following sections give concrete details on the methodology used.
11.1 EEG cleaning and preparing
The EEG signals were cleaned and prepared following standard BCI guidelines . During recording a hardware low-pass filter at 1000 Hz was applied. The continous EEG recordings were filtered with a 35 Hz FIR1 low-pass filter and a
Hz high-pass filter. The signal was then divided into epochs ranging fromms to
ms relative to the onset of each stimulus. Baseline correction was performed on each epoch using the pre-stimulus period. A simple heuristic was applied to reject invalid channels and epochs: First, invalid epochs were estimated based on the epochs’ variances () and the max-min criterium (). A channel was removed if the number of invalid epochs was higher than of all available epochs. After removing all invalid channels, invalid epochs were estimated again and removed. This data cleaning approach was carried out in order to eliminate noise and potential confounds by common artifacts such as eye movements and blinks, as well as artefacts caused by loose electrodes or a cap that did not fit perfectly. Table 2 shows the statistics for the cleaning process for each participant.
11.2 Feature engineering
Event-related potentials are characterized by their temporal evolution and the corresponding spatial potential distributions. We followed standard feature engineering procedures to create spatio-temporal ERP features for classification . For each epoch, the raw EEG data (after basic cleaning) were available as the spatio-temporal matrix , with channels and sampled time points. For each epoch, the time was divided into equidistant windows between 250 ms and 950 ms after the stimulus onset. The number of windows was chosen based on data recorded during the pilot experiments. For each channel, the potential values within one window were averaged, resulting in the spatio-temporal matrix
. The final feature representation of one epoch was the concatenation of all columns into one vector. And, for a specific block with epochs, the full spatio-temporal feature matrix used for classification was . Note that the number of channels and the number of epochs were participant-specific, as they were dependent on the EEG cleaning and preparing procedure. Table 2 shows the concrete numbers for each participant.
12 SI Term Relevance Prediction
We developed term relevance prediction models within the framework of the linear EEG model  and single-trial ERP classification . In detail, we utilized Linear Discriminant Analysis (LDA, see ) and learned linear binary classifiers, which we used to predict class memberhsip probabilities. The assumptions of the method are that the observations
have been drawn from two multivariate Normal distributions, one for the class of “relevant” observations, and the other for the class of “irrelevant” observations. For the estimation of the models we used shrinkage LDA, a covariance-regularized LDA with a shrinkage parameter selected by the analytical solution developed by Schäfer and Strimmer . The choice of this simple method was based on the many existing successful applications using this method in the BCI community . In addition, one major reason is robustness against class imbalance , an obvious situation in the proposed paradigm (see also Table 2 for the relevance class distribution per participant).
12.1 Leave-one-block-out evaluation
For each participant, we trained a set of eight classifiers. The classifer for block was trained with the epochs from the other blocks, i.e., with the spatio-temporal feature matrix . The classifier was evaluated on the epochs from block , i.e., on the matrix . The performance measures of interest were the Area under the ROC curve (AUC), precision, and tf-idf-weighted precision. The AUC is defined as the area under the ROC curve, which links the true positive rate to the false positive rate. A perfect model has an AUC of , and a random model has an AUC of . AUC is a global quality measure of the classification model. This measure was chosen because it allowed us to correctly evaluate the models in the existing class imablance scenario and because it is a comprehensive measure for comparison to the random feedback models. Precision is defined as
where tp is the number of true positives (i.e., relevant words predicted to be relevant) and fp is the number of false positives (i.e., irrelevant words predicted to be relevant). This measure was chosen because we want to have a high precision (i.e., many correct relevant words) for the document retrieval step. Weighted precision is defined as
where is the sum of the term-frequency–inverse document frequency (tf-idf) values of the true positive words, and is the sum of tf-idf values of the false positive predicted words. In our case, the tf-idf values either come from the relevant document or the irrelevant document. For a positive predicted word that is not available in a document, the tf-idf is set to . This reflects that this word has no influence on the document retrieval.
12.2 Random feedback evaluation
For a given block , a classifier was trained and evaluated on data with permuted relevance judgments. If executed for a large number of permutations, this random-feedback strategy is a permutation test, resulting in a permutation-based -value 
. The null hypothesis of the test assumes that the brain data and the relevance judgments are independent. A small-value indicates that the classifier is able to find a significant structure discriminating “relevant” and “irrelevant” brain signal patterns. For each block, permutations were performed, meaning that the smallest possible -value is .
13 SI Intent Modeling-based Recommendation
We developed an intent estimation model to predict how relevant each term the user read is to the topic of interest. This model was then used to retrieve new documents from the database. The motivation for the intent model is that the predictions of the term-relevance model can indicate the relevance to a topical intent, but the individual words for which the predictions are drawn may not represent the whole topic. For example, the words “matter” and “neutrons” are related to the topic “Atom,” but would not alone be sufficient search terms to retrieve information about the topic “Atom.” Therefore, these words are used as positive feedback for the intent model to predict that, for example, the words “atom,” “atomic,” and “nucleus” are also relevant for the user given the positively predicted words “matter” and “neutrons.” We call the resulting model the intent model of the user .
13.1 Document representation
The documents and words are modeled as a term-document matrix with terms and documents. The term vector indicates the weight of a stemmed word for each of the documents. The words are stemmed using the English Porter Stemmer , and the stemmed words are referred to as terms. Before stemming, English stop words were removed because they appeared in the Apache Lucence 4.10 stop word list333https://lucene.apache.org/. We used tf-idf weighting to account for the frequency and specificity of each term .
13.2 Intent model
The intent model estimates a weight for each term based on the input from the term-relevance prediction classifier. The feedback from the term-relevance predictions is denoted as for a subset of terms indexed by . We assume that the term-relevance prediction of a term
is a random variable with expected value, such that the expected weight is a linear function of the terms. The unknown weight vector is essentially the representation of the user’s intent and determines the relevance of terms.
To estimate we utilize the LinRel algorithm 
. It learns a linear regression model of the form. LinRel allows control for the uncertainty related to the term weight estimates. The choice of this method was based on its robustness against suboptimal input, which is the case for potentially noisy predictions of the term-relevance prediction model.
LinRel computes a regularized regression weight vector for each term in :
is the identity matrix, andis a regularization parameter set to , and all terms except on the right-hand side are shared for all keywords. Then for each keyword, the final relevance score at the current iteration is computed by taking into account the feedback obtained so far:
where is the vector of term-relevance predictions obtained, is the weight vector of a single keyword in the data , is the norm of the weight vector, and the constant is used to adjust the influence of the history (we used to give equal weight for exploration and exploitation). It can be shown that this procedure is equivalent to estimating the upper confidence bound in a linear regression problem .
13.3 Retrieval model
Intent model estimates a weight for each term which, in turn, is used to retrieve new documents from the database, to be recommended for the user. We use a unigram language modeling approach of information retrieval . In detail, the vector is treated as a sample of a desired document, and documents are ranked by the probability that would be generated by the respective language model for the document .
Using maximum likelihood estimation, we get
and to avoid zero probabilities and improve the estimation we then compute a smoothed estimate by Bayesian Dirichlet smoothing so that
where is the count of term in document , is the occurrence probability (proportion) of term in the whole document collection, and the parameter is set to 2000 as suggested in the literature .
13.4 Recommendation evaluation
The evaluation setup for the recommendation was designed analogously to term-relevance prediction. Each classifier output for a block was given as input for the intent model. The resulting intent model was used to predicted relevant words, and a ranked set of the top-30 documents were retrieved from the whole English Wikipedia corpus.
13.5 Random feedback recommendation
For a given block , the recommendation was evaluated with term-relevance input resulting from permuted relevance judgments. Similarly to the relevance prediction, this random strategy is also a permutation test. A small -value indicates that the recommendation system is able to gain more relevant documents based on the brain input than with the random input. Following the evaluation setup of term-relevance prediction, permutations were performed for each block.
13.6 Performance measures
The recommendation performance was evaluated using Cumulative information gain (CG) . The cumulative information gain is defined simply as the sum of the relevance scores assigned by the experts for the documents that were ranked in the top-30 documents by the retrieval system in response to the input. Formally,
where is the relevance score of the th document in the ranked list. This measure was chosen because it allows graded relevance assessments: some documents may be highly relevant and some documents may be marginally relevant. The cumulative gain may be different for different topics: some topics may have many highly relevant documents, and some may have only a few.
|Document||#Relevant||#Irrelevant||#Retrieved Documents||#Relevant Documents||#Irrelevant Documents||Top-30 Score||Maximum Score|
|Participant||#Recorded Channels||# Accepted Channels||#Blocks||#Recorded Epochs||#Accepted Epochs||#Relevant Epochs||#Irrelevant Epochs|
- Auer  Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, March 2003. ISSN 1532-4435.
- Blankertz et al.  Benjamin Blankertz, Gabriel Curio, and Klaus-Robert Müller. Classifying single trial eeg: Towards brain computer interfacing. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 157–164. MIT Press, 2002.
- Blankertz et al.  Benjamin Blankertz, Steven Lemm, Matthias Treder, Stefan Haufe, and Klaus-Robert Müller. Single-trial analysis and classification of ERP components – A tutorial. NeuroImage, 56(2):814–825, 2011. doi: http://dx.doi.org/10.1016/j.neuroimage.2010.06.048.
- Brunetti et al.  Enzo Brunetti, Pedro E. Maldonado, and Francisco Aboitiz. Phase synchronization of delta and theta oscillations increase during the detection of relevant lexical information. Frontiers in Psychology, 4(308), 2013. doi: 10.3389/fpsyg.2013.00308.
- Cambridge English Language Assessment  Cambridge English Language Assessment. Test your English – Adult Learners, 2014. http://www.cambridgeenglish.org/test-your-english/adult-learners/.
- Cohen  Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4):213–220, 1968. doi: 10.1037/h0026256.
- Cohen  M. S. Cohen. Handedness questionnaire, 2014. http://www.brainmapping.org/shared/Edinburgh.php.
- Donchin and Coles  Emanuel Donchin and Michael G. H. Coles. Is the P300 component a manifestation of context updating? Behavioral and Brain Sciences, 11:357–374, 9 1988. ISSN 1469-1825. doi: 10.1017/S0140525X00058027.
- Eugster et al.  Manuel J. A. Eugster, Tuukka Ruotsalo, Michiel M. Spapé, Ilkka Kosunen, Oswald Barral, Niklas Ravaja, Giulio Jacucci, and Samuel Kaski. Predicting term-relevance from brain signals. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, pages 425–434, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2257-7. doi: 10.1145/2600428.2609594.
- Good  Phillip I. Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer, 2 edition, 2000. ISBN 038798898X. doi: 10.1007/978-1-4757-3235-1.
- Hagoort P  Groothusen J Hagoort P, Brown C. The syntactic positive shift (SPS) as an ERP measure of syntactic processing. Lang Cogn Process, 8(4):439–483, 1993.
- Hanani et al.  Uri Hanani, Bracha Shapira, and Peretz Shoval. Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction, 11(3):203–259, August 2001. ISSN 0924-1868. doi: 10.1023/A:1011196000674.
- Hastie et al.  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of statistical learning. 2 edition, 2009.
- Hauk and Pulvermüller  O Hauk and F Pulvermüller. Effects of word length and frequency on the human event-related potential. Clinical Neurophysiology, 115(5):1090–1103, 2004. ISSN 1388–2457. doi: http://dx.doi.org/10.1016/j.clinph.2003.12.020.
- Hillyard and Kutas  Steven A. Hillyard and Marta Kutas. Electrophysiology of cognitive processing. Annual Review of Psychology, 34(1):33–61, 1983.
- Järvelin and Kekäläinen  Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, October 2002. ISSN 1046-8188. doi: 10.1145/582415.582418.
- Jones  Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972.
- Kelly and Teevan  Diane Kelly and Jaime Teevan. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18–28, September 2003. ISSN 0163-5840. doi: 10.1145/959258.959260.
- Kok  Albert Kok. Event-related-potential (ERP) reflections of mental resource̊s: A review and synthesis. Biological Psychology, 45(1):19–56, 1997. doi: 10.1016/S0301-0511(96)05221-0.
- Kotchoubey and Lang  Boris Kotchoubey and Simone Lang. Event-related potentials in an auditory semantic oddball task in humans. Neuroscience Letters, 310(2):93–96, 2001. doi: 10.1016/S0304-3940(01)02057-2.
- Kutas and Hillyard  M. Kutas and S. A. Hillyard. Brain potentials during reading reflect word expectancy and semantic association. Nature, 307(5947):161–163, 1984.
- Münte et al.  T. F. Münte, H. J. Heinze, M Matzke, B. M. Wieringa, and S. Johannes. Brain potentials and syntactic violations revisited: No evidence for specificity of the syntactic positive shift. Neuropsychologia, 36(3):217–226, 1998.
- Münte et al.  Thomas F Münte, Bernardina M Wieringa, Helga Weyerts, Andras Szentkuti, Mike Matzke, and Sönke Johannes. Differences in brain potentials to open and closed class words: Class and frequency effects. Neuropsychologia, 39(1):91–102, 2001. ISSN 0028-3932. doi: http://dx.doi.org/10.1016/S0028-3932(00)00095-6.
- Ojala and Garriga  Markus Ojala and Gemma C. Garriga. Permutation tests for studying classifier performance. Journal of Machine Learning Research, 11:1833–1863, 2010.
- Oldfield  O. R. Oldfield. The assessment and analysis of handedness: The Edinburgh inventory. Neuropsychologia, 9(1):97–113, 1971.
- Osterhout et al.  L. Osterhout, R. McKinnon, M. Bersick, and V. Corey. On the language specificity of the brain response to syntactic anomalies: Is the syntactic positive shift a member of the P300 family? Cogn Neurosci J Of, 8(6):507–526, 1996.
- Paller and Kutas  Ken A. Paller and Marta Kutas. Brain potentials during memory retrieval provide neurophysiological support for the distinction between conscious recollection and priming. Journal of Cognitive Neuroscience, 4(4):375–392, 1992. doi: 10.1162/jocn.19126.96.36.1995.
- Parra et al.  Lucas C. Parra, Clay D. Spence, Adam D. Gerson, and Paul Sajda. Recipes for the linear analysis of EEG. NeuroImage, 28:326–341, 2005.
- Pfefferbaum et al.  Adolf Pfefferbaum, Brant G Wenegrat, Judith M Ford, Walton T Roth, and Bert S Kopell. Clinical application of the P3 component of event-related potentials. ii. dementia, depression and schizophrenia. Electroencephalography and Clinical Neurophysiology/Evoked Potentials Section, 59(2):104–124, 1984. doi: 10.1016/0168-5597(84)90027-3.
- Ponte and Croft  Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, pages 275–281, New York, NY, USA, 1998. ACM. ISBN 1-58113-015-5. doi: 10.1145/290941.291008.
- Porter  M. F. Porter. An algorithm for suffix stripping. In Karen Sparck Jones and Peter Willett, editors, Readings in Information Retrieval, pages 313–316. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-454-5.
- Rugg et al.  Michael D. Rugg, Ruth E. Mark, Peter Walla, Astrid M. Schloerscheidt, Claire S. Birch, and Kevin Allan. Dissociation of the neural correlates of implicit and explicit memory. Nature, 392:595–598, 1998. doi: 10.1038/33396.
- Ruotsalo et al.  Tuukka Ruotsalo, Jaakko Peltonen, Manuel J.A. Eugster, Dorota Glowacka, Ksenia Konyushkova, Kumaripaba Athukorala, Ilkka Kosunen, Aki Reijonen, Petri Myllymäki, Giulio Jacucci, and Samuel Kaski. Directing exploratory search with interactive intent modeling. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM ’13, pages 1759–1764, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2263-8. doi: 10.1145/2505515.2505644.
- Ruotsalo et al.  Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, and Samuel Kaski. Interactive intent modeling: Information discovery beyond search. Commun. ACM, 58(1):86–92, December 2015. ISSN 0001-0782. doi: 10.1145/2656334.
- Sassenhagen et al.  J. Sassenhagen, M. Schlesewsky, and I. Bornkessel-Schlesewsky. The P600-as-P3 hypothesis revisited: Single-trial analyses reveal that the late EEG positivity following linguistically deviant material is reaction time aligned. Brain Language, 137(29–39), 2014.
- Schäfer and Strimmer  Juliane Schäfer and Korbinian Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), 2005. doi: 10.2202/1544-6115.1175,.
- Schein et al.  Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pages 253–260, New York, NY, USA, 2002. ACM. ISBN 1-58113-561-0. doi: 10.1145/564376.564421.
- Spapé et al.  Michiel M. Spapé, Guido PH Band, and Bernhard Hommel. Compatibility-sequence effects in the Simon task reflect episodic retrieval but not conflict adaptation: Evidence from LRP and N2. Biological psychology, 88(1):116–123, 2011. doi: 10.1016/j.biopsycho.2011.07.001.
- Squires et al.  Kenneth C Squires, Emanuel Donchin, Ronald I Herning, and Gregory McCarthy. On the influence of task relevance and stimulus probability on event-related-potential components. Electroencephalography and Clinical Neurophysiology, 42(1):1–14, 1977. doi: 10.1016/0013-4694(77)90146-8.
- Sutton et al.  S. Sutton, M. Braren, J. Zubin, and E. R. John. Evoked-potential correlates of stimulus uncertainty. Science, 150:1187–1188, 1965.
- Sutton et al.  Samuel Sutton, Patricia Tueting, Joseph Zubin, and E. R. John. Information delivery and the sensory evoked potential. Science, 155(3768):1436–1439, 1967. doi: 10.1126/science.155.3768.1436.
- Vidaurre and Blankertz  Carmen Vidaurre and Benjamin Blankertz. Towards a cure for BCI illiteracy. Brain Topography, 23(2):194–198, 2010. doi: 10.1007/s10548-009-0121-6.
- Wikimedia  Wikimedia. Wikimedia downloads, 2014. https://dumps.wikimedia.org/.
- Xue and Titterington  Jing-Hao Xue and D. Michael Titterington. Do unbalanced data have a negative effect on LDA? Pattern Recognition, 41(5):1558–1571, 2008. doi: 10.1016/j.patcog.2007.11.008.
- Zander et al.  Thorsten O. Zander, Christian Kothe, Sabine Jatzev, and Matti Gaertner. Brain-Computer Interfaces: Applying our Minds to Human-Computer Interaction, chapter Enhancing Human-Computer Interaction with Input from Active and Passive Brain-Computer Interfaces, pages 181–199. Springer London, London, 2010. ISBN 978-1-84996-272-8. doi: 10.1007/978-1-84996-272-8_11.
- Zhai and Lafferty  Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, pages 334–342, New York, NY, USA, 2001. ACM. ISBN 1-58113-331-6. doi: 10.1145/383952.384019.