Relevance Prediction from Eye-movements Using Semi-interpretable Convolutional Neural Networks

01/15/2020 ∙ by Nilavra Bhattacharya, et al. ∙ The University of Texas at Austin Lockheed Martin Corp. IEEE 12

We propose an image-classification method to predict the perceived-relevance of text documents from eye-movements. An eye-tracking study was conducted where participants read short news articles, and rated them as relevant or irrelevant for answering a trigger question. We encode participants' eye-movement scanpaths as images, and then train a convolutional neural network classifier using these scanpath images. The trained classifier is used to predict participants' perceived-relevance of news articles from the corresponding scanpath images. This method is content-independent, as the classifier does not require knowledge of the screen-content, or the user's information-task. Even with little data, the image classifier can predict perceived-relevance with up to 80 literature, this scanpath image classification method outperforms previously reported metrics by appreciable margins. We also attempt to interpret how the image classifier differentiates between scanpaths on relevant and irrelevant documents.



There are no comments yet.


page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Information relevance is one of the fundamental concepts in Information Science in general, and Information Retrieval (IR) in particular (Saracevic, 2007, 2016a). The primary purpose of IR systems is to fetch content which is useful and relevant to people. Understanding the cognitive processes of even one individual is challenging enough, and IR systems have to cater to a variety of users, who may have wildly different mental models of what they consider to be useful and relevant. To add another layer of complexity, these mental models are not static. They evolve as users’ knowledge and information needs change. Researchers have investigated various forms of ‘signals’ generated by users interacting with IR systems, that can serve as proxies for their mental processes. Examples include search queries, mouse-clicks, logs of viewed documents, and other forms of interaction-data. These proxies have been studied to infer what kind of information is relevant to users’ needs. Efforts from a system-centred perspective have been towards minimizing the gap between the users’ query and the documents retrieved. The search query is considered to be an exact representation of the users’ information needs. Documents matching the query using a given algorithm are deemed to contain the information that users are searching for, and are therefore relevant. This notion of relevance is regarded as algorithmic-, or system-relevance (Saracevic, 2016b). The limitation of this perspective is that the query is seldom an exact representation of what the user is looking for. As a result, retrieved documents often do not satisfy the user’s information needs.

In a human-centred perspective, relevance arises from interactions between a user’s information need and information objects (Borlund, 2003). This interaction results in several manifestations of relevance (Saracevic, 2016b), and becomes meaningful “only … in relation to goals and tasks” (Hjørland, 2010). Our interest is in situational relevance, or utility. As introduced by Wilson, “situationally relevant items of information are those that answer, or logically help to answer, questions of concern” (Wilson, 1973). In this paper, we refer to situational relevance as the users’ perceived-relevance of the documents they examine for answering a question.

Neuro-physiological methods provide an interesting avenue to observe users while they interact with information systems. One popular method is eye-tracking. It captures the eye-movement patterns of users as they examine information on a screen. Eye-tracking has been frequently used to assess if the screen-content is relevant to the user (Section 2.1). The method has some distinct advantages. Eye-tracking is non-invasive, and requires minimum to no effort from the user. Even when users are not clicking the mouse or typing a query, they are viewing the screen, and thus helping to provide continuous data in a more natural setting. Eye-tracking can give insights about the focus and progression on an information searcher’s attention in real-time. Eye-movements are sometimes considered to be a closer proxy for human cognition (Just and Carpenter, 1987), than queries and interaction logs.

Despite its many advantages, interpreting eye-tracking data is not straightforward. Often, a variable-length stream of real numbers are collected per stimulus. For the dearth of standard methods, researchers resort to aggregating this data-stream into a set of single numbers, or features, at various levels of analysis (stimulus level, trial level, and/or participant level). By collapsing the eye-tracking data in this fashion, the fine grained information about the individual user’s progress is lost. This reduces the robustness and generalizability of insights gained from the analysis.

We propose an image-classification method to predict user’s perceived-relevance from their eye-movement patterns. Our method is free from many of the inherent problems associated with analyzing eye-tracking data, as shown in existing literature (Section 2). Specifically, we convert participant’s eye-movement scanpaths into images (Section 4.1), and then transform the relevance-prediction problem into an image-classification problem (Section 4.3.1). For this purpose, we use state-of-the art image classifiers based on convolutional neural networks. Our method gives promising results, and outperforms many previously reported performances in similar studies by appreciable margins (Section 5.1). We also attempt to interpret how the classifier possibly differentiates between user-reading-patterns on relevant and irrelevant documents (Section 5.4). Finally, we discuss the limitations of our approach, and propose future directions of research (Section 6).

2. Related Work

2.1. Information Relevance and Eye-tracking

One of the earliest studies employing eye-tracking for inferring users’ perceived-relevance was reported by Salojärvi et al. (Salojärvi et al., 2005)

. Participants saw a question and a list of ten sentences. One sentence had the correct answer to the question, and the others were either relevant or irrelevant to the question. Hidden Markov Models were used to predict the type of sentences the participants were reading. Many subsequent studies have investigated the relationship between eye-movements and viewing relevant vs. irrelevant information. These studies employed similar experimental setups, where participants examined a list of words, sentences, or documents, and judged their relevance in relation to a specific query or task.

In a majority of these relevance assessment studies, a common theme is to collapse the stream of eye-movement data into a set of single-number features, at various levels of analysis (stimulus, trial, or participant level). These features are then used for statistical inferences, classification, and prediction. For instance, some variants of aggregated fixation-count and fixation-duration were used in studies reported in (Puolamäki et al., 2008; Fahey et al., 2011; Loboda et al., 2011; Frey et al., 2013; Gwizdka, 2014a, 2017; Wittek et al., 2016; Wenzel et al., 2017). Eye-dwell time and/or visit time was used by Fahey et al. (Fahey et al., 2011). Salojärvi and colleagues identified a comprehensive list of 22 such features (Salojärvi et al., 2005), which were later used by others (e.g., Hardoon et al. (Hardoon et al., 2007)).

While fixation-count, fixation-duration, and dwell-time are generic eye-movement features applicable to any type of stimuli, several studies used specific features for reading text. These works first labelled each eye-fixations as either reading or scanning/skimming. Then they used derived measures from these two types of fixations. Buscher et al. (Buscher et al., 2008) used reading-to-skimming ratio to infer when participants were reading relevant text. Over a group of studies, Gwizdka et al. (Gwizdka, 2014a, b, 2017; Gwizdka et al., 2017) reported that reading speed, number of fixations on words, count and length of reading sequences, count and percentage of words fixated upon, durations of reading and scanning, and distance covered by scanning proved to be good indicators of perceived-relevance for textual documents.

Research on non-textual relevance assessment have also used the approach of aggregated features. For instance, relevance of images have been studied in (Zhang et al., 2010; Hardoon and Pasupa, 2010; Klami et al., 2008; Brouwer et al., 2009; Golenia et al., 2015; Golenia et al., 2018; Haji Mirza and Izquierdo, 2010), while that of live webpages were studied in (Loyola et al., 2015; Gwizdka and Zhang, 2015; Wu et al., 2019). Though most studies used aggregate features for the whole stimuli duration, the authors of (Gwizdka et al., 2017) report that features from two-second windows near the end of viewing had more discriminating power than those obtained near the beginning of viewing. Thus, collapsing eye-tracking data and thereby losing temporal information, results in our reduced understanding of human relevance assessment.

In terms of models used, most studies employed popular classifiers like Random Forests (RF) and Support Vector Machines (SVM). Few studies employed Hidden Markov Models

(Simola et al., 2008) and Neural Networks (Chow and Gedeon, 2015). Performance was varied, based on the choice of features. For instance, Wu et al. (Wu et al., 2019) predicted user-satisfaction while examining search results. They used advanced mathematical features (e.g., max. and SD of integrated curvature of fixations, using Frenet frame and Bishop frame) which are usually difficult to conceive in information science research. They obtained F1 scores in the range of 0.5 - 0.7 using RF and SVM. Slanzi et al. (Slanzi et al., 2017)

predicted web-surfer’s click-intention from eye-tracking features. They used a battery of classifiers, but the F1 scores were not promising. Thus, appropriate feature selection is crucial to obtain good prediction performance when aggregating eye-tracking data.

Summarily, we see that use of aggregated eye-tracking features and traditional classification techniques resulted in unpromising performances for relevance prediction. While statistical tests were significant at the level, the classification and prediction accuracies were rarely more than 70% (Slanzi et al., 2017; Wenzel et al., 2017; Simola et al., 2008; Gwizdka and Zhang, 2015). In our proposed method, we demonstrate that utilizing the entire eye-tracking data, and applying image classification technique, we can predict perceived-relevance with up to 80% accuracy.

2.2. Eye-movement Scanpath Analysis

The issues discussed in Section 2.1 arise from the dearth of appropriate analysis methods for eye-tracking data. The entire eye-movement trajectory of a user on a stimulus is called a scanpath. A scanpath has various spatial and temporal attributes associated with it: its geometric shape and size, count and duration of fixations, and the sequential information of the fixations as they occurred in time. As of this writing, we do not have a standard loss-less method for representing all this information into a set of features. Analyzing the differences between groups of scanpaths on relevant and irrelevant documents becomes tricky, and the results vary based on the chosen set of features.

Several scanpath comparison algorithms have been proposed, which either use (a) the actual fixation points from the eye-movement trajectory (Jarodzka et al., 2010; Dewhurst et al., 2012), or, (b) a string representation of the trajectory, using letter-labels to categorize each fixation (Anderson et al., 2015; Holmqvist et al., 2011). The first approach works only with scanpaths having equal number of fixations. To deal with scanpaths having diferring number of fixations, the algorithm deletes or clusters some fixations together (simplification step) such that all scanpaths have identical number of fixations. We argue that such an approach may work well for non-reading tasks (e.g. viewing images), but for analyzing eye-movement while reading, all fixation points should be preserved. Nearby fixations on different distinct words should not be clustered together into one fixation, as they may contain important information pertinent to the reading task. The second scanpath comparison approach uses a string representation of the two scanpaths, and compares them using either the Levenshtein distance (Brandt and Stark, 1997; Duchowski et al., 2010) or the Needleman-Wunsch algorithm (Cristino et al., 2010; West et al., 2006). This method assumes that annotated data is available for all the fixations. However, such annotations are not available when we do not have pre-existing insights about the eye-movements for our task. A common limitation of both the methods is that they work for pairwise comparisons only, and cannot be easily extended to compare between groups of scanpaths.

2.3. Image Classification using Convolutional Neural Networks (CNN)

As introduced in Section 1

, we propose an image-classification approach for predicting perceived-relevance. Over the last decade, image classification, and computer vision in general, has seen tremendous improvement by starting to re-use the Convolutional Neural Network (CNN). Although developed in the 1970s, CNNs did not play a major role in computer vision research until 2010s, due to lack of adequate computing capabilities for fast execution. In 2012, Ciresan et al.

(Ciregan et al., 2012)

applied max-pooling operation after convolution, using dedicated hardware GPUs. This process significantly improved the benchmark performances of numerous computer vision algorithms. Around the same time, the

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015) began to be organized annually. The goal of the challenge was to beat previous years’ top-performances for object recognition tasks111object recognition encompasses image classification and object detection on more than 14 million annotated images. Various research institutions began participating in the challenge, and the competition spearheaded the emergence of high-performing CNN architectures that began to be regarded as benchmarks. Examples of such benchmark architectures are VGG (Simonyan and Zisserman, 2014), DenseNet (Huang et al., 2017), ResNet (He et al., 2016a, b), Inception (Szegedy et al., 2015, 2016) and InceptionResNet (combination of Inception and ResNet architectures) (Szegedy et al., 2017)222The architecture names often have numeric suffixes to denote the number of hidden layers. E.g., VGG16, VGG19 DenseNet121, DenseNet201, etc..

An interesting feature of CNN based image-classifiers is that the ‘knowledge’ learnt by the network for solving one problem can be reused to solve another related problem. This is called transfer learning. The initial layers of a CNN based image classifier learns low-level image-features (edges, shapes, and corners), while the final layers learn increasingly abstract and task specific features (Yosinski et al., 2015; Zeiler and Fergus, 2014; Krizhevsky et al., 2012)

. Since low-level image-feature detection is required in all forms of automated image-understanding, transfer learning works well for research problems having relatively low-sized datasets. For this reason, popular deep-learning frameworks (e.g., Keras, PyTorch etc.) include many benchmark CNN architectures, with their weights pre-trained on the ImageNet challenge. In this work, we utilize several such benchmark CNN image classifiers to predict the perceived-relevance of documents from scanpath images.

A CNN is often considered as a “black-box”, because its inner working are not easily understandable. Various methods have been proposed to understand why the network makes a particular prediction (Springenberg et al., 2014; Selvaraju et al., 2017). One such method is Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017). The Grad-CAM method produces a heatmap, which is similar to an attention map, and highlights the regions of the input image that was focused on for making the prediction. For ‘known‘ research problems (e.g. detecting cats vs. dogs in images) this visualization helps to understand whether the CNN is paying attention to the relevant image regions. In our case of classifying scanpath images according to perceived-relevance, the Grad-CAM visualizations can offer new insights about human reading behaviour on relevant and irrelevant documents.

Examining the challenges involved in relevance prediction using eye-tracking data (Section 2.1 and Section 2.2), and also the opportunities provided by image classification and transfer learning (Section 2.3), we propose an image-classification based solution for the problem of perceived-relevance prediction. The advantages of our method are:

  1. unlike previous studies where eye-tracking data was collapsed into a set of single-number features, our method allows to use all the data points that are collected, for a more nuanced analysis

  2. the spatial and temporal characteristics of eye-movement scanpaths can be utilized to make inferences

  3. our method is content independent, and does not require knowledge of what the user is viewing on the screen

  4. unlike approaches in reading-related studies, our method does not require additional insights about the data (e.g. need not label fixations as reading, scanning, etc.)

3. User Study

3.1. Experimental Design and Procedure

A controlled lab experiment was conducted in the Department of Kinesiology, University of Maryland, College Park. Participants (, college-age students) judged the relevance of short news articles for answering a trigger question. Eye-tracking and EEG signals were recorded. In this paper, we report a novel analysis using only the eye-tracking data.

Figure 1. One trial in the experimental procedure.

The main element of the experimental procedure was a trial (Figure 1). In each trial, a trigger question was shown first. The trigger questions was a short, one-sentence question, informing participants what to look for in the subsequently presented documents (e.g. “What is the birth name of Jesse Ventura?”). After the trigger question, a short news article was displayed, then a text relevance response (Y/N) screen appeared, then a list of words for further assessment was shown. Participants progressed between stimuli by pressing a space bar, with an exception of moving from a news article to the text-relevance response screen, which occurred by participants fixating their eyes for two seconds or longer in the lower-right screen-corner to indicate their readiness for relevance judgement. Finally, a fixation screen was shown for one second between trials. The list of words for further assessment are not analysed in this paper. The news articles were chosen to have three levels of relevance with respect to the trigger question:

  • Relevant (R): the article explicitly contained the exact answer asked in the question

  • Topical (T): partially relevant – the article did not contain the exact answer to the question, but was on the topic of the information asked in the question

  • Irrelevant (I): did not contain the answer to the question

We regard this three-level relevance for each news article as the article’s document-relevance. The source of these relevance labels are discussed in Section 3.2.

There were 40 trigger questions; each of them was associated with three news articles designed to contain exactly one R document, and one or more T or I documents. Thus, the following 12 permutations were possible for each question: {RTI, RIT, RTT, RII, TRI, IRT, TRT, IRI, TIR, ITR, TTR, IIR}. This yielded 120 sets of question + news article; each set constituting one experiment trial. The order of trials was randomized for each participant to mitigate order effects. Participants rated the news article as relevant or irrelevant (by pressing Y or N key) based on their judgement of whether the article contained an answer to the trigger question. These binary responses from each participant, for each news article, are regarded as the perceived-relevance for the user-document pair. Participants performed a training task consisting of six trials (two questions and six documents) before the 120 trials.

3.2. Stimuli Dataset

The set of 40 trigger questions were selected from the TREC 2005 Question Answering Task (Voorhees and Dang, 2003). The collection of 120 short news articles and their document-relevance labels came from the AQUAINT Corpus of English News Text (Graff, 2002) (the same collection used in TREC 2005 Q&A Task). The news articles were carefully selected to have nearly similar text-length (mean length: 178 words, SD: 30 words).

3.3. Apparatus

Eye-tracking data was recorded in the lab on a Windows laptop PC connected to an SMI RED250 eye-tracker. Participant relevance responses were recorded on a remote server. The eye-tracker sampling rate is 250 Hz, and an accuracy up to of visual angle. The screen resolution was . Eye-tracking data was captured by SMI iViewX software and the stimuli were presented by SMI Experiment Center 360 v3.0 software. The textual stimuli were entered to Experiment Center’s text editor as the text elements, and displayed in black Times font on a light-grey background. Line-height was approximately 32 pixels.

Figure 2. Distribution of recorded fixation-durations, and the corresponding encoding marker for representing fixations belonging to different levels (See Section 4.1.1).

4. Data Analysis

Eye-tracking data was processed using the SMI BeGaze: Analysis Software (version 3.2). Data recording for one participant failed, hence we report analysis for (). Fixations were detected using Velocity-Threshold Identification (I-VT) algorithm, as implemented in the BeGaze software, with default parameter values.

Figure 3. Top: Typical eye-movement patterns when reading relevant, irrelevant, and topical documents. Bottom: Examples of generated scanpath images, which are used to train CNN classifiers for predicting the user’s perceived-relevance of the documents. This figure is best viewed in colour, on screen.

4.1. Generating Scanpath Images

We generated scanpath images from eye-tracking data of user-document pairs, using only three attributes of eye-fixations: screen-coordinates (in pixels), fixation duration (in ms), and start time of the fixation relative to stimulus-onset. We used Python Matplotlib library (Hunter, 2007) to generate the scanpath images. CNNs have been shown to be good at detecting local patterns within images (Andrearczyk and Whelan, 2017; Srinivas et al., 2017). Since we were preparing the images for training a CNN classifier, we made the following design choices:

4.1.1. Fixations:

Eye-fixations were encoded as marker points having varying shapes, sizes, and colours. These were controlled by the fixation duration as follows:

  • 110 - 250 ms: Level 1 fixations, encoded as red circle

  • 250 - 400 ms: Level 2 fixations, encoded as pink star

  • 400 - 550 ms: Level 3 fixations, encoded as yellow pentagon

  • ¿ 550 ms: Level 4 fixations, encoded as white cross

These levels were identified empirically. We examined the distribution of fixation durations in our data, and roughly divided the range into three equal partitions (Figure 2). Fixations having durations less than 110 ms were discarded (Widdel, 1984; Salvucci and Goldberg, 2000). The marker-size was made to increase with the Level number. The fixation markers were chosen to be grossly different from each other (instead of, say, only circles), so that the CNN could possibly identify spatial patterns of similar-duration fixations.

4.1.2. Linearized Saccades:

Saccades are rapid eye-movements between two fixation points. Ideally, they follow ballistic paths. To keep things simple for our analysis, we plotted linearized saccades: the effective eye-movement between two fixations, represented as a straight line connecting the two points. For brevity, henceforth we will say ‘saccade’ to mean ‘linearized saccade’. We controlled the colour of the saccade lines to follow a linear colour scale, based on their temporal occurrence (‘Winter’ colourmap in Matplotlib333 The colour of the saccades changed linearly from blue (first saccade) to green (final saccade). Each individual saccade had a solid colour.

We also tested controlling the width of the saccade lines using saccade velocity (ratio of screen-distance covered to time taken). However, doing so made the scanpath-image too crowded, especially for scanpaths having more than 50 fixations. So we kept the width of the saccade lines constant at 2 pixels.

4.1.3. Colours:

Care was taken to select the colours of the fixations and the saccades. Using a colour wheel, the colours of the different fixation markers were chosen to be far apart, from each other, as well as from the range of colours used to draw the saccades. We hypothesized that these colour choices would enable the CNN classifier to easily distinguish between fixations and saccades, and identify necessary patterns. Examples of typical eye-movement patterns on three types of documents, and their corresponding generated scanpath images are shown in Figure 3.

4.2. Machine Learning Setup

Data was available for 24 participants, where each participant judged the binary relevance of 120 news articles. In total we had eye-tracking data for 2,880 user-document pairs, or 2,880 scanpaths. After data cleaning, we decided to use scanpaths having 10 or more fixations. We assumed that at least 10 fixations, or a minimum eye dwell-time of 1 second on the document (at 100 ms / fixation) is required to make a relevance assessment. This left us with 2,579 scanpath images.

4.2.1. Train / Validation / Test Partition:

As human-information-interaction researchers, we are more interested in studying human behaviour. So we used the participants’ perceived-relevance labels as the ground-truth for our classification task (and not the document-relevance obtained from TREC dataset). Out of the 2,579 scanpath images, only 806 (31.2%) were for documents marked relevant. Thus, there was almost a 1:2 class imbalance. Since this is an initial attempt to apply image classification on scanpath images, we decided to use a balanced dataset. So we randomly sampled 806 images from the pool of irrelevant scanpath images, and created a perfectly balanced dataset of 1,612 images. We used an approximate 60-20-20 split to randomly place 966 images in the training set, 314 images in the validation set, and 332 images in the test set. The relevant/irrelevant class balance was preserved in each set. All random selections were performed using the MySQL rand() function.

4.3. Analysis Procedure

4.3.1. Image Classification Setup:

We posed our binary classification problem as follows: given only the scanpath image of a user’s eye movements on a short news article, did the user perceive the article to be relevant for answering a trigger question?

For this binary classification problem, we analysed the performance of six popular CNN based architectures: VGG16 and VGG19 (Simonyan and Zisserman, 2014), DenseNet121 and DenseNet201 (Huang et al., 2017), ResNet50 (He et al., 2016a, b), and InceptionResNet (version 2) (Szegedy et al., 2017). All the architectures had benchmark performances in the ImageNet challenge (Krizhevsky et al., 2012). To examine whether the obtained results were reproducible in different environments and software versions (Crane, 2018)

, we ran the analyses independently on two popular Python deep-learning frameworks: TensorFlow-Keras

444, and PyTorch-fastai. The architecture of the TensorFlow-Keras implementation was:

CNN model (initialized with pre-trained ImageNet weights) –¿ Fully Connected Layer

(256 nodes, ReLU activation, with/without L1L2 regularization) –¿


(probability=0.2) –¿

Fully Connected Layer

(1 node, Sigmoid activation).    Optimizer: Stochastic Gradient Descent (SGD)

In PyTorch-fastai, we built the classifier using the cnn_learner module555, which initializes the model with random weights, and trains from scratch. We ran the TensorFlow-Keras implementation on FloydHub GPU Cloud Server666 (NVIDIA Tesla K80 GPU, 12 GB memory, 61 GB RAM), and the PyTorch-fastai implementation on Google Colab777 (NVIDIA Tesla T4 GPU, 15 GB memory, 26 GB RAM).

We trained the models on the training set, and used the validation set for very basic hyper-parameter tuning (learning rate, number of epochs, optimizer momentum, etc.). Since our intention was to see whether the method works, and not to obtain the best benchmark performance, we performed minimal hyper-parameter tuning. Finally, we took the best set of models obtained after tuning the hyper-parameters

(epochs: 6, batch-size: 16, momentum: 0.9), and used them to predict the labels of the test set. The top portion of Table 1 reports the results from the TensorFlow-Keras implementation, while Table 2 reports the results from the PyTorch-fastai implementation. The discussions are centred around the results from the TensorFlow-Keras implementation.

4.3.2. Comparison to Existing Standard:

We also compared our method to existing approaches for inferring relevance using eye-movements, where the data is collapsed into a set of handcrafted features (discussed in Section 2.1). Perceived-relevance of documents are predicted from these features using popular classifiers like Random Forests (Wu et al., 2019; Jimenez-Molina et al., 2018) and Support Vector Machines (SVM) (Slanzi et al., 2017; Li et al., 2018)

. We computed 20 such hand-engineered features, aggregated at the user-doc level, and classified them using Random Forest and SVM. This analysis was done using Python Scikit-learn library. Similar to our approach with the CNN classifiers, we started with the default hyperparameter values of the Random Forest and the SVM classifier from the Scikit-learn library, and then performed basic parameter tuning. Finally, we selected the best performing models. The bottom portion of Table 

1 reports these results. The handcrafted features are discussed in Section 5.2.

Colour scales rank the performance of each row from best (green) to worst (red), across both methods. Asterisk (*) indicates best performance for each method.

For CNN classifiers: Frozen: if Yes, then weights of the CNN layers (pre-trained on ImageNet) were frozen during training. Regularization: if Yes, then L1 and L2 regularization with decay = 0.01 was used in the Fully Connected Layer (See Section 4.3.1 for neural network architecture).

Table 1. Performances of two different methods to predict perceived-relevance from eye-movements, ordered by decreasing F1 score for the Test Set. Top: CNN classifiers on scanpath images. Bottom: traditional classifiers on aggregate features.

5. Results & Discussion

Figure 4. Attempt to interpret how the CNN classifiers made predictions. Middle column shows heatmaps obtained using Grad-CAM technique for a single image. Right column shows average of all such heatmaps. All heatmaps are generated using the best performing model and hyperparameters from Table 1 (VGG19, F1: 0.81). Inferences are discussed in Section 5.4.

5.1. Scanpath Image Classification

We report the performance of our proposed scanpath image classification method, by testing six different CNN classifier architectures (Table 1, top). To easily compare our results to those reported in previous papers, we report five different metrics: percentages of correct predictions for both relevant and irrelevant documents, as True Positive Rate (TPR %) and True Negative Rate (TNR %); accuracy (Acc %); area under the ROC curve (ROC AUC); and F1-score (F1). TPR and TNR are also known as sensitivity and specificity, respectively. We have ranked the image classifiers according to their F1 scores on the Test Set. We have taken care to report all range of performances – best, average, and worst – to provide realistic expectations of using this method.

From Table 1, we have the following observations: First, all the classifiers have comparable F1 scores on both the Validation Set and the Test Set. Despite having less than 1000 training images (which is quite low by deep-learning standards), the models did not overfit, and generalized well on the unseen Test Set. Second, VGG16 and VGG19 architectures show the best performances (seven out of the top-10 F1 scores). These are “shallow” networks, having 16 and 19 layers respectively. Very deep models (e.g. DensNet201 or ResNet50, with 201 and 50 layers, respectively) on the other hand, occur once each within the top-10 F1 scores. Thus, shallower models performed better for relevance-prediction from scanpath images, than deeper models. Third, frozen versions of shallower models performed better than their unfrozen counterparts, while unfrozen versions of deeper models had better F1 scores than their frozen versions. When the models were trained in frozen mode, only the fully connected layers had their weights updated by gradient descent, while the weights of the CNN layers were kept frozen (refer to Section 4.3.1 for model architecture). Shallower models therefore effectively re-utilized the training received from another object classification task (i.e. the ImageNet challenge (Krizhevsky et al., 2012)), while the deeper models needed to learn new weights to have similar performance. Fourth, from Table 2 we see that F1 scores of the same architectures implemented in PyTorch-fastai are similar to those obtained using the TensorFlow-Keras implementation. The results are thus reproducible across different libraries and software environments.

Accordingly, using latest CNN classifiers, comparatively less training data, and leveraging the power of transfer-learning, it is possible to predict the perceived-relevance of documents from scanpath images with F1 score up to 0.81, and up to 80% accuracy (Table 1, VGG19 row, marked with asterisk).

Table 2. Results (F1-scores) from PyTorch-fastai implementation, with similar configurations as in Table 1.

5.2. Comparison with Traditional Classifiers

To compare our result to the existing approaches of today, we tested the performance of two popular classifiers – Random Forest and SVM – using 20 handcrafted features informed by literature (Table 1, bottom). The highest Test Set accuracy obtained was 69% (our proposed method achieves 80%), and the highest F1-score obtained was 0.69 (our method achieves 0.81). The five most important features, as obtained from the Random Forest classifier, were (1) vertical scan speed, (2) HV ratio (ratio of total horizontal movement to total vertical movement, normalized by screen dimensions), (3) SD of fixation durations, (4) mean saccade length, and (5) task duration. To make this comparison fair against our proposed method, we had included the counts of fixations in the different levels (1-4) – that we encoded with special markers in the scanpath images (Section 4.1.1) – in our handcrafted feature set. However, those level-wise fixation counts were placed among the ten least-important features by the Random Forest classifier. Thus, the scanpath image classification method performs much better than using handcrafted features.

5.3. Comparison with Related Works

Our method vastly improves upon performance-measures reported in literature on related work. We first discuss the studies that had similar experimental setups as our own, and also predicted relevance from eye-tracking data. Our best performing classifier surpasses the numbers reported in these studies by at least five percentage points on average, w.r.t. ROC AUC, Accuracy, and/or F1 score. For instance, Chow et al. (Chow and Gedeon, 2015) predicted document-categories from eye-movement data of analysts, using a neural network classifier. They reported an accuracy of 70%, but did not mention the specific eye-tracking features used. Wenzel et al. (Wenzel et al., 2017) inferred the relevance of individual words from fixation duration and EEG features. They reported that only using fixation-duration gave an AUC of 0.51 (marginally better than chance), whereas combining fixation-durations with EEG features improved the AUC to 0.63. However, they did not state the kind of classifiers used. Compared to the above figures, our best performing model – VGG19, by Test Set F1 score – has an ROC AUC of 0.87, accuracy of 80%, and F1 score of 0.81. Gwizdka (Gwizdka, 2014a)

predicted relevance of short documents, and he reported a maximum accuracy of 74% using decision trees. However, some of the features he used were content-dependent (number of fixations on words, count and percentage of words fixated upon, etc.). Our method, on the other hand, is both content and task independent.

In our literature search, the only study found to have comparable and better performance than our image classification method, for a similar relevance judgement task, was reported in (Gwizdka et al., 2017). Employing proximal SVM as the classifier, the best performance using only eye-tracking features were reported to have an AUC of 0.95, and accuracy of 86%. However, their approach had two distinct differences from our method: First, all fixations were passed through a two-stage reading model to label them as reading or scanning. Then separate features were calculated for the groups of reading and scanning fixations. In contrast, our method simply takes all the raw fixations and encodes them directly into the scanpath image. There is no need for pre-classification, which requires additional insights about the data. Second, the classification features were computed using windows of 1 second and 2 seconds near the beginning, middle, and end of the reading trials. Higher prediction accuracies were obtained using the values of end-window, than values of the beginning-window. Differently, our method considers the entire duration of the reading trial for the prediction task.

We now discuss classification and prediction results from other related yet different studies, which employed eye-tracking in the domain of interactive IR. Simola et al. (Simola et al., 2008)

predicted task-category (word search, question answer, or reading by interest) from the scanpath, and they obtained 59.8% accuracy using logistic regression on fixation count, mean and SD of fixation durations, and mean and SD of saccade length. Slanzi et al.

(Slanzi et al., 2017)

predicted the click-intention of web users. Though they initially considered using eye-tracking features, those were later discarded using Random Lasso feature selection, and EEG and pupillometry features were mainly used. They employed a variety of classifiers, including SVM, neural network, and Logistic Regression. However, the highest F1 score obtained was 0.4, using the neural network classifier. Although they reported the highest accuracy of 71% for Logistic Regression, it had low precision and recall, and thus low F1 score of 0.33. Gwizdka et al.

(Gwizdka and Zhang, 2015) predicted visits and revisits to relevant and irrelevant webpages, using fixation-duration, saccade-duration, and saccade-length. They reported a maximum accuracy of 61% using Flexible Discriminant Analysis. Though our prediction problem was somewhat different than the ones discussed above, we hypothesize that our method can possibly obtain better performances on these prediction problems as well, since the scanpath image classifier did not receive any information about the task.

5.4. Interpreting Reasons for Prediction

In this section, we attempt to interpret how the best performing CNN classifier (VGG19, Test set F1 score: 0.81, from Table 1) made predictions. We employed Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017) for this purpose. Given a scanpath image, the Grad-CAM technique produces a heatmap (Class Activation Map) indicating which pixels in the image are considered important for making a prediction. This is similar to feature-importances in Random Forests, but is specific to each scanpath image. Examples of such heatmaps are shown in the second column of Figure 4. To understand whether the CNN had identified some patterns about human reading behavour on relevant and irrelevant documents, we generated an average heatmap for each class, using all the scanpath images in the Test Set. These average heatmaps are shown in the third column of Figure 4.

For the heatmap of the relevant scanpath image, the classifier appears to have focused more on the right side of the scanpath (Figure 4, second column). This can be explained by findings about reading versus scanning behaviour (Gwizdka, 2014a). When people read relevant documents, their eyes move more horizontally than vertically. They also continue to read till the end of every line, and then move on to the next line. Therefore, fixations occur at the end of most lines, in a consistent manner from top to bottom. Specific to our task, the participants possibly read the news article headlines first, to quickly decide if the answer to the trigger question could be found in the article. When the headline looked relevant, they continued to read into the body of the article, and read till the end of every line (Figure 4, Relevant Scanpath).

In the irrelevant scanpath heatmap, the classifier possibly focused on the top and bottom regions instead (Figure 4, second column). People usually scan or skim irrelevant information, and produce more vertical eye-movements than horizontal. Very few fixations occur near the ends of successive lines, since people rarely read irrelevant documents continuously to the ends of lines, for many lines in sequence. For our task, the participant possibly read the headline first, similar to relevant articles. However, when the headline appeared irrelevant, they scanned the remainder of the article in long vertical sweeps. They may have also looked at the last few lines of the article, to search for summaries or conclusions about the article-content (Figure 4, Irrelevant Scanpath).

These patterns are further reinforced in the average heatmaps (Figure 4, second column). In the relevant case (top), the classifier’s attention was spread over a rectangular region, corresponding to the overall shape of the stimuli news-paragraphs. The right side of the rectangle is brighter than the left, indicating that the classifier looked for fixations near line endings. For the irrelevant case (bottom), attention is focused on islands in the top and bottom regions, with a less-focused central region.

Summarily, we hypothesize that participants initially looked at the headlines for both kinds of news articles. For relevant articles, the headlines convinced them to read in more detail. So they produced more fixations in the body of the article, and read till the end of every line. For irrelevant articles, on reading the headlines (and possibly the first few lines), participants understood that the article would not be useful for answering the trigger question. So they quickly skimmed / scanned to the bottom, and looked for concluding remarks which could solidify their initial judgement of the article. As a result, both relevant and irrelevant scanpath images contained fixations in top region (headlines). However, the proportion of headline-region-fixations to body-region-fixations was higher for irrelevant documents. A similar phenomenon was reported in (Li et al., 2018). Therefore, we think the CNN classifier possibly “learnt” that:

If a large number of fixations are present in the right side of the scanpath image (where line endings are located) it is probably a relevant scanpath; whereas if fixations are concentrated in the top and bottom regions, and sparse in the middle, it is possibly an irrelevant scanpath.

6. Conclusion

In this paper, we pose the research problem of ‘predicting perceived-relevance from eye-movements’ as a problem of ‘scanpath image classification’. We employ pre-trained Convolutional Neural Networks to predict whether scanpath images correspond to reading relevant or irrelevant news articles. The advantages of our method are: (i) we use all of the eye-tracking datapoints available per user, and do not collapse them into features; (ii) the spatial and temporal aspects of eye-movement scanpaths are preserved; (iii) our method is content independent, and does not require knowledge of the content being viewed (e.g., the actual text of the news articles); and (iv) our method does not need additional insights about the data (e.g. for labelling fixations as reading or scanning).

Our approach has several limitations. First, since we used pre-trained image classifiers, the high resolution scanpath images () were reduced to the dimensions on which the classifiers were trained on (). This led to some loss of information, and possibly decreased the classifier’s performance. However, it is standard practice in computer vision research to downsize images, since using high- or full resolution images leads to exponentially slower execution times, and significantly more memory requirements. Second, due to resource-limitations, we were unable to appreciably search the hyper-parameter space. It is possible that a simpler or shallower model can achieve better performance for this task. Third, we employed a fairly simple information search task, and used only short texts of similar type. It is possible that more complex information search tasks on the open web can bring additional challenges. Fourth, our participant pool was relatively homogeneous: all of them were college-age students attending the same university.

To our knowledge, this is the first attempt to approach both relevance prediction and eye-movement analysis using image classification, or more broadly, computer vision. Our work was aimed at proof-of-concept. We demonstrated that even with little data, this method shows promising results. For similar eye-tracking studies from the literature, our scanpath image classification method outperformed previously reported metrics by appreciable margins. By looking at aggregated class activation heatmaps, we gained additional insights on how users examine relevant and irrelevant documents. Thus, there is promising scope for improving interactive IR research, by employing computer vision algorithms in non-vision tasks.

The research project was funded in part by Sponsor Lockheed Martin Corporation Rl. We thank the team from Department of Kinesiology, University of Maryland, College Park led by Professor Bradley Hatfield, and including Dr. Rodolphe Gentili, Dr. Joe Dien, and graduate students: Hyuk Oh, Kyle James Jaquess, and Li-Chuan Lo, for contributing to the experimental design, implementing it in SMI Experiment Center software, and collecting the data. We thank Splunk Inc. for the blog post on using mouse trajectories for fraud detection (Esman, 2017), which gave us the idea to adapt this approach for relevance prediction from eye-movements. We also gratefully acknowledge the ACM SIGIR Student Travel Grant awarded to the first author.


  • (1)
  • Anderson et al. (2015) Nicola C. Anderson, Fraser Anderson, Alan Kingstone, and Walter F. Bischof. 2015. A Comparison of Scanpath Comparison Methods. 47, 4 (2015), 1377–1392.
  • Andrearczyk and Whelan (2017) Vincent Andrearczyk and Paul F. Whelan. 2017. Chapter 4 - Deep Learning in Texture Analysis and Its Application to Tissue Image Classification. In Biomedical Texture Analysis, Adrien Depeursinge, Omar S. Al-Kadi, and J.Ross Mitchell (Eds.). Academic Press, 95 – 129.
  • Borlund (2003) Pia Borlund. 2003. The concept of relevance in IR. Journal of the American Society for information Science and Technology 54, 10 (2003), 913–925.
  • Brandt and Stark (1997) Stephan A Brandt and Lawrence W Stark. 1997. Spontaneous eye movements during visual imagery reflect the content of the visual scene. Journal of cognitive neuroscience 9, 1 (1997), 27–38.
  • Brouwer et al. (2009) Anne-Marie Brouwer, Maarten A. Hogervorst, Pawel Herman, and Frank Kooi. 2009. Are You Really Looking? Finding the Answer through Fixation Patterns and EEG. In Foundations of Augmented Cognition. Neuroergonomics and Operational Neuroscience. Vol. 5638. Springer Berlin Heidelberg, Berlin, Heidelberg, 329–338. 00000.
  • Buscher et al. (2008) Georg Buscher, Andreas Dengel, and Ludger van Elst. 2008. Eye movements as implicit relevance feedback. In CHI ’08 extended abstracts on Human factors in computing systems (CHI EA ’08). ACM, New York, NY, USA, 2991–2996. 6.
  • Chow and Gedeon (2015) C. Chow and T. Gedeon. 2015. Classifying document categories based on physiological measures of analyst responses. In 2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom). 421–425. 00004.
  • Ciregan et al. (2012) D. Ciregan, U. Meier, and J. Schmidhuber. 2012. Multi-column deep neural networks for image classification. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    . 3642–3649.
  • Crane (2018) Matt Crane. 2018. Questionable answers in question answering research: Reproducibility and variability of published results. Transactions of the Association for Computational Linguistics 6 (2018), 241–252.
  • Cristino et al. (2010) Filipe Cristino, Sebastiaan Mathôt, Jan Theeuwes, and Iain D Gilchrist. 2010. ScanMatch: A novel method for comparing fixation sequences. Behavior research methods 42, 3 (2010), 692–700.
  • Dewhurst et al. (2012) Richard Dewhurst, Marcus Nyström, Halszka Jarodzka, Tom Foulsham, Roger Johansson, and Kenneth Holmqvist. 2012. It Depends on How You Look at It: Scanpath Comparison in Multiple Dimensions with MultiMatch, a Vector-Based Approach. 44, 4 (2012), 1079–1100.
  • Duchowski et al. (2010) Andrew T Duchowski, Jason Driver, Sheriff Jolaoso, William Tan, Beverly N Ramey, and Ami Robbins. 2010. Scanpath comparison revisited. In Proceedings of the 2010 symposium on eye-tracking research & applications. ACM, 219–226.
  • Esman (2017) Gleb Esman. 2017. Splunk and Tensorflow for Security: Catching the Fraudster with Behavior Biometrics. [Online; accessed 11-Dec-2019].
  • Fahey et al. (2011) Daniel Fahey, Tom Gedeon, and Dingyun Zhu. 2011. Document Classification on Relevance: A Study on Eye Gaze Patterns for Reading. In Neural Information Processing, Bao-Liang Lu, Liqing Zhang, and James Kwok (Eds.). Number 7063 in Lecture Notes in Computer Science. Springer Berlin Heidelberg, 143–150. 00001 Cited by 0000.
  • Frey et al. (2013) Aline Frey, Gelu Ionescu, Benoit Lemaire, Francisco Lopez-Orozco, Thierry Baccino, and Anne Guerin-Dugue. 2013. Decision-making in information seeking on texts: an Eye-Fixation-Related Potentials investigation. Frontiers in Systems Neuroscience 7, 39 (2013). 00020.
  • Golenia et al. (2015) Jan-Eike Golenia, Markus Wenzel, and Benjamin Blankertz. 2015. Live Demonstrator of EEG and Eye-Tracking Input for Disambiguation of Image Search Results. In Symbiotic Interaction, Benjamin Blankertz, Giulio Jacucci, Luciano Gamberini, Anna Spagnolli, and Jonathan Freeman (Eds.). Springer International Publishing, 81–86. event-place: Cham.
  • Golenia et al. (2018) Jan-Eike Golenia, Markus A Wenzel, Mihail Bogojeski, and Benjamin Blankertz. 2018. Implicit relevance feedback from electroencephalography and eye tracking in image search. Journal of Neural Engineering 15, 2 (Jan. 2018), 026002. bibtex*[publisher=IOP Publishing].
  • Graff (2002) David Graff. 2002. The AQUAINT corpus of English news text:[content copyright] Portions© 1998-2000 New York Times, Inc.,© 1998-2000 Associated Press, Inc.,© 1996-2000 Xinhua News Service. Linguistic Data Consortium.
  • Gwizdka (2014a) Jacek Gwizdka. 2014a. Characterizing Relevance with Eye-Tracking Measures. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX ’14). ACM, New York, NY, USA, 58–67. 00028.
  • Gwizdka (2014b) Jacek Gwizdka. 2014b. News Stories Relevance Effects on Eye-movements. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA ’14). ACM, New York, NY, USA, 283–286. 00008.
  • Gwizdka (2017) Jacek Gwizdka. 2017. Differences in Reading Between Word Search and Information Relevance Decisions: Evidence from Eye-Tracking. In Information Systems and Neuroscience, Fred D. Davis, René Riedl, Jan vom Brocke, Pierre-Majorique Léger, and Adriane B. Randolph (Eds.). Springer International Publishing, 141–147. event-place: Cham.
  • Gwizdka et al. (2017) Jacek Gwizdka, Rahilsadat Hosseini, Michael Cole, and Shouyi Wang. 2017. Temporal dynamics of eye-tracking and EEG during reading and relevance decisions. Journal of the Association for Information Science and Technology 68, 10 (Oct. 2017), 2299–2312. 00000.
  • Gwizdka and Zhang (2015) Jacek Gwizdka and Yinglong Zhang. 2015. Differences in Eye-Tracking Measures Between Visits and Revisits to Relevant and Irrelevant Web Pages. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). ACM, New York, NY, USA, 811–814. 00007.
  • Haji Mirza and Izquierdo (2010) Seyed Navid Haji Mirza and Ebroul Izquierdo. 2010. Finding the user’s interest level from their eyes. In Proceedings of the 2010 ACM workshop on Social, adaptive and personalized multimedia interaction and access (SAPMIA ’10). ACM, New York, NY, USA, 25–28. 00001 Cited by 0000.
  • Hardoon et al. (2007) DR Hardoon, J. Shawe-Taylor, A. Ajanki, K. Puolamäki, and S. Kaski. 2007. Information Retrieval by Inferring Implicit Queries from Eye Movements. In

    Eleventh International Conference on Artificial Intelligence and Statistics

    , Vol. 2. San Juan, Puerto Rico, 179–186. 4.
  • Hardoon and Pasupa (2010) David R. Hardoon and Kitsuchart Pasupa. 2010. Image ranking with implicit feedback from eye movements. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications (ETRA ’10). ACM, New York, NY, USA, 291–298. 00039.
  • He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630–645.
  • Hjørland (2010) Birger Hjørland. 2010. The foundation of the concept of relevance. Journal of the american society for information science and technology 61, 2 (2010), 217–237.
  • Holmqvist et al. (2011) Kenneth Holmqvist, Marcus Nyström, Richard Andersson, Richard Dewhurst, Halszka Jarodzka, and Joost Van de Weijer. 2011. Eye tracking: A comprehensive guide to methods and measures. OUP Oxford.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
  • Hunter (2007) John D Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering 9, 3 (2007), 90.
  • Jarodzka et al. (2010) Halszka Jarodzka, Kenneth Holmqvist, and Marcus Nyström. 2010. A Vector-Based, Multidimensional Scanpath Similarity Measure. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications - ETRA ’10. ACM Press, 211.
  • Jimenez-Molina et al. (2018) Angel Jimenez-Molina, Cristian Retamal, and Hernan Lira. 2018. Using Psychophysiological Sensors to Assess Mental Workload During Web Browsing. Sensors 18, 2 (Feb. 2018), 458.
  • Just and Carpenter (1987) Marcel Adam Just and Patricia Ann Carpenter. 1987. The Psychology of Reading and Language Comprehension. Vol. x. Allyn & Bacon, Needham Heights, MA, US. 01267.
  • Klami et al. (2008) Arto Klami, Craig Saunders, Teófilo E. de Campos, and Samuel Kaski. 2008. Can relevance of images be inferred from eye movements?. In Proceedings of the 1st ACM international conference on Multimedia information retrieval (MIR ’08). ACM, New York, NY, USA, 134–140. 7.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Li et al. (2018) Xiangsheng Li, Yiqun Liu, Jiaxin Mao, Zexue He, Min Zhang, and Shaoping Ma. 2018. Understanding Reading Attention Distribution During Relevance Judgement. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, 733–742. bibtex*[numpages=10;acmid=3271764] event-place: Torino, Italy.
  • Loboda et al. (2011) Tomasz D. Loboda, Peter Brusilovsky, and Jöerg Brunstein. 2011. Inferring word relevance from eye-movements of readers. In Proceedings of the 16th international conference on Intelligent user interfaces (IUI ’11). ACM, New York, NY, USA, 175–184. 00029 Cited by 0009.
  • Loyola et al. (2015) Pablo Loyola, Gustavo Martinez, Kristofher Muñoz, Juan D. Velásquez, Pedro Maldonado, and Andrés Couve. 2015. Combining eye tracking and pupillary dilation analysis to identify Website Key Objects. Neurocomputing 168 (Nov. 2015), 179–189. 00008.
  • Puolamäki et al. (2008) Kai Puolamäki, Antti Ajanki, and Samuel Kaski. 2008. Learning to learn implicit queries from gaze patterns. In

    Proceedings of the 25th international conference on Machine learning

    (ICML ’08). ACM, New York, NY, USA, 760–767. 8.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.
  • Salojärvi et al. (2005) Jarkko Salojärvi, Kai Puolamäki, Jaana Simola, Lauri Kovanen, Ilpo Kojo, and Samuel Kaski. 2005.

    Inferring relevance from eye movements: Feature extraction. In

    Workshop at NIPS 2005, in Whistler, BC, Canada, on December 10, 2005. 45.
  • Salojärvi et al. (2005) Jarkko Salojärvi, Kai Puolamäki, and Samuel Kaski. 2005. Implicit Relevance Feedback from Eye Movements. In Artificial Neural Networks: Biological Inspirations – ICANN 2005. Vol. 3696. Springer Berlin Heidelberg, Berlin, Heidelberg, 513–518. 2.
  • Salvucci and Goldberg (2000) Dario D. Salvucci and Joseph H. Goldberg. 2000. Identifying Fixations and Saccades in Eye-tracking Protocols. In Proceedings of the 2000 Symposium on Eye Tracking Research & Applications (ETRA ’00). ACM, New York, NY, USA, 71–78.
  • Saracevic (2007) Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 1915–1933. arXiv:
  • Saracevic (2016a) Tefko Saracevic. 2016a. The Notion of Relevance in Information Science: Everybody knows what relevance is. But, what is it really? Synthesis Lectures on Information Concepts, Retrieval, and Services 8, 3 (2016), i–109.
  • Saracevic (2016b) Tefko Saracevic. 2016b. Relevance: In search of a theoretical foundation. Theory development in the information sciences (2016), 141–163.
  • Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
  • Simola et al. (2008) Jaana Simola, Jarkko Salojärvi, and Ilpo Kojo. 2008. Using hidden Markov model to uncover processing states from eye movements in information search tasks. Cognitive Systems Research 9, 4 (Oct. 2008), 237–251. 9.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Slanzi et al. (2017) Gino Slanzi, Jorge A. Balazs, and Juan D. Velásquez. 2017. Combining eye tracking, pupil dilation and EEG analysis for predicting web users click intention. Information Fusion 35 (May 2017), 51–57. 00006.
  • Springenberg et al. (2014) Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2014. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014).
  • Srinivas et al. (2017) Suraj Srinivas, Ravi K. Sarvadevabhatla, Konda R. Mopuri, Nikita Prabhu, Srinivas S.S. Kruthiventi, and R. Venkatesh Babu. 2017. Chapter 2 - An Introduction to Deep Convolutional Neural Nets for Computer Vision. In Deep Learning for Medical Image Analysis, S. Kevin Zhou, Hayit Greenspan, and Dinggang Shen (Eds.). Academic Press, 25 – 52.
  • Szegedy et al. (2017) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017.

    Inception-v4, inception-resnet and the impact of residual connections on learning. In

    Thirty-First AAAI Conference on Artificial Intelligence.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
  • Voorhees and Dang (2003) Ellen M Voorhees and Hoa Trang Dang. 2003. Overview of the TREC 2002 question answering track. In Trec, Vol. 2003. Citeseer, 54–68.
  • Wenzel et al. (2017) M. A. Wenzel, M. Bogojeski, and B. Blankertz. 2017. Real-time inference of word relevance from electroencephalogram and eye gaze. Journal of Neural Engineering 14, 5 (2017), 056007. 00001.
  • West et al. (2006) Julia M West, Anne R Haake, Evelyn P Rozanski, and Keith S Karn. 2006. eyePatterns: software for identifying patterns and similarities across fixation sequences. In Proceedings of the 2006 symposium on Eye tracking research & applications. ACM, 149–154.
  • Widdel (1984) Heino Widdel. 1984. Operational problems in analysing eye movements. In Advances in psychology. Vol. 22. Elsevier, 21–29.
  • Wilson (1973) Patrick Wilson. 1973. Situational relevance. Information storage and retrieval 9, 8 (1973), 457–471.
  • Wittek et al. (2016) Peter Wittek, Ying-Hsang Liu, Sándor Darányi, Tom Gedeon, and Ik Soo Lim. 2016. Risk and Ambiguity in Information Seeking: Eye Gaze Patterns Reveal Contextual Behavior in Dealing with Uncertainty. Frontiers in Psychology 7 (2016). 00003.
  • Wu et al. (2019) Yingying Wu, Yiqun Liu, Yen-Hsi Richard Tsai, and Shing-Tung Yau. 2019. Investigating the role of eye movements and physiological signals in search satisfaction prediction using geometric analysis. Journal of the Association for Information Science and Technology 0, 0 (2019).
  • Yosinski et al. (2015) Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 (2015).
  • Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818–833.
  • Zhang et al. (2010) Yun Zhang, Hong Fu, Zhen Liang, Zheru Chi, and Dagan Feng. 2010.

    Eye Movement As an Interaction Mechanism for Relevance Feedback in a Content-based Image Retrieval System. In

    Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications (ETRA ’10). ACM, New York, NY, USA, 37–40. 00023.