Classifying Eye-Tracking Data Using Saliency Maps

10/24/2020 ∙ by Shafin Rahman, et al. ∙ University of Dhaka 0

A plethora of research in the literature shows how human eye fixation pattern varies depending on different factors, including genetics, age, social functioning, cognitive functioning, and so on. Analysis of these variations in visual attention has already elicited two potential research avenues: 1) determining the physiological or psychological state of the subject and 2) predicting the tasks associated with the act of viewing from the recorded eye-fixation data. To this end, this paper proposes a visual saliency based novel feature extraction method for automatic and quantitative classification of eye-tracking data, which is applicable to both of the research directions. Instead of directly extracting features from the fixation data, this method employs several well-known computational models of visual attention to predict eye fixation locations as saliency maps. Comparing the saliency amplitudes, similarity and dissimilarity of saliency maps with the corresponding eye fixations maps gives an extra dimension of information which is effectively utilized to generate discriminative features to classify the eye-tracking data. Extensive experimentation using Saliency4ASD, Age Prediction, and Visual Perceptual Task dataset show that our saliency-based feature can achieve superior performance, outperforming the previous state-of-the-art methods by a considerable margin. Moreover, unlike the existing application-specific solutions, our method demonstrates performance improvement across three distinct problems from the real-life domain: Autism Spectrum Disorder screening, toddler age prediction, and human visual perceptual task classification, providing a general paradigm that utilizes the extra-information inherent in saliency maps for a more accurate classification.



There are no comments yet.


page 1

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Eye-tracking is the technological process of recording gaze movements and viewing patterns across time and task. With an eye-tracking camera, one can capture eye movements like fixations, saccades, and smooth pursuit. The lengths/durations of these movements vary with the subconscious state of the brain. Cognitive processes of attention, such as perception, memory, and decision making further influence human gaze behavior [11]. As a result, eye-tracking data provides valuable insights for psychology, neuroscience, visual behavior study, visual stimuli response measurement, and so on. As eye-tracking technology is easily available, cheap and minimizes expert manual interventions of the subjective process, nowadays, it has become an integral part of gaze-based command and control [5], user interface design [55], diagnosing cognitive disorders [2].

Fig. 1: Problem overview. The classification system of eye-tracking data can be of two types. (a) Subject classification: Given eye-tracking data of one subject while viewing a series of images, the system identifies the category of the subject. E.g., Autism Spectrum Disorder (ASD) screening [13], age group classification [16], etc. (b) Visual perception classification: Given eye-tracking data of multiple subjects, while viewing one image, the system identifies the visual perception involved with the image. E.g.,viewing task classification [7], traffic hazard perception problem [47], etc. In this paper, we propose a novel feature extraction method that is equally useful across various eye-tracking classification problems.

Eye-tracking data describes the human visual attention behavior. In general, three factors dictate this behavior. (a) Input stimuli/images like social cues, presence or absence of objects and their locations, (b) observer’s age, psychological or physiological state, ethnicity, and (c) tasks associated during the act of viewing like free viewing, object search etc. Therefore, we can obtain important details about image/subject/tasks by classifying eye-tracking data. Prior research in this line of investigation usually targets a single real-life problem, e.g., (i) Autism Spectrum Disorder (ASD) and Typically Developed (TD) children classification [13], (ii) toddler age prediction [16], (iii) bio-metric identification [36], (iv) behavior analysis of user interface [55], (v) traffic hazard perception, (vi) visual perception of user [7], etc. Although the end goals of these problems are different, each of the work aims to address one common task: to classify eye-tracking data while viewing images/videos. In this paper, we attempt to solve the common task so that the same solution can seamlessly work across different real-life problems. We categorize state-of-the-art eye-tracking classification research into two types (see Fig 1), (a) subject classification where eye-tracking data includes fixations of each subject for a set of images. The goal is to classify subjects by analyzing their fixations (i - iv). (b) Visual perception classification where eye-tracking data includes fixation of a group of subjects for an image. The goal is to classify the image type or visual task involved while capturing the fixation data (v - vi). This paper proposes a novel task-agnostic feature extraction method for eye-tracking classification that works irrespective of the problem domain.

For feature extraction, traditional approaches use HoGs [39], Gist [35], Spatial density [32], LM filters [7], CNN feature (VGG [27, 40], ResNet [53]). We notice that the same feature set does not consistently work across problems. The possible reasons could be (a) different problems require to find different aspects of fixation data as distinguishing information, and (b) the learning model could not get enough supervision from a small amount of fixation data. To get rid of these problems, we employ saliency maps from established saliency models (e.g., GBVS [22], CovSal [19], SimpSal [25]

etc) in the literature. We compare the input fixation maps with saliency maps using the saliency evaluation metric (e.g., sAUC

[8], CC [34], NSS [8]

, etc.) and use the evaluation results to construct our feature set. It measures the variation between those two maps, providing useful information to classify eye-tracking data of different problems. Moreover, since saliency maps estimate where people generally look, it adds extra-supervision signals to the learning model about the characteristic of the visual stimuli or image. We apply our feature extraction process in three well-known eye-tracking classification problems: ASD screening, toddler age prediction, and human visual perceptual task classification. We establish a new state-of-the-art performance using Saliency4ASD

[18], Age Prediction [16], Visual Perceptual Task [30] datasets.

Our overall contributions are as follows.

  • We propose a novel feature extraction method that helps to classify eye-tracking data across several real-life problem domains.

  • We introduce that saliency maps from popular saliency models can be a powerful tool to extract discriminative features for fixation data.

  • We provide extensive experiments on three eye-tracking datasets and seamlessly achieve state-of-the-art performance on three tasks, e.g., ASD screening, toddler age and visual perceptual task prediction.

Ii Related Work

Eye-tracking data includes fixation patterns and movement of the observers. In a broad sense, research on the classification of eye-tracking data can be of two types. One is to analyze the raw data and detect distinct type of eye movement events (i.e., fixation [33], scan-path [9], saccades [6]) and the other is to classify user groups [16, 13], nationality [55] or visual tasks [21, 7] through analysis of eye-tracking data. This paper mostly follows the second body of works, where we are interested in solving real-life problems with the help of eye-tracking data.

Eye movement data correlates with human psychology, physiology, and many other fields. Researchers analyze the variability in saccade movements for medical diagnosis of neurodegenerative disorders [3], schizophrenia [23], and acute alcohol consumption [15]. In the same vein, the impact of age on visual attention and gaze patterns have been studied widely. Different age groups (i.e., 2, 4-6, 6-8, 8-10 years old) are classified based on gaze data, saliency map agreement, and center bias tendency of subjects [43]

. Toddlers of 3 and 30 months are classified based on the differences of their fixations on video stimulus. A data-driven approach using deep learning is adopted to illustrate the factors that drive the inter-individual changes in gaze patterns because of age

[16]. Given the fixation of subjects on a fixed set of stimuli, they can classify the age of that subject into two classes (18 months and 30 months).

Automatic ASD detection system allows diagnosing individuals with less complexity and without any aid of an expert physician [44]. The gaze pattern of the image viewing provides discriminative features, which are a powerful tool for classifying ASD. ASD individuals show atypical patterns in face-scanning [37] as they show less attention to prime features of faces like eyes, nose, mouth during face-scanning [46]. Another study shows ASD and control groups spent a variable amount of fixation time [17, 51] in different regions of interest in an image. Moreover, some studies have found ASD-TD group to show distinct types of scan paths during image viewing [4, 45] as their cognitive ability and area of interest in an image are distinct from one another. Chen and Zhao [13] reported a study that used temporal information of eye movements during image viewing, which decodes discriminative features of ASD and healthy group children. Moreover, this study presented a privileged modality framework that utilizes multiple behavioral data sources and provides a better result compared with state-of-the-art performance. In this paper, given eye-tracking data of subject group of interest, we focus on ASD screening and toddler age prediction task.

Apart from the subject group classification problem, eye gaze data can also be used to classify the viewing pattern [21] or analyze the viewing scene [48] and identify biometric pattern [36]. In a study, Tafaj et al. [48] classify hazardous driving situations based on the driver’s eye-tracking data. In another study, [7] attempted to predict visual tasks from eye movement trajectories of multiple subjects. They studied four different visual tasks: free viewing (observing images without any particular goal), object search (searching a particular object in images), saliency viewing (finding whether the left or right side of images are salient), explicit judgment (manually selecting the most important location of the image). In this paper, we apply our proposed feature extraction method on the visual task-based classification.

Iii Method

Researchers have already proposed automated systems to classify unique characteristics of the image (hazardous situation in driving), observer’s class (age, ASD-TD prediction), or visual tasks (free view, explicit judgment) prediction based on fixation (eye-tracking) data. From the perspective of algorithmic design, the input to such systems can be of two types. Firstly, for the observer/subject classification case, an algorithm takes fixation data of all images provided by a single subject as input to predict characteristics of the subject under study. Secondly, for image or task classification cases, fixation data of all subjects originating from a single image are provided to the algorithm to classify the image or task involved in the data collection process. In line with the discussion above, now, we formally describe eye-tracking classification problems.

Iii-a Problem Formulation

We assume, subjects/observers, , observe images, while providing eye fixation/gaze data. Let, denotes fixation data of subject for image. Suppose, we have distinct classes, , that may represent a group (like ASD/TD, age groups, etc.) or viewing activity (like free viewing, saliency viewing etc). The classification of eye-tracking data can be of following two types:

  • Subject classification: The fixation data of subject over images/trials is . Given the eye-tracking data, , of a test subject, our goal is to assign a label by learning a classifier on the training set, . Example: age prediction [16], ASD screening [13].

  • Visual perceptual task classification: The fixation data of image collected from all observer/subjects is . Given the fixation data of test image , our goal is to classify an activity or task label , by learning a classifier on the training set, . Example: perceptual task prediction [7].

Fig. 2: Performance (based on sAUC, Info gain, CC, NSS) of saliency model, CIWaM [24] while predicting fixation map of (a) ASD/TD, (b) 18/30 months-aged, (c) Free-viewing/Explicit Judgment, and (d) Object Search/Explicit Judgment. One can notice the saliency map predicts TD, 30 months-aged, Saliency-viewing, and Explicit Judgment better than its counter category. Therefore, saliency maps can help to distinguish different types of eye-tracking data.
Fig. 3: Our proposed method. (A) Feature extraction process uses saliency maps (generated using saliency models) of an image that are compared with the fixation map of that image to generate a

dimensional evaluation vector from each saliency map. The evaluation vectors from

saliency maps are concatenated to get the dimensional vector, which is used for training. Here, and means comparison and concatenation, respectively. (B) For subject group classification, to classify the subject, the feature vectors generated from fixation data on images (presented to that subject) are aggregated to form one -dimensional vector. (C) Training for visual perception task. The fixation data for all subjects on a single image performing the same perception task is used to form a single fixation map, and a single feature vector is created, which is fed forward to the classifier.

Solution overview: We know that saliency models attempt to simulate human attention mechanisms by predicting eye-tracking data [10, 42]. However, the extent to which a saliency model predicts fixation data of different classes (subject or task) should differ from one class to another. For example, in ASD/TD case, a TD subject pays special attention to socially relevant cues, e.g., faces and eyes, whereas a subject with autism focuses less on people and faces [52]. Here, a saliency model will predict TD fixation data better than ASD data because TD’s behavior is analogous to the common behavior of people. Similarly, in toddler age prediction case, the saliency maps could model the eye-tracking behavior of 30 months toddlers better than 18 months because 30 months of toddlers fixate on foreground objects whereas 18 months old toddlers pay attention only to human faces [16]. In the perceptual task prediction case, saliency models predict the explicit judgment data better than free-viewing because free-viewing data is more prone to add noise than the explicit judgment case [43]. In this way, considering saliency maps as a standard prediction and comparing other fixation maps of different classes with this standard, one can distinguish fixation data. In Fig. 2, by comparing with popular saliency evaluation metric, sAUC [8], Info gain [31], CC [34] and NSS [8], we show that a saliency model like CIWaM [24] predicts TD, 30 months toddler, explicit judgment data better than ASD, 18 months toddler, free view data. Being motivated by this trend, in this paper, we propose a feature extraction method by employing saliency maps from established saliency models.

Iii-B Proposed Model

We illustrate our approach in Fig. 3. It has two components: feature extraction and learning a classifier. Depending on the problem type mentioned above, the input to the feature extraction process varies slightly. Now, we discuss the components of our method in detail.

Feature Extraction: We visualize the complete feature extraction process in Fig. 3(a). Instead of adopting the traditional way of extracting features directly from the fixation map, we further employ saliency maps from several established saliency models in the feature extraction process. Suppose is the total number of saliency models used during feature extraction. Let, represents the set of saliency maps for the image using all saliency models, where, is the saliency map of th saliency model. Our method uses both fixation map and saliency maps to extract a feature vector corresponding to th image. Depending on the problem type, could be either fixation map of a single subject, or union of all subjects .

To extract features, we evaluate the performance of each saliency model predicting fixation map . There are many evaluation methods available in the literature. Each evaluation measures the prediction ability of a saliency model considering different aspects of attention mechanism. For example, sAUC [8] tries to exclude the effect of center bias, a tendency of human fixation to look at the center of the scene. In contract, AUC_judd [28] includes center bias because center bias is a natural viewing pattern that should be a part of the evaluation process. With this motivation, we use number of evaluation metric while comparing and . Then, to get a feature vector for , we concatenate all results of evaluation metric corresponds to saliency models. In this way, the dimensional feature vector, becomes as follows:

Here, each element of represents the result of a saliency evaluation metric (sAUC/AUC_judd/NSS etc).

Learning Classifier: We forward the extracted features to a learning model. However, based on problem type, we may need to process the extracted features further. For subject classification case shown in Fig. 3(b), for each , we apply feature extraction process with and separately. Then, we aggregate the individual feature of input fixation maps by averaging. In this way, we get a feature vector for th subject considering subject-specific fixation map of all images. Using this feature, we train a classifier to classify the subject. For visual perceptual task classification shown in Fig. 3(c), we calculate a union of fixation maps from all subjects regarding a single image. In this case, we use and to extract a feature vector for th image. Again, we forward this feature vector to train a classifier to classify the visual perceptual task associated with that image.

Iv Experiment

Iv-a Setup

Dataset: We experiment with our method using three eye-tracking datasets. Here, we briefly describe those datasets. (a) Saliency4ASD: Duan et al. [18] collected eye movement data from 28 children where half of them were ASD and other half were TD. This dataset consists of fixation data of 300 selected natural scene images from [29]. For each image, they merged all fixation data from ASD and TD subjects separately to produce fixation maps of representative classes. (b) Age Prediction dataset: Dalrymple et al. [16] experimented the gaze behavior of 18 and 30 months children. This study selected 100 images from the Object and Semantic Images Eye-tracking (OSIE) database [54], which includes both social and non-social scene. Participants observed each image for three seconds while collecting the data. Fixation data of total 22 and 19 subjects of 18 and 30 months respectively are available from the study. This dataset provides fixation points of each subject and image. (c) Visual Perceptual Task dataset: Koehler et al. [30] provided a dataset of humans viewing of 800 natural images while performing four visual tasks (free-viewing, object search, saliency view and explicit judgment of salient region). A total of 20 observers performed free-viewing, object search, and saliency search tasks, whereas 100 observers completed the explicit judgment task. The dataset provides fixation coordinates of each observer across different tasks.

Evaluation Process: Following the literature of eye-tracking data classification [13, 16, 7] , we have evaluated and compared our method with the existing methods using accuracy, sensitivity (i.e. true positive rate), specificity (i.e. true negative rate) and Area Under the ROC Curve (AUC).

Implementation Details111Codes and evaluation are available at: In this paper, we have used some of the established saliency models for feature extraction. We choose bottom-up saliency models that compute local, global features of the input image to predict saliency without requiring any prior training. The reason for such a choice is that our used datasets [18, 16, 30] are not large enough to train top-down based saliency models like GazeGan[12], EML-Net [26] etc. Thus, the inadequate training of saliency models might impact our feature extraction process. Considering the points mentioned above, in this study, we have used the following saliency models: (i) CovSal [19], (ii) LDS [20], (iii) GBVS [22], (iv) UHF [49], (v) CIWaM [24], (vi) CEoS [38], (vii) SimpSal [25]. While comparing the similarity/dissimilarity between saliency maps and the fixation data, we use standard saliecny model evaluation metric [10] e.g., AUC_Borji [8], AUC_Judd [28], AUC_Shuffled [8], Information Gain (IG) [31], Similarity (SIM) [28], Pearson’s Correlation Coefficient (CC) [34]

, Kullback-Leibler divergence (KL-Div)

[1]. To build our classifiers, we have used the Scikit-learn python package. For SVM classifer, we have applied the kernel trick to employ non-linearity in the classification process. We find our best results by tuning our SVM model’s parameter using the polynomial kernel and regularization parameter,

. For XGBoost classifier


, we have found the best result using Gradient Boost tree booster with a depth of

and estimators (weak learners).

ASD/TD Classification Results
Accuracy Sensitivity Specificity AUC
Chen’19 (Independent) [13] 89.00 86.00 93.00 92.00
Chen’19 (Full) [13] 93.00 93.00 93.00 98.00
Ours (SVM) 99.50 96.70 99.30 99.50
Ours (XGBoost) 99.80 1.00 99.70 99.80
Toddler Age Classification Results
Accuracy Sensitivity Specificity AUC
Dalrymple’19 [16] 83.00 90.00 81.00 84.00
Ours (SVM) 75.60 78.90 72.70 75.80
Ours (XGBoost) 83.00 84.20 81.80 83.00
TABLE I: Subject/Observer classification.
Free/obj Free/Sal Free/Exp Obj/Sal Obj/Exp Sal/Exp
All images and subjects
Boisvert’16[7] 84.38 66.13 89.75 89.88 97.75 90.00
Ours (SVM) 86.35 78.57 95.33 94.70 97.80 96.20
Ours (XGBoost) 84.20 74.30 96.50 84.25 97.70 96.10
50% images but all subjects
Boisvert’16 [7] 73.41 59.59 - 71.01 - -
Ours (SVM) 79.54 71.70 86.21 82.31 90.20 91.56
Ours (XGBoost) 78.80 69.60 86.13 82.51 88.60 90.36
All images but 50% subjects
Boisvert’16[7] 79.98 60.16 - 77.85 - -
Ours (SVM) 82.30 66.25 78.77 81.33 84.57 83.18
Ours (XGBoost) 77.20 64.32 75.13 80.23 79.00 81.50
TABLE II: Results of Visual Perceptual Task classification. ‘-’ means unavailable results.

Iv-B Overall Result

We apply our method on three well-known eye-tracking based classification problems. In this subsection, we report our experimental results.

Autism Spectrum Disorder (ASD) screening: Several works suggested that eye-tracking data could work as a blueprint for ASD vs. TD classification. Such an eye-tracking based method can automate the lengthy, manual, time-consuming, and subjective process of this classification. To facilitate research in this area, Saliency4ASD dataset [18] provides 300 fixation maps for each group (ASD and TD) totaling instances. However, the dataset does not provide any individual subject’s fixation. Therefore, similar to the work of Chen and Zhao [13], we report cross-validation results based on the leave-one-image-out method in Table I. Chen and Zhao [13] used an end-to-end deep learning approach where it takes images and fixation maps as input, applies a pre-trained Resnet-50 architecture and used a variant of LSTM network for classification. Their approach reported the highest accuracy of 93% between two modalities of feature extraction. We notice that our approach successfully outperforms [13] with a large margin. We get the highest accuracy of 99.8% using XGBoost as a classifier, although the SVM classifier can also beat state-of-the-art results. Our proposed feature extraction method plays a key role in superior performance. The inherent ability of saliency maps for predicting TD fixation better than ASD helps to classify ASD/TD with high confidence. In contrast, [13] learned an LSTM model based on a small ( 600) number of fixation maps, which may not be enough to train a large deep learning model.

Toddler Age Prediction: We can interpret the variability of age using eye-tracking data. Thus, we can classify different age groups analyzing the variation of fixation data. Here, we predict the toddler age by examining their gaze behavior. This research could help to monitor the growth of toddlers. In this paper, we perform our experiments on the fixation data of 100 images collected and used by Dalrymple’19 et al. [16] to predict toddlers belonging to two age groups (18 months and 30 months). The original work used two parallel CNNs (for generating features from fixation data) and an SVM for classification using those features. The CNNs are two pre-trained VGG-16 networks that are trained to predict the difference maps of group fixation data from full-scale and half-scale image input. In our method, we create a feature vector for fixation data of each trial of a subject, and by averaging across the trials, we generate a feature vector for classification. The number of train samples is quite low (41 subjects), and as a result, we validate our training in leave-one-subject-out, similar to the work of [16] and report the results in Table I. In contrast to the alternative work, even though we require no training to generate the features, we achieved 83% accuracy, which is similar to their results.

Perceptual tasks Prediction: Here, we experiment with koehler et al. dataset [30], where fixation maps of four different visual tasks (i.e., free-viewing, object search, saliency search, and explicit judgment) are present. Given a fixation map of any image, our goal is to predict the visual task performed while capturing the eye-tracking data. For this, we follow the experimental protocol presented in Boisvert and Bruce [7], where a set of binary classification problem is designed based on different combinations of visual tasks (i.e. Free viewing/object search, Free viewing/Saliency viewing, Free viewing/Exp. judgment, Object search/Saliency viewing, Object search/Exp. judgment and Saliency viewing/Exp. judgment). Firstly, we perform 10-fold cross-validation using all images and subjects of the dataset. Secondly, we use fixation data of all subjects from 50% images in training and rest images in testing. In this case, same subjects are used in both training and testing. Finally, we use fixation data of 50% subjects from all images in training and rest subjects in testing. In this case, same images are used in both training and testing. We outperform the alternative method [7] (see Table II) in three mentioned experimental setups. The use of saliency maps provides extra/side information in our method, whereas [7] relies on conventional features (like HOGs, Gist, Spatial density, LM filters) only from the fixation maps.

Iv-C Ablation Study

We perform ablation studies in two directions. First, we experiment with varying the number of saliency models used during our proposed feature extraction process. Then, we investigate some alternative baseline feature extraction methods in comparison to our proposed approach.

Varying the number of saliency models: We perform ablation studies by choosing a different number of saliency models during feature extraction and report classification accuracy on three problems discussed above. Out of seven saliency models used in this paper, we randomly employ 1, 2, 3, 4, 5, 6, and 7 saliency models to perform each problem. Then, we report the average performance by repeating the same experiments ten times. From Fig. 4 one can notice a clear trend that using more saliency models improves the performance of any problem upto a certain point (using one to six saliency models). After that, the performance becomes stable, providing no significant improvement by adding more saliency models (i.e. adding the seventh model). This trend is expected because after considering a certain amount of saliency models, adding new saliency models cannot add any more discriminative information to the feature extraction process. Thus, the performance becomes stable.

Baseline methods: In our approach, we use saliency maps to extract features from fixation. Like traditional approaches, a reasonable alternative can be to obtain features directly from the fixation maps. In this experiment, we apply some popular feature extraction methods, e.g., Histogram of Gradient (HoG), GIST, and VGG-16 on fixation maps and then train XGBoost classifier for eye-tracking classification. This approach does not consider saliency maps in the process. Thus, we consider it as our baseline method. In Table III, we report the accuracy of the baseline methods and compare the performance with our approach. Our approach successfully outperforms those baselines. It tells that features extracted directly from the fixation maps can not represent the eye-tracking effectively across problems. Such baseline approaches may be useful for some cases like HoG/Gist for perceptual task prediction or VGG16 for age prediction. But, none of the features could work consistently in different problem settings. In contrast, our proposed feature extraction method (using saliency maps) can work seamlessly in multiple real-life problems. Our advantage is that saliency maps indirectly provide extra supervision to the learning process by augmenting distinguishing information about the input visual stimuli.

Fig. 4: Average accuracies of three eye-tracking classification problems using a different number of saliency models during the feature extraction process. One can notice increasing the number of saliency models improves the classification performance up to six saliency models. From six to seven models, the improvement is not significant.
ASD/TD Age Prediction
Task (Free/Obj)
HoG 57.00 47.00 70.00
Gist 68.00 57.00 81.00
VGG16 63.70 82.90 74.90
Ours 99.80 83.00 84.20
TABLE III: Comparison of our method with several baseline feature extraction methods. Baseline methods apply different well-known feature extraction process on the fixation maps. All results are based on the XGBoost classifier.
Fig. 5: 2D tSNE [50] visualization of our extracted features for classification of (a) ASD/TD, (b) Free view/Explicit Judgment. (c) Visualizing the correlation among different saliency evaluation metrices obtained from the perceptual task prediction dataset.

Iv-D Discussion

In this subsection, we illustrate different critical aspects of our approach and discuss the limitations of this sort of research.

Feature visualization: We visualize our extracted feature in Fig. 5

(a-b). One can notice that the features of different classes are quite well-separated and easy for a simple machine learning model (SVM/XGBoost) for classification. Moreover, we notice that the separation is relatively clear for the case of ASD/TD and Free viewing/Explicit Judgment, where we get better performance (See Table

I and II). Furthermore, in Fig. 5(c), we visualize the correlation among different saliency evaluation metric which makes our feature set. The variation of values in non-diagonal regions tells that each evaluation metric focuses on different criteria to compare saliency maps and fixation maps of any particular group. It helps to describe eye-tracking data effectively.

Selection of visual stimuli or social cue: While classifying a particular subject group (ASD/TD/Toddler Age) or visual perceptual task (Free viewing/Explicit judgment), the images used for capturing the fixation is a paramount concern. For example, subjects with ASD mostly ignore social cues like face, eye, or mouth [37]. Toddlers of 18 months of age pay more attention to faces, but 30 months old toddlers focus on foreground objects [16]. Explicit judgment fixations mostly concentrate on everyday objects, whereas free viewing fixations become biased towards the center of the image irrespective of object locations [43]. Therefore, the input image set’s choice has a profound impact on the performance of the eye-tracking based solution. One can consider designing a process to automatically evaluate how suitable an image is for a particular problem. A simple recommendation could be the Inter-observer congruency (IOC) score [41]. For a given image, a high IOC score means a high possibility of agreement among common people’s viewing patterns. This way, for the same image, if any particular interest group (ASD/TD) views it differently, our feature extraction method will easily pick the distinguishing information to classify that interest group.

Limitations: A notable limitation of this line of investigation is the unavailability of the large scale and public dataset. All the datasets used in this study (i.e. [18, 16, 30]) are small scale containing the number of images and subjects at best 800 and 45, respectively. Moreover, datasets for such problems (ASD screening, Toddler age perdition etc.) are publicly unavailable. In future, one can collect large scale data and open it for researches to further investigate the applicability of eye-tracking data in real-life applications.

V Conclusion

In this paper, we propose a novel feature extraction method for the eye-tracking classification task. The feature extraction process is so generalized that it can solve a wide variety of real-life problems. We employ popular visual saliency models for feature extraction from eye fixation data. Such an approach is better than extracting features from the fixation data alone because saliency maps provide extra supervision to our learning system. We apply our proposed feature extraction process while solving three problems, e.g., ASD screening, toddler age, and visual perceptual task prediction. Our experiments show significant performance boosts in comparison to previous efforts of similar investigations.


  • [1] M. Afgani, S. Sinanovic, and H. Haas (2008) Anomaly detection using the kullback-leibler divergence metric. In 2008 First International Symposium on Applied Sciences on Biomedical and Communication Technologies, pp. 1–5. Cited by: §IV-A.
  • [2] J. Ahonniska-Assa, O. Polack, E. Saraf, J. Wine, T. Silberg, A. Nissenkorn, and B. Ben-Zeev (2018) Assessing cognitive functioning in females with rett syndrome by eye-tracking methodology. European Journal of Paediatric Neurology 22 (1), pp. 39–45. Cited by: §I.
  • [3] T. J. Anderson and M. R. MacAskill (2013) Eye movements in patients with neurodegenerative disorders. Nature Reviews Neurology 9 (2), pp. 74–85. Cited by: §II.
  • [4] G. Arru, P. Mazumdar, and F. Battisti (2019) Exploiting visual behaviour for autism spectrum disorder identification. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 637–640. Cited by: §II.
  • [5] R. Bates, M. Donegan, H. O. Istance, J. P. Hansen, and K. Räihä (2007) Introducing cogain: communication by gaze interaction. Universal Access in the Information Society 6 (2), pp. 159–166. Cited by: §I.
  • [6] F. Behrens, M. MacKeben, and W. Schröder-Preikschat (2010) An improved algorithm for automatic detection of saccades in eye movement data and for calculating saccade parameters. Behavior Research Methods 42 (3), pp. 701–708. Cited by: §II.
  • [7] J. F. Boisvert and N. D. Bruce (2016) Predicting task from eye movements: on the importance of spatial distribution, dynamics, and image features. Neurocomputing 207, pp. 653–668. Cited by: Classifying Eye-Tracking Data Using Saliency Maps thanks: This research was supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology of the Government of Bangladesh., Fig. 1, §I, §I, §II, §II, 2nd item, §IV-A, §IV-B, TABLE II.
  • [8] A. Borji, D. N. Sihite, and L. Itti (2013) Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Transactions on Image Processing 22 (1), pp. 55–69. Cited by: §I, §III-A, §III-B, §IV-A.
  • [9] M. Burch, A. Kumar, K. Mueller, T. Kervezee, W. Nuijten, R. Oostenbach, L. Peeters, and G. Smit (2019)

    Finding the outliers in scanpath data

    In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, pp. 1–5. Cited by: §II.
  • [10] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand (2019-03) What do different evaluation metrics tell us about saliency models?. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (3), pp. 740–757. External Links: Document Cited by: §III-A, §IV-A.
  • [11] B. T. Carter and S. G. Luke (2020) Best practices in eye tracking research. International Journal of Psychophysiology 155, pp. 49 – 62. External Links: ISSN 0167-8760, Document, Link Cited by: §I.
  • [12] Z. Che, A. Borji, G. Zhai, X. Min, G. Guo, and P. L. Callet (2020) How is gaze influenced by image transformations? dataset and model. IEEE Transactions on Image Processing 29, pp. 2287–2300. External Links: Document Cited by: §IV-A.
  • [13] S. Chen and Q. Zhao (2019-10) Attention-based autism spectrum disorder screening with privileged modality. In

    The IEEE International Conference on Computer Vision (ICCV)

    Cited by: Classifying Eye-Tracking Data Using Saliency Maps thanks: This research was supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology of the Government of Bangladesh., Fig. 1, §I, §II, §II, 1st item, §IV-A, §IV-B, TABLE I.
  • [14] T. Chen and C. Guestrin (2016-08) XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Document Cited by: §IV-A.
  • [15] E. Childs, D. J. Roche, A. C. King, and H. de Wit (2012) Varenicline potentiates alcohol-induced negative subjective responses and offsets impaired eye movements. Alcoholism: Clinical and Experimental Research 36 (5), pp. 906–914. Cited by: §II.
  • [16] K. A. Dalrymple, M. Jiang, Q. Zhao, and J. T. Elison (2019-04) Machine learning accurately classifies age of toddlers based on eye tracking. Scientific Reports 9 (1). External Links: Document Cited by: Classifying Eye-Tracking Data Using Saliency Maps thanks: This research was supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology of the Government of Bangladesh., Fig. 1, §I, §I, §II, §II, 1st item, §III-A, §IV-A, §IV-A, §IV-A, §IV-B, §IV-D, §IV-D, TABLE I.
  • [17] A. B. Dris, A. Alsalman, A. Al-Wabil, and M. Aldosari (2019) Intelligent gaze-based screening system for autism. In 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), pp. 1–5. Cited by: §II.
  • [18] H. Duan, G. Zhai, X. Min, Z. Che, Y. Fang, X. Yang, J. Gutiérrez, and P. L. Callet (2019) A dataset of eye movements for the children with autism spectrum disorder. In Proceedings of the 10th ACM Multimedia Systems Conference, pp. 255–260. Cited by: Classifying Eye-Tracking Data Using Saliency Maps thanks: This research was supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology of the Government of Bangladesh., §I, §IV-A, §IV-A, §IV-B, §IV-D.
  • [19] E. Erdem and A. Erdem (2013) Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision 13 (4), pp. 11–11. Cited by: §I, §IV-A.
  • [20] S. Fang, J. Li, Y. Tian, T. Huang, and X. Chen (2016) Learning discriminative subspaces on random contrasts for image saliency analysis.

    IEEE Transactions on Neural Networks and Learning Systems

    28 (5), pp. 1095–1108.
    Cited by: §IV-A.
  • [21] A. Haji-Abolhassani and J. J. Clark (2014) An inverse yarbus process: predicting observers’ task from eye movement patterns. Vision Research 103, pp. 127–142. Cited by: §II, §II.
  • [22] J. Harel, C. Koch, and P. Perona (2007) Graph-based visual saliency. In Advances in Neural Information Processing Systems, pp. 545–552. Cited by: §I, §IV-A.
  • [23] P. S. Holzman, L. R. Proctor, D. L. Levy, N. J. Yasillo, H. Y. Meltzer, and S. W. Hurt (1974) Eye-tracking dysfunctions in schizophrenic patients and their relatives. Archives of General Psychiatry 31 (2), pp. 143–151. Cited by: §II.
  • [24] N. Imamoglu, W. Lin, and Y. Fang (2012) A saliency detection model using low-level features based on wavelet transform. IEEE Transactions on Multimedia 15 (1), pp. 96–105. Cited by: Fig. 2, §III-A, §IV-A.
  • [25] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11), pp. 1254–1259. Cited by: §I, §IV-A.
  • [26] S. Jia and N. D.B. Bruce (2020-03) EML-NET: an expandable multi-layer NETwork for saliency prediction. Image and Vision Computing 95, pp. 103887. External Links: Document Cited by: §IV-A.
  • [27] M. Jiang and Q. Zhao (2017) Learning visual attention to identify people with autism spectrum disorder. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3267–3276. Cited by: §I.
  • [28] T. Judd, F. Durand, and A. Torralba (2012) A benchmark of computational models of saliency to predict human fixations. In MIT Technical Report, Cited by: §III-B, §IV-A.
  • [29] T. Judd, K. Ehinger, F. Durand, and A. Torralba (2009) Learning to predict where humans look. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2106–2113. Cited by: §IV-A.
  • [30] K. Koehler, F. Guo, S. Zhang, and M. P. Eckstein (2014) What do saliency models predict?. Journal of Vision 14 (3), pp. 14–14. Cited by: Classifying Eye-Tracking Data Using Saliency Maps thanks: This research was supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology of the Government of Bangladesh., §I, §IV-A, §IV-A, §IV-B, §IV-D.
  • [31] M. Kümmerer, T. Wallis, and M. Bethge (2014) How close are we to understanding image-based saliency?. arXiv preprint arXiv:1409.7686. Cited by: §III-A, §IV-A.
  • [32] K. Kurzhals and D. Weiskopf (2013) Space-time visual analytics of eye-tracking data for dynamic stimuli. IEEE Transactions on Visualization and Computer Graphics 19 (12), pp. 2129–2138. Cited by: §I.
  • [33] L. Larsson, M. Nyström, R. Andersson, and M. Stridh (2015) Detection of fixations and smooth pursuit movements in high-speed eye-tracking data. Biomedical Signal Processing and Control 18, pp. 145–152. Cited by: §II.
  • [34] O. Le Meur, P. Le Callet, and D. Barba (2007) Predicting visual fixations on video based on low-level visual features. Vision Research 47 (19), pp. 2483–2498. Cited by: §I, §III-A, §IV-A.
  • [35] Z. Li and L. Itti (2009) Gist based top-down templates for gaze prediction. Journal of Vision 9 (8), pp. 202–202. Cited by: §I.
  • [36] Z. Liang, F. Tan, and Z. Chi (2012) Video-based biometric identification using eye tracking technique. In 2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC 2012), pp. 728–733. Cited by: §I, §II.
  • [37] W. Liu, M. Li, and L. Yi (2016) Identifying children with autism spectrum disorder based on their face processing abnormality: a machine learning framework. Autism Research 9 (8), pp. 888–898. Cited by: §II, §IV-D.
  • [38] R. Mairon and O. Ben-Shahar (2014) A closer look at context: from coxels to the contextual emergence of object saliency. In European Conference on Computer Vision, pp. 708–724. Cited by: §IV-A.
  • [39] F. Martinez, A. Carbone, and E. Pissaloux (2012)

    Gaze estimation using local features and non-linear regression

    In 2012 19th IEEE International Conference on Image Processing, pp. 1961–1964. Cited by: §I.
  • [40] A. Nebout, W. Wei, Z. Liu, L. Huang, and O. Le Meur (2019) Predicting saliency maps for asd people. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 629–632. Cited by: §I.
  • [41] S. Rahman and N. D. B. Bruce (2016) Factors underlying inter-observer agreement in gaze patterns: predictive modelling and analysis. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications, ETRA ’16, New York, NY, USA, pp. 155–162. External Links: ISBN 9781450341257, Link, Document Cited by: §IV-D.
  • [42] S. Rahman and N. Bruce (2015) Saliency, scale and information: towards a unifying theory. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2188–2196. External Links: Link Cited by: §III-A.
  • [43] S. Rahman and N. Bruce (2015) Visual saliency prediction and evaluation across different perceptual tasks. PloS One 10 (9), pp. e0138053. Cited by: §II, §III-A, §IV-D.
  • [44] O. Shahid, S. Rahman, S. F. Ahmed, M. A. Arrafi, and M.A.R. Ahad (2020) Data-driven automated detection of autism spectrum disorder using activity analysis: a review.. Preprints 2020, 2020100388. External Links: Document Cited by: §II.
  • [45] A. I. Shihab, F. A. Dawood, and A. H. Kashmar (2020)

    Data analysis and classification of autism spectrum disorder using principal component analysis

    Advances in Bioinformatics 2020. Cited by: §II.
  • [46] U. H. Syeda, Z. Zafar, Z. Z. Islam, S. M. Tazwar, M. J. Rasna, K. Kise, and M. A. R. Ahad (2017) Visual face scanning and emotion perception analysis between autistic and typically developing children. In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable computers, pp. 844–853. Cited by: §II.
  • [47] E. Tafaj, T. C. Kübler, G. Kasneci, W. Rosenstiel, and M. Bogdan (2013) Online classification of eye tracking data for automated analysis of traffic hazard perception. In International Conference on Artificial Neural Networks, pp. 442–450. Cited by: Fig. 1.
  • [48] E. Tafaj, T. C. Kübler, G. Kasneci, W. Rosenstiel, and M. Bogdan (2013) Online classification of eye tracking data for automated analysis of traffic hazard perception. In International Conference on Artificial Neural Networks, pp. 442–450. Cited by: §II.
  • [49] H. R. Tavakoli and J. Laaksonen (2016) Bottom-up fixation prediction using unsupervised hierarchical models. In Asian Conference on Computer Vision, pp. 287–302. Cited by: §IV-A.
  • [50] L. Van Der Maaten (2014) Accelerating t-sne using tree-based algorithms.. Journal of Machine Learning Research 15 (1), pp. 3221–3245. Cited by: Fig. 5.
  • [51] G. Wan, X. Kong, B. Sun, S. Yu, Y. Tu, J. Park, C. Lang, M. Koh, Z. Wei, Z. Feng, et al. (2019) Applying eye tracking to identify autism spectrum disorder in children. Journal of Autism and Developmental Disorders 49 (1), pp. 209–215. Cited by: §II.
  • [52] S. Wang, M. Jiang, X. M. Duchesne, E. A. Laugeson, D. P. Kennedy, R. Adolphs, and Q. Zhao (2015-11) Atypical visual saliency in autism spectrum disorder quantified through model-based eye tracking. Neuron 88 (3), pp. 604–616. External Links: Document Cited by: §III-A.
  • [53] E. T. Wong, S. Yean, Q. Hu, B. S. Lee, J. Liu, and R. Deepu (2019) Gaze estimation using residual neural network. In 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 411–414. Cited by: §I.
  • [54] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao (2014-01) Predicting human gaze beyond pixels. Journal of Vision 14 (1), pp. 28–28. External Links: ISSN 1534-7362, Document, Link, Cited by: §IV-A.
  • [55] Y. Yin, C. Juan, J. Chakraborty, and M. P. McGuire (2018)

    Classification of eye tracking data using a convolutional neural network

    In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 530–535. Cited by: §I, §I, §II.