Predicting Confusion from Eye-Tracking Data with Recurrent Neural Networks

06/19/2019 ∙ by Shane D. Sims, et al. ∙ The University of British Columbia 0

Encouraged by the success of deep learning in a variety of domains, we investigate the suitability and effectiveness of Recurrent Neural Networks (RNNs) in a domain where deep learning has not yet been used; namely detecting confusion from eye-tracking data. Through experiments with a dataset of user interactions with ValueChart (an interactive visualization tool), we found that RNNs learn a feature representation from the raw data that allows for a more powerful classifier than previous methods that use engineered features. This is evidenced by the stronger performance of the RNN (0.74/0.71 sensitivity/specificity), as compared to a Random Forest classifier (0.51/0.70 sensitivity/specificity), when both are trained on an un-augmented dataset. However, using engineered features allows for simple data augmentation methods to be used. These same methods are not as effective at augmentation for the feature representation learned from the raw data, likely due to an inability to match the temporal dynamics of the data.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An open problem in Affective Computing is that of detecting and adapting to a user’s affective states in real-time. One such state is that of confusion. When a user is confused while interacting with a system, they can experience a decrease in satisfaction, as well as performance [Yi2008]. When this confusion is not resolved, the user may be unable to complete their task unless they are given some form of help. An obvious prerequisite to resolving a user’s confusion is knowing that they are confused. A system that can detect when its user is confused (i.e. classify when the user is or is not confused) gains a human-like awareness that can be leveraged to provide appropriate interventions to resolve such confusion.

Recurrent Neural Networks (RNNs) are deep learning (DL) methods that have achieved excellent results in domains with large amounts of available data [Karpathy and Fei-Fei2015]. Aside from detecting sentiment from text, work has been done using RNNs to recognize emotion from videos [Ebrahimi Kahou et al.2015] and interaction events [Botelho et al.2017]. The limited amount of work in this field (compared to the vast amount of work on object detection in pictures) is because data capturing affect is difficult to acquire and label. For example, one of the larger public emotion video datasets contains less than 1000 training items, with automated ground truth label generation [Dhall et al.2017]. The data scarcity problem is exacerbated when relying on data from eye-trackers because it currently requires comparatively more specialized equipment and data collection in a laboratory setting. However, gaze has been shown to be a strong predictor of affective states such as boredom [Jaques et al.2014] and affective valence [Lallé et al.2018]. This fact, along with the success of deep learning in the other data scarce domains just mentioned is encouraging.

Prior work has shown that user confusion can be detected in the specific context of processing information visualizations using eye-tracking data such as gaze point, pupil changes, and the distance of the user’s head from the screen. [Lallé et al.2016] used a Random Forest (RF) classifier on such data, achieving confused and not confused

class accuracies (i.e. sensitivity and specificity) of 57% and 91%, respectively. Their approach used engineered features based on domain knowledge about what is important for detecting confusion. We set out to determine if these results could be improved by using an RNN. This method could take advantage of the sequential nature of eye-tracking data (unlike previous approaches), while also learning a data-driven feature representation; a key difference between deep learning and other machine learning paradigms

[Bengio et al.2013].

As the core contribution of this work, we show that RNNs learn a feature representation from raw eye-tracking data that allows for a more powerful classifier for detecting confusion than previous methods that used a Random Forest with engineered features (Section 6). This claim is supported by the results of our experiments, which show that RNN variants perform better than the RF before the dataset is augmented with synthetic examples. However, while simple data augmentation techniques have been applied to produce state of the art results with engineered features [Lallé et al.2016]

, we discovered that these same techniques are less effective when applied to the raw eye-tracking dataset used to train the RNNs. Our results serve as a baseline upon which to evaluate our ongoing work in augmenting raw eye-tracking data. To the best of our knowledge, we are the first to use an RNN for the task of classifying a user state from raw eye-tracking data. A secondary contribution of this work is an account of our procedure (Section 5) for using RNNs to classify raw eye-tracking data from a small dataset; establishing some best practices along the way. In particular, we provide a simple method of increasing intra-class variance by cyclically splitting our data items into multiple items while maintaining the sequential structure. Our methods will be useful as more data of this type becomes available and research into related augmentation techniques progresses.

2 Related Work

Most work on predicting affect with deep learning has been in computer vision and natural language, where established methods already exist for classifying emotions from pictures, video, and sound

[Ebrahimi Kahou et al.2015] [Xu et al.2016] [Amer et al.2014]. These works focus on predicting one of anger, disgust, fear, happy, sad, surprise, or neutral. There is less work that uses DL for emotion recognition based on data from live interaction with users, which is ultimately what we want from an affect-sensitive artificial agent.

Work that uses deep learning for classifying confusion has applied RNNs to sequential interaction data [Jiang et al.2018] [Botelho et al.2017]. These works seek to predict multiple emotions (one of which is confusion) while students interact with an Intelligent Tutoring System. While these works utilize interaction data, we utilize eye-tracking data, which has been shown to be a good predictor of emotional or attentional states such as mind wandering [Bixler and D’Mello2015], as well as boredom, and curiosity, while learning with educational software [Jaques et al.2014]. [Lallé et al.2016] predicted confusion using eye-tracking data, and achieved state of the art results using the Random Forest (RF) algorithm (Section 4). In [Pusiol et al.2016], a patient’s gaze point was superimposed onto the face of a doctor from the patient’s point of view. The result is a video that shows the patient’s gaze point over time. An RNN was then used to predict a developmental disorder. To the best of our knowledge, deep learning has yet to be used for any affect predictive task based on eye-tracking data.

3 Dataset

The dataset used in this paper was collected from a user study using ValueChart, an interactive visualization for preferential choice [Lallé et al.2016]. Figure 1 shows an example of ValueChart configured for selecting rental properties from ten available alternatives (listed in the leftmost column), based on a set of relevant attributes (e.g., location or appliances). Although ValueChart has been evaluated for usability, the complexity of the decision tasks it supports means that it can still generate confusion [Yi2008][Conati et al.2013].

Figure 1: An example of the main elements of ValueChart.

In the user study where the data was collected, 136 participants were recruited to perform tasks with ValueChart. Each user performed a set of tasks relevant to exploring the available options for a given decision problem (e.g. selecting the best home based on the aggregation of price and size). There were 5 tasks, each of which was repeated 8 times (randomly ordered), resulting in 5440 task trials (136 users trials). The user’s eyes were tracked with a Tobii T120 eye-tracker embedded in the study computer monitor. The study was administered in a windowless room with uniform lighting. Tasks have a mean duration of 13.7s (SD=11.3). To collect ground truth labels, users self-report their confusion by clicking on a button labelled “I am confused” (top right in Figure 1). To verify the labels, each participant was later shown replays of interaction segments centred around their confusion reports. Confusion was reported in 112 trials (2%).

The eye-tracking data used for this prediction task can be expressed at three different levels of abstraction:

1. Raw eye-tracker sample: data captured by the eye-tracker each time a sample is taken. At 120 Hz (the T120 sampling rate), a sample is collected 120 times per second. Each sample includes the (x,y) coordinates of the user’s gaze on the study screen, their pupil size, and the distance of each eye from the screen (Table 1). This data is rarely used in applications of eye-tracking, which typically rely on fixations.

2. Fixation-based data: eye-tracker proprietary software takes the raw eye-tracker data and generates sequences of fixations (gaze maintained at one point on the screen), saccades (quick movement of gaze from one fixation to another), the user’s pupil size, and head distance from the screen. Tobii’s IV-T algorithm calculates this data [Olsen, 2012] by translating raw eye-tracker sample points to fixation locations by aggregating smaller eye movements that occur during fixations, such as tremors, drifts, and flicks.

3. High-level features: this level includes statistics of the user’s gaze (G), pupils (P), and head distance (HD). The G features are fixation rate and duration, saccade length, and relative and absolute saccade angles. Additional features relate to various Areas of Interest (AOI), such as fixation rate, longest duration, the proportion of fixations, and the number of transitions from a given AOI to every other. P and HD features are mean, SD, min, and max. The Eye Movement Data Analysis Toolkit (EMDAT) is a package capable of calculating such summative gaze statistics [Kardan].

Feature Definition
GazePointXLeft Horiz. screen position of gaze
GazePointYLeft Vert. screen position of gaze
CamXLeft Horiz. location of left pupil in the cam.
CamYLeft Vert. location of the left pupil in cam.
PupilLeft Size of the left pupil (mm)
DistanceLeft Dist. from eye-tracker to eye (mm)
ValidityLeft Confidence that left eye identified
Table 1: Raw eye-tracker features for the left eye.

4 Classifying Confusion with Random Forest from High-level Features

The state of the art approach to classifying confusion with the dataset just described uses a Random Forest classifier [Lallé et al.2016]

. The model relies on the high-level features described in the previous section. Each trial in the dataset is represented as a 161-element vector encoding 149 gaze features, 6 pupil features, and 6 distance features, which summarize the user’s viewing and attentional behaviour for that trial.

333The last second in each trial before a confusion self-report (or a pivot point) is removed to exclude signs of the intention to push the “I am confused” button from the data.

To address the highly imbalanced dataset, the Synthetic Minority Over-sampling Technique (SMOTE) [Chawla et al.2002] was used to augment the minority class with synthetic examples. Their best results were obtained with a 200% augmentation rate of their minority class and randomly downsampling the majority class to balance the training set.

In order to determine how much data leading up to an episode of confusion is needed to predict it, feature sets were built using two different windows of data. In the case where no confusion was reported in a trial, a randomly selected pivot point is used to provide a start point to the trial. A short window captures 5s immediately before the end of a trial, while a full window captures the whole trial. Results from these two window sizes were not statistically different.

To evaluate this setup, Lallé et al. measured sensitivity (proportion of confused trials correctly identified as such) and specificity (proportion of not confused trials correctly identified as such). These metrics are also known as true positive and true negative rates, respectively. Lallé et al. achieved 0.57 sensitivity and 0.91 specificity after augmenting the dataset with SMOTE by 200%. These results were achieved with a 75% data validity requirement (i.e., ensuring each item was % valid). In both cases, the reported metrics were obtained over the unbalanced test set. Without using SMOTE the RF obtained a sensitivity of 0.51 and a specificity of 0.70. 444 Lallé et al. also experimented with mouse-based interaction events. We focus on their eye-tracking results in order to keep our problem unimodal for the time being.

5 Classifying Confusion with RNNs and Raw Eye-tracker Data

Recurrent Neural Networks. Our approach is first distinguished from the one described above by our use of RNNs. Like other neural network approaches, RNNs learn a feature representation from data. It is important to note the distinction that deep neural networks are not just another classification algorithm (even if they can be used as such). In fact, the classification task is performed only at the last layers. The layers before this should be considered a representation learner, in that they serve to disentangle and discover the distinguishing features of the data fed into the model [Bengio et al.2013]. This is in contrast to non-deep learning approaches that don’t include this representation learning step. Non-deep learning models are given a set of carefully selected features, the composition of which is generally based on human expertise about what is important for a given classification task. These models do not have to learn which features to use, they only have to learn from the ones they are given.

RNNs are especially suited for sequential data (like ours) and the RNN variations we use are also more likely than others to learn long-term dependencies in sequential data, as they address the vanishing/exploding gradient problem common in the basic RNN when learning from long sequences

[Goodfellow et al.2016]. An RNN learns its parameters by repeated application of a forward function. It is mainly the definition of this forward function that differentiates the RNN variants. When the output of the RNN is fully connected to a vector containing a node for each class, RNNs can be used for classification just like any other neural network.

We consider two RNN variations, namely Long-Short Term Memory (LSTM) networks

[Hochreiter and Schmidhuber1997]

and Gated Recurrent Units (GRU) for their ability to learn from long sequences like those in our data. LSTMs are a gated variation of the basic RNN that differ in their use of self-loops to facilitate the learning of long term dependencies while also ensuring long term gradient flow. A GRU

[Cho et al.2014] is a gated RNN which is essentially a simplified LSTM that reduces the number of gates (and thus learnable parameters) used [Goodfellow et al.2016].

Given the primary strength of deep learning approaches being their representation learning ability, we use the raw eye-tracker sample data (Section 3) as input to our RNNs. We use data at this level of abstraction because it is the lowest level available and thus doesn’t exclude any pattern expressed in the data captured by the eye-tracker. Any discriminator that could be lost in going to a higher level of data abstraction is necessarily maintained at this level. Previous work (Section 4) shows that summary statistics around fixations, saccades, pupil size, and head distance are strong features for classifying confusion. However, there may be other indicators in the data that we are not aware of. Providing data at the lowest level gives the model an opportunity to discover these [Bengio et al.2013]. Each of our data items is, therefore, a 2D array with the number of rows corresponding to the number of samples captured in each trial. Each row within our items contains 14 columns; one for each feature in Table 1.

Pre-processing. We process the data before training our models by removing the last second of data and filtering out items shorter than 2s or with less than 65% valid rows. We define a row to be invalid if both ValidityLeft and ValidityRight (Table 1) indicate that the eye-tracker is not confident in the data it has captured. We minimize the number of invalid values by identifying rows where at least one of ValidityLeft or ValidityRight is True, and then replace the invalid features with the valid ones, as feature values related to the left and right eye are similar at a given point in time. The steps described up to this point mimic part of the process that EMDAT uses to prepare high-level features [Kardan]. Applying these steps we discard 26 confused and 1328 not confused trials (similar to the number of trials discarded by EMDAT). All invalid values are replaced with -1 (a value not occurring in our data otherwise).

While there is no fixed sequence length on which RNNs must operate, they typically require sequences shorter than those in the ValueChart confusion dataset (mean = 1644) [Neil et al.2016]. [Lallé et al.2016] observed no significant difference when using the full sequence length compared to using the last 5s of data, which suggests that we can reduce our sequences down to at least 5s. We ran the RNNs on window sizes of 3, 4, 5, and 6s, and found no notable difference in performance. We thus chose to use a 5s window to strengthen comparability between our models and the Random Forest.

The next step we take to reduce sequence length comes after looking at the values in the rows of the raw eye-tracker samples. We observed that because of the high sampling rate of the eye-tracker, these values change only a small amount from one row to the next (i.e. after 8ms). Thus we split each sequence into 4 items with the same label by performing a cyclic partition (e.g. as when dealing a deck of cards), which preserves the temporal structure of the time series data (Figure 2). Splitting each data item like this reduces our sequence lengths by a factor of 4; the same amount of wall-clock time is now represented in a quarter of the number of rows.

Data Augmentation.

Figure 2: Cyclic partition of a single data item into four.

The cyclic partition of the data serves a second function. By splitting the data we provide our model with multiple opportunities to learn from the same example in a more intelligent way than would be achieved by simply duplicating an example exactly (over-sampling). The difference between resulting items is enough to provide intra-class variance, while the cyclic partition ensures the preservation of the sequential pattern. It is thus our first attempt to address the small size of our dataset. However, the data-intensive nature of our approach and our small dataset necessitates augmentation beyond that achieved by the cyclic partition. Following the approach of previous work, we used SMOTE to increase the number of minority class items in our dataset.


All of our models are implemented and trained using PyTorch. Limiting our models to a single layer and then following common heuristics for neural network design

[Goodfellow et al.2016]

, we chose a hidden state of 256 units for our models. We use negative log likelihood as our loss function, with the Adam optimizer

[Kingma and Ba2014] and an initial learning rate of

. We limit training to a maximum of 300 epochs, employing linear learning rate decay and early stopping to end training when validation performance stops improving. We use a batch size of

, on a single Nvidia GTX 1080 GPU. Metric calculations are done using the scikit-learn library [Pedregosa et al.2011]. To enhance reproducibility, our code is available at

5.1 Experiment Setup and Evaluation.

To evaluate our models we employ across-user 10-fold Cross Validation (CV), which ensures that no user has contributed data items to both the training and test or validation sets of a given fold. We maintain the same ten data splits while training and evaluating each model and seed all random number generators in our code in order to enhance comparability.

In addition to sensitivity and specificity (defined in Section 4), we measure the Receiver Operating Characteristic (ROC) score, which provides a unified measure of classifiers performance at a variety of decision thresholds. This allows us to choose a decision threshold that maximizes sensitivity and specificity. The threshold calculated is the one corresponding to the point closest to (0,1) on the ROC curve.

Because of the highly imbalanced structure of our dataset (recall that validation and test sets are left unbalanced), sensitivity and specificity provide more meaningful metrics than accuracy. In the latter case, 98% accuracy could be achieved by classifying everything as not confused, since only 2% of our data is labelled confused. Before training, we randomly down-sample not confused items to balance the training set.

For each CV fold, the dataset is split into training (90%), validation (10%) sets. RNN model parameters (weights and biases) are updated via back-propagation on the loss over the training set. Validation set ROC is monitored to detect over-fitting and decide when to decrease the learning rate.

6 Results

Our most encouraging result is the performance of both the LSTM and the GRU against the Random Forest when no augmentation with synthetic data items is used. Here we see (in Table 2) that the RNNs exceed the Random Forest in terms of sensitivity while matching specificity. This result provides empirical evidence that the RNNs learn a more powerful feature representation with which to classify confused items than does the Random Forest with high-level features.

While both the GRU and LSTM achieve higher sensitivity scores than RF-SMOTE 200%, the latter still outperforms the RNNs in terms of specificity. We postulate that this is because the Random Forest with 200% confused class augmentation allows for triple the majority class items to be used for training the model, which allows for a more complete representation of the not confused class in the training data.

Our experiments using SMOTE (with 200% augmentation) to increase the number of minority class training items for the LSTM and GRU showed no improvement in model performance. We also ran our experiments with the basic RNN, but as results were in all cases lower than those of the LSTM and GRU, we omit them from Table 2 for simplicity.

Classifier Sensitivity Specificity
RF 0.51 0.70
RF-SMOTE 200% 0.57 0.91
GRU 0.70 0.70
LSTM 0.74 0.71
Table 2: Random Forest (RF) performance results of [Lallé et al.2016], compared to RNN validation set performance.

Note that the results of Table 2 are calculated on the validation set. This is not an entirely unseen hold-out set, as it is monitored to detect over-fitting and ultimately to decide when to stop training. Validation set performance is often not a useful result to present, as it is easy to over-fit the model to data that (even indirectly) influences training. However, when we split the dataset into train (60%), validation (30%), and test (10%) partitions, we found (Table 3) that the difference between the ROC score when calculated on the test set is close to that calculated on the validation set (average absolute value of difference is 0.0055). This suggests that our training procedure does not result in over-fitting to the validation set. We believe that in this case, reporting validation set performance is not only reasonable but also more useful in estimating the ability of the RNNs to discriminate between

confused and not confused data since doing so allows for training with an extra 30% of the data.

Classifier Test ROC Validation ROC
GRU 0.655 0.661
LSTM 0.653 0.658
Table 3: Comparison of RNN performance on the validation set and the unseen test set.

7 Conclusion and Future Work

In this paper, we have shown that RNNs can be used to classify confusion from raw eye-tracking data (i.e. without any manual feature engineering) and that the strength of this approach in capturing long term dependencies allows our results to approach the state of the art, even without effective data augmentation. When neither method is augmented with synthetic items, the RNNs outperform the Random Forest. When augmented, the Random Forest achieves a higher specificity but is still outperformed by the RNNs in terms of sensitivity. This provides evidence that the RNNs learn a more powerful feature representation from the raw data than is available in the high-level features used by the previous approach. These results also establish a baseline from which to measure future work in augmenting raw eye-tracking data.

We have investigated initial data augmentation techniques, but have discovered that basic approaches to augmentation such as SMOTE do not work as well as they did for previous (non-sequential) approaches to classifying confusion. This is likely because while algorithms in the SMOTE family are necessarily distance based, they compute distance without matching the temporal dynamics of sequences [Gong and Chen2016]. We have, however, identified promising lines of work for the task of augmenting our time series data, and have taken steps to develop and evaluate these.

There are several limitations to be aware of when considering this work. First, while we support all of our claims with Cross Validated experimental results, we did not perform testing to verify the statistical significance of close results. Thus while it appears likely that the LSTM performed better than the GRU (Table 2), we cannot be sure that this difference is statistically significant.

Throughout our work, we made a number of choices guided by intuition. Examples of this are the length of sequences used as input for our RNNs and the number of partitions in which to split our data. These choices would be better supported by significance testing in the first case and an analysis of variance in the latter. While we investigated window sizes near what previous work established as reasonable, it may be beneficial to use the entire sequence (the full window), in the event that there are signals of confusion earlier on in a trial. In this current work, our choice of RNN variations limited our sequence size, but future work should include investigating the Phased LSTM, which is designed specifically for modelling long sequences [Neil et al.2016].

We performed almost no hyperparameter tuning on our RNNs or the learning algorithm. This is often a key element of success for deep learning models. A consequence of this is that our results may not reflect the true power of models, but rather provide a lower bound on what can be achieved. Future work in this respect will include optimizing depth, hidden unit size, adding dropout for regularization, and learning rate scheduling.

Another area to explore is using fixation-based features (the middle level of abstraction described in Section 3) with the same methods discussed in this paper. This may simplify data augmentation, due to the less complex structure of this data. The most critical area of future work is that of investigating and developing methods of data augmentation for raw eye-tracking data. We have implemented and experimented with affine transformations, with encouraging preliminary results. Following that, the effectiveness of Recurrent Variational Auto-Encoders [Chung et al.2015] and Generative Adversarial Networks should be investigated for producing synthetic raw eye-tracking samples [Mogren2016].

While we are the first to classify a user affective state using deep learning methods on raw eye-tracking data, the work on classifying confusion has a longer history. Solving this task is critical if we are to build AI systems capable of expressing and responding to human emotions and mental states.


We thank Sebastian Lallé for his patience and help in explaining his approach to predicting confusion from eye-tracker data. This work is supported in part by the Institute for Computing, Information and Cognitive Systems (ICICS) at UBC.


  • [Amer et al.2014] Mohamed R Amer, Behjat Siddiquie, Colleen Richey, and Ajay Divakaran. Emotion detection in speech using deep networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 3724–3728. IEEE, 2014.
  • [Bengio et al.2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [Bixler and D’Mello2015] Robert Bixler and Sidney D’Mello. Automatic gaze-based detection of mind wandering with metacognitive awareness. In International Conference on User Modeling, Adaptation, and Personalization, pages 31–43. Springer, 2015.
  • [Botelho et al.2017] Anthony F Botelho, Ryan S Baker, and Neil T Heffernan. Improving sensor-free affect detection using deep learning. In

    International Conference on Artificial Intelligence in Education

    , pages 40–51. Springer, 2017.
  • [Chawla et al.2002] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [Chung et al.2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
  • [Conati et al.2013] Cristina Conati, Enamul Hoque, Dereck Toker, and Ben Steichen. When to adapt: Detecting user’s confusion during visualization processing. In UMAP Workshops. Citeseer, 2013.
  • [Dhall et al.2017] Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. From individual to group-level emotion recognition: Emotiw 5.0. In Proceedings of the 19th ACM international conference on multimodal interaction, pages 524–528. ACM, 2017.
  • [Ebrahimi Kahou et al.2015] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI ’15, pages 467–474, New York, NY, USA, 2015. ACM.
  • [Gong and Chen2016] Zhichen Gong and Huanhuan Chen. Model-based oversampling for imbalanced sequence classification. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 1009–1018. ACM, 2016.
  • [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [Jaques et al.2014] Natasha Jaques, Cristina Conati, Jason M Harley, and Roger Azevedo. Predicting affect from gaze data during interaction with an intelligent tutoring system. In International conference on intelligent tutoring systems, pages 29–38. Springer, 2014.
  • [Jiang et al.2018] Yang Jiang, Nigel Bosch, Ryan S Baker, Luc Paquette, Jaclyn Ocumpaugh, Juliana Ma Alexandra L Andres, Allison L Moore, and Gautam Biswas. Expert feature-engineering vs. deep neural networks: Which is better for sensor-free affect detection? In International Conference on Artificial Intelligence in Education, pages 198–211. Springer, 2018.
  • [Kardan] Samad Kardan. Eye movement data analysis toolkit. Accessed: 2019-06-03.
  • [Karpathy and Fei-Fei2015] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3128–3137, 2015.
  • [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [Lallé et al.2016] Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. Prediction of individual learning curves across information visualizations. User Modeling and User-Adapted Interaction, 26(4):307–345, 2016.
  • [Lallé et al.2018] Sébastien Lallé, Cristina Conati, and Roger Azevedo. Prediction of student achievement goals and emotion valence during interaction with pedagogical agents. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 1222–1231. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
  • [Mogren2016] Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.
  • [Neil et al.2016] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in neural information processing systems, pages 3882–3890, 2016.
  • [Pedregosa et al.2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [Pusiol et al.2016] Guido Pusiol, Andre Esteva, Scott S Hall, Michael Frank, Arnold Milstein, and Li Fei-Fei. Vision-based classification of developmental disorders using eye-movements. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 317–325. Springer, 2016.
  • [Xu et al.2016] Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, and Leonid Sigal.

    Video emotion recognition with transferred deep feature encodings.

    In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR ’16, pages 15–22, New York, NY, USA, 2016. ACM.
  • [Yi2008] Ji Soo Yi. Visualized decision making: development and application of information visualization techniques to improve decision quality of nursing home choice. PhD thesis, Georgia Institute of Technology, 2008.