Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour

05/25/2020 ∙ by Sandeep Mathias, et al. ∙ IIT Bombay 0

The gaze behaviour of a reader is helpful in solving several NLP tasks such as automatic essay grading, named entity recognition, sarcasm detection etc. However, collecting gaze behaviour from readers is costly in terms of time and money. In this paper, we propose a way to improve automatic essay grading using gaze behaviour, where the gaze features are learnt at run time using a multi-task learning framework. To demonstrate the efficacy of this multi-task learning based approach to automatic essay grading, we collect gaze behaviour for 48 essays across 4 essay sets, and learn gaze behaviour for the rest of the essays, numbering over 7000 essays. Using the learnt gaze behaviour, we can achieve a statistically significant improvement in performance over the state-of-the-art system for the essay sets where we have gaze data. We also achieve a statistically significant improvement for 4 other essay sets, numbering about 6000 essays, where we have no gaze behaviour data available. Our approach establishes that learning gaze behaviour improves automatic essay grading.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Collecting a reader’s psychological input can be very beneficial to a number of Natural Language Processing (NLP) tasks, like complexity

Mishra et al. (2017); González-Garduño and Søgaard (2017), sentence simplification Klerke et al. (2016), text understanding Mishra et al. (2016), text quality Mathias et al. (2018), parsing Hale et al. (2018), etc. This psychological information can be extracted using devices like eye-trackers, and electroencephalogram (EEG) machines. However, one of the challenges in using reader’s information involves collecting the psycholinguistic data itself. Or, to be more blunt,

Why should people have to read the text to get data and then solve the task? Isn’t the whole point of Natural Language Processing having a machine that solves my problem without me having to read the text?

In this paper, we choose the task of automatic essay grading and show how we can predict the score that a human rater would give using both text and learnt gaze behaviour. An essay is a piece of text, written in response to a topic, called a prompt. Automatic essay grading is assigning a score to the essay using a machine. An essay set is a set of essays written in response to the same prompt.

Multi-task learning Caruana (1998)

is a machine learning paradigm where we utilize auxiliary tasks to aid in solving a primary task. This is done by exploiting similarities between the primary task and the auxiliary tasks. In our paper,

scoring the essay is the primary task, while learning the gaze behaviour is the auxiliary task.

In particular, we show how, using gaze behaviour for a very small number of essays (less than 0.7% of the essays in an essay set), we see an improvement in predicting the overall score of the essays. We also use our gaze behaviour dataset to run experiments on unseen essay sets - i.e., essay sets which have no gaze behaviour data - and observe improvements in the system’s performance in automatically grading essays.

1.1 Contributions

The main contribution of our paper is describing how we use gaze behaviour information, in a multi-task learning framework, to automatically score essays outperforming the state-of-the-art systems.

1.2 Gaze Behaviour Terminology

An Interest Area (IA) is an area of the screen that we are interested in. These areas are where some text is displayed, and not the white background on the left/right, as well as above/below the text. Each word is a separate and unique IA.

A Fixation is an event when the reader’s eye is focused on a part of the screen. For our experiments, we are concerned only with fixations that occur within the interest areas. Fixations that occur in the background are ignored.

A Saccade is the path of the eye movement, as it goes from one fixation to the next. There are two types of saccades - Progressions and Regressions. Progressions are saccades where the reader moves from the current interest area to a later one. Regressions are saccades where the reader moves from the current interest are to an earlier one.

The rest of the paper is organized as follows. Section 2 describes our motivation for using eye-tracking and learning gaze behaviour from readers, over unseen texts. Section 3 describes some of the related work in the area of automatic essay grading, eye tracking and multi-task learning. Section 4 describes the gaze behaviour attributes used in our experiments, and the intuition behind them. We describe our dataset creation and experiment setup in Section 5. In Section 6, we report our results and present a detailed analysis. We present our conclusions and discuss possible future work in Section 7.

2 Motivation

Most of the research performed using psycholinguistics for solving NLP problems often involves collecting the psycholinguistic data a priori. mishra2018cognitively, for instance, describe a lot of research in solving multiple problems in NLP using gaze behaviour of readers. However, most of their work involves collecting the gaze behaviour data first, and then splitting the data into training and testing data, before initiating any experiments. While their work did show significant improvements over baseline approaches, across multiple NLP tasks111They also propose solutions for a few innovative problems like translation complexity Mishra et al. (2013), and sarcasm understandability Mishra et al. (2016)., collecting the gaze behaviour data would be quite expensive, both in terms of time and money.

Therefore, we ask ourselves: “Can we learn gaze behaviour, using a small amount of seed data, to help solve an NLP task?” In order to use gaze behaviour on a large scale, we need to be able to learn it, since we can not ask a user to read texts every time we wish to use gaze behaviour data. mathias-etal-2018-eyes describe using gaze behaviour to predict how a reader would rate a piece of text (which is similar to our chosen application). Since they showed that gaze behaviour can help in predicting text quality, we use multi-task learning to simultaneously learn gaze behaviour information (auxiliary task) as well as score the essay (the primary task). However, they collect all their gaze behaviour data a priori, while we try to learn the gaze behaviour of a reader and use what we learn from our system, for grading the essays. Hence, while they showed that gaze behaviour could help in predicting how a reader would score a text, their approach requires a reader to read the text, while our approach does not do so, during testing / deployment.

3 Related Work

3.1 Automatic Essay Grading (AEG)

The very first AEG system was proposed by page1966imminence in 1966. Since then, there have been a lot of other AEG systems (see shermis2013handbook for more details).

In 2012, the Hewlett Foundation released a dataset called the Automatic Student Assessment Prize (ASAP) AEG dataset. The dataset contains about 13,000 essays across eight different essay sets. We discuss more about that dataset in Section 5.

With the availability of a large dataset, there has been a lot of research, especially using neural networks, in automatically grading essays - like using Long Short Term Memory (LSTM) Networks

Taghipour and Ng (2016); Tay et al. (2018)

, Convolutional Neural Networks (CNNs)

Dong and Zhang (2016), or both Dong et al. (2017). zhang-litman-2018-co improve on the results of dong-etal-2017-attention using co-attention between the source article, and the essay for one of the types of essay sets.

3.2 Eye-Tracking

Capturing the gaze behaviour of readers has been found to be quite useful in improving the performance of NLP tasks Mishra and Bhattacharyya (2018). The main idea behind using gaze behaviour is the eye-mind hypothesis Just and Carpenter (1980)

, which states that whatever text the eye reads, that is what the mind processes. This hypothesis has led to a large body of work in psycholinguistic research that shows a relationship between text processing and gaze behaviour. mishra2018cognitively also describe some of the ways that eye-tracking can be used for multiple NLP tasks like translation complexity, sentiment analysis, etc.

Research has been done on learning gaze behaviour in a multi-task approach to solve downstream NLP tasks like sentence simplification Klerke et al. (2016), readability González-Garduño and Søgaard (2018); Singh et al. (2016), part-of-speech tagging Barrett et al. (2016), and sentiment analysis Mishra et al. (2018); Long et al. (2019).

4 Gaze Behaviour Attributes

In our experiments, we use only a subset of gaze behaviour attributes described by mathias-etal-2018-eyes because most of the other attributes (like Second Fixation Duration222The duration of the fixation when the reader fixates on an interest area for the second time.) were mostly 0, for most of the interest areas, and learning over them would not have yielded any meaningful results. The gaze behaviour attributes that we use are described in this section.

4.1 Fixation Based Attributes

In our experiments, we use the Dwell Time (DT) and First Fixation Duration (FFD) as fixation-based gaze behaviour attributes. First Fixation Duration is amount of time that a reader initially focuses on an interest area. The Dwell Time is the total amount of time a user spends focusing on an interest area.

Larger values for fixation durations (for both DT and FFD) usually indicate that a word could be wrong (either a spelling mistake or grammar error). Errors would force a reader to pause, as they try to understand why the error was made (For example, if the writer wrote “short cat” instead of “short cut”.

4.2 Saccade Based Attribute

In addition to the Fixation based attributes, we also look at a regression-based attribute - IsRegression (IR). This attribute is used to check whether or not a regression occurred from a given interest area.

We don’t focus on progression-based attributes, because the usual direction of reading is progressions. We are mainly concerned with regressions because they often occur when there is a mistake, or a need for disambiguation (like trying to resolve the antecedent of an anaphora).

4.3 IA Based Attributes

Lastly, we also use IA-based attributes, such as the Run Count (RC) and if the IA was Skipped (Skip). The Run Count is the number of times a particular IA was fixated on, and Skip is whether or not the IA was skipped. A well-written text would be read more easily, meaning a lower RC, and higher Skip Mathias et al. (2018).

5 Dataset and Experiment Setup

Essay Set Number of Essays Score Range Mean Word Count
Prompt 3 1726 0-3 150
Prompt 4 1770 0-3 150
Prompt 5 1805 0-4 150
Prompt 6 1800 0-4 150
Prompt 1 1783 2-12 350
Prompt 2 1800 1-6 350
Prompt 7 1569 0-30 250
Prompt 8 723 0-60 650
Total 12976 0-60 250
Table 1: Statistics of the 8 essay sets from the ASAP AEG dataset. We collect gaze behaviour data only for Prompts 3 - 6 (i.e. the first block), as explained in Section 5.3. The other 4 prompts comprise our unseen essay sets.

5.1 Essay Dataset Details

We perform our experiments on the ASAP AEG dataset. The dataset has approximately 13,000 essays, across 8 essay sets. Table 1 reports the statistics of the dataset in terms of Number of Essays, Score Range, and Mean Word Count. The first 4 rows in Table 1 are source-dependent response (SDR) essay sets, which we use to collect our gaze behaviour data. The other essays are used as unseen essay sets. SDRs are essays written in response to a question about a source article. For example, one of the essays that we use is based on an article called The Mooring Mast, by Marcia Amidon Lüsted333The prompt is “Based on the excerpt, describe the obstacles the builders of the Empire State Building faced in attempting to allow dirigibles to dock there. Support your answer with relevant and specific information from the excerpt.” Due to space constraints, we’ll be adding the original article as part of the supplementary material..

5.2 Evaluation Metric

For measuring our system’s performance, we use Cohen’s Kappa with quadratic weights - Quadratic Weighted Kappa (QWK) Cohen (1968)

for the following reasons. Firstly, irrespective of whether we use regression, or ordinal classification, the final scores that are predicted by the system should be discrete scores. Hence, using Pearson Correlation would not be appropriate for our system. Secondly, F-Score and accuracy do not take into account chance agreements unlike Cohen’s Kappa. For example, if we were to give everyone an average grade, we would get a positive value for accuracy and F-Score, but a Kappa value of 0. Thirdly,

weighted Kappa takes into account the fact that the classes are ordered, i.e. . Using unweighted Kappa would penalize a graded as a , as much as a . Lastly, we use quadratic weights, as opposed to linear weights, because quadratic weights reward agreements and penalize mismatches more than linear weights.

5.3 Creation of the Gaze Behaviour Dataset

In this subsection, we describe how we created our gaze behaviour dataset, how we chose our essays for eye-tracking, and how they were annotated.

5.3.1 Details of Texts

Essay Set 0 1 2 3 4 Total
Prompt 3 2 4 5 1 N/A 12
Prompt 4 2 3 4 3 N/A 12
Prompt 5 2 1 3 5 1 12
Prompt 6 2 2 3 4 1 12
Total 8 10 15 13 2 48
Table 2: Number of essays for each essay set which we collected gaze behaviour, scored between 0 to 3 (or 4).

As mentioned earlier in Section 5, we used only essays corresponding to prompts 3 to 6 of the ASAP AEG dataset. From each of the four essay sets, we selected 12 essays with a diverse vocabulary as well as all possible scores.

We use a greedy algorithm to select essays i.e., For each essay set, we pick 12 essays, covering all score points with maximum number of unique tokens, as well as being under 250 words. Table 2 reports the distribution of essays with each score, for each of the 4 essay sets that we use to create our gaze behaviour dataset.

To display the essay text on the screen, we use a large font size, so that (a) the text is clear, and (b) the reader’s gaze is captured on the words which they are currently reading. Although, this ensures the clarity in reading and recording the gaze pattern in a more accurate manner, it also imposes a limitation on the size of the essay which can be used for our experiment. This is why, the longest essay in our gaze behaviour dataset is about 250 words.

The original essays have their named entities anonymized. Hence, before running the experiments, we replaced the required named entities with placeholders (Eg. an instance of @NAME1 “Al Smith”, an instance of @PLACE1 “New Jersey”, @MONTH1 “May”, etc.)444Another advantage of using source-dependent essays is that there is a source article which we can use to correctly replace the anonymized named entities.

5.3.2 Annotator Details

We used a total of 8 annotators555A total of 9 annotators applied. However, we had to reject one of them because his eyesight was not satisfactory. in our experiments. The annotators were aged between 18 and 31, with an average age of 25 years. All of them were either in college, or had completed a Bachelor’s degree. All but one of them also had experience as a teaching assistant. The annotators were fluent in English, and about half of them had participated earlier, in similar experiments. The annotators were adequately compensated for their work666We report details on individual annotators in the supplementary material..

To assess the quality of the individual annotators, we evaluated the scores they provided against the ground truth scores - i.e., the scores given by the original annotators. The QWK measures the agreement between the annotators and the ground truth score. Close is the number of times (out of 48) in which the annotators either agreed with the ground truth scores, or differed from them by at most 1 score point. Correct is the number of times (out of 48) in which the annotators agreed with the ground truth scores. The mean values for the 3 measures were 0.646 (QWK), 42.75 (Close) and 22.25 (Correct).

5.4 System Details

We conduct our experiments using well-established norms in eye-tracking research Holmqvist et al. (2011). The essays are displayed on a screen that is kept about 2 feet in front of the participant.

The workflow of the experiment is as follows. First, the camera is calibrated. This is done by having the annotator look at 13 points on the screen, while the camera tracks their eyes. Next, the calibration is validated. In this step, the participant looks at the same points they saw earlier. In case there is a big difference between the participant’s fixation points tracked by the camera and the actual points, calibration is repeated. Then, the reader reads the essay. As the reader reads the essay, we supervise the tracking of their eyes. They are allowed to take as much time as they need to understand the essay fully. The essay is displayed on the screen in Times New Roman typeface with a font size of 23. Finally, the reader scores the essay and provides a justification for their score777As part of our data release, we will release the scores given by each annotator, as well as their justifications for their score.

This is done for all 48 essays. After reading and scoring an essay, the participant takes a small break of about a minute, before continuing. Before the next essay is read, the participant again has to do the calibration and validation888The average time for the participants was about 2 hours, with the fastest completing the task in slightly under one and a half hours..

The entire process of having participants read the essay, and collecting gaze behaviour data, is done using an SR Research Eye Link 1000 eye-tracker (monocular stabilized head mode, with a sampling rate of 500Hz). The machine can collect all the gaze details that we need for our experiments. An interest area report is generated for gaze behaviour using the SR Research Data Viewer software.

Figure 1: Architecture of the proposed gaze behaviour and essay scoring multi-task learning systems, namely (a) - the Self-Attention multi-task learning system, for an essay of sentences - and (b) - the Co-Attention system for an essay of sentences and a source article of sentences.

5.5 Experiment Details

We use five-fold cross-validation to evaluate our system. For each fold, 60% is used as training, 20% for validation, and 20% for testing. The folds are the same as those used by taghipour-ng-2016-neural. Prior to running our experiments, we convert the scores from their original score range (given in Table 1) to the range of as described by taghipour-ng-2016-neural.

In order to normalize idiosyncratic reading patterns across different readers, we perform binning for each of the features for each of the readers. For IR and Skip we use only two bins - 0 and 1 - corresponding to their values. For the run count, we use six bins (from 0 to 5), where each bin is the same as the run count (up to 4), and bin 5 contains run counts greater than 4. For the fixation attributes - DT and FFD - we use the same binning scheme as described in klerke-etal-2016-improving. The binning scheme for fixation attributes is as follows:

if ,

if and ,

if and ,

if and ,

if and ,

if ,

where is the value of the given fixation attribute, is the average fixation attribute value for the reader and

is the standard deviation. To calculate

and , for the fixation attributes, we exclude all the interest areas that have a zero value, directly assigning them Bin .

5.6 Network Architecture

Figure 1 shows the architecture of our proposed joint gaze behaviour learning and essay scoring system, based on the co-attention based architecture described by zhang-litman-2018-co. Given an essay, we split the essay into sentences. For each sentence, we look-up the word embeddings for all words in the Word Embedding layer, with the pre-trained word embeddings as mentioned in Section 5.7. The 4000 most frequent words are used as the vocabulary, with all other words mapped to a special unknown word. This sequence of word embeddings forming the sentence is then sent through a Time-Delay Neural Network (TDNN), or 1-d Convolutional Neural Network (CNN), of filter width .

The output from CNN is pooled using an attention layer which results in a representation for every sentence - the Word Level Attention Pooling Layer. This representation for every sentence in the essay is sent through a Sentence Level LSTM Layer to obtain the sentence representation of the essay.

A similar procedure is repeated for the source article. We then perform co-attention between the sentence representations of the essay and the source article. Co-attention is performed to learn similarities between the sentences in the essay and the source article. This is done as a way to ensure that the writer sticks to answering the prompt, rather than drifting off topic.

We now represent every sentence in the essay as a weighted combination of the sentence representation between the essay and the source article. The weights are obtained from the output of the co-attention layer. The weights represent how each sentence in the essay are similar to the sentences in the source article. If a sentence in the essay has low weights this indicates that the sentence would be off topic. A similar procedure is repeated to get a weighted representation of sentences in the source article with respect to the essay.

Finally, we send the sentence representation of the essay and article, through a dense layer (i.e. the Modeling Layer) to predict the final essay score, with a

sigmoid activation function

. As the essay scores are in the range , we use sigmoid activation at the output layer. During prediction, we map the output scores from the sigmoid layer back to the original score range. We minimize the mean squared error loss.

For essay sets without a source article, we use the Self-Attention model proposed by dong-etal-2017-attention. This is a simpler model which does not consider the source article, and uses only the essay text. This is applicable whenever a source article is not present. Figure 1 shows the architecture of the model. Like the earlier system, we get the sentence representation of the essay from the Sentence Level LSTM Layer and send it through the Dense Layer with a sigmoid activation function.

The gaze behaviour learning happens at the Word-Level Convolutional Layer in both the models. This is done because the gaze attributes are learnt at the word-level, while the essay score is predicted at the document-level. The output from the CNN layer is sent through a linear layer followed by sigmoid activation for a particular gaze behaviour. For learning multiple gaze attributes simultaneously, we have multiple linear layers for each of the gaze attributes. In the multi-task setting, we also minimize the mean squared error of the learnt gaze behaviour and the actual gaze behaviour attribute value. We assign a weight to each of the gaze behaviour loss functions to control the importance given to individual gaze behaviour learning tasks.

5.7 Network Hyperparameters

We use the 50 dimension GloVe pre-trained word embeddings Pennington et al. (2014). We run our experiments over a batch size of 100, for

100 epochs

, and set the learning rate as 0.001, and a dropout rate of 0.5. The Word-level CNN layer has a kernel size of 5, with 100 filters. The Sentence-level LSTM layer and modeling layer both have 100 hidden units. We use the RMSProp Optimizer Dauphin et al. (2015) with a 0.001 initial learning rate and momentum of 0.9.

In addition to the network hyper-parameters, we also weigh the loss functions of the different gaze behaviours differently, with weight levels of 0.5, 0.1, 0.05, 0.01 and 0.001. We use grid search and pick the weight giving the lowest mean-squared error on the development set. The best weights from grid search are 0.05 for DT and FFD, 0.01 for IR and RC, and 0.1 for Skip.

System Prompt 3 Prompt 4 Prompt 5 Prompt 6 Mean QWK
taghipour-ng-2016-neural 0.683 0.795 0.818 0.813 0.777
dong-zhang-2016-automatic 0.662 0.778 0.800 0.809 0.762
tay-2018-skipflow 0.695 0.788 0.815 0.810 0.777
Self-Attention Dong et al. (2017) 0.677 0.807 0.806 0.809 0.775
Co-Attention Zhang and Litman (2018) 0.689 0.809 0.812 0.813 0.780
Co-Attention+Gaze * 0.698 0.818 0.815 0.821 0.788
Table 3:

Results of our experiments in scoring the essays (QWK values) from the essay sets where we collected gaze behaviour. The first 3 rows are results reported from other state-of-the-art deep learning systems. The next 2 rows are the results we obtained on existing systems - self-attention and co-attention - without gaze behaviour. The last row is the results from our system using gaze behaviour data (Co-Attention+Gaze). The result is statistically significant, with p = 0.015 (

denotes the baseline system, and * denotes a statistically significant result).
System Prompt 1 Prompt 2 Prompt 7 Prompt 8 Mean QWK
taghipour-ng-2016-neural 0.775 0.687 0.805 0.594 0.715
dong-zhang-2016-automatic 0.805 0.613 0.758 0.644 0.705
tay-2018-skipflow 0.832 0.684 0.800 0.697 0.753
Only Prompt (dong-etal-2017-attention) 0.816 0.667 0.792 0.678 0.738
Extra Essays 0.828 0.672 0.802 0.685 0.747
Extra Essays + Gaze * 0.833 0.681 0.806 0.699 0.754
Table 4: Results of our experiments on the unseen essay sets our dataset. The first 3 rows are results reported from other state-of-the-art deep learning systems. The next 2 rows are the results obtained without using gaze behaviour (without and with the extra essays). The last row is the results from our system. The result is statistically significant with p = 0.0041 ( denotes the baseline system, and * denotes a statistically significant result).

5.8 Experiment Configurations

To test our system on essay sets which we collected gaze behaviour, we run experiments using the following configurations. (a) Self-Attention

- This is the implementation of dong-etal-2017-attention’s system in Tensorflow by zhang-litman-2018-co. (b)

Co-Attention. This is zhang-litman-2018-co’s system999The implementation of both systems can be downloaded from https://github.com/Rokeer/co-attention. (c) Co-Attention+Gaze. This is our system, which uses gaze behaviour.

In addition to this, we also run experiments on the unseen essay sets using the following training configurations. (a) Only Prompt

- This uses our self-attention model, with the training data being only the essays from that essay set. We use this model, because there are no source articles for these essay sets. (b)

Extra Essays - Here, we augment the training data of (a) with the 48 essays for which we collect gaze behaviour data. (c) Essays+Gaze - Here, we augment the training data of (a) with the 48 essays which we collect gaze behaviour data, and their corresponding gaze data. We also compare our results with a string kernel based system proposed by cozma-etal-2018-automated.

Figure 2: Dwell Time of one of the readers for one of the essays. The darker the background, the larger the bin.

6 Results and Analysis

Table 3 reports the results of our experiments on the essay sets for which we collect the gaze behaviour data. The table is divided into 3 parts. The first part (i.e., first 3 rows) are the reported results previously available deep-learning systems, namely taghipour-ng-2016-neural, dong-zhang-2016-automatic, and tay-2018-skipflow. The next 2 rows feature results using the self-attention Dong et al. (2017) and co-attention Zhang and Litman (2018). The last row reports results using gaze behaviour on top of co-attention, i.e., Co-Attention+Gaze. The first column is the different systems. The next 4 columns report the QWK results of each system for each of the 4 essay sets. The last column report the Mean QWK value across all 4 essay sets.

Our system is able to outperform the Co-Attention system Zhang and Litman (2018)

in all the essay sets. Overall, it is also the best system - achieving the highest QWK results among all the systems in 3 out of the 4 essay sets (and the second-best in the other essay set). To test our hypothesis - that the model trained by learning gaze behaviour helps in automatic essay grading - we run the Paired T-Test. Our null hypothesis is: “Learning gaze behaviour to score an essay does not help any more than the self-attention and co-attention systems and whatever improvements we see are due to chance.” We choose a significance level of

, and observe that the improvements of our system are found to be statistically significant () - rejecting the null hypothesis.

6.1 Results for Unseen Essay Sets

In order to run our experiments on unseen essay sets, we augment the training data with the gaze behaviour data collected. Since none of these essays have source articles, we use the self-attention model of dong-etal-2017-attention as the baseline system. We now augment the gaze behaviour learning task as the auxiliary task and report the results in Table 4. The first column in the table is the prompt IDs. The next 3 columns are the 3 configurations - Only Prompt, Extra Essays, and Essays+Gaze. From Table 4, we observe that our system which uses both the extra 48 essays and their gaze behaviour outperforms the other 2 configurations across all 4 unseen essay sets. The improvement when learning gaze behaviour for unseen essay sets is statistically significant ().

6.2 Comparison with String Kernel System

Since Cozma et al. (2018) haven’t released their data splits (train/test/dev), we ran their system with our data splits. We observed a mean QWK of 0.750 with the string kernel-based system on the essay sets where we have gaze behaviour data, and 0.685 on the unseen essay sets. One possible reason for this could be that while they used cross-validation, they may have used only a training-testing split (as compared to a train/test/dev split).

6.3 Analysis of Gaze Attributes

Gaze Feature Diff. in QWK
Dwell Time 0.0137
First Fixation Duration 0.0136
IsRegression 0.0090
Run Count 0.0110
Skip 0.0091
Table 5: Results of ablation tests for each of the gaze behaviour attributes across all the essay sets. The reported numbers are the difference in QWK before and after ablating the given gaze attribute. The number in bold denotes the best gaze attribute.

The results for our ablation tests are reported in Table 5. We found that the most important gaze behaviour attribute across all the essay sets is the Dwell Time, followed closely by the First Fixation Duration. One of the reasons for this is the fact that both DT and FFD were very useful in detecting errors made by the essay writers. The normalized mean squared error of each of the gaze features predicted by our system was between 0.125 to 0.128 for all the gaze behaviour attributes.

From Figure 2101010Due to space constraints, we are uploading other heat map examples in the supplementary material, we observe that most of the longest dwell times have come at/around spelling mistakes (tock instead of took), or out-of-context words (bay instead of buy), or incorrect phrases (short cat, instead of short cut). These errors force the reader to spend more time fixating on the word which we also mentioned earlier.

7 Conclusion and Future Work

In this paper, we describe how learning gaze behaviour can help AEG in a multi-task learning setup. We explained how we collect gaze behaviour data, and using multi-task learning we are able to achieve better results over a state-of-the-art system developed by zhang-litman-2018-co. We also analyze the transferability of gaze behaviour patterns across essay sets by training a multi-task learning model on unseen essay sets, thereby establishing that learning gaze behaviour improves automatic essay grading.

In the future, we would like to look at using gaze behaviour to help in cross-domain AEG. This is done mainly when we don’t have enough training examples in our essay set. We would also like to explore the possibility of generating textual feedback (rather than just a number, denoting the score of the essay) based on the justifications that the annotators gave for their grades.


  • Barrett et al. (2016) Maria Barrett, Joachim Bingel, Frank Keller, and Anders Søgaard. 2016. Weakly supervised part-of-speech tagging using eye-tracking data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 579–584, Berlin, Germany. Association for Computational Linguistics.
  • Caruana (1998) Rich Caruana. 1998. Multitask Learning, pages 95–133. Springer US, Boston, MA.
  • Cohen (1968) Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213.
  • Cozma et al. (2018) Mădălina Cozma, Andrei Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 503–509, Melbourne, Australia. Association for Computational Linguistics.
  • Dauphin et al. (2015) Yann Dauphin, Harm De Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems, pages 1504–1512.
  • Dong and Zhang (2016) Fei Dong and Yue Zhang. 2016. Automatic features for essay scoring – an empirical study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1072–1077, Austin, Texas. Association for Computational Linguistics.
  • Dong et al. (2017) Fei Dong, Yue Zhang, and Jie Yang. 2017. Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 153–162, Vancouver, Canada. Association for Computational Linguistics.
  • González-Garduño and Søgaard (2018) Ana V González-Garduño and Anders Søgaard. 2018. Learning to predict readability using eye-movement data from natives and learners. In

    Thirty-Second AAAI Conference on Artificial Intelligence

  • González-Garduño and Søgaard (2017) Ana Valeria González-Garduño and Anders Søgaard. 2017. Using gaze to predict text readability. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 438–443, Copenhagen, Denmark. Association for Computational Linguistics.
  • Hale et al. (2018) John Hale, Chris Dyer, Adhiguna Kuncoro, and Jonathan Brennan. 2018. Finding syntax in human encephalography with beam search. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2727–2736, Melbourne, Australia. Association for Computational Linguistics.
  • Holmqvist et al. (2011) Kenneth Holmqvist, Marcus Nyström, Richard Andersson, Richard Dewhurst, Halszka Jarodzka, and Joost Van de Weijer. 2011. Eye tracking: A comprehensive guide to methods and measures. OUP Oxford.
  • Just and Carpenter (1980) Marcel A Just and Patricia A Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychological review, 87(4):329.
  • Klerke et al. (2016) Sigrid Klerke, Yoav Goldberg, and Anders Søgaard. 2016. Improving sentence compression by learning to predict gaze. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1528–1533, San Diego, California. Association for Computational Linguistics.
  • Long et al. (2019) Yunfei Long, Rong Xiang, Qin Lu, Chu-Ren Huang, and Minglei Li. 2019. Improving attention model based on cognition grounded data for sentiment analysis. IEEE Transactions on Affective Computing.
  • Mathias et al. (2018) Sandeep Mathias, Diptesh Kanojia, Kevin Patel, Samarth Agrawal, Abhijit Mishra, and Pushpak Bhattacharyya. 2018. Eyes are the windows to the soul: Predicting the rating of text quality using gaze behaviour. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2352–2362, Melbourne, Australia. Association for Computational Linguistics.
  • Mishra and Bhattacharyya (2018) Abhijit Mishra and Pushpak Bhattacharyya. 2018. Cognitively Inspired Natural Language Processing: An Investigation Based on Eye-tracking. Springer.
  • Mishra et al. (2013) Abhijit Mishra, Pushpak Bhattacharyya, and Michael Carl. 2013. Automatically predicting sentence translation difficulty. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–351, Sofia, Bulgaria. Association for Computational Linguistics.
  • Mishra et al. (2016) Abhijit Mishra, Diptesh Kanojia, and Pushpak Bhattacharyya. 2016. Predicting readers’ sarcasm understandability by modeling gaze behavior.
  • Mishra et al. (2017) Abhijit Mishra, Diptesh Kanojia, Seema Nagar, Kuntal Dey, and Pushpak Bhattacharyya. 2017. Scanpath complexity: Modeling reading effort using gaze information.
  • Mishra et al. (2018) Abhijit Mishra, Srikanth Tamilselvam, Riddhiman Dasgupta, Seema Nagar, and Kuntal Dey. 2018. Cognition-cognizant sentiment analysis with multitask subjectivity summarization based on annotators’ gaze behavior. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Page (1966) Ellis B Page. 1966. The imminence of… grading essays by computer. The Phi Delta Kappan, 47(5):238–243.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • Shermis and Burstein (2013) Mark D Shermis and Jill Burstein. 2013. Handbook of automated essay evaluation: Current applications and new directions. Routledge.
  • Singh et al. (2016) Abhinav Deep Singh, Poojan Mehta, Samar Husain, and Rajkumar Rajakrishnan. 2016. Quantifying sentence complexity based on eye-tracking measures. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 202–212, Osaka, Japan. The COLING 2016 Organizing Committee.
  • Taghipour and Ng (2016) Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas. Association for Computational Linguistics.
  • Tay et al. (2018) Yi Tay, Minh Phan, Luu Anh Tuan, and Siu Cheung Hui. 2018. Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring.
  • Zhang and Litman (2018) Haoran Zhang and Diane Litman. 2018. Co-attention based neural network for source-dependent essay scoring. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 399–409, New Orleans, Louisiana. Association for Computational Linguistics.