AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

Deep-learning based Automatic Essay Scoring (AES) systems are being actively used by states and language testing agencies alike to evaluate millions of candidates for life-changing decisions ranging from college applications to visa approvals. However, little research has been put to understand and interpret the black-box nature of deep-learning based scoring algorithms. Previous studies indicate that scoring models can be easily fooled. In this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity i.e., large change in output score with a little change in input essay content) and overstability i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.



There are no comments yet.


page 26

page 27

page 29


My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism

Significant progress has been made in deep-learning based Automatic Essa...

Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers

Although various techniques have been proposed to generate adversarial s...

Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency

English proficiency assessments have become a necessary metric for filte...

Alzheimers Dementia Detection using Acoustic Linguistic features and Pre-Trained BERT

Alzheimers disease is a fatal progressive brain disorder that worsens wi...

Explaining AlphaGo: Interpreting Contextual Effects in Neural Networks

In this paper, we propose to disentangle and interpret contextual effect...

Automated Essay Scoring based on Two-Stage Learning

Current state-of-art feature-engineered and end-to-end Automated Essay S...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic Essay Scoring (AES) systems are used in diverse settings such as to alleviate the workload of teachers, save time and costs associated with grading, and to decide admissions to universities and institutions. On average, a British teacher spends 5 hours in a calendar week scoring exams and assignments (micklewright2014teachers). This figure is even higher for developing and low-resource countries where the teacher to student ratio is dismal. While on the one hand, autograding systems effectively reduce this burden, allowing more working hours for teaching activities, on the other, there have been many complaints against these systems for not scoring the way they are supposed to (flawedAlgos; roboGrade; emptyDream; midDay; perelmanBableWebsite). For instance, on the recently released automatic scoring system for the state of Utah, students scored lower by writing question-relevant keywords but higher by including unrelated content (flawedAlgos; roboGrade). Similarly, it has been a common complaint that AES systems focus unjustifiably on obscure and difficult vocabulary (perelmanBable). While earlier, each score generated by the AI systems was verified by an expert human rater, it is concerning to see that now they are scoring a majority of essays independently without any intervention by human experts (ohioAES2). The concerns are further alleviated by the fact that the scores awarded by these systems are used in life-changing decisions ranging from college and job applications to visa approvals (greInterpret; educational2014snapshot; UtahNumbers; OhioNumbers).

Traditionally, autograding systems are built using manually crafted features used with machine learning based models

(kumar2019get). Lately, these systems have been shifting to deep learning based models (ke2019automated). For instance, many companies have started scoring candidates using deep learning based automatic scoring (sltiAIRated; trueNorthSA; duolingoEngTest; laflair2019duolingo; yu2015using; chen2018end; singla2021speakerconditioned; riordan2017investigating; pearsonWhitepaper). However, there are very few research studies on the reliability111A reliable measure is one that measures a construct consistently across time, individuals, and situations (interspeechTutorialETS) and validity222A valid measure is one that measures what it is intended to measure (interspeechTutorialETS) of ML-based AES systems. More specifically, we have tried to address the problems of robustness and validity which plague deep learning based black-box AES models. Simply measuring test set performance may mean that the model is right for the wrong reasons. Hence, much research is required to understand the scoring algorithms used by AES models and to validate them on linguistic and testing criteria. Similar opinions are expressed by madnani2018automated in their position paper on automatic scoring systems.

With this in view, in this paper, we make the following contributions towards understanding current AES systems:

1) Several research studies have shown that essay scoring models are overstable (yoon2018atypical; powers2002stumping; kumar2020calling; pham2018path). Even large changes in essay content do not lead to significant change in scores. For instance, kumar2020calling showed that even after changing 20% words of an essay, the scores do not change much. We extend this line of work by addressing why the models are overstable. Extending these studies further (§3.1), we investigate AES overstability from the perspective of context, coherence, facts, vocabulary, length, grammar and word choice. We do this by using integrated gradients (§2.2), where we find and visualize the most important words for scoring an essay (sundararajan2017axiomatic). We find that the models despite using rich contextual embeddings and deep learning architectures, are essentially behaving as bag-of-words models. Further, with our methods, we also able to improve the adversarial attack strength (§3.1.2). For example, for memory networks scoring model (zhao2017memory), we delete 40% words from essays without significantly changing score (<1%), whereas kumar2020calling observed that deleting a similar number of words resulted in a decrease of 20% scores for the same model.

2) While there has been much work on AES overstability (kumar2020calling; perelman2014state; powers2001stumping), there has been little work on AES oversensitivity. Building on this research gap, by using adversarial triggers, we find that the AES models are also oversensitive, i.e., small changes in an essay can lead to large change in scores (§3.2). We find that, by just adding 3 words in an essay containing 350 words (<1% change), we are able to change the predicted score by 50% (absolute). We explain the oversensitivity of AES systems using integrated gradients (sundararajan2017axiomatic), a principled tool to discover the importance of parts of an input. The results show that the trigger words added to an essay get unusually high attribution. Additionally, we find the trigger words have usually high co-occurence with certain score labels, thus indicating that the models are relying on spurious correlations causing them to be oversensitive (§3.2.4). We validate both the oversensitive and overstable sample in a human study (§4). We ask the annotators whether the scores given by AES models are right by providing them with both original and modified essay responses and scores.

3) While much previous research in the linguistic field studies how essay scoring systems can be fooled, for the first time, we propose models that can detect samples causing overstability and oversensitivity (pham2020out; kumar2020calling; perelman2020). Our models are able to detect both overstability and oversensitivity causing samples with high accuracies (>90% in most cases) (§5). Also, for the first time in the literature, through these solutions, we propose a simple yet effective solution for universal adversarial peturbation (§5.1). These models, apart from defending AES systems against oversensitivity and overstability causing samples, can also inform effective human intervention strategy. For instance, AES deployments either completely rely on double scoring essay samples (human and machine) or solely on machine ratings alone (etsDulingoComparison) can choose to have an effective middle ground. Using our models, AES deployments can select samples for human testing and intervention more effectively. Public school systems, e.g., in Ohio which use automatic scoring without any human interventions can select samples using these models for limited human intervention (ohioAES2; OhioNumbers). For this, we also conduct a small-scale pilot study on the AES deployment of a major language testing company proving the efficacy of the system (§5.3). Previous solutions for human interventions optimization rely on brittle features such as number of words and content modeling approaches like off-topic detection (yoon2018atypical; yoon2017combining). These models cannot detect adversarial samples like the ones we present in our work.

We perform our experiments for three model architectures and eight unique prompts333Here a prompt denotes an instance of a unique question asked to test-takers for eliciting their opinions and answers in an exam. The prompts can come from varied domains including literature, science, logic and society. The responses to a prompt indicate the creative, literary, argumentative, narrative, and scientific aptitude of candidates and are judged on a pre-determined score scale., demonstrating the results on twenty four unique model-dataset pairs. It is worth noting that our goal in this paper is not to show that AES systems can be fooled easily. Rather, our goal is to interpret how deep-learning based scoring models score essays, why they are overstable and oversensitive, and how to solve the problems of oversensitivity and overstability. We release all our code, dataset and tools for public use with the hope that it will spur testing and validation of AES models considering their huge impact on the lives of millions every year (educational2014snapshot).

2 Background

2.1 Task, Models and Dataset

We use the widely cited ASAP-AES dataset which comes from Kaggle Automated Student Assessment Prize (ASAP) for the evaluation of automatic essay scoring systems. The ASAP-AES dataset has been used for automatically scoring essay responses by many research studies (taghipour2016neural; easeGithub; tay2018skipflow). It is one of the largest publicly available datasets (Table 1). The questions covered by the dataset span many different areas such as Sciences and English. The responses were written by high school students and were subsequently double-scored.

Prompt Number 1 2 3 4 5 6 7 8
#Responses 1783 1800 1726 1772 1805 1800 1569 723
Score Range 2-12 1-6 0-3 0-3 0-4 0-4 0-30 0-60
#Avg words per response 350 350 150 150 150 150 250 650
#Avg sentences per response 23 20 6 4 7 8 12 35
Type Argumentative Argumentative RC RC RC RC Narrative Narrative
Table 1: Overview of the ASAP AES Dataset used for evaluation of AS systems. (RC = Reading Comprehension).

We test the following two state-of-the-art models in this work: SkipFlow (tay2018skipflow)

and Memory Augmented Neural Network (

MANN) (zhao2017memory)

. Further, for comparison, we design a BERT based automatic scoring model. The performance is measured using Quadratic Weighted Kappa (QWK) metric, which indicates the agreement between a model’s and the expert human rater’s scores. All models show an improvement of 4-5% over the previous models on the QWK metric. The analysis of these models, especially BERT, is interesting in light of recent studies indicating that pre-trained language models learn rich linguistic features including morphology, parts-of-speech, word-length, noun-verb agreement, coherence, and language delivery

(conneau2018you; hewitt2019structural; shah2021all). This has resulted in pushing the envelope for many NLP applications. The individual models we use are briefly explained as follows:


tay2018skipflow model essay scoring as a regression task. They utilize Glove embeddings for representing the tokens. SkipFlow captures coherence, flow and semantic relatedness over time, which the authors call neural coherence features. Due to the intelligent modelling, it gets an impressive average quadratic weighted kappa score of 0.764. SkipFlow is one of the top performing models (tay2018skipflow; ke2019automated) for AES.


zhao2017memory use memory networks for autoscoring by selecting some responses for each grade. These responses are stored in memory and then used for scoring ungraded responses. The memory component helps to characterize the various score levels similar to what a rubric does. They show an excellent agreement score of 0.78 average QWK outperforming the previous state-of-the-art models.


We also design a BERT-based architecture for scoring essays. It utilizes BERT embeddings (devlin2018bert) to represent essays by passing tokens through the BERT Encoder. The resultant embeddings from the last layer of the encoder are pooled and passed through a fully connected layer of size 1 to produce the score. The network was trained to predict the essay scores by minimizing the mean squared error loss. It achieves an average QWK score of 0.74. We also try out finetuning BERT to see its effect on downstream task (mayfield2020should). It does not result in any statistically significant difference in results. We utilize this architecture as a baseline representative of transformer-based embedding models.

2.2 Attribution Mechanism

The task of attributing a score given by an AES model , on an input essay can be formally defined as producing attributions corresponding to the words contained in the essay . The attributions produced are such that444Proposition 1 in (sundararajan2017axiomatic) , i.e., net attributions of all words () equal the assigned score (). In a way, if is a regression based model, can be thought of as the scores of each word of that essay, which sum to produce the final score, .

We use a path-based attribution method, Integrated Gradients (IGs) (sundararajan2017axiomatic), much like other interpretability mechanisms such as (ribeiro2016lime; lundberg2017shap) for getting the attributions for each of the trained models, . IGs employ the following method to find blame assignments: given an input and a baseline 555Defined as an input containing absence of cause for the output of a model; also called neutral input (shrikumar2016not; sundararajan2017axiomatic)., the integrated gradient along the dimension is defined as:


where represents the gradient of along the dimension of .

We choose the baseline as empty input (all 0s) for essay scoring models since an empty essay should get a score of 0 as per the scoring rubrics. It is the neutral input that models the absence of a cause of any score, thus getting a zero score. Since we want to see the effect of only words on the score, any additional inputs (such as memory in MANN) of the baseline is set to be that of 666

We ensure that IGs are within the acceptable error margin of <5%, where the error is calculated by the property that the attributions’ sum should be equal to the difference between the probabilities of the input and the baseline. IG parameters: Number of Repetitions = 20-50, Internal Batch Size = 20-50

. See Fig. 1 for an example.

We choose IGs over other explainability techniques since they have many desirable properties that make them useful for this task. For instance, the attributions sum to the score of an essay ( = ), they are implementation invariant, do not require any model to be retrained and are readily implementable. Previous literature such as (mudrakarta2018did) also uses Integrated Gradients for explaining the undersensitivity of factoid based question-answer (QA) models. Other interpretability mechanisms like attention require changes in the tested model and are not post-hoc, thus are not a good choice for our task.

Figure 1: Attributions for SkipFlow, MANN and BERT models respectively of an essay sample for Prompt 2. Prompt 2 asks candidates to write an essay to a newspaper reflecting their views on censorship in libraries and express their views if they believe that materials, such as books, etc., should be removed from the shelves if they are found offensive. This essay scored 3 out of 6.

3 Empirical Studies and Results

We perform our overstability (§3.1) and oversensitivity (§3.2) experiments with 100 samples per prompt for the three models discussed in Section 2.1. There are 8 prompt-level datasets in the overall ASAP-AES dataset, therefore we perform our analysis on 24 unique model-dataset pairs, each containing over 100 samples.

3.1 AES Overstability

We first present results on model overstability. Following the previous studies, we test the models’ overstability on different features important for AES scoring such as the knowledge of context (§3.1.1, 3.1.2), coherence (§3.1.3), facts (§3.1.5), vocabulary (§3.1.4), length (§3.1.2), meaning (§3.1.3), and grammar (§3.1.4). This set of features provides an exhaustive coverage of all features important for scoring essays (yan2020handbook).

3.1.1 Attribution of original samples

We take the original human-written essays from the ASAP-AES dataset and do a word-level attribution of scores. Fig. 1 shows the attributions of all models for an essay sample from Prompt 2. We observe that SkipFlow does not attribute any word after the first few lines (first 30% essay content) of the essay while MANN attributions are spread over the complete length of the essay. For BERT-based model, we see that most of the attributions are over non linguistic features (tokens) like ‘CLS’ and ‘SEP’. CLS and SEP tokens are used as delimiters in the BERT model. A similar result was also observed by kovaleva2019revealing.

For SkipFlow, we observe that if a word is negatively attributed at a certain position in an essay sample, it is then commonly negatively attributed in its other occurrences as well. For instance, books, magazines were negatively attributed in all its occurrences while materials, censored were positively attributed and library was not attributed at all. We could not find any patterns in the direction of attribution. In MANN, the same word changes its attribution sign when present in different essays. However, in a single instance of an essay, a word shows the same sign overall despite occurring in very different contexts.

Table 2 lists the top-positive, top-negative attributed words and the mostly unattributed words for all models. For MANN, we notice that the attributions are stronger for function words like, to, of, you, do, and are and lesser for content words like, shelves, libraries, and music. SkipFlow’s top attributions are mostly construct-relevant words while BERT also focuses more on stop-words.

Figure 2: Variation of QWK with iterative addition of response words for SkipFlow, MANN and BERT models. The y-axis notes the relative QWK with respect to the original QWK and the x-axis represents iterative addition of attribute-sorted response words. These results are obtained on Prompt 7, similar results were obtained for all the prompts tested.

3.1.2 Iteratively Deleting Unimportant Words

For this test, we take the original samples and iteratively delete the least attributed words (Eq. 2).


such that , where represents the essay token to be removed, b represents the baseline and represents the attribution on with respect to baseline .

Through this, we note the model’s dependence on a few words without their context. Fig. 2 presents the results. We observe that the performance (measured by QWK) for the BERT model stays within 95% of the original performance even if one of every four words was removed from the essays in the reverse order of their attribution values. The percentage of words deleted were even more for the other models. While Fig. 1 showed that MANN paid attention to the full length of the response, removing words does not seem to affect the scores much. Notably, the words removed are not contiguous but interspersed across sentences, therefore deleting the unattributed words does not produce a grammatically correct response (see Fig. 3), yet can get a similar score thus defeating the whole purpose of testing and feedback.

These findings show that there is a point after which the score flattens out, i.e.

, it does not change in that region either by adding or removing words. This is odd since adding or removing a word from a sentence typically alters its meaning and grammaticality, yet the models do not seem to be affected; they decide their scores only based on 30-50% words. As an example, a sample after deleting bottom 40% attributed words is given here: “

In the end patience rewards better than impatience. A time that I was patient was last year at cheer competition.”

Model Positively Attributed Words
MANN to, of, are, ,, children, do, ’, we
SKIPFLOW of, offensive, movies, censorship, is, our
BERT ., the, to, and, ”, was, caps, [CLS]
Model Negatively Attributed Words
MANN i, shelf, by, shelves, libraries, music, a
SKIPFLOW the, i, to, in, that, do, a, or, be
BERT i, [SEP], said, a, in, time, one
Model Mostly Unattributed Words
MANN t, you, the, think, offensive, from, my
SKIPFLOW it, be, but, their, from, dont, one, what
BERT @, ##1, and, ,, my, patient
Table 2: Top positive, negative and un-attributed words for SkipFlow, MANN and BERT-based model for Prompt 2.

3.1.3 Sentence and Word Shuffle

Coherence and organization are important features for scoring: they measure unity of different ideas in an essay and determine its cohesiveness in the narrative (barzilay2008modeling). To check the dependence of AES models on coherence, we shuffle the order of sentences and words randomly and note the change in score between the original and modified essay (Fig. 3).

We observe little change (<0.002%) in the attributions with sentence shuffle. The attributions are mostly dependent on word identities rather than their position and context for all models. We also find that shuffling sentences results in 10%, 2% and 3% difference in scores for SkipFlow, MANN, and BERT models, respectively. Even for these samples for which we observed a change in the scores, almost half of them increased their scores and the other half was reduced. The results are similar for word-level shuffling. This is surprising since change in the order of ideas in an essay alters the meaning of a prose but the models are unable to detect changes in either idea order or word-order. It indicates that despite getting trained as sentence and paragraph level models with the knowledge of language models, they have essentially become bag-of-words models.

Figure 3: Word-shuffled essay containing 40% of (top-attributed) words for SkipFlow (left), MANN (middle) and BERT (right) models respectively. The perturbed essay scores 26 (SkipFlow), 15 (MANN) and 5 (BERT) out of 30. The original essay was scored 25, 16, 4 respectively by the models.

3.1.4 Modification in Lexicon

Several previous research studies have highlighted the importance vocabulary plays in scoring and how AES models may be biased towards obscure and difficult vocabulary (perelmanBable; perelman2014state; hesse20052005; powers2002stumping; kumar2020calling). To verify their claims, we replace the top and bottom 10% attributed words with ‘similar’ words777Sampled from Glove with the distance calculated using Euclidean distance metric (pennington-etal-2014-glove).

Table 3 shows the results for this test. It can be noted that after replacing all the top and bottom 10% attributed words with their corresponding ‘similar’ words results in an average 4.2% difference in scores across all the models. These results imply that networks are surprisingly not perturbed by modifying even the most attributed words and produce equivalent results with other similarly-placed words. In addition, while replacing a word with a ‘similar’ word often changes the meaning and form of a sentence888For example, consider the replacement of the word ‘agility’ with its synonym ‘cleverness’ in the sentence ‘This exercise requires agility.’ does not produce a sentence with the same meaning., the models do not recognize that change by showing no change in their scores.

Result SkipFlow MANN BERT
Avg score difference 9.8% 2.4% 3%
% of top-20% attributed
words which change
attribution 20.3% 9.5% 34%
% of bottom-20% attributed
words which change
attribution 22.5% 26.0% 45%
Table 3: Statistics obtained after replacing the top and bottom 10% attributed words of each essay with their synonyms.
Figure 4: Attributions for SkipFlow (left), MANN (middle) and BERT (right) models of an essay sample where a false fact has been introduced at the beginning. This essay sample scores (25/30, 18/30, 22/30) by the three models respectively. The original essay (without the added lie) scored (24/30), (18/30) and (21/30) respectively.

3.1.5 Factuality, Common Sense and World Knowledge

Factuality, common sense and world knowledge are important features in scoring essays (yan2020handbook). While a human expert can readily catch a lie, it is difficult for a machine to do so. We randomly sample 100 sample essays of each prompt from the AddLies test case of (kumar2020calling). For constructing these samples, they used various online databases and appended the false information at various positions in the essay. These statements not only introduce false facts in the essay but also perturb its coherence.

A teacher who is responsible for teaching, scoring, and feedback of a student must have knowledge of world knowledge such as, ‘Sun rises in the East’, and ‘The world is not flat’. However, Fig. 4 shows that scoring models do not have the capability to check such common sense. The models tested in fact attribute positive scores to statements like the world is flat if present at the beginning. These results are in contrast with studies like (tenney2019you; zhou2020evaluating) which indicate that BERT and Glove like contextual representations have common sense and world knowledge. ettinger2020bert in their ‘negation test’ also observe similar results to us.

BABEL Semantic Garbage: Linguistic literature has also reported that unexplicably, AES models give high scores to semantic garbage like the one generated using B.S. Essay Language Generator (BABEL generator)999 (perelmanBable; perelmanBableWebsite; perelman2020). These samples are essentially semantic garbage with perfect spellings and obscure and lexically complex vocabulary. In stark contrast to (perelmanBable) and the commonly held notion that writing obscure and difficult to use words fetch more marks, we observed that the models attributed infrequent words such as forbearance, legerdemain, and propinquity negatively while common words such as establishment, celebration, and demonstration were positively scored. Therefore, our results show no evidence for the hypothesis reported by studies like (perelman2020) that writing lexically complex words make the AES systems give better scores.

3.2 AES Oversensitivity

While there has been literature for AES overstability, there is much less literature on AES oversensitivity. Therefore, next using universal adversarial triggers (wallace2019universal), we show the oversensitivity of AES models. We add a few words (adversarial triggers) to the essays and cause them to have large changes in their scores. Post that, we attribute the oversensitivity to essay words and show that trigger words have high attributions and are the ones responsible for the model oversensitivity.

Through this, we test whether an automatically generated small phrase can perform an untargeted attack against a model to increase or decrease the predicted scores irrespective of the original input. Our results show that these models are vulnerable to such attacks, with as few as three tokens increasing / decreasing the scores of of samples. Further, we show the performance of transfer attacks across prompts and find that of them transfer, thus showing that the adversaries are easily domain adaptable and transfer well across prompts101010For the consideration of space, we only report a subset of these results.. We choose to use universal adversarial triggers for this task since they are input-agnostic, consist of a small number of tokens, and since they do not require the model’s white box access for every essay sample, they have the potential of being used as “cheat-codes” where a code once extracted can be used by every test-taker. Our results show that the triggers are highly effective.

Prompt 1 4 6 7 8
Trigger Len Model = SkipFlow
3 68, 43 100, 14 86, 40 56, 81 43, 75
5 79, 38 100, 13 97, 42 65, 83 44, 78
10 85, 44 100, 18 100, 48 78, 88 55, 94
20 93, 68 100, 27 100, 58 90, 91 67, 99
Model = BERT
3 71, 53 89, 31 66, 27 55, 77 46, 61
5 77, 52 90, 33 73, 33 58, 79 49, 64
10 79, 55 91, 41 87, 48 68, 84 55, 75
20 83, 61 94, 49 95, 59 88, 89 61, 89
Model = MANN
3 67, 38 89, 15 86, 40 60, 80 41, 70
5 73, 39 93, 19 96, 42 61, 71 43, 77
10 85, 44 97, 20 99, 48 75, 84 59, 88
20 93, 63 100, 20 100, 59 84, 90 71, 94
Table 4: Single-prompt targeted attack performance results. Percentage of samples whose scores increase, Percentage of samples whose scores decrease on using triggers of length (Rows, Up to Down) on prompt (Columns, Left to Right) against Skipflow, BERT and MANN. The bold values denote the best values for each kind of attack on each prompt. (increasing, decreasing).

3.2.1 Adversarial Trigger Extraction

Following the procedure of wallace2019universal, for a given trigger length (longer triggers are more effective, while shorter triggers are more stealthy), we initialize the trigger sequence by repeating the word “the” and then iteratively replace the tokens in the trigger to minimize the loss for the target prediction over batches of examples from any prompt .

This is a linear approximation of the task loss. We update the embedding for every trigger token to minimizes the loss’ first-order Taylor approximation around the current token embedding:


where is the set of all token embeddings in the model’s vocabulary and is the average gradient of the task loss over a batch. We augment this token replacement strategy with beam search. We consider the top-k token candidates from Equation 3 for each token position in the trigger. We search left to right across the positions and score each beam using its loss on the current batch. We use small beam sizes due to computational constraints, increasing them may improve our results.

3.2.2 Experiments

We conduct two types of experiments namely Single prompt attack and Cross prompt attack.

Single prompt attack Given a prompt , response , model , size criterion , an adversary converts response to response according to Eq. 3. The criterion defines the number of words up to which the original response has to be changed by the adversarial perturbation. We try out different values of ({}).

Cross prompt attack Here the adversarial triggers obtained from a model trained on prompt are tested against other model trained on prompt (where ).

3.2.3 Results

Here, we discuss the results of experiments conducted in the previous section.

Single prompt attack We found that the triggers can increase or decrease the scores very easily, with 3 or 5-word triggers being able to fool the model more than 95% of times correctly. It results in a mean increase of 50% on many occasions. Table 4 shows the percentage of samples that increase/decrease for various prompts and trigger lengths. The success of triggers increases with the number of words as well. Fig. 5 shows a plot of predicted normalized scores before and after attack and how it impacts scores across the entire normalized score range. It shows that the triggers are successful for different prompts and models111111Other prompts had a similar performance so we have only shown a subset of results with one prompt of each type. As an example, adding the words “loyalty gratitude friendship” makes SkipFlow increase the scores of all the essays with a mean normalized increase of 0.5 (out of 1) (prompt 5) whereas adding “grandkids auditory auditory” decreases the scores 97% of the times with a mean normalized decrease of 0.3 (out of 1) (prompt 2).

Figure 5: Single prompt attack for SkipFlow, BERT, Memory-Networks (§2.1). It shows the models’ predicted scores before and after adding 10-word triggers demonstrating the oversensitivity of these models subject to adversarial triggers. The green line indicates the scores given by a model not under attack, while the blue and red lines show the performance on attempting to increase and decrease the scores using the adversarial triggers.

Cross prompt attack We also found that the triggers are able to transfer easily, with 95% of samples increasing with a mean normalized increase of 0.5 on being subjected to 3-word triggers obtained from attacking a different prompt. Fig. 5 shows a similar plot showing the success of triggers obtained from attacking prompt 5 and testing on prompt 4.

Figure 6: Cross prompt attack for 20-word triggers obtained from SkipFlow trained on prompt 5 and tested on prompt 4 showing the transferability across prompts.

3.2.4 Trigger Analysis

We find that it is easier to fool the models to increase the scores than decrease it, with a difference of about in their success (samples increased/decreased). We also observe that some of the triggers selected by our algorithm are very low frequency in the dataset and co-occur with only a few output classes (scores), thus having unusually high relative co-occurrence with certain output classes. We calculate pointwise mutual information (PMI) for such triggers and find that the most harmful triggers have the lowest PMI scores with the classes they effect the most (see Table 5).

Prompt 4 Prompt 3
Score, Grade PMI Value Score, Grade PMI Value
grass, 0 1.58 write, 0 3.28
conclution, 0 1.33 feautures, 0 3.10
adopt, 3 1.86 emotionally, 3 1.33
homesickness, 3 1.78 reservoir, 3 1.27
wich, 1 0.75 seeds, 1 0.93
power, 2 1.03 romshackle, 2 0.96
Table 5: PMI of trigger word-grade pairs for Prompt 4, 3. Other prompts also have similar results.

Further, we analyze the nature of triggers and find that a singificant portion consists of archaic or rare words such as yawing, tallet, straggly with many foreign-derived words as well (wache, bibliotheque)121212All these words were already part of the model vocabulary.. We also find that decreasing triggers are 1.5x more repetitive than increasing triggers and contain half as many adjectives as the increasing ones.

4 Human Baseline

To test how humans perform on the different interpretability tests (§3.1, §3.2), we took 50 samples from each of overstability and oversensitivity tests and asked 2 human expert raters to compare the modified essay samples with the original ones. The expert raters have more than 5 years of experience in the field of language testing. We asked them two questions: (1) whether the score should change after modification and (2) should the score increase or decrease. These questions are easier to answer and produce more objective responses than asking the raters to score responses. We also asked them to give comments behind their ratings.

For all overstability tests except lexicon modification, both raters were in perfect agreement (kappa=1.0) on the answers for the two questions asked. They recommended (1) change in scores and that (2) scores should decrease. In most of the comments for the overstability attacks, the raters wrote that they could not understand the samples after modification

131313For lexicon modification, the raters recommended the above in 78% instances with 0.85 kappa agreement.. For oversensitivity causing samples, they recommended a score decrease but by a small margin due to little change in those samples. This clearly shows that the predictions of auto-scoring models are different from expert human raters and are yet unable to achieve human-level performance despite the recent claims that autoscoring models have surpassed human level agreement (taghipour2016neural; kumar2020explainable; ke2019automated).

5 Oversentivity and Overstability Detection

Next we propose detection-based solutions for oversensitivity (§5.1) and overstability (§5.2) causing samples. Here we propose detection based defence models to protect the automatic scoring models against potentially adversarial samples. The idea is to build another predictor , such that = if has been polluted, and otherwise = . Other techniques to tackle adversaries such as adversarial training have been shown to be ineffective against AES adversaries (ding2020don; pham2020out). It is noteworthy that we do not solve the general problem of cheating or dishonesty

in exams, rather we solve the specific problem of oversensitivity and overstability adversarial attacks on AES models. Preventing cheating such as from copying from the web can be easily solved by proctoring or plagiarism checks. However, proctoring or plagiarism checks cannot solve the deep learning models’ adversarial behavior such as due to adding adversarial triggers or repetition and lexically complex tokens. It has been shown in both computer vision and natural language processing that deep-learning models inherently are adversarially brittle and protection mechanisms are required to make them secure

(zhang2020adversarial; akhtar2018threat).

There is an additional advantage of detection-based adversaries. Most AES systems validate their scores with respect to humans post-deployment (etsDulingoComparison; laflair2019duolingo). However, many deployed systems are now moving towards human-free scoring (etsDulingoComparison; ohioAES2; laflair2019duolingo; sltiAIRated; trueNorthSA). While it may have its advantages such as cost savings, but cheating in the form of overstability and oversensitivity causing samples are a major worry for both the testing companies and score users like universities and companies who rely on these testing scores (midDay; flawedAlgos; emptyDream). The detection based models provide an effective middle-ground where the humans only need to evaluate a few samples flagged by the detector models. A few studies studying this problem have been reported in the past (malinin2017incorporating; yoon2014similarity). We also do a pilot study with a major testing company using the proposed detector models in order to judge their efficacy (§5.3). Studies on the same lines but with different motives have been conducted in the past (powers2001stumping; powers2002stumping).

5.1 IG Based Oversensitive Sample Detection

Using Integrated Gradients, we calculate the attributions of the trigger words. We found that, on average (over 150 essays across all prompts), attribution to trigger words is 3 times the attribution to the words in a normal essay (see Fig.9). This gave us the motivation to detect oversensitive samples automatically.

To detect the presence of triggers () programmatically, we utilize a simple 2 layer LSTM-FC architecture.


The LSTM takes the attributions of all words () in an essay as input and predicts whether a sample is adversarial () based on attribution values and the hidden state (

). We include an equal number of trigger and non-trigger examples in the test set. In the train set, we augment the data by including a single response with different types of triggers so as to make the model learn the attribution pattern of oversensitivity causing words. We train the LSTM based classifier such that there is no overlap between the train and test triggers. Therefore, the classifier has never seen the attributions of any samples with the test-set triggers. Using this setup, we obtained an average test accuracy of 94.3% on a test set size of 600 examples per prompt. We do this testing over all the 24 unique prompt-model pairs. The results for 3 prompts (one each from argumentative, narrative, RC (see Table 

1)) over the attributions of the BERT model are tabulated in the Table 6. As a baseline, trywe take a LSTM model which takes in BERT embeddings and tries to classify the oversensitivity causing adversarial samples using the embedding of the second-last model layer, followed by a dense classification layer. Similar results are obtained for all the model-prompt pairs.

Model Prompt Accuracy F1 Precision Recall
Baseline 2 71 70 80 63
IG-based 2 90 91 84 99
Baseline 6 7490 7491 784 7099
IG-based 6 94 93 90 96
Baseline 8 60 45 68 34
IG-based 8 99 98 96 100
Table 6: Validation metrics for IG attribution-based adversarial sample detection compared with Embedding-dense classification model for 3 representative prompts
Figure 9: Attributions for SkipFlow when adversarial triggers were inserted in the beginning. The figure shows high attributions on the trigger tokens irrespective of length of triggers.

5.2 Language Entropy Based Overstable Sample Detection

For overstability detection, we use a language model to find the text entropy. In psycholinguistics, it is well known that human language has certain fixed entropy (frank2008speaking). To maximize the usage of human communication channel, the number of bits per unit second remains constant (frank2008speaking; jaeger2010redundancy). The principle of uniform information density is followed while reading, speaking, dialogues, grammar, etc (jaeger2010redundancy; frank2008speaking; jaeger2006redundancy). Therefore, semantic garbage (BABEL) or sentence shuffle and word modifications create unexpected language with high entropy. Thus, this inherent property of language can be used to detect overstability causing samples.

We use a GPT-2 language model (radford2019language) to do unsupervised language modelling on our training corpus to learn the grammar and structure of normal essays. We get the perplexity score of all essays after passing through GPT-2.


where and

are the estimated (by language model) and true probabilities of the word sequence

. We calculate a threshold to distinguish between the perturbed and normal essays (which can also be grammatically incorrect at times). For this, we use the Isolation Forest (liu2008isolation), which is a One class (OC) classification technique. Since OC classification only uses 1 type of examples to train, using only the normal essay perplexity, we can train it to detect when the perplexity is anomalous.


where is the mean value of depths a single data point, , reaches in all trees.


where = harmonic number = (Euler’s constant) and is number of points used to construct trees.

We train this IsoForest model on our training perplexities and then test it on our validation set, i.e., other normal essays, shuffled essays (§3.1.3), lexicon-modified essays (§3.1.4) and BABEL samples (§3.1.5). The contamination factor of the IsoForest is set to 1%, corresponding to the number of probable anomalies in the training data. We obtain near perfect accuracies, indicating that our language model has indeed captured the language of a normal essay. Table 7 presents results on three representative prompts (one each from argumentative, narrative, RC (see Table 1)).

Prompt Normal Essays Shuffle Synonyms BABEL
2 99.1 100 82.5 100
6 99.6 98 80 100
8 99.3 98.9 83 100
Table 7: IsoForest accuracy on normal essays, shuffled essays (§3.1.3), lexicon-modified essays (§3.1.4) and BABEL samples (§3.1.5) for three representative prompts

5.3 Pilot Study

To test how well the sample detection systems work in practice, we conduct a small-scale pilot study using essay prompts of a major language testing company. We asked 3 experts and 20 candidate test-takers to try to fool the deployed AES models. The experts had an experience of more than 15 years in the field of language testing and were highly educated (masters in language and above). The test-takers were college graduates from the population served by the company. They were duly compensated for their time according to the local market rate. We provided them with our overstability and oversensitivity tests for their reference.

The pilot study revealed that the test-takers used several strategies to try to bypass the system like using semantic garbage such as what is generated from BABEL generator, sentence and word repetitions, bad grammar, second language use, randomly inserting trigger words, trigger word repetitions, using pseudo and non-words like jabberwocky, and partial question repeats. The models reported were able to catch most of the attacks including the ones with repetitions, trigger words, pseudo and non-word usages, and semantic garbage with high accuracy (0.89 F1 with 0.92 recall scores on an average). However, bad-grammar and partial question repeats were difficult to recognize and identify (0.48 F1 score with 0.52 recall scores on an average). This is especially so since bad grammar could be indicative of both language proficiency and adversaries. While bad grammar was easily detected in semantic garbage category but it was detected with low accuracy when only a few sentences were off. Similarly, candidates often use partial question repeats to start or end answers. Therefore, it forms a construct-relevant strategy and hence cannot be rejected according to rubrics. This problem should be addressed in essay-scoring models by introducing appropriate inductive biases. We leave this task for future works.

6 Related Work

Automatic essay scoring:

Almost all the auto-scoring models are learning-based and treat the task of scoring as a supervised learning task


with a few using reinforcement learning


and semi-supervised learning

(chen2010unsupervised). While the earlier models relied on ML algorithms and hand-crafted rules (page1966imminence; faulkner2014automated; kumar2019get; persing2010modeling), lately the systems are shifting to deep learning algorithms (taghipour2016neural; grover2020multi; dong2016automatic). Approaches have ranged from finding the hierarchical structure of documents (dong2016automatic), using attention over words (dong2017attention), and modelling coherence (tay2018skipflow). In this paper, we interpret and test the recent state-of-the-art scoring models which have shown the best performance on public datasets (tay2018skipflow; zhao2017memory).

AES testing and validation:

Although automatic scoring has seen much work in recent years, model validation and testing still lag in the ML field with only a few contemporary works (kumar2020calling; pham2020out; yoon2014similarity; malinin2017incorporating). kumar2020calling and pham2020out show that AES systems are adversarially unsecure. pham2020out also try adversarial training and obtain no significant improvements. yoon2014similarity and malinin2017incorporating model uncertainty in automatic scoring systems. Most of the scoring model validation work is in the language testing field, which unfortunately has limited AI-expertise (litman2018speech). Due to this, studies have noted that the results there are often conflicting in nature (powers2001stumping; powers2002stumping; bejar2013length; bejar2014vulnerability; perelman2020).

7 Conclusion and Future Work

Automatic scoring, one of the first tasks to be automated using AI (whitlock1964automatic), is now shifting to black box neural-network based automated systems. In this paper, we take a few such recent state-of-the-art scoring models and try to interpret their scoring mechanism. We test the models on various features considered important for scoring such as coherence, factuality, content, relevance, sufficiency, logic, etc and explain the models’ predictions. We find that the models do not see an essay as a unitary piece of coherent text but as a bag-of-words. We find out why essay scoring models are both oversensitive and overstable and propose detection based protection models against such attacks. Through this, we also propose an effective defence against the recently introduced universal adversarial attacks.

Apart from contributing to the discussion of finding effective testing strategies, we hope that our exploratory study initiates further discussion about better modeling automatic scoring and testing systems especially in a sensitive area like essay grading.

=0mu plus 1mu


8 Full Version of Abridged Main Paper Figures

The essays with their sentences shuffled are displayed in Figure. 5.

The full-size attribution images are given in Figures 11 to 14.

Figure 10: Attributions for SkipFlow and MANN respectively of an essay sample where all the sentences have been randomly shuffled. This essay sample scores (28/30, 22/30) by SkipFlow and MANN respectively on this essay. The original essay also scored (28/30) and (22/30) respectively.
Figure 11: Full-Sized Attributions for SkipFlow, MANN and BERT respectively of an essay sample where all the sentences have been randomly shuffled.
Figure 12: Full-Sized Attributions for SkipFlow, MANN and BERT respectively of an BABEL essay sample.
Figure 13: Full-Sized Attributions for SkipFlow, MANN and BERT respectively of an essay sample where all the sentences have an added false fact.
Figure 14: Full-Sized Attributions for SkipFlow, MANN and BERT respectively of a real essay sample.

9 Statistics: Iterative Addition of Words

The results are given in table 8.

80 3.5/1.1/0.002 0.43/0.09/0.05 65/31/0.63 8.9/2.88/91.6 5.1/2/0.06
60 4/0.37/0.001 1.01/1.4/0.14 60/9.2/0.3 17/39.1/99 6.7/2.6/0.14
40 3.1/0.07/0 3.7/5.8/0.23 36/2.24/0 44/88.4/99.6 9.24/6.5/0.24
20 2.09/0.02/0.002 14.7/13.7/0.31 15.6/0.6/0.63 78.5/94.5/99.3 19.5/14.5/0.32
0 61/0/0 0/20/0.52 0/0/0 100/94.5/100 62/22.3/0.5
Table 8: Statistics for iterative addition of the most-attributed words on Prompt 7. Legend (kumar2020calling): {%: % words added to form a response, : Mean difference of positively impacted samples (as % of score range), : Mean difference of negatively impacted samples (as % of score range), : Percentage of positively impacted samples, : Percentage of negatively impacted samples,

Standard deviation of the difference (as % of score range)}

10 Statistics: Iterative Removal of Words

0 0/0/0 0/0/0 0/0/0 0/0/0
20 0/0/0.04 11/1/5 0/.3/1.27 96.1/32/88.4
60 0/0/0.01 26/8/14.8 1.2/0/0.3 97.7/94.5/99.3
80 0.5/0/0 29.9/15/22 5.4/0/0 92.9/94.5/100
Table 9: Statistics for iterative removal of least attributed words on Prompt 7. Legend (kumar2020calling): {%: % words removed from a response, : Mean difference of positively impacted samples (as % of score range), : Mean difference of negatively impacted samples (as % of score range), : Percentage of positively impacted samples, : Percentage of negatively impacted samples }

11 BERT-model Hyperparameters

HyperParameter Value
Optimizer Adam
Learning Rate 2e-5
Batch Size 8
Epochs 5-10 based on Early Stopping
Loss Mean Squared Error
Table 10: Bert Model Hyperparameters and Architecture