Log In Sign Up

Calling Out Bluff: Attacking the Robustness of Automatic Scoring Systems with Simple Adversarial Testing

A significant progress has been made in deep-learning based Automatic Essay Scoring (AES) systems in the past two decades. The performance commonly measured by the standard performance metrics like Quadratic Weighted Kappa (QWK), and accuracy points to the same. However, testing on common-sense adversarial examples of these AES systems reveal their lack of natural language understanding capability. Inspired by common student behaviour during examinations, we propose a task agnostic adversarial evaluation scheme for AES systems to test their natural language understanding capabilities and overall robustness.


page 1

page 2

page 3

page 4


My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism

Significant progress has been made in deep-learning based Automatic Essa...

Robustness Testing of Language Understanding in Dialog Systems

Most language understanding models in dialog systems are trained on a sm...

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees

Automated Scoring (AS), the natural language processing task of scoring ...

On Adversarial Robustness of Synthetic Code Generation

Automatic code synthesis from natural language descriptions is a challen...

Robust Natural Language Inference Models with Example Forgetting

We investigate whether example forgetting, a recently introduced measure...

Concept Tagging for Natural Language Understanding: Two Decadelong Algorithm Development

Concept tagging is a type of structured learning needed for natural lang...

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Evaluation for many natural language understanding (NLU) tasks is broken...

1 Introduction

Automated Essay Scoring (AES) uses computer programs to automatically characterize the performance of examinees on standardized tests involving writing prose. The earliest mention of scoring as a scientific study dates back to the nineteenth century [spolsky1995measured] and automatic scoring, specifically, to the 1960s [whitlock1964automatic]. The field started in 1960s with Ajay, Page and Tillet [ajay1973analysis]

scoring the essays of their students on punch cards. The essay was converted to a number of features which was passed through a linear regression model to produce a score. Since then, the field has undergone major changes which transformed the punch cards to microphones and keyboards, and linear regression techniques on manually extracted features to deep neural networks. However, over the years, the interpretability of the system has gone down and the evaluation methodologies (i.e. accuracy and kappa measurement) have largely remained the same. While the previous methods relied on feature engineering, today the model designers rely on neural networks to automatically extract patterns from the dataset to score.

The common performance metric that has been widely used in the field is Quadratic Weighted Kappa (QWK). It measures the agreement between the scoring model and the human expert. According to this performance metric, with time, the automatic essay scoring models have reached the level of humans [kumar2019get] or even ‘surpassed’ them [shermis2012contrasting]. However, as our experiments show, despite achieving parity with humans on QWK scores, models are not able to score in the same manner as humans do. We demonstrate in the later parts of the paper that heavily modifying responses or even adding false information to them, does not break the scoring systems and the models still maintain their high confidence and scores while evaluating the adversarial responses.

In this work, we propose an adversarial evaluation of AES systems. We show the evaluation scheme on Automated Student Assessment Prize (ASAP) dataset for Essay-Scoring [ASAP-AES]. Our evaluation scheme consists of evaluating AES systems on those inputs which are derived from the original responses but modified heavily to change its original meaning. These tests are mostly designed to check for the overstability of the different models. An overview of the adversarial scheme is given in Table 1. We try out the following operations for generating responses: Addition, Deletion, Modification and Generation. Under these four operations, we include many other operation subtypes such as adding Wikipedia lines, modifying the grammar of the response, taking only first part of the response, etc. As the human evaluation results show (Section 4.1), when these adversarial responses are shown to them, they perceive the responses as ill-formed, lacking coherence and logic. However, our results demonstrate that no published model is robust to these examples. They largely maintain the scores of the unmodified original response even after all the adversarial modifications. This indicates that the models are largely overstable and unable to distinguish ill-formed examples from the well-formed ones. While, on average, the humans reduce their score by approx 3-4 points (on a normalized 1-10 scale), the models are highly overstable and either increase the score for some tests or reduce them for others by only 0-2 points.

We also argue that for deep learning based systems, tracking merely QWK as an evaluation metric is suboptimal for several reasons: 1) While subsequent research papers show an iterative improvement in QWK, yet most of them fail in evaluating how their works generalize across all the different dimensions of scoring including coherence, cohesion, vocabulary, and even surface metrics like average length of sentences, word difficulty, etc. 2) QWK as a metric captures only the overall agreement with humans scores, however, scoring as a science includes knowledge from many domains of NLP like:

fact-checking, discourse and coherence, coreference resolution, grammar, content coverage, etc

. A neural network normally tries to learn all of them at one go, which as the results demonstrate is probably not able to learn. Therefore, QWK instead of taking the field in the right direction is abstracting out all the details associated with scoring as a task. 3) It does not indicate the direction of a machine learning model: oversensitivity or overstability. We quantitatively illustrate the gravity of all these aspects by performing statistical and manual evaluations. We propose that instead of tracking

just QWK for evaluating a model, the field should track QWK for performance and adversarial evaluation of the models for

We would also like to acknowledge that we are not the only researchers to notice the problem of lack of comprehensiveness of scoring with neural networks. Many researchers prior to us have shown that AES models are either easily fooled or pay attention to wrong features for scoring. [perelman2014state] argued that the then current state-of-the-art systems showed a heavy correlation with just number of words. In [reinertsen2018can], the authors observe that the Australian eWrite AES system was rejecting those writings which did not match the style of their training samples and this is not good for a broad-based systems like AS systems. In [west2018trustworthy], the authors note that there is no systematic framework for evaluating a model’s fit for learning purposes in either academic or industry applications. This leads to a lack of trust in high-stakes processes such as AES evaluation in those circumstances where high skepticism is already commonplace. Similarly, Perelman designed Basic Automatic B.S. Essay Language Generator (BABEL) [perelmanBable] to test out and show that the state-of-the-art AI systems can be fooled by crudely written prose as well [perelmanBableWebsite]. We also use BABEL for one of the tests in our framework (namely BabelGen given in Section 3.4).

Finally, we would like to say that we present our argument not as a criticism of anyone, but as an effort to refocus research directions of the field. Since the automated systems that we develop as a community have such high stakes, the research should reflect the same rigor. We sincerely hope to inspire higher quality reportage of the results in automated scoring community which does not track just performance but also the validity of their models.

# Category Test Name Description
1 Add AddWikiRelated Addition of Wikipedia lines related to the essay question in a response
AddWikiUnrelated Addition of Wikipedia lines unrelated to the essay question in a response.
RepeatSent Repetition of some lines of the response within a response.
AddSong Addition of song lyrics into the response.
AddSpeech Addition of excerpts of speeches of popular leaders into a response
AddRC Addition of lines from Reading Comprehension based questions into a responses.
AddTruth Addition of True lines into a responses.
AddLies Addition of Universally false lines into a responses
2 Delete DelStart Deletion of lines from the beginning of a response.
DelEnd Deletion of lines from the end of a response.
DelRand Deletion of random lines from a response.
3 Modify ModGrammar Modifying the sentences in a response to have incorrect grammar.
ModFluency Inducing dis-fluency in the sentences of a response.
ModLexicon Paraphrasing words in the sentences with their respective synonyms in a response.
ShuffleSent Randomly shuffling the sentences in a response.
4 Generate BabelGen Using the essay generated by Babel as a response.
Table 1: Overview of the testing scheme for Automatic Scoring (AS) models

2 Task and Setup

2.1 Task

We used the widely cited ASAP-AES [ASAP-AES] dataset for the evaluation of Automatic Essay Scoring systems. ASAP-AES has been used for automatically scoring essay responses by many research studies. [taghipour2016neural, easeGithub, tay2018skipflow, zhao2017memory]. It is one of the largest publicly available datasets. The relevant statistics for ASAP-AES are listed in Table 2. The questions covered by the dataset are from many different areas such as Sciences and English. The responses were written by high school students and were subsequently double-scored.

Prompt Number 1 2 3 4 5 6 7 8
#Responses 1783 1800 1726 1772 1805 1800 1569 723
Score Range 2-12 1-6 0-3 0-3 0-4 0-4 0-30 0-60
#Avg words per response 350 350 150 150 150 150 250 650
#Avg sentences per response 23 20 6 4 7 8 12 35
Type Argumentative Argumentative RC RC RC RC Narrative Narrative
Table 2: Overview of the ASAP AES Dataset used for evaluation of AS systems. (RC = Reading Comprehension)

2.2 Models

We evaluate the recent state-of-the-art deep learning and feature engineering models. We show the adversarial-evaluation results for five such models: [easeGithub, taghipour2016neural, tay2018skipflow, zhao2017memory, liu2019automated].


is an open-source feature-engineering model maintained by

EdX [easeGithub]

. This model is based on many features such as tags, prompt-word overlap, n-gram based features,

etc. Originally, it ranked third among the 154 participating teams in the ASAP-AES competition.

[taghipour2016neural] uses CNN-LSTM based neural networks with a few mean-over-time layers. They report 5.6% improvement of QWK on top of the EASE feature-engineering model.

SkipFlow [tay2018skipflow] provides another deep learning architecture that is said to improve on the vanilla neural networks. The authors also mention that SkipFlow captures coherence, flow and semantic relatedness over time which they call as the neural coherence features. They also say that essays being long sequences are difficult for a model to capture. For this reason, SkipFlow involves access to intermediate states. By doing this, it shows an increase of 6% over EASE feature engineering model and 10% over a vanilla LSTM model.

[zhao2017memory] use memory-networks for AS where they select some responses for each grade. These responses are stored in the memory and then used for scoring ungraded responses. The memory component helps to characterize the various score levels similar to what a rubric does. They compare their results with the EASE based model and show better performance on 7 out of 8 prompts.

[liu2019automated] is a recent work where the authors claim to improve performance on adversarial responses to AS systems. They achieve this by including some adversarial generated samples in the training data of the model. They consider two types of adversarial evaluation: well-written permuted paragraphs and prompt-irrelevant essays111Both of these evaluation criteria were explained in [kumar2019get]. For these, they develop a two-stage learning framework where they calculate semantic, coherence and prompt-relevance scores and concatenate them with engineered features. The paper uses advance contextual embeddings viz, BERT [devlin2018bert] for extracting sentence embeddings. However, they do not give any analysis for disambiguating the performance gain due to BERT and the other techniques that they apply. We use their model to show how much models that are dependent on even advanced embeddings like BERT can learn about tasks like coherence, relevance to the prompt, relatedness and other things which the test framework captures.

2.3 Standard Evaluation

Both the original competition in which the dataset was released and the papers referenced, use Quadratic Weighted Kappa (QWK) as the evaluation metric. Given observed scores matrix , weights and expected score matrix , number of possible scores , QWK is calculated as

measures number of students who received a score by the first grader and by the second one. Weight matrix is defined as . The value extracted by choosing human and machine scores is then compared with the value calculated by choosing two human graders. It is considered better if the machine-human agreement score (QWK) is as close as possible to human-human agreement score.

3 Adversarial Evaluation

3.1 General Framework

Given a prompt , response , bounded size criterion , position criterion and optionally a model , an adversary converts response to response based on a specific set of rules and the criteria and . The criterion defines the percentage upto which the original response has to be changed by the adversarial perturbation such that . We try out different values of ({10, 15, 20, 25}). The criterion defines the position of inducing adversarial perturbation. We consider three positions () by dividing the response into three equal-sized portions. For the consideration of space, we only report a subset of these results. A complete listing of all the results is provided in the supplementary.

For the model using the scores and , we calculate the following statistics: number of negatively impacted samples ( number of s.t., ), number of positively impacted samples ( number of s.t., ), mean difference , absolute mean difference

, standard deviation of the difference

, mean difference of negative impacted samples ( s.t. ) and mean difference of positively impacted samples ( s.t. ). Since the score ranges and the number of samples vary across all the prompts, we report the corresponding values in percentages (percentage of total samples and percentage of range of score). For knowing un-normalized values, readers are encouraged to look into the supplementary.

Using human evaluation (Section 4.1) and relevant statistics (Section 4), we make sure that an adversary satisfies the following two conditions. First, according to a human, the score of an adversarial response () should always be lesser than the score of the original response (). In other words, no adversary should increase the quality of the response. Second, a human should be able to detect and differentiate from

. Notably, these requirements are different from what is “commonly” given in the adversarial literature where the adversarial response is formed such that a human is not able to detect any difference between the original and modified responses but a model (due to its adversarial weakness) is able to detect differences and thus changes its output. For example, in computer vision a few pixels are modified to make the model mispredict a bus as an ostrich

[szegedy2013intriguing] and in NLP, paraphrasing is done to churn out racial and hateful slurs from a generative deep learning model [wallace2019universal]. Here, we make sure that humans detect the difference between the original and final response and then evaluate the model’s capability to detect and differentiate between them. We call the inability (or under-performance) of models on this as their overstability.

Next, we discuss the various strategies of adversarial perturbations. We also categorize them as majorly impacting syntax, majorly impacting semantics and generative adversaries. An overview of all the perturbations is given in Table 1.

3.2 Majorly Syntax-Modifying Adversaries

Syntax-modifying adversaries are those perturbations that modify the example such that the original meaning (i.e. semantics) of the response is largely retained while the syntax of a sentence unit is modified. These are mostly of the type Modify where word/sentence tokens in a prose are not deleted or new tokens are not added but existing sentence/word tokens are just modified.


Original Anita is going to the park for a walk.
Subject-verb-object Errors Anita to the park is going for a walk.
Article Errors Anita is going to an park for the walk.
Subject Verb Agreement Errors Anita go to an park for the walk.
Conventional Errors anita go 2 an park 4 the walk
Table 3: Some examples of the type ModGrammar

We formed two test cases to simulate common grammatical errors committed by students. The first one focused on changing the subject-verb-object (SVO) triplet of a sentence. A triplet of this form is selected from each sentence and jumbled up. In the second test case, we first induce some article errors by replacing the articles of a sentence with their common incorrect forms. Then we alter the subject-verb agreement of that sentence. Following that, we replace a few selected words with their corresponding informal conventions and generic slangs. A few examples of this type of perturbation are given in Table 3.


This test case is to simulate the involuntary disruptions that frequently occur during the flow of a spoken speech. We induce disfluencies and filler words in the text to model this test case [zayats2016disfluency]. For introducing disfluency, we repeat a few words at the start of each alternate sentence in a paragraph. For example, I like apples becomes I … I like apples. For introducing filler words, we make a list of some common the filler words 222Filler words such as “huh”, “uh”, “erm”, “um”, “well”, “so”, “like”, “hmm”, etc and introduce them randomly in sentences according to the criterion . For example, I would like to tell you a story! is changed to Well…I would like to tell you hmm…a story!


In this test case, we use Wordnet synset [miller1995wordnet] to replace one word (excluding the stop words) randomly in each sentence of the response with its synonym. The motivation behind this was to understand the variation of scores given by the state-of-the-art AES models with synonymy relations. For example, “Tom was a happy man. He lived a simple life.” is changed to “Tom was a cheerful man. He lived an elementary life.”


We randomly shuffled all the sentences of the response. This ensured that readability and coherence of the response are affected. Moreover, the transition between the lines is also lost. Hence, the response will appear disconnected to the reader.

3.3 Semantics Modifying Adversaries

Semantics modifying adversaries are those perturbations which try to modify the meaning of the prose either at a sentence level or the overall prose level. Through this, we majorly disturb the coherence, relatedness, specificity and readability of a response. We do it majorly by three methods. First, by adding some lines to the response which changes the meaning of the original response and disturbs its continuity. Second, we delete some sentences from the original response which again impacts its readability and completeness. Third, we modify the original response in order to replace a few words with some unrelated words. This way the sentence loses its meaning altogether. While doing all these perturbations, we take care of the two constraints, and as mentioned in the Section 3.1.


We formed a list of important topics from each prompt using key-phrase extraction. The Wikipedia articles of each of these topics were extracted and sentence tokenized. Then some sentences were randomly selected from these articles and were added to each response.


For this test case, we took those Wiki entries which did not occur in the previous list and performed the same procedure as above. Using these perturbations, we wanted to see the variation of AES scores with sentences from related and unrelated domains.


We used [songsKaggle4, songsKaggle3, songsKaggle2, songsKaggle5, songsKaggle1] to extract 58000 English songs lyrics from a range of years and genres like Rock, Jazz, Pop, etc. These lyrics are then appended to the responses according to the constraints and .


We collected eight speeches of popular leaders such as Donald Trump, Barack Obama, Hillary Clinton, Queen Elizabeth II, etc. Randomly picked sentences from this speech corpus are then added to the responses. The collected speeches with their sources are given in the supplementary.


For reading comprehension based prompts (refer Table 2), we randomly pick up sentences from the corresponding reading comprehension passage and add them in the responses.


We acquired a list of facts from [trueFacts]. The motivation behind this test case is the general tendency of students to add compelling relevant or irrelevant facts in their response to make it longer and informative, especially in argumentative based essays.


We designed this test case to evaluate whether current Automatic Scoring systems are able to identify false claims or false facts in the student responses. We collected various false facts and added them to the responses according to the constraints mentioned above. Through our experiments, we demonstrate that rubrics for automatic scoring engines focus entirely on organization, writing skills etc ignoring the chance of bluff with these false statements. This outcome verifies the robustness of AS systems and encourages further research in this direction to make the scoring systems more secure.


Students intentionally tend to repeat sentences or specific keywords in their responses in order to make it longer yet not out of context and to fashion cohesive paragraphs. This highlights the limited vocabulary of the writer or meagre knowledge and ideas about the main subject. To deal with this form of a bluff, we concentrated on three different approaches. Firstly, one or two sentences from the introduction part are repeated at the end of the response. Secondly, one or two sentences from the conclusion are repeated at the beginning of the response. Thirdly, one to three sentences are repeated in the middle of the response.


To analyze the performance of state-of-the-art E-Raters on most narrow situations and constraints, we tested the performance of the AS systems with just the beginning line of the response. This was followed by the first two lines, the first three lines and so on. Moreover, this helped to analyze the overall trend on how scoring is affected by slowly building the response to a complete response.


Similar to the above testcase, we repeated this scenario for the last line, last two lines, last three lines and so on to examine the scoring trend in each case.


We removed a fixed percentage of sentences randomly from the response to interfere and discard the coherence This also reduces the length of response. This testcase highlights the type of responses that may occur when a student is trying to cheat in an examination by replicating sentences from some other source. We study this trend whether AS systems detect the presence of gaps in the ideas presented in the response.

3.4 Generative Adversaries

M/P 1 2 3 4 5 6 7 8
Range 2-12 1-6 0-3 0-3 0-4 0-4 0-30 0-60
1 7.1 2.5 1.7 1.1 2.2 1.2 13.8 33.9
2 10 4.4 2 2 3 1.2 19.1 43.1
3 6 2 1.1 0.9 1.3 1.3 12.1 21.9
4 8.4 4 3 3 4 3.9 18.4 40.1
5 10.8 5.6 2.8 2.9 3.8 3.8 26.2 53
Table 4: Scores for BabelGen over all the prompts and models. Ideally, all of the Babel generated essays should have been scored a zero. Legend: M: Model (y-axis), P: Prompt (x-axis), Model Types: 1: LSTM-MoT [taghipour2016neural], 2: EASE [easeGithub], 3: SkipFlow [tay2018skipflow], 4: Memory Networks [zhao2017memory], 5: Adversarial Evaluation + BERT [liu2019automated].


These adversarial samples are completely false samples generated using Les Perelman’s B.S. Essay Language Generator (BABEL) [perelmanBable]. BABEL requires a user to enter three keywords based on which it generates an incoherent, meaningless sample containing a concoction of obscure words and keywords pasted together. In 2014, Perelman had showed that ETS’ e-rater which is used to grade Graduate Record Exam (GRE) 333GRE is a widely popular exam accepted as the standard admission requirement for a majority of graduate schools. It is also used for pre-job screening by a number of companies. Educational Testing Services (ETS) owns and operates the GRE exam. essays consistently 5-6 on a 1-6 point scale [perelmanBableWebsite, washingtonBabel]. This motivated us to try out the same approach with the current state-of-the-art deep learning recent approaches. We came up with a list of keywords based on the AES questions 444The list is presented along with the code in supplementary. For generating a response, we chose 3 keywords related to that question and gave it as input to BABEL which then generated a generative adversarial example.

4 Results and Discussion

Figure 1: Adversarial Samples of the type AddTruth, RepeatSent, AddSong, AddSpeech, BabelGen. The original and final scores of the different models are: (Prompt 5) {1:(3,2), 2:(2,3), 3:(2,2), 4:(2,1), 5:(3,3)}, (Prompt 7) {1:(21,19), 2:(18,20), 3:(18,22), 4:(13,14), 5:(17,16)}, (Prompt 6) {1:(4,4), 2:(2,3), 3:(3,4), 4:(2,2), 5:(2,2)}, (Prompt1) {1:(9,10), 2:(7,8), 3:(5,6), 4:(10,11), 5:(6,6)}. The model numbering is the same as given in the Table 5.
M/P 1 2 3 4 5 6 7 8
1 (4,21,2,6) (4,19,2,9) (8,27,12,10) (10,28,10,10) (8,35,3,11) (8,52,1,15) (5,19,3,8) (2,10,4,5)
2 (4,3,64,14) (5,3,68,13) (9,3,44,19) (29,22,45,23) (21,7,94,22) (21,7,97,20) (4,3,68,14) (4,3,59,7)
3 (7,5,5,12) (2,2,57,7) (21,9,96,7) (15,24,13,11) (8,4,31,14) (7,5,38,19) (6,5,46,16) (3,2,14,5)
4 (18,23,19,14) (21,20,30,14) (47,29,43,19) (60,35,52,37) (39,22,53,23) (38,16,62,23) (11,17,39,14) (6,5,51,6)
5 (15,14,46,14) (22,14,82,16) (10,35,7,14) (12,34,5,14) (12,30,13,14) (14,28,13,13) (11,22,23,16) (21,12,77,17)
1 (4,21,1,7) (4,19,3,8) (7,26,14,9) (9,28,12,10) (7,35,2,11) (6,53,1,15) (6,19,4,8) (2,10,6,5)
2 (4,3,16,15) (5,4,41,13) (7,3,27,18) (28,23,37,23) (19,6,91,21) (19,4,94,21) (3,3,22,14) (4,3,50,8)
3 (8,4,7,12) (2,2,52,7) (21,7,95,7) (13,24,14,11) (6,4,26,14) (6,5,40,19) (7,5,40,16) (3,2,18,5)
4 (19,23,17,15) (21,21,28,15) (46,30,43,17) (60,35,49,34) (38,23,53,22) (38,17,59,23) (10,18,34,14) (6,5,46,6)
5 (16,14,41,14) (22,14,77,16) (8,35,7,14) (12,35,3,14) (12,30,12,14) (14,29,8,13) (11,22,18,16) (22,12,71,17)
1 (4,22,2,10) (4,18,5,8) (9,27,19,13) (10,28,13,13) (8,35,6,16) (9,51,2,14) (5,19,6,8) (2,10,8,6)
2 (4,3,86,14) (5,3,74,13) (9,3,82,19) (29,22,63,23) (22,6,100,22) (22,7,98,20) (4,3,92,14) (4,3,44,7)
3 (8,5,27,12) (2,2,33,7) (21,8,94,7) (15,24,36,11) (9,4,50,14) (8,5,38,19) (6,5,58,16) (3,2,22,5)
4 (19,23,33,14) (21,20,39,12) (47,29,56,27) (60,34,51,40) (39,21,64,25) (38,16,63,23) (11,17,50,14) (6,5,65,5)
5 (15,14,61,14) (22,14,89,16) (10,35,16,14) (12,34,10,14) (12,30,22,14) (14,28,17,13) (11,22,32,16) (20,12,89,17)
Table 5: Results for AddLies, DelRand, AddSong test cases over all the prompts and models. Legend: First sub-table: AddLies to the end such that , Second sub-table: DelRand such that and Third sub-table: AddSong such that and the injections happen at the beginning of the response. M: Model (y-axis), P: Prompt (x-axis), Model Types: 1: LSTM-MoT [taghipour2016neural], 2: EASE [easeGithub], 3: SkipFlow [tay2018skipflow], 4: Memory Networks [zhao2017memory], 5: Adversarial Evaluation + BERT [liu2019automated]. The entry in the tuple represents (). All the values are in percentages and rounded to the nearest integer.

Tables 4 and 5 report the results for AddLies, DelRand, AddSong, BabelGen test cases over all the prompts and models 555Due to lack of space, we could present only a small subset of all the results. Interested readers are encouraged to look into the supplementary for a complete listing. We also give some real randomly chosen examples from the different test cases in the Figure 1. We observe that, in general, [taghipour2016neural] had a very low percentage and [zhao2017memory] and [easeGithub] had consistently high percentages. also varied a lot with prompt. While some prompts showed a lower percentages for some test cases (such as Prompt 4 for Add related test cases), some had a high percentage for others.

In general, DEL tests impacted the scores negatively. There were very few instances where scores increase after removing a few lines. This was also observed by [perelman2014state] where he stated that word count is the most important predictor of an essay’s score. Adding, obscure and difficult words in place of simpler words increased the scores by a fraction. Curiously, adding speeches and songs, impacted the scores on an average positively. It is to be noted that those speeches or songs were in no way related to the question being asked. We tried this out with different genres of songs. However, the initial experiments showed no particular genre was preferred by the models. We observed that AddLies test did not succeed as much as the other tests did. False statements such as “Sun rises in the west” impacted scores negatively in many cases. We believe this is due to the reason that most models used contextual word embeddings as inputs to their models. This may have negatively impacted the scores.

Figure 2: Results of adversarial training for Prompt 2,3,5,7 in clockwise order. The x-axis shows chosen test-cases and y-axis shows 4 metrics: {}. Representations: The solid lines denoted by : Value of with adversarial training done over the data generated by the same test case, the dashed lines denoted by : Value of with adversarial training done over the data generated by a different test case, the dotted lines denoted by : Value of with no adversarial training done

Another category of test case BabelGen. Ideally, this should have been scored a zero but almost all the models scored at least 60% to the generated essays. This strongly suggests that models were looking for obscure keywords with complex sentence formation. We also observed that modifying grammar did not affect the scores much or affected it negatively. This is largely in congruence with the rubrics of the questions where it was indicated that grammar should not be valued for scoring. However, unexpectedly, in some cases after changing the grammar of the whole response, we observed that scores started increasing. A few examples demonstrating this are given in the supplementary.

4.1 Human Annotation Results

# Perturbation Score % % People % People Common Reasons of Common Reasons of
1 ModFluency 28.1 82.7 4.8 Std English, Readability More appropriate
2 Shuffle 24.2 68.6 14.5 Transitions ,Organization, Relevance None
3 ModGrammar 39.5 91.3 6.2 Grammar, Conventions, Readability None
4 AddWikiRelated 38.2 87.2 11.3 Readability, Relevance, Conventions Transitions
5 RepeatSent 15.6 71.6 13.6 Organization, Relevance, Repetition Clarity
6 AddLies 23.9 79.9 10.6 Relevance, Organization Conventions
7 AddTruth 29.2 88.6 8.6 Relevance, Readability Organization
8 AddSong 32.8 91.8 3.2 Relevance, Organization, Grammar Both equal
9 DelRand 38.2 87.2 11.3 Transitions, Organization Same, More appropriate
Table 6: Human Annotation Results. ( represents a decrease and represents an increase. Therefore, ‘% People ’ denotes the percentage of people who scored the adversarial response worse than the original response)

In order to validate that most of our tests are such that they are perceived as negatively impacting scoring, we chose a few test cases based on the following three conditions: 1) Where , 2) Where

and 3) Where a T-test rejects the hypothesis that the adversarial and original scores are the same distribution. The motivation behind setting these three conditions was that we wanted to choose those test-cases where the model is the most confident in scoring adversarial response as negative. Through this, we can show that even while being confident, they still lack in penalizing scores

adequately. In all other test-cases, models are either marking the perturbations as better than the original () or not detecting any significant difference (second and third conditions), both of which are wrong presumptions by the model.

Table 6 depicts the results for human annotations. We divide the annotators into two groups. For the first group, we show them the original response and its corresponding score and then ask the annotators to score the adversarial response accordingly. For the second group, we ask them to score both the original and adversarial responses. If any of the annotators felt that the scores of the original and adversarial responses should not be the same, we ask them to list supporting reasons. For uniformity in responses, we derive a set of scoring rubrics extracted mentioned in our dataset and ask them to choose the most suitable ones. As observed from Table 6, the percentage of people who scored adversarial responses lower than original responses are significantly higher for all selected test-cases. The main reasons for scoring adversarial responses lower by annotators are Relevance, Organization, Readability etc.. It can be observed that the % lowering in score was on an average of 30%.

4.2 Adversarial Training

Finally, we tried training on the adversarial samples generated by our framework to see if the models are able to pick up some inherent “pattern” of the adversarial samples. Since there is a multitude of adversarial test cases category, we narrowed a subcategory of five test cases from those shown for the human annotations. They were selected such that on an average, these test cases had maximum deviation between human annotated scores and machine scores. The train data consisted of an equal number of original samples and adversarial samples. The target scores of adversarial samples was set as the original score minus the mean difference of scores between original and human annotated scores. For example according to the human annotation study, for the ModGrammar case, the mean difference was 2 points below the original score, so all the samples were scored as original scores minus 2 points in the simulated training data. The simulated training data was then appended with original and shuffled. The testing was conducted with the respective adversarial test-case as well as the others. The results for the same is shown in Figure 2. It is evident that the adversarial training improves the scores marginally for all four metrics, as shown by the solid lines being higher than the dotted lines. However a slightly visible improvement in scores is inapparent. The increases for adversarial training, highest for the respective test-case.Similar trend is observed for metric.For , the adversarial training reduces this score for respective test-case , as compared to non-adversarial testing.

5 Conclusion

Through our experiments. we conclude that recent AES systems built mainly with feature extraction techniques and deep neural networks based algorithms fail to recognize the presence of common-sense adversaries in student essays and responses. As these common adversaries are popular among students for ‘

bluffing’ during examinations, it is vital for Automated Scoring system developers to think beyond accuracies of their systems and pay attention to complete robustness so that these systems are not vulnerable to any form of adversarial attack.