Constructing an essay engages a student’s ability to critically think, analyze, organize, and synthesize ideas. This means that the assessment of student essays is a valuable way to test the upper levels of Bloom’s taxonomy . As such, essay items are often important tools in any comprehensive assessment program. Having essays scored professionally in a standardized testing program comes at a great cost to the states 
. Factors that influence how the states choose to assess students include increased testing and a general lack of state funds. An Automated Essay Scoring (AES) engine is a statistical model used to evaluate an essay in a manner that is as close to human scoring as possible. The cost of using AES engines has been estimated to be from one fifth to a half of the cost of human scoring.
. We generally consider these to be in the class of Bag-of-Words (BOW) methods. The performance of these engines is focused on the selection and design of useful features. These features can require a great deal of work to test, extrapolate, and implement. Once a suitable set of features is chosen, a classical machine-learning classifier is fit to a set of training data to obtain an AES engine. Features and knowledge of the classifier weights can be beneficial to the generation of feedback for teachers and students, but also in diagnosing why an engine may have given the score it has assigned. The downside of such engines is that they tend to be brittle because language is not adequately encompassed by a finite collection of linguistic features. These engines are modelled on vocabulary observed in the training sample and upon a small collection of weights. Increasing the amount of training data should increase this vocabulary and provide more accurate estimates for an optimal set of weights.
Technology that has seen far less use in production has been neural networks [4, 19, 23]. The first neural network-based AES engines to appear in research were based on a mix of convolutional and recurrent layers with attention [4, 23]
. These engines used word-embeddings which map words to a semantic vector space. Unlike in LSA, word order is vitally important. These models involve millions of parameters and learn linguistic features implicitly rather than explicitly. These models generally require a lot of data to train, however, we expect that these methods can be more accurate with sufficient data.
The field of neural networks applied Natural Language Processing (NLP) has undergone a revolution since the development of the transformer
. Where LSTM-based models require a lot of data to train, pretrained transformer-based language models can take advantage of vast corpora of unsupervised data. As such, transformer-based models often involve an order of magnitude more parameters that store pre-baked features. We think of transformer-based models as having a general understanding of language before being fine-tuned for a classification task. This pretraining and the number of parameters involved should affect the size of data required to make inferences in an AES setting. As a benchmark, we consider AES engines based on the BERT architecture[13, 19].
We generally possess two types of labels for essay data depending on how the essays were assessed; single-scored data, where the data has been scored once for assessment purposes by teachers or assessment companies, and double scored data, which has been read by two independent professionally trained assessors . This gives us labelled data of two distinct levels of quality. There is an assumption that the high-quality data leads to a high-quality engine, however, this assumption and the framework for implementing AES systems was written with feature-based methods in mind. There are valid reasons that this assumption may break down for neural networks. The neural networks in question are capable of modelling an entire essay prompt. The more language a neural network is exposed to, the greater the number of patterns/features it learns. Testing agencies generally possess far more single-scored than double-scored data, hence, it is our hypothesis that the vast quantities of single-scored data may add value in the context of neural network-based engines. We show that we may use single-scored data to enhance the results of engines trained on double-scored data.
The use of AES is not without controversy with many citing the ability to game the system in certain ways that are not necessarily conducive to good writing [7, 18]. The common critiques of AES are also predominately directed toward feature-based techniques . We foresee that due to the techniques in NLP becoming increasingly sophisticated and accurate that AES engines will become more robust to being gamed in this way. Furthermore, recent advances in language models show the potential to decrease costs while simultaneously increasing the overall quality of scoring. These approaches can also be trained to be more robust, but still susceptible to gaming . The biggest downside of neural network-based models that the features are buried in a sea of parameters and are not as transparent as feature-based models.
2. Experimental Design
An essay is typically assigned a score between 0 and 10 reflecting how well the essay is written. A good essay should be organized logically, flow smoothly, and explain a central idea. It is also important to correctly spell words, adhere to the rules of grammar, and appropriately punctuate sentences. These desired qualities have guided the development of a rubric that assesses essays with respect to three different traits :
Organization/Purpose: This measures how well-focused an essay is, how well the author uses citations and transitions, and how well the introduction and conclusion fit with the essay. This trait is scored out of 4.
Elaboration: This measures clarity/readability, how engaging the essay is, and an appropriate vocabulary. This trait is scored out of 4.
Conventions: This measures the correctness of spelling and grammar. This trait is scored out of 2.
The final score out of 10 is the sum of the individual scores for each trait. When using AES, for each essay item, we typically design three separate engines which are fit to the traits rather than the final score. Due to the nature of the traits, we expect that the three different types of engines (BOW, LSTM, BERT) benefit from more data differently for each trait. In particular, we expect that the BOW models perform well even with a small amount of data, however, neural network-based models should provide better performance once they are provided sufficient data. Furthermore, the appropriate parameters governing the implicit features of a BERT model will require more data to become effective, and we expect that the number of effective features an LSTM will learn from a small dataset will not be large enough to effectively model any of the traits.
In developing an engine for a particular item/essay prompt, the first step is to obtain a sufficiently large corpus of text responses to be used as a training set. Each text is assessed independently for each of the three traits by two human raters. If both the human raters agree, their agreed upon score is reported, while if the two raters disagree, the score is adjudicated by a third reader, and sometimes even a fourth if the third reader disagrees with both scores given depending on the rules set out by the agency governing the scoring . The aim of having two raters with adjudication and resolution is that the final score is as close as feasible to being a true interpretation of the essay rubric for each trait. While it takes more than twice the time to score in this way, the argument is that the better the labels are, the better the AES engine is.
In addition to obtaining more accurate scores, two human raters allow us to obtain a measure of how well the item was scored from a psychometric standpoint by gauging inter-rater reliability statistics . The most important metric to evaluate is the Cohen’s quadratic weighted kappa (QWK) statistic, defined by
is the observed probability
where is the number of classes.
There are several reasons QWK is favored over accuracy. The first reason is that QWK depends on the entire confusion matrix, not just the diagonal elements. This means that larger score discrepancies have a greater effect on the QWK than smaller ones, whereas accuracy considers all incorrect scores equally. Secondly, the use of observed probabilities in (1) have the effect of taking into account the rarity of the score. One interpretation of this metrics is that the QWK captures the level of agreement above and beyond what would be obtained by chance and weighted by the extent of disagreement.
The QWK and accuracy are not the only two metrics that are used in the calibration of an AES engine; we also consider the standardized mean difference (SMD) which measures the overall spread of scores. The SMD is defined by
are the mean and standard deviation functions.
The framework outlined in  recommends that the QWK between two raters should be above before being considered as a training set for an AES engine. We generally require that the difference between the QWK between two human raters and the QWK between the AES engine and the final human resolved score should be no greater than . Furthermore, the SMD between the raters and the SMD between the AES engine and the final score should be less than . It is for this reason that we need to consider both the SMD and the QWK as we increase the amount of training data.
In our first experiment, we seek to determine how our models improve with the amount of data provided. For this experiment, we will use a large corpus of single-scored data and gradually increase the amount of single scored data used for training. We consider the following two sets of data:
Training: We use a corpus of 15,000 single-scored responses, each of which has been assessed in each of the three traits for assessment purposes.
Validation: We use an additional 2,000 responses from the same source as held-out single-scored validation data.
The training data was divided into a chain of 30 subsets, so that and . That is to say we have a chain 30 subsets whose sizes range between 500 and 15,000 in steps of 500. To determine how well the average model does in comparison with humans, we use 5-fold validation by further subdividing each subset into 5 different test/train splits. Each of subset, , is the disjoint union of a test/train-split and each subset, , is the disjoint union of the 5 different test sets. Our final QWK is the average over the folds. In this way we determine how each of the types of models responds to increases in data size.
We consider the final performance to be the average of the QWK on the held-out set for each of the folds . It is important to note that we are not gauging the best possible performance, which is the goal of most research programs in neural networks. This study seeks to gauge average performance, so averaging over the folds has the effect of smoothing out the variability in the resulting QWK. We often found, in the case of the LSTM, that the engine failed to converge altogether, which are results we have not discarded for the fidelity of the experiment. One of the other factors that influences the variability of the QWK measurement is the rarity of scores. Failing to predict rare scores has a more dramatic effect on the QWK than failing to predict common scores. For this reason, we present the score distributions for the validation set in Table 1.
|Essay #2||Single-Scored||Elaboration||12.8 %||19.3%||49.4%||14.75%||3.75%|
In counting the number of models in this experiment, we note that we have a different model for each dimension, subset, and fold. In total, we require an evaluation of 450 models for each type of model and for each essay item, making 2700 models in total (or 900 for each type of model used). Due to the sheer number of models involved, it was not feasible to perform hyper-parameter tuning on this scale. For this reason the results we present are not the best possible results with each architecture. They are to be considered a reflection of how each architecture scales as the result of a single model resulting from generically chosen parameters. The evaluation of these 2700 models on the validation set should give us a clear idea of how the size of data affects the quality of the engine.
Our second experiment challenges the long-standing assumption that it is more important to have small amount of good quality data than it is to have a large quantity of poorly labelled data. To test this, we took the same two essay prompts from the first experiment in which we had a large corpus (approximately 50k) of both single-scored data and approximately 2500 double scored data that was designed to be used to build an AES engine. This gives us two distinct qualities for the labels used in training. In this experiment, we have three sets of data:
Single-scored Training: We use approximately the full set of 50k responses as training data with their corresponding labels.
Double-scored Training: We use approximately 2000 responses with their final resolved score as the labels used in training.
Validation: We use the remaining 500 double scored data, with their final resolved score as the labels to validate machines on both sets of training data.
The inter-rater reliability statistics for the validation set for the two human reads are presented in table 2.
In this experiment, we have two sets of models, one trained on the single-scored data and another trained on the double-scored data. Our hypothesis is that training on 2000 double-scored data performs better than a large quantity of single-scored data. There are good reasons this hypothesis could be false, especially for neural networks.
There is one issue in this comparison which need to be addressed regarding the nature of the data; are the single-scored labels/scores an accurate representation of the double-scored labels/scores? We know that the corpus of single-scored responses originates from a time in which all responses were scored by a human. While we also know that the double-scored responses were drawn from the same sample, the administrative conditions for assessing responses may have changed between the time the corpus was originally assessed and the time the data was assessed for the purposes of building an AES engine. The average scores and spread of scores could differ which would adversely affect the SMD and the QWK to a lesser extent. What should be true is that the single-scored data should be able to form a first approximation for the double-scored data. For this reason, we consider an extra step; we use the models obtained by training on the single-scored data to define a set of initial weights to be used for training on the double-scored data. In this way, we test the overall usefulness of the single-scored data in training for a AES for use in production. To our knowledge, this type of data is completely disregarded in the development of AES engines.
We now describe the models used in the specific AES engines to test our hypotheses:
BOW: The BOW-based engine may be considered an ensemble between an LSA-based engine and a feature-based engine. In this engine we need to choose the linguistic features to include and the LSA dimension. We include a list of sixteen features such as the number of punctuation errors, misspellings, typos and average sentence length. Since we expect that the conventions score is mainly dependent on these features, we chose an LSA dimension of 10 for conventions, while we expect that the elaboration and organization dimensions are driven by the semantic content, so we chose an LSA dimension of 70 for these dimensions. The resulting twenty six or ninety six features are then concatenated and an ordinal probit model is applied to produce a classification.
LSTM: To evaluate an LSTM-based architecture we first established an embedding. We took a large corpus of student texts, tokenized them with respect to the standard spaCy tokenizer and formed case insensitive fastText embedding 
. This embedding was used to transform the inputs into a two layer bidirectional LSTM with 400 hidden units in each direction in each layer. A simplified attention mechanism consisting of a weighted average of the output of the LSTM was applied with a linear layer to form the output. Optimization with the adaptive minimization algorithm, Adam, with standard learning rates. These models were implemented using Pytorch. No pretraining was involved.
BERT: To evaluate pretrained transformer-based architectures we chose the standard BERT architecture. For conventions, the cased version of the base architecture with 12 layers while both the elaboration and organization scores used the uncased version of the same architecture. These models were obtained and fine-tuned using the codebase of Hugginface111https://github.com/huggingface/transformers. A version of Adam was used with a standard learning rate.
In our first experiment, we are interested in how these engines perform as we increase the amount of data, we provide for training from 500 responses up to 15k responses in increments of 500. We compare the three different engine types on two separate essay prompts, each of which has three traits. In our view, elaboration and organization are grouped together, as they depend on similarly defined and intersecting features, while conventions are defined on an almost disjoint set of features. We start by examining the performance on Elaboration and Organization; the change in the resulting QWK on the validation data for the traits of Elaboration and Organization is presented in Figure 1 and Figure 2 respectively.
One of the difficulties that AES engines have in assessing Elaboration and Organization is that these two traits should be scored independently of the spelling and grammar. This poses a unique difficulty as these two traits benefit from how the AES engines may extrapolate the correct word when one is incorrectly spelled. To adjust for this, the LSA component of the BOW engine was subjected to spell-correction while spell-correction was not applied to the features component of the model.
For both prompts, the performance of the LSTM-based AES engine is heavily dependent on the amount of data. It is clear that when the LSTM starts with little to no data, the models do poorly, however, it is clear that with enough data, the LSTM-based engines can perform comparably with the BOW-based models and in most cases, these engines exceed the performance of BOW-based models. One of the distinct advantages the LSTM model possesses is the use of the fast-text embedding. Since fast-text embeddings are the result of an average over subwords, if sufficiently many subwords in an incorrectly spelled word are present, it is possible that the averaging mechanism may approximate the meaning of the misspelled word 
. One of the problems in this model did not always converge, especially in the case when we used smaller datasets. Most of the time these models would have been discarded in the hyperparameter selection, however, since we did not tune the LSTM models, we see a considerable drop in performance on Essay #2 due to this instability. It was clear that the larger the dataset, the more stable the results were.
As expected, the BOW-based models show much less of a performance increase as we increase the quantity of data. The size and accuracy of the LSA component should account for some increase in accuracy, however, in some cases, there is an even a slight drop in performance of the BOW-based model as we increase the amount of data used.
In most cases, the BERT engine not only improves with the increase in data, but it also shows very solid performance with very little data. This indicates that the pretraining endows the BERT engine with sufficiently many features to be useful for conventions before training begins. These results seem to be different from those in a previous study, but, the nature of the data is very different .
When we consider the performance on the trait of conventions, we see that BERT holds a considerable advantage across the board. Given we used only 16 textual features, the linear layer determining the classification in the BERT architecture is a function of 768 inputs that depend on the entire input space. While dissecting the features of a BERT model is difficult, we know that conventions could be interpreted as a measure of how discrepant the target text from the grammatically correct and impeccably spelled texts that BERT was exposed to in pretraining. It is clear that the features learned in this pretraining process seems to capture more features required to model conventions than the preprogrammed features of the BOW model.
It is also clear that the LSTM models are at a distinct disadvantage. Since the fast-text embedding used was lower-case, it is clear that correct capitalization cannot be determined by the LSTM model. Furthermore, these models are trying to ascertain the rules of grammar from a small collection of texts. The ability to discern the correct words from incorrectly spelled words works against the ability of the LSTM to score conventions accurately. The fact that the LSTM falls short of the BOW and BERT models is no surprise in this context.
It should be noted that the spread of the data, as measured by the SMD, does not improve significantly with the amount of data used in any of our methods.
If we consider the results of the engines trained on a large set single-scored data, the results are poor when compared on the double-scored data for BOW and BERT and better for LSTM. If we are to believe that the double-scored data represents the interpretation of the rubric with the highest fidelity, then the SMD alone for conventions for each engine would disqualify the engines trained on this data for operational use.
The data seems to indicate a mismatch between the labels of the single-scored data and the double-scored data. We need to bear in mind that the administrative conditions for the creation of these two datasets vary as the single-scored data was a corpus of responses that predated the use of AES for these two prompts. In any case, the labels are seen to have two distinct natures, hence, we expect some variation in the way in which the rubrics were interpreted. One score being more leniently in one administrative setting than the other causing a difference in the spread of scores, which can be seen in Table 4.
|SS||DS Train||DS Val|
There is just one more possibility we wish to explore, that we use the single-scored data to define an initial state for the training of models to be trained on the double-scored data. The last part of the second experiment involves using the single-scored data for pretraining. This approach is similar to the work done on classifying tweets . The main idea is that the weaker data is used to define an appropriate set of features that were not abundant in the smaller dataset, from which the linear layer may use to more accurately classify the smaller dataset. This differs from the pretraining the underlying model as a language model since we do not use the LSTM to predict missing or future words, as done in . Furthermore, the pretraining set bears a more accurate resemblance to the target dataset in this case. The results of this process are outlined in Table 5.
It is interesting to see that BERT seems to see increases in the performance characteristics for elaboration and organization when subjected to this training and decreases in conventions. We speculate that this may be a symptom of the system forgetting much of the pretraining in the process of training on the single-scored data. Elaboration and organization, on the other hand, are based on features that are more specific to the prompt. What is also interesting is how comparable the LSTM based engine performs given the engine itself (not including the embedding) possesses approximately 3.5 million parameters, while the BERT model possesses 110 million parameters (23 million for the embedding).
One of the aspects of this study we did not delve into is the ability to tune the parameters that define neural networks. Tuning parameters such as learning rate, batch-size, dropout, and recurrent dropout can lead to significant improvements in text-classification results . We expect that hyperparameter tuning should improve the performance of BERT and the LSTM-model significantly on all traits and even exceed human performance. Typically, we see in the order of 5-10% improvement due to hyperparamater tuning. When tuning these parameters, we often rely on simple grid search methods as well as Baysean approaches . The only parameter we can tune in the case of the BOW-model chosen is the LSA-dimension.
The other aspect not touched is the pretraining of an LSTM, the difficulty in doing so has been reported in 
. We believe that similar results to the BERT and BOW models are possible by pretraining the LSTM as a part of an autoencoder network. It is worth noting that the current state-of-the-art results on the Kaggle dataset (see) is achieved with an attention-based ensemble of convolutional and LSTM units , hence, we can assume the addition of an attention mechanism should improve the results significantly. It may also be able to do a kind of pretraining for the LSA by using the LSA features defined by a larger dataset which is then used as the features for a classifier for a smaller dataset.
An interesting aspect of the conventions trait is that it is based on universal features associated with grammar, spelling, and prose. From an educational standpoint, how these features are assessed depend upon the intended grade this prompt is given; however, it is the same set of features that are taken into consideration for each grade. This means it might be possible to pool the data from multiple prompts, meaning that it may be possible to provide an LSTM with sufficient data to perform comparably on the conventions trait, especially on an embedding that is case-sensitive.
When we initiated this line of research, the consensus seemed to be that single-scored data would provide little to no value in developing an effective AES engine. This seems to be true for traditional AES models built on an ensemble of hand-crafted and LSA features, however, the possibility of transfer learning makes single-scored data useful. This study has illuminated a few things; LSTMs seem to be able to perform comparably to transformer-based and BOW-based models in elaboration and organization with enough data, that conventions seems to benefit more from the initial features defined in the pretrained model.
-  Attali, Yigal, and Jill Burstein. “Automated essay scoring with e-rater® V. 2.” The Journal of Technology, Learning and Assessment 4, no. 3 (2006).
-  Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. “Enriching word vectors with subword information.” Transactions of the Association for Computational Linguistics 5 (2017): 135-146.
-  Dai, Andrew M., and Quoc V. Le. “Semi-supervised sequence learning.” In Advances in neural information processing systems, pp. 3079-3087. 2015.
Dong, Fei, Yue Zhang, and Jie Yang. “Attention-based recurrent convolutional neural network for automatic essay scoring.” In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 153-162. 2017.
-  Ezen-Can, Aysu. “A Comparison of LSTM and BERT for Small Corpus.” arXiv preprint arXiv:2009.05451 (2020).
-  Farag, Youmna, Helen Yannakoudakis, and Ted Briscoe. “Neural automated essay scoring and coherence modeling for adversarially crafted input.” arXiv preprint arXiv:1804.06898 (2018).
-  Kolowich, Steven. “Writing instructor, skeptical of automated grading, pits machine vs. machine.” The Chronicle of Higher Education 28 (2014).
-  Krathwohl, David R. “A revision of Bloom’s taxonomy: An overview.” Theory into practice 41, no. 4 (2002): 212-218.
Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9, no. 8 (1997): 1735-1780.
-  Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018).
-  Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. “An introduction to latent semantic analysis.” Discourse processes 25, no. 2-3 (1998): 259-284.
-  Larkey, Leah S. “Automatic essay grading using text categorization techniques.” In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 90-95. 1998.
-  Mayfield, Elijah, and Alan W. Black. “Should You Fine-Tune BERT for Automated Essay Scoring?.” In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 151-162. 2020.
Martinez-Cantin, Ruben, Kevin Tee, and Michael McCourt. “Practical Bayesian optimization in the presence of outliers.” In International Conference on Artificial Intelligence and Statistics, pp. 1722-1731. PMLR, 2018.
-  McHugh, Mary L. “Interrater reliability: the kappa statistic.” Biochemia medica: Biochemia medica 22, no. 3 (2012): 276-282.
-  Page, Ellis Batten. “Project Essay Grade: PEG.” (2003).
-  Phelps, Richard P. “Estimating the cost of standardized student testing in the United States.” Journal of Education Finance 25, no. 3 (2000): 343-380.
-  Perelman, Les. ”Construct validity, length, score, and time in holistically graded writing assessments: The case against automated essay scoring (AES).” International advances in writing research: Cultures, places, measures (2012): 121-131
-  Rodriguez, Pedro Uria, Amir Jafari, and Christopher M. Ormerod. “Language models and Automated Essay Scoring.” arXiv preprint arXiv:1909.09482 (2019).
-  Shermis, Mark D. “State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration.” Assessing Writing 20 (2014): 53-76.
-  Shermis, Mark D., and Jill C. Burstein, eds. “Automated essay scoring: A cross-disciplinary perspective.” Routledge, 2003.
Shorten, Connor, and Taghi M. Khoshgoftaar. “A survey on image data augmentation for deep learning.” Journal of Big Data 6, no. 1 (2019): 60.
-  Taghipour, Kaveh, and Hwee Tou Ng. “A neural approach to automated essay scoring.” In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 1882-1891. 2016.
-  Topol, Barry, John Olson, Ed Roeber, and P. Hennon. “Getting to higher-quality assessments: Evaluating costs, benefits, and investment strategies.” Assessment Solutions Group (2012)
-  Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” In Advances in neural information processing systems, pp. 5998-6008. 2017.
-  Wei, Jason, and Kai Zou. “Eda: Easy data augmentation techniques for boosting performance on text classification tasks.” arXiv preprint arXiv:1901.11196 (2019).
-  Williamson, David M. “A framework for implementing automated scoring.” In Annual Meeting of the American Educational Research Association and the National Council on Measurement in Education, San Diego, CA. 2009.
-  Kim, Yoon. “Convolutional neural networks for sentence classification.” arXiv preprint arXiv:1408.5882 (2014).
-  S. Yuan, X. Wu and Y. Xiang, ”Incorporating Pre-Training in Long Short-Term Memory Networks for Tweets Classification,” 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, 2016, pp. 1329-1334, doi: 10.1109/ICDM.2016.0181.
-  Zhang, Li. “Review of handbook of automated essay evaluation: Current applications and new directions.” Language Learning & Technology 18, no. 2 (2014): 65-69.