1 Introduction
Free-form constructed textual responses generally fall into one of two categories; essays and short answers. These two categories are not just distinguished by the average response length, they are also assessed very differently [4]. Rubrics for essays often take grammatical rules, organization, and argumentation into consideration whereas rubrics for short answer questions tend to assess specific analytic or comprehension skills. This means that a response is not penalized if there are multiple spelling or grammatical errors present. Automated Short Answer Scoring (ASAS) and Automated Essay Scoring (AES) are two classes of techniques that utilize statistical models to approximate the assessment of constructed textual responses. Given the difference in rubrics, the performance of particular models and the importance of particular features used in each setting vary greatly.
Traditionally, statistical models for AES have been based on bag-of-words (BoW) methods which combine frequency-based and hand-crafted features [2, 8, 17]
. As neural networks were developed in other areas of NLP, they became increasingly adopted for AES
[1, 5, 29]. One of the most important developments in NLP has been the effectiveness of transformer-based pretrained language models such as the Bidirectional Encoder Representation by Transformers (BERT) model [6] which can be fine-tuned to a range of downstream tasks. The effectiveness of these models on the Kaggle essay dataset has been investigated by numerous authors [16, 18, 30]. More recently, we saw a combination of hand-crafted features and language models define the state-of-the-art on this dataset [30].The most effective methods for short answer questions differ depending on the various types of responses. In the case of the Powergrading dataset, where there are fewer than twenty words per response on average, a simple yet effective clustering technique is sufficient [3]. In the case of the SemEval-2013 Joint Student Response Analysis (SRA) task, the current state-of-the-art was achieved by fine-tuning BERT models [27]. This work is concerned with the Kaggle Short Answer Scoring (KSAS) dataset111https://www.kaggle.com/c/asap-sas [24]. Each prompt consists of a passage and a prompt that asks the student to describe or explain aspects of the passage using evidence [24]
. Given the semantic nature of descriptions and explanations, we expect well-trained neural networks to perform well in this task. Despite the advancements of neural networks, the current state-of-the-art for this dataset has been achieved by the application of random forest classifiers to a set of rule-based features
[10]. What is remarkable from a production standpoint is that the calculations for these models can be done in a low resource setting.Our goal of this short note is to explore how some of the most popular language models do when subjected to the KSAS dataset. We were expecting that language models on their own could surpass previous results, but when it comes to single models, on average, this is not the case. We are able to show that particular ensembles are capable of exceeding this benchmark, but the computational cost would be prohibitive from a production standpoint. In this sense, this work is the antithesis of [10] in that we simply bludgeon the problem to death with computational power. Even in doing so, there remain a few prompts that seem to fall drastically short of the methods in [10]. Conversely, there are some other prompts in which we see even our most efficient models perform comparably or even exceed rule-based methods, which we believe is sufficient to show that these methods and the results of this paper are of interest.
This paper is outlined as follows: In §2 we specify the way in which we have fine-tune, train, and ensemble the pretrained transformer-based language models and feature-based models, in §3 we present the results of the various models produced, and in §4 we discuss some corollaries of this work in terms of future directions.
2 Method
Since BERT was introduced in [6], a veritable cornucopia of language models have been introduced, each either varying the underlying architecture of BERT or the way in which they were trained. The General Language Understanding Evaluation (GLUE) benchmark has been one of several benchmarks used to evaluate the performance these language models in a range of classification, generation and understanding tasks [31]. We expect that some models, due to their architectural changes or training methods, should perform differently to other models. We shall compare each model by applying the same fine-tuning procedures to each pretrained model to each prompt in the KSAS dataset.
We start by introducing the metrics typically used to evaluate model performance in most production systems in AES and ASAS [32]. The primary statistic used in automated assessment is the Cohen’s quadratic weighted kappa (QWK) score, defined by
(1) |
where
is the observed probability
and
where is the number of classes. One interpretation of this statistic is that i represents the level of agreement between two scorers where you negate the agreement by chance. In production systems, we often require that the QWK between the true score and the scores predicted by the model are within 0.1 of the QWK between two humans. In an educational setting, most scoring engines are also required to have a standardized mean difference (SMD) with the final score of below 0.15 and the discrepancy between the IRR accuracy and the engines accuracy must be within some limit [32].
train | dev | test | dev | Avg. | ||
---|---|---|---|---|---|---|
Set | N | N | N | QWK | Acc | Length |
1 | 1337 | 335 | 557 | 0.936 | 87.5% | 52 |
2 | 1022 | 256 | 426 | 0.911 | 84.8% | 65 |
3 | 1446 | 362 | 406 | 0.758 | 78.5% | 53 |
4 | 1325 | 332 | 295 | 0.686 | 78.3% | 46 |
5 | 1436 | 359 | 598 | 0.935 | 95.9% | 28 |
6 | 1437 | 360 | 599 | 0.951 | 97.0% | 28 |
7 | 1439 | 360 | 599 | 0.973 | 96.4% | 46 |
8 | 1439 | 360 | 599 | 0.837 | 83.3% | 60 |
9 | 1438 | 360 | 599 | 0.831 | 80.8% | 54 |
10 | 1312 | 328 | 546 | 0.905 | 90.9% | 45 |
Our first step is to isolate a development set to be used for an early stopping mechanisms and hyperparameter-tuning. This set was chosen at random without any stratification. The properties of the development set are specified in Table 1. Because the original test set has been withheld by the organizers of the competition, we use the public test set to validate the models. Unfortunately, this means that we do not have second reads for the validation set, hence, we cannot assess whether the results provided satisfies the operational criteria provided [32].
The training of a single model or model in an ensemble was performed using the AdamW optimizer [15]
with a linear learning rate scheduler. The loss function used was the usual binary-cross-entropy function. We train each model 20 epochs and select the model over that range the best QWK on the development set. To select the learning rate and the batch size, we used the Tree-structured Parzen Estimator (TPE) algorithm
[7] with 10 trials with batch sizes between 6 and 12 and learning rates between 5e-6 and 1e-4. We used the Optuna implementation of the TPE algorithm [28]. The source code we used to train and score for this project will be made available in a future version of this paper.To keep the code accessible, we chose a range of models that were both popular, accessible through a single API222huggingface.co and achieved high GLUE scores [31]. The selection of models, their references, and an approximation of their GLUE scores, and their respective sizes in millions of parameters is given in Table 2.
Size | model | Ref. | Size | GLUE |
L | ALBERT (xxL) | [13] | 222M | 89.4 |
RoBERTa (large) | [14] | 355M | 88.5 | |
BERT (large) | [6] | 335M | 88.1 | |
Electra (large) | [11] | 335M | 89.4 | |
B | BERT (base) | [6] | 110M | 79.5 |
XLNet (base) | [34] | 117M | 79.6 | |
RoBERTa (base) | [14] | 124M | 79.6 | |
Electra (base) | [11] | 110M | 82.7 | |
DeBERTa (base) | [9] | 184M | ||
S | Electra (small) | [11] | 13M | 77.4 |
ConvBERT (med) | [12] | 18M | 79.7 | |
MobileBert | [26] | 25M | 78.5 | |
Albert (base) | [13] | 17M | ||
DistBERT | [22] | 67M | 77.0 |
While each of these models are transformer-based pretrained language models, they differ in some key aspects. The RoBERTa models were trained longer and removes the next sentence prediction task in the original BERT [14]. A novel aspect of the ALBERT models is the weight-sharing mechanism [13]. This mechanism was used in the base version to drastically reduce model size, and in the large version, to increase the size of the hidden units and feed-forward layers while keeping the size in terms of parameters managable. The Electra models differ greatly in the way they were trained; they are trained in generator and discriminator pair in which one model is used to generate tokens in masked positions, while the other attempts to distinguish between the generated and true labels [11]. The DeBERTa model we uses a disentangled attention mechanism in which the relative positions of words are considered, but not the absolute positions. Furthermore, the DeBERTa model is trained similarly to Electra however the goal of the discriminator in the case of DeBERTa is to detect replaced tokens rather than determine whether known tokens are generated or true [9].
The difference between XLNet and BERT is that the tokens are essentially predicted simultaneously by considering all permutations of token prediction order [34]. The Convolutional BERT model is novel in that it replaces fully connected layers in the feed-forward mechanisms in BERT with convolutional layers that can be done more efficiently [12]. The MobileBERT architecture uses linear layers called bottlenecks to reduce the dimension of the attention matrix computations, which is efficient and effective due to the rank of the attention matrix [26]. Lastly, the Distilled BERT model has 6 layers instead of 12 and is trained using knowledge distillation. The authors claim this is a much smaller and faster version of BERT with 97% of the performance [22].
Name | Description | |
---|---|---|
Sentence-BERT | 364 | The outut of the encoder component of Sentence-BERT. |
TF-IDF | 100-300 | The transformation induced by the transformation of the largest 300 eigenvectors of the TF-IDF training matrix. |
text-overlap | 15 | The number of minutia that intersect with the prompt. |
key words, bigrams and trigrams | 90 | The key terms are extracted and near matches in the target text are counted. |
text-stats | 10 | A number of key statistics like length, average word length, etc. |
Inspired by previous works both in essays and in short answer questions, we wanted to ensemble a range of models with a feature based model that captures the important components of the feature based models [10]. It has been our experience that ensembles between models of different natures better gains in performance than those that are similar. It is in this vein that we consider a feature model several key analogues of the features that were considered important. These features are summarized in Table 3.
Firstly, we consider the embedding of documents delivered by the sentence embedding given by Sentence-BERT [20], a term-frequency inverse-document-frequency model, a set of text overlaps, and the frequencies of key -grams for between and . To evaluate text overlaps, we use minutia; any spaces, numbers, or punctuation are removed from the text and the prompt, at which point we count all overlapping strings between 5 and 20 characters, making a total of 15 dimensions. To evaluation ”near matches” we use the text difference library difflib 333https://docs.python.org/3/library/difflib.html
. There is a threshhold for the near matching cutoff that ranges from between 0.5, where half the characters are correct, and 1, in which case all the characters are correct. This threshold is varied in accordance with hyperparameter tuning choices. The reasons for this approach is that short answer scoring should disregard spelling, hence, the near-matches are a way to make the keyword features more robust to spelling errors. Lastly, we use a number of typical statistics concerned with word and sentence lengths. We normalize any features so that adhere to a normal distribution with mean 0 and a standard deviation of 1.
Once all the features are compiled, we summarize the document into a single vector of featuers between dimension 579 and 779. A multilayer perceptron model is fit to the training set and the performance is then evaluated on the development set with some fixed learning rate and batch size. The learning rate, batch-size, TF-IDF dimension, and cutoff are subjected to a hyperparameter optimization with 20 trials. The goal of the feature model is to produce something with sufficient performance, that is distinct in nature, to ensemble with our pretrained language models.
The structure of our general ensemble of models. The linear transformations,
, are the outputs of the classification heads that give log-probabilities.Given a selection of models, the structure of our ensemble is simple; we take the log-probabilities of each model on the development set as the training set of a logistic regression. The output of the logistic regression is considered to be the ensemble output. In this reigeme, the test set is only considered at this stage. The structure that we use for ensembling is depicted in Figure 1.
3 Results
We optimized each model using a Xeon E5-2620 v4 @ 2.10GHz with a Nvidia RTX 8000 with 48Gb of on-board memory. This allowed for the batch sizes used for the larger models. Since the original test set for the competition has not been made available, we use the results on the publicly available test set, as was done for studies that are comparable to our own . The individual models, and the ensemble results, are shown in Table 5.
QWK | ||||||||||||
Model | ref. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | mean |
Baseline | [24] | 0.719 | 0.719 | 0.592 | 0.688 | 0.752 | 0.775 | 0.606 | 0.571 | 0.760 | 0.691 | 0.687 |
Ramachandran et al. | [19] | 0.86 | 0.78 | 0.66 | 0.70 | 0.84 | 0.88 | 0.66 | 0.63 | 0.84 | 0.79 | 0.78 |
Riordan et al. | [21] | 0.795 | 0.718 | 0.684 | 0.700 | 0.830 | 0.790 | 0.648 | 0.554 | 0.777 | 0.735 | 0.723 |
Kumar et al. | [10] | 0.872 | 0.824 | 0.745 | 0.743 | 0.845 | 0.858 | 0.725 | 0.624 | 0.843 | 0.832 | 0.791 |
Features | 0.722 | 0.664 | 0.680 | 0.712 | 0.752 | 0.776 | 0.624 | 0.514 | 0.789 | 0.721 | 0.702 | |
AlBERT (xxL) | 0.843 | 0.838 | 0.673 | 0.675 | 0.764 | 0.793 | 0.712 | 0.636 | 0.805 | 0.758 | 0.750 | |
BERT(L) | 0.826 | 0.791 | 0.715 | 0.732 | 0.762 | 0.817 | 0.715 | 0.640 | 0.827 | 0.742 | 0.757 | |
Electra(L) 3 | 0.866 | 0.855 | 0.717 | 0.764 | 0.835 | 0.836 | 0.715 | 0.669 | 0.837 | 0.724 | 0.776 | |
RoBERTa (L) 2 | 0.843 | 0.844 | 0.704 | 0.679 | 0.815 | 0.844 | 0.719 | 0.662 | 0.838 | 0.762 | 0.771 | |
BERT (base) | 0.849 | 0.772 | 0.692 | 0.722 | 0.845 | 0.840 | 0.676 | 0.598 | 0.829 | 0.717 | 0.749 | |
DeBERTa V3 (base) 1 | 0.881 | 0.865 | 0.677 | 0.690 | 0.766 | 0.830 | 0.722 | 0.690 | 0.840 | 0.733 | 0.769 | |
Electra (base) | 0.846 | 0.810 | 0.714 | 0.724 | 0.819 | 0.811 | 0.727 | 0.683 | 0.841 | 0.724 | 0.770 | |
RoBERTa (base) | 0.809 | 0.821 | 0.683 | 0.708 | 0.809 | 0.834 | 0.716 | 0.616 | 0.829 | 0.730 | 0.755 | |
XNet (base) | 0.822 | 0.784 | 0.675 | 0.680 | 0.814 | 0.824 | 0.686 | 0.663 | 0.804 | 0.718 | 0.747 | |
ConvBERT | 0.838 | 0.762 | 0.624 | 0.665 | 0.835 | 0.813 | 0.678 | 0.599 | 0.830 | 0.716 | 0.728 | |
DistilledBERT | 0.822 | 0.780 | 0.659 | 0.685 | 0.813 | 0.787 | 0.687 | 0.610 | 0.822 | 0.715 | 0.752 | |
Electra (small) | 0.838 | 0.780 | 0.702 | 0.731 | 0.805 | 0.832 | 0.654 | 0.629 | 0.818 | 0.727 | 0.751 | |
MobileBERT | 0.846 | 0.779 | 0.698 | 0.740 | 0.829 | 0.824 | 0.676 | 0.611 | 0.816 | 0.725 | 0.754 | |
Ensemble (best 2) | 0.884 | 0.868 | 0.714 | 0.728 | 0.790 | 0.839 | 0.730 | 0.690 | 0.850 | 0.761 | 0.786 | |
Ensemble (best 3) | 0.882 | 0.891 | 0.722 | 0.750 | 0.813 | 0.822 | 0.734 | 0.702 | 0.865 | 0.779 | 0.796 |
From a production standpoint, it is not just important that the QWK is high relative to the agreement between two human raters, it is also important that the SMD is within appropriate bounds [32]. This is where the feature-based model does very well. In this situation, it is more important that the SMD is low across models, and we find that the ensembles do remarkably well. If we consider a typical violation to be a model with an SMD of over 0.15, most violations occur with small models. Generally speaking, as we found with QWK, the SMDs are far better for large models than the small models, but ensembles seem to do even better. Even ensembles of models with large SMDs have much better controlled SMDs than the individual models in the ensemble. If nothing else, this points to the fact that ensembles in the way we consider here give us a way of controlling SMDs.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
Features | 0.000 | 0.036 | 0.109 | 0.095 | 0.079 | 0.030 | 0.082 | 0.021 | 0.030 | 0.073 |
ALBERT | 0.002 | 0.016 | 0.064 | 0.069 | 0.000 | 0.032 | 0.018 | 0.026 | 0.002 | 0.027 |
BERT(L) | 0.044 | 0.129 | 0.059 | 0.038 | 0.003 | 0.063 | 0.064 | 0.077 | 0.052 | 0.074 |
Electra(L) | 0.041 | 0.142 | 0.044 | 0.043 | 0.021 | 0.036 | 0.068 | 0.069 | 0.065 | 0.100 |
RoBERTa (L) | 0.025 | 0.077 | 0.097 | 0.208 | 0.029 | 0.089 | 0.043 | 0.014 | 0.015 | 0.086 |
BERT (base) | 0.034 | 0.009 | 0.011 | 0.068 | 0.030 | 0.062 | 0.022 | 0.124 | 0.060 | 0.092 |
DeBERTa V3 (base) | 0.034 | 0.090 | 0.101 | 0.221 | 0.031 | 0.071 | 0.033 | 0.137 | 0.039 | 0.027 |
Electra (base) | 0.056 | 0.101 | 0.125 | 0.090 | 0.028 | 0.079 | 0.043 | 0.068 | 0.021 | 0.117 |
RoBERTa (base) | 0.141 | 0.132 | 0.079 | 0.156 | 0.039 | 0.007 | 0.078 | 0.047 | 0.050 | 0.073 |
XNet (base) | 0.080 | 0.082 | 0.145 | 0.021 | 0.036 | 0.107 | 0.064 | 0.122 | 0.019 | 0.211 |
ConvBERT | 0.281 | 0.246 | 0.454 | 0.291 | 0.008 | 0.134 | 0.063 | 0.221 | 0.026 | 0.192 |
DistilledBERT | 0.023 | 0.048 | 0.102 | 0.016 | 0.028 | 0.050 | 0.104 | 0.016 | 0.037 | 0.211 |
Electra (small) | 0.057 | 0.158 | 0.081 | 0.057 | 0.070 | 0.055 | 0.103 | 0.117 | 0.019 | 0.099 |
MobileBERT | 0.023 | 0.091 | 0.045 | 0.072 | 0.060 | 0.044 | 0.060 | 0.106 | 0.060 | 0.069 |
Ensemble (best of 2) | 0.004 | 0.062 | 0.011 | 0.093 | 0.037 | 0.010 | 0.024 | 0.002 | 0.024 | 0.047 |
Ensemble (best of 3) | 0.016 | 0.065 | 0.004 | 0.082 | 0.045 | 0.022 | 0.012 | 0.008 | 0.017 | 0.081 |
We find it interesting to note that there are a number of items in which pretrained models have succeeded where rule-based methods did not do as well. For example, several individual large pretrained language models exceed the state-of-the-art for prompts 2 and 8, yet pretrained models, or even their ensembles, seem to even come close to the results of [10] for item 10. It should be noted that even our naive feature model seemed to perform better than many of the pretrained language models tested.
The three best models that performed on the development set were the large RoBERTa model, and the large and base Electra models in that order, even though that order is not reflected in the test set. When we ensemble the best two and three models that perform best on average, we obtain results that seem to be on-par and even slightly exceeding the state-of-the-art results of [10]. That said, the combination of these models is a model with approximately 875 million parameters. We do not believe it was a coincidence that the best models on both the test set and development set were among the largest, highest scoring models with respect to the GLUE benchmarks.
4 Discussion
It seems to be the case that automated short answer scoring with pretrained transformer-based language models one their own can be outperformed generally by a mixture of regular expressions and other classical classifiers [10]. While we managed to exceed benchmarks with an ensemble of 3 very large networks, to do so with such huge computational power is a little bit dissatisfying. It is clear that such a solution is not feasible from a production standpoint.
We firmly believe that the ideal solution, from a production and accuracy standpoint, would be the ensemble of an efficient network like [16] and a rules based method like [10]. Firstly, this would require more careful consideration of the features used, and secondly, a careful consideration of how to incorporate these features into the score prediction. For example, concatenating the features to the set of features used by the classification head might yield better results [30]. We have generally found that ensembles
In terms of architectures, there are a range of models we did not consider that would be worth mentioning. Some of these are are a range of architectures that we did not consider, such as Reformer, LongerFormer, FNet, Linformer, Performer, MPNet. These all are variations on the transformer-based architecture that approximate attention using architectural differences that may be prove to be an advantage in short answer scoring.
Lastly, we mention that there is still some work to be done in linking the output of the language model to the rubric. Most work on explainable AI has been focused on token-level importance, however, more semantically complex elements of a rubric are not simply stated in terms of the presence or absence of particular tokens. Knowing what features work well might be useful in determining interpretations of certain vectors in the feature space used to assign scores. This would be an important step in establishing a validity argument from these methods beyond their pure statistical performance.
References
- [1] Alikaniotis, Dimitrios, Helen Yannakoudakis, and Marek Rei. “Automatic text scoring using neural networks.” arXiv preprint arXiv:1606.04289 (2016).
- [2] Attali, Yigal, and Jill Burstein. “Automated essay scoring with e-rater® V. 2.” The Journal of Technology, Learning and Assessment 4, no. 3 (2006).
- [3] Basu, Sumit, Chuck Jacobs, and Lucy Vanderwende. ”Powergrading: a clustering approach to amplify human effort for short answer grading.” Transactions of the Association for Computational Linguistics 1 (2013): 391-402.
- [4] Burstein, Jill, Claudia Leacock, and Richard Swartz. ”Automated evaluation of essays and short answers.” (2001).
-
[5]
Dong, Fei, Yue Zhang, and Jie Yang. “Attention-based recurrent convolutional neural network for automatic essay scoring.” In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 153-162. 2017.
- [6] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. ”Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
-
[7]
Franceschi, Luca, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. ”Forward and reverse gradient-based hyperparameter optimization.” In International Conference on Machine Learning, pp. 1165-1173. PMLR, 2017.
- [8] Foltz, Peter W., Darrell Laham, and Thomas K. Landauer. “The intelligent essay assessor: Applications to educational technology.” Interactive Multimedia Electronic Journal of Computer-Enhanced Learning 1, no. 2 (1999): 939-944.
- [9] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. “DeBERTaV3: Improving DeBERTa using Electra-style pre-training with gradient-disentangled embedding sharing.” arXiv preprint arXiv:2111.09543 (2021).
-
[10]
Kumar, Yaman, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. ”Get it scored using autosas—an automated system for scoring short answers.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9662-9669. 2019.
- [11] Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ”Electra: Pre-training text encoders as discriminators rather than generators.” arXiv preprint arXiv:2003.10555 (2020).
- [12] Jiang, Zihang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. ”Convbert: Improving bert with span-based dynamic convolution.” arXiv preprint arXiv:2008.02496 (2020).
- [13] Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ”Albert: A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).
- [14] Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. ”Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
- [15] Loshchilov, Ilya, and Frank Hutter. ”Decoupled weight decay regularization.” arXiv preprint arXiv:1711.05101 (2017).
- [16] Ormerod, Christopher M., Akanksha Malhotra, and Amir Jafari. ”Automated essay scoring using efficient transformer-based language models.” arXiv preprint arXiv:2102.13136 (2021).
- [17] Page, Ellis Batten. “Project Essay Grade: PEG.” (2003).
- [18] Rodriguez, Pedro Uria, Amir Jafari, and Christopher M. Ormerod. “Language models and Automated Essay Scoring.” arXiv preprint arXiv:1909.09482 (2019).
- [19] Ramachandran, L., Cheng, J., and Foltz, P. 2015. Identifying patterns for short answer scoring using graph-based lexico-semantic text matching. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 97–106.
- [20] Reimers, Nils, and Iryna Gurevych. ”Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).
- [21] Riordan, B.; Horbach, A.; Cahill, A.; Zesch, T.; and Lee, C. M. 2017. Investigating neural architectures for short answer scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 159–168
- [22] Sanh, Victor, L. Debut, J. Chaumond, and T. Wolf. ”DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019.” arXiv preprint arXiv:1910.01108 (2019).
- [23] Shermis, Mark D. “State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration.” Assessing Writing 20 (2014): 53-76.
- [24] Shermis, Mark D. ”Contrasting state-of-the-art in the machine scoring of short-form constructed responses.” Educational Assessment 20, no. 1 (2015): 46-65.
- [25] Sun, Lichao, Kazuma Hashimoto, Wenpeng Yin, Akari Asai, Jia Li, Philip Yu, and Caiming Xiong. ”Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert.” arXiv preprint arXiv:2003.04985 (2020).
- [26] Sun, Zhiqing, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. ”Mobilebert: a compact task-agnostic bert for resource-limited devices.” arXiv preprint arXiv:2004.02984 (2020).
- [27] Sung, Chul, Tejas Indulal Dhamecha, and Nirmal Mukhi. ”Improving short answer grading using transformer-based pre-training.” In International Conference on Artificial Intelligence in Education, pp. 469-481. Springer, Cham, 2019.
- [28] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta,and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In KDD.
-
[29]
Taghipour, Kaveh, and Hwee Tou Ng. “A neural approach to automated essay scoring.” In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 1882-1891. 2016.
- [30] Uto, Masaki, Yikuan Xie, and Maomi Ueno. ”Neural Automated Essay Scoring Incorporating HandcraftedFeatures.” In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6077-6088. 2020.
- [31] Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. ”GLUE: A multi-task benchmark and analysis platform for natural language understanding.” arXiv preprint arXiv:1804.07461 (2018).
- [32] Williamson, David M., Xiaoming Xi, and F. Jay Breyer. ”A framework for evaluation and use of automated scoring.” Educational measurement: issues and practice 31, no. 1 (2012): 2-13.
- [33] Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac et al. ”Huggingface’s transformers: State-of-the-art natural language processing.” arXiv preprint arXiv:1910.03771 (2019).
- [34] Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. ”Xlnet: Generalized autoregressive pretraining for language understanding.” Advances in neural information processing systems 32 (2019)