Key English Test111https://www.cambridgeenglish.org/exams-and-tests/key (KET) and Preliminary English Test222https://www.cambridgeenglish.org/exams-and-tests/preliminary (PET) are examinations to assess the communication ability of the test taker in practical situations. In PET and KET, there are a variety of question types, including speaking, reading, listening, and writing. In writing questions, examinees are not only required to write an essay precisely and correctly but need to make responses to the Task Requirements (TRs). According to official scoring instructions, an essay with poor task achievements should be assigned a low grade. Some examples of TR writing questions are shown in Table 1.
Lines begin with * are task requirement questions.
Timely and accurate evaluation on the performance of test-takers, especially informing them of TR achievements of their essays, is essential to improve their writing and communication skills. Such evaluation usually takes experienced teachers a large amount of time as each essay needs to be graded carefully. However, due to the limitation of teacher resources, most English learners cannot get timely assessments on the quality of their essays. Although many researchers studied how to automatically score an essay, most of the current approaches can only provide total scores without enriched supports [taghipour2016neural, dong2017attention, wang2018automatic]. This is not really helpful for students to improve their writing skills.
In natural language processing field, machine reading comprehension (MRC) has been studied for a long time and can be employed to provide details in terms of how well TRs have been achieved in students’ essays. In MRC field, the second version of Stanford Question Answering Dataset (SQuAD 2.0) is the most widely used benchmark dataset to evaluate model performance[rajpurkar2018know]
. However, our experiments prove that even a model that achieves the best performance on SQuAD 2.0 cannot be directly used on educational scenarios, as there is a significant performance degradation. The main reason is that SQuAD 2.0 is a general-purpose open-source dataset, but there is a huge difference between educational and general-purpose corpora.
To alleviate these problems, we construct a real-world educational dataset and propose an end-to-end framework based on MRC approach, which uses ELECTRA as a backbone, to detect whether students respond to TRs in their essays [clark2020electra]. Our framework can clearly and accurately locate sentences in student essays that respond to the requirements. Experiments on an educational dataset show that the proposed framework achieves 0.93 accuracy score and 0.85 F1 score, outperforming many existing approaches. We believe that this research can help automatic essay scoring system provide interpretable grading results, thereby helping students improve their writing skills.
2 Related Work
2.1 Automated Writing Evaluation
Automated writing evaluation (AWE) has been studied for a long time in both industry and academia [klebanov2020automated, page1966imminence, ke2019automated, wang2020learning]. Since Page and Ellis B published their works in 1996, plenty of automated scoring products and applications, e.g., E-rater, have emerged. Based on AWE, lots of works on automatic essay scoring (AES) have been published [page1966imminence, taghipour2016neural, dong2017attention, wang2018automatic]. However, these works mainly focused on giving a holistic score, which measures the overall quality of an essay. Taghipour explored several neural network models for AWE and outperformed strong baselines without requiring any feature engineering [taghipour2016neural]
. Dong proposed a reinforcement learning framework that incorporates quadratic weighted kappa as guidance to optimize the scoring system[dong2017attention]. In recent years, a variety of researches focused on fine-grained essay evaluation [carlile2018give, ke2019give, persing2014modeling]. In Persing’s work, they presented a feature-rich approach to score prompt adherence of essays [persing2014modeling]. In Ke’s work, they not only predicted a score of thesis strength but also provided more reasons [ke2019give]. Nevertheless, none of these works address the problem of detecting TR achievements in AES systems.
2.2 Machine Reading Comprehension
At document level, finding students’ response to a TR is similar to extractive and abstractive MRC task in which given several reading materials, the model is expected to answer related questions based on the materials. The MRC models are expected to understand both the context and the question and be able to perform reasoning. In TR writing, we could regard student’s essays as reading materials, and the model is supposed to find answers to TRs. If no answer is found, it indicates that the essay does not respond to the requirement.
The early trend of MRC used long short-term memory or convolutional neural network as an encoder of questions and contexts and blended a variety of attention mechanisms, e.g., attention sum, gated attention[hochreiter1997long, o2015introduction, kadlec2016text, dhingra2016gated]. Approaches mimicking the process of how humans do reading comprehension were also proposed, such as multi-hop reasoning [shen2017reasonet, liu2017stochastic, shen2017empirical]
. Recently, pre-trained language models, e.g., BERT, RoBERTa, ALBERT, BART, ELECTRA, became prevalent encoder architectures in MRC and achieved state-of-the-art performance[devlin2018bert, liu2019roberta, lan2019albert, lewis2019bart, clark2020electra]. Besides these improvements and optimizations on the encoder module, research about the decoder in the MRC model also starts to draw attention. Zhang al et. proposed an answer verification method and achieved state-of-the-art single model performance on SQUAD 2.0 benchmark with ELECTRA encoder module [zhang2020retrospective, rajpurkar2018know, rajpurkar2016squad].
Another line of research on MRC is how to construct high-quality datasets and lots of works have been done [nguyen2016human, hewlett2016wikireading, trischler2016newsqa, joshi2017triviaqa, rajpurkar2016squad]. Among them, SQuAD is one of the most widely-used reading comprehension benchmarks [rajpurkar2016squad]. However, Rajpurkar et al. showed that the success on SQuAD does not ensure robustness to distracting sentences [rajpurkar2018know, jia2017adversarial]. One reason is that SQuAD focuses on questions for which a correct answer is guaranteed to exist in the context document. Therefore, models only need to select the span that seems most related to the question, instead of checking that the answer is actually entailed by the text. Based on SQuAD, Rajpurkar et al. proposed SQuAD 2.0. To do well on SQuAD 2.0, systems must not only answer questions when possible but determine when no answer is supported by the paragraph and abstain from answering [rajpurkar2018know].
Comparing with previous AWE works, to the best of our knowledge, we are the first to use a MRC approach to detect TR achievements in educational domain. We also construct a Student Essay Dataset (SED) which can be deemed as SQuAD 2.0 in the educational field and we explore the usage of a combination of SQuAD 2.0 and SED.
3 Problem Statement
In the TRs writing evaluation task, let denote a collection of task requirement questions and denote a single question in . Let denotes the -th token in the question such that . is an essay written by a student where denotes the -th token in the essay . Then the problem is defined as for each requirement , is there a sequential text span in that responds to the requirement ? If such span exists, is achieved and needs to be extracted from the essay , if not, is not achieved by .
4.1 The Overall Workflow
The overview of our proposed framework is displayed in Figure 1. Our approach is mainly composed of three principal modules, question normalization (QN) module, MRC module, and response locating (RL) module.
4.2 Question Normalization Module
Task requirement questions are proposed from the perspective of examiners, but essays are from examinees’ perspective. This perspective gap brings difficulties to the MRC model. To eliminate the difference, we normalize texts of task requirements with two rule-based methods: switching personal pronouns and deleting redundant words.
4.2.1 Switch Personal Pronouns
We use pre-defined rules to replace personal pronouns in the sentence. For example, a question “What will you do in the summer vacation ?” may receive a student’s answer “I will travel to Japan”. If we change personal pronouns “you” in the question, it will be normalized as “What will I do in the summer vacation ?”. The normalized question will decrease the difficulties of this task for the models.
4.2.2 Delete Redundant Words
We define question words such as “what”, “how”, etc., and then delete redundant words that appear before them. One example of deleting unnecessary words is that we omit the word “explain” in the question “explain why you need to change the time” and change it to “why you need to change the time”. Another instance is that we delete the words “remind Sally” in “remind Sally where you arranged to meet” and acquire the normalized question “Where I arranged to meet”.
4.3 Machine Reading Comprehension Module
In MRC module, normalized task requirement question and the whole essay are concatenated with a special symbol . The entire input sequence to MRC model can be described as , where the full length of is .
4.3.1 ELECTRA Encoder
We use the discriminator module of ELECTRA to encode each token in
into a dense vector. The max length ofis 512 and tokens exceeding the max length will be truncated at the end. We use to represent the final layer outputs of ELECTRA at position which corresponds to the -th token in . We use to denote the last-layer hidden states of the input sequence, where . ELECTRA model is based on a multi-layer bidirectional Transformer encoder, and multi-head attention network [vaswani2017attention]. Therefore, is able to capture the context of the -th token from and . The attention function in ELECTRA and the output of layer are showed in eq.(1). In layer , inputs are computed by , , respectively, where denotes the output of the previous layer and , , . Thus have the same dimensions where is the dimension of vectors in .
4.3.2 Span Prediction
We employ a fully connected layer with softmax operation which takes
as input and outputs start and end probabilities of each token in, as shown in eq.(2). Let and represent the start and end probabilities of -th token in respectively, thus is the start probability vector for all tokens in and is the end probability vector for all tokens in .
4.3.3 Answerable Verification
Motivated by Zhang’s work, we introduce the same answerable verification step to determine whether an essay responds to a task requirement [zhang2020retrospective].
We feed which is the representation vector of
token encoded by ELECTRA into external front verification (E-FV) module. E-FV uses a fully connected layer followed by softmax operation to calculate classification logitswhere is a scalar to indicate the answered logits and is a scalar to indicate no-answer logits. We calculate the difference as the external verification score with Equation 3a.
Threshold-based answerable verification (TAV) takes start and end probabilities as input and outputs the no-answer score computed with Equation 3b, 3c and 3d. and in Equation 3c represents the start and end probabilities of the token in .
Rear verification (RV) combines and to obtain the final answerable score as shown in Equation 3e, where and are weights. MRC model predicts that question is answered by if , and not answered otherwise, where is a hyper parameter.
4.4 Response Locating Module
In RL module, it takes start probabilities , end probabilities and answerable score as input, and decides the start and end positions according to these inputs. A naive path to achieve this goal is that positions that obtains the highest start and end probabilities are chosen as start and end positions respectively. All tokens between these two positions are extracted as the student’s response to the task requirement. If the start or the end position is less than , in which case a span of question is marked, or their positions are contradictory, e.g., start position greater than end position, the module decides that the question is not responded. Finally, the framework outputs both the binary label indicating whether the student’s essay does respond to the task requirement and the location of the responsive span if it is available.
5.1.1 SQuAD 2.0
SQuAD 2.0 is the most widely used benchmark in machine reading comprehension literature. It combines the first version of SQuAD with over 50,000 unanswerable questions written adversarially by crowd workers to look similar to answerable ones [rajpurkar2016squad]. It contains 130,319 training examples from 442 Wikipedia articles and 11,873 development examples from 78 Wikipedia articles, where each example is made of a question and an article. This dataset requires that a model should not only answer the question when it is possible but also abstain from answering when there is no answer in the reading materials.
This is a real-world student essay dataset that we collect from a third-party K-12 online learning platform. It consists of 9,450 examples in the training set and 3,357 examples in the test set, where each example contains an essay and a requirement question. There are 3,367 different essays and 593 different task requirement questions in the training set. In the test set, the number of essays and requirement questions are 1,655 and 185 respectively. In order to obtain labels, annotators need to firstly decide whether an essay does respond to the question and label it positive or negative accordingly. Secondly, for all positive essay examples, annotators need to mark the start and end positions of the span in the essay that responds to the question.
Despite that SQuAD 2.0 and SED share similarities in terms of task and structure, there are many differences between them. First of all, SED is in the educational domain and SQuAD 2.0 is from Wikipedia. Secondly, answers in SED are much longer than answers in SQuAD 2.0. Fig 3 illustrates that most answers in SQuAD 2.0 are between 5 to 20 characters, while answers in SED are between 25 to 100 characters. The average length of answers in SQuAD 2.0 is 18.0 while the average length of answers in SED is 103.4. The last difference is that there are more grammatical errors in SED because essays in SED are written by second language learners. So a model that achieves the best performance on SQuAD 2.0 may not be directly deployed on educational scenarios.
5.2 Experimental Setting
In this section, we describe three sets of experiments as follows.
Set 1. This set aims to prove that existing SOTA models on SQuAD 2.0 cannot be directly deployed on educational scenarios. In Set 1, all models are trained on SQuAD 2.0 but evaluated on the test set of SED. SAN was trained 50 epochs with learning rateon SQuAD 2.0 [liu2017stochastic]. Pre-trained language models such as BERT, RoBERTa, ALBERT, and BART, were acquired from hugging face333https://huggingface.co. Our ELECTRA-based approach was trained 2 epochs with default parameters in this work [zhang2020retrospective].
Set 2. This set is to prove that MRC approaches are effective solutions to TRs writing evaluation when trained on the educational corpus. The training parameters of the models are consistent with those in Set 1. The difference is that models are all trained on SED.
Set 3. This set explores how can we utilize SQuAD 2.0 and further improve model performance on SED. Following the idea that models pre-trained on massive data can be a good warm-up for subsequent finetuning, we first train MRC models on SQuAD 2.0 so as to acquire basic models, and then finetune them on SED for optimal performance.
In all experiments, we use two evaluation indicators. One is Accuracy (Acc.) which measures the performance of the model on the binary classification task of predicting whether the essay answers the TR. Another is Answer Overlap F1 score (F1) which measures the performance of the model to predict the location of the answer span. Accuracy and F1 metrics can be calculated by Equations 4.
In Equation 4, indicates the number of examples in the test set, and is the number of examples that are correctly predicted by the framework. represents the number of identical tokens in both the predicted span and the gold span. is the total number of tokens in predicted span and is the total number of tokens in the gold span.
5.3 Results and Evaluation
5.3.1 Results of Set 1.
Table 2 shows that existing SOTA models on SQuAD 2.0 are suffered a significant performance degradation on SED. All models in Table 2 are well finetuned on SQuAD 2.0 and their F1 scores on SQuAD 2.0 dev set are all over 0.66. However, when evaluating them on SED test set, performances drop dramatically. For example, RoBERTa and our method achieve F1 score of 0.83 and 0.89 on SQuAD 2.0 dev set, but both drop to F1 score of 0.49 on SED test set.
5.3.2 Result of Sets 2&3.
Table 3 shows results of Set 2 and Set 3. From the results of Set 2, we conclude that the MRC approaches can solve the TRs writing evaluation problem. Comparing with models trained on SQuAD 2.0 (Set 1), models trained on SED achieve significantly better results on SED test set. Our framework in Set 2 achieves the best F1 score of 0.84 and the best accuracy of 0.91, outperforms our framework in Set 1 by 23% Accuracy and 35% F1 score.
|Training Dataset||Methods||SQuAD 2.0 dev||SED test|
|SQuAD 2.0 (Set 1)||SAN||0.70||0.66||0.57||0.31|
If we compare results in Set 2 and Set 3, we find that optimal performance can be obtained by firstly training models on SQuAD 2.0 and then finetuning on SED. Specifically, F1 score of SAN increases by 11%, and F1 score of BERT increases by 8%. Similarly, the accuracy also increases significantly in Set 3.
|Training Dataset||Methods||SED Test|
|SED (Set 2)||SAN||0.67||0.58|
|SQuAD 2.0&SED (Set 3)||SAN||0.79 (+0.12)||0.69 (+0.11)|
|BERT||0.84 (+0.05)||0.76 (+0.08)|
|ALBERT||0.86 (+0.02)||0.80 (+0.03)|
|RoBERTa||0.88 (+0.07)||0.80 (+0.09)|
|BART||0.89 (+0.07)||0.82 (+0.09)|
|Ours||0.93 (+0.02)||0.85 (+0.01)|
Comparing with Set 2, the accuracy of BART and our framework increase by 7% and 2% respectively. Furthermore, our approach achieves the best performance in each of the three sets of experiments, and outperforms a variety of SOTA approaches.
In this paper, we proposed a MRC based approach which cannot only detect if an essay responds to a requirement question but find where the essay answers the question. From our experiments and analysis, we demonstrate that SQUAD 2.0 is very different from our educational dataset, so existing SOTA models on SQuAD 2.0 cannot be directly deployed on educational scenarios. Instead, we propose to firstly train a basic model on SQuAD 2.0 and then finetune the basic model on educational data for optimal performance. We believe this proposed framework is able to help automatic essay scoring systems provide detailed grading results, thereby helping students improve their writing skills.
This work was supported in part by National Key R&D Program of China, under Grant No. 2020AAA0104500 and in part by Beijing Nova Program (Z201100006820068) from Beijing Municipal Science & Technology Commission.