Automatic essay scoring (AES) system was introduced to alleviate the workload of the assessor and to improve the feedback cycle in the teaching-learning process. Since its introduction, a number of research activities have been carried out Page (1966); Attali (2011); Zupanc and Bosnifa (2017); Tashu and Horváth (2018, 2019)
. The task of AES was regarded as a machine learning problem that learns to approximate the assessment process using handcrafted features with supervised learningAttali and Burstein (2006); Foltz et al. (1999); Deerwester et al. (1999)
. Some of the features extracted most frequently are essay length, sentence length, grammar correctnessBurstein et al. (1998), readability Zesch et al. (2015) and textual coherence Chen and He (2013). These handcrafted features, however, require much human commitment and usually require a complex implementation for each new feature.
Recently, natural language processing (NLP) has achieved some success in switching from such linear models via sparse and handcrafted feature inputs to nonlinear neural network models via dense inputsGoldberg (2016). The recent success comes from the application of deep learning models on NLP tasks where these models are capable of modelling intricate patterns in data without handcrafted features. Since these methods do not depend on manually produced features, they can also be used to solve problems in an end-to-end fashion. SENNA Collobert et al. (2011) et al. (2015) are two remarkable examples in NLP that work without external task-specific knowledge. These successes in the processing of natural language tasks attracted scientists in the field of AES Dimitrios Alikaniotis (2016); Taghipour and Ng (2016); Dong and Zhang (2016). The most commonly used model in AES Dimitrios Alikaniotis (2016); Taghipour and Ng (2016)
Scoring essays utilizing handcrafted features like, for example, essay length, sentence length, grammar correctness or readability face the following problems: First, it might be used by students as an option to cheat the system by writing and submitting a well-structured essay which is off-topic. Every well-written essay that does not address the question topic may receive a good score from an AES system because of linguistic features such as text structure and surface. If every essay submitted to an AES system is evaluated as a standard essay input, then it may degrade user confidence in the AES engine HIGGINS et al. (2006). Second, creating these handcrafted features is tedious, time-consuming and sometimes inefficient. Therefore, new approaches that score essay semantically without handcrafted features have to be used.
To semantically score essays, latent semantic models such as Latent Semantic Analysis (LSA) Deerwester et al. (1999); Shermis and Burstein (2003); Shermis and Hamner (2013), Generalized Latent Semantic Analysis (PLSA) Islam and Latiful Hoque (2010), Latent Dirichlet Allocation (LDA), Bilingual Topic Model (BLTM) Kakkonen et al. (2006), Word Mover’s Distance (WMD) Tashu and Horváth (2018, 2019) and Content Vector Analysis (CVA) Attali and Burstein (2006); Attali (2011) were proposed and applied.
The success of deep learning in the areas of natural language processing and the problems related to manual feature engineering led to the use of neural network methods for Automatic essay scoring. The works of Taghipour and Ng Taghipour and Ng (2016); Cummins and Rei Cummins and Rei (2018); Wang et al. Wang et al. (2018); Jin et al. Jin et al. (2018); Farag et al.Farag et al. (2018); Zhang and Litman Zhang and Litman (2018) used mostly long short term memory (LSTM) based neural network models for AES. Most of these studies used the combination of neural models and hand-crafted features, since their objective was mostly to score essays using discourse structure and coherence attributes. Thus, our work looks at advancing AES by exploring other architectures that can capture more in-depth and essential contextual information in scoring essays semantically.
In this study, we introduce an extended latent semantic model utilizing the advantages of deep learning. Our model, called DeLAES, addresses the above stated problems by modifying and combining existing latent semantic models. DeLAES uses a multichannel CNN with different window sizes with the Max Pooling operation and also bidirectional gated recurrent neural networks to capture more in-depth and essential contextual information.
Instead of using input representation based on bag-of-words, DeLAES considers an essay as a sequence of words with rich contextual structure, and it retains maximum contextual information in its projected latent semantic representation. The model first projects each word in its context onto a low-dimensional continuous feature vector using skip-gram models. Then, it directly learns and captures the contextual features at the word n-gram level using a three-channel convolutional neural network. Second, the model does not discover and aggregate all word n-gram characteristics, but only the essential semantic concepts, to form a feature vector at the essay level using max-pooling. Third, the essay-level feature vector of each channel is then passed on to a bidirectional gated recurrent neural network that performs a nonlinear transformation to extract high-level semantic information from the word order of each channel. The extracted semantic high-level representation from each channel is then aggregated to obtain the final semantic representation at the essay level and fed into a fully connected layer for score prediction using the Sigmoid function. The contribution of this paper can be summarized as follows:
A new latent semantic model is introduced that captures the previous as well as subsequent contextual features for contextual structures at both the word n-gram and essay levels using multichannel convolutional pooling operations with bidirectional gated recurrent neural networks;
An experiment is performed on the data set provided as part of the automated essay scoring competition on the Kaggle website that contains essays by students from eight different prompts (essay discussion questions) or topics;
Experimental results are compared to published state-of-the-art results and also with baseline approaches based on deep learning methods.
The rest of the paper is organized into six sections. Section 2 describes related works that are relevant to our research. Section 3 presents the proposed DeLAES Architecture, Section 4 presents the overall experimental settings, implementation and evaluation of the proposed system and Section 5 the results of our system DeLAES and other baselines. Section 6 presents the discussion and conclusions.
2 AES Related works
The research on automatically evaluating and scoring essays is ongoing for more than a decade where Machine Learning (ML), NLP and artificial neural networks (NN) were used for evaluating essay question answers. In this section, we present the characteristics of the majority of AES systems developed by commercial organizations and also AES systems introduced by the academic community.
Project Essay Grade (PEG) was the first AES system developed by Ellis Page and his colleagues Page (1966). It evaluates and scores essays by measuring trins and proxes. A trin is defined as an intrinsic higher-level variable, such as punctuation, fluency, diction, grammar, the number of paragraphs, average sentence length, the length of an essay in words, counts of other textual units, etc., which as such cannot be measured directly and has to be approximated by means of other measures, called proxes. Shermis et al. (2002); Wang (2005); Zhang (2014). However, the exact set of textual features underlying each dimension and the details concerning the derivation of the overall score were not disclosed Ben-Simon and Bennett (2007); Shermis et al. (2002). The system uses regression methods to score new essays based on a training set.
Intelligent Essay Assessor (IEA) uses on LSA Foltz et al. (1999) to infer characteristics that define the content, organizational, and development-related attributes of the essay. IEA also extracts other essential features that measure lexical sophistication, grammatical, mechanical, stylistic and organizational aspects of essays using natural language processing techniques.
The system uses different attributes to measure the above aspects in essays, such that, context (e.g. semantic similarity), lexical sophistication , grammar (grammatical errors and grammatical error types), mechanics (e.g. spelling, upper and lower case), style, organization and development (e.g. sentence/sentence coherence, general essay coherence and topic development). IEA requires training essay set with a representative sample of a manually annotated essay by experts.
IntelliMetric was designed and introduced in 1991 by Vantage Learning as proprietary AES Shermis and Burstein (2003). IntelliMetric analyzes the semantic, syntactic and discursive attributes of essays to form a composite sense of meaning. The attributes used in IntelliMetric can be grouped into two main groups (content and structure). The content attributes are used to evaluate the subject covered, the breadth of content, the support of advanced concepts, the logic of discourse, the cohesion and consistency of the purpose and main idea. The structural attributes evaluate spelling, punctuation, letter cases, grammar, syntactic literacy, syntactic diversity, sentence complexity, use, legibility, and subject-binding. The system uses multiple predictions with linear analysis, Bayesian approach and LSA in predicting the final score, and combines the models into a single final score.
E-Rater Attali and Burstein (2006) works by extracting features such as grammatical errors, word usage errors, writing mechanics errors, presence of essay-based discourse elements, development of essay-based discourse elements, style flaws, content vector analysis (CVA) to evaluate current word usage, an alternative, differentiated measurement of word usage, based on the relative frequency of a word in high-scoring versus low-scoring essays. Such an attribute that takes into account the correct use of prepositions and collocations, and diversity in the formation of sentence structures and finally apply regression modelling to predict the score.
The Lexile Writing Analyzer is part of the Lexile Framework for Writing Smith (2009)
developed by MetaMetrics. The system is the score, genre, prompt and punctuation independent and uses the Lexile Writer measure which is an estimate of the student’s ability to express language in writing based on factors related to semantic complexity (the level of words used) and syntactic sophistication (how words are written in sentences). The system uses some attributes that represent approximate values for writeability. Lexile perceives the ability to write as a fundamental individual characteristic.
SAGrader Brent and Townsend (2006)
is a proprietary AES system developed By IdeaWorks, Inc. SAGrader combines a number of linguistic, statistical and artificial intelligence approaches to evaluate the essay automatically. The operation of the SAGrader is as follows: The instructor first specifies a task in a prompt. Then the teacher creates a section in which he identifies the "desired properties" vital elements of knowledge (facts) to be included in a right answer, as well as the relationships between these elements via a semantic network. Fuzzy logic allows the program to recognize the features in students’ essays and compare them with the desired ones. Finally, an expert system evaluates student essays based on the similarities between the desired and observed characteristics. Students immediately receive feedback by reporting their results, along with detailed comments on what they have done well and what remains to be done. The system provides both results and meaningful feedback through ontology-based information extraction.
The system that uses logical reasoning to recognize errors in a statement in an essay was proposed by Gutierrez et al. in Gutierrez et al. (2012). The system first transforms the text into a set of relevant clauses using the Open Information Extraction methodology and integrates them into the domain ontology using a manually selected vocabulary mapping. The system determines whether these statements contradict ontology and, thus, domain knowledge. This method regards falsity as inconsistency with respect to the domain.
The automated scoring engine called CRASE Lottridge et al. (2013), developed by Pacific Metrics, passes through the following three phases of the scoring process: Identification of inappropriate attempts, attribute extraction and scoring. The attribute extraction step is constructed around six aspects of the essay: ideas, typesetting, organization, voice, wording, conventions and written presentation. The system analyzes a sample of student responses that have already been scored to create a model of grader scoring behaviour. It is a Java-based application that runs as a web service. The system is adaptable to the configurations used to build machine learning models and to mix human and machine. The application also generates text-based and numerical feedback that can be used to improve the essays.
The first evaluation system made publicly available was Rudner’s Bayesian Essay Test Scoring system called BETSY Rudner and Liang (2002)
. BETSY uses Naive Bayes models to classify texts into different classes (e.g. Pass or Fail) based on content (e.g. Word Uni-Gram and bi-gram) and style attributes. The classification is based on the assumption that each attribute is independent of another. BETSY has only proven to be a demonstration tool for Bayesian approach to essay evaluation.
An automated evaluation engine with compiled and publicly available source code called LightSIDE was introduced in 2010 by Mayfield and Rose Mayfield and Penstein-Rosé (2010). LightSIDE was developed as a tool for non-experts to effectively use text mining technology for a variety of purposes, including essay evaluation. It allows the selection of the set of attributes and algorithms to build a prediction model. The set of attributes focuses mainly on n-grams, part of speech tags, and "counting" attributes and also the system have a feature that allows the user to enter the code for new attributes manually.
As the existing AES techniques which are using LSA do not consider the word sequence of sentences in the documents and the creation of word by document matrix is somewhat arbitrary. Chali and Hasan Chali and Hasan (2013) proposed an AES system that calculates the syntactic similarity between two sentences by analyzing the corresponding sentences in syntactic trees and measuring the similarity between the trees. The so-called shallow semantic tree kernel method allows parts of semantic trees to be synchronized. The kernel function returns the similarity value between a pair of sentences based on their semantic structures.
Islam and Hoque Islam and Latiful Hoque (2010)
developed an AES system using generalized latent semantic analysis (GLSA), which develops a system that uses grams for document matrix instead of word for document matrix, as in LSA. The system uses the following steps in the grading procedure: Pre-processing of training essays, removal of stop words, word origin, selection of n-gram index terms, n-gram when creating document matrices, calculation of singular value decomposition (SVD) from n-gram to document matrix, dimensional reduction of SVD matrices and calculation of similarity evaluation. The main advantage of GLSA is adherence to the word order in sentences. In order to reduce memory and time consumption without lowering the performance of automated scoring in comparison with human scoring,Zhang (2014)
proposed incremental singular value decomposition as a part of incremental LSA to score essays when the dataset is massive.
An extension of existing AES systems was introduced in Zupanc and Bosnifa (2017) by incorporating additional semantic coherence and consistency attributes. They designed coherence attributes by transforming sequential parts of an essay into the semantic space and measuring changes between them to estimate coherence of the text and consistency attributes that detect semantic errors using information extraction and logic reasoning. The resulting system, named “sage-semantic automated grader for essays”, provides semantic feedback for the writer and achieves significantly higher grading accuracy compared with other state-of-the-art AES systems.
An investigation on the effectiveness of using semantic vector representations for the task of AES was performed in Jin et al. (2017)
. According to the evaluation results on the standard English dataset, the effectiveness brought by the proposed semantic representations of essays depends on the learning algorithms and the evaluation metrics used. On the other hand, the effectiveness of individual semantic features is stable with respect to different numbers of dimensions.
In Fauzi et al. (2017)
, an AES system based on n-gram and cosine similarity were described. N-Gram was used for feature extraction and modified to split by word instead of by letter so that the word order would be considered. Based on evaluation results, this system got the best correlation of 0.66 by using unigram on questions that do not consider the order of words in the answer. For questions that consider the order of the words in the answer, bi-gram has the best correlation value, by 0.67.
The architecture of an AES system based on a rubric, which combines automated scoring with human scoring, was proposed in Yamamoto et al. (2018). The proposed rubric has five evaluation viewpoints: contents, structure, evidence, style, and skill as well as 25 evaluation items which are subdivided in viewpoint. At first, the system automatically scores 11 items included in the style and skill such as sentence style, syntax, usage, readability, lexical richness, and so on. Then it predicts scores of style and skill from these items’ scores by multiple regression models. It also predicts contents’ score by the cosine similarity between topics and descriptions
AES systems that are introduced above are based on regression methods applied to a set of carefully handcrafted designed features. The process of feature engineering is the most challenging part of building AES systems. Moreover, it is challenging for humans to consider all the factors that are involved in assigning a score to an essay. To alleviate these issue, AES systems that learn features and relation between an essay and its score automatically were introduced using deep neural networksDimitrios Alikaniotis (2016); Taghipour and Ng (2016); Dong and Zhang (2016).
Recently, neural network models have been introduced into AES, making the development of handcrafted features unnecessary or at least optional. Alikaniotis et al. Dimitrios Alikaniotis (2016) and Taghipour and Ng Taghipour and Ng (2016) presented AES models that used Long Short-Term Memory (LSTM) networks. Differently, Dong and Zhang Dong and Zhang (2016) used a Convolutional Neural Network (CNN) model for essay scoring by applying two CNN layers on both the word level and then sentence level. Later, Dong et al. Dong et al. (2017) presented another work that uses attention pooling to replace the mean over time pooling after the convolutional layer in both word level and sentence levels. Dong and Zhang Dong and Zhang (2016)
showed that a two-layer Convolutional Neural Network (CNN) outperformed other baselines (e.g., Bayesian Linear Ridge Regression) on both in-domain and domain adaptation experiments on the ASAP dataset. This model was later improved by employing attention layers. Specifically, the model learns text representation with LSTMs which can model the coherence and co-reference among sequences of words and sentences, and uses attention pooling to capture more relevant words and sentences that contribute to the final quality of essaysDong et al. (2017).
Song et al. Song et al. (2017) proposed a deep model using gated recurrent unit and bidirectional gated recurrent unit for identifying discourse modes in an essay. Chen and He Chen et al. (2014) studied the usefulness of prompt independent text features and achieved a human machine rating agreement slightly lower than the use of all text features for prompt-dependent essay scoring prediction. A constrained multitask pairwise preference learning approach was proposed by Cummins et al. Cummins and Rei (2018) by combining essays from multiple prompts for training. However, as shown by Dong and Zhang Dong and Zhang (2016) and Zesch et al. Zesch et al. (2015) straightforward applications of existing AES methods for prompt-independent AES lead to a poor performance. In the same year, Jin et al. Jin et al. (2018) proposed a two-stage deep neural network (TDNN) learning framework to address a poor permanence of the approach proposed by Cummins et al. Cummins and Rei (2018) The TDNN utilizes the prompt-independent features to generate pseudo training data for the target prompt, on which a bidirectional LSTM was used to learn a rating model consuming semantic, part-of-speech, and syntactic signals. Their TDNN model outperforms the baselines, and leads to promising improvement in the human-machine agreement.
Most of the proposed neural network models aim in scoring essays using discourse structure, coherence, part-of-speech, and semantics attributes and the results were promising, but more research has to be conducted to have a more accurate and robust AES system. Thus, our work looks at advancing AES by exploring other architectures that will improve the predictive accuracy by capturing more in-depth and essential contextual information in scoring essays semantically.
In this paper, we propose a novel deep learning architecture for AES. The proposed architecture is enhanced by bidirectional GRU (BGRU) using a multichannel convolutional layer with Max-Pooling operation, referred to as Max-Pooling based BGRU using multichannel convolutional layer. The basic idea of the new architecture is based on the following considerations. The one-dimensional convolutional filters with different window size in the convolutional layer perform in extracting n-gram features at different positions of the essay and reduce the dimensions of the essay. Since all the n-gram features extracted by the convolutional layer are not relevant to the semantics of the essay, Max-Pooling operation is used to select the most essential n-gram features. The BGRU is used to learn and extract the preceding and succeeding contextual features from the outputs of the Max-Pooling layer. The outputs from each BGRU channel will be then aggregated using the fully connected layer, and the final score will be predicted using the sigmoid layer.
3 Max-Pooling based Bidirectional Gated Recurrent Network using Multi-channel Convolutional Filters Model
The prediction task addressed in this paper can be briefly described as follows: Given a set of labeled essays , where each essay , a sequence of words, is assigned a score , the tasks is to learn a scoring function such that the difference between the true score and the predicted score for any essay is minimal.
3.1 DeLAES Architecture
The proposed architecture is illustrated in Figure 1 which can be summarized as follows (detailed descriptions of its layers will be introduced in the following subsections):
The embedding layer maps the words of an input essay into latent vectors utilizing pre-trained word vectors.
Learning parameters, weights and biases are initialized with random values (they will be adjusted through training).
A three channel convolutional layer is employed that extracts contextual features for each word with its neighboring words defined by a window size (we used three different window sizes); input the dense matrix representation of the essay to the convolutional layer along with the initialized parameters. The convolutional layer then produces n-gram level features.
A max-pooling operation is applied that accepts the n-gram level feature representation and discovers salient word-n-gram features to form an essay-level feature vector.
The BGRU layer obtains the preceding and the succeeding contextual features. This layer extracts the final high-level semantic feature vector representation for each input word sequence essay.
The fully connected payer combines the semantic feature vector representation of each word to obtain a comprehensive semantic feature representation;
The comprehensive semantic feature representation of the essay is fed into the sigmoid function to get the predicted labels;
The loss of the prediction is computed using mean squared error.
An optimization algorithm is used for loss minimization, adjusting the weights and biases based on the computed loss. For this study, the RMSPro  optimizer was used.
The above described process is repeated until the model reaches the desired or the highest accuracy possible. Afterwards, the trained model can be used for evaluation of new essays by predicting the corresponding scores.
In the next subsections, each layer of our model is described in detail.
3.1.1 Text Representation Layer
The first layer of our model projects each word into a dimensional latent space, resulting in a vector . Given an input essay of a sequence of words, its corresponding word embedding results in a matrix with columns , , , . The resulting word embedding matrix will be used as an input for 1D convolution for feature extraction and will be learned during training. The most popular word embedding model proposed by Mikolov Mikolov et al. (2013)
, trained using the skip-gram method by maximizing the average log probability of all words, is used in this work.
3.1.2 CNN Layer
CNNs, originally proposed for computer vision, have shown strong performance on natural language processingCollobert et al. (2011) and text classification tasks Wang et al. (2015); Zhang et al. (2015).
Once the dense representation of the input sequence is calculated using the embedding layer, it is fed into the recurrent layer of the network. However, it might be advantageous for the network to extract local (salient) features from the sequence before feeding it into the recurrent operation. This can be performed by applying a convolution layer on the output of the embedding layer.
The convolution operation can be considered as a sliding window based feature extraction. It is designed to capture the word n-gram contextual features.
A one dimensional (1D) convolution layer of width works by moving a sliding window of size over the word embedding matrix () representation of each essay Kim (2014) and applies the filter to each window in the sequence of vectors111The ith, (i+1)th, …, (i+k-1)th columns of , respectively. , , , for , This results in a concatenated dense vector which is as defined in equation 1.
for each , for each . where
is a non-linear activation function that is applied element-wise,is the feature transformation function and
is the bias of the network. We used rectified linear unit (ReLU) as the nonlinear activation function because it can improve the learning dynamics of the network and significantly reduces the number of iterations required for convergence in deep networks. At the convolutional layer, words within their contexts are projected to vectors that are close to each other if they are semantically similar. The output of the convolutional layer is a sequence of feature vectors whose length is proportional to the length of the input word sequence.
This filter is applied to each possible window of words , , , in the essay
to produce a feature map
3.1.3 Max-pooling layer
A sequence of local contextual feature vectors are extracted at the convolutional layer, one for each word n-gram. These local features need to be aggregated to obtain an essay-level feature vector representation. Since we want to extract words that have significant importance in the semantics of the essay, we apply a max-pooling operation on the convolution contextual feature representation. The max pooling operation obtains the most useful local features produced by the convolutional layers, i.e., selects the highest neuron activation value across all local feature vectors. Thevectors are then extracted using a max pooling layer, resulting in a single dimensional vector as defined in equation (3).
denotes element of the max pooling layer , is the element of the local feature vector . The effect of the max-pooling operation is to get the most salient information across all window positions.
What we have described is the process by which one feature is extracted from one filter window size. Our proposed model uses 100 filters with different window sizes to obtain multiple features, and these features are then passed into a bidirectional gated recurrent unit for further processing. Therefore, the convolution layer can be seen as a function that extracts important feature vectors from n-grams. Since this layer provides n-gram level information to the subsequent layers of the neural network, it can potentially capture local contextual dependencies in the essay and consequently improve the performance of the model.
3.1.4 Bidirectional Gated Recurrent Unit (BiGRU)
The sequence of word embeddings learned by using the embedding layer and further processed by convolution layer and max-pooling operation is then passed to a BGRU network layer. The gated recurrent unit (GRU) was recently introduced as an alternative to the LSTM to make each recurrent unit to adaptively capture dependencies of different time scales Chung et al. (2014). Similarly to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit, however, without having a separate memory cells. Below are the updates performed at each in a bidirectional GRU parameterized by weight matrices .
where is an element-wise multiplication. The activation at time
is a linear interpolation between the previous activationand the candidate activation . An update gate decides how much the unit updates its activation or content. The reset gate is used to control access to the previous state and compute a proposed update . When off ( close to ), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state.
3.1.5 Bidirectional Models
The idea of bidirectional models (BRNN, BLSTM and BGRU) is to present each training input sequence forwards and backwards to two separate networks, both of which are connected to the same output layer. This means that for every point in a given sequence, the bidirectional model has complete, sequential information about all points before and after it. At each time stamp, the concatenation of the forward and backward hidden state produces the hidden state of the bidirectional model. The prediction at time stamp is computed by combining the outputs of forward model and the backward , formulated as follows:
3.1.6 Fully-connected Hidden Layer
The outputs that are obtained from each hidden state are concatenated together to form the neural feature vector representation for the essay, and we pass the concatenated feature vector into the final sigmoid layer for score prediction.
where is the concatenation from the outputs of the recurrent layer, , and are the parameters of the linear layer and . The output of the layer is a normalized score of the essay, and we also normalize all gold-standard scores to [0,1] and use them to train the network and we re-scaled the output of the network to the original score range and use the re-scaled scores to evaluate the performance of the method.
3.1.7 Objective and Optimization
We use the RMSProp optimization algorithmN. Dauphin et al. (2015)
to minimize the mean squared error (MSE) loss function over the training data, defined as
where is the gold standard score and is the model predicted score. Given training essays and their corresponding normalized gold standard scores (the actual scores), the predicted normalized score by the model is for all training essay and then the network parameters are updated such that the MSE is minimized.
4 Experimental Settings
In this section, we elucidate the empirical setup, data sets used, performance metrics, baselines, and experimental evaluation.
To determine the hyperparameters and to avoid overfitting, we divide the training data into two sets that do not overlap, called training and validation datasets, respectively. In the experiment, the models were first trained on the training data using 10-fold cross validation, and then training parameters are optimized on the validation data. The weights of the neural network are initialized randomly, as suggested inN. Dauphin et al. (2015)
. The model is trained using mini-batch based stochastic gradient descent. Each mini-batch consists of 128 training samples. To create mini-batches for training, we pad all essays in a mini-batch using a dummy token to make them have the same length. To eliminate the effect of padding tokens during training, we mask them to prevent the network from miscalculating the gradients.
|CNN Window Size||2,3,4|
|Number of CNN Filters||100|
|Number of BGRU hidden units||128|
Number of Epochs
|Word Embedding dimensions||300|
4.2 Data sets and Evaluation Methodology
We perform the experiments on the dataset that was provided within the Automated Essay Scoring competition on the Kaggle 222https://www.kaggle.com/c/asap-aes website. The datasets contain student essays for eight different prompts (essay discussion questions). The anonymized students were from the USA and were drawn from three different grade levels: 7, 8, and 10 (aged 12, 13, and 15, respectively). Four datasets included essays of traditional writing genres such that persuasive, expository or narrative. The other four datasets were source-based, i.e. the students had to discuss questions referring to a previously read source document. At least two human expert graders score each training set. The authors of the datasets already divided them into fixed training and test sets. Since the test labels are not released, we split the training set into training and test sets to build scoring models and measure prediction accuracy, respectively. The characteristics of the used datasets are shown in Table 2.
We use 10-fold cross-validation to evaluate the proposed method, since the test set used in the competition is not publicly available. In each fold, 70% of the data is used as our training set, 10% as the validation set, and 20% as the test set. We train all models for 40 epochs and select the best model based on the performance on the evaluation set. The evaluation is conducted in a prompt-specific fashion(essay topic). We also use dropout regularization to avoid overfitting Srivastava et al. (2014). The input of the dropout is the output of the recurrent layer. Dropout randomly selects some entries in its input with probability p and resets them by assigning zero to their value. The rest of the entries will be copied to the output.
We tokenize the essays using the NLTK 333http://www.nltk.org tokenizer, lowercase the text, and normalize the gold-standard scores to the range of . To learn the representation of each essay, the freely available word2vec 444https://code.google.com/archive/p/word2vec/ word embedding was used, with an embedding for a million words/phrases from Google News trained using the approach in Mikolov et al. (2013). During testing, we re-scale the system-generated normalized scores to the original range of scores and measure the performance using quadratic Weighted Kappa scores described in the following subsection.
4.3 Evaluation Metric
Model validation in AES systems depends on comparing the similarity between the predicted score of the model with the score given by the human raters Attali (2011). In this scenario, the scores from human judges are considered as gold standard and function as an explicit criterion for evaluating the performance of AES models. We use the most widely used evaluation metric called quadratic weighted kappa (QWK). QWK is an error metric that measures the degree of agreement between the automated scores and the resolved human scores, and is an analogy to the correlation coefficient. This metrics scores ranges from 0 to 1. In case that there is less agreement between the graders than expected by chance, this metric may go below 0.
Quadratic weighted kappa is calculated as follows. First, a weight matrix is constructed using equation 15.
where and are the actual rating (assigned by a human annotator or by the assessor) and the predicted rating (assigned by an AES system), respectively, and is the number of possible ratings.
An observed score matrix is calculated such that denotes the number of essays that receive a rating by the human annotator and a rating by the AES system. An expected score (rating) matrix is calculated as the outer product of histogram vectors of the two (actual and predicted) ratings, assuming that there is no correlation between rating scores. The matrix is then normalized such that the sum of elements in and the sum of elements in are the same. Finally, given the matrices and , the QWK score is calculated as indicated in equation 16.
To analyze the potential benefits of the proposed DeLAES, we compared the predictive accuracy of our system with the following other five different versions of deep learning AES system, three classical machine learning systems and also to other state-of-the-art AES systems:
Hybrid Variants: Experiments with three hybrid recurrent NN approaches, namely, BRNN with multichannel CNN (MCom_BRNN), GRU with multichannel CNN (MCom_GRU), BLSTM with multichannel CNN(MCom_BLSTM), were carried out. The same parameter setting was used while implementing these baselines, as indicated in table 1 whenever required.
Finally, we also compared the experimental results of our proposed model with other state-of-the-art commercial and academic AES systems such as SAGE Zupanc and Bosnifa (2017) ,PEG Page (1966) , e-rater Attali and Burstein (2006) ,IntelliMetric Shermis and Burstein (2003) ,CRASE Lottridge et al. (2013) ,LightSIDE Mayfield and Penstein-Rosé (2010) ,IEA -Foltz et al. (1999) and Lexile Smith (2009). Further details can be found at section 5.2.
5.1 Accuracy of the Semantic-based DeLAES system
In the following experiments, we compared our system with three different systems using multichannel CNN, two bidirectional neural network systems and three machine learning systems to evaluate if semantic attributes learned using both multichannel CNN and bidirectional approaches yield to better model performance. Table 3 shows the quadratic weighted Kappas for DeLAES, three other versions of recurrent neural network approaches using multichannel CNN, three bidirectional recurrent neural networks and also three classic machine learning approaches on all eight datasets. We also calculated the p-values as shown in table 4 between DeLAES and all approaches on the all the dataset by running the models using 10-fold cross validation.
The results in table 4 show that the prediction accuracy was improved on six (6) out of eight (8) datasets and the predictions were significant (p-values < 0.05) on five data sets when the prediction accuracy is compared to MCom_GRU and MCom_BLSTM. As most of the prediction results are statistically significant when DeLAES is compared to all other baselines, we indicate the insignificant results using * in table 4.
|DeLAES vs MCoM_GRU||0.0226||0.0744*||0.5751*||0.3328*||0.0227||0.0050||0.0468||0.0166|
|DeLAES vs MCoM_BLSTM||0.0093||0.0093||0.0166||0.3862*||0.0468||0.2845*||0.0218||0.1394*|
|DeLAES vs BLSTM||0.0298||0.5750*||0.0386||0.0298||0.0359||0.0202||0.0512*||0.0076|
|DeLAES vs MCoM_BRNN||0.0125||0.0050||0.0218||0.0050||0.0166||0.0050||0.1141*||0.0093|
|DeLAES vs RNN||0.0284||0.0125||0.0125||0.0298||0.0069||0.0366||0.5750*||0.0125|
|DeLAES vs SVM||0.0050||0.0050||0.0050||0.0050||0.0050||0.0050||0.0050||0.0050|
|DeLAES vs RF||0.0069||0.0050||0.0093||0.0050||0.0069||0.0069||0.0069||0.0593*|
|DeLAES vs NB||0.0050||0.0050||0.0050||0.0050||0.0050||0.0069||0.0050||0.0166|
We aimed to determine whether the semantic-based essay evaluation using multichannel CNN and bidirectional models contributes to the higher prediction accuracy. Thus, we compared the DeLAES system with the system without multichannel CNN. Therefore, the p-values evaluating these comparisons shown also in Table 4
are colored in gray. The results show that using multichannel CNN and bidirectional models leads to higher Kappa values in all eight observed datasets for the two compared system pairs (DeLAES-BRNN and DeLAES-BLSTM) and the results are also statistically significant in six of eight datasets. This proves that learning the feature representation and feature selection using multichannel CNN helps the lower layers of our model to learn the semantic feature representation of the essay in better way and improves the model prediction accuracy significantly.
We also compared the performance of DeLAES with three classical machine learning approaches (support vector machine (SVM), Random Forest (RF) and Naive Bayes (NB)). The results also show that the prediction accuracy was significantly (p-values < 0.05) improved in all of the datasets. In general, according to the results presented in table 3, all neural network model variants were able to learn the task correctly and work competitively compared to the baselines. However, DeLAES performs better than all other systems, outperforming MCoM-GRU by a wide margin (2.9%) and MCoM-BLSTM by 2.0%. The experimental results all show that the ensemble models were also performing better than the non-ensemble and lead to improvements in the prediction accuracy.
To verify how the proposed DeLAES systems performed, we provided one example where DeLAES performs better and another example where DeLAES gives a worse prediction. In example1, we present an essay where our DeLAES system predicts the score of the essay with identification number of set from the test set as while the actual score is .
: “ Dear newspaper, @caps1 having kids wasting there whole lives by being on the computer to much. i sure wouldn’t want my son to be like that. being on the computer too much can make you unhealthy from not exercising, not be able to enjoy nature, and keep you away from you friends and family. not much good can came out of being on the computer @num1. please newspaper take my word. @caps4 will make you well known @caps3 you agree with me on this. first of all, being on the computer all day, everyday can make you unhealthy by not exercising. …”
User-defined numbered environment
Example 2 is an example to showcase the perfection of our DeLAES system, where an essay with identification number of set from the test was predicted accurately with a score of . This is one of the most correctly predicted examples from the test set.
: “ dear @caps1 of the @caps2 @caps3, @caps4 teens are online @num1 chatting with friends and surfing the web. i personally think we should get off the computer. if you want to talk to someone, just call them! with a computer, it can start a lot of drama too! and lastly, we should enjoy nature! so obviouly computers have a larg effect on people! keep reading to hear my reasoning. the first reason why i think computers effect us is because we get attached to talking to people from a computer! we are use to going on facebook and @caps5 everyone. …”
5.2 Comparison with the other AES Systems
|SAGE Zupanc and Bosnifa (2017)||0.93||0.79||0.83||0.81||0.89||0.79||0.88||0.81||0.84|
|PEG Page (1966)||0.82||0.72||0.75||0.82||0.83||0.81||0.84||0.73||0.79|
|e-rater Attali and Burstein (2006)||0.82||0.74||0.72||0.80||0.81||0.75||0.81||0.70||0.77|
|IntelliMetric Shermis and Burstein (2003)||0.78||0.70||0.73||0.79||0.83||0.76||0.81||0.68||0.76|
|CRASE Lottridge et al. (2013)||0.76||0.72||0.73||0.76||0.78||0.78||0.80||0.68||0.75|
|LightSIDE Mayfield and Penstein-Rosé (2010)||0.79||0.70||0.74||0.81||0.81||0.76||0.77||0.65||0.75|
|IEA -Foltz et al. (1999)||0.79||0.70||0.65||0.74||0.80||0.75||0.77||0.69||0.74|
|Lexile Smith (2009)||0.66||0.62||0.65||0.67||0.64||0.65||0.58||0.63||0.64|
The results of the proposed DeLAES system was also compared with the state-of-the-art systems like SAGE, PEG, e-rater, IntelliMetric, CRASE, LightSIDE, IEA, Lexile writing analyzer and with other state of the art AES systems where we obtained the results from previous studies of Shermis and Hamner Shermis and Hamner (2013) and Zupanic Zupanc and Bosnifa (2017). Since not all the systems are available for public experimenting, their results were obtained from the papers Shermis and Hamner (2013) and Zupanc and Bosnifa (2017). The results of comparison to these systems are presented in Table 5.
According to the results presented in table 5, our system achieves a better result on out of dataset and . On the remaining one dataset (ES1), the accuracy of DeLAES was the same as SAGE. On the average (the rightmost column), DeLAES achieved much better results than other systems.
6 Discussion and Conclusion
In DeLAES, the purpose of using the convolutional layer is to pre-process the input text data. Due to its ability to detect local correlations of spatial or temporal structures, the convolution layer is ideally suited to extract n-gram features at different positions of the text from the word vectors through the convolution filters. In text classification, the procedures for generating word embedding vectors can affect classification accuracy. The max-pooling operation is mainly used to identify the influential n-grams from each the sentence in the essay. compared to directional models, bidirectional models can access both the preceding and succeeding contextual information. Hence, bidirectional models can be more effective in learning the context of each word in the text, as indicated in our experimental results in table 5 . Furthermore, the combination of CNN based approaches and bidirectional recurrent neural based approaches makes the understanding of text semantics more simple and improves the prediction accuracy, as indicated in our experimental results on AES datasets.
Experiments show that the convolutional layer with the max-pooling operation and BGRU have an essential influence on the performance. It is worth to note that the lower layers (recurrent neural network) have more significant effects on the prediction accuracy than the convolutional layer, since the convolutional layer is used as n-gram based feature selection. For the convolutional layer, the convolution window size and the max-pooling operation also affect the performance, according to our experimental results.
In text classification, in general, and specifically in AES, the methods to generate word embedding vectors can also affect the classification (score prediction) accuracy. Compared to pre-trained embedding word vectors, training random embedding word vectors requires more parameters, a huge dataset, and it causes relatively lower classification accuracy in limited iterations. The experiments show that pre-trained embedding word vectors can achieve better results than random word embedding vectors. Hence, to generate pre-trained embedding word vectors is more suitable.
All experiment results show that the combination of the convolutional layer, max-pooling operation and BGRU remarkably improves the AES system prediction accuracy. For most of the benchmark datasets, DeLAES was able to obtain significantly better results than other baseline models. It shows that DeLAES has a better classification ability and performs better than other state-of-the-art deep NNs.
In the task of AES, feature extraction and the design of classifiers are very important. LSTM and GRU have shown better performance on many real-world and benchmark text classification problems. However, it is still challenging to understand the semantics and the classification accuracy still needs to be improved. In order to solve these problems, an improved GRU method, namely DeLAES, was presented in this paper utilizing multichannel convolutional layer, max-pooling operation and BGRU, improving the accuracy of the scoring engine. Experiments, conducted on eight real-world benchmark datasets, showed that DeLAES could understand semantics more accurately and enhance the performance of the scoring engine in terms of the quality of the final results.
Comparisons with state-of-the-art baseline methods demonstrated that DeLAES is more effective and efficient in terms of the classification quality in most of the cases. Future work will focus on integrating attention mechanism and investigating their effect on the performance of score prediction.
Acknowledgements.This research is supported by the ÚNKP-21-4 New National Excellence Program of the Ministry for Innovation and Technology from the source of the National Research, Development and Innovation Fund. Supported by the Telekom Innovation Laboratories (T-Labs), the Research and Development unit of Deutsche Telekom. Project no. ED_18-1-2019-0030 (Application domain specific, highly reliable IT solutions subprogramme) has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme funding scheme.
Conflict of interest
The authors declare that they have no conflict of interest.
- Automated Essay Scoring With e-rater® V.2. The Journal of Technology, Learning, and Assessment 4 (3). Cited by: §1, §1, §2, item 4, Table 5.
- A differential word use measure for content analysis in automated essay scoring. ETS Research Report Series 36 (August). Cited by: §1, §1, §4.3.
- Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §1.
- Toward More Substantively Meaningful Automated Essay Scoring. The Journal of Technology, Learning, and Assessment. External Links: Cited by: §2.
- Automated essay grading in the sociology classroom: finding common ground. Machine scoring of student essays: Truth and consequences, pp. 177–198. Cited by: §2.
- Automated scoring using a hybrid feature identification technique. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, ACL ’98, Stroudsburg, PA, USA, pp. 206–210. External Links: Cited by: §1.
- On the effectiveness of using syntactic and shallow semantic tree kernels for automatic assessment of essays. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 767–773. Cited by: §2.
- Automated essay scoring by capturing relative writing quality1. The Computer Journal 57 (9), pp. 1318–1330. Cited by: §2.
- Automated essay scoring by maximizing human-machine agreement. In EMNLP, Cited by: §1.
- Learning long-term dependencies in segmented memory recurrent neural networks. In Advances in Neural Networks – ISNN 2004, F. Yin, J. Wang, and C. Guo (Eds.), Berlin, Heidelberg, pp. 362–369. Cited by: §3.1.5.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. Cited by: §3.1.4.
- Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, pp. 2493–2537. External Links: Cited by: §1, §3.1.2.
- Neural multi-task learning in automated assessment. CoRR abs/1801.06830. External Links: Cited by: §1, §2.
- Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41 (6). External Links: Cited by: §1, §1.
- Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 715–725. Cited by: §1, §2, §2.
- Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, pp. 153–162. External Links: Cited by: §2.
- Automatic features for essay scoringan empirical study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1072–1077. External Links: Cited by: §1, §2, §2, §2.
- Neural automated essay scoring and coherence modeling for adversarially crafted input. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, New Orleans, Louisiana, pp. 263–271. External Links: Cited by: §1.
- Automatic essay scoring system using n-gram and cosine similarity for gamification based e-learning. In Proceedings of the International Conference on Advances in Image Processing, ICAIP 2017, New York, NY, USA, pp. 151–155. External Links: Cited by: §2.
- Automated Essay Scoring : Applications to Educational Technology. In World Conference on Educational Multimedia, Hypermedia and Telecommunications (ED-MEDIA), Cited by: §1, §2, item 4, Table 5.
- A primer on neural network models for natural language processing. J. Artif. Int. Res. 57 (1), pp. 345–420. External Links: Cited by: §1.
- Providing grades and feedback for student summaries by ontology-based information extraction. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, New York, NY, USA, pp. 1722–1726. External Links: Cited by: §2.
- Identifying off-topic student essays without topic-specific training data. Natural Language Engineering 12 (2), pp. 145–159. Cited by: §1.
- Automated essay scoring using generalized latent semantic analysis. In International Conference on Computer and Information Technology, Cited by: §1, §2.
- TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1), Melbourne, Australia, pp. 1088–1097. External Links: Cited by: §1, §2.
- A study of distributed semantic representations for automated essay scoring. In Knowledge Science, Engineering and Management, G. Li, Y. Ge, Z. Zhang, Z. Jin, and M. Blumenstein (Eds.), Cham, pp. 16–28. External Links: Cited by: §2.
- Applying latent dirichlet allocation to automatic essay grading. In Advances in Natural Language Processing, T. Salakoski, F. Ginter, S. Pyysalo, and T. Pahikkala (Eds.), Berlin, Heidelberg, pp. 110–120. External Links: Cited by: §1.
- Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. External Links: Cited by: §3.1.2.
- Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337, pp. 325 – 338. External Links: Cited by: §3.1.5.
- Using automated scoring to monitor reader performance and detect reader drift in essay scoring. Handbook of Automated Essay Evaluation: Current Applications and New Directions, pp. 233–250. Cited by: §2, item 4, Table 5.
- An interactive tool for supporting error analysis for text mining. In Proceedings of the NAACL HLT 2010 Demonstration Session, HLT-DEMO ’10, Stroudsburg, PA, USA, pp. 25–28. Cited by: §2, item 4, Table 5.
- Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 3111–3119. Cited by: §3.1.1, §4.2.
- RMSProp and equilibrated adaptive learning rates for non-convex optimization. arXiv 35, pp. . Cited by: §3.1.7, §4.1.
- Grading Essays by Computer: Progress Report. In Invitational Conference on Testing Problems, External Links: Cited by: §1, §2, item 4, Table 5.
Automated essay scoring using bayes theorem. The Journal of Technology, Learning and Assessment 1 (2). Cited by: §2.
- Trait Ratings for Automated Essay Grading. Educational and Psychological Measurement 62. External Links: Cited by: §2.
- Contrasting state-of-the-art automated scoring of essays. Handbook of automated essay evaluation, pp. 313–346. Cited by: §1, §5.2.
- Automated essay scoring a cross-disciplinary perspective. British Journal of Mathematical & Statistical Psychology. External Links: Cited by: §1, §2, item 4, Table 5.
- The reading-writing connection. POSITION PAPER metametricsinc. External Links: Cited by: §2, item 4, Table 5.
- Discourse mode identification in essays. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 112–122. External Links: Cited by: §2.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §4.2.
- A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1882–1891. External Links: Cited by: §1, §1, §2, §2.
- Pair-wise: automatic essay evaluation using word mover’s distance. In Proceedings of the 10th International Conference on Computer Supported Education - Volume 2: CSEDU,, pp. 59–66. External Links: Cited by: §1, §1.
- A layered approach to automatic essay evaluation using word-embedding. In Computer Supported Education, B. M. McLaren, R. Reilly, S. Zvacek, and J. Uhomoibhi (Eds.), Cham, pp. 77–94. External Links: Cited by: §1, §1.
- Semantic clustering and convolutional neural network for short text categorization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. Cited by: §3.1.2.
- A review of automatic essay scoring: a cross-disciplinary perspective. Journal of Educational and Behavioral Statistics 30 (1). External Links: Cited by: §2.
Automatic essay scoring incorporating rating schema via reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 791–797. External Links: Cited by: §1.
- Automated essay scoring system based on rubric. In Applied Computing & Information Technology, pp. 177–190. External Links: Cited by: §2.
- Task-independent features for automated essay grading. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, Denver, Colorado, pp. 224–232. External Links: Cited by: §1, §2.
- Co-attention based neural network for source-dependent essay scoring. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, Louisiana, pp. 399–409. External Links: Cited by: §1.
- Review of handbook of automated essay evaluation: current applications and new directions. Language Learning & Technology 18 (2). Cited by: §2, §2.
- Character level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS15, pp. 649–657. Cited by: §3.1.2.
- Automated essay evaluation with semantic analysis. Knowledge-Based Systems 120, pp. 118 – 132. External Links: Cited by: §1, §2, item 4, §5.2, Table 5.