Automated essay scoring (AES) is the use of some statistical model to assign grades to essays in an educational setting. These engines were initially used to reduce the cost of essay scoring [21, 22]. Aside from cost effectiveness, AES is considered to be inherently more consistent and less biased than human raters. We can compare the performance of an AES engine with the performance of human raters using inter-rater reliability (IRR) statistics . Recently, an AES engine with above human performance was presented in  based on an engine in which experts carefully engineered a set of features. AES has been a subject of a number of recent works by Sakaguchi , Shermis and Hammer , and Yannakoudakis .
The function of AES is essentially one of classification, where neural networks are associated with almost all the current state-of-the-art results. Feedforward (static) neural networks are a class of powerful nonlinear statistical models capable of modelling complex relationships between the input space and a set targets 
. Many of these Feedforward neural networks are known as Convolutional Neural Networks (CNN)’s which are ubiquitously used in image classification tasks
. These nonlinear models are fit to a set of training data using backpropagation and a variety of optimization algorithms. Recently very efficient deep neural net model architectures have been used to compute the vector representation of words and/or subwords called embeddings. These models are used heavily in Natural Language Processing (NLP) tasks to convert words and/or subwords to vectors in a meaningful manner that has been shown to preserve semantic information.
We also consider AES to be an area of NLP in which another type of dynamic network is ubiquitously used. These dynamic networks are mostly called Recurrent Neural Networks (RNN)’s and are powerful tools used to model and classify data that is sequential in nature. These types of networks have been used in engineering and science in the identification and modeling of complex systems
. Using an embedding we may convert a sequence of words into a sequence of vectors that has preserved the semantic information. RNN’s, in combination with embeddings, have many applications in NLP tasks like sentiment analysis, topic labeling, language detection and machine translation. In recent years, researchers have applied RNNs and Deep Neural Nets to AES.
In cases where there are a very large number of student essays, grading can be a very expensive and time consuming process. Since scoring essays is a part of the student assessment process that is conducted by almost all educational testing agencies, there are many AES engines being used in large-scale formative and summative assessment 
. The core idea of essay scoring is to evaluate an essay with respect to a rubric which may depend on traits such as the use of grammar, the organization of the essay in addition to topic specific information. An AES engine seeks to extract measurable features which may be used to approximate these traits, hence, deduce a probable score based on statistical inference. A comprehensive review of AES engines in production featured in the work of Shermis et al..
In 2012, Kaggle released a Hewlett Foundation sponsored competition under the name “Automated Student Assessment Prize” (ASAP). Competitors designed and developed statistical AES engines based on techniques like Bag of Words (BOW) in combination with standard Machine Learning (ML) algorithms to extract important features of student responses that correlated well with scores. Subsequent works applied RNN-based engines in combination with word embeddings to the Kaggle AES dataset. This dataset and these results provide us with a benchmark for AES engines and a way of comparing current state-of-the-art neural network architectures against previous results.
Since there exists an abundance of unlabeled text data available, researchers have started training very deep language models, which are networks designed to predict some part of the text (usually words) based on the other parts. These networks eventually learn contextual information. By adapting these language models to predict labels instead of words or sentences, state-of-the-art results have been achieved in many NLP tasks. Many of these models (see [26, 8, 34, 25]) are built from layers of Transformers which utilize attention to find the most relevant features to perform a particular task . We concentrate on two such models; the Bidirectional Encoder Representations from Transformers (BERT), introduced in , and XLNet, which is a variation of the BERT model .
2. Automated Essay Scoring
In this section, we discuss the task of producing an AES engine. This includes the data collection, how we train the models and how we evaluate an AES engine. We include a brief description of some of the standard IRR statistics in the literature used in the context of evaluating models [9, 33].
The first step in producing an AES engine is data collection. Typically, a large sample of essays is collected for the task and scored by expert raters. The raters are trained using a holistic rubric specifying the criteria each essay is required to satisfy to be awarded each score. Exemplar essays are used to demonstrate how the criteria is to be applied. A holistic rubric may take into account a number of factors such as grammar, spelling, organization, clarity and cohesion . Since these essays are the result of specific prompts shown to students, the rubric may include prompt specific information. The training material for the Kaggle AES dataset was made publicly available. To evaluate the efficacy of an AES engine, we require that every essay is scored by (at least) two different raters. Other quality control mechanisms like resolution reads and targeted backreads help improve the quality of the data .
Once the collection of essays is scored, we divide the essays into three different sets; a training set, a test set and a validation set. From a classification standpoint, the input space is the set of raw text essays while the targets for this problem are the human assigned labels. The goal of an AES engine use and evaluate a set of features of the training set, either implicitly or explicitly, in a manner that the labels of the test set may be deduced as accurately as possible using statistical inference. Ultimately, if the features are appropriate and the statistical inference is valid, the AES engine assigns grades to essays statistically similarly to how a human would on the test set. Once the hyperparameters are optimized for the test set, the engine is applied on the validation set.
In the case of the ASAP data, two raters were used to evaluate the essays. We call the scores of one reader the initial scores and the scores of the second reader the reliability scores. There are two main metrics used to evaluate the agreement between two sets of scores; the exact agreement (accuracy), which measures when two scores agree, and the quadratic weighted kappa (QWK) statistic  or Cohen’s Kappa Score. The QWK of two sets of scores is defined as follows:
where is the number of classes and is the probability of score receiving score . The original Cohen’s Kappa Score is defined as
where is the relative observed exact agreement among raters (i.e., accuracy), and is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category . The QWK has the property that
if the raters are in complete agreement. The QWK captures the level of agreement above and beyond what would be obtained by chance and weighted by the extent of disagreement. Furthermore, in contrast to the accuracy, QWK is statistically a better measurement for detecting disagreements between raters since it depends on the entire confusion matrix, not just the diagonal entries. Typically, the QWK between two raters is also used to measure the quality or subjectivity of the data used in training.
We may evaluate an AES engine on the ASAP dataset and compare the engine with a human rater by training an engine on the initial scores and showing that the scores predicted by the engine are in greater agreement with the intial scores than the reliability scores on the validation set. We used the same 5-fold cross validation splits found in  where each of the five splits used 60 percent of the data as training data, 20 percent as a test set and 20 percent as a validation set. We also considered hyperparameter tuning at a level in which the very structure of the network was altered.
Automated Essay Scoring is one of the more challenging tasks in NLP. The challenges that are somewhat distinct to essay scoring relate to the length of essays, the quality of the language/spelling and typical training sample sizes. Essays can be long relative to the texts found in sentiment analysis, short answer scoring, language detection and machine translation. Furthermore, while many tasks in NLP can be done sentence by sentence, the length and structure of essays often introduces longer time dependencies which requires more data than typically available. The amount of data is often restricted due to the expense of hand-scoring. The longer the essay, the more difficult for Neural Network models to keep the information from beginning of the essay in the network. This results in convergence issues or low performance. These are in addition to typical challenges of NLP such as the choice of embedding, different contextual meanings of words and the choice of ML algorithms.
A variety of models have been introduced over the last 50 years in essay scoring 
. These models started with statistical models using the Bag of Words (BOW) method with logistic regression or other classifiers, SVD methods for feature selection and probabilistic models like Naive Bayes or Gaussian models. In using Neural Network models, we are required to choose an appropriate embedding. An embedding may be between characters, words, subwords or sentences into some real -dimensional space that somehow preserves the usage/semantics . This converts a text into a sequence of vectors in -dimensional space which may be modeled using RNNs like LSTM Networks and GRU Networks  or with CNNs. The gating mechanisms, such as those found in GRUs and LSTM units, mitigate the issue of long term dependencies to some degree, however, it has been shown that long term dependencies are more effectively accommodated for by attention mechanisms . Recently people have started to combine these algorithms with each other in order to improve the results.
At a word level, if a word is misrepresented or misspelled the embedding of that token results in an inconsistent input that is being used to train the NN models leading to poor extrapolation. Standard algorithms for correcting words may suggest words that do not fit into the context. The language models in question are masked word models [26, 8, 34, 25] which seeks to guess a selection of missing words better than standard algorithms by incorporating context in three different ways. These models use three different embeddings; a word/subword embedding, a sentence embedding and a positional embedding that encodes the position of each word. The probably masked words are calculated by using context at a word and sentence level. By modelling sentences, these models possess much more information than typically available using typical word embeddings.
Neural networks are inherently non-linear and continuous models, however, to approximate a discrete scoring rubric, a series of boundaries is introduced in the output space that distinguish the various scores. When the output lies close to the boundaries between scores it is difficult for the models to pick a score correctly. Ideas of committee (or ensemble) of networks by taking a majority vote or the mean will be discussed in later sections.
In this section we will go into some detail regarding some of the major methods used develop AES engines 
. We start with the BOW method in which the features are explicitly defined. We then go on to describe RNN approaches. In particular, we will review how the gating mechanism in layers of LSTM units allow for long term dependencies. The Multi Layer Perceptron and its variations are classified as static network and networks that have delays are also considered RNNs. Lastly, we elucidate the structure and function of the language models featured in this paper.
For ML algorithms, we mostly prefer to have well defined fixed input and targets. An issue with modeling text data is that it is usually very messy and some techniques are required to pre-process it into useful inputs and targets to feed to ML algorithms. Texts needs to be converted to numbers that we can use in machine learning as proper input and labels. Converting textual data to vectors is called feature extraction or feature encoding. A bag of words (BOW) model is a technique to extract features from text and use them for modeling. The method is very simple:
Find all occurrence of words within a document.
Find a unique vocabulary of words.
Then form the vector that represents the frequency of each word.
Each dimension of the vector represents the number of counts (occurrence).
Remove dimensions associated with very high frequency words.
We use term frequency (TF) (take the raw frequency and divide to max frequency).
We use inverse document frequency (IDF) (log of documents counts to the length of all the documents has has the term)
By multiplying the TF and IDF, we get (TF-IDF) to reduce the most important words.
Normalize the TF-IDF vectors.
The BOW model is completed and each essay is associated with a single vector and the set of vectors with a particular label may be classified by some traditional classifier. We should note that the BOW model will not consider the order of the words and that in each bag it finds the words that have the most textual information.
The output of an RNN is a sequence that depends on the current input to the network but also on the previous inputs and outputs. In other words, the input and output can be delayed and we can also use the state of the network as input. Since these networks have delays, they operate on a sequence of inputs in which the order is important.
An RNN can be a purely Feed Forward network with delays in the inputs or they can have feedback connections with the output of the network and/or the state of the network. A variety of recurrent units, which are used to build RNNs, are available like LSTM , GRU, ELMAN, NARX 
and Focused Delay Networks. In this section we are going to discuss networks of LSTM units. In order to do so, let us introduce a general notation by describing the most basic unit that makes up all ANNs, that is, the (artificial) neuron, shown in Figure1(a).
A scalar input is multiplied by a parameter , called weight, and the result is added to another parameter called bias. Their sum (, the net input
) goes into a (usually) nonlinear activation functionto get the neuron output . By updating the values of and through an iterative optimization algorithm called Gradient Descent, this single neuron can find the best parameters that fit the neuron equation (with a set transfer function) to any two-dimensional data. In other words, this single modular unit can map input data to the target and approximate the underlying function. By assigning a different weight to each input dimension, a single neuron can be extended to model -dimensional data. In this case, both and are -dimensional vectors, and the neuron output equation is
By combining multiple neurons together, and stacking multiple layers of these neurons, a Multi-Layer Perceptron (MLP) is formed 1(b). The super script number shows the layer numbers. For example, the forward calculation of the three layers shown in the figure is
We want to introduce the neural network framework that we will use to represent general recurrent networks. We added new notation that we have used to represent MLP, therefore we can conveniently represent networks with feedback connections and tapped delay lines.
The net input for layer of an RNN can be computed as follows:
where is the -th input to the network at time , is the input weight between input and layer , is the layer weight between layer and layer ,
is the bias vector for layer, is the set of all delays in the tapped delay line between layer and layer , is the set of indices of input vectors that connect to layer , and is the set of indices of layers that connect directly forward to layer . The output of layer is
for , where is the transfer function at layer . The set of paired equations (7) and (8) describes the general RNN. RNN can have any number of layers, any number of neurons in any layer, and arbitrary connections between layers (as long as there are no zero-delay loops) .
Training RNN networks can be very complex and difficult. The key issues that may arise are Vanishing Gradients , Exploding Gradient and instability . Many architectures are proposed to deal with these issues. Long Short Term Memory (LSTM) network is one of these network architectures  that has recently become very popular. They key concept in LSTM is we would like to predict responses that may be significantly delayed from the corresponding stimulus. For example, words in a previous paragraph can provide context for a translation, therefore the network must enable this possibility to have long term memory.
Long term memories are the network weights and short term memories are the layer outputs. We need a network which has long and short term memory combined. In RNNs, as the weights change during training, the length of the short term memory will change. It will be very difficult to increase the length if the initial weight does not produce a long short term memory. Unfortunately, if the initial weight produces a long short term memory, the network can easily have unstable outputs. To maintain a long term memory, we need to have a layer called Constant Error Carousel (CEC). This layer has a feed back matrix
to have some eigenvalues very close to one shown in Figure2. This has to be maintained during training or the gradients will vanish. In addition to ensure long memories, the derivative of the transfer function should be constant. Therefore, we need to set and use a linear transfer function.
Now, we do not want to indiscriminately remember everything. Thus, we need to create a system that selectively picks what information to remember. The solution, outlined in  is a gating mechanism in which gates act like switches that operates on input, the CEC layer and the output layer. The input gate will allow selective inputs into CEC, a feedback or forget gate will clear CEC, and the output gate will allow selective outputs from CEC. Each gate will be a layer with inputs from gated outputs and the network inputs. The network results in the LSTM, with CEC short term memories that last longer. The key details are:
The operator is the Hadamard product, which is an element by element multiplication.
The weights in the CEC are all fixed to the identity matrix and they are not trained.
The output and the gating layer weights are also fixed to the identity matrix.
It has been shown that the best results are obtained when initializing the feedback or forget gate, bias , to all ones or larger values.
Other weight and biases are randomly initialized to small numbers.
The output of the gating layer generally connects to another layer or ML network with softmax transfer function.
Multiple LSTM can be cascaded into each other.
We need to note that Deep Learning frameworks unroll these networks with delays and for each time step they create a physical layer and then use static backpropagation algorithm to calculate the gradients . Then they roll the networks back and average the derivatives with respect to the weight and biases over the physical layers. The unrolling and rolling effect is only an approximation of the true gradient with respect to the weights.
BERT, which stands for Bidirectional Encoder Representations from Transformers, is a Language Model released by the Google AI Language team at the end of the year 2018 . It has become the state-of-the-art model for many different Natural Language Undestanding tasks, including sequence and document classification. This is best reflected by the fact that one can only see BERT-like models on the GLUE benchmark leaderboard, with the sole exception of XLNet , which then again is not so different from BERT. The success of BERT can be explained in part by its novel language modeling approach, but also by the use of the Transformer , a Neural Network architecture based solely on Attention mechanisms, which was introduced one year prior, replacing Recurrent Neural Networks (RNNs) as the state-of-the-art Natural Language Understanding (NLU) techniques. We will give an overview of how Attention and Transformers work, and then explain BERT’s architecture and its pre-training tasks.
Self-Attention, the kind of Attention used on the Transformer, is essentially a mechanism that allows a Neural Network to learn representations of some text sequence influenced by all the words on the sequence. In a RNN context, this is achieved by using the hidden states of the previous tokens as inputs to the next time step. However, as the Transformer is purely feed-forward, it must find some other way of combining all the words together to map any kind of function in an a NLU task. It does so with the following equation
Here, , and (query, key and value) are matrices which are obtained by taking the dot product of some trainable weight matrices , and with the embedding matrix of our input sequence . That is, , and . Basically, each row on these matrices corresponds to one word, meaning that each word is mapped to three different projections of its embedding space. These projections serve as abstractions to compute the self-attention function for each word. The dot product between the query for word 1 and all the keys for words 1, 2, …, tells us how “similar” each word is to word 1, a measure that is normalized by the softmax function across all the words. The output of the softmax weights how much each word should contribute to the representation of the sequence that is drawn from word 1. Thus, the output of the self-attention transfer function for each word is a weighted sum of the values of all the words (including, and mainly, itself), by some parameters that are learnt to get the best representation that fits the problem at hand. is the dimension of the query vectors (512 for the Transformer, and 768 for base BERT and XLNet), and diving by its square root leads to more stable gradients.
The Transformer model goes one step further than simply computing a Self-Attention function, by implementing what is called Multi-Head Attention. This is basically a set of Self-Attention computations, each on different sub-vectors, obtained from the original , and by breaking them up into different , and made up of components each, where is the embedding dimension (768 for the base BERT and XLNet) and the number of Attention Heads (12 for base BERT and XLNet). This is illustrated in Figure 3 under “Segmentation”. After the Self-Attention is computed for each , the original dimension is obtained by a simple concatenation (“Context Matrix” in Figure 3).
Although up until this point we have only described the Encoder part of the Transformer, which is actually an Encoder-Decoder architecture, both BERT and XLNet use only an Encoder Transformer, so this is mainly all the architecture these Language Models are made of, with some key changes in the case of XLNet. Now we proceed to describe BERT’s architecture from input to output, and also how it is pre-trained to learn a natural language. First, the actual words in the text are projected into an embedding dimension, which will be explained later in the context of Language Modeling. Once we have the embedding representation of each word, we input them into the first layer of BERT. Such layer, shown in Figure 3, consists mainly of a Multi-Head Attention Layer, which is identical to that of the Transformer, except for the fact that an attention mask
is added to the softmax input. This is done in order to avoid paying attention to padded 0s (which are necessary if one wants to do vectorized mini-batching). The attention mask is vector made up of 0s for the words we want the model to attend to (the actual words in the sequence), and of very small values (like -10,000) for the padded 0s. The sums of the keys and queries dot products with this mask will go into the softmax, making the attention scores for the masked padded 0s become practically 0. The output of this layer goes into a linear layer of sizex
, in order to learn a local linear combination of the Multi-Head Attention output. Batch Normalization is performed on the sum of the output of this layer (after a Dropout) and the input to the BERT layer. This is fed into yet another linear layer of sizex, where for the base BERT, followed by a GeLu (Gaussian Error Linear Units) transfer function and another linear layer (x) that maps the higher dimensions back to the embedding dimensions, with also Dropout and Batch Norm. This constitutes one BERT Layer, of which the base model has 12. The outputs of the first layer are treated as the hidden embeddings for each word, which the second layer takes as inputs and does the same kind of operations on them. Once we have gone through all the layers, the output for the first token (a special token “[CLS]” that remains the same for all input sequences) is passed onto another linear layer (x) with a tanh transfer function. This layer (Figure 4(b)) acts as a pooler and its output is used as the representation of the whole sequence, which can finally allow learning multiple types of tasks by using other specific-purpose layers or even treating it as the sequence features to input into another kind of Machine Learning model.
Now that we have described BERT’s architecture in detail, we will focus on the other main aspect that makes BERT so successful: BERT is, first and foremost, a Language Model. This means that the model is designed to learn useful knowledge about natural language from large amounts of unlabeled text, but also to retain and use this knowledge for supervised downstream tasks. The way Language Modeling usually works in a RNN scenario is just using the previous words as inputs to predict the next word . The model cannot take as input the word or any words after it (although there are some bidirectional variants), so there is no need for special preprocessing of the text. However, as BERT is a feed-forward architecture that uses attention on all the words in some fixed-length sequence, if nothing is done, the model would be able to attend mainly to the very same word it is trying to predict. One solution would be cutting the attention on all the words after, and including, the target word. However, natural language is not so simple. More often than one would think, words within a sequence only make sense when taking the words after them as context. Thankfully, the attention mechanism can allow to capture both previous and future context, and one can stop the model from attending to the target word by masking it (not to be confused with the attention mask used for the padded zeros). In particular, for each input sequence, 15 % of the tokens are randomly masked, and then the model is trained to predict these tokens. The way this is done is taking the output of BERT, before the pooler, and mapping the vectors corresponding to each word to the vocabulary size with a linear layer, whose weights are the same as the ones from the input word embedding layer, although an additional bias is included, and then passing this to a softmax function in order to minimize a Categorical Cross-Entropy performance index that is computed with the predicted labels and the true labels (the ids on the token vocabulary, but only making the masked words contribute to the loss). Masking words is really straightforward: just replace them with the special token “[MASK]”. This way, the network cannot use information from this word or any other masked words, aside from their position in the text. BERT was also pre-trained to predict whether a sentence B follows another sentence A (both randomly sampled from the text 50% of the time, while the rest of the time sentence B is actually the sentence that comes after sentence A). Although recent research  has shown that the same or even better results can be obtained without this second task, in the original implementation the model is optimized to minimize the sum of the losses from each task at the same time. The additional architecture just described is shown in Figure 5.
In addition to the usual word embeddings, positional embeddings are used to give the model information about the position of each word on the sequence (this is also done in the Transformer, although with some differences), and due to the next sentence prediction task and also for easy adaptation to downstream tasks such as question-answering, a segment embedding to represent each of the sentences is also utilized. The word embeddings used by BERT are WordPiece embeddings , which consist in a tokenization technique in which the words are split into sub-word units. This helps handling out-of-vocabulary words while keeping the actual vocabulary size small (30,522 unique word-pieces for BERT uncased). The positional embeddings are look-up tables of size x, which assign a different embedding vector to each token based on its position within the sequence. was also chosen on the original Transformer as the maximum sequence length, mainly because Self-Attention’s complexity is quadratic to the sequence length, due to the fact that it needs to compute the attentions of every word to every other word and also to themselves. While the first token in every input sequence will have the same positional embedding, the same applies to all the tokens belonging to the first sentence in the pair of sentences and , i.e, the segment embeddings are look-up tables of size x. All of these embeddings have the same dimensions, so they can be simply added up element-wise to combine them together and obtain the input to the first Multi-Head Attention Layer, as shown in Figure 4(a). Notice that these embeddings are learnable, so although pre-trained WordPiece are being used at the beginning for the word embeddings, these are being updated to represent the words in a better way during BERT’s pre-training and fine-tuning tasks. This becomes even more crucial in the case of the positional and segment embeddings, which need to be learned from scratch. It is also worth noting that, although the embedding layers are technically look-up tables, which work with inputs of dimension x (containing one unique token vocabulary id for each word), mathematically this is equivalent to having a linear layer (without bias) of size x
, and one-hot-encoded inputs of dimensionxVocabularySize. The projection and weight update will be the same, but the first method is much faster because there is no matrix product involved, just a look-up (indexing) operation.
XLNet is a language model introduced very recently  that makes use of the TransformerXL  to incorporate information from previous sequence/s in order to process the current sequence, achieving a regressive effect at the sequence level. To do so, it employs a relative positional encoding and a permutation language modeling approach. Although BERT and XLNet share a lot of similarities, there are some key differences that need to be explained.
Firstly, XLNet’s Multi-Head Attention’s core operation is different than the one implemented in BERT and in the Transformer. In this case, instead of just breaking up the original , and into different , and ; linear layers (for each) are used to map the input to the Multi-Head Attention layer into these different , and , and thus no intermediate , and are computed. This results in the three linear layers of x being replaced by linear layers of x , which map the input into smaller subspaces (with the same number of dimensions which add up to the original dimension). These several (and parallel) computations on different dimensions produce more variability, allowing each word to attend more to other words and not only to itself, which results in a final richer representation of each word, calculated by adding up the results of mapping back each of the sub-representations to the original embedding dimension with again 12 linear layers. This is expressed with the following equation, where is the input. Note that the actual implementation is a bit more complex, as shown in Figure 6, but the heart of the operation is indeed in equation (10).
Secondly, apart from this, XLNet’s Attention is different from BERT’s in two ways: 1. The keys and values (but not the queries) of the current sequence and for each layer depend on the hidden states of the previous sequence/s, based on a memory length hyper-parameter. That is, let the hidden state (output) of layer for the previous sequence be a matrix of dimensions x , then if we choose a memory length of tokens, the key and value of the current sequence for layer will be computed by concatenating to the last ten vectors of and then projecting the result using the and matrices. This recurrence mechanism at the sequence level is illustrated in Figure 7. If the memory length is greater than 512, we can even reuse information from the two last sequences, although this becomes quadratically expensive. 2. The operation just described is only applied to the word embeddings, and not to the sum of the three kinds of embeddings ( are just the word embeddings). The other two are used in a different way. The (relative) positional embeddings (encodings is a more suitable name) are computed by the following equation
where and , with . Note that these are also different from BERT’s in the sense that they are not being learnt. , of shape x , is projected into positional keys by learnable matrices . The dot products between these and positional queries obtained from the original queries by adding them up with learnable biases are performed, and then the second to the elements are obtained from the memory dimension after performing a relative shift between this dimension and the current sequence dimension, resulting in positional attention scores which are added up to the regular attention scores before going into the softmax. This way, XLNet can perform a smarter attention to both the words on the previous sequence/s and the current sequence, by using this information that is being learnt based on the relative position of each word with respect to each other word of each sequence. To distinguish between current and previous sequence/s, a segment embedding is also utilized, which consists simply of a one-hot-encoded matrix of xx: we have a if word and word belong to the same sequence, and otherwise. Before attending to this segment encoding (which acts as a unique segment key), the original queries are again added up with biases and then projected by weight matrices into , of shape x. The result is also added to the attention scores before the softmax, and the operation described on equation (10) with the values is performed. The rest is almost identical to BERT (layers 5 and 6 on Figure 3, layer 4 and the batch norm after it are omitted), taking into account that the output of the first XLNet layer acts as hidden word embeddings that go into to the second layer, while the positional and segment encodings inputed to this layer remain the same as the ones inputed to the first layer, as shown in Figure 7. The detailed architecture is shown in Figure 6.
While the architecture differences have been listed above, XLNet also differs from BERT in their pre-training tasks. XLNet is pre-trained by a permutation language modeling approach. This means that, for any sequence, there are sequence_length! permutations of the factorization order, and an AR language modeling can be performed by maximizing the likelihood under the forward autoregressive factorization. Note that the order of the sequence remains unchanged: the permutation only affects which words are attended to, by changing the attention mask before the softmax: to predict word , the attention mask is set to very small numbers for words with , so that only the words before and including on the current factorization order are used to compute the attention. The trick here is that the words that come before change with each permutation, but their positions are kept constant within the sequence, allowing XLNet to capture bidirectional context. Additionally, due to the fact that utilizing permutations causes slow convergence, XLNet is pre-trained to predict only the last 16.67 % of tokens in each factorization.
In order to use the position of the token that is going to be predicted, but not its content, XLNet introduces a new type of query. The same kind of Multi-Head attention is performed, starting from a randomly initialized vector (or vectors if we are predicting more than one token at the same time). This vector is projected by the same linear layers as the normal query to obtain the new type of query, which attends to the same keys and values as the regular query, but with the new attention mask explained before (with the difference that the element corresponding to word is also set to a very small value). So basically, in the pre-training task, this new Multi-Head Attention (named query stream) and the one from Figure 6 (named content stream) are performed at the same time layer by layer, because the query stream needs the outputs of each layer from the content stream to get the content keys and the values to perform the attention on the next layer. The content stream can see the content of the words that come before in the factorization order, and also , while the query stream can only see the content of the words that come before . After going through the 12 XLNet layers and projecting the output of this new query with a linear layer of VocabularySize x target_tokens, the Cross-Entropy loss with the indexes of the real tokens is computed and the model’s parameters are updated to minimize this loss. The method just described allows XLNet to be pre-trained without the need to replace the target tokens with the special token “[MASK]”, which is not present at all during fine-tuning.
4. Fine Tuning and Experiments
In this section we provide an overview of how neural language model fine-tuning is done for a downstream classification task such as essay scoring, as well as explain the experiments we did in order to improve performance. The output layer/s that were used for the pre-training task/s are replaced with a single classification layer. This layer has the same number of neurons as labels (possible scores for the essays), with a softmax activation function, which is then used, together with the target, to compute a cross-entropy
performance index as a loss function. In the case of BERT, the last hidden state of the first (and special) token “[CLS]” is used as the representation of the whole essay. Because this representation needs to be adjusted to the particular problem at hand, the whole model is trained. This differs from the way in which transfer learning is done on images, where, if the model was pretrained using at least some images similar to the task at hand, updating all the parameters does not usually provide a boost in performance that is justifiably by the much longer training time. Regarding XLNet, the same method is applied but now the “[CLS]” token is located at the end of the essay.
In theory, the model should retain most of the knowledge it learnt about the English language during the pre-training tasks. This would provide not only a much better initialization, which drastically reduces the downstream training time, but also an increase in performance when compared with other Neural Networks that need to learn natural language from random initial conditions from a much smaller corpus. However, in practice, various problems can arise such as catastrophic forgetting, which means the model forgets very quickly what it had learnt previously, rendering the main point of transfer learning almost useless. There are various ways of dealing with this: we try gradual unfreezing, discriminative fine-tuning and a combination of both as proposed in 
. Gradual unfreezing consists of only training the last layer on the first epoch, which contains the least general information about the language, and then unfreezing one more layer per epoch, from last to first. On the other hand, discriminative fine-tuning consists on using different learning rates for different layers, as they capture different kinds of features on Deep Networks. In particular, BERT has been shown to attend to different kinds of words and capture diverse linguistic notions on different attention heads . The learning rate across layers follows the formula , where is a decay factor usually set close to 1 . Another closely related problem is overfitting. To mitigate this, we try using the model’s hidden states at different layers and also some data-preprocessing to force the model to focus more on other kinds of words, as well as different dropout values. Lastly, we also use an ensemble of different models trained with the different approaches.
We run our experiments using pytorch-transformers implementations of BERT and XLNet. We choose Adam as the optimizer, as in the original papers, and try different learning rates, narrowing the best values to either or . We also try different warmup schedules, and find that they make no significant difference. Regarding BERT, there are currently two main versions: “cased” and “uncased”. We find that overall “uncased” works slightly better, although for some items the “cased” version is superior. However, the difference is still very small. For XLNet, the only available version is “cased”. We also compare the base and large versions and find that they perform very similarly, so using the large versions is not worth it, given that they are much more expensive to fine-tune. Thus, all the results shown are for the base versions. The same applies to the batch size, so we end up using the largest we could fit in a 12GB GPU, i.e, 9 for BERT and 8 for XLNet.
Due to the fact that BERT and XLNet were pre-trained with sequences of 512 tokens (510 when taking into account “[CLS]” and “[SEP]”), and some of our essays are quite longer than that, we use a sliding-window approach in which longer essays are split into two or more sequences of 510 tokens. We force an overlapping of the last of these sequences with the second-to-last, in order to avoid meaningless padding on the last split. For prediction, we just round the average of the scores on each of these splits. Although we also experimented imputing only the first 510 tokens, or the first 128 and last 382, as proposed in, this did not make any significant difference, and even if it did, it should be avoided because in the context of essay scoring it could be argued to be unethical.
In this section we evaluate more in-depth each of the things we tried on the development set, and then provide the results on the test set by picking the best model on the development set. Table 1 shows the dev qwk percentage difference between each combination and the base try, which is just using BERT/XLNet as they are. It can be seen that, overall, the methods to avoid catastrophic forgetting do not work very well, although for particular items they can give a small boost in performance. Their combination (1+2) also performs poorly on all the items except for BERT on item 8. Increasing the dropout probability is neither a good idea, and when it helps, it is only slightly. However, in the case of BERT, decreasing the input complexity (removing stop-words), the model complexity (using only the three first layers) and a combination of both seems to actually reduce overfitting and works the best overall, which is good news because it is much more inexpensive than running a combination of the previous methods, even more so when trying different learning rates and warm-up schedules for each of them. On the other hand, XLNet only sees an increase in performance for two items when removing stop-words, and using three layers does not help either. These findings suggests that BERT is more flexible than XLNet, or at least that it can adapt better to extreme changes in the architecture and input levels. Regarding catastrophic forgetting, it looks like XLNet does witness more improvement than BERT for the items in which either gradual unfreezing, discriminative finetuning or their combination boost performance.
|BERT Experiments / Item||1 (%)||2 (%)||3 (%)||4 (%)||5 (%)||6 (%)||7 (%)||8 (%)||Mean (%)|
|(1) Gradual Unfreezing||-4.58||-5.17||+0.04||-1.49||-5.76||-4.19||-18.87||-6.41||-5.80|
|(2) Discriminative Finetuning ()||-3.59||-0.04||+0.70||+1.17||-1.61||-1.08||-0.14||-10.20||-1.85|
|(3) Dropout (0.2)||-3.77||+0.49||+0.93||-0.12||-2.11||-0.54||-3.57||-31.83||-5.07|
|(4) Remove Stop-Words||+2.60||-2.43||-0.66||-0.92||-4.03||-2.18||+0.69||-0.17||-0.89|
|(5) 3 Layers||-0.41||+0.23||-0.55||-0.52||+0.12||-1.14||+0.37||-2.71||-0.58|
|4 + 5||+1.82||+1.35||+2.04||-0.05||-1.90||-2.20||+0.77||-0.65||+0.15|
|XLNet Experiments / Item||1 (%)||2 (%)||3 (%)||4 (%)||5 (%)||6 (%)||7 (%)||8 (%)||Avg (%)|
|(1) Gradual Unfreezing||-4.46||-6.00||-2.86||+1.38||-3.31||-1.34||-1.95||-0.26||-2.35|
|(2) Discriminative Finetuning ()||+2.41||-2.02||-1.05||-1.02||-1.75||-0.41||-1.26||+4.50||-0.08|
|(3) Dropout (0.2)||-0.59||-2.28||-3.16||-2.65||-2.53||-6.88||-6.89||-1.87||-3.36|
|(4) Remove Stop-Words||+2.66||-4.19||-1.72||-0.79||-1.59||-2.36||+2.04||-3.19||-1.14|
|(5) 3 Layers||-0.96||-0.29||-4.07||-1.54||+0.06||-3.70||-7.95||-20.06||-4.81|
|4 + 5||+0.94||-1.26||-2.49||-2.89||-0.12||-1.32||-8.58||-20.44||-4.52|
Table 2 shows the final results on each item for BERT, XLNet, a BERT ensemble, an XLNet ensemble and a BERT + XLNet ensemble. The first two ensembles consist of 6 models obtained using the different experiments from above. We tried taking a majority vote (using the best model out of these 6 to decide when there is a tie), and rounding the mean of the scores predicted by each model. Both methods performed similarly on items 1 to 6, but the majority vote performed significantly poorer on items 7 and 8. The BERT + XLNet ensemble consists of 12 models, i.e, it combines the models from the two other ensembles together. We also show the results for the LSTM from  and their ensemble, which consists on 10 LSTMs and 10 LSTMs with a convolutional layer before them, and which also arrives at the final prediction by taking the mean of the scores. The last two rows correspond to the Bag of Words model and the inter-human agreement.
|Item||1 qwk (%)||2 qwk (%)||3 qwk (%)||4 qwk (%)||5 qwk (%)||6 qwk (%)||7 qwk (%)||8 qwk (%)||Avg qwk(%)|
|BERT + XLNet Ensemble||80.78||69.67||70.31||81.90||80.82||81.45||80.67||60.46||75.76|
|LSTM (+CNN) Ensemble ||82.10||68.80||69.40||80.50||80.70||81.90||80.80||64.40||76.08|
|EASE (Bag of Words) ||78.10||62.10||63.00||74.90||78.20||77.10||72.70||53.40||69.90|
Regarding the individual models, BERT, XLNet and the LSTM obtain very similar average qwk across all the items. This suggests that the essay scoring problem has reached its ceiling in terms of modeling, at least for now. When compared to their individual versions, the ensemble boost performance by 0.51 % for BERT, 0.53 % for XLNet, and 1.45 % for the LSTM. The main difference between the Language Models ensembles and the LSTM ensemble that may account for an almost 3x bigger delta in favor of the LSTM is the amount of models (6 vs 20), although it is possible that the convolution layer on the LSTM produces more variability than using different number of layers and altering the inputs to the Language Models. The BERT + XLNet ensemble (12 models) does better (+1.01 % from BERT and +1.42 % from XLNet), which points to the first reason being more likely.
When compared with the Bag of Words method, Neural Networks show a significant superiority, with a 6 % higher qwk on average. What is more, there is no single item for which the Bag of Words performs better. And although individual networks are still a bit below the inter-human agreement, the ensemble of these models are actually beating humans by 0.79 % in the case of the LSTM and by 0.47 % in the case of BERT + XLNet on average. Item by item, Neural Networks achieve higher-than-human qwk on 5 out of 8.
Transfer learning and language models enhanced the performance of analyzing texts in natural language processing. In this paper, we demonstrated the two major transformer based neural network models which improved the result of essay scoring on the Kaggle dataset. BERT and XLNet are discussed in a very detailed manner to researchers for further improvements. The results of BERT and XLNet are compared with other traditional methods and human standards. Overall, we got better results that human and rule based techniques. Our major contribution is explaining the network architectures and generalizing it with simple notation, and implementing a classification technique using these models on the essay scoring problem to get an automated engine. This engine tends to be more reliable than humans and save a lot of time and money for grading essays in a large scale.
I would like to thank Balaji Kodeswaran, and Paul van Wamelen for their support and discussions.
-  (2016) Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289. Cited by: §1.
-  (2000) Rubrics, scoring guides, and performance criteria: classroom tools for assessing and improving student learning.. Cited by: §2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §2.
-  (2015) Gated feedback recurrent neural networks. In International Conference on Machine Learning, pp. 2067–2075. Cited by: §3.2.
-  (2019-06) What does bert look at? an analysis of bert’s attention. arXiv (1906.0434), pp. . Note: Cited by: §4.
-  (1960) A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1), pp. 37–46. Cited by: §2, §2.
-  (2019-01) Transformer-xl: attentive language models beyond a fixed-length context. arXiv (1901.02860), pp. . Note: Cited by: §3.4.
-  (2018-10) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv (1810.04805), pp. . Note: Cited by: §1, §2, §2, §3.3.
-  (2014) Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC. Cited by: §1, §2.
-  (1994) Training feedforward networks with the marquardt algorithm. IEEE transactions on Neural Networks 5 (6), pp. 989–993. Cited by: §1.
-  (Boston) Neural network design, 2nd edition. PWS Publishing. External Links: Cited by: Figure 1, §3.2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2, §3.2, §3.2, §3.2.
-  (2018) Universal language model fine-tuning for text classification. arXiv (1801.06146), pp. . Note: Cited by: §4.
-  (2015) Enhanced recurrent network training. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §3.2.
Application of new training methods for neural model reference control.
Engineering Applications of Artificial Intelligence74, pp. 312 – 321. External Links: Cited by: §1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2014) How transferable are features in deep neural networks?. Advances in neural information processing systems (), pp. 3320–3328. Note: Cited by: §4.
-  (2019-07) RoBERTa: a robustly optimized bert pretraining approach. arXiv (1907.11692), pp. . Note: Cited by: §3.3.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
-  (1967) Grading essays by computer: progress report.. In Proceedings of the Invitational Conference on Testing Problems, pp. 87–100. Cited by: §1.
-  (1968) The use of the computer in analyzing student essays.. In International Review of Education, pp. 210–225. Cited by: §1.
-  (1966) The imminence of… grading essays by computer. The Phi Delta Kappan 47 (5), pp. 238–243. Cited by: §2.
-  (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §3.2, §3.2.
-  (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1, §2.
-  (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §2.
-  (2015) Effective feature integration for automated short answer scoring. In Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies, pp. 1049–1054. Cited by: §1.
-  (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §3.3.
-  (2015) Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment 20 (1), pp. 46–65. External Links: Cited by: §1, §1, §3.
-  (2019-05) How to fine-tune bert for text classification?. arXiv (1905.05583), pp. . Note: Cited by: §4, §4.
-  (2016-11) A neural approach to automated essay scoring. Conference on Empirical Methods in Natural Language Processing (1905.05583), pp. 1882–1891. Note: Cited by: §1, §2, Table 2, §5.
-  (2017-06) Attention is all you need. arXiv (1706.03762), pp. . Note: Cited by: §3.3.
-  (2012) A framework for evaluation and use of automated scoring. Educational measurement: issues and practice 31 (1), pp. 2–13. Cited by: §2, §2.
-  (2019-06) XLNet: generalized autoregressive pretraining for language understanding. arXiv (1906.08237). Cited by: §1, §2, §3.3, §3.4.
-  (2015) Evaluating the performance of automated text scoring systems. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 213–223. Cited by: §1.