1. Introduction
One of the prominent features of human intelligence is the ability to track their current knowledge states in mastering specific skills or concepts. This enables humans to identify gaps in their knowledge states to personalise their learning experience. With the success of artificial intelligence (AI) in modeling various areas of human cognition
(LeCun et al., 2015; Graves et al., 2013; Yu et al., 2017; Donahue et al., 2015; Liu et al., 2016), the question has arisen: can machines trace human knowledge like humans? This motivated the study of knowledge tracing (KT), which aims to model the knowledge states of students in mastering skills and concepts, through a sequence of learning activities they participate in (Corbett and Anderson, 1994; Piech et al., 2015; Zhang et al., 2017).Knowledge tracing is of fundamental importance to a wide range of applications in education, such as massive open online courses (MOOCs), intelligent tutoring systems, educational games, and learning management systems. Improvements in knowledge tracing can drive novel techniques to advance human learning. For example, knowledge tracing can be used to discover students’ individual learning needs so that personalised learning and support can be provided to fulfill diverse capabilities of each student (Khajah et al., 2016). It can also be used by human experts to design new measures of student learning and new teaching materials based on learning strengths and weaknesses of students (Piech, 2016). Nonetheless, tracing human knowledge using machines is a rather challenging task. This is due to the complexity of human learning behaviors (e.g., memorising, guessing, forgetting, etc.) and the inherent difficulties of modeling human knowledge (i.e. skills and prior background) (Piech et al., 2015).
Existing knowledge tracing models can be generally classified into two categories: traditional machine learning KT models and deep learning KT models. Among traditional machine learning KT models,
Bayesian Knowledge Tracing (BKT) is the most popular (Corbett and Anderson, 1994), which models the knowledge tracing problem as predicting the state of a dynamical system that has hidden latent variables (i.e. learning concepts). In addition to BKT, probabilistic graphical models such as Hidden Markov Models (HMMs)
(Corbett and Anderson, 1994; Baker et al., 2008) or Bayesian belief networks (Villano, 1992)have also been used to model knowledge tracing. To keep the inference computation tractable, traditional machine learning KT models use discrete random state variables with simple transition regimes, which limits their ability to represent complex dynamics between learning concepts. Moreover, these models often assume a firstorder Markov chain for an exercise sequence (i.e. considering the most recent observation to be representing the whole history) which also limits their ability to model longterm dependencies in an exercise sequence.
Inspired by recent advances in deep learning (LeCun et al., 2015), several deep learning KT models have been proposed. A pioneer work by Piech et al. (Piech et al., 2015) reported Deep Knowledge Tracing
(DKT), which uses a Recurrent Neural Networks (RNN) with Long ShortTerm Memory (LSTM) units
(Graves et al., 2013; Sutskever et al., 2014) to predict student performance on new exercises given their past learning history. In DKT, a student’s knowledge states are represented by a sequence of hidden states that successively encode relevant information from past observations over time. Although DKT has achieved substantial improvements in prediction performance over BKT, due to the limitation of representing a knowledge state by one hidden state, it lacks the ability to go deeper to trace how specific concepts are mastered by a student (i.e., concept states) in a knowledge state. To deal with this limitation, Dynamic KeyValue Memory Networks (DKVMN) (Zhang et al., 2017) was proposed to model a student’s knowledge state as a complex function over all underlying concept states using a keyvalue memory. Their idea of augmenting DKVMN with an auxiliary memory follows the concepts of MemoryAugmented Neural Networks (MANN) (Santoro et al., 2016; Graves et al., 2016). However, DKVMN acquires the knowledge growth through the most recent exercise and thus fails to capture longterm dependencies in an exercise sequence (i.e., relevant past experience to a new observation).In this paper, we present a new KT model, called Sequential KeyValue Memory Networks (SKVMN). This model provides three advantages over the existing deep learning KT models:

First, SKVMN unifies the strengths of both recurrent modelling capacity of DKT and memory capacity of DKVMN for modelling student learning. We observe that, although a keyvalue memory can help trace concept states of a student, it is not effective in modeling longterm dependencies on sequential data. We remedy this issue by incorporating LSTMs into the sequence modelling for a student’s knowledge states over time. Thus, SKVMN is not only augmented with a keyvalue memory to enhance representation capability of knowledge states at each time step, but also can provide recurrent modelling capability for capturing dependencies among knowledge states at different time steps in a sequence.

Second, SKVMN uses a modified LSTM with hops, called HopLSTM, in its sequence modelling. HopLSTM deviates from the standard LSTM architecture by using a triangular layer for discovering sequential dependencies between exercises in a sequence. Then, the model may hop across LSTM cells according to the relevancy of the latent learning concepts. This enables relevant exercises that correlate to similar concepts to be processed together. In doing so, the inference becomes faster and the capacity of capturing longterm dependencies in an exercise sequence is enhanced.

Third, SKVMN improves the write process of DKVMN in order to better represent knowledge states stored in a keyvalue memory. In DKVMN, the current knowledge state is not considered when calculating the knowledge growth of a new exercise. This means that the previous learning experience is ignored. For example, when a student attempts the same question multiple times, the same knowledge growth would be added to the knowledge state, regardless of whether the student has previously answered this question or how many times the answers were correct. SKVMN solves this issue by using a summary vector as input for the write process, which reflects both the current knowledge state of a student and the prior difficulty of a new question.
We have extensively evaluated our proposed model SKVMN on five wellestablished KT benchmark datasets, and compared it with the stateoftheart KT models. The experimental results show that (1) SKVMN outperforms the existing KT models, including DKT and DKVMN, on all five datasets, (2) SKVMN can better discover the correlation between latent concepts and questions, and (3) SKVMN can the knowledge state of students dynamics, and leverage sequential dependencies between exercises in an exercise sequence for improved predication accuracy.
Figure 1 illustrates how a student’s knowledge states are evolving as the student attempts a sequence of 50 exercises in DKVMN and SKVMN. We can see that, compared with the knowledge states of DKVMN depicted in Figure 1.(a), our model SKVMN provides a smoother transition between two successive concept states as depicted in Figure 1.(b). Moreover, SKVMN captures a smooth, progressive evolution of knowledge states over time (i.e., through this exercise sequence), which more accurately reflects the way a student learns (will be discussed in detail in Section 5.4).
2. Problem Formulation
Broadly speaking, knowledge tracing is to track down students’ knowledge states over time through a sequence of observations on how the students interact with given learning activities. In this paper, we formulate knowledge tracing as a sequence prediction problem in machine learning, which learns the knowledge state of a student based on an exercise answering history.
Let be the set of all distinct question tags in a dataset. Each may have a different level of difficulty, which is not explicitly provided. An exercise is a pair consisting of a question tag
and a binary variable
representing the answer, where means that is incorrectly answered and means that is correctly answered. When a student interacts with the questions in , a history of exercises undertaken by the student can be observed. Based on a history of exercises, we want to predict the probability of correctly answering a new question at time step
by the student, i.e., .We assume that the questions in are associated with latent concepts . The concept state of each latent concept
is a random variable describing the mastery level of the student on this latent concept. At each time step
, the knowledge state of a student is modelled as a set of all concept states of the student, each corresponding to a latent concept in at time step .3. Sequential KeyValue Memory Networks
In this section, we introduce our model Sequential KeyValue Memory Networks (SKVMN). We first present an overview for SKVMN. Then, we show how a keyvalue memory can be attended, read and written in our model. To leverage sequential dependencies among latent concepts for predication, we then present a modified LSTMs, called HopLSTMs. Lastly, we discuss the optimisation techniques used in the model.
3.1. Model Overview
The SKVMN model is augmented with a keyvalue memory following the work in (Zhang et al., 2017), i.e., a pair of one static matrix of size , called the key matrix, and one dynamic matrix of size , called the value matrix. Both the key matrix and the value matrix have the same memory slots, but they may differ in their state dimensions and . The key matrix stores the latent concepts underlying questions, and the value matrix stores the concept states of a student (i.e., the knowledge state) which can be changed dynamically based on student learning.
Given an input question at time step , the SKVMN model retrieves the knowledge state of a student from the keyvalue memory , and predicts the probability of correctly answering the question by the student. Figure 2.(a) illustrates the SKVMN model at time step , which consists of four layers: the embedding, memory, sequence and output layers.

The embedding layer is responsible for mapping an input question at time step into a highdimensional vector space.

The memory layer involves two processes: attention and read, where the attention process provides an addressing mechanism for the input question to allocate the relevant information from the keyvalue memory, and the read process uses the attention vector to retrieve the current knowledge state of the student from the value matrix . The details of the attention and read processes will be discussed in Sections 3.2.1 and 3.2.2.

The output layer generates the probability of correctly answering the input question .
After a student has attempted the input question with the answer , the value matrix in the keyvalue memory of the SKVMN model needs to be updated in order to reflect the latest knowledge state of the student. Figure 3 depicts how the value matrix is transited from at time step to at time step using the write process. The details of the write process will be discussed in Section 3.2.3.
3.2. Attention, Read and Write
There are three processes relating to access to a keyvalue memory in our model: attention, read and write. In the following, we elaborate these processes.
3.2.1. Attention
Given a question as input, the model represents as a “onehot” vector of length in which all entries are zero, except for the entry that corresponds to .
In order to map into a continuous vector space, is multiplied by an embedding matrix , which generates a continuous embedding vector , where is the embedding dimension. Then, an attention vector is obtained by applying the Softmax function to the inner product between the embedding vector and each key slot in the key matrix as follows:
(1) 
where . Conceptually, represents the correlation between the question and the underlying latent concepts stored in the key matrix .
3.2.2. Read
For each exercise , the model uses its corresponding attention vector to retrieve the concept states of the student with regard to the question from the value matrix . Specifically, the read process takes the attention vector as input and yields a read vector which is the weighted sum of all values being attended by in the memory slots of the value matrix , i.e.,
(2) 
The read vector is concatenated with the embedding vector . Then, the combined vector is fed to a Tanh layer to calculate the summary vector :
(3) 
where , is the weight matrix of the Tanh layer, and
is the bias vector. While the read vector
represents the student’s knowledge state with respect to the relevant concepts of the current question , the summary vector adds prior information of the question (e.g., the level of difficulty) to this knowledge state. Intuitively, represents how well the student has mastered the latent concepts relevant to the question before attempting this question.3.2.3. Write
The write process occurs each time after the student has attempted a question. The purpose of this write process is to update the concept states of the student in the value matrix using the knowledge growth gained through attempting the question . This update leads to the transition of the value matrix from from to as depicted in Figure 3.
To calculate the knowledge growth, our model considers not only the correctness of the answer for , but also the student’s mastery level of the concept states before attempting . Each is represented as a vector of length and multiplied by an embedding matrix to get a write vector , which represents the knowledge growth of the student obtained by attempting the question.
Similar to other memory augmented networks (Graves et al., 2016; Zhang et al., 2017), the write process proceeds with two gates: erase gate and add gate. The former controls what information to erase from the knowledge state after attempting the latest exercise , while the latter controls what information to add into the knowledge state. From the knowledge tracing perspective, these two gates capture the forgetting and enhancing aspects of learning knowledge, respectively.
With the write vector for the knowledge growth, an erase vector is calculated as:
(4) 
where and is the weight matrix. Using the attention vector , the value matrix after applying the erase vector is
(5) 
Then, an add vector is calculated by
(6) 
where is the weight matrix. Finally, the value matrix for the next knowledge state is updated as:
(7) 
3.3. Sequence Modelling
Now we discuss the sequence modelling approach used at the sequence layer for predicting the probability of correctly answering the question based on the exercise history.
3.3.1. Sequential dependencies
An exercise history of a student may contain a long sequence of exercises, for example, the average sequence length in the ASSISTments2009 dataset is questions per sequence. However, as different exercises may correlate to different latent concepts, not all exercises in can equally contribute to the prediction of answering a given question . Thus, we observe that, by hopping across irrelevant exercises in with regard to , recurrent models can be applied on a shorter and more relevant sequence, leading to more efficient and accurate prediction performance.
For each question in , since its attention vector reflects the correlation between this question and the latent concepts in , we consider that the similarity between two attention vectors can provide a good indication of how their corresponding questions are relevant in terms of their correlations with latent concepts. For example, suppose that we have three latent concepts , and two attention vectors and that correspond to the questions and , respectively. Then and are considered as being relevant if both and are mapped to a vector where , and refer to the value ranges “low”, “middle” and “high”, respectively.
Now, the question is: how to identify similar attention vectors? For this, we use the triangular membership function (Klir and Yuan, 1995):
(8) 
where the parameters and determine the feet and the parameter determines the peak of a triangle. We use three triangular membership functions for three value ranges: low (0), medium (1), and high (2). Each realvalued component in an attention vector is mapped to one of the three value ranges. Each attention vector is associated with an identity vector . Similar attention vectors have the same identity vector, while dissimilar attention vectors have different identity vectors.
Then, at each time step , for the current question and an exercise history , we say is sequentially dependent on in with , denoted as , if the following two conditions are satisfied:

The attention vectors of and have the same identity vector, i.e., , and

There is no other in such that and , i.e., is the most recent exercise in that is relevant to .
Over time, given a sequence of questions , we thus have an exercise history in which exercises are partitioned into a set of subsequences with and, for any two consecutive exercises and in the subsequence , is sequentially dependent on , i.e. . In Figure 2, based on the sequential dependencies presented in Figure 2.(b), is partitioned into and , which correspond to the recurrently connected LSTM cells depicted in Figure 2.(a). We will discuss further details in the following.
3.3.2. HopLSTM
Based on sequence dependencies between exercises in a sequence, a modified LSTM with hops, called HopLSTM, is used to predict the probability of correctly answering a new question by a student. Different from the standard LSTM architecture, HopLSTM allows us to recurrently connect the LSTM cells for questions based on the relevance of their latent concepts. More precisely, two LSTM cells in HopLSTM are connected only if the input question of one LSTM cell is sequentially dependent on the input question of the other LSTM cell. This means that HopLSTM has the capability of hopping across the LSTM cells when their input questions are irrelevant to the current question .
Formally, at each time step , for the question , if there is an exercise with and , then the current LSTM cell takes the summary vector and the hidden state as input. Moreover, this LSTM cell updates the cell state into the cell state , and generates the new hidden state as output. As in (Hochreiter and Schmidhuber, 1997), the LSTM cell used in our work has three gates: forget gate , input gate , and output gate in addition to the hidden state and the cell state :
(9)  
(10)  
(11)  
(12)  
(13)  
(14) 
Then, the output vector of the curent LSTM cell is sent to a Sigmoid layer, which calculates the probability of correctly answering the current question by
(15) 
3.4. Model Optimisation
To optimise the model, we use the crossentropy loss function between the predicted probability of being correctly answered
and the true answer . The following objective function is defined over training data:(16) 
We initialise of the memory matrices ( and ) and embedding matrices ( and
) using a random Gaussian distribution
. While for weights and biases of the neural layers, we use Glorot uniform random initialization (Glorot and Bengio, 2010) for a faster convergence. These randomly initialized parameters are optimized using the stochastic gradient decent (SGD) mechanism (Bottou, 2012). As we use HopLSTM at the sequence layer, during the backpropagation of the gradients, only the parameters of the connected LSTM cells (i.e. the ones responsible for the current prediction error) are updated. Other parameters, such as the embedding matrices, weight matrices, and bias vectors, are updated in each backpropagation iteration based on loss function values.
Note that, through training, the model can discover relevant latent concepts for each question and store their state values in the value matrix .
4. Experiments
In this section, we present the experiments of evaluating our proposed model SKVMN against the stateoftheart KT models. These experiments aim to answer the following research questions:

What is the optimal size for a keyvalue memory (i.e., the key and value matrices) of SKVMN?

How does SKVMN perform on predicting a student’s answers of new questions, given an exercise history?

How does SKVMN perform on discovering the correlation between latent concepts and questions?

How does SKVMN perform on capturing the evolution of a student’s knowledge states?
4.1. Datasets
We use five wellestablished datasets in the KT literature (Khajah et al., 2016; Zhang et al., 2017; Piech et al., 2015). Table 2 summarizes the statistics of the data sets.

Synthetic5^{1}^{1}1Synthetic5:https://github.com/chrispiech/DeepKnowledgeTracing/tree/master /data/synthetic: This dataset consists of two subsets: one for training and one for testing. Each subset contains distinct questions which were answered by virtual students. A total number of exercises (i.e. ) are contained in the dataset.

ASSISTments2009^{2}^{2}2ASSISTments2009:https://sites.google.com/site/assistmentsdata/home/assistment20092010data/skillbuilderdata20092010: This dataset was collected during the school year using the ASSISTments online education website ^{3}^{3}3https://www.assistments.org/. The dataset consists of distinct questions answered by students which gives a total number of exercises.

ASSISTments2015^{4}^{4}4ASSISTments2015:https://sites.google.com/site/assistmentsdata/home/2015assistmentsskillbuilderdata: As an update to the ASSISTments2009 dataset, this dataset was released in . It includes distinct questions answered by students with a total number of exercises. This dataset has the largest number of students among the other datasets. Albeit, the average number of exercises per student is low. The original dataset also has some incorrect answer values (i.e. ), which are removed during preprocessing.

Statics2011^{5}^{5}5Statics2011:https://pslcdatashop.web.cmu.edu/ DatasetInfo?datasetId=507: This datasets was collected from a statistics course at Carnegie Mellon University during Fall . It contains distinct questions answered by undergraduate students with a total number of exercises. This dataset has the highest exercise per student ratio among all datasets.

JunyiAcademy^{6}^{6}6Junyi2015: https://datashop.web.cmu.edu/DatasetInfo?datasetId=1198: This dataset was collected from Junyi Academy ^{7}^{7}7https://www.junyiacademy.org/, which is an education website of providing learning materials and exercises on various scientific courses, on 2015 (Chang et al., 2015). It contains distinct questions answered by students with a total number of exercises. It is the largest dataset in terms of the number of exercises.
Dataset  #Questions  #Students  #Exercises  #Exercises 
per student  
Synthetic5  
ASSISTments2009  
ASSISTments2015  
Statics2011  
JunyiAcademy 
Dataset  SKVMN  DKVMN  
AUC (%)  m  AUC (%)  m  
Synthetic5  10  50  83.11  15K  82.00  12k 
50  50  83.67  30k  82.66  25k  
100  50  84.00  57k  82.73  50k  
200  50  83.73  140k  82.71  130k  
ASSISTments2009  10  10  83.63  7.8k  81.47  7k 
50  20  82.87  35k  81.57  31k  
100  10  82.72  71k  81.42  68k  
200  20  82.63  181k  81.37  177k  
ASSISTments2015  10  20  74.84  16k  72.68  14k 
50  10  74.50  31k  72.66  29k  
100  50  74.24  66k  72.64  63k  
200  50  74.20  163k  72.53  153k  
Statics2011  10  10  84.50  92.8k  82.72  92k 
50  10  84.85  199k  82.84  197k  
100  10  84.70  342k  82.71  338k  
200  10  84.76  653k  82.70  649k  
JunyiAcademy  10  20  82.50  16k  79.63  14k 
50  10  82.41  31k  79.48  29k  
100  50  82.67  66k  79.54  63k  
200  50  82.32  163k  80.27  153k 
4.2. Baselines
In order to evaluate the performance of our proposed model, we select the following three KT models as the baselines:

Bayesian knowledge tracing (BKT) (Corbett and Anderson, 1994)
, which is based on Bayesian inference in which a knowledge state is modelled as a set of binary variables, each representing the understanding of a single concept.

Deep knowledge tracing (DKT) (Piech et al., 2015) which uses recurrent neural networks to model student learning.

Dynamic keyvalue memory networks (DKVMN) (Zhang et al., 2017) which extends the memoryaugmented neural networks (MANN) by a keyvalue memory and is considered as the stateoftheart model for knowledge tracing.
Our proposed model is referred to as SKVMN in the experiments.
4.3. Measures
We use the area under the Receiver Operating Characteristic (ROC) curve, referred to as AUC (Ling et al., 2003), to measure the prediction performance of the KT models. The AUC ranges in value from 0 to 1. An AUC score of means random prediction (i.e. coin flipping). The higher an AUC score goes above , the more accurately a predictive model can perform.
4.4. Evaluation Settings
We divided each dataset into % for training and validation and % for testing, except for Synthetic5. This is because, as mentioned before, Synthetic5 itself contains the training and test subsets of the same size. For each training and validation subset, we further divided it using the 5fold cross validation (e.g. % for training and
% for validation). The validation subset was used to determine the optimal values for the hyperparameters, including the memory slot dimensions
for the key matrix and for the value matrix.Dataset  BKT  DKT  DKVMN  SKVMN 
Synthetic5  
ASSISTments2009  
ASSISTments2015  
Statics2011  
JunyiAcademy 
We utilised the Adam optimizer (Kingma and Ba, 2015) for SGD implementation with momentum of and learning rate of annealed using a cosine function every epochs for epochs, then it remains fixed at . The LSTM gradients were clipped to improve the training (Pascanu et al., 2013). For the other baselines, we follow the optimisation procedures indicated in the original work for each of them (Corbett and Anderson, 1994; Zhang et al., 2017; Piech et al., 2015).
A minibatch of is selected during the training for all datasets, except Synthetic5, for which we use a minibatch of due to the relatively small number of training samples (i.e. exercises) in the dataset (Bengio, 2012)
. For each dataset, the training process is repeated five times, each time using a different initialization. We report the average test AUC and the standard deviation over these five runs.
For the Triangular layer, the hyperparameter values () of each triangular membership function are set based on the empirical analysis of each dataset.
5. Results and Discussion
In this section, we present the experimental results and discuss our observations from the obtained results.
5.1. Hyperparameters and
To explore how the sizes of the key and value matrices can affect the model performance, we have conducted experiments to compare SKVMN with DKVMN under different numbers of memory slots and state dimensions , where so as to be consistent with the previous work (Zhang et al., 2017). In order to allow a fair comparison between the models SKVMN and DKVMN, we select the same set of state dimensions, (i.e. ) and the same corresponding numbers of memory slots on the datasets Synthetic5, ASSISTments2009, ASSISTments2015 and Statics2011 to report the AUC results, following the settings originally reported in (Zhang et al., 2017). For the dataset JunyiAcademy, it was not considered in the previous work (Zhang et al., 2017). Considering that JunyiAcademy has the largest numbers of students and exercises among all datasets, we use the same settings for the numbers of memory slots and state dimensions as the ones for the second largest dataset ASSISTments2015. Table 2 presents the AUC results for all five datasets.
As shown in Table 2, compared with DKVMN, our model SKVMN can produce better AUC results with comparable parameters on the datasets Synthetic5, ASSISTments2015 and Statics2011, and with fewer parameters on the datasets ASSISTments2009 and JunyiAcademy. Particularly, for the dataset ASSISTments2009, SKVMN yields an AUC at 83.63% with N=10, d=10 and m=7.8k, whereas DKVMN yields an AUC at 81.57% with N=20, d=50 and m=31k (nearly 4 times of 7.8k). Similarly, for the dataset JunyiAcademy, SKVMN yields an AUC at 82.67% with N=50, d=100 and m=66k, whereas DKVMN yields an AUC at 80.27% with N=50, d=200 and m=153k (more than twice of 66k).
Note that, the optimal value of for ASSISTments2015 is higher than the one for its previous version (i.e. ASSISTments2009). This implies that the number of latent concepts increases in ASSISTments2015 in comparison to ASSISTments2009. Moreover, the optimal value of generally reflects the complexity of the exercises in a dataset, and the dataset JunyiAcademy has exercises of higher complexity than other realworld datasets.
5.2. Prediction Accuracy
We have conducted experiments on comparing the AUC results of our model SKVMN with the other three KT models: BKT, DKT, and DKVMN. Table 3 presents the AUC results of all the models. It can be seen that our model SKVMN outperformed the other models over all the five datasets. Particularly, the SKVMN model achieved an average AUC value that is at least 2% higher than the stateofart model DKVMN on all realworld datasets ASSISTments2009, ASSISTments2015, Statics2011, and JunyiAcademy. Even for the only synthetic dataset (i.e. Synthetic5), the SKVMN model achieved an average AUC value of , in comparison with achieved by DKVMN. Note that the AUC values on ASSISTments2015 are the lowest among all datasets, regardless of the KT models. This reflects the difficulty of the KT task in this dataset due to its lowest exercise per student ratio, which not only makes the training process more difficult but also limits the effective use of sequence information to enhance the prediction performance. Figure 4 illustrates the ROC curves of these four models for each dataset.
In a nutshell, based on the AUC results in Table 3, we have the following observations. First, the neural models generally performed better than the Bayesian inference model (i.e. BKT). This is due to the power of these models in learning complex student learning patterns without the need to oversimplify the problem’s assumption to keep it within the tractable computation limits as the case in BKT. Second, the memoryaugmented models DKVMN and SKVMN performed better than the DKT model that does not utilise an external memory structure. This has empirically verified the effectiveness of external memory structures in storing past learning experiences of students, as well as facilitating the access of relevant information to enhance the prediction performance. Third, the use of sequential dependencies among exercises in our SKVMN model enhanced the prediction accuracy in comparison to the DKVMN model which primarily considers the latest observed exercise.
5.3. Clustering Questions
To provide insights on how our proposed model SKVMN can correlate questions to their latent concepts, In Figure 5.(a)5.(b), we present the clustering results of questions based on their correlated concepts in the dataset ASSISTments2009, generated by using DKVMN and SKVMN, respectively. This dataset was selected for two reasons. First, it has a reasonable number of questions (i.e., 110), enabling the visualization of clusters to be readable. Second, each question in this dataset is provided with a description as depicted in Table 4, which is useful for validating how well the model discovers correlations between questions and their latent concepts.
As shown in Figure 5, both DKVMN and SKVMN discover that there are 10 latent concepts relating to the 110 questions in the dataset ASSISTments2009, where all questions in one cluster relate to common latent concepts and are labelled using the same color. It can be noticed that SKVMN performs significantly better than DKVMN since the overlapping between different clusters in SKVMN is smaller than in DKVMN. For example, in Figure 5.(a), question 105 is about curve slop, which is close to the cluster for geometric concepts in brown colour, while it is placed in the cluster for equation system concepts in blue colour. Similarly, other overlaps can be observed such as questions 38, 73, and 26. This indicates that the effectiveness of SKVMN in discovering latent concepts as well as discovering questions that relate to these latent concepts.
We can further verify the effectiveness of SKVMN in discovering latent concepts for questions using the question descriptions in Table 4. For example, in Figure 5.(b), the questions , , and fall in the same cluster in pink (top right corner). Their provided descriptions are “Equivalent Fractions”,“Multiplication Fractions”, and “Ordering Fractions”, respectively, which are all relevant to fractions concepts. Similarly, the questions , , and have the descriptions “Area Trapezoid”, “Area Rectangle”, and “Rotations”, respectively. These questions fall in the same cluster in light blue (bottom right corner) because they are about geometric concepts, such as area functions and transformations.
Note that, SKVMN depends on identity vectors to aggregate questions with common concepts together. While the DKVMN depends on the attention vectors to perform this aggregation.
5.4. Evolution of Knowledge States
As previously discussed in Section 1, the knowledge states of a student may evolve over time, through learning from a sequence of exercises. In order to illustrate this evolution process, Figure 1 shows a student’s knowledge states over a sequence of 50 exercises from the ASSISTments2009 dataset. At each time step, a knowledge state consists of the concept states of five concepts , which are stored in the value matrix of a keyvalue memory augmented with DKVMN or SKVMN. Figure 1.(a) shows this student’s knowledge states captured by DKVMN, while Figure 1.(b) shows this student’s knowledge states captured by SKVMN.
In SKVMN, relevant questions are identified as shown in Table 4. Comparing Figure 1.(a) and Figure 1.(b), it can be visually noticed that SKVMN has smoother updates to the concept states in the value matrix than DKVMN. For example, considering the questions 23, 33 and 61, the student answered the first two questions incorrectly and the last one correctly, which result in a sudden update to the value of (i.e., the concept state of ) in the value matrix of DKVMN but a smoother update to the concept state of in the value matrix of SKVMN. Another example is the questions 70 and 88 that correlate to same latent concepts, the student answered the first one incorrectly and the second one correctly, which resulted in a significant update to the concept state of in the DKVMN’s memory around indices 18, 19 and 20, while SKVMN’s concept state of decreased in a smoother manner. This means that SKVMN considers the past performance of the student in relevance to this concept. At its core, these differences in capturing concept states are due to the fact that DKVMN’s write process only takes the question and the answer to calculate the erase and add vectors, so that the knowledge state of DKVMN is biased to the most recently observed question. SKVNM has resolved this issue by taking into account the summary vector (i.e., current knowledge state and the level of difficulty of the current question) for the write process.
6. Related Work
In this section, we provide a brief review of related research work.
One of the early attempts for developing a KT model was introduced by Corbett and Anderson (Corbett and Anderson, 1994)
. Their KT model, called Bayesian Knowledge Tracing (BKT), assumed a knowledge state to be a binary random variable (i.e. know or do not know) and followed a Bayesian inference approach to estimate the values of knowledge states. However, BKT has limitations in modelling dynamics between different concepts due to its oversimplified representation to make the Bayesian inference tractable. Baker et al.
(Baker et al., 2008) extended BKT by introducing an additional layer to the Bayesian inference to represent the contextual information. While their model achieved better results, it was still considered only the latest observation as a firstorder Markov chain. Several attempts have been made to extend BKT by individualizing the prior distribution of Bayesian inference parameters (Pardos and Heffernan, 2010; Yudelson et al., 2013) so as to customize the model for each individual student. These individualization techniques were proved to reduce prediction errors of the original BKT model. Pardos and Heffernan (Pardos and Heffernan, 2011) introduced the use of auxiliary information to the Bayesian inference process, such as item difficulty, and showed that it can further enhance the prediction accuracy.With the rise of deep learning models (LeCun et al., 2015) and their achieved breakthroughs in sequence modelling (Schmidhuber, 2015)
, such as natural language processing
(Graves et al., 2013; Yu et al., 2017), video recognition (Donahue et al., 2015; Liu et al., 2016), and signal processing (Wang et al., 2018), recent studies adopted deep learning models to address the KT problem. Piech et al. (Piech et al., 2015) proposed the deep knowledge tracing (DKT) model which uses a recurrent neural networks (RNN) (Mandic and Chambers, 2001) to model dynamics in a past exercise sequence and predicts answers for new questions. DKT resolved the limitations of Bayesian inference approaches as RNNs optimization through backpropagation is tractable. Despite this advance, DKT assumed only one hidden state variable for representing a student’s knowledge state, which is an unrealistic assumption for realworld scenarios as a student’s knowledge can significantly vary across different learning concepts. To address this limitation, Zhang et al. (Zhang et al., 2017) proposed a model called Dynamic KeyValue Memory Networks (DKVMN), which followed the concepts of MemoryAugmented Neural Networks (MANN) (Santoro et al., 2016; Graves et al., 2016). MANN aim at mimicking the human’s brain functionality which combines neural spiking for computation with memory for storing past experiences (Gallistel and King, 2011). Inspired by MANN, DKVMN is augmented with two auxiliary memory structures: the key matrix and the value matrix. The former is used to keep the concepts underlying exercises, while the later one is used to store a knowledge state across these concepts. Results showed that DKVMN outperformed BKT and DKT on standard KT benchmarks, and therefore it is considered the stateoftheart KT models. However, DKVMN only considers the latest exercise embedding when updating the value matrix, resulting in biased knowledge states that ignore past learning experience. As an example, if we have three related exercises in a sequence, two being answered correctly and the latest being answered incorrectly, DKVMN would be biased to the latest one and update the knowledge state with knowledge loss abruptly. In addition to this, DKVMN has no model capacity to capture long dependencies in an exercise sequence. This assumes a firstorder Markov chain to represent a past exercise sequence, which is not satisfactory in many scenarios. Our proposed KT model has addressed the limitations from both DKT and DKVMN.In our proposed KT model, we developed a modified LSTM, called HopLSTM, for sequence modelling. Current recurrent neural network models (RNNs) and their variants, such as LSTMs (Hochreiter and Schmidhuber, 1997), bidirectional RNNs (Schuster and Paliwal, 1997), or other gated RNNs (Kusupati et al., 2018), provide the capacity to effectively ingest dependencies in sequential data. However, when sequences are long, it is still difficult to capture long term dependencies. One way to alleviate this issue is to only update a fraction of hidden states based on the current hidden state and input (Jernite et al., 2017). For example, Yu, Lee and Le (Yu et al., 2017)
proposed a LSTM model that can jump ahead in a sequence to avoid irrelevant words. The jump decision was controlled by a policy gradient reinforcement learning algorithm that works as an active learning technique to sample only important words for the model. Campos et al.
(Campos et al., 2018) proposed a model by augmenting the standard RNN with a binary state update gate function which is responsible for deciding whether to update the hidden state or not based on the number of previous updates performed and a loss term that balances the number of updates (i.e. learning speed) with achieved accuracy. Different from these models, we developed HopLSTM in relation to a Triangular layer so that only the hidden states of LSTM cells for relevant exercises are connected. This allows our KT model to identify relevant skills and prior background from the past learning activities (e.g., exercise sequences) for improved prediction accuracy.7. Conclusions
In this paper, we introduced a novel model called Sequential KeyValue Memory Networks (SKVMN) for knowledge tracing. SKVMN aimed at overcoming the limitations of the existing KT models. It is augmented with a keyvalue memory at the memory layer and a modified LSTM, called HopLSTM, at the sequence layer. The experimental results showed that our proposed model outperformed the stateoftheart models over all datasets. Future work will consider techniques to automatically tune hyperparameters for Knowledge Tracing models.
Acknowledgements.
This research is supported by an Australian government higher education scholarship, ANU ViceChancellor’s Teaching Enhancement Grant, as well as NVIDIA for the generous GPU support.References
 More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems, ITS, Berlin, Heidelberg, pp. 406–415. External Links: ISBN 9783540691303 Cited by: §1, §6.
 Practical recommendations for gradientbased training of deep architectures. In Neural networks: Tricks of the trade: Second Edition, pp. 437–478. Cited by: §4.4.
 Stochastic gradient descent tricks. In Neural networks: Tricks of the trade: Second Edition, pp. 421–436. Cited by: §3.4.
 Skip RNN: learning to skip state updates in recurrent neural networks. In 6th International Conference on Learning Representations, (ICLR), Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §6.
 Modeling exercise relationships in elearning: a unified approach.. In Proceedings of the 8th International Conference on Educational Data Mining, (EDM), Madrid, Spain, June 2629, 2015, pp. 532–535. Cited by: 5th item.
 Knowledge tracing: modeling the acquisition of procedural knowledge. User Modeling and UserAdapted Interaction 4 (4), pp. 253–278. External Links: ISSN 15731391 Cited by: §1, §1, item –, §4.4, §6.

Longterm recurrent convolutional networks for visual recognition and description.
In
The IEEE Conference on Computer Vision and Pattern Recognition, (CVPR) , Boston, MA, USA, June 712, 2015
, pp. 2625–2634. Cited by: §1, §6.  Memory and the computational brain: why cognitive science will transform neuroscience. Vol. 6, John Wiley & Sons. Cited by: §6.
 Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). Society for Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy, pp. 249–256. Cited by: §3.4.
 Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, December 812, 2013, Vol. , pp. 273–278. External Links: ISSN Cited by: §1, §6.
 Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 6645–6649. External Links: ISSN 15206149 Cited by: §1.
 Hybrid computing using a neural network with dynamic external memory. Nature 538, pp. 471. Cited by: §1, §3.2.3, §6.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §3.3.2, §6.
 Variable computation in recurrent neural networks. In 5th International Conference on Learning Representations, (ICLR), Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §6.
 How deep is knowledge tracing?. In Proceedings of the 9th International Conference on Educational Data Mining, (EDM), Raleigh, North Carolina, USA, June 29  July 2, 2016, Cited by: §1, §4.1.
 Adam: a method for stochastic optimization. In international conference on learning representations, ICLR. Cited by: §4.4.
 Fuzzy sets and fuzzy logic: theory and applications. External Links: ISBN 0131011715 Cited by: §3.3.1.
 FastGRNN: a fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, (NeurIPS), 38 December 2018, Montréal, Canada., pp. 9017–9028. Cited by: §6.
 Deep learning. nature 521 (7553), pp. 436. Cited by: §1, §1, §6.
 AUC: a statistically consistent and more discriminating measure than accuracy. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI, San Francisco, CA, USA, pp. 519–524. Cited by: §4.3.
 Spatiotemporal lstm with trust gates for 3d human action recognition. In 14th European Conference on Computer Vision, (ECCV) , Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part III, Cham, pp. 816–833. Cited by: §1, §6.
 Recurrent neural networks for prediction: learning algorithms,architectures and stability. John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471495174 Cited by: §6.

Modeling individualization in a bayesian networks implementation of knowledge tracing
. In Proceedings of the 18th International Conference on User Modeling, Adaptation, and Personalization, UMAP, Berlin, Heidelberg, pp. 255–266. External Links: ISBN 3642134696, 9783642134692 Cited by: §6.  KTidem: introducing item difficulty to the knowledge tracing model. In Proceedings of the 19th International Conference on User Modeling, Adaption, and Personalization, UMAP, Berlin, Heidelberg, pp. 243–254. External Links: ISBN 9783642223617 Cited by: §6.
 On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML, Atlanta, Georgia, USA, pp. III–1310–III–1318. Cited by: §4.4.
 Deep knowledge tracing. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, NeurIPS, Cambridge, MA, USA, pp. 505–513. Cited by: §1, §1, §1, item –, §4.1, §4.4, §6.
 Uncovering patterns in student work: machine learning to understand human learning. Ph.D. Thesis, Stanford University. Cited by: §1.
 Metalearning with memoryaugmented neural networks. In Proceedings of the 33nd International Conference on Machine Learning, (ICML), New York City, NY, USA, June 1924, 2016, pp. 1842–1850. Cited by: §1, §6.
 Deep learning in neural networks: an overview. Neural Netw. 61 (C), pp. 85–117. External Links: ISSN 08936080 Cited by: §6.
 Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. External Links: ISSN 1053587X Cited by: §6.
 Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, December 813 2014, Montreal, Quebec, Canada, NeurIPS, Cambridge, MA, USA, pp. 3104–3112. Cited by: §1.
 Probabilistic student models: bayesian belief networks and knowledge space theory. In Proceedings of the Second International Conference on Intelligent Tutoring Systems, ITS, London, UK, UK, pp. 491–498. External Links: ISBN 3540556060 Cited by: §1.
 LSTMbased eeg classification in motor imagery tasks. IEEE Transactions on Neural Systems and Rehabilitation Engineering 26 (11), pp. 2086–2095. External Links: ISSN 15344320 Cited by: §6.
 Learning to Skim Text. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, (ACL), Vancouver, Canada, July 30  August 4, (Volume 1: Long Papers), pp. 1880–1890. Cited by: §1, §6, §6.
 Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education  16th International Conference, (AIED), Memphis, TN, USA, July 913, 2013. Proceedings, pp. 171–180. Cited by: §6.
 Dynamic keyvalue memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, WWW, Republic and Canton of Geneva, Switzerland, pp. 765–774. External Links: ISBN 9781450349130 Cited by: §1, §1, §3.1, §3.2.3, item –, §4.1, §4.4, §5.1, §6.
Comments
There are no comments yet.