One of the prominent features of human intelligence is the ability to track their current knowledge states in mastering specific skills or concepts. This enables humans to identify gaps in their knowledge states to personalise their learning experience. With the success of artificial intelligence (AI) in modeling various areas of human cognition(LeCun et al., 2015; Graves et al., 2013; Yu et al., 2017; Donahue et al., 2015; Liu et al., 2016), the question has arisen: can machines trace human knowledge like humans? This motivated the study of knowledge tracing (KT), which aims to model the knowledge states of students in mastering skills and concepts, through a sequence of learning activities they participate in (Corbett and Anderson, 1994; Piech et al., 2015; Zhang et al., 2017).
Knowledge tracing is of fundamental importance to a wide range of applications in education, such as massive open online courses (MOOCs), intelligent tutoring systems, educational games, and learning management systems. Improvements in knowledge tracing can drive novel techniques to advance human learning. For example, knowledge tracing can be used to discover students’ individual learning needs so that personalised learning and support can be provided to fulfill diverse capabilities of each student (Khajah et al., 2016). It can also be used by human experts to design new measures of student learning and new teaching materials based on learning strengths and weaknesses of students (Piech, 2016). Nonetheless, tracing human knowledge using machines is a rather challenging task. This is due to the complexity of human learning behaviors (e.g., memorising, guessing, forgetting, etc.) and the inherent difficulties of modeling human knowledge (i.e. skills and prior background) (Piech et al., 2015).
, which models the knowledge tracing problem as predicting the state of a dynamical system that has hidden latent variables (i.e. learning concepts). In addition to BKT, probabilistic graphical models such as Hidden Markov Models (HMMs)(Corbett and Anderson, 1994; Baker et al., 2008) or Bayesian belief networks (Villano, 1992)
have also been used to model knowledge tracing. To keep the inference computation tractable, traditional machine learning KT models use discrete random state variables with simple transition regimes, which limits their ability to represent complex dynamics between learning concepts. Moreover, these models often assume a first-order Markov chain for an exercise sequence (i.e. considering the most recent observation to be representing the whole history) which also limits their ability to model long-term dependencies in an exercise sequence.
Inspired by recent advances in deep learning (LeCun et al., 2015), several deep learning KT models have been proposed. A pioneer work by Piech et al. (Piech et al., 2015) reported Deep Knowledge Tracinget al., 2013; Sutskever et al., 2014) to predict student performance on new exercises given their past learning history. In DKT, a student’s knowledge states are represented by a sequence of hidden states that successively encode relevant information from past observations over time. Although DKT has achieved substantial improvements in prediction performance over BKT, due to the limitation of representing a knowledge state by one hidden state, it lacks the ability to go deeper to trace how specific concepts are mastered by a student (i.e., concept states) in a knowledge state. To deal with this limitation, Dynamic Key-Value Memory Networks (DKVMN) (Zhang et al., 2017) was proposed to model a student’s knowledge state as a complex function over all underlying concept states using a key-value memory. Their idea of augmenting DKVMN with an auxiliary memory follows the concepts of Memory-Augmented Neural Networks (MANN) (Santoro et al., 2016; Graves et al., 2016). However, DKVMN acquires the knowledge growth through the most recent exercise and thus fails to capture long-term dependencies in an exercise sequence (i.e., relevant past experience to a new observation).
In this paper, we present a new KT model, called Sequential Key-Value Memory Networks (SKVMN). This model provides three advantages over the existing deep learning KT models:
First, SKVMN unifies the strengths of both recurrent modelling capacity of DKT and memory capacity of DKVMN for modelling student learning. We observe that, although a key-value memory can help trace concept states of a student, it is not effective in modeling long-term dependencies on sequential data. We remedy this issue by incorporating LSTMs into the sequence modelling for a student’s knowledge states over time. Thus, SKVMN is not only augmented with a key-value memory to enhance representation capability of knowledge states at each time step, but also can provide recurrent modelling capability for capturing dependencies among knowledge states at different time steps in a sequence.
Second, SKVMN uses a modified LSTM with hops, called Hop-LSTM, in its sequence modelling. Hop-LSTM deviates from the standard LSTM architecture by using a triangular layer for discovering sequential dependencies between exercises in a sequence. Then, the model may hop across LSTM cells according to the relevancy of the latent learning concepts. This enables relevant exercises that correlate to similar concepts to be processed together. In doing so, the inference becomes faster and the capacity of capturing long-term dependencies in an exercise sequence is enhanced.
Third, SKVMN improves the write process of DKVMN in order to better represent knowledge states stored in a key-value memory. In DKVMN, the current knowledge state is not considered when calculating the knowledge growth of a new exercise. This means that the previous learning experience is ignored. For example, when a student attempts the same question multiple times, the same knowledge growth would be added to the knowledge state, regardless of whether the student has previously answered this question or how many times the answers were correct. SKVMN solves this issue by using a summary vector as input for the write process, which reflects both the current knowledge state of a student and the prior difficulty of a new question.
We have extensively evaluated our proposed model SKVMN on five well-established KT benchmark datasets, and compared it with the state-of-the-art KT models. The experimental results show that (1) SKVMN outperforms the existing KT models, including DKT and DKVMN, on all five datasets, (2) SKVMN can better discover the correlation between latent concepts and questions, and (3) SKVMN can the knowledge state of students dynamics, and leverage sequential dependencies between exercises in an exercise sequence for improved predication accuracy.
Figure 1 illustrates how a student’s knowledge states are evolving as the student attempts a sequence of 50 exercises in DKVMN and SKVMN. We can see that, compared with the knowledge states of DKVMN depicted in Figure 1.(a), our model SKVMN provides a smoother transition between two successive concept states as depicted in Figure 1.(b). Moreover, SKVMN captures a smooth, progressive evolution of knowledge states over time (i.e., through this exercise sequence), which more accurately reflects the way a student learns (will be discussed in detail in Section 5.4).
2. Problem Formulation
Broadly speaking, knowledge tracing is to track down students’ knowledge states over time through a sequence of observations on how the students interact with given learning activities. In this paper, we formulate knowledge tracing as a sequence prediction problem in machine learning, which learns the knowledge state of a student based on an exercise answering history.
Let be the set of all distinct question tags in a dataset. Each may have a different level of difficulty, which is not explicitly provided. An exercise is a pair consisting of a question tag
and a binary variablerepresenting the answer, where means that is incorrectly answered and means that is correctly answered. When a student interacts with the questions in , a history of exercises undertaken by the student can be observed. Based on a history of exercises
, we want to predict the probability of correctly answering a new question at time stepby the student, i.e., .
We assume that the questions in are associated with latent concepts . The concept state of each latent concept
is a random variable describing the mastery level of the student on this latent concept. At each time step, the knowledge state of a student is modelled as a set of all concept states of the student, each corresponding to a latent concept in at time step .
3. Sequential Key-Value Memory Networks
In this section, we introduce our model Sequential Key-Value Memory Networks (SKVMN). We first present an overview for SKVMN. Then, we show how a key-value memory can be attended, read and written in our model. To leverage sequential dependencies among latent concepts for predication, we then present a modified LSTMs, called Hop-LSTMs. Lastly, we discuss the optimisation techniques used in the model.
3.1. Model Overview
The SKVMN model is augmented with a key-value memory following the work in (Zhang et al., 2017), i.e., a pair of one static matrix of size , called the key matrix, and one dynamic matrix of size , called the value matrix. Both the key matrix and the value matrix have the same memory slots, but they may differ in their state dimensions and . The key matrix stores the latent concepts underlying questions, and the value matrix stores the concept states of a student (i.e., the knowledge state) which can be changed dynamically based on student learning.
Given an input question at time step , the SKVMN model retrieves the knowledge state of a student from the key-value memory , and predicts the probability of correctly answering the question by the student. Figure 2.(a) illustrates the SKVMN model at time step , which consists of four layers: the embedding, memory, sequence and output layers.
The embedding layer is responsible for mapping an input question at time step into a high-dimensional vector space.
The memory layer involves two processes: attention and read, where the attention process provides an addressing mechanism for the input question to allocate the relevant information from the key-value memory, and the read process uses the attention vector to retrieve the current knowledge state of the student from the value matrix . The details of the attention and read processes will be discussed in Sections 3.2.1 and 3.2.2.
The output layer generates the probability of correctly answering the input question .
After a student has attempted the input question with the answer , the value matrix in the key-value memory of the SKVMN model needs to be updated in order to reflect the latest knowledge state of the student. Figure 3 depicts how the value matrix is transited from at time step to at time step using the write process. The details of the write process will be discussed in Section 3.2.3.
3.2. Attention, Read and Write
There are three processes relating to access to a key-value memory in our model: attention, read and write. In the following, we elaborate these processes.
Given a question as input, the model represents as a “one-hot” vector of length in which all entries are zero, except for the entry that corresponds to .
In order to map into a continuous vector space, is multiplied by an embedding matrix , which generates a continuous embedding vector , where is the embedding dimension. Then, an attention vector is obtained by applying the Softmax function to the inner product between the embedding vector and each key slot in the key matrix as follows:
where . Conceptually, represents the correlation between the question and the underlying latent concepts stored in the key matrix .
For each exercise , the model uses its corresponding attention vector to retrieve the concept states of the student with regard to the question from the value matrix . Specifically, the read process takes the attention vector as input and yields a read vector which is the weighted sum of all values being attended by in the memory slots of the value matrix , i.e.,
The read vector is concatenated with the embedding vector . Then, the combined vector is fed to a Tanh layer to calculate the summary vector :
where , is the weight matrix of the Tanh layer, and
is the bias vector. While the read vectorrepresents the student’s knowledge state with respect to the relevant concepts of the current question , the summary vector adds prior information of the question (e.g., the level of difficulty) to this knowledge state. Intuitively, represents how well the student has mastered the latent concepts relevant to the question before attempting this question.
The write process occurs each time after the student has attempted a question. The purpose of this write process is to update the concept states of the student in the value matrix using the knowledge growth gained through attempting the question . This update leads to the transition of the value matrix from from to as depicted in Figure 3.
To calculate the knowledge growth, our model considers not only the correctness of the answer for , but also the student’s mastery level of the concept states before attempting . Each is represented as a vector of length and multiplied by an embedding matrix to get a write vector , which represents the knowledge growth of the student obtained by attempting the question.
Similar to other memory augmented networks (Graves et al., 2016; Zhang et al., 2017), the write process proceeds with two gates: erase gate and add gate. The former controls what information to erase from the knowledge state after attempting the latest exercise , while the latter controls what information to add into the knowledge state. From the knowledge tracing perspective, these two gates capture the forgetting and enhancing aspects of learning knowledge, respectively.
With the write vector for the knowledge growth, an erase vector is calculated as:
where and is the weight matrix. Using the attention vector , the value matrix after applying the erase vector is
Then, an add vector is calculated by
where is the weight matrix. Finally, the value matrix for the next knowledge state is updated as:
3.3. Sequence Modelling
Now we discuss the sequence modelling approach used at the sequence layer for predicting the probability of correctly answering the question based on the exercise history.
3.3.1. Sequential dependencies
An exercise history of a student may contain a long sequence of exercises, for example, the average sequence length in the ASSISTments2009 dataset is questions per sequence. However, as different exercises may correlate to different latent concepts, not all exercises in can equally contribute to the prediction of answering a given question . Thus, we observe that, by hopping across irrelevant exercises in with regard to , recurrent models can be applied on a shorter and more relevant sequence, leading to more efficient and accurate prediction performance.
For each question in , since its attention vector reflects the correlation between this question and the latent concepts in , we consider that the similarity between two attention vectors can provide a good indication of how their corresponding questions are relevant in terms of their correlations with latent concepts. For example, suppose that we have three latent concepts , and two attention vectors and that correspond to the questions and , respectively. Then and are considered as being relevant if both and are mapped to a vector where , and refer to the value ranges “low”, “middle” and “high”, respectively.
Now, the question is: how to identify similar attention vectors? For this, we use the triangular membership function (Klir and Yuan, 1995):
where the parameters and determine the feet and the parameter determines the peak of a triangle. We use three triangular membership functions for three value ranges: low (0), medium (1), and high (2). Each real-valued component in an attention vector is mapped to one of the three value ranges. Each attention vector is associated with an identity vector . Similar attention vectors have the same identity vector, while dissimilar attention vectors have different identity vectors.
Then, at each time step , for the current question and an exercise history , we say is sequentially dependent on in with , denoted as , if the following two conditions are satisfied:
The attention vectors of and have the same identity vector, i.e., , and
There is no other in such that and , i.e., is the most recent exercise in that is relevant to .
Over time, given a sequence of questions , we thus have an exercise history in which exercises are partitioned into a set of subsequences with and, for any two consecutive exercises and in the subsequence , is sequentially dependent on , i.e. . In Figure 2, based on the sequential dependencies presented in Figure 2.(b), is partitioned into and , which correspond to the recurrently connected LSTM cells depicted in Figure 2.(a). We will discuss further details in the following.
Based on sequence dependencies between exercises in a sequence, a modified LSTM with hops, called Hop-LSTM, is used to predict the probability of correctly answering a new question by a student. Different from the standard LSTM architecture, Hop-LSTM allows us to recurrently connect the LSTM cells for questions based on the relevance of their latent concepts. More precisely, two LSTM cells in Hop-LSTM are connected only if the input question of one LSTM cell is sequentially dependent on the input question of the other LSTM cell. This means that Hop-LSTM has the capability of hopping across the LSTM cells when their input questions are irrelevant to the current question .
Formally, at each time step , for the question , if there is an exercise with and , then the current LSTM cell takes the summary vector and the hidden state as input. Moreover, this LSTM cell updates the cell state into the cell state , and generates the new hidden state as output. As in (Hochreiter and Schmidhuber, 1997), the LSTM cell used in our work has three gates: forget gate , input gate , and output gate in addition to the hidden state and the cell state :
Then, the output vector of the curent LSTM cell is sent to a Sigmoid layer, which calculates the probability of correctly answering the current question by
3.4. Model Optimisation
To optimise the model, we use the cross-entropy loss function between the predicted probability of being correctly answeredand the true answer . The following objective function is defined over training data:
We initialise of the memory matrices ( and ) and embedding matrices ( and
) using a random Gaussian distribution. While for weights and biases of the neural layers, we use Glorot uniform random initialization (Glorot and Bengio, 2010) for a faster convergence. These randomly initialized parameters are optimized using the stochastic gradient decent (SGD) mechanism (Bottou, 2012)
. As we use Hop-LSTM at the sequence layer, during the backpropagation of the gradients, only the parameters of the connected LSTM cells (i.e. the ones responsible for the current prediction error) are updated. Other parameters, such as the embedding matrices, weight matrices, and bias vectors, are updated in each backpropagation iteration based on loss function values.
Note that, through training, the model can discover relevant latent concepts for each question and store their state values in the value matrix .
In this section, we present the experiments of evaluating our proposed model SKVMN against the state-of-the-art KT models. These experiments aim to answer the following research questions:
What is the optimal size for a key-value memory (i.e., the key and value matrices) of SKVMN?
How does SKVMN perform on predicting a student’s answers of new questions, given an exercise history?
How does SKVMN perform on discovering the correlation between latent concepts and questions?
How does SKVMN perform on capturing the evolution of a student’s knowledge states?
Synthetic-5111Synthetic-5:https://github.com/chrispiech/DeepKnowledgeTracing/tree/master /data/synthetic: This dataset consists of two subsets: one for training and one for testing. Each subset contains distinct questions which were answered by virtual students. A total number of exercises (i.e. ) are contained in the dataset.
ASSISTments2009222ASSISTments2009:https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data/skill-builder-data-2009-2010: This dataset was collected during the school year using the ASSISTments online education website 333https://www.assistments.org/. The dataset consists of distinct questions answered by students which gives a total number of exercises.
ASSISTments2015444ASSISTments2015:https://sites.google.com/site/assistmentsdata/home/2015-assistments-skill-builder-data: As an update to the ASSISTments2009 dataset, this dataset was released in . It includes distinct questions answered by students with a total number of exercises. This dataset has the largest number of students among the other datasets. Albeit, the average number of exercises per student is low. The original dataset also has some incorrect answer values (i.e. ), which are removed during preprocessing.
Statics2011555Statics2011:https://pslcdatashop.web.cmu.edu/ DatasetInfo?datasetId=507: This datasets was collected from a statistics course at Carnegie Mellon University during Fall . It contains distinct questions answered by undergraduate students with a total number of exercises. This dataset has the highest exercise per student ratio among all datasets.
JunyiAcademy666Junyi2015: https://datashop.web.cmu.edu/DatasetInfo?datasetId=1198: This dataset was collected from Junyi Academy 777https://www.junyiacademy.org/, which is an education website of providing learning materials and exercises on various scientific courses, on 2015 (Chang et al., 2015). It contains distinct questions answered by students with a total number of exercises. It is the largest dataset in terms of the number of exercises.
|AUC (%)||m||AUC (%)||m|
In order to evaluate the performance of our proposed model, we select the following three KT models as the baselines:
Deep knowledge tracing (DKT) (Piech et al., 2015) which uses recurrent neural networks to model student learning.
Dynamic key-value memory networks (DKVMN) (Zhang et al., 2017) which extends the memory-augmented neural networks (MANN) by a key-value memory and is considered as the state-of-the-art model for knowledge tracing.
Our proposed model is referred to as SKVMN in the experiments.
We use the area under the Receiver Operating Characteristic (ROC) curve, referred to as AUC (Ling et al., 2003), to measure the prediction performance of the KT models. The AUC ranges in value from 0 to 1. An AUC score of means random prediction (i.e. coin flipping). The higher an AUC score goes above , the more accurately a predictive model can perform.
4.4. Evaluation Settings
We divided each dataset into % for training and validation and % for testing, except for Synthetic-5. This is because, as mentioned before, Synthetic-5 itself contains the training and test subsets of the same size. For each training and validation subset, we further divided it using the 5-fold cross validation (e.g. % for training and
% for validation). The validation subset was used to determine the optimal values for the hyperparameters, including the memory slot dimensionsfor the key matrix and for the value matrix.
We utilised the Adam optimizer (Kingma and Ba, 2015) for SGD implementation with momentum of and learning rate of annealed using a cosine function every epochs for epochs, then it remains fixed at . The LSTM gradients were clipped to improve the training (Pascanu et al., 2013). For the other baselines, we follow the optimisation procedures indicated in the original work for each of them (Corbett and Anderson, 1994; Zhang et al., 2017; Piech et al., 2015).
A mini-batch of is selected during the training for all datasets, except Synthetic-5, for which we use a mini-batch of due to the relatively small number of training samples (i.e. exercises) in the dataset (Bengio, 2012)
. For each dataset, the training process is repeated five times, each time using a different initialization. We report the average test AUC and the standard deviation over these five runs.
For the Triangular layer, the hyper-parameter values () of each triangular membership function are set based on the empirical analysis of each dataset.
5. Results and Discussion
In this section, we present the experimental results and discuss our observations from the obtained results.
5.1. Hyperparameters and
To explore how the sizes of the key and value matrices can affect the model performance, we have conducted experiments to compare SKVMN with DKVMN under different numbers of memory slots and state dimensions , where so as to be consistent with the previous work (Zhang et al., 2017). In order to allow a fair comparison between the models SKVMN and DKVMN, we select the same set of state dimensions, (i.e. ) and the same corresponding numbers of memory slots on the datasets Synthetic-5, ASSISTments2009, ASSISTments2015 and Statics2011 to report the AUC results, following the settings originally reported in (Zhang et al., 2017). For the dataset JunyiAcademy, it was not considered in the previous work (Zhang et al., 2017). Considering that JunyiAcademy has the largest numbers of students and exercises among all datasets, we use the same settings for the numbers of memory slots and state dimensions as the ones for the second largest dataset ASSISTments2015. Table 2 presents the AUC results for all five datasets.
As shown in Table 2, compared with DKVMN, our model SKVMN can produce better AUC results with comparable parameters on the datasets Synthetic-5, ASSISTments2015 and Statics2011, and with fewer parameters on the datasets ASSISTments2009 and JunyiAcademy. Particularly, for the dataset ASSISTments2009, SKVMN yields an AUC at 83.63% with N=10, d=10 and m=7.8k, whereas DKVMN yields an AUC at 81.57% with N=20, d=50 and m=31k (nearly 4 times of 7.8k). Similarly, for the dataset JunyiAcademy, SKVMN yields an AUC at 82.67% with N=50, d=100 and m=66k, whereas DKVMN yields an AUC at 80.27% with N=50, d=200 and m=153k (more than twice of 66k).
Note that, the optimal value of for ASSISTments2015 is higher than the one for its previous version (i.e. ASSISTments2009). This implies that the number of latent concepts increases in ASSISTments2015 in comparison to ASSISTments2009. Moreover, the optimal value of generally reflects the complexity of the exercises in a dataset, and the dataset JunyiAcademy has exercises of higher complexity than other real-world datasets.
5.2. Prediction Accuracy
We have conducted experiments on comparing the AUC results of our model SKVMN with the other three KT models: BKT, DKT, and DKVMN. Table 3 presents the AUC results of all the models. It can be seen that our model SKVMN outperformed the other models over all the five datasets. Particularly, the SKVMN model achieved an average AUC value that is at least 2% higher than the state-of-art model DKVMN on all real-world datasets ASSISTments2009, ASSISTments2015, Statics2011, and JunyiAcademy. Even for the only synthetic dataset (i.e. Synthetic-5), the SKVMN model achieved an average AUC value of , in comparison with achieved by DKVMN. Note that the AUC values on ASSISTments2015 are the lowest among all datasets, regardless of the KT models. This reflects the difficulty of the KT task in this dataset due to its lowest exercise per student ratio, which not only makes the training process more difficult but also limits the effective use of sequence information to enhance the prediction performance. Figure 4 illustrates the ROC curves of these four models for each dataset.
In a nutshell, based on the AUC results in Table 3, we have the following observations. First, the neural models generally performed better than the Bayesian inference model (i.e. BKT). This is due to the power of these models in learning complex student learning patterns without the need to oversimplify the problem’s assumption to keep it within the tractable computation limits as the case in BKT. Second, the memory-augmented models DKVMN and SKVMN performed better than the DKT model that does not utilise an external memory structure. This has empirically verified the effectiveness of external memory structures in storing past learning experiences of students, as well as facilitating the access of relevant information to enhance the prediction performance. Third, the use of sequential dependencies among exercises in our SKVMN model enhanced the prediction accuracy in comparison to the DKVMN model which primarily considers the latest observed exercise.
5.3. Clustering Questions
To provide insights on how our proposed model SKVMN can correlate questions to their latent concepts, In Figure 5.(a)-5.(b), we present the clustering results of questions based on their correlated concepts in the dataset ASSISTments2009, generated by using DKVMN and SKVMN, respectively. This dataset was selected for two reasons. First, it has a reasonable number of questions (i.e., 110), enabling the visualization of clusters to be readable. Second, each question in this dataset is provided with a description as depicted in Table 4, which is useful for validating how well the model discovers correlations between questions and their latent concepts.
As shown in Figure 5, both DKVMN and SKVMN discover that there are 10 latent concepts relating to the 110 questions in the dataset ASSISTments2009, where all questions in one cluster relate to common latent concepts and are labelled using the same color. It can be noticed that SKVMN performs significantly better than DKVMN since the overlapping between different clusters in SKVMN is smaller than in DKVMN. For example, in Figure 5.(a), question 105 is about curve slop, which is close to the cluster for geometric concepts in brown colour, while it is placed in the cluster for equation system concepts in blue colour. Similarly, other overlaps can be observed such as questions 38, 73, and 26. This indicates that the effectiveness of SKVMN in discovering latent concepts as well as discovering questions that relate to these latent concepts.
We can further verify the effectiveness of SKVMN in discovering latent concepts for questions using the question descriptions in Table 4. For example, in Figure 5.(b), the questions , , and fall in the same cluster in pink (top right corner). Their provided descriptions are “Equivalent Fractions”,“Multiplication Fractions”, and “Ordering Fractions”, respectively, which are all relevant to fractions concepts. Similarly, the questions , , and have the descriptions “Area Trapezoid”, “Area Rectangle”, and “Rotations”, respectively. These questions fall in the same cluster in light blue (bottom right corner) because they are about geometric concepts, such as area functions and transformations.
Note that, SKVMN depends on identity vectors to aggregate questions with common concepts together. While the DKVMN depends on the attention vectors to perform this aggregation.
5.4. Evolution of Knowledge States
As previously discussed in Section 1, the knowledge states of a student may evolve over time, through learning from a sequence of exercises. In order to illustrate this evolution process, Figure 1 shows a student’s knowledge states over a sequence of 50 exercises from the ASSISTments2009 dataset. At each time step, a knowledge state consists of the concept states of five concepts , which are stored in the value matrix of a key-value memory augmented with DKVMN or SKVMN. Figure 1.(a) shows this student’s knowledge states captured by DKVMN, while Figure 1.(b) shows this student’s knowledge states captured by SKVMN.
In SKVMN, relevant questions are identified as shown in Table 4. Comparing Figure 1.(a) and Figure 1.(b), it can be visually noticed that SKVMN has smoother updates to the concept states in the value matrix than DKVMN. For example, considering the questions 23, 33 and 61, the student answered the first two questions incorrectly and the last one correctly, which result in a sudden update to the value of (i.e., the concept state of ) in the value matrix of DKVMN but a smoother update to the concept state of in the value matrix of SKVMN. Another example is the questions 70 and 88 that correlate to same latent concepts, the student answered the first one incorrectly and the second one correctly, which resulted in a significant update to the concept state of in the DKVMN’s memory around indices 18, 19 and 20, while SKVMN’s concept state of decreased in a smoother manner. This means that SKVMN considers the past performance of the student in relevance to this concept. At its core, these differences in capturing concept states are due to the fact that DKVMN’s write process only takes the question and the answer to calculate the erase and add vectors, so that the knowledge state of DKVMN is biased to the most recently observed question. SKVNM has resolved this issue by taking into account the summary vector (i.e., current knowledge state and the level of difficulty of the current question) for the write process.
6. Related Work
In this section, we provide a brief review of related research work.
One of the early attempts for developing a KT model was introduced by Corbett and Anderson (Corbett and Anderson, 1994)
. Their KT model, called Bayesian Knowledge Tracing (BKT), assumed a knowledge state to be a binary random variable (i.e. know or do not know) and followed a Bayesian inference approach to estimate the values of knowledge states. However, BKT has limitations in modelling dynamics between different concepts due to its oversimplified representation to make the Bayesian inference tractable. Baker et al.(Baker et al., 2008) extended BKT by introducing an additional layer to the Bayesian inference to represent the contextual information. While their model achieved better results, it was still considered only the latest observation as a first-order Markov chain. Several attempts have been made to extend BKT by individualizing the prior distribution of Bayesian inference parameters (Pardos and Heffernan, 2010; Yudelson et al., 2013) so as to customize the model for each individual student. These individualization techniques were proved to reduce prediction errors of the original BKT model. Pardos and Heffernan (Pardos and Heffernan, 2011) introduced the use of auxiliary information to the Bayesian inference process, such as item difficulty, and showed that it can further enhance the prediction accuracy.
, such as natural language processing(Graves et al., 2013; Yu et al., 2017), video recognition (Donahue et al., 2015; Liu et al., 2016), and signal processing (Wang et al., 2018), recent studies adopted deep learning models to address the KT problem. Piech et al. (Piech et al., 2015) proposed the deep knowledge tracing (DKT) model which uses a recurrent neural networks (RNN) (Mandic and Chambers, 2001) to model dynamics in a past exercise sequence and predicts answers for new questions. DKT resolved the limitations of Bayesian inference approaches as RNNs optimization through backpropagation is tractable. Despite this advance, DKT assumed only one hidden state variable for representing a student’s knowledge state, which is an unrealistic assumption for real-world scenarios as a student’s knowledge can significantly vary across different learning concepts. To address this limitation, Zhang et al. (Zhang et al., 2017) proposed a model called Dynamic Key-Value Memory Networks (DKVMN), which followed the concepts of Memory-Augmented Neural Networks (MANN) (Santoro et al., 2016; Graves et al., 2016). MANN aim at mimicking the human’s brain functionality which combines neural spiking for computation with memory for storing past experiences (Gallistel and King, 2011). Inspired by MANN, DKVMN is augmented with two auxiliary memory structures: the key matrix and the value matrix. The former is used to keep the concepts underlying exercises, while the later one is used to store a knowledge state across these concepts. Results showed that DKVMN outperformed BKT and DKT on standard KT benchmarks, and therefore it is considered the state-of-the-art KT models. However, DKVMN only considers the latest exercise embedding when updating the value matrix, resulting in biased knowledge states that ignore past learning experience. As an example, if we have three related exercises in a sequence, two being answered correctly and the latest being answered incorrectly, DKVMN would be biased to the latest one and update the knowledge state with knowledge loss abruptly. In addition to this, DKVMN has no model capacity to capture long dependencies in an exercise sequence. This assumes a first-order Markov chain to represent a past exercise sequence, which is not satisfactory in many scenarios. Our proposed KT model has addressed the limitations from both DKT and DKVMN.
In our proposed KT model, we developed a modified LSTM, called Hop-LSTM, for sequence modelling. Current recurrent neural network models (RNNs) and their variants, such as LSTMs (Hochreiter and Schmidhuber, 1997), bi-directional RNNs (Schuster and Paliwal, 1997), or other gated RNNs (Kusupati et al., 2018), provide the capacity to effectively ingest dependencies in sequential data. However, when sequences are long, it is still difficult to capture long term dependencies. One way to alleviate this issue is to only update a fraction of hidden states based on the current hidden state and input (Jernite et al., 2017). For example, Yu, Lee and Le (Yu et al., 2017)
proposed a LSTM model that can jump ahead in a sequence to avoid irrelevant words. The jump decision was controlled by a policy gradient reinforcement learning algorithm that works as an active learning technique to sample only important words for the model. Campos et al.(Campos et al., 2018) proposed a model by augmenting the standard RNN with a binary state update gate function which is responsible for deciding whether to update the hidden state or not based on the number of previous updates performed and a loss term that balances the number of updates (i.e. learning speed) with achieved accuracy. Different from these models, we developed Hop-LSTM in relation to a Triangular layer so that only the hidden states of LSTM cells for relevant exercises are connected. This allows our KT model to identify relevant skills and prior background from the past learning activities (e.g., exercise sequences) for improved prediction accuracy.
In this paper, we introduced a novel model called Sequential Key-Value Memory Networks (SKVMN) for knowledge tracing. SKVMN aimed at overcoming the limitations of the existing KT models. It is augmented with a key-value memory at the memory layer and a modified LSTM, called Hop-LSTM, at the sequence layer. The experimental results showed that our proposed model outperformed the state-of-the-art models over all datasets. Future work will consider techniques to automatically tune hyper-parameters for Knowledge Tracing models.
Acknowledgements.This research is supported by an Australian government higher education scholarship, ANU Vice-Chancellor’s Teaching Enhancement Grant, as well as NVIDIA for the generous GPU support.
- More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems, ITS, Berlin, Heidelberg, pp. 406–415. External Links: Cited by: §1, §6.
- Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second Edition, pp. 437–478. Cited by: §4.4.
- Stochastic gradient descent tricks. In Neural networks: Tricks of the trade: Second Edition, pp. 421–436. Cited by: §3.4.
- Skip RNN: learning to skip state updates in recurrent neural networks. In 6th International Conference on Learning Representations, (ICLR), Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §6.
- Modeling exercise relationships in e-learning: a unified approach.. In Proceedings of the 8th International Conference on Educational Data Mining, (EDM), Madrid, Spain, June 26-29, 2015, pp. 532–535. Cited by: 5th item.
- Knowledge tracing: modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction 4 (4), pp. 253–278. External Links: Cited by: §1, §1, item –, §4.4, §6.
- Long-term recurrent convolutional networks for visual recognition and description. In , pp. 2625–2634. Cited by: §1, §6.
- Memory and the computational brain: why cognitive science will transform neuroscience. Vol. 6, John Wiley & Sons. Cited by: §6.
- Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). Society for Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy, pp. 249–256. Cited by: §3.4.
- Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, December 8-12, 2013, Vol. , pp. 273–278. External Links: Cited by: §1, §6.
- Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 6645–6649. External Links: Cited by: §1.
- Hybrid computing using a neural network with dynamic external memory. Nature 538, pp. 471. Cited by: §1, §3.2.3, §6.
- Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §3.3.2, §6.
- Variable computation in recurrent neural networks. In 5th International Conference on Learning Representations, (ICLR), Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §6.
- How deep is knowledge tracing?. In Proceedings of the 9th International Conference on Educational Data Mining, (EDM), Raleigh, North Carolina, USA, June 29 - July 2, 2016, Cited by: §1, §4.1.
- Adam: a method for stochastic optimization. In international conference on learning representations, ICLR. Cited by: §4.4.
- Fuzzy sets and fuzzy logic: theory and applications. External Links: Cited by: §3.3.1.
- FastGRNN: a fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, (NeurIPS), 3-8 December 2018, Montréal, Canada., pp. 9017–9028. Cited by: §6.
- Deep learning. nature 521 (7553), pp. 436. Cited by: §1, §1, §6.
- AUC: a statistically consistent and more discriminating measure than accuracy. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI, San Francisco, CA, USA, pp. 519–524. Cited by: §4.3.
- Spatio-temporal lstm with trust gates for 3d human action recognition. In 14th European Conference on Computer Vision, (ECCV) , Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, Cham, pp. 816–833. Cited by: §1, §6.
- Recurrent neural networks for prediction: learning algorithms,architectures and stability. John Wiley & Sons, Inc., New York, NY, USA. External Links: Cited by: §6.
Modeling individualization in a bayesian networks implementation of knowledge tracing. In Proceedings of the 18th International Conference on User Modeling, Adaptation, and Personalization, UMAP, Berlin, Heidelberg, pp. 255–266. External Links: Cited by: §6.
- KT-idem: introducing item difficulty to the knowledge tracing model. In Proceedings of the 19th International Conference on User Modeling, Adaption, and Personalization, UMAP, Berlin, Heidelberg, pp. 243–254. External Links: Cited by: §6.
- On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML, Atlanta, Georgia, USA, pp. III–1310–III–1318. Cited by: §4.4.
- Deep knowledge tracing. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, NeurIPS, Cambridge, MA, USA, pp. 505–513. Cited by: §1, §1, §1, item –, §4.1, §4.4, §6.
- Uncovering patterns in student work: machine learning to understand human learning. Ph.D. Thesis, Stanford University. Cited by: §1.
- Meta-learning with memory-augmented neural networks. In Proceedings of the 33nd International Conference on Machine Learning, (ICML), New York City, NY, USA, June 19-24, 2016, pp. 1842–1850. Cited by: §1, §6.
- Deep learning in neural networks: an overview. Neural Netw. 61 (C), pp. 85–117. External Links: Cited by: §6.
- Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. External Links: Cited by: §6.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, December 8-13 2014, Montreal, Quebec, Canada, NeurIPS, Cambridge, MA, USA, pp. 3104–3112. Cited by: §1.
- Probabilistic student models: bayesian belief networks and knowledge space theory. In Proceedings of the Second International Conference on Intelligent Tutoring Systems, ITS, London, UK, UK, pp. 491–498. External Links: Cited by: §1.
- LSTM-based eeg classification in motor imagery tasks. IEEE Transactions on Neural Systems and Rehabilitation Engineering 26 (11), pp. 2086–2095. External Links: Cited by: §6.
- Learning to Skim Text. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, (ACL), Vancouver, Canada, July 30 - August 4, (Volume 1: Long Papers), pp. 1880–1890. Cited by: §1, §6, §6.
- Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education - 16th International Conference, (AIED), Memphis, TN, USA, July 9-13, 2013. Proceedings, pp. 171–180. Cited by: §6.
- Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th International Conference on World Wide Web, WWW, Republic and Canton of Geneva, Switzerland, pp. 765–774. External Links: Cited by: §1, §1, §3.1, §3.2.3, item –, §4.1, §4.4, §5.1, §6.