Knowledge Tracing Machines
Knowledge tracing is a sequence prediction problem where the goal is to predict the outcomes of students over questions as they are interacting with a learning platform. By tracking the evolution of the knowledge of some student, one can optimize instruction. Existing methods are either based on temporal latent variable models, or factor analysis with temporal features. We here show that factorization machines (FMs), a model for regression or classification, encompass several existing models in the educational literature as special cases, notably additive factor model, performance factor model, and multidimensional item response theory. We show, using several real datasets of tens of thousands of users and items, that FMs can estimate student knowledge accurately and fast even when student data is sparsely observed, and handle side information such as multiple knowledge components and number of attempts at item or skill level. Our approach allows to fit student models of higher dimension than existing models, and provides a testbed to try new combinations of features in order to improve existing models.READ FULL TEXT VIEW PDF
Knowledge Tracing Machines
In this section, we review several approaches proposed to model student learning.
Knowledge Tracing aims at predicting the sequence of outcomes of a student over questions. It usually relies on modeling the state of the learner throughout the process. After several attempts, students may eventually evolve to a state of mastery.
The most popular model is Bayesian knowledge tracing (BKT), which is a hidden Markov model[corbett1994knowledge]. However, BKT cannot model the fact that a question might require several KCs. New models have been proposed that do handle multiple subskills, such as feature-aware student tracing (FAST) [gonzalez2014general].
Factor analysis tend to learn common factors in data in order to generalize observations. They have been successfully applied to matrix completion, where we assume that data is recorded for (user, item) pairs, but many entries are missing. The main difference with sequence prediction for our purposes is that the order in which the data is observed does not matter. If one wants to encode temporality though, it is possible to complement the data with temporal features such as simple counters, as we will see later. In all that follows,
will denote the logit function:.
The most simple model for factor analysis does not assume knowledge between several attempts, it is the 1-parameter logistic item response theory model, also known as Rasch model:
where measures the ability of student (the student bias) and measures the difficulty of question (the question bias). We will refer to the Rasch model as IRT in the rest of the paper. More recently, Wilson have shown that IRT could outperform DKT [wilson2016estimating], even without temporal features [gonzalez2014general]. It may be because DKT have many parameters to estimate, so they are prone to overfitting, and they are hard to train on long sequences.
The IRT model has been extended to multidimensional abilities:
where is the easiness of item (item bias). Multidimensional Item Response Theory (MIRT) has a reputation to be hard to train [desmarais2012review] thus is not frequently encountered in the EDM literature, and still, the dimensionality used in psychometrics papers is up to 4, but we show in this paper how to train those models effectively, up to 20 dimensions.
Additive factor model (AFM) [cen2006learning] takes into account the number of attempts a learner has made to an item:
where is the bias for the skill , and the bias for each opportunity of learning the skill . is the number of attempts of student over a question that requires the skill .
Performance factor analysis (PFA) [pavlik2009performance] counts separately positive and negative attempts:
where is the bias for the skill , () the bias for each opportunity of learning the skill after a successful (unsuccessful) attempt, () is the number of successes (failures) of student over a question that requires the skill . In other words, AFM can be seen as a particular case of PFA where for every skill . Please note that AFM and PFA do not consider item difficulty, presumably to avoid the item cold-start problem. According to [gonzalez2014general], PFA and FAST have comparable performance. By reproducing experiments, [xiong2016going] have managed to match the performance of DKT with PFA.
Numerous works have coined the similarity between student modeling and collaborative filtering (CF) in recommender systems [bergner2012model, thai2011factorization]. For CF, factorization machines were designed to provide a way to encode side information about items or users into the model.
[thai2012using] and [sweeney2016next] have used factorization machines in their regression form for student modeling (where the root mean squared error is used as metric) but to the best of our knowledge, it has not been used in its classification form for student modeling. This is what we present in this paper, in the next section.
We now introduce the family of models described in this paper, Knowledge Tracing Machines (KTM).
Let be the number of features. Features can refer either to students, exercises, knowledge components (KCs), opportunities for learning, or extra information about the learning environment. For example, one might want to model the fact that the student attempted an exercise on mobile, or on computer, which might influence their outcome: it may be harder to type a correct answer when using a mobile, so this data should be taken into account in the predictions.
KTMs model the probability of observing binary outcomes of events (right or wrong), based on a sparse set of weights for all features involved in the event. Features involved in an event are encoded by a sparse vectorof length such that iff feature is involved in the event. For each event involving , the probability to observe a positive outcome verifies:
where is a link function such as , is a global bias, each feature is modeled by both a bias and an embedding for some dimension . In what follows, will refer to the vector of biases and to the matrix of embeddings , . For each event, only the features that have will contribute to the prediction, see Figure 1.
We now describe how to encode the observed data in the learning platform into the sparse vector . First, we need to choose which features will be represented in the modeling.
Let us assume there are students. The first features will be for all students. As an example, if student is involved in the observation, its value will be set to 1, while the ones for the other students will be set to 0. This is called a one-hot vector.
Let us assume there are questions or items. One can allocate more features for all questions. If question is involved in the observation, its component in will be set to 1, while the ones for the other questions will be set to 0.
We now assume there are skills. We can then allocate extra features for those skills. The skills involved in an observation of a student over a question are the ones of .
One can allocate extra features as counters of how many opportunities a student could have learned a skill involved in the test.
One can also distinguish between successes and failures: allocate features as opportunities to have learned a skill if the attempt was correct, more features as opportunities to have learned a skill if the attempt was incorrect.
More side information can be concatenated to the existing sparse features, such as the school ID and teacher ID of the student, or also other information such as the type of test: low stake (practice) or high stake (posttest), etc.
See Table 1 for an example of encoding of users + items + skills + wins + fails, for the set of observed, chronologically ordered triplets (student 2 attempted question 2 and got it correct), , , , , , . Here, we assume that there are students, questions, skills and question 1 does not involve any skill, question 2 involves skills 1 and 2, question 3 involves skills 2 and 3. At the beginning, user 2 had no opportunity to learn any skill, so counters of wins and fails are all 0. After student 2 got question 2 correct, as it involved skills 1 and 2, the counters of wins for these two skills are incremented, and encoded for the next observation. We thus managed to encode the triplets with features, and at training time, a bias and an embedding will be learned for each one of them.
|Name||Users||Items||Skills||Skills per item||Entries||Sparsity (user, item)||Attempts per user|
When , KTMs include IRT, AFM and PFA. Let us now recover some particular cases, especially when , i.e., only biases are learned for features, no embeddings. We will again assume there are students, questions and skills.
We will note a one-hot vector of size , which means all its components are 0 except the th one, which is 1.
If , the second sum in Equation 1 disappears and all that is left is a weighted sum of biases.
If all features considered are students and questions (encoding users + items), and the encoding of the pair (student , question ) is a concatenation of one-hot vectors and : and iff or . The expression in Equation 1 becomes:
if the first features (students numbered where ) have bias and the next features (questions numbered where ) have bias . Therefore, KTM becomes after reparametrization the 1-PL IRT model, also referred to as Rasch model.
Now we will again consider the special case and an encoding of skills, wins and fails at skill level. For this, we will assume we know the q-matrix, that is, the binary mapping between questions and skills as described in the introduction.
If and encoding of “student attempted question ” is , where and are the counters of successful and unsuccessful attempts at skill level, then KTM behaves like the PFA model. Similarly, one can recover the AFM model.
If , KTM becomes a MIRT model with user bias:
The encoding is the same as for IRT (users + items with one-hot vectors), and the embeddings .
Training of KTMs is made by minimizing the negative log-likelihood over the observations and outcomes :
Like [rendle2012factorization], we assume some priors over the model parameters in order to guide training and avoid overfitting.
Each bias follows and each embedding component also follows where and
are regularization parameters that follow hyperpriorsand .
Because of those hyperpriors, we do not need to tune regularization parameters by hand [rendle2012factorization]. As we use
, that is, the inverse of the CDF of the normal distribution, we can fit the model using Gibbs sampling. Details of the computations can be found in[freudenthaler2011bayesian].
Another advantage of KTMs is that we can visualize the embeddings that they learn. On Figure 2, we show the 2-dimensional embeddings of users, items, skills learned by a knowledge tracing machine on the Fraction subtraction dataset. The user WALL·E is positively correlated with most of items, but not skills 2 (separate a whole number from a fraction) and 7 (substract numerators), which may explain why WALL·E couldn’t solve item 5 () that requires these two skills. To know more about the items and skills of this dataset, see [DeCarlo2010].
We used various datasets of different shapes and sizes in order to push our method to its limits. In Table 2, we report the main characteristics of the datasets: number of users, number of items, number of skills, average number of skills per item, total number of observed entries, sparsity of the (user, item) pairs, average number of attempts per user at item level.
For the temporal datasets, students could attempt several times a same question, and potentially learn between attempts.
The 2009–2010 dataset of Assistments described in [feng2009addressing]. 4217 students over 26688 questions, 123 KCs. 347k observations. There are many items but they involve 0 to 4 KCs, and there are only 146 combinations of KCs. For this dataset, we had also access to more side information, referred to as “extra” in the experiments:
first_action: attempt, or ask for a hint;
school_id where the problem was assigned;
teacher_id who assigned the problem;
tutor_mode: tutor, test mode, pretest, or posttest.
1730 students from Berkeley attempting 234 questions from an online CS course, 29 KCs, exactly 1 KC per question, which is actually a category. 650k entries.
For all these datasets, the observations are fully specified: all users attempted all questions. All datasets except Castor6e can be found in the R package CDM [george2016r].
58939 middle-school students over CS-related 17 tasks, 2 KCs, 1.47 KCs per task. 1M entries.
2922 students over 28 language-related items, 3 KCs, 1.3 KCs per question in average. 81k entries. This dataset can be found in the CDM R package.
536 middle-school students over 20 fraction subtraction questions, 8 KCs, 2.8 KCs per question in average. 16k entries. A precise description of the items and skills is in [DeCarlo2010].
757 students over 23 math questions from the TIMSS test in 2003, 13 KCs, 1.65 KCs per task. 17k entries.
From the triplets (user_id, item_id, outcome), we first compute for the temporal datasets the number of successful and unsuccessful attempts at skill level, according to the q-matrix.
For each dataset, we perform 5-fold cross validation. For each fold, entries are separated into a train and test set, then we train different encodings of KTMs using the train set, notably the ones corresponding to existing models, and predict the outcomes in the test set.
KTMs are trained during 1000 epochs for each non-temporal dataset, 500 epochs for the Assistments dataset and 300 epochs for the Berkeley dataset, because it was enough for convergence. At each epoch, we average the results over all 5 folds, in terms of accuracy (ACC), area under the curve (AUC) and negative log-likelihood (NLL).
As special cases, as shown earlier, we have, for the temporal datasets:
AFM is actually “skills, attempts ”
PFA is actually “skills, wins, fails ”
And for every dataset:
IRT is “users, items ”
MIRT plus a user bias (coined as MIRTb) is “users, items” with any .
|items, skills, wins, fails, extra||20||0.774||0.819||0.465|
|items, skills, wins, fails, extra||5||0.775||0.819||0.465|
|items, skills, wins, fails, extra||10||0.775||0.818||0.465|
|items, skills, wins, fails, extra||0||0.774||0.815||0.463|
|items, skills, wins, fails||10||0.727||0.767||0.539|
|items, skills, wins, fails||0||0.725||0.759||0.542|
|items, skills, wins, fails||5||0.714||0.75||0.56|
|items, skills, wins, fails||20||0.714||0.75||0.564|
|IRT: users, items||0||0.675||0.691||0.599|
|MIRTb: users, items||20||0.674||0.691||0.602|
|MIRTb: users, items||10||0.673||0.687||0.604|
|MIRTb: users, items||5||0.67||0.685||0.605|
|PFA: skills, wins, fails||0||0.68||0.685||0.604|
|skills, wins, fails||20||0.649||0.684||0.603|
|skills, wins, fails||5||0.649||0.683||0.604|
|skills, wins, fails||10||0.649||0.683||0.604|
|AFM: skills, attempts||0||0.653||0.616||0.631|
|items, skills, wins, fails||20||0.706||0.778||0.563|
|items, skills, wins, fails||10||0.706||0.778||0.563|
|items, skills, wins, fails||5||0.706||0.778||0.563|
|items, skills, wins, fails||0||0.705||0.775||0.566|
|IRT: users, items||0||0.688||0.753||0.586|
|MIRTb: users, items||5||0.685||0.753||0.589|
|MIRTb: users, items||10||0.685||0.752||0.59|
|MIRTb: users, items||20||0.683||0.752||0.591|
|PFA: skills, wins, fails||0||0.631||0.684||0.635|
|skills, wins, fails||10||0.631||0.684||0.635|
|skills, wins, fails||20||0.631||0.684||0.635|
|skills, wins, fails||5||0.631||0.684||0.635|
|AFM: skills, attempts||0||0.621||0.675||0.639|
|MIRTb: users, items||20||0.619||0.667||0.651|
|MIRTb: users, items||5||0.621||0.666||0.650|
|IRT: users, items||0||0.623||0.666||0.656|
|users, items, skills||0||0.623||0.666||0.656|
|MIRTb: users, items||10||0.618||0.665||0.652|
|IRT: users, items||0||0.640||0.695||0.63|
|users, items, skills||0||0.639||0.694||0.63|
|MIRTb: users, items||10||0.638||0.694||0.628|
|MIRTb: users, items||20||0.636||0.693||0.629|
Results are reported in Tables 3 to 7 and Figure 3. For convenience, we also reported a summary of the main results in Table 5. Each existing model is matched or outperformed by a KTM. For all non-temporal datasets, we did not consider attempt count, as each user only attempted an item once.
On the Assistments dataset, for , our model KTM(iswfe0) = “items, skills, fails, extra ” is logistic regression, so it was faster to train (4 min 30 seconds on CPU for all 5 folds) than DKT (1 hour on CPU), while achieving higher AUC (). For models of higher dimensions on this dataset, experiments took 17 min for with the same 31138 features, and 32 min for .
Given its simplicity, IRT has a remarkable performance on all datasets considered, even on the temporal ones, which may be because the average number of attempts per student is small. When considering all information at hand, the top performing KTM model on the Assistments dataset for achieves higher performance than the known results of vanilla DKT. It makes sense, as we have access to more side information, and logistic regression is less prone to overfitting.
For all temporal datasets, encoding wins and fails (PFA model) instead of only the number of attempts (AFM model) improves the performance a lot (+0.07 AUC for Assistments, +0.01 for Berkeley). This is concordant with existing work [pavlik2009performance]. There is an improvement of KTM models that consider number of wins and fails (KTM(iswf0) = “items, skills, wins, fails ”) over IRT (+0.07 in Assistments, +0.02 in Berkeley).
For all datasets, considering a bias per item improves the predictions, which is what IRT does but PFA does not. KTM(iswf0) = “items, skills, wins, fails ” has +0.07 AUC improvement over KTM(swf0) = PFA in Assistments, +0.09 in Berkeley. It may be because the number of items is huge, and they do not have the same difficulty. So, it is useful to learn this difficulty parameter using the performance of previous students. This extra parameter enables a big improvement on all datasets, except on the Fraction dataset, which may be because the skills for fraction subtraction are easily known and clearly specified, so it is enough to characterize the items uniquely.
For Fraction (8 KCs), Assistments (123 KCs) and TIMSS (13 KCs), the skills are easy to identify, because the items are math problems. For the other datasets, either there are few skills (ECPE: 3 language-learning KCs, Castor: 2 KCs for CS), or there is only one KC mapped to an item (Berkeley: 29 KCs, categories of CS problems). This is why considering a bias per skill barely increases the performance of the predictions.
On the temporal datasets, there is only a slight improvement of models with higher dimensions (less than +0.01 AUC), which seems to indicate that when there are many features considered (number of successful and unsuccessful attempts at item or skill level), a KTM with provides good enough predictions. Still, on a similar task, [Vie2018] managed to get an improvement of +0.03 AUC for factorization machines for compared to logistic regression (), presumably because the side information was considerable for this task.
In this work, we wanted to compare the expressiveness of models typically used for student modeling. Our experiments assess the strong generalization of student models, as students are randomly shuffled into train and test set, and the task of performance prediction is made for totally new students.
The vanilla DKT model cannot handle multiple skills, so instead, practictioners treat combinations of skills as new skills, which prevents the transfer of information between skills. The approach described in this paper can be used to handle multiple skills with DKT. Also, more recent results have successfully built upon the vanilla DKT (AUC 0.91 > 0.743), by incorporating dynamic cluster information [Minn2018]. We could indeed combine DKT with side information.
IRT and MIRT were initially designed to provide adaptive testing: choose the best next question to present to a learner, given their previous answers. KTMs could also be used to these ends, as they extend the IRT and MIRT models with extra information, under the form of KCs or several attempts.
Modeling response time could provide better predictions of outcomes, and it has also been used in the encoding of factorization machines in previous works. Also, we could add to the side information another counter representing how many timesteps were elapsed since a certain item was asked for the last time. It would learn how the user reacts to spaced repetition. In some datasets such as Assistments, more data is recorded about students that can be used to improve the predictions. Still, we should be careful about encoding noisy data such as the output of other machine-learning algorithms as side information, because it may degrade performance[Vie2018].
In this paper, we were limited to pairwise interactions. But in his original paper (rendle2012factorization), rendle2012factorization mentions higher-order factorization machines, which generalize interactions to -way terms. It could be an interesting direction for future research.
Instead of binary outcomes, one could consider graded outcomes, with the same KTM model, using thresholds, just like the graded response model in item response theory [samejima1997graded]. We leave it to further work.
In this paper, we showed how knowledge tracing machines, a family of models that encompasses existing models in the EDM literature as special cases, could be used for the classification problem of knowledge tracing.
We showed, using many datasets of various sizes and characteristics, that it could estimate user and item parameters even when the observations are sparse, and provide better predictions than existing models, including deep neural networks. KTMs are a testbed to try new combinations of data, such as response time, of number of attempts at item level.
One can refine the encoding of features in a KTM according to how the data was collected: Are the observations made at skill level or problem level? Does it make sense to count the number of attempts at item level or at skill level? What are extra sources of information that may raise better understanding of the observations?
Furthermore, as we showed, KTMs are log-bilinear models, so the embeddings they learn are interpretable, and can be used to provide useful feedback to students.
We thank Mohammad Emtiyaz Khan, and the reviewers, for their precious comments. We also thank Armando Fox and Nikunj Jain for providing the Berkeley dataset and Mathias Hiron for providing the Castor dataset. Part of this research was discovered in a plane, so we also thank the flight attendants, that are always working hard to ensure our comfort.