Knowledge Tracing Machines: Factorization Machines for Knowledge Tracing

by   Jill-Jênn Vie, et al.
Kyoto University

Knowledge tracing is a sequence prediction problem where the goal is to predict the outcomes of students over questions as they are interacting with a learning platform. By tracking the evolution of the knowledge of some student, one can optimize instruction. Existing methods are either based on temporal latent variable models, or factor analysis with temporal features. We here show that factorization machines (FMs), a model for regression or classification, encompass several existing models in the educational literature as special cases, notably additive factor model, performance factor model, and multidimensional item response theory. We show, using several real datasets of tens of thousands of users and items, that FMs can estimate student knowledge accurately and fast even when student data is sparsely observed, and handle side information such as multiple knowledge components and number of attempts at item or skill level. Our approach allows to fit student models of higher dimension than existing models, and provides a testbed to try new combinations of features in order to improve existing models.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep-IRT: Make Deep Learning Based Knowledge Tracing Explainable Using Item Response Theory

Deep learning based knowledge tracing model has been shown to outperform...

Option Tracing: Beyond Correctness Analysis in Knowledge Tracing

Knowledge tracing refers to a family of methods that estimate each stude...

Assessment meets Learning: On the relation between Item Response Theory and Bayesian Knowledge Tracing

Few models have been more ubiquitous in their respective fields than Bay...

Deep Factorization Machines for Knowledge Tracing

This paper introduces our solution to the 2018 Duolingo Shared Task on S...

Bayes Nets in Educational Assessment: Where Do the Numbers Come From?

As observations and student models become complex, educational assessmen...

Incorporating Features Learned by an Enhanced Deep Knowledge Tracing Model for STEM/Non-STEM Job Prediction

The 2017 ASSISTments Data Mining competition aims to use data from a lon...

DAS3H: Modeling Student Learning and Forgetting for Optimally Scheduling Distributed Practice of Skills

Spaced repetition is among the most studied learning strategies in the c...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Related Work

In this section, we review several approaches proposed to model student learning.

Knowledge Tracing

Knowledge Tracing aims at predicting the sequence of outcomes of a student over questions. It usually relies on modeling the state of the learner throughout the process. After several attempts, students may eventually evolve to a state of mastery.

The most popular model is Bayesian knowledge tracing (BKT), which is a hidden Markov model

[corbett1994knowledge]. However, BKT cannot model the fact that a question might require several KCs. New models have been proposed that do handle multiple subskills, such as feature-aware student tracing (FAST) [gonzalez2014general].

As deep learning models have proven successful at predicting sequences, they have been applied to student modeling: deep knowledge tracing (DKT) is a long short-term memory (LSTM)

[piech2015deep]. Several researchers have reproduced the experiment on several variations of the Assistments dataset [xiong2016going, wilson2016back, wilson2016estimating], and shown that some factor analysis models could match the performance of DKT, as we will see now.

Factor Analysis

Factor analysis tend to learn common factors in data in order to generalize observations. They have been successfully applied to matrix completion, where we assume that data is recorded for (user, item) pairs, but many entries are missing. The main difference with sequence prediction for our purposes is that the order in which the data is observed does not matter. If one wants to encode temporality though, it is possible to complement the data with temporal features such as simple counters, as we will see later. In all that follows,

will denote the logit function:


Item Response Theory

The most simple model for factor analysis does not assume knowledge between several attempts, it is the 1-parameter logistic item response theory model, also known as Rasch model:

where measures the ability of student (the student bias) and measures the difficulty of question (the question bias). We will refer to the Rasch model as IRT in the rest of the paper. More recently, Wilson have shown that IRT could outperform DKT [wilson2016estimating], even without temporal features [gonzalez2014general]. It may be because DKT have many parameters to estimate, so they are prone to overfitting, and they are hard to train on long sequences.

The IRT model has been extended to multidimensional abilities:

where is the easiness of item (item bias). Multidimensional Item Response Theory (MIRT) has a reputation to be hard to train [desmarais2012review] thus is not frequently encountered in the EDM literature, and still, the dimensionality used in psychometrics papers is up to 4, but we show in this paper how to train those models effectively, up to 20 dimensions.


Additive factor model (AFM) [cen2006learning] takes into account the number of attempts a learner has made to an item:

where is the bias for the skill , and the bias for each opportunity of learning the skill . is the number of attempts of student over a question that requires the skill .

Performance factor analysis (PFA) [pavlik2009performance] counts separately positive and negative attempts:

where is the bias for the skill , () the bias for each opportunity of learning the skill after a successful (unsuccessful) attempt, () is the number of successes (failures) of student over a question that requires the skill . In other words, AFM can be seen as a particular case of PFA where for every skill . Please note that AFM and PFA do not consider item difficulty, presumably to avoid the item cold-start problem. According to [gonzalez2014general], PFA and FAST have comparable performance. By reproducing experiments, [xiong2016going] have managed to match the performance of DKT with PFA.

Factorization machines

Numerous works have coined the similarity between student modeling and collaborative filtering (CF) in recommender systems [bergner2012model, thai2011factorization]. For CF, factorization machines were designed to provide a way to encode side information about items or users into the model.

[thai2012using] and [sweeney2016next] have used factorization machines in their regression form for student modeling (where the root mean squared error is used as metric) but to the best of our knowledge, it has not been used in its classification form for student modeling. This is what we present in this paper, in the next section.

Knowledge tracing machines

Users Items Skills Wins Fails
0 1 0 1 0 1 1 0 0 0 0 0 0 0
0 1 0 1 0 1 1 0 1 1 0 0 0 0
0 1 0 1 0 1 1 0 1 1 0 1 1 0
0 1 0 0 1 0 1 1 0 2 0 0 1 0
0 1 0 0 1 0 1 1 0 2 0 0 2 1
1 0 0 1 0 1 1 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0 0 0
Table 1: An example of encoding for training a Knowledge Tracing Machine.

We now introduce the family of models described in this paper, Knowledge Tracing Machines (KTM).

Let be the number of features. Features can refer either to students, exercises, knowledge components (KCs), opportunities for learning, or extra information about the learning environment. For example, one might want to model the fact that the student attempted an exercise on mobile, or on computer, which might influence their outcome: it may be harder to type a correct answer when using a mobile, so this data should be taken into account in the predictions.

KTMs model the probability of observing binary outcomes of events (right or wrong), based on a sparse set of weights for all features involved in the event. Features involved in an event are encoded by a sparse vector

of length such that iff feature is involved in the event. For each event involving , the probability to observe a positive outcome verifies:

Figure 1: Example of activation of a knowledge tracing machine.

where is a link function such as , is a global bias, each feature is modeled by both a bias and an embedding for some dimension . In what follows, will refer to the vector of biases and to the matrix of embeddings , . For each event, only the features that have will contribute to the prediction, see Figure 1.

Data and encoding of side information

We now describe how to encode the observed data in the learning platform into the sparse vector . First, we need to choose which features will be represented in the modeling.


Let us assume there are students. The first features will be for all students. As an example, if student is involved in the observation, its value will be set to 1, while the ones for the other students will be set to 0. This is called a one-hot vector.


Let us assume there are questions or items. One can allocate more features for all questions. If question is involved in the observation, its component in will be set to 1, while the ones for the other questions will be set to 0.


We now assume there are skills. We can then allocate extra features for those skills. The skills involved in an observation of a student over a question are the ones of .


One can allocate extra features as counters of how many opportunities a student could have learned a skill involved in the test.

Wins & Fails

One can also distinguish between successes and failures: allocate features as opportunities to have learned a skill if the attempt was correct, more features as opportunities to have learned a skill if the attempt was incorrect.

Extra side information

More side information can be concatenated to the existing sparse features, such as the school ID and teacher ID of the student, or also other information such as the type of test: low stake (practice) or high stake (posttest), etc.

Full example

See Table 1 for an example of encoding of users + items + skills + wins + fails, for the set of observed, chronologically ordered triplets (student 2 attempted question 2 and got it correct), , , , , , . Here, we assume that there are students, questions, skills and question 1 does not involve any skill, question 2 involves skills 1 and 2, question 3 involves skills 2 and 3. At the beginning, user 2 had no opportunity to learn any skill, so counters of wins and fails are all 0. After student 2 got question 2 correct, as it involved skills 1 and 2, the counters of wins for these two skills are incremented, and encoded for the next observation. We thus managed to encode the triplets with features, and at training time, a bias and an embedding will be learned for each one of them.

Name Users Items Skills Skills per item Entries Sparsity (user, item) Attempts per user
fraction 536 20 8 2.800 10720 0.000 1.000
timss 757 23 13 1.652 17411 0.000 1.000
ecpe 2922 28 3 1.321 81816 0.000 1.000
assistments 4217 26688 123 0.796 346860 0.997 1.014
berkeley 1730 234 29 1.000 562201 0.269 1.901
castor 58939 17 2 1.471 1001963 0.000 1.000
Table 2: Datasets used for the experiments

Relation to existing models

When , KTMs include IRT, AFM and PFA. Let us now recover some particular cases, especially when , i.e., only biases are learned for features, no embeddings. We will again assume there are students, questions and skills.

We will note a one-hot vector of size , which means all its components are 0 except the th one, which is 1.

Relation to IRT

If , the second sum in Equation 1 disappears and all that is left is a weighted sum of biases.

If all features considered are students and questions (encoding users + items), and the encoding of the pair (student , question ) is a concatenation of one-hot vectors and : and iff or . The expression in Equation 1 becomes:

if the first features (students numbered where ) have bias and the next features (questions numbered where ) have bias . Therefore, KTM becomes after reparametrization the 1-PL IRT model, also referred to as Rasch model.

Relation to AFM and PFA

Now we will again consider the special case and an encoding of skills, wins and fails at skill level. For this, we will assume we know the q-matrix, that is, the binary mapping between questions and skills as described in the introduction.

If and encoding of “student attempted question ” is , where and are the counters of successful and unsuccessful attempts at skill level, then KTM behaves like the PFA model. Similarly, one can recover the AFM model.

Relation to MIRT

If , KTM becomes a MIRT model with user bias:

The encoding is the same as for IRT (users + items with one-hot vectors), and the embeddings .


Training of KTMs is made by minimizing the negative log-likelihood over the observations and outcomes :

Like [rendle2012factorization], we assume some priors over the model parameters in order to guide training and avoid overfitting.

Each bias follows and each embedding component also follows where and

are regularization parameters that follow hyperpriors

and .

Because of those hyperpriors, we do not need to tune regularization parameters by hand [rendle2012factorization]. As we use

, that is, the inverse of the CDF of the normal distribution, we can fit the model using Gibbs sampling. Details of the computations can be found in


The model is learned using the MCMC Gibbs sampler implementation of libFM222 in C++ [rendle2012factorization], using the pywFM Python wrapper333

Visualizing the embeddings

Another advantage of KTMs is that we can visualize the embeddings that they learn. On Figure 2, we show the 2-dimensional embeddings of users, items, skills learned by a knowledge tracing machine on the Fraction subtraction dataset. The user WALL·E is positively correlated with most of items, but not skills 2 (separate a whole number from a fraction) and 7 (substract numerators), which may explain why WALL·E couldn’t solve item 5 () that requires these two skills. To know more about the items and skills of this dataset, see [DeCarlo2010].

Figure 2: Example of learned 2-dimensional embeddings for the Fraction dataset.


We used various datasets of different shapes and sizes in order to push our method to its limits. In Table 2, we report the main characteristics of the datasets: number of users, number of items, number of skills, average number of skills per item, total number of observed entries, sparsity of the (user, item) pairs, average number of attempts per user at item level.

Temporal Datasets

For the temporal datasets, students could attempt several times a same question, and potentially learn between attempts.


The 2009–2010 dataset of Assistments described in [feng2009addressing]. 4217 students over 26688 questions, 123 KCs. 347k observations. There are many items but they involve 0 to 4 KCs, and there are only 146 combinations of KCs. For this dataset, we had also access to more side information, referred to as “extra” in the experiments:

  • first_action: attempt, or ask for a hint;

  • school_id where the problem was assigned;

  • teacher_id who assigned the problem;

  • tutor_mode: tutor, test mode, pretest, or posttest.


1730 students from Berkeley attempting 234 questions from an online CS course, 29 KCs, exactly 1 KC per question, which is actually a category. 650k entries.

Non-temporal Datasets

For all these datasets, the observations are fully specified: all users attempted all questions. All datasets except Castor6e can be found in the R package CDM [george2016r].


58939 middle-school students over CS-related 17 tasks, 2 KCs, 1.47 KCs per task. 1M entries.


2922 students over 28 language-related items, 3 KCs, 1.3 KCs per question in average. 81k entries. This dataset can be found in the CDM R package.


536 middle-school students over 20 fraction subtraction questions, 8 KCs, 2.8 KCs per question in average. 16k entries. A precise description of the items and skills is in [DeCarlo2010].


757 students over 23 math questions from the TIMSS test in 2003, 13 KCs, 1.65 KCs per task. 17k entries.


From the triplets (user_id, item_id, outcome), we first compute for the temporal datasets the number of successful and unsuccessful attempts at skill level, according to the q-matrix.

For each dataset, we perform 5-fold cross validation. For each fold, entries are separated into a train and test set, then we train different encodings of KTMs using the train set, notably the ones corresponding to existing models, and predict the outcomes in the test set.

KTMs are trained during 1000 epochs for each non-temporal dataset, 500 epochs for the Assistments dataset and 300 epochs for the Berkeley dataset, because it was enough for convergence. At each epoch, we average the results over all 5 folds, in terms of accuracy (ACC), area under the curve (AUC) and negative log-likelihood (NLL).

As special cases, as shown earlier, we have, for the temporal datasets:

  • AFM is actually “skills, attempts

  • PFA is actually “skills, wins, fails

And for every dataset:

  • IRT is “users, items

  • MIRT plus a user bias (coined as MIRTb) is “users, items” with any .

Results and Discussion

Figure 3: Results for the Assistments dataset.
model dim ACC AUC NLL
items, skills, wins, fails, extra 20 0.774 0.819 0.465
items, skills, wins, fails, extra 5 0.775 0.819 0.465
items, skills, wins, fails, extra 10 0.775 0.818 0.465
items, skills, wins, fails, extra 0 0.774 0.815 0.463
items, skills, wins, fails 10 0.727 0.767 0.539
items, skills, wins, fails 0 0.725 0.759 0.542
items, skills, wins, fails 5 0.714 0.75 0.56
items, skills, wins, fails 20 0.714 0.75 0.564
IRT: users, items 0 0.675 0.691 0.599
MIRTb: users, items 20 0.674 0.691 0.602
MIRTb: users, items 10 0.673 0.687 0.604
MIRTb: users, items 5 0.67 0.685 0.605
PFA: skills, wins, fails 0 0.68 0.685 0.604
skills, wins, fails 20 0.649 0.684 0.603
skills, wins, fails 5 0.649 0.683 0.604
skills, wins, fails 10 0.649 0.683 0.604
skills, attempts 20 0.623 0.62 0.631
skills, attempts 5 0.626 0.619 0.63
skills, attempts 10 0.622 0.619 0.632
AFM: skills, attempts 0 0.653 0.616 0.631
Table 3: Results for the Assistments dataset.
model dim ACC AUC NLL
items, skills, wins, fails 20 0.706 0.778 0.563
items, skills, wins, fails 10 0.706 0.778 0.563
items, skills, wins, fails 5 0.706 0.778 0.563
items, skills, wins, fails 0 0.705 0.775 0.566
IRT: users, items 0 0.688 0.753 0.586
MIRTb: users, items 5 0.685 0.753 0.589
MIRTb: users, items 10 0.685 0.752 0.59
MIRTb: users, items 20 0.683 0.752 0.591
PFA: skills, wins, fails 0 0.631 0.684 0.635
skills, wins, fails 10 0.631 0.684 0.635
skills, wins, fails 20 0.631 0.684 0.635
skills, wins, fails 5 0.631 0.684 0.635
skills, attempts 20 0.621 0.675 0.639
AFM: skills, attempts 0 0.621 0.675 0.639
skills, attempts 10 0.621 0.675 0.639
skills, attempts 5 0.621 0.675 0.639
Table 4: Results for the Berkeley dataset.
AFM PFA IRT MIRTb10 MIRTb20 KTM(iswf0) KTM(iswf20) KTM(iswfe5)
assistments 0.6163 0.6849 0.6908 0.6874 0.6907 0.7589 0.7502 0.8186
berkeley 0.675 0.6839 0.7532 0.7521 0.7519 0.7753 0.7780
ecpe 0.6811 0.6807 0.6810
fraction 0.6662 0.6653 0.6672
timss 0.6946 0.6939 0.6932
castor 0.7603 0.7602 0.7599
Table 5: Summary of AUC results for all datasets.
model dim ACC AUC NLL
MIRTb: users, items 20 0.619 0.667 0.651
items, skills 5 0.621 0.667 0.650
items, skills 20 0.621 0.666 0.649
MIRTb: users, items 5 0.621 0.666 0.650
IRT: users, items 0 0.623 0.666 0.656
users, items, skills 0 0.623 0.666 0.656
MIRTb: users, items 10 0.618 0.665 0.652
users, skills 5 0.62 0.664 0.649
Table 6: Results for the Fraction dataset.
model dim ACC AUC NLL
items, skills 0 0.637 0.695 0.629
IRT: users, items 0 0.640 0.695 0.63
users, items, skills 0 0.639 0.694 0.63
MIRTb: users, items 10 0.638 0.694 0.628
MIRTb: users, items 20 0.636 0.693 0.629
users, skills 0 0.579 0.605 0.67
Table 7: Results for the TIMSS dataset.

Results are reported in Tables 3 to 7 and Figure 3. For convenience, we also reported a summary of the main results in Table 5. Each existing model is matched or outperformed by a KTM. For all non-temporal datasets, we did not consider attempt count, as each user only attempted an item once.

Training time

On the Assistments dataset, for , our model KTM(iswfe0) = “items, skills, fails, extra ” is logistic regression, so it was faster to train (4 min 30 seconds on CPU for all 5 folds) than DKT (1 hour on CPU), while achieving higher AUC (). For models of higher dimensions on this dataset, experiments took 17 min for with the same 31138 features, and 32 min for .

Effect of the side information considered

Given its simplicity, IRT has a remarkable performance on all datasets considered, even on the temporal ones, which may be because the average number of attempts per student is small. When considering all information at hand, the top performing KTM model on the Assistments dataset for achieves higher performance than the known results of vanilla DKT. It makes sense, as we have access to more side information, and logistic regression is less prone to overfitting.

Wins and fails

For all temporal datasets, encoding wins and fails (PFA model) instead of only the number of attempts (AFM model) improves the performance a lot (+0.07 AUC for Assistments, +0.01 for Berkeley). This is concordant with existing work [pavlik2009performance]. There is an improvement of KTM models that consider number of wins and fails (KTM(iswf0) = “items, skills, wins, fails ”) over IRT (+0.07 in Assistments, +0.02 in Berkeley).

Item bias

For all datasets, considering a bias per item improves the predictions, which is what IRT does but PFA does not. KTM(iswf0) = “items, skills, wins, fails ” has +0.07 AUC improvement over KTM(swf0) = PFA in Assistments, +0.09 in Berkeley. It may be because the number of items is huge, and they do not have the same difficulty. So, it is useful to learn this difficulty parameter using the performance of previous students. This extra parameter enables a big improvement on all datasets, except on the Fraction dataset, which may be because the skills for fraction subtraction are easily known and clearly specified, so it is enough to characterize the items uniquely.


For Fraction (8 KCs), Assistments (123 KCs) and TIMSS (13 KCs), the skills are easy to identify, because the items are math problems. For the other datasets, either there are few skills (ECPE: 3 language-learning KCs, Castor: 2 KCs for CS), or there is only one KC mapped to an item (Berkeley: 29 KCs, categories of CS problems). This is why considering a bias per skill barely increases the performance of the predictions.

Effect of the dimension of features

On the temporal datasets, there is only a slight improvement of models with higher dimensions (less than +0.01 AUC), which seems to indicate that when there are many features considered (number of successful and unsuccessful attempts at item or skill level), a KTM with provides good enough predictions. Still, on a similar task, [Vie2018] managed to get an improvement of +0.03 AUC for factorization machines for compared to logistic regression (), presumably because the side information was considerable for this task.

Further work

In this work, we wanted to compare the expressiveness of models typically used for student modeling. Our experiments assess the strong generalization of student models, as students are randomly shuffled into train and test set, and the task of performance prediction is made for totally new students.

Side information in deep knowledge tracing

The vanilla DKT model cannot handle multiple skills, so instead, practictioners treat combinations of skills as new skills, which prevents the transfer of information between skills. The approach described in this paper can be used to handle multiple skills with DKT. Also, more recent results have successfully built upon the vanilla DKT (AUC 0.91 > 0.743), by incorporating dynamic cluster information [Minn2018]. We could indeed combine DKT with side information.

Adaptive testing

IRT and MIRT were initially designed to provide adaptive testing: choose the best next question to present to a learner, given their previous answers. KTMs could also be used to these ends, as they extend the IRT and MIRT models with extra information, under the form of KCs or several attempts.

Response time, spaced repetition, and other data

Modeling response time could provide better predictions of outcomes, and it has also been used in the encoding of factorization machines in previous works. Also, we could add to the side information another counter representing how many timesteps were elapsed since a certain item was asked for the last time. It would learn how the user reacts to spaced repetition. In some datasets such as Assistments, more data is recorded about students that can be used to improve the predictions. Still, we should be careful about encoding noisy data such as the output of other machine-learning algorithms as side information, because it may degrade performance


Higher order factorization machines

In this paper, we were limited to pairwise interactions. But in his original paper (rendle2012factorization), rendle2012factorization mentions higher-order factorization machines, which generalize interactions to -way terms. It could be an interesting direction for future research.

Ordinal regression

Instead of binary outcomes, one could consider graded outcomes, with the same KTM model, using thresholds, just like the graded response model in item response theory [samejima1997graded]. We leave it to further work.


In this paper, we showed how knowledge tracing machines, a family of models that encompasses existing models in the EDM literature as special cases, could be used for the classification problem of knowledge tracing.

We showed, using many datasets of various sizes and characteristics, that it could estimate user and item parameters even when the observations are sparse, and provide better predictions than existing models, including deep neural networks. KTMs are a testbed to try new combinations of data, such as response time, of number of attempts at item level.

One can refine the encoding of features in a KTM according to how the data was collected: Are the observations made at skill level or problem level? Does it make sense to count the number of attempts at item level or at skill level? What are extra sources of information that may raise better understanding of the observations?

Furthermore, as we showed, KTMs are log-bilinear models, so the embeddings they learn are interpretable, and can be used to provide useful feedback to students.


We thank Mohammad Emtiyaz Khan, and the reviewers, for their precious comments. We also thank Armando Fox and Nikunj Jain for providing the Berkeley dataset and Mathias Hiron for providing the Castor dataset. Part of this research was discovered in a plane, so we also thank the flight attendants, that are always working hard to ensure our comfort.