LANA: Towards Personalized Deep Knowledge Tracing Through Distinguishable Interactive Sequences

by   Yuhao Zhou, et al.
Sichuan University

In educational applications, Knowledge Tracing (KT), the problem of accurately predicting students' responses to future questions by summarizing their knowledge states, has been widely studied for decades as it is considered a fundamental task towards adaptive online learning. Among all the proposed KT methods, Deep Knowledge Tracing (DKT) and its variants are by far the most effective ones due to the high flexibility of the neural network. However, DKT often ignores the inherent differences between students (e.g. memory skills, reasoning skills, ...), averaging the performances of all students, leading to the lack of personalization, and therefore was considered insufficient for adaptive learning. To alleviate this problem, in this paper, we proposed Leveled Attentive KNowledge TrAcing (LANA), which firstly uses a novel student-related features extractor (SRFE) to distill students' unique inherent properties from their respective interactive sequences. Secondly, the pivot module was utilized to dynamically reconstruct the decoder of the neural network on attention of the extracted features, successfully distinguishing the performance between students over time. Moreover, inspired by Item Response Theory (IRT), the interpretable Rasch model was used to cluster students by their ability levels, and thereby utilizing leveled learning to assign different encoders to different groups of students. With pivot module reconstructed the decoder for individual students and leveled learning specialized encoders for groups, personalized DKT was achieved. Extensive experiments conducted on two real-world large-scale datasets demonstrated that our proposed LANA improves the AUC score by at least 1.00 and RAIEd2020 1.00 methods.




Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing

In Intelligent Tutoring System (ITS), tracing the student's knowledge st...

Context-Aware Attentive Knowledge Tracing

Knowledge tracing (KT) refers to the problem of predicting future learne...

Open-Ended Knowledge Tracing

Knowledge tracing refers to the problem of estimating each student's kno...

Deep Knowledge Tracing with Convolutions

Knowledge tracing (KT) has recently been an active research area of comp...

A Self-Attentive model for Knowledge Tracing

Knowledge tracing is the task of modeling each student's mastery of know...

Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing

Knowledge tracing, the act of modeling a student's knowledge through lea...

A Survey of Knowledge Tracing

High-quality education is one of the keys to achieving a more sustainabl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A sample Knowledge Tracing

(KT) process. Given each student’s exercising activities in the past, KT aims to accurately predict the probability of whether each student is capable of correctly answering the next question.

Knowledge Tracing (KT) aims to accurately retrieve students’ knowledge states at a certain time by his past sequential exercising interactions. To evaluate KT’s performance, it is asked to predict the correctness of students’ future exercises with the retrieved knowledge states as Figure 1. Traditionally, KT was regarded as a sequential behavior mining task [12, 25]

, and therefore various methods established models with the theory of bayesian probability (BKT 

[4]) and psycho-statistics (IRT [8]), providing excellent interpretability and good performance. Nevertheless, recently proposed Deep Knowledge Tracing (DKT) [22] and its variants [19, 7, 20, 2, 26]

significantly outperform other KT methods in metrics using Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM 

[10]). However, DKT distinctly lacks personalization for students compared to BKT and IRT, which are capable of separately training unique models for each student, while DKT only trains a unified model for all students due to massive training data and abundant computing resources required by deep learning. Hence, DKT weakly reflects the large inherent property (i.e. memory skills, reasoning skills, or even guessing skills) gaps between students.

Is it possible to bring personalization back to DKT? To answer this question, a natural and straightforward thought is to treat KT task as an exercise-recommending task in which students are the target users. However, a typical recommendation system usually requires much targets’ side information (e.g. age, sex, education, …) to help the model realize the differences between users, while KT could only provide little. Moreover, the side information in the recommendation system is often fixed or changing gradually, instead of being dynamic in KT (e.g. memory skills could be reinforced over time). As a result, employing an excellent recommendation system in KT will mostly fail to properly cluster different students due to insufficient student-related features, resulting in an unsatisfactory personalizing experience.

Motivated by the observation that the proactive behavior sequence (i.e. interactive sequences) of each individual is unique and changeable over time, we argue that the minimal personalized unit in KT is “a student at a certain time ” instead of just “a student”, and student’s inherent properties at time can be represented by his interactive sequences around time . In such a way, these student-related features could tremendously help personalize the KT process since they could be used to identify different students at different stages. Consequently, in our proposed Leveled Attentive KNowledge TrAcing (LANA), unique student-related features are distilled from students’ interactive sequence by a Student-Related Features Extractor (SRFE). Moreover, inspired by BKT and IRT that assign completely different models to different students, LANA, as a DKT model, successfully achieves the same goal in a different manner. Detailedly, instead of separately training each student a model like BKT and IRT, LANA learns to learn correlations between inputs and outputs on attention of the extracted student-related features, and thus becomes transformable for different students at different stages. More specifically, the transformation was accomplished using pivot module and leveled learning, where the former one is a model component that seriously relies on the SRFE, and the latter one is a training mechanism that specializes encoders for groups with interpretable Rasch model defined ability levels. Formally, the LANA can be represented by:


where is referred as student ’s knowledge state at time , is referred as the sequential exercising interactions of student before time , and finally (decoder), (encoder) and (SRFE) are three main modules that LANA seeks to learn.

The main contributions of this paper were summarized as:

  1. To the best of our knowledge, LANA was the first proposal to distill student-related features from their respective interactive sequences by a novel Student-Related Features Extractor (SRFE), exceedingly reducing the difficulty of achieving personalized KT.

  2. With distilled unique student-related features, novel pivot module and leveled learning were utilized in LANA to make the whole model transformable for different students at different stages, bringing strong adaptability to the DKT domain.

  3. Extensive experiments were conducted on two real-world large-scale KT datasets in comparison with other State-Of-The-Art KT methods. The results demonstrated that LANA outperforms any other KT methods substantially. Ablation studies were also performed to investigate the impact of different key components in LANA. The source code and hyper-parameters of experiments are open-sourced for reproducibility


  4. Visualizations of the intermediate features suggest extra impacts of LANA, such as learning stages transfer and learning path recommendation.

The remainder of this paper was structured as follows. The background of KT was introduced in Section 2. The problem setup of KT was formally stated in Section 3. The comprehensive implementation of the proposed LANA was presented in Section 4. Experiments and results analysis was demonstrated in Section 5. Finally, Section 6 concluded this paper.

2 Literature Review

2.1 Background

With the rapid development of internet-related facilities, many online education systems [35, 1], such as Massive Online Open Courses (MOOC), Tencent Classroom 222 and Fudao QQ 333, are gaining more and more attention for allowing students to be educated by a wide range of high-quality courses across the world at any time and anywhere, potentially saving students extensive time and thereby improving their learning efficiency extraordinarily [18]. Moreover, influenced by the pandemic in 2020, the number of students learning online has already surpassed 180 million. In addition with over 16 thousand courses and 67 degrees provided by over 950 universities, it is fair to say online education is making knowledge truly accessible for everyone in the world.

However, despite the prevalence and convenience of online education, it originally suffers from the inability to accurately track students’ knowledge states, and thereby failing to provide personalized learning suggestions [14] and perceive whether students have reached their objectives [9]. In other words, in comparison with offline education at school, online education offers very little opportunity for students to proactively exporting their thoughts. Accordingly, the knowledge states of students learning online can be easily overlooked, resulting in a potential risk to students’ future developments. Being motivated by the fact that students’ current knowledge states are highly relevant to their previous learning activities, Knowledge Tracing (KT) [4] was therefore introduced to address these problems by modeling students’ knowledge states with their past sequential exercising interactions, and is commonly evaluated by predicting students’ responses to future exercises:


where the meaning of each symbol is consistent with Equation 1. In addition, a sample KT process was illustrated in Figure 1, where every student has sequential exercising interactions, each represents student correctly/incorrectly solved question at time . Finally, a KT method is asked to accurately predict the probability of these students correctly solving the next question in the future.

2.2 Related Work

Extensive approaches [24, 6] were proposed to better model students’ knowledge states from earlier Bayesian Knowledge Tracing (BKT) to recent Deep Knowledge Tracing (DKT). In the literature, BKT [4, 5] mainly assumes each student’s knowledge state as a binary array with each bit representing whether student mastered a specific concept. Detailedly, BKT utilizes each student’s sequential interactions to extract each question’s latent concepts, and then solves and in Equation 2

with a Hidden Markov Model 

[23] and 4 probabilistic parameters: initial mastery probability of a concept, transitioning probability from non-mastery to mastery of a concept, probability of guessing correct answers of questions related to a concept and the slipping probability. However, although BKT owns strong interpretability, the original work could only handle the scenario where each question only has one single concept, which is supposed to be not very practical in real-life situations. Hence, much later work was proposed to relax this assumption [36]. Meanwhile, inspired by psycho-statistics, Item Response Theory (IRT) [29, 33, 8] was proposed to trace students’ knowledge states with only 2 parameters (i.e. Rasch model): student ability and question difficulty, and thereby drastically simplified the equation-solving process. Moreover, each question in IRT was considered to have multiple concepts, assuring the effectiveness of IRT in real-world applications. Nevertheless, IRT freezes the parameters of the model once it was well trained and fitted to the data. Videlicet, performing online knowledge states updating is considerably complex for IRT. On the other hand, while and in BKT and IRT were both manually designed by specialists, recently DKT [22] was proposed to utilize RNN/LSTM [10] to automatically learn their representations. The idea of DKT is that the knowledge state of a student at the current time, namely , can be inferred from his knowledge state history and previous sequential exercising interactions :


where and are also known as the encoder and the decoder in DKT’s context. Particularly, SAKT [19], to our knowledge, was the first work that replaces RNN/LSTM module with attention mechanism [31], substantially surpassing other KT methods in both speed and metrics. Moreover, SAINT+ [26] stepped further on the basis of SAKT, by leveraging the whole transformer to learn the hidden patterns of students’ interactive sequences, successfully achieving SOTA on EdNet [3] dataset, which significantly scaled the model up. AKT [7], on the other hand, sought the possibility of improving the existing SAKT model. Specifically, AKT introduced a Monotonic Attention Mechanism (which would be referred to as the Vanilla Memory Attention (VMA) module in the latter of this paper) to take students’ “forgetting” behaviors into account. Although these KT methods were all capable of efficiently tracking students’ knowledge states over time, they were hard to provide personalized analysis and suggestions for each individual due to the unified model parameters for every student. In fact, from the perspective of these KT methods, all students are the same and share the same inherent properties (e.g. memory skills, reasoning skills or even guessing skills, …). Furthermore, the lack of adaptability in DKT is actually a setback [21, 37] since both BKT and IRT have such an ability by assigning different models to different students. Consequently, LANA was therefore proposed, to bringing personalization back to DKT.

Figure 2: The overall model architecture of LANA. There are mainly three differences compared to vanilla transformer-based KT method [2, 26]: I. Modifications to the basic transformer model. II. Introduced SRFE and III. Introduced Pivot Memory Attention (PMA) Module and Pivot Classification Feed Forward Network (PC-FFN) Module, which collectively referred to as pivot module.

3 Problem Setup

3.1 Knowledge Tracing Problem

Given a sequence of interactions , where , represents the activity of student answering question at discrete time step , and represents the correctness of student ’s answer to at time , the KT problem aims to predict the probability of in , formally:


Recently, many proposed KT methods studied the contextual information (i.e. side information) of the questions in KT problem [32, 20, 15, 7], and found that it can drastically help the KT model converge, suggesting that contextual information is strongly correlated to . Consequently, we follow this work and give a more general representation of in KT problem, namely where represents the contextual information of question (e.g. related concepts, part, etc.). Finally, is defined as , referring to all features that participated in one interaction .

3.2 Student-Related Features

In order to achieve adaptive learning in DKT, the model needs to adapt itself to different students at different stages (i.e. student-related features). That is, when the model makes a personalized prediction, it must be aware of which student is being predicted and what characteristics this student currently has. Inspired by the observation that the proactive behavior sequence of each student actually reflects some of his inner properties, we argue that I. interactive sequences of different students at the same time period are distinguishable (distinctive), and II. interactive sequences of the same student at different time periods are also distinguishable, as long as it satisfies a. the features in an interaction is sufficient enough, b. the length of the interactive sequence is large enough, and c. the time interval of the same student’s sequences is long enough. Accordingly, students’ interactive sequences can be used to identify their own, and summarize the student-related features. As a result, a mild assumption was made for the latter analysis.

Assumption 1

For any interactive sequences satisfying , and , can be distinguished from and respectively.

4 Methodology

In this section, we provided a comprehensive presentation of the proposed LANA method.

4.1 Overview

LANA method is composed of a LANA model and a training mechanism. The overall architecture of the proposed LANA model was presented in Figure 2. As it can be seen, the left part of the architecture in the figure is the encoder and the Student-Related Features Extractor (SRFE), while the right part is the decoder. The encoder aims to retrieve any useful information from the model’s input embedding, then the SRFE further abstracts this information to obtain student-related features (Assumption 1). Finally, the decoder makes the prediction with the collected information from both SRFE and the encoder. LANA model, like SAINT+ [26], is a transformer [31] based KT model. However, different from the SAINT+, the LANA model has mainly 3 improvements: I. LANA model considers KT’s characteristics, and therefore makes modifications to the basic transformer model, such as feeding positional embedding directly into attention modules. II. LANA model uses a novel SRFE to abstract necessary student-related features from the input sequence. III. LANA model utilizes pivot module and extracted student-related features to dynamically construct different decoders for different students. With the reconstructed decoder, retrieved knowledge states, and other contextual information, corresponding personalized responses for future exercises are predicted.

While the pivot module can help the LANA model transform the decoder in terms of students’ inherent properties, the encoder, on the other hand, is fixed for all students after the training. Considering in DKT the input sequence is only a portion of the student’s whole interactive sequence for lowering the computational requirement, the difficulty for features extracting from the input sequence increases, and therefore the ability for LANA model to distinguish different students at different stages needs to be strengthened greatly. Consequently, a leveled learning mechanism was proposed to address this problem by specializing different encoders and SRFEs to different groups of students with interpretable Rasch model defined student’s ability level. The workflow of the leveled learning was illustrated in Figure 


Figure 3: The workflow of leveled learning: interpretable Rasch model was leveraged to analyze students’ overall ability levels, and then cluster students into multiple layers, where each layer would respectively fine-tune the LANA model by its own training data.

4.2 Base Modifications

There are mainly two base modifications in the LANA model that were made to the basic transformer. Firstly, in the LANA model, the positional information (e.g. positional encoding, positional embedding) was directly fed into the attention module with a private linear projection, instead of being added to the input embedding and shared the same linear projection matrix with other features in the input layer. Although experiments in [31] suggested that blending input embedding with positional information is effective, recently some work [27] debated that when the model becomes deeper, it tends to “forget” the positional information fed into the first layer. Moreover, some other work [13] believed that adding positional information to the input embedding and offering them to the attention module, is essentially making them share the same linear projection matrix, which is not reasonable since the effects of the input embedding and the positional information are clearly distinctive. For exactly the same reason, in the LANA model, multiple input embeddings (i.e. question ID embedding, student ID embedding, etc.) are concatenated instead of added, leading to the second base modification. Specifically, assumes there are input embeddings in total, each with a dimension of . Then after concatenating, the input embedding would have a total dimension of . Hence, a linear projection layer was used to map the concatenated input embedding of dimension to dimension .

Figure 4: The data shape transformation of two SRFE: Memory-SRFE and Performance-SRFE.
Figure 5: An illustration of the data transformation in the pivot module.
Figure 6: The inner structure of the Pivot Classification Feed Forward Network (PC-FFN) Module.

4.3 Student-Related Features Extractor (SRFE)

Student-Related Features Extractor (SRFE) summaries students’ inherent properties from their interactive sequences with Assumption 1 for the pivot module to personalize the parameters of the decoder. Specifically, SRFE contains an attention layer and several linear layers, where the attention layer was used to distill student-related features from the provided information by the encoder, and the linear layers were leveraged to refine and reshape these features. It is notable that in the LANA model there were primarily two SRFEs: memory-SRFE and performance-SRFE, where the former one was utilized to derive students’ memory-related features for the PMA module (be introduced later) and the latter one was dedicated to distill students’ performance-related features (i.e. Logical thinking skill, Reasoning skill, Integration skill, etc.) for PC-FFN module (introduced later either). The reshaping process was drawn in Figure 4 for better illustration, where , , and are referred to as the model’s batch size, the number of attention heads [31], the length of the input sequence and the dimension of performance-related features. The intuition that memory-related features have a second dimension of comes from the theory that each attention head only pays attention to one perspective of the features. Thus it is reasonable that each student has different memory skills for different attention heads (e.g. for different concepts).

4.4 Pivot Module

Provided an ordinary input , a student-related features

and a target output

, pivot module learns the process of learning how to project to based on , instead of simply learning to project to (i.e. Pivot module learns to learn) as Equation 5 shown.


where here is the function that pivot module learns to learn. That is, the projection matrix of is adapted to instead of being fixed. To accomplish this dynamic mapping, the weight and bias of need to be a projection from . Assumes , and , Equation 5 could be formally presented in Equation 6:


where and . Since and is derived from , the detailed transformation could be revealed in Equation 7, which was also depicted in Figure 5 for better illustration.


where , , and .

By simplification, Equation 5 can be defined as Equation 8, being named as .


where and .

In the LANA model, there are primarily two modules that pertain to the pivot module: Pivot Memory Attention (PMA) Module and Pivot Classification Feed Forward Network (PC-FFN) Module. In many methods [7, 20], Vanilla Memory Attention (VMA) Module was employed to consider the “forgetting” behavior of students, which is pivotal in KT’s context since students are very likely to have done similar exercises to the one he is going to do, and if the student could remember the answers to previous similar exercises, the probability of him correctly answering the future related exercises will be increased greatly. Inspired by the Ebbinghaus Forgetting Curve [17] and much previous work [20, 7], “forgetting” behavior of students are defined as exponentially decaying weights of corresponding interactions in the timeline. Detailedly, in the original attention module, the weight of item on item , i.e. , is determined by the sigmoid result of the similarity between item and item :


where is a function to calculate the similarity between item and item by dot production. In order to take “forgetting” behavior into ’s account (e.g. The further away from , the lower the weight would be), we replaced Equation 9 with Equation 10:


where is the student’s memory-related features extracted in memory-SRFE, is a private learnable constant that describes all students’ average memory skill in the PMA module, and calculates the time distance between item and item (e.g. item is done minutes after item is done). The reason for representing the memory skill with two learnable parameters is to reduce the difficulty for model converging since has a much longer back-propagation path compared to . When is introduced to fit the average memory skill of all students, the distribution of

becomes a Gaussian distribution, which makes the model much easier to learn.

On the other hand, PC-FFN was utilized to make the final prediction in reference to the performance-related features, and its architecture was illustrated in Figure 6. The idea of this module comes from many investigations that the early layers in a deep neural network are often used as a feature extractor while the latter layers are often used as a decision-maker to decide which feature is useful to the output of the model. As a result, these investigations point out that many models are actually having similar early layers, and it is the latter layers that make these models distinctive in usage. Consequently, PC-FFN in the LANA model was utilized as a personalized decision-maker to adaptively make the final prediction based on students’ distinctive inherent properties:


where is the students’ performance-related features extracted in the performance-SRFE.

4.5 Leveled Learning

While the pivot module enables the decoder to be transformable for different students, the encoder and the SRFE of the LANA model that provides necessary information for the pivot module remains the same for all students. This is not problematic if the length of the input sequence is large enough since Assumption 1 assures long sequences are always distinguishable, unless they both belong to the same student at the same time period. However, DKT, especially transformer-based DKT, can only be inputted with the latest (commonly

) interactions at once due to the limited memory size and high computational complexity. Consequently, it is possible for the encoder and SRFE to output similar results for two different students, resulting in a failure for the decoder to adapt. To alleviate this problem, it is natural to think of assigning different students with different encoders and SRFEs that are highly specialized (sensitive) to their assigned students’ patterns. However, in practice, it is not feasible to train a unique encoder for each individual student considering both the limited training time and the limited training data. As a result, a novel leveled learning method was proposed to address this problem, which was initially inspired by the fine-tuning mechanism in transfer learning 

[28], where we consider each student a unique task, and we want to transfer a model that fits well on all students to one student efficiently.

Leveled learning holds the view that the earlier layers of a model are similar for similar tasks. Thus, to save training time and enlarge the training set, instead of training each student a unique encoder and SRFE by his private training data, students with similar ability levels are considered to be grouped together, sharing their private training data and having the same encoder and SRFE. Therefore, LANA firstly utilizes an interpretable Rasch model to analyze the ability level for each student , then groups students into different independent layers . Assuming the ability distribution of all students and students at the level are Gaussian distribution and respectively, we have the Equation 12:


In LANA, for simplicity, we consider all layers share the same variance

444In practice, if the number of layers is small, their variances then need to be manually measured and tuned based on the targets. If the number of layers is large, then multiple layers can be regarded as one layer and therefore sharing the same variance for all layers should be fine., and the difference of mean between consecutive layers is a constant . Hence, and are given by:


where is the number of layers. With both and retrieved for every layer , given a student’s ability constant , we can now calculate the probability of been grouped into different layers by Equation 14:


where is referred as the probability of student being grouped into layer . As it can be seen from Equation 14, students that have high ability levels are not necessarily grouped into layers with high expected ability levels . Contrarily, these high ability students only have a higher probability of been grouped into high ability layers in comparison with those low ability students, which obeys rules in reality (e.g. high ability students may also come from normal schools).

Then, the LANA model that has been pre-trained on all students was duplicated times, each cloned model would be assigned to a layer to be dedicatedly fine-tuned with ’s private training data by weighted back-propagation:


where is the prediction of the model .

While the training phase of leveled learning seems promising, the inference phase of it suffers problems. The first problem is how to make the prediction using multiple specialized models. In LANA, the prediction was made by models fusion. Detailedly, when student ’s future responses are needed to be predicted, LANA firstly computes , then feed ’s interactive sequence to all models that satisfies , where needs to be manually set up to control the predicting time. Then, the outputs of these models would be multiplied by to form the final prediction. The workflow of leveled learning’s inference step could be described in Equation 16:


where is the leveled learning’s final prediction and is the input of the model. This workflow seems similar to the ensemble where multiple models are unitized to generate the final answer. Nonetheless, weights of models in LANA are probabilities that come from an interpretable Rasch model so that it is clear which model is dominant to . Moreover, unlike in ensemble, where the role of each model is ambiguous, in LANA, every model has its explainable effect (e.g. is committed to high ability students, and therefore a student with large indicates he must be similar to those high ability students in ), suggesting that leveled learning significantly outperforms ensemble in interpretability. Detailed comparison was shown in Table 1.

Leveled Learning Ensemble
Sub-set Select Psycho-statistics Random
Interpretability Good Bad
Predicting Time Controllable (top-k) Uncontrollable
Table 1:
Comparison Between Leveled Learning And Ensemble

On the other hand, the second problem of the leveled learning is how to compute for students that LANA has never met in training, namely the “cold start” problem [34]

. In vanilla KT context, we can only initiate newly arrived students’ ability levels to the average ability level of all students. However, in practice, we can estimate their ability levels more accurately by asking them to do a couple of sample exercises or using ranking at school.

5 Experiments

5.1 Experimental Setup

In order to evaluate the effectiveness of the proposed LANA, we applied it to two real-world large-scale datasets in comparison with many other State-Of-The-Art (SOTA) KT methods. Specifically, EdNet [3] and RAIEd2020 [11] are employed in our experiments, where EdNet is currently the largest publicly available benchmark dataset in education domain, consisting of over 90,000,000 interactions and nearly 800,000 students. On the other hand, RAIEd2020 is a recently published real-world dataset that has approximately the same size as EdNet with nearly 100,000,000 interactions and 400,000 students. Particularly, the average number of exercising interactions per student in RAIEd2020 is double to EdNet’s. Other details of these two datasets can be referred to in Table 2.

Statistic EdNet RAIEd2020
# of interactions 95,293,926 99,271,300
# of students 784,309 393,656
# of exercises 13,169 13,523
# of parts 7 7
Table 2:
Datasets Comparison Of EdNet [3] And RAIEd2020
EdNet RAIEd2020
Optimizer AdamW AdamW
Learning Rate 5.00E-04 5.00E-04
Batch Size 256 256
Input Length 100 100
Model Dimension 512 256
Hidden Dimension 512 256
# of Heads 8 8
# of Encoder 4 2
# of Decoder 4 2
Dropout 0 0
1.0 1.0
Table 3:
The Detailed Hyper-parameters Setup Of Experiments Under Two Different Datasets

Moreover, 6 KT methods that had previously achieved SOTA performance have participated in the comparison, which would be respectively introduced in the following literature:

  • DKT [22]: DKT models students’ knowledge states and makes future response predictions with LSTM [10]

    . Expressly, the knowledge states of students in DKT are represented by the hidden vectors of the LSTM.

  • DKVMN [38]: DKVMN utilizes two novel matrices, a key matrix and a value matrix, to separately store each question’s knowledge concepts and each student’s mastery level of the corresponding concept.

  • SAKT [19]: SAKT replaces DKT’s LSTM module with transformer’s [31] self-attention module to achieve SOTA on multiple knowledge tracing datasets.

  • SAINT [2]: While SAKT only uses one self-attention module in its architecture, SAINT leverages the whole transformer, which drastically improves the prediction accuracy.

  • SAINT+ [26]: SAINT+ is a follow-up of SAINT, where SAINT+ adds two more features to the input of the model. Experiments demonstrate the effectiveness of SAINT+ compared to SAINT.

  • AKT [7]: AKT utilizes a monotonic attention mechanism to trace students’ knowledge states, which considers the “forgetting” behavior of the human brain.

Dataset Model AUC
EdNet DKT 0.7638
EdNet DKVMN 0.7668
EdNet SAKT 0.7663
EdNet SAINT 0.7816
EdNet SAINT+ 0.7913
EdNet SAINT+ & BM 0.7935
EdNet LANA 0.8059
RAIEd2020 SAKT 0.7832
RAIEd2020 AKT 0.7901
RAIEd2020 SAINT+ 0.7956
RAIEd2020 SAINT+ & BM 0.7991
RAIEd2020 LANA 0.8056
Table 4:
The AUC Comparison Of Different Methods Tested On EdNet And RAIEd2020 datasets
Dataset Base Modification Pivot Module Layered Learning AUC Improvement
EdNet 0.7913 -
EdNet 0.7935  0.0022
EdNet 0.7997  0.0084
EdNet 0.7923  0.0010
EdNet 0.7933  0.0020
EdNet 0.8029  0.0116
EdNet 0.8015  0.0102
EdNet 0.8038  0.0125
EdNet 0.8050  0.0137
EdNet 0.8059  0.0146
RAIEd2020 0.7956 -
RAIEd2020 0.7991  0.0035
RAIEd2020 0.8020  0.0064
RAIEd2020 0.7965  0.0009
RAIEd2020 0.7977  0.0021
RAIEd2020 0.8031  0.0075
RAIEd2020 0.8027  0.0071
RAIEd2020 0.8035  0.0079
RAIEd2020 0.8051  0.0095
RAIEd2020 0.8056  0.0100
Table 5:
Investigation Of The Effectiveness Of Different Improvements In LANA
Figure 7: The visualization of intermediate features in SAINT+ (a) and in LANA (b). Compared to (a), students in (b) (different colors) are notably clustered (marked arrows). The learning process of student #3 overtime in SAINT+ (c) and in LANA (d). compared to (c), a clear learning path appeared in (d).

In terms of the basic experimental environment, all experiments were conducted with Pytorch

555 1.6 on a Linux server that is equipped with an Nvidia V100 GPU. For hyper-parameters setup, the learning rate was set to with  [16] optimizer, the length of the input sequence was set to , the batch size was set to , and other detailed configurations were listed in Table 3. The input features in EdNet contains Question ID, Question part, Students’ responses, Time interval between two consecutive interactions and Elapsed time of an interaction, whereas in RAIEd2020, a new feature is additionally added to , which indicates Whether or not the student check the correct answer to the previous question

. Finally, The Area Under the receiver operating characteristic Curve (AUC) was leveraged in our experiments as the performance metric, which has been widely used in many other KT-related proposals.

For the ease of explanation, hereinafter Base Modification (Section 4.2), Pivot Module (Section 4.4) and Leveled Learning (Section 4.5) would be abbreviated as BM, PM and LL respectively.

5.2 Results And Analysis

The overall experimental results of different KT methods on different datasets were illustrated in Table 4. Because we had successfully reproduced the performance of SAINT and SAINT+ that was previously reported in SAINT+’s paper [26] (with considerable precision), AUCs of other models are therefore directly cited from the paper (labeled with subscript ).

From the comparison table, it can be seen that in both EdNet and RAIEd2020 datasets, LANA (marked bold) outperforms the previous SOTA method (marked italic) by 1.46% and 1.00% respectively, readily verifying the effectiveness of our proposed improvements. Moreover, LANA also surpasses SAINT+ & BM by 1.24% and 0.65% respectively, suggesting adaptability contributes most to LANA’s AUC increment. Considering experimented datasets are by far the two largest knowledge tracing datasets in the world, these results undoubtedly provide strong evidence of the validity of the proposed LANA method.

5.3 Ablation Studies

In this section, we investigated the effectiveness of each of our proposed improvements: BM that customizes the basic transformer architecture, PM that enables the decoder to be adaptive to the students’ personal characteristics, and LL that interpretably specializes encoders and SRFEs for better predicting performance. The results of the ablation study were shown in Table 5.

The table shows in EdNet, applying BM alone was already capable of improving the predicting AUC by approximately 0.2% averagely, verifying the importance of both the action of positional embedding and the personalized linear projection for each input feature in KT’s context. Meanwhile, applying LL solely can benefit the model performance as well, by generally 0.2% compared to 0.1% with the vanilla ensemble. Considering without PM, LL would just perform fitting on students with different ability levels, the performance gain from sole LL could be interpreted as reductions in students’ inherent properties gaps. Moreover, BM + PM drastically boosts the model performance by nearly 1.25%, suggesting PM makes proper use of extracted student-related features from SRFE to adaptively reparameterize the model’s decoder for different students at different stages, and therefore contributes most to the final performance gain. Finally, by combining all improvements together, BM + PM + LL (i.e. LANA) achieves a final AUC of 0.8059, substantially outperforms previous SOTA by at least 1.46%.

5.4 Features Visualization

For vividly illustrating the validity of student-related features distilling in LANA, students’ intermediate features from PC-FFN module was sampled to generate Figure 7 by t-SNE[30]. In figure 7

(a) and (b), each sample represents intermediate features of different students with different colors in SAINT+ and LANA respectively. It can be seen that in SAINT+, samples are almost randomly distributed, indicating the correlation between samples of the same student is not more significant compared to samples of the others due to the ignorance of students’ personalities. On the other hand, in LANA, clusters (marked arrows) of samples have notably appeared in comparison to (a). Thus, we concluded that LANA is capable of successfully extracting student-related features from their interactive sequences, summarizing the similarities and differences, which eventually results in more distinguishable features for the final classifier.

Furthermore, we individually visualized student #3’s (randomly picked) samples along the time axis to investigate the transitioning pattern of features in Figure 7 (c)(SAINT) and (d)(LANA). In (c), there is no clear pattern in the change of features over time, while in (d), a clear transitioning path could be noticed. Since many other students are sharing the same pattern in LANA, we argue that it represents the trajectory of the student’s ability changes with more and more exercising. Namely, it is the learning path of the student. Consequently, we contended that it is potentially helpful for other applications, such as learning stages transfer and learning path recommendation.

6 Conclusion

Knowledge Tracing, as a fundamental task to achieve adaptive learning, requires the tracing method capable of providing personalized analysis for each individual student. However, while integrating deep learning with knowledge tracing creates the most powerful DKT that surpasses any other traditional KT methods in metrics, the adaptability tends to be ignored by DKT and its variants considering they are essentially treating all students averagely to learn the pattern.

In this paper, we proposed a novel Leveled Attentive KNowledge TrAcing (LANA) method that was committed to bringing adaptability back to DKT. Specifically, inspired by BKT and IRT, LANA aims to achieve adaptability by providing different model parameters for different students at different stages. However, separately training massive unique models for each individual student is clearly not a practical solution, and therefore instead of directly learning the model parameters of different students, LANA distills students’ inherent properties from their respective interactive sequences by a novel SRFE, and learns the function to reparameterize the model with these extracted student-related features. Consequently, innovative pivot module was proposed to produce an adaptive decoder. Besides, to reduce the ambiguity of the input sequence and capture the long-term characteristic of the individual student, a novel leveled learning training mechanism was introduced to cluster students by interpretable Rasch model defined ability level, which not only specializes the encoder and therefore enhances the significance of students’ latent features, but also saves much training time. Extensive experiments on the two largest public benchmark datasets in the education domain strongly evaluate the feasibility and effectiveness of the proposed LANA. Features visualization also suggests extra impacts of LANA, be it learning stages transfer or learning path recommendation.

However, there are some drawbacks of LANA as well: firstly, the configuration of variances in Layered Learning requires extensive manpower, which is desired to have a automatic workflow to setup these parameters. Secondly, while Figure 7 illustrates the efficacy of the proposed SRFE module, a more systematic way to quantify the student features needs to be came up with. Consequently, solving these problems will be our next step.


  • [1] A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. Engaging with massive online courses. In Proceedings of the 23rd international conference on World wide web, pages 687–698, 2014.
  • [2] Y. Choi, Y. Lee, J. Cho, J. Baek, B. Kim, Y. Cha, D. Shin, C. Bae, and J. Heo. Towards an appropriate query, key, and value computation for knowledge tracing. In Proceedings of the Seventh ACM Conference on Learning@ Scale, pages 341–344, 2020.
  • [3] Y. Choi, Y. Lee, D. Shin, J. Cho, S. Park, S. Lee, J. Baek, C. Bae, B. Kim, and J. Heo. Ednet: A large-scale hierarchical dataset in education. In

    International Conference on Artificial Intelligence in Education

    , pages 69–73. Springer, 2020.
  • [4] A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4):253–278, 1994.
  • [5] R. S. d Baker, A. T. Corbett, and V. Aleven. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In International conference on intelligent tutoring systems, pages 406–415. Springer, 2008.
  • [6] M. C. Desmarais and R. S. d Baker. A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 22(1):9–38, 2012.
  • [7] A. Ghosh, N. Heffernan, and A. S. Lan. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2330–2339, 2020.
  • [8] J. González-Brenes, Y. Huang, and P. Brusilovsky. General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge. In The 7th international conference on educational data mining, pages 84–91. University of Pittsburgh, 2014.
  • [9] R. Grossman and E. Salas. The transfer of training: what really matters. International Journal of Training and Development, 15(2):103–120, 2011.
  • [10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [11] R. Inc. Riiid aied challenge 2020., 2020. [Online; accessed 6-Oct-2020].
  • [12] A. Jalal and M. Mahmood. Students’ behavior mining in e-learning environment using cognitive processes with information technologies. Education and Information Technologies, 24(5):2797–2821, 2019.
  • [13] G. Ke, D. He, and T.-Y. Liu. Rethinking the positional encoding in language pre-training. arXiv preprint arXiv:2006.15595, 2020.
  • [14] G. D. Kuh, J. Kinzie, J. A. Buckley, B. K. Bridges, and J. C. Hayek. Piecing together the student success puzzle: Research, propositions, and recommendations: ASHE higher education report, volume 116. John Wiley & Sons, 2011.
  • [15] Q. Liu, Z. Huang, Y. Yin, E. Chen, H. Xiong, Y. Su, and G. Hu. Ekt: Exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering, 33(1):100–115, 2019.
  • [16] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [17] J. M. Murre and J. Dros. Replication and analysis of ebbinghaus’ forgetting curve. PloS one, 10(7):e0120644, 2015.
  • [18] B. A. Oakley and T. J. Sejnowski. What we learned from creating one of the world’s most popular moocs. NPJ science of learning, 4(1):1–7, 2019.
  • [19] S. Pandey and G. Karypis. A self-attentive model for knowledge tracing. arXiv preprint arXiv:1907.06837, 2019.
  • [20] S. Pandey and J. Srivastava. Rkt: Relation-aware self-attention for knowledge tracing. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 1205–1214, 2020.
  • [21] Z. A. Pardos and N. T. Heffernan.

    Modeling individualization in a bayesian networks implementation of knowledge tracing.

    In International Conference on User Modeling, Adaptation, and Personalization, pages 255–266. Springer, 2010.
  • [22] C. Piech, J. Spencer, J. Huang, S. Ganguli, M. Sahami, L. Guibas, and J. Sohl-Dickstein. Deep knowledge tracing. arXiv preprint arXiv:1506.05908, 2015.
  • [23] L. Rabiner and B. Juang.

    An introduction to hidden markov models.

    ieee assp magazine, 3(1):4–16, 1986.
  • [24] C. Romero and S. Ventura. Educational data mining: a review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(6):601–618, 2010.
  • [25] S. Shang, L. Chen, C. S. Jensen, J.-R. Wen, and P. Kalnis. Searching trajectories by regions of interest. IEEE Transactions on Knowledge and Data Engineering, 29(7):1549–1562, 2017.
  • [26] D. Shin, Y. Shim, H. Yu, S. Lee, B. Kim, and Y. Choi. Saint+: Integrating temporal features for ednet correctness prediction. arXiv preprint arXiv:2010.12042, 2020.
  • [27] V. L. Shiv and C. Quirk. Novel positional encodings to enable tree-based transformers. In NeurIPS, pages 12058–12068, 2019.
  • [28] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer, 2018.
  • [29] W. J. van der Linden and R. K. Hambleton. Handbook of modern item response theory. Springer Science & Business Media, 2013.
  • [30] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  • [32] Z. Wang, X. Feng, J. Tang, G. Y. Huang, and Z. Liu. Deep knowledge tracing with side information. In International conference on artificial intelligence in education, pages 303–308. Springer, 2019.
  • [33] K. H. Wilson, Y. Karklin, B. Han, and C. Ekanadham. Back to the basics: Bayesian extensions of irt outperform neural networks for proficiency estimation. arXiv preprint arXiv:1604.02336, 2016.
  • [34] K. H. Wilson, X. Xiong, M. Khajah, R. V. Lindsey, S. Zhao, Y. Karklin, E. G. Van Inwegen, B. Han, C. Ekanadham, J. E. Beck, et al. Estimating student proficiency: Deep learning is not the panacea. In In Neural Information Processing Systems, Workshop on Machine Learning for Education, page 3, 2016.
  • [35] B. P. Woolf. Building intelligent interactive tutors: Student-centered strategies for revolutionizing e-learning. Morgan Kaufmann, 2010.
  • [36] Y. Xu and J. Mostow.

    Using logistic regression to trace multiple sub-skills in a dynamic bayes net.

    In EDM

    , pages 241–246. Citeseer, 2011.

  • [37] M. V. Yudelson, K. R. Koedinger, and G. J. Gordon. Individualized bayesian knowledge tracing models. In International conference on artificial intelligence in education, pages 171–180. Springer, 2013.
  • [38] J. Zhang, X. Shi, I. King, and D.-Y. Yeung. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on World Wide Web, pages 765–774, 2017.