Log In Sign Up

Bi-CLKT: Bi-Graph Contrastive Learning based Knowledge Tracing

The goal of Knowledge Tracing (KT) is to estimate how well students have mastered a concept based on their historical learning of related exercises. The benefit of knowledge tracing is that students' learning plans can be better organised and adjusted, and interventions can be made when necessary. With the recent rise of deep learning, Deep Knowledge Tracing (DKT) has utilised Recurrent Neural Networks (RNNs) to accomplish this task with some success. Other works have attempted to introduce Graph Neural Networks (GNNs) and redefine the task accordingly to achieve significant improvements. However, these efforts suffer from at least one of the following drawbacks: 1) they pay too much attention to details of the nodes rather than to high-level semantic information; 2) they struggle to effectively establish spatial associations and complex structures of the nodes; and 3) they represent either concepts or exercises only, without integrating them. Inspired by recent advances in self-supervised learning, we propose a Bi-Graph Contrastive Learning based Knowledge Tracing (Bi-CLKT) to address these limitations. Specifically, we design a two-layer contrastive learning scheme based on an "exercise-to-exercise" (E2E) relational subgraph. It involves node-level contrastive learning of subgraphs to obtain discriminative representations of exercises, and graph-level contrastive learning to obtain discriminative representations of concepts. Moreover, we designed a joint contrastive loss to obtain better representations and hence better prediction performance. Also, we explored two different variants, using RNN and memory-augmented neural networks as the prediction layer for comparison to obtain better representations of exercises and concepts respectively. Extensive experiments on four real-world datasets show that the proposed Bi-CLKT and its variants outperform other baseline models.


page 1

page 2

page 3

page 4


Towards Graph Self-Supervised Learning with Contrastive Adjusted Zooming

Graph representation learning (GRL) is critical for graph-structured dat...

Graph Matching with Bi-level Noisy Correspondence

In this paper, we study a novel and widely existing problem in graph mat...

DAGKT: Difficulty and Attempts Boosted Graph-based Knowledge Tracing

In the field of intelligent education, knowledge tracing (KT) has attrac...

SCGC : Self-Supervised Contrastive Graph Clustering

Graph clustering discovers groups or communities within networks. Deep l...

Motif-Driven Contrastive Learning of Graph Representations

Graph motifs are significant subgraph patterns occurring frequently in g...

On the Interpretability of Deep Learning Based Models for Knowledge Tracing

Knowledge tracing allows Intelligent Tutoring Systems to infer which top...

1 Introduction

With growing developments in online education platforms, vast amounts of online learning data is available to keep accurate and timely trace of students’ learning status. To trace students’ mastery of specific knowledge points or concepts, a fundamental task called Knowledge Tracing (KT) has been proposed  Corbett and Anderson (1994), which uses a series of student interactions with exercises to predict their mastery of the concepts corresponding to those exercises. Specifically, Knowledge Tracing addresses the problem of predicting whether students will be able to correctly answer the next exercise relevant to a concept, given their previous learning interactions. In recent years, KT tasks have received significant attention

in academic areas, and many scholars have conducted research to propose numerous methods that deal with this problem. Conventional approaches in this domain are mainly divided into the Bayesian Knowledge Tracing model using Hidden Markov Models  

Corbett and Anderson (1994) and Deep Knowledge Tracing using Deep Neural Networks  Piech et al. (2015) and its derivative methods  Graves et al. (2014); Zhang et al. (2017); Pandey and Karypis (2019); Su et al. (2018).

Existing KT methods  d Baker et al. (2008); Zhang et al. (2017); Piech et al. (2015) generally target the concept to which the exercise belongs rather than distinguishing between the exercises themselves to build predictive models. Such an approach assumes that the ability of a student to solve the relevant exercise correctly to a certain extent directly reflects that student’s mastery of the concept. Therefore, prediction based on concepts in such a way is a viable option, however, this reduces the difficulty of the task itself, given the limited performance of the model. Generally, a KT task comprises multiple concepts and a large number of exercises with an even larger number of situations where a concept is associated with many exercises, and a proportion of situations where an exercise may correspond to multiple concepts. Traditional models can only deal with the former, while for the latter, they often have to resort to dividing these cross-concept exercises into multiple single-concept exercises. Such an approach, while enhancing the feasibility of these models, nevertheless interferes with the accuracy of the overall task.

Although these concept-based KT methods have been somewhat successful, the characteristics of the exercises themselves are often overlooked. This can lead to a reduction in the ultimate predictive accuracy of the model and a failure to predict specific exercises. Even if two exercises have the same concept, the difference in their difficulty level may ultimately lead to a large difference in the probability of them being answered correctly. Therefore, some previous works 

Minn et al. (2019); Dos Santos et al. (2016); Yang et al. (2020); Abdelrahman and Wang (2019); Xue et al. (2021); Song et al. (2020) have attempted to use exercise features as a supplement to concept input, achieving success to some extent. However, due to the relatively large difference between the number of exercises and the number of exercises students actually interact with, each student may only interact with a very small fraction of the exercises, leading to problems of sparse data. Furthermore, for those exercises that span concepts, simply adding features to the exercises loses potential inter-exercise and inter-concept information. Therefore, the use of higher-order information such as “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) is necessary to address these issues.

The idea of redefining the Knowledge Tracing problem in terms of graphs has recently gained significant momentum due to the widespread deployment of GNNs  Nakagawa et al. (2019); Yang et al. (2020); Liu et al. (2020); Li et al. (2021); Yin et al. (2021) and breakthroughs in addressing the unpredictability of traditional approaches to cross-concept exercises. Traditional KT usually takes sequential data as input in the form of concepts corresponding to the input exercises and their responses. This leads to a lack of information between exercises, and only the relationship between exercises and concepts is available. Recent research in graph theory has opened up the possibility of breaking this bottleneck. Unlike sequential data, graph data can capture the higher order information of “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) very well due to the multivariate node and edge structure of the graph itself. As a result, some research Nakagawa et al. (2019); Song et al. (2021) has turned to redefining this task in terms of graphs. However, these efforts face several problems: 1) too much attention to the details of the nodes rather than high-level semantic information; 2) difficulty in effectively establishing spatial associations and complex structures of the nodes; and 3) representing only concepts or exercises without integrating them.

Due to the difficulty and inaccuracy of data labelling, self-supervised learning has become increasingly popular, with great success in many areas such as computer vision  

Bachman et al. (2019); He et al. (2020); Tian et al. (2020)

and natural language processing  

Collobert and Weston (2008); Mnih and Kavukcuoglu (2013). The speciality of self-supervised learning is in dealing with low quality or missing labels, which is a requirement for supervised learning and uses the input data itself for incremental layers as supervised labels for the learning model. This can be as powerful as supervised models with specifically labelled information and eliminates the tedious labelling work required by supervised models. Specifically, self-supervised learning eliminates the need to label specific tasks which is the biggest bottleneck in supervised learning. Especially for large amounts of network data, obtaining high quality labels at scale is often very expensive and time consuming. Self-supervised learning has been shown to excel in tasks with textual and image datasets, but is still in its infancy for problems on the graphs such as retrieval, recommendation, graph mining, and social network analysis.

In this paper, we address the problems encountered by traditional GNN based KT models and propose a self-supervised learning framework along with Bi-Graph Contrastive Learning based Knowledge Tracing (Bi-CLKT) model. Our model employs contrastive learning with global-bilayer and local-bilayer structures, where they apply graph-level and node-level GCNs to extract “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) relational information, respectively. Finally, the prediction on students’ performance is carried out by a prediction layer based on a deep neural network.

Our approach has been fully validated using four benchmark datasets. In terms of prediction performance, our model and its variants outperformed traditional deep learning-based methods, showing great potential for knowledge-tracing prediction accuracy. Furthermore, through a series of ablation studies, we have analysed each module separately, making the model more interpretable. Our proposed model also has some technical innovations and improvements. Due to the specificity of this task, which requires separate representation learning for two related but independent entities, our two-layer comparative learning structure fits well into this. Our specific contributions are described as follows:

  • To the best of our knowledge, we present the first self-supervised learning based knowledge-tracing framework. Through contrastive self-supervised learning, we solve a number of problems encountered by traditional GNN-based knowledge-tracing models leading to significant improvement in the accuracy of the final prediction results.

  • For Knowledge Tracing, we design a two-layer contrastive learning framework, which performs “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) relational information at the global and local levels respectively. The representation of exercise is learned eventually and effectively combined by a joint contrastive loss function. Such a structure allows an exercise embedding to have both exercise and concept structural information, which has a positive effect on the final prediction task.

  • We perform thorough experiments on four real-world open datasets, and the results show that our proposed framework and its variants all have a significant improvement in predictive efficiency compared to the individual baseline models. We also perform ablations studies to analyse the validity of each individual module, which greatly enhances the model interpretability.

2 Related Work

2.1 Knowledge Tracing

There are two main approaches to using machine learning for Knowledge Tracing. The first one is the traditional machine learning KT approach represented by Bayesian Knowledge Tracing (BKT)  

Corbett and Anderson (1994)

. BKT primarily applies the Hidden Markov Model, which uses Bayesian rules to update the state of each concept considered as a binary variable. Several works have extended the basic BKT model and introduced additional variables such as slip and guess probabilities 

d Baker et al. (2008), concept difficulty  Pardos and Heffernan (2011) and student personalisation Pardos and Heffernan (2010); Yudelson et al. (2013); Song et al. (2020). On the other hand, traditional machine learning KT models also include factor analysis models such as item response theory (IRT) Ebbinghaus (2013) and performance factor analysis (PFA)  Pavlik Jr et al. (2009); Li et al. (2021), which tend to focus on learning general parameters from historical data to make predictions.

With the development of Deep Neural Networks, the literature has experienced advances in Deep Knowledge Tracing methods that have proved to be more effective in learning valid representations for large amounts of data for more accurate predictions. For example, Deep Knowledge Tracing (DKT)  Piech et al. (2015) uses recurrent neural networks (RNNs) to track students’ knowledge states and became the first deep KT method to achieve excellent results. Another example is the Dynamic Key-Value Memory Network (DKVMN)  Zhang et al. (2017) that builds a static and dynamic matrix to store and update all concepts and students’ learning states respectively. Xu et al. Xu et al. (2018) propose a pioneering deep matrix factorization method for conceptual representation learning from multi-view data. However, these classical models consider the most basic concept features only, and the absence of exercise features leads to unreliable final predictions.

Some deeper KT methods have since been proposed that do take into account the features of the exercises in their predictions. For example, Exercise-Enhanced Recurrent Neural Network with attention mechanism (EERNNA) Su et al. (2018) uses information about the text of the exercise to allow the embedding itself to contain the features of the exercise, but in reality it is difficult to collect such textual information, and doing so introduces too much interference into the embedding itself. Dynamic Student Classification on Memory Networks (DSCMIN) Minn et al. (2019) uses modelling of problem difficulty to help distinguish between different problems under the same concept. DHKT, on the other hand, augments DKT by using relationships between problems and skills to obtain a representation of the exercises. However, this does not capture the relationships between exercises and concepts due to the data sparsity issues. Due to the presence of long-term dependencies in practice sequences, the Sequence Key Value Memory Network (SKVMN) Abdelrahman and Wang (2019) has improved the LSTM with good results in order to improve the ability to capture such dependencies. Our approach differs from these methods in that they build the graph of exercise-influence relations from the original “exercise-to-concept” (E2C) relations by certain assumptions and use graph-level and node-level GCNs, respectively, to extract the “exercise-to exercise” (E2E) and “concept-to-concept” (C2C) relational information. On the other hand, to reduce the interference of too much detailed information, we use contrast learning model to learn the concepts and exercises separately for representation.

2.2 Self-supervised Learning

Research on self-supervised learning can be broadly divided into two branches: generative models and contrastive models. The main representative of the generative model is the automatic coding which is very popular. The main approach on graph data is to learn the embedding of nodes of the graph into a latent space through GNN, and then reconstruct the structure and properties of the original graph through the learned representations. The representations of the nodes are adjusted by reducing the size of the loss between the generated graph and the original graph step by step. The learned representations are then used to reconstruct the original diagram. These representations encode the structural and attribute features of the original graph. Contrastive learning, on the other hand, uses augmentation methods to structurally disrupt the input data, separating out the predicted objects and corresponding labels from their own structure before learning the representations, and finally comparing the loss functions to minimize the distance between positive pairs and maximize the distance between negative pairs to achieve a structural grasp of the complete graph. Pioneer methods along the direction of learning graph representations of GCNs include Hu et al.  Kingma and Ba (2014) and Kaveh et al.  Li et al. (2017).

2.3 Contrastive Learning on Graphs

Contrastive learning is a type of self-supervised learning where the target label to be learned is generated from source data itself. It brings similar representations closer together and dissimilar ones further apart by comparison. For graph data, traditional learning methods Becker and Hinton (1992); Wu et al. (2018); Ye et al. (2019) often overemphasise detailed information at the expense of structural information. On the other hand, contrastive learning compensates for this by nicely finding a balance between local and global representation learning. Although contrastive learning on graph data is still in its infancy, it has been demonstrated by numerous models to be powerful in its control of graph structural information.

3 Preliminary and Problem Statement

In this section, we define the task of Knowledge Tracing in our setting. We first formally define the student performance prediction problem, which represents the level of student mastery of each concept by the accuracy with which students interact with the exercises under each concept. Next, we present the important definitions used in our study.

3.1 Problem Definition

In the Knowledge Tracing task, we record the a particular student’s practice process as , where represents the student, and represents the specific exercise, represents the exercise that student does in its exercise step , and represents the correctness or otherwise of the corresponding exercise. In general, equals 1 if the student answered the exercise correctly, otherwise equals 0. To trace the mastery of a specific concept , we normally observe a sequence of a student’s interactions and predict the result of the next exercise . Finally, the probability of a student getting any next exercise correct for a specific concept is taken as the student’s mastery of that concept. The specific definitions are as follows:

Problem 1 (Performance Prediction in KT).

Give a sequence of interaction observations taken by a students , on a specific concept , where for the exercise being answered at the time step with whether or not the exercise was answered correctly . The objective of Knowledge Tracing task is to predict the next interaction .

To solve this problem, we dig deeper into the “exercise-to-exercise”(E2E) graph structure relationships from the original “exercise-to-concept” sequence data. The secondary “exercise-to-exercise”(E2E) relationships extracted are used to construct an exercise influence graph based on each of the different concepts. Then, we obtained separate pre-representations of the exercises and concepts by applying graph-level and node-level GCNs on these exercise influence graphs. Finally, we fuse them into a contrast learning-based Knowledge Tracing graph to train these representations to achieve optimal results under the Knowledge Tracing task.

The exercise-level influence graph is derived from students’ transitions between exercises. It assumes that the majority of students who get two different exercises correct at the same time will have a high degree of similarity or correlation between the two exercises, and therefore will have relatively high weights between them, and vice versa. The creation of the exercise-level influence graph not only ameliorates the existing models’ shortcoming of not being able to distinguish between exercises under the same concept, but also provides rich information on the structure of the “exercise-to-exercise” graph.

Figure 1: An example of creating an exercise influence subgraph from an exercise-to-concept relationship graph.
Definition 1.

(Exercise-level influence graph) Given an exercise set and an “exercise-to-exercise” interaction set , an exercise influence sub-graph is a graph that takes the exercises in as the vertices and the interactions in as the edges. The weighted influence of the edges of co-ocurred exercises and is measured by


where is the co-correctness rate of exercises and among all the answered co-occurred exercises involving . and denote the count of co-correctness and co-occurrence respectively. Edge (, ) only exists when is larger than a certain threshold .

Example 1.

As shown in Figure 1, we constructed an example of an exercise influence subgraph by assuming a high degree of similarity in the questions that a student can answer correctly at the same time, in which the properties of different exercises are reflected from node to node. In turn, the properties of different concepts are reflected between different subgraphs. They represent “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) relational information, respectively.

3.2 Representative Solutions Study

The vector sequence of

as input to conventional deep knowledge tracking is mapped to the output vector sequence by computing a sequence of ’hidden’ states . This can be seen as a continuous encoding of information about historical learning performance to make predictions about the future, and DKT makes the connection between input and output via a simple recurrent neural network. The input to the dynamic network is a representation of the student’s historical behaviour, while the prediction is a vector representing the probability of being correct for each sample exercise. More details are defined by the equations:


where the sigmoid and tanh functions

are used as activation functions,

denotes the input weight matrix, denotes the initial state, denotes the readout weight matrix and denotes the cyclic weight matrix. The deviations of the latent and readout cells are given by and .

Another classical approach is DKVMN, which outputs the probability of a response through a discrete exercise label . The motion and response tuples are then updated. Here, is a set with distinct exercise labels and is the binary value of whether the student got it right or not. DKVMN assumes that the exercise is based on the set of potential concepts with . The key matrix (size ) is used to store these concepts of size . Concept states are stored as students’ mastery of each concept in the time-varying value matrix (size ). Ultimately, DKVMN tracks student knowledge by reading and writing to the value matrix using the relevant weights computed from the input exercises and the key matrix.

In the KT process, existing models often do not link the different concepts well, which leads to the inability of these models to make correct or complete predictions when students encounter exercises on concepts that have not been covered before or when an exercise involves multiple concepts. We, therefore, fill this gap by using the construction of exercise influence subgraphs, with nodes connecting different exercises to each other and subgraphs connecting different concepts to each other.

By building an exercise influence subgraph, we can transform the original sequence data into graph structured data containing “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) relational information. We then learn these data into the corresponding embedded representations by using node-level and graph-level GCNs respectively, while learning the best architecture by contrast to maximise the differentiation of these representations so that the final representations contain both information about the exercises and concepts, and their differentiation from other concepts and exercises.

4 The Bi-Graph Contrastive Knowledge Tracing

Inspired by recent developments of contrastive learning in visual representation, an increasing scale of research has shown that contrastive learning frameworks perform well on graph-structured data as well. Therefore, after various comparisons and studies, we propose a Bi-graph Contrastive Knowledge Tracing representation learning (Bi-CLKT) based on graph-level (global) and node-level (local) GCNs. In the next section, we describe Bi-CLKT in detail, first by briefly discuss a traditional contrastive learning framework, and then more specifically by presenting our proposed Bi-graph contrastive learning framework. Finally, we provide the theoretical rationale behind our approach.

4.1 The Graph CL Paradigm

Figure 2: An overview of Bi-Graph Contrastive Knowledge Tracing framework.

As shown in Figure 2, our proposed Bi-CLKT framework extends the common graph CL paradigm. Common graph CL paradigms typically employ either global features or local features to seek to maximise the consistency of representations between the graph-level or node-level of different views. Specifically, graph CL paradigms first generate two graph views by performing random graph augmentations on the original graph, such as eliminating or adding edges, eliminating nodes, masking attributes, etc. For the global graphCL, these two views are treated as a positive pair, while the other graphs are treated as negative pairs. While for the local graphCL, the nodes in these two views are found as a positive pair and negative pairs. we then employ a contrastive loss that forces the positive pair’s embeddings in the view are consistent with each other, while trying to distance all negative pairs from each other. Specifically, the graph CL paradigm consists of four main components:

  • Graph data augmentation. The given graph is augmented by eliminating or adding edges, eliminating nodes, masking attributes, etc. to obtain two related views (i.e. augmented graphs), which are treated as a positive pair in the global graph CL. Whereas in the local graph CL, all positive pairs and negative pairs exist in the both two views.

  • GCN-based encoder. GCN-based encoders node-level and graph-level for augmented graphs to extract representation vectors .

  • Projection head. A non-linear transformation

    named projection head maps augmented representations to another latent space where the contrastive loss is calculated. In graph contrastive learning,

    is obtained by a Multi-Layer perceptron (MLP).

  • Joint contrastive loss function. The joint contrastive loss function is defined as forcing the maximisation of the distance between the positive pair and the negative pair in the two subgraphs respectively. The final normalized temperature-scaled cross entropy loss is used to calculate the loss of the two contrastive learning modules.

4.2 The Bi-CLKT Framework

In general, common graph CL approaches usually choose either global graph CL at the graph level or local graph CL at the node level, which seek to learn representations by maximising the consistency between views from different perspectives. Normally, these approaches often have only one feature, global or local. In the Bi-CLKT model, due to the specificity of the KT task, we need to acquire both “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) relational information. These two relational features respectively have the corresponding properties of being local and global. Therefore, we propose to design a Bi-graph contrastive learning framework with both local and globe features.

To match this two-layer framework, we use node-level GCN to learn “exercise-to-exercise” (E2E) embedding and graph-level GCN to learn “concept-to-concept” (C2C) embedding, respectively. In the process of graph data augmentation, we design separate graph augmentation processes for each of these two layers of the contrastive learning framework. We mainly augment the input graph by randomly removing edges and masking node features in the graph. In addition, inspired by Zhu et al. (2021), we introduce differences in the importance of different nodes and edges by calculating the centrality of different nodes through methods such as random walk and PageRank, and those nodes and edges with lower importance are prioritized for elimination during the graph augmentation process.

4.2.1 Graph Data Augmentation

For the exercise level augmentation, we took two different approaches to augmenting the input graph, i.e. randomly picking edges or points to be removed. Formally, we form a modified subset and by randomly selecting some nodes and edges from the original and in proportion to the probability of random selection of


where and are the probabilities of eliminating edges and nodes , and then and are the set of edges and the set of shop points after graph augmentation. The and reflect the importance of edges and nodes , respectively. By doing so, this function ensures that the unimportant edges or points are eliminated preferentially while ensuring that the important structure of the graph is not compromised.

In concept-level graph augmentation, similar to the increments at the exercise level, we need to compute probabilities and to reflect the importance of the corresponding edges and nodes . The difference is that in concept-level graph augmentation, to better fit the graph-level GCN, we use the PageRank algorithm to combine and to work out a composite probability for the concept level. The specific calculation is defined below as defined below


where we compute the maximum and average values of and denote them by and respectively, and is the probability of combining and .

Finally, we generate two corrupted graph views by augmentation for the exercise and conceptual levels, respectively. In Bi-CLKT, the probabilities of generating the two views are different, and to better target the description of these two probabilities we denote them by and respectively.

4.2.2 Node-Level and Graph-Level Encoder

Direct use of the node features of the last layer of the encoder is the most direct way to obtain a node-level representation of node , i.e. . Where the use of skip connections or skip knowledge to generate node-level representations is do commonly used. However, the method of connecting the node features of all layers produces a node-level representation with a different dimensionality than the node features. To avoid this problem, we perform a linear transformation of the node features of all layers before joining them together.


where the weight matrix that reduces the size of the dimension is .


The key operation for computing the node-level representation of a graph is the READOUT function, of which summation and averaging are the most commonly used READOUT functions. For reasons of node envelope invariance. We use a summation over all node representations, i.e.



is the sigmoid function and

denotes the total number of nodes in a given graph.

Due to the nature of Bi-CLKT’s two-layer structure, we have used node-level and graph-level GCNs as encoders for the exercises and concepts respectively. The specific structural form is as follows


where the adjacency matrix is and the degree matrix is and is the non-linear activation function we use .

4.2.3 Projection Head

Furthermore, in Bi-CLKT we map the enhanced representations to a uniform latent space via a non-linear transformation called projection head

, where the contrast loss is computed. In graphical contrast learning, a multilayer perceptron (MLP) is applied to obtain these mappings


In terms of projection head, the case of considering view generation in terms of mutual information is that a good view generation should minimise the MI between two views , provided that . Intuitively, the fact that the generated viewpoints do not affect the information that determines the prediction of the downstream task can guarantee the effectiveness of contrast learning. Thus under this restriction, divergence between viewpoints as it increases leads to better learning results. From a flowform perspective, we adopt the extension hypothesis and find that an increase in data can induce continuity in the neighbourhood of each instance.

4.3 Joint Contrastive Loss Function

To better learn the representations of exercises and concepts, we used a joint contrastive loss function . This loss function forces the maximisation of the distance between positive pairs of previously learned mappings and other negative pairs. With extensive comparisons, we found that normalised temperature scaled cross-entropy loss (NT-Xent) was the most appropriate loss model. In the node-level GCN training process, we randomly draw N small batches of nodes on the practice influence graph under the same concept, and learn these nodes by local contrast learning, where all 1-hop neighbour points under its same view are used as his negative pairs, along with all 1- hop neighbours of the other view, and the only positive pairs are the corresponding points under both views. In the training process of the graph-level GCN, we randomly draw small batches of exercise influence graphs, so that

augmented graphs are generated as positive pairs, while all other graphs are used as their negative pairs.The cosine similarity function is denoted as


As for the objective function, the prevailing practice is to use a standard binary cross-entropy loss between positive and negative examples i.e. a noisy contrast type objective. However, we have found by research that it is detrimental to representation learning if positive and negative examples are absolutely distinguished. This is mainly due to the fact that these contextual subgraphs are extracted from the same original graph and overlap each other. Therefore, we used normalised temperature scaled cross-entropy loss (NT-Xent) for model optimisation so that positive and negative samples are well differentiated to some extent, resulting in a high quality representation. The NT-Xent of the -th graph is defined as


where the temperature parameter is denoted by . These last two correspond to the exercise and concept losses being computed in all positive pairs, respectively.

5 Experimental Settings and Results

We performe extensive experiments on four real-world datasets to evaluate the performance of our Bi-CLKT model. We also compare it to several state-of-the-art machine learning and deep learning Knowledge Tracing models. To fully evaluate the performance of Bi-CLKT, we conducted a large number of in-depth ablation experiments, which validated the role of each module in Bi-CLKT and enhanced the interpretability of the model.

To implement the baseline and Bi-CLKT models, we used PyTorch and the Geometric Deep Learning extension library. Experiments were conducted on four NVIDIA TITAN V GPUs. Bi-CLKT was used to learn node representations in a self-supervised contrastive learning fashion, and these representations were then used to evaluate node-level and graph-level classifications. This was done by directly using these representations to train and test a simple linear (logistic regression) classifier. In pre-processing, we perform line normalisation of and apply processing strategies. We normalise the learned embeddings before feeding them into the logistic regression classifier. In training, we use the Adam optimiser with an initial learning rate of 0.001 and the subgraph size does not exceed 20. the dimensionality of the node representation is 1024. the marginal value of the loss function is 0.75.

5.1 Datasets

To evaluate our model, the experiments are conducted on the following four widely-used datasets in KT and the detailed statistics are shown in Table 1.

  • ASSISTment 2009111 is provided by the online tutorial website ASSISTment, which is widely used to validate KT problems. Among other things, this dataset comes with accurate labels, practice and conceptual clarity. We have not modified this dataset much except for filtering out corrupt samples.

  • ASSISTment 2015222 is similarly from the online tutoring site ASSISTment, which further clarifies the data set ASSISTment collected in 2009 by collapsing the number of concepts to exactly 100 and introducing a larger number of students, but with a slightly reduced average student interaction record.

  • ASSISTment Challenge333 (ASSISTment chall) is collected for a data mining competition run by ASSISTments in 2017. It has a relatively rich average number of records per student, and because it is used for competition, the dataset as a whole has the highest degree of completeness and normality of the three datasets from ASSISTment.

  • STATICS 2011444 differs from the previous three datasets in that it is course-specific i.e. there is high relevance in the data. This dataset contains 189,297 interactions between 333 students on 1223 concepts making it the most intensive of all four datasets.

Table 1 presents all the statistical data for the dataset, where , and represent the number of students, concepts and interactions respectively.

Datasets Statistics
# # #
ASSISTment 2009 4,151 110 325,637
ASSISTment 2015 19,917 100 708,631
ASSISTment chall 686 102 942,816
STATICS 2011 333 1,223 189,297
Table 1: Statistics for the datasets.
ASSISTment 2009 ASSISTment 2015 ASSISTment Chall STATICS 2011
BKT Corbett and Anderson (1994) 0.648 0.594 0.616 0.592 0.562 0.555 0.719 0.698
DKT Piech et al. (2015) 0.74 0.708 0.701 0.68 0.691 0.712 0.815 0.723
DKVMN Zhang et al. (2017) 0.739 0.618 0.705 0.68 0.689 0.614 0.814 0.722
SAKT Pandey and Karypis (2019) 0.735 0.679 0.721 0.647 0.701 0.657 0.803 0.797
EKT Liu et al. (2019) 0.754 0.702 0.737 0.754 0.72 0.727 0.842 0.819
SAINT+ Shin et al. (2021) 0.782 0.718 0.754 0.741 0.734 0.718 0.853 0.808
Bi-CLKT 0.857 0.802 0.765 0.757 0.775 0.764 0.865 0.835
Table 2: Area Under the curve (AUC) and Accuracy (ACC) on four datasets. The best performing runs per metric per dataset are marked in boldface.

5.2 Evaluation Methods

We compare our proposed Bi-CLKT with the following baseline methods.

  • Bayesian Knowledge Tracing Corbett and Anderson (1994) is a classical machine learning Knowledge Tracing model based on the Hidden Markov Model, which uses Bayesian rules to update the state of each concept, considered to be a binary variable.

  • Deep Knowledge Tracing Piech et al. (2015) uses recurrent neural networks (RNNs) to track students’ knowledge states and became the first deep KT method to achieve excellent results.

  • Dynamic Key-Value Memory Networks Zhang et al. (2017), inspired by memory-enhanced neural networks, builds a static and dynamic matrix to store and update all concepts and students’ learning states respectively.

  • SAKT Pandey and Karypis (2019) is the first self-attentive based Knowledge Tracing model. It abandons the traditional approach of using RNNs to model a student’s historical interaction and instead makes predictions by taking into account relevant exercises from his past interactions. SAKT has been shown to be far more efficient than the RNN-based KT model.

  • EKT Liu et al. (2019) is an extension to the Exercise-Enhanced Recurrent Neural Network (EERNN) framework. Compared to EERNN, EKT further introduces information about the knowledge concepts present in each exercise.

  • SAINT+ Shin et al. (2021) , the first Transformer-based Knowledge Tracing model, is unique in that it introduces exercise information as well as student response information separately, while at the same time it embeds two temporal features, elapsed time and lag time, into the embedding of student response information.

5.3 Experiment discussion

Table 2 compares the predictive performance of Bi-CLKT and its variants with other mainstream baseline methods for ML and DL. The Area Under the curve (AUC) and Accuracy (ACC)

are used as evaluation metrics.

The empirical performance is summarised in Table 2. Overall, we can find that our proposed model shows competitive performance on all datasets. Bi-CLKT consistently outperforms all other baseline KT models by a wide margin. The competitive performance validates the superior performance of our proposed contrastive learning model for the knowledge tracing task. While existing baselines have achieved sufficiently high performance, our approach Bi-CLKT still pushes this bound forward. Furthermore, we note that Bi-CLKT competes with models based on the latest deep learning methods on all four datasets.

5.4 Overall Performance

Table  2 summarises the results of the AUC and ACC comparisons for all baseline methods on the four datasets. From the results, we observe that our Bi-CLKT model achieves the best performance on all four datasets, ASSISTment 2009, ASSISTment 2015, ASSISTment Chall and STATICS 2011, which validates the validity and superiority of our model. Specifically, our proposed Bi-CLKT model achieves at least improvement than the other baseline models. In the baseline models, deep learning models consistently perform better than traditional machine learning models like BKT. This justifies the current research trend towards deep learning methods. We can also see that DKVMN performs slightly worse than DKT on average, as building states for each concept may lose information about the relationships between concepts. Furthermore, SAKT performs worse than our model, suggesting that there is a difference between exploiting higher-order concept-exercise relationships by selecting the most relevant exercises and performing interactions. Finally we can see that SAINT+, the best performing of the baseline models, is the first model to apply the transform modelling framework to the Knowledge Tracing task, which reflects the good adaptation of transform learning to this task. To further dissect our model, we provide sufficient ablation studies on the internal constructs of the model in the following sections.

Augmentation ASSISTment 2009 ASSISTment 2015 ASSISTment Chall STATICS 2011
Uniform 0.864 0.786 0.748 0.749 0.77 0.753 0.852 0.815
Degree 0.869 0.795 0.765 0.757 0.773 0.764 0.858 0.821
PageRank 0.875 0.802 0.757 0.752 0.775 0.764 0.865 0.835
Table 3: Predictive performance. The best performing runs per metric per dataset are marked in boldface
Variants Embedding ASSISTment 2009 ASSISTment 2015 ASSISTment Chall STATICS 2011
C2C 0.838 0.768 0.762 0.746 0.733 0.752 0.802 0.777
E2E 0.83 0.762 0.748 0.748 0.744 0.747 0.857 0.788
Concate 0.862 0.795 0.764 0.752 0.769 0.751 0.859 0.833
C2C 0.847 0.784 0.764 0.754 0.761 0.76 0.849 0.817
E2E 0.859 0.795 0.765 0.755 0.761 0.761 0.864 0.828
Concate 0.875 0.802 0.765 0.757 0.775 0.764 0.865 0.835
Table 4: The Effect of Embedding Propagation Layer. The best performing runs per metric per dataset are marked in boldface

5.5 Ablation Studies

To get insights into the effect of each module in Bi-CLKT, we design several ablation studies. Specifically, we further investigate the effectiveness of three important components of our proposed model: (1) augmentation methods; (2) dmbedding methods; (3) the predictive layer. We set a total of nine comparative settings and report the performances in Table 3 and Table 4.

5.6 Effects of Augmentation methods

We observed that all three variants of Bi-CLKT with different node centrality measures outperformed the existing KT baseline model on all datasets. We also note that the augmentation with degree and PageRank centrality are two powerful variants that achieve the best or competitive performance on all datasets. Specifically, the augmented variant with PageRank centrality works best on the ASSISTment 2009, ASSISTment Chall and STATICS 2011 datasets, while only on ASSISTment 2015 does the augmented variant with Degree centrality outperform PageRank. This shows that our final model has good generalization and is not limited to a specific choice of augmentation method for different datasets.

5.7 Effects of Different Embedding Methods

Since the two implicit relations ”exercise-to-exercise” (E2E) and ”concept-to-concept” (C2C) are constructed separately in our Bi-CLKT model for the subgraphs. To better verify the role of these two embeddings in the model, we adopted the “exercise-to-exercise” (E2E) and “concept-to-exercise” (C2C) subgraphs separately as the attribute of each exercise and compared with their concatenation. From Table 4, we can see that the results obtained by concatenation are significantly better than those obtained by using either embedding alone. Moreover, due to the difference in the prediction layer mechanism, C2C embedding is overall better than E2E embedding in this variant of Bi-CLKT-M, especially on the ASSISTment 2009 and ASSISTment 2015 datasets. In contrast, in the DKT variant, the use of E2E embedding alone gives better results, specifically on the ASSISTment 2009 and STATICS 2011 datasets. In general, both embeddings work well separately, and the best results are obtained when they are combined.

5.8 Effects of Different Predictive Layers

To improve the performance of the models, we used two distinct prediction mechanisms, Bi-CLKT-R and Bi-CLKT-M, which apply Recurrent Neural Network and Memory-augmented Neural Networks, respectively, in the prediction layer. In particular, on the ASSIST09 and STATICS 2011 datasets, our Bi-CLKT-R model achieves an AUC of over 0.85 and an ACC of over 0.8. Compared with Bi-CLKT-M, there are some slight differences with this variant of Bi-CLKT-R, despite the fact that this variant improves overall performance by at least 3% over all other baseline KT models. We can find that in the ASSISTment 2015 dataset the two variants perform fairly close to each other, however, in the other three datasets there is a gap of at least 2% between the two variants. Therefore, in the final model selection we chose Bi-CLKT-R as the predictive layer of the model for best results.

6 Conclusion

We transformed the traditional Knowledge Tracing problem into a graph form and proposed Bi-CLKT model that exploits contrastive learning to learn from large amounts of unlabelled data. Bi-CLKT consists of three main parts: subgraph establishing, contrastive learning and performance prediction. In the contrastive learning part, we adopt two different contrastive learning frameworks, local-local and global-global, for the “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) implicit relationships respectively. The final “exercise-to-exercise” (E2E) and “concept-to-concept” (C2C) embeddings were obtained by node-level and graph-level GCN, and are concatenated together as attributes for each exercise into the prediction layer. Our proposed approach achieved significantly better performance compared to previous state-of-the-art methods for Knowledge Tracing tasks on multiple challenging datasets.


  • G. Abdelrahman and Q. Wang (2019) Knowledge tracing with sequential key-value memory networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 175–184. Cited by: §1, §2.1.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1.
  • S. Becker and G. E. Hinton (1992) Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 355 (6356), pp. 161–163. Cited by: §2.3.
  • R. Collobert and J. Weston (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §1.
  • A. T. Corbett and J. R. Anderson (1994) Knowledge tracing: modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4 (4), pp. 253–278. Cited by: §1, §2.1, item 1, Table 2.
  • R. S. d Baker, A. T. Corbett, and V. Aleven (2008) More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In International conference on intelligent tutoring systems, pp. 406–415. Cited by: §1, §2.1.
  • L. Dos Santos, B. Piwowarski, and P. Gallinari (2016) Multilabel classification on heterogeneous graphs with gaussian embeddings. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 606–622. Cited by: §1.
  • H. Ebbinghaus (2013) Memory: a contribution to experimental psychology. Annals of neurosciences 20 (4), pp. 155. Cited by: §2.1.
  • A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 9729–9738. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.
  • J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu (2017) Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 387–396. Cited by: §2.2.
  • Z. Li, X. Wang, J. Li, and Q. Zhang (2021) Deep attributed network representation learning of complex coupling and interaction. Knowledge-Based Systems 212, pp. 106618. Cited by: §1, §2.1.
  • Q. Liu, Z. Huang, Y. Yin, E. Chen, H. Xiong, Y. Su, and G. Hu (2019) Ekt: exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering 33 (1), pp. 100–115. Cited by: item 5, Table 2.
  • Y. Liu, Y. Yang, X. Chen, J. Shen, H. Zhang, and Y. Yu (2020) Improving knowledge tracing via pre-training question embeddings. arXiv preprint arXiv:2012.05031. Cited by: §1.
  • S. Minn, M. C. Desmarais, F. Zhu, J. Xiao, and J. Wang (2019) Dynamic student classiffication on memory networks for knowledge tracing. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 163–174. Cited by: §1, §2.1.
  • A. Mnih and K. Kavukcuoglu (2013)

    Learning word embeddings efficiently with noise-contrastive estimation

    In Advances in neural information processing systems, pp. 2265–2273. Cited by: §1.
  • H. Nakagawa, Y. Iwasawa, and Y. Matsuo (2019) Graph-based knowledge tracing: modeling student proficiency using graph neural network. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 156–163. Cited by: §1.
  • S. Pandey and G. Karypis (2019)

    A self-attentive model for knowledge tracing

    arXiv preprint arXiv:1907.06837. Cited by: §1, item 4, Table 2.
  • Z. A. Pardos and N. T. Heffernan (2010)

    Modeling individualization in a bayesian networks implementation of knowledge tracing

    In International conference on user modeling, adaptation, and personalization, pp. 255–266. Cited by: §2.1.
  • Z. A. Pardos and N. T. Heffernan (2011) KT-idem: introducing item difficulty to the knowledge tracing model. In International conference on user modeling, adaptation, and personalization, pp. 243–254. Cited by: §2.1.
  • P. I. Pavlik Jr, H. Cen, and K. R. Koedinger (2009) Performance factors analysis–a new alternative to knowledge tracing.. Online Submission. Cited by: §2.1.
  • C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein (2015) Deep knowledge tracing. Advances in neural information processing systems 28, pp. 505–513. Cited by: §1, §1, §2.1, item 2, Table 2.
  • D. Shin, Y. Shim, H. Yu, S. Lee, B. Kim, and Y. Choi (2021) SAINT+: integrating temporal features for ednet correctness prediction. In LAK21: 11th International Learning Analytics and Knowledge Conference, pp. 490–496. Cited by: item 6, Table 2.
  • X. Song, J. Li, S. Sun, H. Yin, P. Dawson, and R. R. M. Doss (2020) SEPN: a sequential engagement based academic performance prediction model. IEEE Intelligent Systems 36 (1), pp. 46–53. Cited by: §1, §2.1.
  • X. Song, J. Li, Y. Tang, T. Zhao, Y. Chen, and Z. Guan (2021) JKT: a joint graph convolutional network based deep knowledge tracing. Information Sciences. Cited by: §1.
  • Y. Su, Q. Liu, Q. Liu, Z. Huang, Y. Yin, E. Chen, C. Ding, S. Wei, and G. Hu (2018) Exercise-enhanced sequential modeling for student performance prediction. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §1, §2.1.
  • Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Cited by: §1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742. Cited by: §2.3.
  • C. Xu, Z. Guan, W. Zhao, Y. Niu, Q. Wang, and Z. Wang (2018) Deep multi-view concept learning.. In IJCAI, pp. 2898–2904. Cited by: §2.1.
  • G. Xue, M. Zhong, J. Li, J. Chen, C. Zhai, and R. Kong (2021) Dynamic network embedding survey. arXiv preprint arXiv:2103.15447. Cited by: §1.
  • Y. Yang, J. Shen, Y. Qu, Y. Liu, K. Wang, Y. Zhu, W. Zhang, and Y. Yu (2020) GIKT: a graph-based interaction model for knowledge tracing. In ECML/PKDD, Cited by: §1, §1.
  • M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §2.3.
  • H. Yin, S. Yang, X. Song, W. Liu, and J. Li (2021) Deep fusion of multimodal features for social media retweet time prediction. World Wide Web 24 (4), pp. 1027–1044. Cited by: §1.
  • M. V. Yudelson, K. R. Koedinger, and G. J. Gordon (2013) Individualized bayesian knowledge tracing models. In International conference on artificial intelligence in education, pp. 171–180. Cited by: §2.1.
  • J. Zhang, X. Shi, I. King, and D. Yeung (2017) Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on World Wide Web, pp. 765–774. Cited by: §1, §1, §2.1, item 3, Table 2.
  • Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang (2021) Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pp. 2069–2080. Cited by: §4.2.