1 Introduction
With growing developments in online education platforms, vast amounts of online learning data is available to keep accurate and timely trace of students’ learning status. To trace students’ mastery of specific knowledge points or concepts, a fundamental task called Knowledge Tracing (KT) has been proposed Corbett and Anderson (1994), which uses a series of student interactions with exercises to predict their mastery of the concepts corresponding to those exercises. Specifically, Knowledge Tracing addresses the problem of predicting whether students will be able to correctly answer the next exercise relevant to a concept, given their previous learning interactions. In recent years, KT tasks have received significant attention
in academic areas, and many scholars have conducted research to propose numerous methods that deal with this problem. Conventional approaches in this domain are mainly divided into the Bayesian Knowledge Tracing model using Hidden Markov Models
Corbett and Anderson (1994) and Deep Knowledge Tracing using Deep Neural Networks Piech et al. (2015) and its derivative methods Graves et al. (2014); Zhang et al. (2017); Pandey and Karypis (2019); Su et al. (2018).Existing KT methods d Baker et al. (2008); Zhang et al. (2017); Piech et al. (2015) generally target the concept to which the exercise belongs rather than distinguishing between the exercises themselves to build predictive models. Such an approach assumes that the ability of a student to solve the relevant exercise correctly to a certain extent directly reflects that student’s mastery of the concept. Therefore, prediction based on concepts in such a way is a viable option, however, this reduces the difficulty of the task itself, given the limited performance of the model. Generally, a KT task comprises multiple concepts and a large number of exercises with an even larger number of situations where a concept is associated with many exercises, and a proportion of situations where an exercise may correspond to multiple concepts. Traditional models can only deal with the former, while for the latter, they often have to resort to dividing these crossconcept exercises into multiple singleconcept exercises. Such an approach, while enhancing the feasibility of these models, nevertheless interferes with the accuracy of the overall task.
Although these conceptbased KT methods have been somewhat successful, the characteristics of the exercises themselves are often overlooked. This can lead to a reduction in the ultimate predictive accuracy of the model and a failure to predict specific exercises. Even if two exercises have the same concept, the difference in their difficulty level may ultimately lead to a large difference in the probability of them being answered correctly. Therefore, some previous works
Minn et al. (2019); Dos Santos et al. (2016); Yang et al. (2020); Abdelrahman and Wang (2019); Xue et al. (2021); Song et al. (2020) have attempted to use exercise features as a supplement to concept input, achieving success to some extent. However, due to the relatively large difference between the number of exercises and the number of exercises students actually interact with, each student may only interact with a very small fraction of the exercises, leading to problems of sparse data. Furthermore, for those exercises that span concepts, simply adding features to the exercises loses potential interexercise and interconcept information. Therefore, the use of higherorder information such as “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) is necessary to address these issues.The idea of redefining the Knowledge Tracing problem in terms of graphs has recently gained significant momentum due to the widespread deployment of GNNs Nakagawa et al. (2019); Yang et al. (2020); Liu et al. (2020); Li et al. (2021); Yin et al. (2021) and breakthroughs in addressing the unpredictability of traditional approaches to crossconcept exercises. Traditional KT usually takes sequential data as input in the form of concepts corresponding to the input exercises and their responses. This leads to a lack of information between exercises, and only the relationship between exercises and concepts is available. Recent research in graph theory has opened up the possibility of breaking this bottleneck. Unlike sequential data, graph data can capture the higher order information of “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) very well due to the multivariate node and edge structure of the graph itself. As a result, some research Nakagawa et al. (2019); Song et al. (2021) has turned to redefining this task in terms of graphs. However, these efforts face several problems: 1) too much attention to the details of the nodes rather than highlevel semantic information; 2) difficulty in effectively establishing spatial associations and complex structures of the nodes; and 3) representing only concepts or exercises without integrating them.
Due to the difficulty and inaccuracy of data labelling, selfsupervised learning has become increasingly popular, with great success in many areas such as computer vision
Bachman et al. (2019); He et al. (2020); Tian et al. (2020)and natural language processing
Collobert and Weston (2008); Mnih and Kavukcuoglu (2013). The speciality of selfsupervised learning is in dealing with low quality or missing labels, which is a requirement for supervised learning and uses the input data itself for incremental layers as supervised labels for the learning model. This can be as powerful as supervised models with specifically labelled information and eliminates the tedious labelling work required by supervised models. Specifically, selfsupervised learning eliminates the need to label specific tasks which is the biggest bottleneck in supervised learning. Especially for large amounts of network data, obtaining high quality labels at scale is often very expensive and time consuming. Selfsupervised learning has been shown to excel in tasks with textual and image datasets, but is still in its infancy for problems on the graphs such as retrieval, recommendation, graph mining, and social network analysis.In this paper, we address the problems encountered by traditional GNN based KT models and propose a selfsupervised learning framework along with BiGraph Contrastive Learning based Knowledge Tracing (BiCLKT) model. Our model employs contrastive learning with globalbilayer and localbilayer structures, where they apply graphlevel and nodelevel GCNs to extract “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) relational information, respectively. Finally, the prediction on students’ performance is carried out by a prediction layer based on a deep neural network.
Our approach has been fully validated using four benchmark datasets. In terms of prediction performance, our model and its variants outperformed traditional deep learningbased methods, showing great potential for knowledgetracing prediction accuracy. Furthermore, through a series of ablation studies, we have analysed each module separately, making the model more interpretable. Our proposed model also has some technical innovations and improvements. Due to the specificity of this task, which requires separate representation learning for two related but independent entities, our twolayer comparative learning structure fits well into this. Our specific contributions are described as follows:

To the best of our knowledge, we present the first selfsupervised learning based knowledgetracing framework. Through contrastive selfsupervised learning, we solve a number of problems encountered by traditional GNNbased knowledgetracing models leading to significant improvement in the accuracy of the final prediction results.

For Knowledge Tracing, we design a twolayer contrastive learning framework, which performs “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) relational information at the global and local levels respectively. The representation of exercise is learned eventually and effectively combined by a joint contrastive loss function. Such a structure allows an exercise embedding to have both exercise and concept structural information, which has a positive effect on the final prediction task.

We perform thorough experiments on four realworld open datasets, and the results show that our proposed framework and its variants all have a significant improvement in predictive efficiency compared to the individual baseline models. We also perform ablations studies to analyse the validity of each individual module, which greatly enhances the model interpretability.
2 Related Work
2.1 Knowledge Tracing
There are two main approaches to using machine learning for Knowledge Tracing. The first one is the traditional machine learning KT approach represented by Bayesian Knowledge Tracing (BKT)
Corbett and Anderson (1994). BKT primarily applies the Hidden Markov Model, which uses Bayesian rules to update the state of each concept considered as a binary variable. Several works have extended the basic BKT model and introduced additional variables such as slip and guess probabilities
d Baker et al. (2008), concept difficulty Pardos and Heffernan (2011) and student personalisation Pardos and Heffernan (2010); Yudelson et al. (2013); Song et al. (2020). On the other hand, traditional machine learning KT models also include factor analysis models such as item response theory (IRT) Ebbinghaus (2013) and performance factor analysis (PFA) Pavlik Jr et al. (2009); Li et al. (2021), which tend to focus on learning general parameters from historical data to make predictions.With the development of Deep Neural Networks, the literature has experienced advances in Deep Knowledge Tracing methods that have proved to be more effective in learning valid representations for large amounts of data for more accurate predictions. For example, Deep Knowledge Tracing (DKT) Piech et al. (2015) uses recurrent neural networks (RNNs) to track students’ knowledge states and became the first deep KT method to achieve excellent results. Another example is the Dynamic KeyValue Memory Network (DKVMN) Zhang et al. (2017) that builds a static and dynamic matrix to store and update all concepts and students’ learning states respectively. Xu et al. Xu et al. (2018) propose a pioneering deep matrix factorization method for conceptual representation learning from multiview data. However, these classical models consider the most basic concept features only, and the absence of exercise features leads to unreliable final predictions.
Some deeper KT methods have since been proposed that do take into account the features of the exercises in their predictions. For example, ExerciseEnhanced Recurrent Neural Network with attention mechanism (EERNNA) Su et al. (2018) uses information about the text of the exercise to allow the embedding itself to contain the features of the exercise, but in reality it is difficult to collect such textual information, and doing so introduces too much interference into the embedding itself. Dynamic Student Classification on Memory Networks (DSCMIN) Minn et al. (2019) uses modelling of problem difficulty to help distinguish between different problems under the same concept. DHKT, on the other hand, augments DKT by using relationships between problems and skills to obtain a representation of the exercises. However, this does not capture the relationships between exercises and concepts due to the data sparsity issues. Due to the presence of longterm dependencies in practice sequences, the Sequence Key Value Memory Network (SKVMN) Abdelrahman and Wang (2019) has improved the LSTM with good results in order to improve the ability to capture such dependencies. Our approach differs from these methods in that they build the graph of exerciseinfluence relations from the original “exercisetoconcept” (E2C) relations by certain assumptions and use graphlevel and nodelevel GCNs, respectively, to extract the “exerciseto exercise” (E2E) and “concepttoconcept” (C2C) relational information. On the other hand, to reduce the interference of too much detailed information, we use contrast learning model to learn the concepts and exercises separately for representation.
2.2 Selfsupervised Learning
Research on selfsupervised learning can be broadly divided into two branches: generative models and contrastive models. The main representative of the generative model is the automatic coding which is very popular. The main approach on graph data is to learn the embedding of nodes of the graph into a latent space through GNN, and then reconstruct the structure and properties of the original graph through the learned representations. The representations of the nodes are adjusted by reducing the size of the loss between the generated graph and the original graph step by step. The learned representations are then used to reconstruct the original diagram. These representations encode the structural and attribute features of the original graph. Contrastive learning, on the other hand, uses augmentation methods to structurally disrupt the input data, separating out the predicted objects and corresponding labels from their own structure before learning the representations, and finally comparing the loss functions to minimize the distance between positive pairs and maximize the distance between negative pairs to achieve a structural grasp of the complete graph. Pioneer methods along the direction of learning graph representations of GCNs include Hu et al. Kingma and Ba (2014) and Kaveh et al. Li et al. (2017).
2.3 Contrastive Learning on Graphs
Contrastive learning is a type of selfsupervised learning where the target label to be learned is generated from source data itself. It brings similar representations closer together and dissimilar ones further apart by comparison. For graph data, traditional learning methods Becker and Hinton (1992); Wu et al. (2018); Ye et al. (2019) often overemphasise detailed information at the expense of structural information. On the other hand, contrastive learning compensates for this by nicely finding a balance between local and global representation learning. Although contrastive learning on graph data is still in its infancy, it has been demonstrated by numerous models to be powerful in its control of graph structural information.
3 Preliminary and Problem Statement
In this section, we define the task of Knowledge Tracing in our setting. We first formally define the student performance prediction problem, which represents the level of student mastery of each concept by the accuracy with which students interact with the exercises under each concept. Next, we present the important definitions used in our study.
3.1 Problem Definition
In the Knowledge Tracing task, we record the a particular student’s practice process as , where represents the student, and represents the specific exercise, represents the exercise that student does in its exercise step , and represents the correctness or otherwise of the corresponding exercise. In general, equals 1 if the student answered the exercise correctly, otherwise equals 0. To trace the mastery of a specific concept , we normally observe a sequence of a student’s interactions and predict the result of the next exercise . Finally, the probability of a student getting any next exercise correct for a specific concept is taken as the student’s mastery of that concept. The specific definitions are as follows:
Problem 1 (Performance Prediction in KT).
Give a sequence of interaction observations taken by a students , on a specific concept , where for the exercise being answered at the time step with whether or not the exercise was answered correctly . The objective of Knowledge Tracing task is to predict the next interaction .
To solve this problem, we dig deeper into the “exercisetoexercise”(E2E) graph structure relationships from the original “exercisetoconcept” sequence data. The secondary “exercisetoexercise”(E2E) relationships extracted are used to construct an exercise influence graph based on each of the different concepts. Then, we obtained separate prerepresentations of the exercises and concepts by applying graphlevel and nodelevel GCNs on these exercise influence graphs. Finally, we fuse them into a contrast learningbased Knowledge Tracing graph to train these representations to achieve optimal results under the Knowledge Tracing task.
The exerciselevel influence graph is derived from students’ transitions between exercises. It assumes that the majority of students who get two different exercises correct at the same time will have a high degree of similarity or correlation between the two exercises, and therefore will have relatively high weights between them, and vice versa. The creation of the exerciselevel influence graph not only ameliorates the existing models’ shortcoming of not being able to distinguish between exercises under the same concept, but also provides rich information on the structure of the “exercisetoexercise” graph.
Definition 1.
(Exerciselevel influence graph) Given an exercise set and an “exercisetoexercise” interaction set , an exercise influence subgraph is a graph that takes the exercises in as the vertices and the interactions in as the edges. The weighted influence of the edges of coocurred exercises and is measured by
(1) 
where is the cocorrectness rate of exercises and among all the answered cooccurred exercises involving . and denote the count of cocorrectness and cooccurrence respectively. Edge (, ) only exists when is larger than a certain threshold .
Example 1.
As shown in Figure 1, we constructed an example of an exercise influence subgraph by assuming a high degree of similarity in the questions that a student can answer correctly at the same time, in which the properties of different exercises are reflected from node to node. In turn, the properties of different concepts are reflected between different subgraphs. They represent “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) relational information, respectively.
3.2 Representative Solutions Study
The vector sequence of
as input to conventional deep knowledge tracking is mapped to the output vector sequence by computing a sequence of ’hidden’ states . This can be seen as a continuous encoding of information about historical learning performance to make predictions about the future, and DKT makes the connection between input and output via a simple recurrent neural network. The input to the dynamic network is a representation of the student’s historical behaviour, while the prediction is a vector representing the probability of being correct for each sample exercise. More details are defined by the equations:(2)  
where the sigmoid and tanh functions
are used as activation functions,
denotes the input weight matrix, denotes the initial state, denotes the readout weight matrix and denotes the cyclic weight matrix. The deviations of the latent and readout cells are given by and .Another classical approach is DKVMN, which outputs the probability of a response through a discrete exercise label . The motion and response tuples are then updated. Here, is a set with distinct exercise labels and is the binary value of whether the student got it right or not. DKVMN assumes that the exercise is based on the set of potential concepts with . The key matrix (size ) is used to store these concepts of size . Concept states are stored as students’ mastery of each concept in the timevarying value matrix (size ). Ultimately, DKVMN tracks student knowledge by reading and writing to the value matrix using the relevant weights computed from the input exercises and the key matrix.
In the KT process, existing models often do not link the different concepts well, which leads to the inability of these models to make correct or complete predictions when students encounter exercises on concepts that have not been covered before or when an exercise involves multiple concepts. We, therefore, fill this gap by using the construction of exercise influence subgraphs, with nodes connecting different exercises to each other and subgraphs connecting different concepts to each other.
By building an exercise influence subgraph, we can transform the original sequence data into graph structured data containing “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) relational information. We then learn these data into the corresponding embedded representations by using nodelevel and graphlevel GCNs respectively, while learning the best architecture by contrast to maximise the differentiation of these representations so that the final representations contain both information about the exercises and concepts, and their differentiation from other concepts and exercises.
4 The BiGraph Contrastive Knowledge Tracing
Inspired by recent developments of contrastive learning in visual representation, an increasing scale of research has shown that contrastive learning frameworks perform well on graphstructured data as well. Therefore, after various comparisons and studies, we propose a Bigraph Contrastive Knowledge Tracing representation learning (BiCLKT) based on graphlevel (global) and nodelevel (local) GCNs. In the next section, we describe BiCLKT in detail, first by briefly discuss a traditional contrastive learning framework, and then more specifically by presenting our proposed Bigraph contrastive learning framework. Finally, we provide the theoretical rationale behind our approach.
4.1 The Graph CL Paradigm
As shown in Figure 2, our proposed BiCLKT framework extends the common graph CL paradigm. Common graph CL paradigms typically employ either global features or local features to seek to maximise the consistency of representations between the graphlevel or nodelevel of different views. Specifically, graph CL paradigms first generate two graph views by performing random graph augmentations on the original graph, such as eliminating or adding edges, eliminating nodes, masking attributes, etc. For the global graphCL, these two views are treated as a positive pair, while the other graphs are treated as negative pairs. While for the local graphCL, the nodes in these two views are found as a positive pair and negative pairs. we then employ a contrastive loss that forces the positive pair’s embeddings in the view are consistent with each other, while trying to distance all negative pairs from each other. Specifically, the graph CL paradigm consists of four main components:

Graph data augmentation. The given graph is augmented by eliminating or adding edges, eliminating nodes, masking attributes, etc. to obtain two related views (i.e. augmented graphs), which are treated as a positive pair in the global graph CL. Whereas in the local graph CL, all positive pairs and negative pairs exist in the both two views.

GCNbased encoder. GCNbased encoders nodelevel and graphlevel for augmented graphs to extract representation vectors .

Projection head. A nonlinear transformation
named projection head maps augmented representations to another latent space where the contrastive loss is calculated. In graph contrastive learning,is obtained by a MultiLayer perceptron (MLP).

Joint contrastive loss function. The joint contrastive loss function is defined as forcing the maximisation of the distance between the positive pair and the negative pair in the two subgraphs respectively. The final normalized temperaturescaled cross entropy loss is used to calculate the loss of the two contrastive learning modules.
4.2 The BiCLKT Framework
In general, common graph CL approaches usually choose either global graph CL at the graph level or local graph CL at the node level, which seek to learn representations by maximising the consistency between views from different perspectives. Normally, these approaches often have only one feature, global or local. In the BiCLKT model, due to the specificity of the KT task, we need to acquire both “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) relational information. These two relational features respectively have the corresponding properties of being local and global. Therefore, we propose to design a Bigraph contrastive learning framework with both local and globe features.
To match this twolayer framework, we use nodelevel GCN to learn “exercisetoexercise” (E2E) embedding and graphlevel GCN to learn “concepttoconcept” (C2C) embedding, respectively. In the process of graph data augmentation, we design separate graph augmentation processes for each of these two layers of the contrastive learning framework. We mainly augment the input graph by randomly removing edges and masking node features in the graph. In addition, inspired by Zhu et al. (2021), we introduce differences in the importance of different nodes and edges by calculating the centrality of different nodes through methods such as random walk and PageRank, and those nodes and edges with lower importance are prioritized for elimination during the graph augmentation process.
4.2.1 Graph Data Augmentation
For the exercise level augmentation, we took two different approaches to augmenting the input graph, i.e. randomly picking edges or points to be removed. Formally, we form a modified subset and by randomly selecting some nodes and edges from the original and in proportion to the probability of random selection of
(3) 
where and are the probabilities of eliminating edges and nodes , and then and are the set of edges and the set of shop points after graph augmentation. The and reflect the importance of edges and nodes , respectively. By doing so, this function ensures that the unimportant edges or points are eliminated preferentially while ensuring that the important structure of the graph is not compromised.
In conceptlevel graph augmentation, similar to the increments at the exercise level, we need to compute probabilities and to reflect the importance of the corresponding edges and nodes . The difference is that in conceptlevel graph augmentation, to better fit the graphlevel GCN, we use the PageRank algorithm to combine and to work out a composite probability for the concept level. The specific calculation is defined below as defined below
(4) 
where we compute the maximum and average values of and denote them by and respectively, and is the probability of combining and .
Finally, we generate two corrupted graph views by augmentation for the exercise and conceptual levels, respectively. In BiCLKT, the probabilities of generating the two views are different, and to better target the description of these two probabilities we denote them by and respectively.
4.2.2 NodeLevel and GraphLevel Encoder
Direct use of the node features of the last layer of the encoder is the most direct way to obtain a nodelevel representation of node , i.e. . Where the use of skip connections or skip knowledge to generate nodelevel representations is do commonly used. However, the method of connecting the node features of all layers produces a nodelevel representation with a different dimensionality than the node features. To avoid this problem, we perform a linear transformation of the node features of all layers before joining them together.
(5) 
where the weight matrix that reduces the size of the dimension is .
.
The key operation for computing the nodelevel representation of a graph is the READOUT function, of which summation and averaging are the most commonly used READOUT functions. For reasons of node envelope invariance. We use a summation over all node representations, i.e.
(6) 
where
is the sigmoid function and
denotes the total number of nodes in a given graph.Due to the nature of BiCLKT’s twolayer structure, we have used nodelevel and graphlevel GCNs as encoders for the exercises and concepts respectively. The specific structural form is as follows
(7)  
where the adjacency matrix is and the degree matrix is and is the nonlinear activation function we use .
4.2.3 Projection Head
Furthermore, in BiCLKT we map the enhanced representations to a uniform latent space via a nonlinear transformation called projection head
, where the contrast loss is computed. In graphical contrast learning, a multilayer perceptron (MLP) is applied to obtain these mappings
.In terms of projection head, the case of considering view generation in terms of mutual information is that a good view generation should minimise the MI between two views , provided that . Intuitively, the fact that the generated viewpoints do not affect the information that determines the prediction of the downstream task can guarantee the effectiveness of contrast learning. Thus under this restriction, divergence between viewpoints as it increases leads to better learning results. From a flowform perspective, we adopt the extension hypothesis and find that an increase in data can induce continuity in the neighbourhood of each instance.
4.3 Joint Contrastive Loss Function
To better learn the representations of exercises and concepts, we used a joint contrastive loss function . This loss function forces the maximisation of the distance between positive pairs of previously learned mappings and other negative pairs. With extensive comparisons, we found that normalised temperature scaled crossentropy loss (NTXent) was the most appropriate loss model. In the nodelevel GCN training process, we randomly draw N small batches of nodes on the practice influence graph under the same concept, and learn these nodes by local contrast learning, where all 1hop neighbour points under its same view are used as his negative pairs, along with all 1 hop neighbours of the other view, and the only positive pairs are the corresponding points under both views. In the training process of the graphlevel GCN, we randomly draw small batches of exercise influence graphs, so that
augmented graphs are generated as positive pairs, while all other graphs are used as their negative pairs.The cosine similarity function is denoted as
(8) 
As for the objective function, the prevailing practice is to use a standard binary crossentropy loss between positive and negative examples i.e. a noisy contrast type objective. However, we have found by research that it is detrimental to representation learning if positive and negative examples are absolutely distinguished. This is mainly due to the fact that these contextual subgraphs are extracted from the same original graph and overlap each other. Therefore, we used normalised temperature scaled crossentropy loss (NTXent) for model optimisation so that positive and negative samples are well differentiated to some extent, resulting in a high quality representation. The NTXent of the th graph is defined as
(9) 
where the temperature parameter is denoted by . These last two correspond to the exercise and concept losses being computed in all positive pairs, respectively.
5 Experimental Settings and Results
We performe extensive experiments on four realworld datasets to evaluate the performance of our BiCLKT model. We also compare it to several stateoftheart machine learning and deep learning Knowledge Tracing models. To fully evaluate the performance of BiCLKT, we conducted a large number of indepth ablation experiments, which validated the role of each module in BiCLKT and enhanced the interpretability of the model.
To implement the baseline and BiCLKT models, we used PyTorch and the Geometric Deep Learning extension library. Experiments were conducted on four NVIDIA TITAN V GPUs. BiCLKT was used to learn node representations in a selfsupervised contrastive learning fashion, and these representations were then used to evaluate nodelevel and graphlevel classifications. This was done by directly using these representations to train and test a simple linear (logistic regression) classifier. In preprocessing, we perform line normalisation of and apply processing strategies. We normalise the learned embeddings before feeding them into the logistic regression classifier. In training, we use the Adam optimiser with an initial learning rate of 0.001 and the subgraph size does not exceed 20. the dimensionality of the node representation is 1024. the marginal value of the loss function is 0.75.
5.1 Datasets
To evaluate our model, the experiments are conducted on the following four widelyused datasets in KT and the detailed statistics are shown in Table 1.

ASSISTment 2009^{1}^{1}1https://sites.google.com/site/assistmentsdata/home/assistment20092010data is provided by the online tutorial website ASSISTment, which is widely used to validate KT problems. Among other things, this dataset comes with accurate labels, practice and conceptual clarity. We have not modified this dataset much except for filtering out corrupt samples.

ASSISTment 2015^{2}^{2}2https://sites.google.com/site/assistmentsdata/home/2015assistmentsconceptbuilderdata is similarly from the online tutoring site ASSISTment, which further clarifies the data set ASSISTment collected in 2009 by collapsing the number of concepts to exactly 100 and introducing a larger number of students, but with a slightly reduced average student interaction record.

ASSISTment Challenge^{3}^{3}3https://sites.google.com/view/assistmentsdatamining/dataset (ASSISTment chall) is collected for a data mining competition run by ASSISTments in 2017. It has a relatively rich average number of records per student, and because it is used for competition, the dataset as a whole has the highest degree of completeness and normality of the three datasets from ASSISTment.

STATICS 2011^{4}^{4}4https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507 differs from the previous three datasets in that it is coursespecific i.e. there is high relevance in the data. This dataset contains 189,297 interactions between 333 students on 1223 concepts making it the most intensive of all four datasets.
Table 1 presents all the statistical data for the dataset, where , and represent the number of students, concepts and interactions respectively.
Datasets  Statistics  
#  #  #  
ASSISTment 2009  4,151  110  325,637 
ASSISTment 2015  19,917  100  708,631 
ASSISTment chall  686  102  942,816 
STATICS 2011  333  1,223  189,297 
ASSISTment 2009  ASSISTment 2015  ASSISTment Chall  STATICS 2011  
AUC  ACC  AUC  ACC  AUC  ACC  AUC  ACC  
BKT Corbett and Anderson (1994)  0.648  0.594  0.616  0.592  0.562  0.555  0.719  0.698 
DKT Piech et al. (2015)  0.74  0.708  0.701  0.68  0.691  0.712  0.815  0.723 
DKVMN Zhang et al. (2017)  0.739  0.618  0.705  0.68  0.689  0.614  0.814  0.722 
SAKT Pandey and Karypis (2019)  0.735  0.679  0.721  0.647  0.701  0.657  0.803  0.797 
EKT Liu et al. (2019)  0.754  0.702  0.737  0.754  0.72  0.727  0.842  0.819 
SAINT+ Shin et al. (2021)  0.782  0.718  0.754  0.741  0.734  0.718  0.853  0.808 
BiCLKT  0.857  0.802  0.765  0.757  0.775  0.764  0.865  0.835 
5.2 Evaluation Methods
We compare our proposed BiCLKT with the following baseline methods.

Bayesian Knowledge Tracing Corbett and Anderson (1994) is a classical machine learning Knowledge Tracing model based on the Hidden Markov Model, which uses Bayesian rules to update the state of each concept, considered to be a binary variable.

Deep Knowledge Tracing Piech et al. (2015) uses recurrent neural networks (RNNs) to track students’ knowledge states and became the first deep KT method to achieve excellent results.

Dynamic KeyValue Memory Networks Zhang et al. (2017), inspired by memoryenhanced neural networks, builds a static and dynamic matrix to store and update all concepts and students’ learning states respectively.

SAKT Pandey and Karypis (2019) is the first selfattentive based Knowledge Tracing model. It abandons the traditional approach of using RNNs to model a student’s historical interaction and instead makes predictions by taking into account relevant exercises from his past interactions. SAKT has been shown to be far more efficient than the RNNbased KT model.

EKT Liu et al. (2019) is an extension to the ExerciseEnhanced Recurrent Neural Network (EERNN) framework. Compared to EERNN, EKT further introduces information about the knowledge concepts present in each exercise.

SAINT+ Shin et al. (2021) , the first Transformerbased Knowledge Tracing model, is unique in that it introduces exercise information as well as student response information separately, while at the same time it embeds two temporal features, elapsed time and lag time, into the embedding of student response information.
5.3 Experiment discussion
Table 2 compares the predictive performance of BiCLKT and its variants with other mainstream baseline methods for ML and DL. The Area Under the curve (AUC) and Accuracy (ACC)
are used as evaluation metrics.
The empirical performance is summarised in Table 2. Overall, we can find that our proposed model shows competitive performance on all datasets. BiCLKT consistently outperforms all other baseline KT models by a wide margin. The competitive performance validates the superior performance of our proposed contrastive learning model for the knowledge tracing task. While existing baselines have achieved sufficiently high performance, our approach BiCLKT still pushes this bound forward. Furthermore, we note that BiCLKT competes with models based on the latest deep learning methods on all four datasets.
5.4 Overall Performance
Table 2 summarises the results of the AUC and ACC comparisons for all baseline methods on the four datasets. From the results, we observe that our BiCLKT model achieves the best performance on all four datasets, ASSISTment 2009, ASSISTment 2015, ASSISTment Chall and STATICS 2011, which validates the validity and superiority of our model. Specifically, our proposed BiCLKT model achieves at least improvement than the other baseline models. In the baseline models, deep learning models consistently perform better than traditional machine learning models like BKT. This justifies the current research trend towards deep learning methods. We can also see that DKVMN performs slightly worse than DKT on average, as building states for each concept may lose information about the relationships between concepts. Furthermore, SAKT performs worse than our model, suggesting that there is a difference between exploiting higherorder conceptexercise relationships by selecting the most relevant exercises and performing interactions. Finally we can see that SAINT+, the best performing of the baseline models, is the first model to apply the transform modelling framework to the Knowledge Tracing task, which reflects the good adaptation of transform learning to this task. To further dissect our model, we provide sufficient ablation studies on the internal constructs of the model in the following sections.
Augmentation  ASSISTment 2009  ASSISTment 2015  ASSISTment Chall  STATICS 2011  
AUC  ACC  AUC  ACC  AUC  ACC  AUC  ACC  
Uniform  0.864  0.786  0.748  0.749  0.77  0.753  0.852  0.815 
Degree  0.869  0.795  0.765  0.757  0.773  0.764  0.858  0.821 
PageRank  0.875  0.802  0.757  0.752  0.775  0.764  0.865  0.835 
Variants  Embedding  ASSISTment 2009  ASSISTment 2015  ASSISTment Chall  STATICS 2011  
AUC  ACC  AUC  ACC  AUC  ACC  AUC  ACC  

C2C  0.838  0.768  0.762  0.746  0.733  0.752  0.802  0.777  
E2E  0.83  0.762  0.748  0.748  0.744  0.747  0.857  0.788  
Concate  0.862  0.795  0.764  0.752  0.769  0.751  0.859  0.833  

C2C  0.847  0.784  0.764  0.754  0.761  0.76  0.849  0.817  
E2E  0.859  0.795  0.765  0.755  0.761  0.761  0.864  0.828  
Concate  0.875  0.802  0.765  0.757  0.775  0.764  0.865  0.835 
5.5 Ablation Studies
To get insights into the effect of each module in BiCLKT, we design several ablation studies. Specifically, we further investigate the effectiveness of three important components of our proposed model: (1) augmentation methods; (2) dmbedding methods; (3) the predictive layer. We set a total of nine comparative settings and report the performances in Table 3 and Table 4.
5.6 Effects of Augmentation methods
We observed that all three variants of BiCLKT with different node centrality measures outperformed the existing KT baseline model on all datasets. We also note that the augmentation with degree and PageRank centrality are two powerful variants that achieve the best or competitive performance on all datasets. Specifically, the augmented variant with PageRank centrality works best on the ASSISTment 2009, ASSISTment Chall and STATICS 2011 datasets, while only on ASSISTment 2015 does the augmented variant with Degree centrality outperform PageRank. This shows that our final model has good generalization and is not limited to a specific choice of augmentation method for different datasets.
5.7 Effects of Different Embedding Methods
Since the two implicit relations ”exercisetoexercise” (E2E) and ”concepttoconcept” (C2C) are constructed separately in our BiCLKT model for the subgraphs. To better verify the role of these two embeddings in the model, we adopted the “exercisetoexercise” (E2E) and “concepttoexercise” (C2C) subgraphs separately as the attribute of each exercise and compared with their concatenation. From Table 4, we can see that the results obtained by concatenation are significantly better than those obtained by using either embedding alone. Moreover, due to the difference in the prediction layer mechanism, C2C embedding is overall better than E2E embedding in this variant of BiCLKTM, especially on the ASSISTment 2009 and ASSISTment 2015 datasets. In contrast, in the DKT variant, the use of E2E embedding alone gives better results, specifically on the ASSISTment 2009 and STATICS 2011 datasets. In general, both embeddings work well separately, and the best results are obtained when they are combined.
5.8 Effects of Different Predictive Layers
To improve the performance of the models, we used two distinct prediction mechanisms, BiCLKTR and BiCLKTM, which apply Recurrent Neural Network and Memoryaugmented Neural Networks, respectively, in the prediction layer. In particular, on the ASSIST09 and STATICS 2011 datasets, our BiCLKTR model achieves an AUC of over 0.85 and an ACC of over 0.8. Compared with BiCLKTM, there are some slight differences with this variant of BiCLKTR, despite the fact that this variant improves overall performance by at least 3% over all other baseline KT models. We can find that in the ASSISTment 2015 dataset the two variants perform fairly close to each other, however, in the other three datasets there is a gap of at least 2% between the two variants. Therefore, in the final model selection we chose BiCLKTR as the predictive layer of the model for best results.
6 Conclusion
We transformed the traditional Knowledge Tracing problem into a graph form and proposed BiCLKT model that exploits contrastive learning to learn from large amounts of unlabelled data. BiCLKT consists of three main parts: subgraph establishing, contrastive learning and performance prediction. In the contrastive learning part, we adopt two different contrastive learning frameworks, locallocal and globalglobal, for the “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) implicit relationships respectively. The final “exercisetoexercise” (E2E) and “concepttoconcept” (C2C) embeddings were obtained by nodelevel and graphlevel GCN, and are concatenated together as attributes for each exercise into the prediction layer. Our proposed approach achieved significantly better performance compared to previous stateoftheart methods for Knowledge Tracing tasks on multiple challenging datasets.
References
 Knowledge tracing with sequential keyvalue memory networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 175–184. Cited by: §1, §2.1.
 Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1.
 Selforganizing neural network that discovers surfaces in randomdot stereograms. Nature 355 (6356), pp. 161–163. Cited by: §2.3.
 A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §1.
 Knowledge tracing: modeling the acquisition of procedural knowledge. User modeling and useradapted interaction 4 (4), pp. 253–278. Cited by: §1, §2.1, item 1, Table 2.
 More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In International conference on intelligent tutoring systems, pp. 406–415. Cited by: §1, §2.1.
 Multilabel classification on heterogeneous graphs with gaussian embeddings. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 606–622. Cited by: §1.
 Memory: a contribution to experimental psychology. Annals of neurosciences 20 (4), pp. 155. Cited by: §2.1.
 Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §1.

Momentum contrast for unsupervised visual representation learning.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 9729–9738. Cited by: §1.  Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.2.
 Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 387–396. Cited by: §2.2.
 Deep attributed network representation learning of complex coupling and interaction. KnowledgeBased Systems 212, pp. 106618. Cited by: §1, §2.1.
 Ekt: exerciseaware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering 33 (1), pp. 100–115. Cited by: item 5, Table 2.
 Improving knowledge tracing via pretraining question embeddings. arXiv preprint arXiv:2012.05031. Cited by: §1.
 Dynamic student classiffication on memory networks for knowledge tracing. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 163–174. Cited by: §1, §2.1.

Learning word embeddings efficiently with noisecontrastive estimation
. In Advances in neural information processing systems, pp. 2265–2273. Cited by: §1.  Graphbased knowledge tracing: modeling student proficiency using graph neural network. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 156–163. Cited by: §1.

A selfattentive model for knowledge tracing
. arXiv preprint arXiv:1907.06837. Cited by: §1, item 4, Table 2. 
Modeling individualization in a bayesian networks implementation of knowledge tracing
. In International conference on user modeling, adaptation, and personalization, pp. 255–266. Cited by: §2.1.  KTidem: introducing item difficulty to the knowledge tracing model. In International conference on user modeling, adaptation, and personalization, pp. 243–254. Cited by: §2.1.
 Performance factors analysis–a new alternative to knowledge tracing.. Online Submission. Cited by: §2.1.
 Deep knowledge tracing. Advances in neural information processing systems 28, pp. 505–513. Cited by: §1, §1, §2.1, item 2, Table 2.
 SAINT+: integrating temporal features for ednet correctness prediction. In LAK21: 11th International Learning Analytics and Knowledge Conference, pp. 490–496. Cited by: item 6, Table 2.
 SEPN: a sequential engagement based academic performance prediction model. IEEE Intelligent Systems 36 (1), pp. 46–53. Cited by: §1, §2.1.
 JKT: a joint graph convolutional network based deep knowledge tracing. Information Sciences. Cited by: §1.

Exerciseenhanced sequential modeling for student performance prediction.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 32. Cited by: §1, §2.1.  Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Cited by: §1.
 Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742. Cited by: §2.3.
 Deep multiview concept learning.. In IJCAI, pp. 2898–2904. Cited by: §2.1.
 Dynamic network embedding survey. arXiv preprint arXiv:2103.15447. Cited by: §1.
 GIKT: a graphbased interaction model for knowledge tracing. In ECML/PKDD, Cited by: §1, §1.
 Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §2.3.
 Deep fusion of multimodal features for social media retweet time prediction. World Wide Web 24 (4), pp. 1027–1044. Cited by: §1.
 Individualized bayesian knowledge tracing models. In International conference on artificial intelligence in education, pp. 171–180. Cited by: §2.1.
 Dynamic keyvalue memory networks for knowledge tracing. In Proceedings of the 26th international conference on World Wide Web, pp. 765–774. Cited by: §1, §1, §2.1, item 3, Table 2.
 Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pp. 2069–2080. Cited by: §4.2.