1 Introduction
Both student knowledge modeling and domain knowledge modeling are important problems in the educational data mining community. In this context, student knowledge tracing and knowledge modeling approaches aim to evaluate students’ state of knowledge or quantify students’ knowledge in the concepts that are presented in learning materials at each point of the learning period [Corbett1994, Baker2008, Yudelson2013, Khajah2014, Zhang2017a, Nagatani2019, Choffin2019, Vie2019]. Domain knowledge modeling, on the other hand, focuses on understanding and quantifying the topics, knowledge components, or concepts that are presented in the learning material [barnes2005q, casalino2017q, lan2014time]. It is useful in creating a coherent study plan for students, modeling students’ knowledge, and analyzing students’ knowledge gaps.
A successful student knowledge model should be personalized to capture individual differences in learning [Yudelson2013, Lan2014c], understand the association and relevance between learning from various concepts [Sahebi2016b, Zhang2017a], model knowledge gain as a gradual process resulting from student interactions with learning material [GonzalezBrenes2013a, Piech2015, Doan2019a], and allow for occasional forgetting of concepts in students [Choffin2019, Nagatani2019, Doan2019a]. Despite recent success in capturing these complexities in student knowledge modeling, a simple, but important aspect of student learning is still underinvestigated: that students learn from different types of learning materials. Current research has focused on modeling one single type of learning resource at a time (typically, “problems”), ignoring the heterogeneity of learning resources from which students may learn. Modern online learning systems frequently offer students to learn and assess their knowledge using various learning resource types, such as readings, video lectures, assignments, quizzes, and discussions. Previous research has demonstrated considerable benefits of interacting with multiple types of materials on student learning. For example, worked examples can lead to faster and more effective learning compared to unsupported problem solving [Najar2014]; and enriching textbooks with additional forms of content, such as images and videos, increases the helpfulness of learning material [Agrawal2011, Agrawal2014]. Ignoring diverse types of learning materials in student knowledge modeling limits our understanding of how students learn.
One of the obstacles in considering the combined effect of learning material types is the lack of explicit learning feedback from all of them. Some learning material types, such as problems and quizzes, are gradable. As students interact with such material types, the system can perceive student grade as an explicit feedback or indication of student knowledge: if a student receives a high grade in a problem, it is likely that the student has gained enough knowledge required to solve that problem. On the other hand, some of the learning materials are not gradable and their impact on student knowledge cannot be explicitly measured. For example, we cannot directly measure the consequent knowledge gain from watching a video lecture or studying an example.
As an alternative for quantifying student knowledge gain, the system can measure other quantities, such as the binary indication of student activity with a learning material or the time they spent on it. However, this kind of measure may result in contradictory conclusions [Beck2008, Huang2015a, Hosseini2016]. For example, spending more time to study the examples provided by the system may both increase the student’s knowledge, and at the same time, be an indicator of a weaker student, who does not have enough knowledge in the provided concepts. These weaker students may select to study more examples to compensate for their lower knowledge levels. Consequently, the knowledge gain of studying these auxiliary learning materials is usually overpowered by the student selection bias and is not represented correctly in the overall dataset.
A similar issue exists in the current domain knowledge models. The automatic domain knowledge models that are based on students’ activities mainly model one type of learning material and ignore the relationship between various kinds of learning materials [doan2019rank, casalino2017q]. Alternatively, an ideal domain knowledge model should be able to model and discover the similarities between learning materials of different types.
In this paper, we simultaneously address the problems of student knowledge modeling and domain knowledge modeling, while considering the heterogeneity of learning material types. We introduce a new student knowledge model that is the first to concurrently represent student interactions with both graded and nongraded learning material. Meanwhile, we discover the hidden concepts and similarities between different types of learning materials, as in a domain knowledge model. To do this, we pose this concurrent modeling as a multiview tensor factorization problem, using one tensor for modeling student interactions with each learning material type. By experimenting on both synthetic and realworld datasets, we show that we can improve student performance prediction in graded learning materials, as measured by the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
In summary, the contributions of this paper are:
1) proposing a personalized, multiview student knowledge model (MVKM) that can capture learning from multiple learning material types and allow for occasional student forgetting, while modeling all types of learning materials;
2) conducting experiments on both synthetic and realworld datasets showing that our proposed model outperforms conventional methods in predicting student performance;
3) examining the resulting learning material and student knowledge latent features to show the captured similarity between learning material types and interpretability of student knowledge model.
2 Related Work
Knowledge Modeling Student knowledge modeling aims to quantify student knowledge state in the concepts or skills that are covered by learning materials at each learning point.
Pioneer approaches of student knowledge modeling, despite being successful, were not personalized, relied on a predefined (sometimes expertlabeled) set of concepts in learning material, did not allow for learned concepts to be forgotten by students, and modeled each concept independently from one another [Drasgow1990, Corbett1994, Pavlik2009, Linden2013]. Later, some student knowledge models aimed to solve these shortcomings by learning different parameters for each (type of) student [Pardos2010a, Yudelson2013, Koedinger2013a], including decays to capture forgetting of concepts in learner models [Qiu2011, Lindsey2014a, Mozer2016] and capturing the relationship between concepts that are present in a course [ThaiNghe2011, GonzalezBrenes2013a]. Yet, these models assume that a correct domain knowledge model, that maps learning material into course concepts, exists.
In recent years, new approaches aim to learn both domain knowledge model and student knowledge model at the same time [Lan2014c, GonzalezBrenes2015, Sahebi2016b, Wang2017, Zhang2017a, Doan2019a]. Our proposed model falls into this latest category as it does not require any manual labeling of learning materials, while having the ability to use such information if they are available. It is personalized by learning lowerdimensional student representations, allows forgetting of concepts during student learning by adding a rankbased constraint on student knowledge, and models the relationship between learning material.
Learning from Multiple Material Types In the educational data mining (EDM) literature, learning materials are provided in various types, such as problems, examples, videos, and readings. While there have been some studies in the literature on the value of having various types of learning materials for educating students [Agrawal2011, Beck2008, Najar2014], the relationship between these material types, and their combined effect on student knowledge and student performance is underinvestigated.
Multiple learning material types have been studied in the literature in finding insights into different activity distributions or cluster patterns between highgrade and lowgrade students [Velasquez2014, Wen2014], have been used as contextual features in scaffolding or choosing among the existing student models [Yudelson2008, SaoPedro2013], have been added to improve existing domain knowledge models only for graded material types while ignoring student sequences [Botelho2016, Chen2016, Desmarais2015, Liu2016, Pardos2013, Sahebi2018, Pelanek19]
, or have been classified into beneficial or nonbeneficial for students
[Alexandron2015]. However, to the best of our knowledge, none of these studies have explicitly modeled the contribution of various kinds of learning materials on student knowledge during the learning period, the interrelations among these learning materials, and their effect on student performance. The Bayesian Evaluation and Assessment framework found that assistance promoted students’ longterm learning. More recently, Huang et al. discovered that adaptation of their framework (FAST) for student modeling by including various activity types may lead researchers to contradictory conclusions [Huang2015a]. More specifically, in one of their formulations student example activity suggests a positive association with model parameters, such as probability of learning, while in another formulation this type of activity has a negative association with model parameters. Also, Hosseini et al. concluded that annotated examples show a negative relationship with students’ learning, because of a selection effect: while annotated examples may help students to learn, weaker students may study more annotated examples
[Hosseini2016]. The model proposed in this paper considers student interactions from multiple learning material types, mitigating overestimation of student knowledge by transferring information from interactions with graded material, while accounting for knowledge increase that happen as a result of student interaction with nongraded material.
3 MultiView Knowledge Modeling
3.1 Problem Formulation and Assumptions
We consider an online learning system in which students interact with and learn from multiple types () of learning materials. Each learning material type includes a set of learning materials. A material type can be either graded or nongraded. Students’ normalized grade in tests, success or failure in compiling a piece of code, or scores in solving problems are all examples of graded learning feedback. Whereas, watching videos, posting comments in discussion forums, or interacting with annotated examples are instances of nongraded learning feedback that the system can receive. We model the learning period as a series of student attempts on learning materials, or time points (). To represent student interaction feedback with learning materials of each type during the whole learning period , we use a threedimensional tensor . The slice of tensor , denoted by , is a matrix representing student interactions with the learning material type during one snapshot of the learning period. The row of this interaction matrix shows feedback from student ’s interactions with all learning materials of type at attempt ; and the tensor element is the feedback value of student ’s activity on learning material of type at learning point .
We use the following assumptions in our model: (a) Each learning material covers some concepts that are presented in a course; the set of all course concepts are shared across learning materials; and the training data does not include the learning materials’ contents nor their concepts.(b) Different learning materials have different difficulty or helpfulness levels for students. For example, one quiz can be more difficult than another one, and one video lecture can be more helpful than the other one. (c) The course may follow a trend in presenting the learning material: going from easier concepts to more difficult ones or alternating between easy and difficult concepts; despite that, students can freely interact with the learning materials and are not bound to a specific sequence. (d) As students interact with these materials, they learn the concepts that are presented in them; meaning that their knowledge in these concepts increases. (e) Since students may forget some course concepts, this knowledge increase is not strict. (f) Different students come with different learning abilities and initial knowledge values. (g) The gradual change of knowledge varies among different students. But, students can be grouped together according to how their knowledge changes in different concepts, e.g., some students are fast learners compared to others. (h) Eventually, a student’s performance in a graded learning material, represented by a score, depends on the concepts covered in that material, student’s knowledge in those concepts, the learning material difficulty/helpfulness, and the general student ability.
In addition to the above, we have an essential assumption (i) that connects the different parts of our model: a student’s knowledge that is obtained from interacting with one learning material type is transferable to be used in other types of learning materials. In other words, students’ knowledge can be modeled and quantified in the same latent space for all different learning material types. In the following, we first propose a singleview model for capturing the knowledge gained using one type of learning material (MVKMBase) and then extend it to a multiview model that can represent multiple types of learning materials.
3.2 MVKM Factorization Model
The Proposed Base Model (MVKMBase). Following the mentioned assumptions in Section 3.1, particularly assumptions (a), (g), and (h), and assuming that students interact with only one learning material type, we model student interaction tensor as a factorization (mode tensor product) of three lowerdimensional representations: 1) an student latent feature matrix , 2) a temporal dynamic knowledge tensor , and 3) a matrix serving as a mapping between learning materials and course concepts. In other words, we have . Matrix here represents students being mapped to latent learning features that can be used to group the students (assumption (g)). Tensor quantifies the knowledge growth of students with each learning feature in each of the concepts while attempting the learning material. Accordingly, the resulting tensor from product represents each student’s knowledge in each concept at each attempt.
To increase interpretability, we enforce the contribution of different concepts in each learning material to be nonnegative and sum to one. Similarly, we enforce the same constraints on each student’s membership in the student latent features. Since each student can have a different ability (assumption (f)) and each learning material can have its own difficulty level (assumption (b)), we add two bias terms to our model ( for each student , and for each learning material ) to account for such differences. To capture the general score trends in the course (assumption (c)), we add a parameter for each attempt. Accordingly, we estimate student ’s score in a graded learning material at attempt () as in Equation 1. Here, is a matrix capturing the relationship between student features and concepts at attempt , represents student
’s latent feature vector,
shows material ’s concept vector.(1) 
We use a sigmoid function
to estimate student interaction with a nongraded learning material, or graded ones with binary feedback:Modeling Knowledge Gain while Allowing Forgetting. So far, this simple model captures latent feature vectors of students and learning materials, and learns as a representation of knowledge in students. However, it does not explicitly model students’ gradual knowledge gain (assumption (d)). We note that students’ knowledge increase is associated with the strength of concepts in the learning material that they interact with. As students interact with learning materials with some specific concepts, it is more likely for their predicted scores in the relevant learning materials to increase. With a Markovian assumption, we can say that if students have practiced some concepts, we expect their scores in attempt to be more than their scores in attempt :
However, this inequality constraint is too strict as the students may occasionally forget the learned concepts (assumption (e)). To allow for this occasional forgetting and soften this constraint, we model the knowledge increase as a rankbased constraint, that allows for knowledge loss, but penalizes it. We formulate this constraint as maximising the value for in Equation 2. Essentially, this penalty term can be viewed as a predictionconsistent regularization. It helps to avoid significant changes in students’ knowledge level since their performance is expected to transit gradually over time.
(2) 
The Proposed MultiView Model (MVKM). We rely on our main assumption (i) to extend our model to capture learning from different learning material types. So far, we have assumed that course concepts are shared among learning materials (assumption (a)). With the knowledge transfer assumption (i), all learning materials of different types will share the same latent space. Also, we represent student knowledge and student ability as shared parameters across all different learning material types. Consequently, for each set of learning materials of type , we can rewrite Equation 1 as:
An illustration of this decomposition, when considering two learning material types, is presented in Figure 1. Note that we represent one shared matrix student and one shared knowledge gain tensor in both types of learning materials.
We can learn the parameters of our model by minimizing the sum of squared differences between the observed () and estimated () values over all learning material types . For the learned parameters to be generalizable to unseen data, we regularize the unconstrained parameters using their L2 norms. As a result, we minimize the objective function in Equation 3, in which are hyperparameters that represent the relative importance of different learning materials types. and are hyperparameters to control the weights of regularization term of and .
(3) 
Similarly, the knowledge gain and forgetting constraint presented in Equation 2 can be extended to the multiview model. Eventually, we use a combination of the reconstruction objective function (Equation 3) and the learning and forgetting objective function (Equation 2) to model students’ knowledge increase, while representing their personalized knowledge and finding learning material latent features, as in Equation 4. Note that, since our goal is to minimize and maximize , we use to minimize . To balance between the accuracy of student performance prediction and modeling student knowledge increase, we use a nonnegative tradeoff parameter :
(4) 
We use stochastic gradient descent algorithm to minimize
in Equation 4. The parameters need to learn are students’ latent feature matrix (), dynamic knowledge in each concept at any attempt (), importance of each concept in every learning material (), each student’s general ability (), each learning material’s difficulty/helpfulness (), and each attempt’s bias ().4 Experiments
We evaluate our model with three sets of experiments. First, to validate if the model captures the variability of observed data, we use it to predict unobserved student performances (Sec. 4.3). Second, to check if our model represents valid student knowledge growth, we study the knowledge increase patterns between different types of students and across different concepts (Sec. 4.4). Finally, to study if the model meaningfully recovers learning materials’ latent concepts, we analyze their similarities according to the learned latent feature vectors (Sec. 4.5). Without loss of generalizability, although the model is designed to handle multiple learning material types, we experiment on two learning material types. Before the experiments, we will go over our datasets, and experiment setup.
Dataset 


#stu 

#rcds. 



Synthetic_NG  quiz (10)  discussion (15)  1000  20  19991  0.6230  
Synthetic_NG2  quiz (10)  discussion (15)  1000  20  19991  0.6984  
Synthetic_G  quiz (10)  assignment (15)  1000  20  19980  0.6255  
MORF_QD  assignment (18)  discussion (525)  459  25  6800  0.8693  
MORF_QL  assignment (10)  lecture (52)  1329  76  58956  0.7731  
Canvas_H  quiz (10)  discussion (43)  1091  20  13633  0.8648 
4.1 Datasets
We use three synthetic and three realworld datasets (from two MOOCs) to evaluate the proposed model. Our choice of realworld datasets is guided by two factors, aligned with our assumptions: that they include multiple types of learning material, and that they allow the students to work freely with the learning material in the order they choose. In the realworld datasets, we select the students that work with both types of learning materials, removing the learning materials that none of these students have interacted with. General statistics of each dataset are presented in Table 1. Figure 2 shows score distributions of the graded learning material types in these datasets.
Synthetic Data.
We generate three synthetic datasets according to two characteristics: (1) if both learning material types are graded vs. if one of them is nongraded (or has binary observations); (2) if the student scores are capped and their distribution is highly skewed vs. if the score distribution in not capped and less skewed.
For creating the datasets, we follow similar assumptions as to the ones made by our model. Expecting learning materials of type , and materials of type , we first generate a random sequence for each student , which represents the student’s attempts on different learning materials. Considering latent concepts, we then create two random matrices and as the mapping between the learning material and the underlying concepts, such that the sum of values for each underlying learning material is one. For the student knowledge gain assumption, we represent each student’s knowledge increase separately. Hence, we directly create a student knowledge tensor , instead of creating and , and multiplying them. To generate
, we first generate a random matrix
that represents all students’ initial knowledge in all concepts. For generating the knowledge matrix in the next attempts (), we use the following random process. For each student , we generate a random number representing the probability of forgetting. If (forgetting threshold), we assume no forgetting happens and increase the value in the knowledge matrix according to the learning material that the student has interacted with: . Here, is a random effect of increasing and is the learning material that student has selected to interact with at timestamp . Otherwise (, or forget), we set , for . we use nmode tensor product to build and , where , . Finally, according to the student learning sequences , we remove the “unobserved” values that are not in from and .To create different data types according to the first characteristic above, for the graded learning material type , we keep the values in . For the nongraded ones, we use the same process as above, except that in the final step we set according to the student sequence . However, in many realworld scenarios, the score distribution of students is highly skewed especially towards higher scores (Figure 2 show it). To represent this skewness, in some of the generated datasets, we clip all to 1.
Then, we create following three datasets according to above process: Synthetic_G, in which both learning material types are graded and scores are skewed; Synthetic_NG, in which one of the learning material types is graded and scores are skewed; and Synthetic_NG2, in which one of the learning material types is graded and scores are not skewed. We generate all synthetic data with 1000 students, learning materials of type 1, learning materials of type 2, latent concepts, and maximum sequence length of 20 for students.
Canvas Network [CanvasNetwork2016]. This is an online available dataset collected from various courses on the Canvas network platform ^{1}^{1}1http://canvas.net. The available open online course data comes from various study fields, such as computer science, business and management, and humanities. For each course, its general field of study is presented in the data. The rest of the dataset is anonymized such that course names, discussion contents, student IDs, submission contents, or course contents are not available. Each course can have different learning material types, including assignments, discussions, and quizzes. We experiment on the data from one course in this system, with course id , which is in the humanities field (Canvas_H dataset). We use quizzes as the graded learning material type and discussions as the nongraded one.
MORF [Andres2016]. This is a dataset of the “educational data mining” course [ryanbigdata] at Coursera^{2}^{2}2https://www.coursera.org/, available via the MOOC Replication Framework (MORF). The course includes various learning material types, including video lectures, assignments, and discussion forums. Students’ history, in terms of their watched video lectures, submitted assignments, and participated discussions, in addition to the score they received in assignments, is available in data. In this course, we experiment with two datasets, each focusing on two sets of learning material types: one with assignments as the graded type and discussions as the nongraded type (MORF_QD), another with assignments as the graded type and video lecture views as the nongraded type (MORF_QL).
4.2 Experiment Setup
We use fold studentstratified crossvalidation to separate our datasets into test and train. At each fold, we use interaction records from of students as training data. For the rest () of the students (target students), we split their attempt sequences on the graded learning material type into two parts: the first and the last . For performance prediction experiments, we predict the performance of the graded learning material type in the last , given the first . In order to see how the proposed model captures the knowledge growth, we do online testing, in which we predict the test data attempt by attempt (next attempt prediction). Eventually, we report the average performance on all five folds. For selecting the best hyperparameters, we use a separate validation dataset. Our code and synthetic data are available at GitHub^{3}^{3}3https://github.com/sz612866/MVKMMultiviewTensor.
4.3 Student Performance Prediction
In this set of experiments, we test our model on predicting student scores on their future unobserved graded learning material attempts. More specifically, we estimate student scores on their future attempts, and compare them with their actual scores in the test data.
4.3.1 Baselines
We compare our model with stateoftheart student performance prediction baselines:
Individualized Bayesian Knowledge Tracing (IBKT)
[johnsonscaling, Yudelson2013]: This is a variant of the standard BKT model, which assumes binary observations and provides individualization on student priors, learning rate, guess, and slip parameters ^{4}^{4}4The code is from https://github.com/CAHLR/pyBKT.
Deep Knowledge Tracing (DKT) [Piech2015]
: DKT is a pioneer algorithm that uses recurrent neural networks to model student learning, on binary (success/failure) student scores.
FeedbackDriven Tensor Factorization (FDTF) [Sahebi2016]: This tensor factorization model decomposes the student interaction tensor into a learning material latent matrix and a knowledge tensor. However, it only models one type of learning material, does not capture student latent features, and does not allow the students to forget the learned concepts. It assumes that students’ knowledge strictly increases as they interact with learning materials.
Tensor Factorization Without Learning (TFWL): This is a model similar as FDTF, the only difference is TFWL does not have constraint that student knowledge is increasing.
RankBased Tensor Factorization (RBTF) [Doan2019a]: This model has similar assumptions to FDTF. Except, it allows for occasional forgetting of concepts and has extra bias terms. Compared to MVKM, it does not differentiate between different student groups. It only uses student previous scores in graded learning materials to predict students’ future scores, and it has a different tensor factorization strategy.
Bayesian Probabilistic Tensor Factorization (BPTF) [Xiong2010]: This is a recommender systems model has a smoothing assumption over student scores in consecutive attempts.
AVG: This baseline uses the average of all students’ scores for all predictions.
As mentioned before, one major issue in realworld datasets is their skewness, meaning that, on average, student grades are skewed towards a full (complete) score on quizzes/assignments. This skewness adds to the complexity of predicting an accurate score for unobserved quizzes: only using an overall average score will provide a relatively good estimate of the real score. As a result, outperforming a simple average baseline is a challenging task.
The mentioned baselines all work on one type of learning material. Since our proposed MVKM model works with more than one learning material type, to be fair in evaluations, we run baseline algorithms in a multiview setup. To do this, we aggregate the data from all learning material types and use that as an input to these baselines. In those cases, we add a “MV” to the end of their names. For example, FDTF_MV represents running FDTF on aggregation of student interactions with multiple learning material types. In addition, for knowledge tracing algorithms (BKT and DKT) which are designed for binary student responses (correct or incorrect), we modify their settings to make them predict numerical scores as described below. First, we binarize students’ historical scores based on median score. Specifically, if the score is greater than the median, it will be set to 1, and 0 otherwise. Then, we use the probability of success generated by BKT and DKT as the probability of student receiving a score more than median score. Eventually, the numerical predicted scores can be obtained by viewing the probability output as the percentile of students’ score on that specific question. Moreover, since these models require predefined knowledge components (KCs), we assume that each learning material is mapped to one KC in these models.
In addition to the above, we compare our multiview model with its basic variation (MVKMBase) using the data from graded materials only, and its multiview variation without the learning and forgetting constraints (MVKMW/OP).
Methods  Synthetic_NG  Synthetic_NG2  Synthetic_G  

RMSE  MAE  RMSE  MAE  RMSE  MAE  
AVG  0.30840.0072  0.28200.0093  0.50590.0115  0.40050.0115  0.30700.0039  0.28110.0050 
RBTF  0.25150.0126  0.20270.0081  0.33740.0234  0.26810.0146  0.26280.0113  0.21030.0080 
FDTF  0.49060.0172  0.44100.0207  0.65880.0215  0.55290.0226  0.50410.0184  0.45370.0213 
TFWL  0.52830.0168  0.46320.0178  0.69190.0132  0.58830.0156  0.54900.0053  0.51300.0076 
BPTF  0.16750.0048  0.12560.0061  0.34540.0140  0.25890.0072  0.18250.0064  0.13810.0050 
IBKT  0.47440.0118  0.41970.0140  0.66300.0122  0.54940.0152  0.47480.0076  0.42330.0098 
DKT  0.26940.0275  0.19110.0241  0.45360.0404  0.35690.0413  0.27160.0209  0.20470.0178 
RBTFMV  0.29200.0069  0.23050.0078  0.40640.0213  0.32270.0147  0.26180.0155  0.21260.0130 
FDTFMV  0.40780.0168  0.34020.0167  0.58610.0211  0.46880.0135  0.48880.0112  0.45380.0131 
TFWLMV  0.43370.0139  0.38960.0133  0.63860.0161  0.54500.0194  0.53120.0137  0.46260.0145 
BPTFMV  0.17180.0037  0.14570.0055  0.34380.0158  0.26030.0120  0.15330.0055  0.11840.0044 
IBKTMV  0.42570.0142  0.35850.0155  0.60190.0124  0.48920.0165  0.48440.0068  0.42750.0089 
DKTMV  0.42780.0313  0.36130.0318  0.63990.0515  0.53200.0526  0.33900.0252  0.28920.0245 
MVKMBase  0.20070.1069  0.14980.0809  0.30260.0697  0.22730.0356  0.20970.0485  0.15650.0348 
MVKMW/OP  0.17140.0089  0.13060.0089  0.28170.0316  0.22130.0245  0.17960.0345  0.13570.0190 
Our Method (MVKM)  0.13880.0048  0.10490.0056  0.22210.0074  0.17390.0048  0.15320.0128  0.11710.0097 
Performance Prediction results on synthetic datasets, measured by RMSE and MAE, shown with variance in 5fold crossvalidation
Methods  MORF_QD  MORF_QL  CANVAS_H  

RMSE  MAE  RMSE  MAE  RMSE  MAE  
AVG  0.24100.0227  0.19130.0161  0.24200.0108  0.19570.0067  0.07670.0121  0.05550.0040 
RBTF  0.27110.0229  0.21320.0147  0.25720.0114  0.19800.0074  0.15710.0172  0.12350.0103 
FDTF  0.30810.0437  0.24010.0329  0.30060.0194  0.23240.0151  0.13950.0259  0.09290.0119 
TFWL  0.27500.0529  0.20030.0249  0.30900.3090  0.22370.0099  0.23770.0803  0.11860.0513 
BPTF  0.21720.0128  0.17760.0082  0.23020.0068  0.19530.0048  0.11140.0120  0.09460.0082 
IBKT  0.27560.0070  0.22810.0053  0.26460.0147  0.21740.0096  0.08560.0105  0.06920.0042 
DKT  0.31690.0374  0.24980.0313  0.28590.0061  0.21580.0075  0.09110.0322  0.06160.0173 
RBTFMV  0.28140.0282  0.21770.0222  0.26240.0193  0.19770.0136  0.14840.0098  0.11710.0054 
FDTFMV  0.31380.0441  0.24530.0387  0.23980.0137  0.18660.0091  0.11490.0085  0.09070.0068 
TFWLMV  0.29190.0275  0.19750.0160  0.32220.0208  0.21780.0165  0.17480.0600  0.07840.0269 
BPTFMV  0.26150.0129  0.22860.0114  0.23130.0070  0.19600.0041  0.14520.0100  0.13430.0081 
IBKTMV  0.27740.0204  0.21770.0099  0.29040.0098  0.21370.0062  0.08340.0125  0.04250.0049 
DKTMV  0.29380.0310  0.23520.0236  0.25400.0065  0.21850.0047  0.0790.0247  0.04960.0065 
MVKMBase  0.22420.0328  0.16690.0207  0.22770.0119  0.17240.0081  0.06660.0159  0.04110.0040 
MVKMW/OP  0.23850.0196  0.17710.0104  0.2450 0.0145  0.18140.009  0.06490.0111  0.03880.0027 
Our Method (MVKM)  0.2088 0.0229  0.16030.0142  0.21500.0127  0.16540.0104  0.06130.0112  0.03620.0028 
4.3.2 Performance Metrics and Comparison
In this task, our target is to accurately estimate the actual student scores. To evaluate how close our predicted values are to the actual ones, we use Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) between the predicted scores and the actual scores for students. Table 2 and 3 show the results of performance among different methods on synthetic data and real data, respectively. We can see that our proposed model outperforms other baselines on synthetic data, and has the best performance on real datasets in general.
MVKMBase vs. Single Material Type Baselines. Comparing MVKMBase with other algorithms that use student scores only, shows us that MVKMBase has consistently lower error compared to most baselines, in both synthetic and realworld datasets. This result demonstrates the ability of MVKMBase in capturing data variance and validity of its assumptions for realworld graded data. Compared to AVG, MVKMBase can represent more variability; compared to RBTF, the student latent features in MVKMBase leads to improved results; compared to FTDF, the forgetting factor results in less error; and compared to BKT and DKT, modeling the learning material concepts in and having a rankbased constraint to enforce learning improves the performance. The only baseline algorithm that outperforms MVKMBase in some setups is BPTF. Particularly, BPTF has a lower RMSE and MAE in Synthetic_NG and Synthetic_G datasets that are skewed. In realworld datasets, it performs better than MVKMBase in MORFQD dataset that is more sparse and has a slightly higher average score compared to the other two. This shows that BPTF is better than MVKMBase in handling skewed data. One potential reason is BPTF’s smoothing assumption, in contrast with MVKMBase’s rankbased knowledge increase, that results in a more homogeneous score predictions for each student.
MVKM: Multiple vs. Single Material Types. Comparing MVKM’s results with MVKMBase model, we can see that using data from multiple learning material types improves performance prediction results. It verifies our assumptions regarding knowledge transfer in different learning material types through the knowledge gain in shared concept latent space. This is given that in other models, e.g., all models except DKT in MORFQD, adding different learning material types increases the prediction error. This error increase is particularly happening with BPTF model in realworld datasets and DKT model in synthetic ones. This shows that merely aggregating data from various resources, without appropriate modeling, can even harm the prediction results. This difference between MVKM and other baselines is in its specific setup, in which each learning material type is modeled separately, while keeping a shared knowledge space, student latent features, and knowledge gain.
Learning and Forgetting Effect. To further test the effect of our knowledge gain and forgetting constraint, we compare MVKM with MVKMW/OP, a variation of our proposed model without the rankbased constraint in Equation 2. We can see that MVKM outperforms MVKMW/OP in all datasets. This shows that the soft knowledge increase and forgetting assumption is essential in correctly capturing the variability in students’ learning. Particularly, comparing MVKMW/OP’s results with MVKMBase, the singleview version that includes the rankbased learning constraints, we can measure the effect of adding multiple learning material types vs. the effect of adding the learning and forgetting constraints in MVKM model. In CANVAS_H dataset, adding multiple learning material types is more effective than learning constraint, and in MORF datasets, realizing learning constraint is more important than modeling multiple types of learning materials. Nevertheless, they are not mutually exclusive and both are important in the model.
Hyperparameter Tuning Using a separate validation set, we experiment with various values (grid search) for model hyperparameters to select the most representative ones for our data. Specifically, we first vary the student latent feature dimension in , the question latent feature dimension in , the penalty weight in , the Markovian step in , and the learning resource importance parameter in . Once we found a good set of hyperparameters from coarsegrained grid search, we search the values close to the optimal values to find out the best finegrained values for these hyperparameters. The best resulting hyperparameter values for each dataset are listed in table 4. We use as the tradeoff parameter for graded learning material, for anther learning material. As we can see, in both the synthetic and realworld data, the learning and forgetting constraint is more important (larger ) when having a nongraded learning material type. This shows that binary interaction data, unlike student grades (or scores), is not precise enough to represent the students’ gradual knowledge gain in the absence of a learning and forgetting constraint. Also, comparing in MORF_QD vs. MORF_QL we can see that the importance of video lectures is more than discussions in predicting students’ performance.
Dataset  K  C  m  

Synthetic_NG  3  3  0.2  1  0.1  0.1  1  0.01  0.001 
Synthetic_NG2  3  3  0.2  1  0.1  0.1  1  0.001  0.001 
Synthetic_G  3  3  0.1  1  0.4  0.1  1  0.001  0.001 
MORF_QD  39  5  1  1  0.05  0.1  1  0  0 
MORF_QL  35  9  0.6  1  0.5  0.1  1  0  0 
Canvas_H  28  7  2.0  1  0.5  0.01  1  0  0 
4.4 Student Knowledge Modeling
In this set of experiments, we answer two main research questions: 1) Can our model’s learning and forgetting constraint capture meaningful knowledge trends across concepts for students as a whole? and 2) Are the individual student’s knowledge growth representative of their learning? To answer these questions, we look at the estimated knowledge tensor of students ().
To answer the first question, we check the average student knowledge growth on different concepts. Figure 3 shows the average knowledge of all students in different concepts (represented with different colors) during the whole course period (Xaxis) for MORF_QL, and CANVAS_H datasets (MORF_QD has similar patterns as MORF_QL, we don’t show it due to the page limitation). Notice that, for a clear visualization, we only show out of concepts from MORF_QL dataset in the figure. We can see that, on average, students’ knowledge in different concepts increase. Particularly, in MORF_QL, the initial average knowledge on concept 3 is less that concepts 5 and 7. However, students learn this concept rapidly as shown by the increase of knowledge level around the tenth attempt. As the knowledge growth is less smooth in this concept, compared to the other two (e.g., the drop around the attempt), students are more likely to forget it rapidly. Eventually, the average student knowledge in all concepts are close to each other. On the other hand, in CANVAS_H, the average initial knowledge in different concepts are relatively close. However, students end up having different knowledge levels in different concepts at the end of the course, especially in concepts 0 and 4. Also, all six concepts show large fluctuations across the attempts. Overall, the students have a significant knowledge gain at the first few attempts and the knowledge gain slows down after that. This is aligned with our expectation on students’ knowledge acquisition through out the course.
To show the effect of the learning and forgetting constraint in MVKM, we look at the student knowledge acquisition in the MVKMW/OP model, that removes this constraint. The MVKMW/OP’s average student knowledge in different concepts throughout all attempts is shown in Figure 4. We can see that despite its acceptable performance prediction error, MVKMW/OP’s estimated knowledge trends are elusive and counterintuitive. For example, many concepts (such as concept 3 in MORF_QL) show a Ushaped curve. This curve can be interpreted as the students having a high prior knowledge in these concepts, but forgetting them in the middle of the course, and then relearning them at the end of the course. In some cases, such as concept 1 in CANVAS_H, students lose some knowledge and forget what they already knew, by the end of the course. This demonstrates the necessity of learning and forgetting penalty term in MVKM.
For second question, we check if there are meaningful differences between knowledge gain trends of different students. To do this, we apply spectral clustering on students’ latent features matrix
to discover different groups of students. Then, we compare students’ learning curves from different clusters. The number of clusters is determined by the significance of difference on average performance in each cluster. We obtained 3 clusters of students for course, and 2 clusters for and courses based on students’ latent features from our model.To see the differences in these groups, we sample one student from each cluster in each realworld dataset. Figure 5 shows these sample students’ knowledge gain, averaged over all concepts, in datasets MORF_QD and MORF_QL (CANVAS_H is not showed due to the page limitaion, it has similar patterns as MORF_QD). The figures show that different students start with different initial prior knowledge. For example, in MORF_QL, student starts with a lower prior knowledge than student and ends up with a lower final knowledge. Also, the figure shows that different knowledge gain trends across students, particularly in MORF_QD. For example, student starts with a lower prior knowledge than the other two students, but has a faster knowledge growth, and catches up with them around attempt . However, this student’s knowledge growth slows down after a while end up to be lower than the other two at the end of course. To see if the quantified knowledge is meaningful, we compare student’s knowledge growth with their scores. Students , , and in MORF_QD have average grades , and , in MORF_QL, and have average grades and . This align with the knowledge levels shown in the figure. These observations show that MVKM can meaningfully differentiate between different students’ knowledge growth.
4.5 Learning Resource Modeling
In this section, we evaluate our model on how well it can represent the variability and similarity of different learning materials. We mainly focus on two questions: 1) Are the learning materials’ biases consistent with their difficulty levels? 2) Are the discovered latent concepts for learning materials (matrix ) representative of actual conceptual groupings of learning materials in the real datasets?
Bias Evaluation. For the first question, since we do not have access to the learning materials’ difficulty levels, we use average student scores on them, as a proxy for difficulty. As a result, we only use graded learning materials for this analysis. We calculated the spearman correlation between question bias captured by our model and average score of each question. The spearman correlation on MORF_QD is 0.779, on MORF_QL is 0.597, and on CANVAS_H is 0.960.We find that question bias derived from MCKM is highly correlated with average question score, where the lower the actual average grades are, the lower the bias values are learned.
WithinType Concept Evaluation. For the second question, we would like to know how much the learning materials’ discovered latent concepts resemble the realworld similarities in them. To evaluate the realworld similarities, we rely on two scenarios: 1) the learning material that are arranged closely to each other in the course structure, either in same module or in consequent modules, are similar to each other (course structure similarity); 2) the learning materials that are similar to each other have similar concepts and contents (content similarity). Since only one of our realworld datasets, MORF_QL, includes the required information for these scenarios, we use this dataset in the continuation of this paper. For first scenario, the course includes an ordered list modules, each of which include an ordered list of videos, in addition to the assignments associated with each module.
For the second scenario, because our learning materials are not labeled with their concepts in our datasets, we use their textual contents (not used in MVKM) as a representation of their concepts. Particularly, we have subscripts for 40 video lectures, and text of questions for 8 quizzes. We note that if two learning materials present the same concepts, their textual contents should also be similar to each other. As a result, we build contentbased clusters of learning materials, each of which containing the learning materials that are conceptually similar to each other. Specifically, to cluster the learning material according to their contents, we use Spectral Clustering on the latent topics that are discovered using Latent Dirichlet Analysis (LDA)[blei2003latent] on the learning material’s textual contents. In the same way, we can cluster the learning materials according to their discovered latent concepts by MVKM. Similar to the textual analysis, we use spectral clustering on the discovered matrices to form clusters of learning materials. To do this, we first consider only one learning material type (the video lectures) and then move on to the similarities between two types of learning materials (both video lectures and assignments).
The results are shown in Figure 6 for withintype learning material similarity in videolectures. Figure 6 shows the 8 clusters that were discovered using MVKM, and Figure 6 shows the 8 clusters that were discovered using videolecture transcripts. Each cluster is shown within a box with a number associated with it. Each videolecture is shown by its module (or week in the course), its order in the module sequence, and its name. For ease of comparison, we colored the video names according to their LDA content clusters. Looking at the LDA content clusters, we can see that although some lectures in same module fit in same cluster (e.g., videos 1, 2, 3, and 4 from week 7 are all in cluster 7), some of the lectures do not cluster with other videos in their module. For example, video 5 in week 7 is in cluster 2, with pioneer knowledge tracing methods. This shows that in addition to structural similarities, content similarities also exist in learning materials. Looking at MVKM clusters, we can see that the clusters mostly represent the course structure similarity: learning materials from same module are grouped. For example, all videos of week 3 are grouped in cluster 2. However, we can see that in many cases, whenever the structure similarity in clusters are disrupted, it is because of the content similarity in video lectures. For example, video 5 in week 7 that was clustered with pioneer knowledge tracing method in LDA content clusters is also clustered with them in MVKM clusters.
BetweenType Concept Evaluation. To evaluate MVKM’s discovered similarities between different types of learning materials, we evaluate assignments’ and video lectures’ in MORF_QL. To do this, we build LDAbased clusters using assignment texts and video lecture transcripts. These clusters are shown in Figure 7. We also cluster the learning materials using spectral clustering on the concatenation of their matrices (Figure 7). Because the assignments bring more information to the clustering algorithms, the clustering results are different from the clusters of video lectures only. Similar to withintype concept evaluation results, we can still see the effect of both content and structure similarities in video lectures that are clustered together by MVKM. For example, videos 1 and 3 of week 2 are clustered with later weeks’ videos because of content similarity (cluster 1 in Figure 7). While videos 2 of week 2 is also clustered with them because it comes between these two videos in course sequence.
Additionally, between video lectures and assignments, the clusters closely follow the course structure. The assignments in this course come at the end of their module and right before the next module starts. For example, “Assignment 3” appears after video 5 at week 3 and before video 1 at week 4. We can see that all assignments, except “Assignment 1” that is the first one, are clustered with their immediate next video lecture. Moreover, we can see the effect of content similarity between assignments and video lectures in differences of Figures 6 and 7. For example, without including assignments, “Week 1 Introduction” and “W1 V1: Big Data in Education” were clustered together in cluster 7 of Figure 6. However, after adding assignments, because of the content similarity between “Assignment 3” and “Week 1 Introduction” ( Figure 7 cluster 2), “Week 1 Introduction” and “W1 V1: Big Data in Education” are clustered with video lectures that are structurally close to “Assignment 3”.
Altogether, we demonstrated that learning materials’ bias parameters in MVKM are aligned with their difficulties; learning materials’ latent concepts discovered by our model well represent learning materials’ realworld similarities, both in structure and in content; and MVKM can successfully unveil these similarities between different types of learning materials, without observing their content or structure.
5 Conclusions
In this paper, we proposed a novel MultiView Knowledge Model (MVKM) that can model students’ knowledge gain from different learning materials types, while simultaneously discovering materials’ latent concepts. Our proposed tensor factorization model explicitly represents students’ knowledge growth and allows for occasional forgetting of learned concepts. Our extensive evaluations on synthetic and realworld datasets show that MVKM outperforms other baselines in the task of student performance prediction, can effectively capture students’ knowledge growth, and represent similarities between different learning materials types.
6 Acknowledgments
This paper is based upon work supported by the National Science Foundation under Grant No. 1755910.
Comments
There are no comments yet.