Assessing the Knowledge State of Online Students – New Data, New Approaches, Improved Accuracy

09/04/2021 ∙ by Robin Schmucker, et al. ∙ Carnegie Mellon University 2

We consider the problem of assessing the changing knowledge state of individual students as they go through online courses. This student performance (SP) modeling problem, also known as knowledge tracing, is a critical step for building adaptive online teaching systems. Specifically, we conduct a study of how to utilize various types and large amounts of students log data to train accurate machine learning models that predict the knowledge state of future students. This study is the first to use four very large datasets made available recently from four distinct intelligent tutoring systems. Our results include a new machine learning approach that defines a new state of the art for SP modeling, improving over earlier methods in several ways: First, we achieve improved accuracy by introducing new features that can be easily computed from conventional question-response logs (e.g., the pattern in the student's most recent answers). Second, we take advantage of features of the student history that go beyond question-response pairs (e.g., which video segments the student watched, or skipped) as well as information about prerequisite structure in the curriculum. Third, we train multiple specialized modeling models for different aspects of the curriculum (e.g., specializing in early versus later segments of the student history), then combine these specialized models to create a group prediction of student knowledge. Taken together, these innovations yield an average AUC score across these four datasets of 0.807 compared to the previous best logistic regression approach score of 0.766, and also outperforming state-of-the-art deep neural net approaches. Importantly, we observe consistent improvements from each of our three methodological innovations, in each dataset, suggesting that our methods are of general utility and likely to produce improvements for other online tutoring systems as well.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligent online tutoring systems (ITS’s) are now used by millions of students worldwide, enabling access to quality teaching materials and to personally customized instruction. These systems depend critically on their ability to track the evolving knowledge state of the student, in order to deliver the most effective instructional material at each point in time. Because the problem of assessing the student’s evolving knowledge state (also referred to as student performance modeling, or knowledge tracing (KT)) is so central to successfully customizing instruction to the individual student, it has received significant attention in recent years.

The state of the art that has emerged for this student performance modeling problem involves applying machine learning algorithms to historical student log data. The result produced by the machine learning algorithm is a model, or computer program, that outputs the estimated knowledge state of any future student at any point in the lesson, given the sequence of steps they have taken up to this point in the lesson (e.g., Corbett1994:Knowledge, Pavlik2009:Performance, Piech2015:Deep, Pandey2019:Self, Gervet2020:Deep, Shin2021:Saint+). These systems typically represent the student knowledge state as a list of probabilities that the student will correctly answer a particular list of questions that cover the key concepts (also known as ”knowledge components” (KC’s)) in the curriculum. Most current approaches estimate the student’s knowledge state by considering only the log of questions asked and the student’s answers, though recent datasets provide considerably more information such as the length of time taken by the student to provide answers, specific videos the student watched and whether they watched the entire video, and what hints they were given as they worked through specific practice problems.

This paper seeks to answer the question of which machine learning approach produces the most accurate estimates of students’ knowledge states. To answer this question we perform an empirical study using data from over 750,000 students taking a variety of courses, to study several aspects of the question including (1) which types of machine learning algorithms work best? (2) which features of a student’s previous and current interactions with the ITS are most useful for predicting their current knowledge state? (3) how valuable is background information about curriculum prerequisites for improving accuracy? and (4) can accuracy be improved by training specialized models for different portions of the curriculum? We measure the quality of alternative approaches by how accurately they predict which future questions the student answers correctly.

More specifically, we present here the first comparative analysis of recent state-of-the-art algorithms for student performance modeling across four very large student log datasets that have recently become available, which are each approximately 10 times larger than earlier publicly available datasets, which cover a variety of courses in elementary mathematics, as well as teaching English as a second language, and which range across different teaching objectives such as initial assessment of student knowledge state, test preparation, and extra-curricular tutoring complementing K-12 schooling. We show that accuracy of student performance modeling can be improved beyond the current state of the art through a combination of techniques including incorporating new features from student logs (e.g., time spent on previously answered questions), incorporating background information about prerequisite/postrequisite topics in the curriculum, and training multiple specialized models for different parts of the student experience (e.g., training distinct models to assess new students during the first 10 steps of their lesson, versus students taking the post-lesson quiz). The fact that we see consistent improvements in accuracy across all four datasets suggests the lessons gained from our experiments are fairly general, and not tied to a specific type of course or specific tutoring system.

To summarize, the key contributions of this paper include:

  • Cross-ITS study on modern datasets. We present the first comparative analysis of state-of-the-art approaches to student performance modeling across four recently published, large and diverse student log datasets taken from four distinct intelligent tutoring systems (ITS’s), resulting in the largest empirical study to date of student performance modeling. These four systems teach various topics in elementary mathematics or English as a second language, and the combined data covers approximately 200,000,000 observed actions taken by approximately 750,000 students. Three of these datasets have been made publicly available over the past few years, and we make the fourth available with publication of this paper.

  • Improved student performance modeling by incorporating prerequisite and hierarchical structure across knowledge components. All four of the datasets we consider provide a vocabulary of knowledge components (KC’s) taught by the system. Two datasets provide meta-information about which knowledge components are prerequisites for which others (e.g., Add and subtract negative numbers is a prerequisite for Multiply and divide negative numbers), and one provides a hierarchical structure (e.g., the knowledge component

    Add and subtract vectors

    falls hierarchically under Basic vectors). We found that incorporating the pre- and postrequisite information into the model resulted in substantial improvements in accuracy predicting which future questions the student would answer correctly (e.g., by including features such as the counts of correctly answered questions related to prerequisite and post-requisite topics). Features derived from the hierarchical structure lead to improvements as well, but to a lesser degree than the ones extracted from the prerequisite structures.

  • Improved student performance modeling by incorporating log data features beyond which questions were answered correctly. Although earlier work has focused on student performance models that consider only the sequence of questions asked and which were answered correctly, modern log data includes much more information. We found that incorporating this information yields accuracy improvements over the current state of the art. For example, we found that features such as the length of time the student took to answer previous questions, the number of videos watched on the KC, and whether the student is currently answering a question in a pre-test, post-test, or a practice session were all useful features for predicting whether a student would correctly answer the next question.

  • Improved student performance modeling by training multiple models for specialized contexts, then combining them. We introduce a new approach of training multiple distinct student assessment models for distinct learning contexts. For example, training a distinct model for the ”cold start” problem of assessing students who have just begun the course and have little log data at this point, yields significant improvements in the accuracy of student knowledge state assessment. Furthermore, combining predictions of multiple specialized models (e.g., one trained for the current study module type, and one trained for students that have answered at least questions in the course) leads to even further improvements.

  • The above improvements can be combined. The above results show improvements in student performance modeling due to multiple innovations. By combining these we achieve overall improvements over state-of-the-art logistic regression methods that reduce AUC error by on average over these four datasets, as summarized in Table 1.

    ElemMath2021 EdNet KT3 Eedi Junyi15 Average
    ACC AUC ACC AUC ACC AUC ACC AUC ACC AUC
    Prev. SOTA: Best-LR 0.7569 0.7844 0.7069 0.7294 0.7343 0.7901 0.8425 0.7620 0.7602 0.7665
    Add Q-R features: Best-LR+ 0.7616 0.7924 0.7163 0.7450 0.7451 0.8035 0.8497 0.7887 0,7682 0.7824
    Add novel features: AugmLR 0.7653 0.7975 0.7184 0.7489 0.7492 0.8091 0.8626 0.8582 0.7739 0.8034
    Multimodel: Comb-AugmLR 0.7673 0.8011 0.7210 0.7543 0.7501 0.8107 0.8637 0.8616 0.7755 0.8069
    percent error reduction 4,28% 7,75% 4.81% 9.20% 5.95% 9.81% 13.46% 41.85% 6.41% 17.32%
    Table 1: Improvements to state of the art (SOTA) in student performance modeling, due to innovations introduced in this paper. The previous logistic regression state-of-the-art approach is Best-LR. Performance across each of the four diverse datasets improves with each of our three suggested extensions to the Best-LR algorithm. Best-LR+ extends Best-LR by adding new features calculated from the question-response (Q-R) data available in most student logs. AugmentedLR further adds a variety of novel features that go beyond question-response data (e.g., which videos the student watched or skipped). Combined-AugmentedLR further extends the approach by training multiple logistic regression models (Multimodel) on different subsets of the training data (e.g., based on how far into the course the student is currently). Together, these extensions to the previous state of the art produce significant improvements across each of these four datasets, improving the average AUC score from 0.766 to 0.807 on average across the four datasets – a reduction of 17.3% in average AUC error (i.e., in the difference between the observed AUC and the ideal perfect AUC of 1.0).

2 Related Work

Student performance modeling techniques estimate a student’s knowledge state based on their interactions with the ITS. These performance models and the estimates of student proficiency they produce are a key component of current ITS’s which allow the tutoring system to adapt to each student’s personal knowledge state at each point in the curriculum. In the literature, performance modeling is also sometimes referred to as knowledge tracing, proficiency modeling, or student assessment. There are three main categories of performance modeling techniques: (i) Markov process based probabilistic modeling, (ii) logistic regression and (iii) deep learning based approaches.

Markov process based techniques, such as Bayesian Knowledge Tracing (BKT) [Corbett and Anderson (1994)]

and its various extensions (Baker2008:More, Pardos2010:Modeling, Pardos2011:KT, Qiu2011:Does, Yudelson2013:Individualized, Sao2013:Incorporating, Khajah2016:Deep, Kaser2017:Dynamic) have a long history in the educational data mining (EDM) community. Most approaches in this family determine a student proficiency by performing probabilistic inference using a two state Hidden Markov Model containing one state representing that the student has

mastered a particular concept, and one state representing non-mastery. A recent study comparing various performance modeling algorithms across nine real-world datasets [Gervet et al. (2020)] found that when applied to large-scale datasets, BKT and its extension BKT+ [Khajah et al. (2016)] are very slow to train and their predictive performance is not competitive with more recent logistic regression and deep learning based approaches. The Python package pyBKT was released and promises faster training times for Bayesian Knowledge Tracing models [Badrinath et al. (2021)].

Logistic regression models take as input a vector of manually specified features calculated from a student’s interaction history, then output a predicted probability that this student has mastered a particular concept or knowledge component (KC) (often implemented as the probability that they will correctly answer a specified question). Common approaches include IRT [van der Linden and Hambleton (2013)], LFA [Cen et al. (2006)], PFA [Pavlik Jr et al. (2009)], DASH (Lindsey2014:Improving, Gonzalez2014:General, Mozer2016:Predicting) and its extension DAS3H [Choffin et al. (2019)] as well as Best-LR [Gervet et al. (2020)]

. While there exists a variety of logistic regression models, they mainly rely on two types of features: (i) One-hot encodings

111A one-hot encoding of the question ID, for example, is a vector whose length is equal to the number of possible questions. The vector contains zeros, and a single to indicate which question ID is being encoded. of question and KC identifiers; and (ii) Count features capturing the student’s number of prior attempts to answer questions, and the number of correct and incorrect responses. DAS3H incorporates a temporal aspect into its predictions by computing count features for different time windows. LKT [Pavlik Jr et al. (2020)] is a flexible framework for logistic regression based student performance modeling which offers a variety of features based on question-answering behaviour. Among others it offers decay functions to capture recency effects as well as features based on the ratio of correct and incorrect responses. Section 4 discusses multiple regression models in more detail and proposes alternative features which are able to incorporate rich information from various types of log data.

Deep learning based models, like logistic regression models, take as input the student log data, and output a predicted probability that the student will answer a specific question correctly. However, unlike logistic regression, deep learning models have the ability to automatically define useful features computable from the sequence of log data, without relying on human feature engineering. A wide range of neural architectures have been proposed for student performance modeling. DKT [Piech et al. (2015)]

is an early work that uses Long Short Term Memory (LSTM) networks 

[Hochreiter and Schmidhuber (1997)] processing student interactions step-by-step. DKVNM [Zhang et al. (2017)] is a memory augmented architecture that can capture multi-KC dependencies. CKT [Shen et al. (2020)]

uses a convolutional neural network 

[LeCun et al. (1999)]

to model individualized learning rates. Inspired by recent advances in the natural language processing (NLP) community, multiple transformer 

[Vaswani et al. (2017)] based approaches have been proposed (SAKT, [Pandey and Karypis (2019)]; AKT, [Ghosh et al. (2020)]; SAINT, [Choi et al. (2020)]; SAINT+, [Shin et al. (2021)]). Graph-based modeling approaches infer knowledge states based on the structure induced by question-KC relations (GKT, [Nakagawa et al. (2019)]; HGKT, [Tong et al. (2020)]; GIKT, [Yang et al. (2020)]). For an in detail survey on the recent student performance modeling approaches we refer to Liu2021:Survey.

Most student performance modeling approaches focus exclusively on question-answering behaviour. A student’s sequence of past interactions with the tutoring system is modelled as . The response is represented by a tuple , where is the question (item) identifier and is binary response correctness. While this formalism has yielded many effective performance modeling techniques, it is limiting in that it assumes the student log data contains only a single type of user interaction: answering questions posed by the ITS. Many aspects of student behaviour such as interactions with learning materials (videos, instructional text), hint usage and information about the current learning context can provide a useful signal, but fall outside the scope of this formalism.

More recently, the EDM community has started exploring alternative types of log data, in an attempt to improve the accuracy of student performance modeling. Zhang2017:Incorporating augment DKT with information on response time, attempt number and type of first interaction with the ITS. Later, Yang2018Implicit enhanced DKT predictions by incorporating more than 10 additional features related to learning context, interaction times and hint usage provided by the ASSISTment 2009 [Feng et al. (2009)] and Junyi15 [Chang et al. (2015)] datasets. While that work employs most information contained in the Junyi15 dataset, it fails to utilize the prerequisite structure among topics in the curriculum, and does not evaluate the potential benefit of those features for logistic regression models. EKT [Liu et al. (2019)] uses the question text to learn exercise embeddings which are then used for downstream performance predictions. MVKM [Zhao et al. (2020)]

uses a multi-view tensor factorization to model knowledge acquisition from different types of learning materials (quizzes, videos, …). SAINT+ 

[Shin et al. (2021)] and MUSE [Zhang et al. (2021)] augment transformer models with interaction time features to capture short-term memorization and forgetting. Closest to the spirit of this manuscript is an early work by Feng2009:Addressing. That work integrates features related to accuracy, response time, attempt-usage and help-seeking behavior into a logistic regression model to enhance exam score predictions. However, their exam score prediction problem is inherently different from our problem of continuous performance modeling because it focuses only on the final learning outcome and does not capture user proficiency on the question and KC level. Unlike our work  Feng2009:Addressing, do not evaluate features related to lecture video and reading material consumption, prerequisite structure and learning context. Their study is also limited to two small-scale datasets ( students) collected by the ASSISTment system.

This paper offers the first systematic study of a wide variety of features extracted from alternative types of log data. It analyzes four recent large-scale datasets which capture the learning process at different levels of granularity and focus on various aspects of student behaviour. Our study identifies a set of features which can improve the accuracy of knowledge tracing techniques and we give recommendations regarding which types of user interactions should be captured in log data of future tutoring systems. Further, our feature evaluation led us to novel logistic regression models achieving a new state-of-the-art performance on all four datasets.

3 Datasets

ElemMath2021 EdNet KT3 Eedi Junyi15
# of students 125,246 297,915 118,971 247,606
# of unique questions 59,892 13,169 27,613 835
# of KCs 4191 293 388 41
# of logged actions 62,570,009 89,270,654 19,834,813 25,925,992
# of student responses 23,447,961 17,954,718 19,834,813 25,925,992
average correctness 68.52% 66.19% 64.30% 82.99%
subject Mathematics English Mathematics Mathematics
timestamp
question ID
KC ID
age/gender
social support
platform
teacher/school
study module
pre-requisite graph
KC hierarchy
question bundle
elapsed/lag time
question difficulty
videos
reading
hints
Table 2: Summary of datasets. Here KC refers to a knowledge component in the curriculum of the respective system, average correctness refers to the fraction of questions that were correctly answered across all students, question ID refers to whether each distinct question (item) had an ID, and KC ID indicates that each item is linked to at least one KC, platform indicates if the system logs how students access materials (e.g. mobile app or web browser), social support indicates if there is information about a student’s socioeconomic status, question bundle indicates if the system groups multiple items into sets which are asked together, elapsed/lag time indicates whether the data includes detailed timing information allowing the calculation of how long the student took to respond to each question, and the time between presentations of successive questions. Question difficulty indicates whether the dataset includes ITS-defined difficulties for each question. Videos, reading, and hints indicate whether the dataset provides information about which videos and explanations the student watched or read, and which hints were delivered to them as they worked on practice questions.

In recent years multiple education companies released large-scale ITS datasets to promote research on novel knowledge assessment techniques. Compared to earlier sizable datasets such as Bridge to Algebra 2006 [Stamper et al. (2010)] and ASSISTment 2012 [Feng et al. (2009)] these new datasets capture an order of magnitude more student responses making them attractive for data intensive modeling approaches. Table 2 provides an overview of the four recent large-scale student log datasets, from different ITS systems, we use in this paper. Taken together, these datasets capture over 197 million lines of log data from over 789,000 students, including correct/incorrect responses to over 87 million questions. While each ITS collects the conventional timestamp for each question (item) presented to the student, the correctness of their answer, and knowledge component (KC) attributes associated with the question, they also contain additional interaction data. Going beyond pure question-solving activities multiple datasets provide information about how and when students utilize reading materials, lecture videos and hints. In addition there are various types of meta-information about the individual students (e.g. age, gender, …) as well as the current learning context (e.g. school, topic number, …). All four tutoring systems exhibit a modular structure in which learning activities are assigned to distinct categories (e.g. pre-test, effective learning, review, …). Further, three of the datasets provide meta-information indicating which questions, videos, etc. are related to which KCs.

Figure 2 visualizes the distribution of the number of responses per student for each dataset (i.e. the number of answered questions per user). All datasets follow a power-law distribution. The EdNet KT3 and Junyi15 datasets contain many users with less than 50 completed questions and only a small proportion of users answer more than 100. The Eedi dataset filters out students and questions with less than 50 responses. The ElemMath2021 dataset exhibits the largest median response number, and many users answer several hundreds of questions.

Another interesting ITS property is the degree to which all students progress through the study material in a fixed sequence, versus how different and adaptive is the presented sequence across different students. We can get insight into this for each of our datasets by asking how predictable the next question item (or the next KC) is from the previous one, across all students in the dataset. Figure 2 shows how accurately the current question/KC predicts the following question/KC. The Junyi15 data is very static with respect to the KC sequence and tends to present questions in the same order across all students. Eedi logs show more variability in the KC sequence, but the question order is still rather static and therefore predictable. ElemMath2021 exhibits moderate KC sequence variations and a highly variable question order. EdNet KT3 exhibits the most variation in question and KC order.

Some of the differences between the individual datasets can be traced back to the distinct structures and objectives of the underlying tutoring systems. They teach different subjects, assist students in different ways (K-12 tutoring, standardized test preparation, knowledge diagnosis) and provide students with varying levels of autonomy. We next discuss the individual datasets and corresponding ITS in more detail.

Figure 1: Accuracy when predicting the next question or KC based on the current one. Lower accuracy reflects greater sequence variability across students.
Figure 2: Distribution of the number of responses per student. All four datasets follow a power-law distribution. The EdNet KT3 and Junyi15 dataset both contain many users with few responses. The users of the Eedi and ElemMath2021 tutoring systems have more responses on average. xxxxxxxxxxxxxxxxxxxxxxx

EdNet [Choi et al. (2020)]: EdNet was introduced as a large-scale benchmarking dataset for knowledge tracing algorithms. The data was collected over 2 years by Riiid’s Santa tutoring system which prepares students in South Korea for the Test of English for International Communication (TOEIC©) Listening & Reading. Each test is split into 7 distinct parts, four assessing listening and 3 assessing reading proficiency. Following this structure Santa categorizes its questions into 7 parts (additional fine-grained KCs are provided). Santa is a multi-platform system available on Android, iOS and the web and users have autonomy regarding which test parts they want to focus on. There exist 4 versions of this dataset (KT1, …, KT4) capturing student behaviour at increasing levels of detail ranging from pure question answering activity to comprehensive UI interactions. In this paper we analyze the EdNet KT3 dataset which contains logs of 297,915 students and provides access to reading and video consumption behaviour. We omit the use of the KT4 dataset which augments KT3 with purchasing behaviour.

Junyi15 [Chang et al. (2015)]: The Junyi Academy Foundation is an philanthropic organization located in Taiwan. It runs the Junyi Academy online learning platform which offers various educational resources designed for K-12 students. In 2015 the foundation released log data from their mathematics curriculum capturing activities of 247,606 students collected over 2 years. Users of Junyi Academy can choose from a large variety of topics, are able to submit multiple answers to a single question and can request hints. The system registers an answer as correct if no hints are used and the correct answer is submitted on the first attempt. The dataset provides a prerequisite graph which captures semantic dependencies between the individual questions. In addition to a single KC each question is annotated with an area identifier (e.g. algebra, geometry, …). In 2020 Junyi Academy shared a newer mathematics dataset on Kaggle [Pojen et al. (2020)]. Unfortunately, in that dataset each timestamp is rounded to the closest quarter hour which prevents exact reconstruction of the response sequence making it difficult to evaluate student performance modeling algorithms. Because of this we perform our analysis using the Junyi15 dataset.

Eedi [Wang et al. (2020)]: Eedi is an UK based online education company which offers a knowledge assessment and misconception diagnosis service. Unlike other ITS which are designed to teach new skills, the Eedi system confronts each student with a series of diagnostic questions – i.e. multiple choice questions in which each incorrect answer is indicative of a common misconception. This process results in a report that helps school teachers to adapt to student specific needs. In 2020 Eedi released a dataset for the NeurIPS Education Challenge [Wang et al. (2021)]. It contains mathematics question logs of 118,971 students (primary to high school) and was collected over a 2 year period. Student age and gender as well as information on whether a student qualifies for England’s pupil premium grant (a social support program for disadvantaged students) is provided. In contrast to the Junyi15 and ElemMath2021 datasets which have a prerequisite graph, the Eedi dataset organizes its KCs via a 4-level topic ontology tree. For example the KC Add and Subtract Vectors falls under the umbrella of Basic Vectors which itself is assigned to Geometry and Measure which is connected to the subject Mathematics. While analyzing this dataset we noticed that the timestamp information is rounded to the closest minute which prevents exact reconstruction of the interaction sequences. Upon request, the authors provided us with an updated version of the dataset that allows exact recovery of the interaction sequence.

Squirrel Ai ElemMath2021: Squirrel Ai Learning (SQ-Ai) is a K-12 education company located in China which offers individualized after-school tutoring services. SQ-Ai provides their ITS as mobile and Web applications, but also deploys it in over 3000 physical tutoring centers. The centers provide a unique setting in which students can study under the supervision of human teachers and can ask for additional advice and support which augments the ITS’s capabilities. Students also have the social experience of working alongside their peers. In this paper we introduce and analyze the Squirrel Ai ElemMath2021 dataset. It provides 3 months of behavioral data of 125,246 K-12 students completing various mathematics courses and captures observational data at fine granularity. ElemMath2021 gives insight into reading material and lecture video consumption. It also provides meta-information on which learning center and teacher a student is associated with. Each question has a manually assigned difficulty rating ranging from 10 to 90. A prerequisite graph captures dependencies between the individual KCs. Most student learning sessions have a duration of about one hour and the ITS selects a session topic. Each learning process is assigned one of six categories and learning sessions usually follow a pre-test, learning, post-test structure. This makes this dataset particularly interesting because it allows to quantify the learning success of each individual session as the difference between pre-test and post-test performance. The ElemMath2021 data as well as a detailed dataset description is publicly available via Carnegie Mellon University’s DataShop222https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=4888 [Koedinger et al. (2013)].

We conclude this section with a few summarizing remarks. Recent years have yielded large-scale educational datasets which have the potential to fuel future EDM research. The individual datasets exhibit large heterogeneity with regards to the structure and objectives of the underlying tutoring systems as well as captured aspects of the learning process. Motivated by these observations we perform the first comparative evaluation of various student performance modeling techniques across these four large-scale datasets. Further, we use the rich log data to evaluate potential benefits of a variety of alternative features to provide recommendations on which types of observational data is informative for future performance modeling techniques. Our feature evaluation leads us to multiple novel logistic regression models achieving state-of-the-art performance. Given the size of available training data and the structure of the learning processes we also address the question if it is beneficial to train a set of different assessment modules which are specialized on different parts of the ITS.

4 Approach

In this paper we examine alternative student performance modeling algorithms using alternative sets of features, across four recent large-scale ITS datasets. Our goals are (1) to discover the features of rich student log data that lead to the most accurate performance models of students, (2) to discover the degree to which the most useful features are task-independent versus dependent on the particular tutoring system, and (3) to discover which types of machine learning algorithms produce the most accurate student models using this log data. We start this Section with a formal definition of the student performance modeling problem. We then discuss prior work on logistic regression based modeling approaches and analyze which types of features they employ. From there, we introduce a set of alternative features leveraging alternative types of log data to offer a foundation for novel knowledge assessment algorithms. We conclude this Section with the proposal of two additional features which capture long-term and short-term student performance and only rely on response correctness. Appendix A provides precise definitions of all features used in our experiments, including implementations details and ITS specific considerations.

4.1 The Student Performance Modeling Problem

Student performance modeling is a supervised sequence learning task which traces a student’s latent knowledge state over time. Reliable knowledge assessments are crucial to enabling ITSs to provide effective individualized feedback to users. More formally, we denote a student’s sequence of past interactions with the tutoring system as . The interaction with the system is represented by the tuple , where indicates the interaction type and is a dataset dependent aggregation of information related to the interaction. In this paper we consider interaction types connected to question answering, video and reading material consumption as well as hint usage. Examples of attributes contained in are timestamp, learning material identifiers, information about the current learning context and student specific features. Question answering is the most basic interaction type which is monitored by all ITSs. If the interaction is a question response, provides the question (item) identifier and binary response correctness . Given a user’s history of past interaction with the ITS, the student performance problem is to predict ) – the probability that the student’s response will be correct if they are next asked question , given their history . In addition to interaction logs, all four datasets provide a knowledge component model which associates each question with a set containing one or more knowledge components (KCs). Each KC represent a concrete skill which can be a targeted by questions and other learning materials. User interactions are discrete and observed at irregular time intervals. To capture short-term memorization and forgetting it is necessary to utilize additional temporal features. We denote the dependence of variables on the individual student with the subscript .

4.2 Logistic Regression for Student Performance Modeling

Logistic regression models enjoy great popularity in the EDM community. At their core each trained regression model takes as input a real-valued feature vector that describes the student and their log data up to this point in the course, along with a question ID. Note different logistic regression approaches can summarize the student and their history in terms of different features. Each approach calculates its features using its own feature calculation function, (i.e., . The trained logistic regression model then uses this feature vector as input, and outputs the probability that the student will correctly answer question if they are asked it at this point in time. The full logistic regression model is therefore of the form

Here represents the vector of learned regression weights and

is the sigmoid function which outputs a value between 0 and 1, which is interpreted as the probability that student

will answer question correctly. A suitable set of weights can be determined by maximizing the likelihood function on the training set. The corresponding maximization problem is usually solved using gradient based optimization algorithms.

Over the years various logistic regression models have been proposed, each employing a distinct feature mapping to extract a suitable feature set. In this paper we consider Item Response Theory (IRT; Van2013:Handbook), in particular a one-parameter version of IRT known as Rasch model [Rasch (1993)], Performance Factor Analysis (PFA; Pavlik2009:Performance), DAS3H [Choffin et al. (2019)] and Best-LR [Gervet et al. (2020)]. We now discuss the individual models focusing on how the available interaction data is utilized for their predictions. First, IRT is the simplest regression models. It employs a parameter which represents the ability of student as well as a separate difficulty parameter for each question (item) . The IRT prediction is defined as

(1)

Unlike IRT, PFA extracts features based on a student’s history of past interactions. It computes the number of correct () and incorrect responses () prior to the current attempt and introduces a difficulty parameter for each individual KC . The PFA prediction is defined as

(2)

DAS3H is a more recent model which combines aspects of IRT and PFA and extends them with time-window based count features. It defines a set of time windows measured in days. For each window , DAS3H determines the number of prior correct responses () and overall attempts () of student on KC which fall into the window. A scaling function is applied to avoid features of large magnitude. The DAS3H prediction is defined as

Gervet2020:Deep performed a comparative evaluation of student performance modeling algorithms across 9 real-world datasets. They also evaluated the effects of question, KC and total count as well as time window based count features leading them to a new logistic regression model referred to as Best-LR. Best-LR is similar to DAS3H, but does not use time-window features and uses and as additional features that capture the total number of prior correct and incorrect responses. The Best-LR prediction is defined as

(3)

Overall, we can see that these four logistic regression models are mainly based on two types of features: (i) One-hot encodings that allow the models to infer question and KC specific difficulty; (ii) Count based features that summarize a student’s past interaction history with the system computed at various level of granularity (total/KC/question-level) potentially augmented with time windows to introduce a temporal dimension in the predictions. Looking back at the dataset discussion provided in Section 3, we note that the current feature sets only use a small fraction of the information collected by the tutoring systems. In the following we aim at increasing the amount of utilized information by exploring alternative types of features which can be extracted from the log data and can serve as foundation for future student performance modeling algorithms.

4.3 Features Based on Rich Observational Data

Tutoring systems collect a wide variety of observational data during the learning process. Here we discuss a range of alternative features leveraging alternative types of log data. The individual features can be used to augment the logistic regression models discussed in Subsection 4.2, but might also be combined with deep learning based modeling techniques. As shown by Table 2, each dataset captures different aspects of the learning process and supports a different subset of the discussed features.

Temporal features: Many performance modeling techniques treat student interactions as discrete tokens and omit timestamp information which can be indicators of cognitive processes such as short-term memorization and forgetting. DAS3H uses time window based count features to summarize the user history. Here we discuss two additional types of temporal features: (i) One-hot encoded datetime and (ii) the interaction time based features introduced by Shin2021:Saint+. By providing the model with information about the specific week and month of interaction we try to capture effects of school work outside the ITS on knowledge development. The hour and day encodings aim at tracing temporal effects on a smaller timescale. For example, students might produce more incorrect responses when studying late at night. Recently, Shin2021:Saint+ introduced a deep learning approach which employs elapsed time and lag time features to capture temporal aspects of student behaviour during question solving activities. Elapsed time measures the time span from question display to response submission. The idea is that a faster response is correlated with student proficiency. Lag time measures the time passed between the completion of the previous exercise until the next question is received. Lag time can be indicative for short-term memorization and forgetting. For our experiments we convert elapsed time and lag time values, , to scaled values and also use the categorical one-hot encodings from Shin2021:Saint+. There, elapsed time is capped off at 300 seconds and categorized based on the integer second. Lag time is rounded to integer minutes and assigned to one of 150 categories (). We evaluate two variations of the features. In the first version we compute elapsed time and lag time values based on interactions with the current question. In the second version we compute them based on interactions with the prior question. Because it is unknown how long a student will take to answer a question before the question is asked, we cannot realistically use this elapsed time feature for predicting correctness in answering a new question. Therefore, we omit the elapsed time feature for the current question in our experiments described in Subsection 5.3.

Learning context: Knowledge about the context a learning activity is placed in can be informative for performance predictions. For example, all four datasets considered in this work group learning activities into distinct context categories we call study modules (e.g. pre-test, effective learning, review, …). Here effective learning study modules try to teach novel KCs, whereas review study modules aim at deepening proficiency related to KCs the student is already familiar with. Providing a student performance model with information about the corresponding study module can help it adapt its predictions to these different contexts. Additional context information is provided by the exercise structure. For example, the ElemMath2021 dataset marks each interaction with a course and topic identifier and offers a manually curated question difficulty score. The EdNet dataset labels questions based on which part of the TOIEC exam they address and groups them into bundles–a bundle is a set of questions which is asked together. Large parts of the ElemMath2021 dataset were collected in physical learning centers where students can study under the supervision of human teachers and each interaction contains school and teacher identifiers. This information can be useful because students visiting the same learning center might exhibit similar strengths and weaknesses. In Section 5 we evaluate the potential benefits of learning context information for student performance modeling by encoding the categorical information into one-hot vectors and passing it into logistic regression models. Additionally, we evaluate count features defined for the individual study modules and EdNet parts.

Personal context: By collecting a student’s personal attributes a tutoring system can offer a curriculum adapted to their individual needs. For example, information about a student’s age and grade can be used to select suitable learning materials. Further, knowledge about personal attributes can enable an ITS to serve as a vehicle for research on educational outcomes. For example, it is a well studied phenomena that socioeconomic background correlates with educational attainment (see for example White1982:Relation, Aikens2008:Socioeconomic, Stumm2017:Socioeconomic). A related research question is how ITSs can be used to narrow the achievement gap [Huang et al. (2016)]. While both ElemMath2021 and EdNet datasets indicate which modality students use to access the ITS, the Eedi dataset is the only one that provides more detailed personal information in form of age, gender and socioeconomic status. Features extracted from these sensitive attributes need to be handled with care to avoid discrimination against individual groups of students. Outside of student performance modeling these features might be used to detect existing biases in the system. We evaluate one-hot encodings related to age, gender and socioeconomic status.

KC graph features: In addition to a KC model three of the datasets provide a graph structure capturing semantic dependencies between the individual KCs and questions. We can leverage these graph structures by defining prerequisite and postrequisite count features. These features compute a student’s number of prior attempts and correct responses on KCs which are prerequisite and postrequisite to the KCs of the current question. The motivation for these features is that the mastery of related KCs can carry over to the current question. For example, it is very likely that a user that is proficient in multi-digit multiplication can solve single-digit multiplication exercises as well. The Eedi dataset provides a 4-level KC ontology tree which associates each question with a leaf. Here we compute a prerequisite feature by counting the number of interactions related to the direct parent node. We also investigate the use of sparse vectors taken from the KC graph adjacency matrix as an alternative way to incorporate the semantic information into our models.

Learning material consumption: Modern tutoring system offer a rich variety of learning materials in form of video lectures, audio recordings and written explanations. Information on how students interact with the available resources can benefit performance predictions. For example, watching a video lecture introducing a new KC can foster a student’s understanding before they enter the next exercise session. The ElemMath2021 and Ednet KT3 datasets record interactions with lecture video and written explanations. Count features capture the number of videos and texts a user has interacted with prior to the current question and can be computed on a per KC and an overall level. Video and reading time in minutes might also provide a useful signal. Another feature which can be indicative for proficiency is the number of videos a student skips. The Junyi15 datasets provides us with information on hint usage. We experiment with features capturing the number of used hints as well as minutes spent on reading hints aggregated on a per KC and an overall level.

4.4 Features Based on Response Correctness

4.4.1 Smoothed Average Correctness

The performance of logistic regression models relies on a feature set which offers a suitable representation of the problem domain. Unlike neural network based approaches, linear and linear logistic regression models are unable to recover higher-order dependencies between individual features on their own. For example, given two separate features describing a student’s number of prior correct responses,

, and overall attempts, , the linear logistic model is unable to infer the ratio of average correctness which can be a valuable indicator of a student’s long-term performance. To mitigate this particular issue we introduce an additional feature, capturing average correctness of student over time as

where is the number of questions attempted by student , and is the number of questions the student answered correctly. Here, is the average correctness rate over all other students in the dataset, and is a smoothing parameter which biases the estimated average correctness rate, of student towards this all students average

. The use of smoothing reduces the feature variance during a student’s initial interactions with the ITS. A prior work by Pavlik2020:Logistic proposed, but did not evaluate, an average correctness feature without the smoothing parameter for student performance modeling. While calibrating this parameter for our experiments, we observed benefits from smoothing and settled on

.

4.4.2 Response Patterns

Figure 3: A response pattern is a one-hot encoding vector which represents the binary sequence formed by a student’s most recent correct and incorrect responses. Logistic regression models can use response patterns to infer aspects of a student’s short-term behaviour.

Student performance modeling approaches that employ recurrent or transformer based neural networks take interaction sequences describing multiple student responses as input (e.g. the most recent 100). Provided this time-series data, deep learning algorithms are able to discover patterns and temporal dependencies in student behaviour without requiring additional human feature engineering. Among the logistic regression models discussed in Subsection 4.2 only DAS3H incorporates temporal information in form of time window based count features to summarize student interactions over time scales from one hour to one months. While DAS3H can capture effects of cognitive processes such as long-term memorization and forgetting, it is still at disadvantage compared to deep learning based approaches that can also infer aspects of student behaviour occurring on smaller time-scales. Indeed, it has been shown that a student’s most recent responses have a large effect on DKT performance predictions (Ding2019:Deep, Ding2021:Interpretability).

Here, inspired by the use of n-gram models in the NLP community (e.g. Manning1999:Foundations), we propose

response patterns as a feature which allows logistic regression models to infer aspects impacting short-term student performance. At time , a response pattern is defined as a one-hot encoded vector that represents a student’s sequence of most recent responses formed by binary correctness indicators . The encoding process is visualized by Figure 3. Response patterns are designed to allow logistic regression models to capture momentum in student performance. They allow the model to infer how challenging the current exercise session is for the user and can also be indicative for question skipping behaviour. In Subsection 5.3 we combine response patterns, smoothed average correctness and DAS3H time window features to propose Best-LR+, a regression model which offers performance competitive to deep learning based techniques while only relying on information related to response correctness.

5 Experiments

We evaluate the benefit of alternative features extracted from different types of log data for student performance modeling using the four large-scale datasets discussed in Section 3. After an initial feature evaluation, we combine helpful features to form novel state-of-the-art logistic regression models. Motivated by the size of the recent datasets, we also investigate two ways of partitioning the individual datasets to train multiple assessment models targeting different parts of the learning process. First, we show how splitting the interaction sequences by response number can be used to train time-specialized models which can mitigate the cold start problem (Gervet2020:Deep, Zhang2021:Knowledge). We then analyze how features describing the learning context (e.g. study module, topic, course, …) can be used to train multiple context-specialized models whose individual predictions can be combined to improve overall prediction quality even further.

5.1 Evaluation Methodology

Figure 4: Accuracy, true positive rate and false positive rate of a Best-LR model trained on the ElemMath2021 dataset for different classification thresholds. The classification threshold converts the probability output by the model into a binary correct/incorrect prediction.
Figure 5: The receiver operating characteristic (ROC) curve captures the relationship between true and false positive rates as the classification threshold is swept from 0 to 1. The AUC score stands for the Area Under the Curve. A perfect system will have AUC of 1.0.

To be in line with prior work, we start data preparation by filtering out students with less than ten answered questions (Pandey2019:Self, Choffin2019:DAS3H, Piech2015:Deep, Gervet2020:Deep). The Ednet KT3 and Eedi

dataset both contain questions annotated with multiple KCs which yields difficulties for some of the deep learning baselines (DKT and SAKT). In those cases we introduce new artificial KCs to represent each unique combination of original KCs. In all our experiments we perform a 5-fold cross-validation on the student level. In each fold 80% of the students are used for training and parameter selection and 20% are used for testing. Thus, all test results are obtained only from predictions over student who were not observed during model training. We report performance in terms of prediction accuracy (ACC) and area under curve (AUC). The AUC score captures the area under the receiver operating characteristic (ROC) curve and is a popular evaluation metric for knowledge tracing algorithms. The ROC curve plots the true-positive rate against the false-positive rate at all decision thresholds. One way of interpreting the AUC score is viewing it as the probability of assigning a random correct student response a higher probability of correctness than a random incorrect response. As concrete examples we visualize the predictive performance and ROC curve for a Best-LR model trained on the

ElemMath2021 dataset in Figures 5 and 5.

For the computation of average correctness and response pattern features we set smoothing parameter and sequence length respectively. Additional implementation details of the different features as well as ITS specific considerations are provided in Appendix A

. For the evaluation of the deep learning models, we relied on the hyperparameters reported in the cited references. A detailed list of the used model hyperparameters is provided in Appendix 

B

. All models were trained for 100 epochs without learning rate decay. For our logistic regression experiments we rely on Scikit-learn 

[Pedregosa et al. (2011)]

and all deep learning architectures were implemented using PyTorch 

[Paszke et al. (2019)]. Our implementation of the regression models uses combinations of attempt and correct count features which is a slight deviation from the original PFA and Best-LR formulations which count the number of prior correct and incorrect responses. While the individual features have a different interpretation, the two feature pairs are collinear to each other and provide identical information to the model. The code to reproduce our experiments as well as links to all used datasets are shared on GitHub333https://github.com/rschmucker/Large-Scale-Knowledge-Tracing.

5.2 Utility of Individual Features with Logistic Regression

To evaluate the effectiveness of the different features discussed in Section 4 we perform two experiments in the context of logistic regression modeling. In the first experiment we analyze each feature by training a logistic regression model using only that feature and a bias (constant) term. This allows us to quantify the predictive power of each feature in isolation. In the second experiment we evaluate the marginal utility gained from each feature by training an augmented Best-LR model (Eq. 3) based on the Best-LR feature set plus the additional feature. This is particularly interesting because the Best-LR feature set was created by combining features proposed by various prior regression models and it achieves state-of-the-art performance on multiple educational datasets [Gervet et al. (2020)]. If a feature can provide marginal utility on top of the Best-LR feature set it is likely that it can contribute to other knowledge tracing techniques as well and should be captured by the ITS.

ElemMath2021 EdNet KT3 Eedi Junyi15
Feature ACC AUC ACC AUC ACC AUC ACC AUC
always correct (baseline)
question ID
KC ID
total counts
KC counts
question counts
combined counts
total TW counts
KC TW counts
question TW counts
combined TW counts
current elapsed time - -
current lag time - -
prior elapsed time - -
prior lag time - -
month
week
day
hour
study module ID
study module counts
teacher/group - - - -
school - - - - - -
course - - - - - -
topic - - - - - -
difficulty - - - - - -
bundle/quiz - - - -
part/area ID - - - -
part/area counts - - - -
age - - - - - -
gender - - - - - -
social support - - - - - -
platform - - - -
prereq IDs - -
prereq counts - -
postreq IDs - - - -
postreq counts - - - -
videos watched counts - - - -
videos skipped counts - - - -
videos watched time - - - -
reading counts - - - -
reading time - - - -
hint counts - - - - - -
hint time - - - - - -
smoothed avg correct
response pattern
Table 3: Individual feature performance. Each entry reports ACC and AUC scores achieved by a logistic regression model trained using only the corresponding feature and a bias term. The first row provides a baseline for comparison, where no features are considered and the model simply predicts the student will answer every question correctly. Dashed lines indicate this feature is not available in this dataset. Maximum ACC and AUC variances over the five-fold test data are 0.0013 and 0.003 respectively.

ACC and AUC scores achieved by the logistic regression models trained on the individual features in isolation are shown in Table 3. Somewhat unsurprisingly, the question (item) identifier – representing the main feature used by IRT (EQ 1) – is effective for all datasets and is the single most predictive feature for ElemMath2021 and EdNet KT3. KC identifiers are very informative as well, but to a lesser extent than the question identifiers. This indicates varying difficulty among questions targeting the same KC. The PFA model (EQ 2) which only estimates KC difficulties is unable to capture these variations on the question-level. The attempt and correct count features are informative for all datasets and there is benefit in using total-, KC- and question-level counts in combination. Question-level count features are informative for Junyi15 which is the only dataset that frequently presents users previously encountered questions. DAS3H introduced count features computed for different time windows (TWs). TW based count features are an extension of the standard count features and yield additional benefits for all four settings. The elapsed time and lag time features are based on user interaction times and are both informative. The datetime features describing month, week, day and hour of interaction yield little signal on their own. The features describing the learning context vary in utility. The study module feature is informative for all datasets, as are the counts of student correct responses and attempts associated with different study modules. Predictions for the Eedi dataset benefit largely from information about the specific group a student belongs too. Reasons for this could be varying group proficiency levels and differences in diagnostic test difficulty. Bundle and part identifiers both provide a useful signal for EdNet KT3 and Eedi. Features describing personal information have little predictive power on their own. Among this family of features, social support yields the largest AUC score improvement over the always correct baseline. Features extracted from the prerequisite graph are effective for the three datasets that support them. The logistic regression models trained on prerequisite and postrequisite count features exhibit the best AUC scores on Junyi15. The features describing study material consumption all provide a predictive signal, but do not yield enough information for accurate knowledge assessments on their own. The smoothed average correctness and response pattern features lead to good performance on all four datasets.

ElemMath2021 EdNet KT3 Eedi Junyi15
Feature ACC AUC ACC AUC ACC AUC ACC AUC
Best-LR (baseline)
question counts
total TW counts
KC TW counts
question TW counts
combined TW counts
current elapsed time - -
current lag time - -
prior elapsed time - -
prior lag time - -
month
week
day
hour
study module ID
study module counts
teacher/group - - - -
school - - - - - -
course - - - - - -
topic - - - - - -
difficulty - - - - - -
bundle/quiz ID - - - -
part/area ID - - - -
part/area counts - - - -
age - - - - - -
gender - - - - - -
social support - - - - - -
platform - - - -
prereq IDs - -
prereq counts - -
postreq IDs - - - -
postreq counts - - - -
videos watched counts - - - -
videos skipped counts - - - -
videos watched time - - - -
reading counts - - - -
reading time - - - -
hint counts - - - - - -
hint time - - - - - -
smoothed avg correct
response pattern
Table 4: Augmented Best-LR performance. Each entry reports average ACC and AUC scores achieved by a logistic regression model trained using the Best-LR feature set augmented with a single feature. Maximum ACC and AUC variances over the five-fold test data are 0.0007 and 0.0022 respectively. The marker ✗ is used to indicate features that are used for the AugmentedLR assessment models described later in the paper. Dashed lines indicate this feature is not available in this dataset.

Moving forward we evaluate the marginal utility of each feature by evaluating the performance of logistic regression models trained using the Best-LR feature set augmented with this particular feature. Table 4 shows the results of this experiment. Time window based count features lead to improved knowledge assessments for all datasets. In combination, they improve the AUC scores of EdNet KT3 and Junyi by and respectively. The elapsed time and lag time features offer a useful signal for the three datasets which capture temporal interaction data. The datetime features provide little utility in all cases (the best observed value is a AUC improvement for Junyi15). Information about the study module a question is placed in is beneficial in all settings and leads to the largest AUC score improvement for Junyi15. Indication of the group a student belongs to improves model performance for Eedi. Knowing which topic a student is currently visiting is a useful signal for ElemMath2021. Information related to bundle and part structure improves EdNet KT3 and Eedi predictions. ElemMath2021’s manually assigned difficulty ratings lead to no substantial improvements likely due to the fact that the Best-LR feature set allows the model to learn question difficulty on its own. The features describing a student’s personal information provide little marginal utility. Count features derived from the question/KC prerequisite graphs yield a sizeable improvement in assessment quality. Features targeting study material consumption yield some marginal utility when available. Discrete learning material consumption counts lead to larger improvements than the continuous time features. The smoothed average correctness feature leads to noticeable improvements for all four datasets. The response pattern feature enhances assessments in all cases and yields the largest improvements for the ElemMath2021 and Eedi dataset (over AUC). This shows that the last few most recent student responses yield a valuable signal for performance prediction.

Overall, we observe that a variety of alternative features derived from different types of log data can enhance student performance modeling. Tutoring systems that track temporal aspects of student behaviour in detail can employ elapsed time and lag time features. Additional information about the current learning context is valuable and should be captured. While the study module features improve predictions for all datasets, other ITS specific context features such as group, topic and bundle identifiers vary in utility. The count features derived from the provided prerequisite and hierarchical graph structures increased prediction quality in all cases. Log data related to study material consumption also provides a useful signal. Beyond student performance modeling this type of log data might also prove itself useful for evaluating the effects of individual video lectures and written explanations in future work. Lastly, the smoothed average correctness and response pattern features that only require answer correctness improve predictions for all four tutoring systems substantially.

5.3 Integrating Features into Logistic Regression

We have seen how individual features can be integrated into the Best-LR model to increase prediction quality. We now experiment with augmented logistic regression models that incorporate combinations of multiple beneficial features. The number of parameters learned by the different approaches is provided in Table 6. Because each dataset provides different types of log data it only supports a subset of the explored features. The first model we propose and evaluate is Best-LR+.

Best-LR+ method. This method augments the Best-LR method (the current state-of-the-art logistic regression method (EQ 3)) by adding the DAS3H time window-based count features on total-, KC- and question-level as well as smoothed average correctness and response pattern features. All of these features can be calculated using only the raw question-response information from the student log data.

Note that because the features used in Best-LR+ rely only on question-response log data, this method can be applied to all four datasets, as well as many earlier datasets that lack new features available in the four we study here. We define Best-LR+ in this way, to explore whether tutoring systems that log only this question-response data can improve the accuracy of the student performance modeling by calculating and adding in these count features.

We further propose dataset specific augmented logistic regression models (AugmentedLR) by combining the helpful features marked in Table 4.

AugmentedLR method. This logistic regression method uses all of the features employed in the Best-LR model, and adds in all of the marked features from Table 4, which were found to individually provide improvements over the state-of-the-art results of Best-LR.

Note that the AugmentedLR method employs a superset of the features used by Best-LR+. Because the features used in AugmentedLR go beyond simple logs of questions and responses, this method can only be applied to datasets that capture this additional information. We define AugmentedLR in this way, to explore whether ITS systems that do not yet log these augmented features can improve the accuracy of their student performance modeling by capturing and utilizing these additional features in their student log data. Note that some of the features marked in Table 4 and used by AugmentedLR are available in only a subset of our four datasets.

Experimental results for these two proposed logistic regression models are presented in Table 5, along with results for previously published logistic regression approaches IRT, PFA, DAS3H, Best-LR and previously published deep learning approaches DKT, SAKT, SAINT and SAINT+.

Examining the results, first compare the results for Best-LR+ to the previous state of the art in logistic regression methods, Best-LR. Notice that Best-LR+ benefits from its additional question-response features and outperforms Best-LR in all four datasets. On EdNet KT3 and Juni15 Best-LR+ improves the AUC scores of the previous best logistic regression models (Best-LR) by more than . These results suggest that the additional features used by Best-LR+ should be incorporated into student performance modeling in any ITS, even those for which the student log data contains only the sequence of question-response pairs.

Examining the results for AugmentedLR, it is apparent that this provides the most accurate method for student modeling across these four diverse datasets, among the logistic regression methods considered here. Even when considering recent deep learning based models, the AugmentedLR models outperform all logistic regression and deep neural network methods on all datasets except Eedi where the deep network method DKT leads to the best performance predictions. Especially on the Junyi15 dataset AugmentedLR increases ACC by and AUC by over , compared to Bes-LR. These results strongly suggest that all ITSs might benefit from logging the alternative features of AugmentedLR to improve the accuracy of their student performance modeling.

ElemMath2021 EdNet KT3 Eedi Junyi15
ACC AUC ACC AUC ACC AUC ACC AUC
IRT
PFA
DAS3H
Best-LR
DKT
SAKT
SAINT
SAINT+ - -
Best-LR+
AugmentedLR
Table 5: Comparative evaluation of student performance modeling algorithms across 4 large-scale datasets. The first four table rows correspond to previously studied logistic regression methods, the next four to previously studied deep neural network approaches, and the final two rows correspond to the two new logistic regression methods introduced in this paper. Maximum ACC and AUC variances over the five-fold test data are 0.0014 and 0.0017 respectively. Because the Eedi dataset does not provide response time information it does not accommodate SAINT+ and there is no corresponding entry.
ElemMath2021 EdNet KT3 Eedi Junyi15
IRT 59,571 11,556 27,614 723
PFA 12,574 904 1,165 124
DAS3H 105,672 14,867 31,882 1,174
Best-LR 72,146 12,461 28,780 848
DKT 6,970,301 2,526,101 3,430,701 649,601
SAKT 6,610,301 2,166,101 3,070,701 289,601
SAINT 81,023,001 27,692,601 38,547,801 5,174,601
SAINT+ 13,433,401 4,544,801 - 791,601
Best-LR+ 106,718 15,913 32,928 2,220
AugmentedLR 125,029 16,392 62,912 6,045
Combined-AugmentedLR 9,627,236 213,099 4,089,283 120,904
Table 6: Number of parameters learned by different student performance models. The number of parameters is heavily dependent on the number of questions and KCs used by the underlying ITS.

5.4 The Cold Start Problem

Figure 6: Comparison in AUC performance between a single Best-LR model trained using the entire training set (blue), and a collection of time-specialized Best-LR models trained on partitions of the response sequences (red). The horizontal axis shows the subset of question-response pairs used to train the specialized model. For example, the leftmost red point in each plot shows the AUC performance of a specialized model trained using only the first 10 question-response pairs of each student, then tested on the first 10 responses of other students. In contrast, the leftmost blue point shows the performance of the Best-LR model trained using all available data, and still tested on the first 10 responses of other students. The time-specialized models are able to mitigate the cold start problem for all four tutoring systems. On the EdNet KT3 and Junyi15 datasets the combined predictions of the specialized models increase overall performance substantially.

Student performance modeling techniques use a student’s interaction history to trace their knowledge state over time. When a new student starts using the ITS, little or no interaction history is yet available, resulting in a new student ”cold start” problem for estimating the knowledge state of new students. Most student performance models therefore require a burn-in period of student use of the ITS before they can accurately estimate student performance (Gervet2020:Deep, Zhang2021:Knowledge). Here, to ensure a gratifying user experience and to improve early on retention rates, we show how one can mitigate the cold start problem by training multiple time-specialized assessment models. We start by splitting the question-response sequence of each student into multiple distinct partitions based on their ordinal position in the student’s learning process (i.e. partition 50-100 will contain the 50th to 100th response of each students). We then train a separate time-specialized model for each individual partition. The motivation for this is that the way observational data needs to be interpreted can change over time. For example, during the beginning of the learning process one might put more focus on question and KC identifiers while later on count features provide a richer signal. With this approach a student’s proficiency is evaluated by different specialized models depending on the length of their prior interactions history.

We evaluate this approach by training time-specialized logistic regression models using the Best-LR feature set for each of the four educational datasets. In particular we induce the partitions using the splitting points . Figure 6 visualizes the results of a 5-fold cross-validation and shows the performance of a single generalist Best-LR model trained using all available data. The use of time-specialized assessment models substantially improves prediction accuracy early on (i.e., during their first 10 question-response pairs) for all four datasets. For EdNet KT3 and Junyi we observe AUC improvements of over in early on predictions compared to the generalist model. Also, for the same two datasets the overall performance (shown in Table 7) achieved by combining the predictions of the individual time-specialized models yields an over increase in overall AUC scores and the time-specialized models consistently outperform the generalist model in each individual partition. For ElemMath2021 and Eedi datasets the time specialized-models do not outperform the baseline consistently in each partition, but we still observe minor gains in predictive performance overall.

5.5 Combining Multiple Specialized Models

ElemMath2021 EdNet KT3 Eedi Junyi15
Feature ACC AUC ACC AUC ACC AUC ACC AUC
Best-LR
Best-LR (time-spec.)
question ID specific
KC ID specific
study module specific
teacher/group specific - - - -
school specific - - - - - -
course specific - - - - - -
topic specific - - - - - -
bundle/quiz specific - - - -
part/area specific - - - -
platform specific - - - -
Combined-Best-LR
Table 7: Composite Best-LR performance for different partitioning schemes. The first two rows of the table give as baseline scores the ACC and AUC for the previously discussed Best-LR model and time-specialized Best-LR models (Subsection 5.4). The next 10 rows show results when training specialized models that partition the data based on single features such as question ID, KC ID, etc.. The final row shows the result of combining several of these models by taking a weighted vote of their predictions as described in the text. The marker ✗ indicates the inputs used for this combination model for each dataset. Maximum ACC and AUC variances over the five-fold test data are 0.0014 and 0.0012 respectively.
ElemMath2021 EdNet KT3 Eedi Junyi15
Feature ACC AUC ACC AUC ACC AUC ACC AUC
AugmentedLR
AugmentedLR (time-spec.)
question ID specific
KC ID specific
study module specific
teacher/group specific - - - -
school specific - - - - - -
course specific - - - - - -
topic specific - - - - - -
bundle/quiz specific - - - -
part/area specific - - - -
platform - - - -
Combined-AugmentedLR
Table 8: Composite AugmentedLR performance for different partitioning schemes. The first two rows of the table give as baseline scores the ACC and AUC for the previously discussed AugmentedLR model and time-specialized AugmentedLR models (Subsection 5.4). The next 10 rows show results when training specialized models that partition the data based on single features such as question ID, KC ID, etc.. The final row shows the result of combining several of these models by taking a weighted vote of their predictions as described in the text. The marker ✗ indicates the inputs used for this combination model for each dataset. Maximum ACC and AUC variances over the five-fold test data are 0.0013 and 0.0011 respectively.

The large-scale datasets discussed in this paper are aggregations of learning trajectories collected from users of varying ages and grades who complete a range of different courses. The internal heterogeneity of the datasets paired with the large number of recorded interactions naturally leads to the question of whether it is more advantageous to pursue a monolithic or a composite modeling approach. Conventional student performance modeling techniques follow a monolithic approach in which they train a single model using all available training data, but the results in the previous section show that with sufficiently large datasets it can be useful to partition the training data to train multiple time-specialized models.

In this section we explore the potential benefits of (1) training specialized models for specific questions, KCs, study modules, etc., and (2) combining the predictions from several of these models to obtain a final prediction of student performance. The motivation for considering partitioning the data to train models specialized to these different contexts is that it has the potential to recover finer nuances in the training data, just as the time-specialized models in the previous section did. For example, two algebra courses might teach the same topics, but employ different analogies to promote conceptual understanding, allowing students to solve certain types of questions more easily than others. Training a separate specialized model for each course can allow the trained models to capture these differences and improve overall assessment quality.

Table 7 shows performance metrics for sets of logistic regression models trained using the Best-LR features. As baselines we use a single Best-LR model trained on the entire dataset as well as a set of time-specialized Best-LR models (described in Subsection 5.4). Note that even though question and KC identifiers are already part of the Best-LR feature set they can still be effective splitting criteria. As shown in the table, training question specialized models improves predictive performance for all datasets except ElemMath2021 and KC specialized models are beneficial for all four datasets. Partitioning the data on the value of the study module feature, and training specialized models for each study module yields the greatest improvements for ElemMath2021, EdNet KT3 and Junyi15, and also leads to substantial improvements on Eedi. It is interesting that training multiple specialized models for different study modules is more effective than augmenting the Best-LR feature set directly with the study module feature (Table 4). Topic and course induced data partitions improve knowledge assessments for the ElemMath2021 dataset. On the other hand, school and teacher specific splits are detrimental and we observe large overfitting to the training data. Splits on the bundle/quiz level are effective for EdNet KT3 and Eedi. While Eedi benefits from incorporating the group identifiers into the Best-LR model, group specialized models harm overall performance.

While fitting the question specific models on ElemMath2021, we observed severe overfitting behaviour, where accuracy on the training data is much higher than on the test set. This is likely caused by the fact that ElemMath2021 contains the smallest number of responses per question among the four datasets. Table 9 provides information about the average and median number of responses per model for the different data partitioning schemes.

We repeat the same experiment, but this time train logistic regression models using the ITS specific AugmentedLR feature sets. Performance metrics are provided by Table 8. Unlike when using the Best-LR features, the question and KC specific AugmentedLR models have lower overall prediction quality than the original AugmentedLR model. These specialized AugmentedLR models contain many more parameters than their Best-LR counterparts due to the fact that they use more features to describe the student history (Table 6). This larger number of parameters requires a larger training dataset, which makes them more prone to overfitting. Still, splits on the study module, course and part features yield benefits on multiple datasets. Even though the AugmentedLR models benefit less from training specialized models, these specialized models still exhibit higher performance than the ones based on the Best-LR feature set.

ElemMath2021 EdNet KT3 Eedi Junyi15
Feature Avg Med ACC Med Avg Med Avg Med
question ID specific 393 118 1,445 930 718 351 35,207 8,380
KC ID specific 5,801 2,031 11,694 2,956 18,485 613 635,487 200,284
study module specific 3,901,401 2,487,865 2,385,616 810,120 336,183 29,589 5,083,895 3,421,938
teacher/group specific 1,370 297 - - 1,674 554 - -
school specific 11,992 5,620 - - - - - -
course specific 329,695 69,125 - - - - - -
topic specific 21,614 4809 - - - - - -
bundle/quiz specific - - 1,970 1,265 1,146 120 - -
part/area specific - - 2,385,616 1,344,293 - - 2,824,386 605,666
platform 11,704,204 11,704,204 8,349,656 8,349,656 - - - -
Table 9: Average and median number of training examples (i.e. question responses) available for different partitioning schemes. For example, when a specialized model is trained for each different question ID in the ElemMath2021 dataset, the average number of training examples available per question-specific model is 393.

Finally, we discuss combining the strengths of the different models described in this section, by combining their predictions into a final group prediction. Our approach here is to train a higher-order logistic regression model that takes as input the predictions of the different models, and outputs the final predicted probability that the student will answer the question correctly. The learned weight parameters in this higher-order logistic regression model essentially create a weighted voting scheme that combines the predictions of the different models. We determined which models to include in this group prediction by evaluating all different combinations using the first data split of the 5-fold validation scheme and selecting the one yielding the highest AUC. The results of the best-performing combination models are shown in the final rows of Tables Table 7 (Combined-Best-LR) and Table 8 (Combined-AugmentedLR). Both tables also mark which models were used for this combination model, for each of the four datasets.

The combination models outperform the baselines as well as the each individual set of specialized models for both Best-LR and AugmentedLR feature sets on all datasets. These combination models are the most accurate of all the logistic regression models discussed in this paper, and they also outperform all the deep learning baselines (Table 5) on all datasets except on Eedi where DKT is the only model that produces more accurate predictions. Looking at the Best-LR based combination models (Combined-Best-LR) we observe large AUC improvements of more than over the baseline models. For EdNet KT3 and Junyi15 there is an increase in AUC scores of and respectively. The minimum number of individual predictions used by a combination model is 3 (for Eedi) and the maximum number is 6 (for EdNet KT3). For the AugmentedLR based combination models (Combined-AugmentedLR) we observe AUC score improvements between (for Junyi15) and (for EdNet KT3) compared to the baseline models. The AugmentedLR models already contain many of the features used for partitioning and are more data intensive then the Best-LR based models which is likely to be the reason for the smaller performance increment. The predictions of the combined AugmentedLR models rely mainly on the predictions of time- and study module-specialized models. Only the combination model for Junyi15 uses the area-specialized models as a third signal.

We note that while we were able to enhance prediction quality using only simple single feature partitioning schemes, future work on better strategies to partition the training data to train specialized models is likely to yield even larger benefits.

6 Results Summary and Discussion

The main results of this paper show that the state of the art of student performance modeling can be advanced through new machine learning approaches, yielding more accurate student assessments and in particular more accurate predictions of which questions a student will be able to answer correctly at any given point in time as they move through the online course. In particular we show:

  • State-of-the-art logistic regression approaches to student performance modeling can be further improved by incorporating a set of new features that can be easily calculated from the question-response pairs that appear in student log data from nearly all ITSs. For example, these include counts and smoothed ratios of correctly answered questions and overall attempts, partitioned by question, by knowledge component, and by time window, as well as specific sequences of correct/incorrect responses over the most recent student responses. We refer to the previous state-of-the-art logistic regression model as Best-LR [Gervet et al. (2020)] and to the logistic regression model that incorporates these additional features as Best-LR+. Our experiments show that Best-LR+ yields more accurate student modeling than Best-LR across all four diverse ITS logs considered in this paper. We conclude that most tutoring systems that perform student modeling should benefit by incorporating these features.

  • A second way of improving over the state of the art in student modeling is to incorporate new types of features that go beyond the traditional question-response data typically logged in all ITSs. For example, accuracy is improved by incorporating features such as the time students took to answer the previous questions, student performance on earlier questions associated with prerequisites to the knowledge component of the current question, and information about the study module (e.g., does the question appear in a pre-test, post-test, or as a practice problem). We conclude that future tutoring systems should log the information needed to provide these features to their student performance modeling algorithms.

  • A third way to improve of the state of the art is to train multiple, specialized student performance models and then combine their predictions to form a final group prediction. For example, we found that training distinct logistic regression models on different partitions of the data (e.g., partitioning the data by its position in the sequential log, or by the knowledge component being tested) leads to improved accuracy. Furthermore, combining the predictions of different specialized models leads to additional accuracy improvements (e.g., combining the predictions of specialized models trained on different question bundles, with predictions of specialized models trained on different periods of time in the sequential log). We conclude that time-specialized models can help ameliorate the problem of assessing new students who have not yet created a long sequence of log data. Furthermore, we feel that as future tutoring systems are adopted by more and more students, the increasing size of student log datasets will make this approach of training and combining specialized models increasingly effective.

  • Although our primary focus here is on logistic regression models, we also considered top-performing neural network approaches including DKT [Piech et al. (2015)], SAKT [Pandey and Karypis (2019)], SAINT [Choi et al. (2020)] and SAINT+ [Shin et al. (2021)] as additional state-of-the-art systems against which we compare. Our experiments show that among these neural network approaches, DKT consistently outperforms the others across our four diverse datasets. However, we also find that our logistic regression model Combined-AugmentedLR, which combines the three above points, outperforms all of these neural network models on average across the four datasets, and outperforms them all on three of the four individual datasets (DKT outperforms Combined-AugmentedLR on the Eedi dataset). We do find that neural network approaches are promising, however, especially due to their ability in principle to automatically discover additional features that logistic regression cannot discover on its own. Furthermore, we believe this ability of neural networks will be increasingly important as available student log datasets continue to increase both in size and in diversity of logged features. We conclude that a promising direction for future research is to explore the integration of our above three approaches into DKT and other neural network approaches.

It is useful to consider how our results relate to previously published results. We found that the time window features proposed by DAS3H [Choffin et al. (2019)] enhanced Best-LR assessments for all four datasets, providing strong support for their proposed features. Note that when Gervet2020:Deep introduced the Best-LR model they also experimented with time window features, but unlike us did not observe consistent benefits. This might be due to the number of additional parameters that must be trained when incorporating these time window features, and the corresponding need for larger training datasets. Recall the datasets we used in this manuscript are about one order of magnitude larger than the ones used by Gervet2020:Deep. The elapsed and lag time features introduced by SAINT+ [Shin et al. (2021)] also improve Best-LR predictions substantially in our experiments. Interestingly, the performance increment for the Best-LR model produced by these features is comparable to and sometimes even greater than the performance difference between SAINT and SAINT+. Note SAINT+ is an improved version of SAINT that uses the two interaction time based features. This suggests it might not require a deep learning-based approach to leverage these elapsed time and lag time features optimally.

Considering features that require augmented student logs that contain more than question-response pairs, we found these augmented features vary in utility. We found the feature ”current study module” to be a particularly useful signal for all datasets. During preliminary analysis we observed differences in student performance between different parts of the learning sessions (i.e. pre-test, learning, post-test, …), which are captured by the study module feature. Even though post-tests tend to contain more difficult questions on average, the highest level of student performance is observed during post-test session in which the users’ overall performance is evaluated.

Importantly, we also found that introducing background knowledge about prerequisites and postrequisites among the knowledge components (KCs) in the curriculum is very useful. As summarized in Table 3, counts of correctly answered questions and attempts associated with pre- and post-requisite KCs are among the most informative features. Importantly, these features can be easily incorporated even into pre-existing ITS’s and datasets that log only question-response pairs, because calculating these features requires no new log data – only annotating it using background knowledge about their prerequisite structure.

We compared different types of machine learning algorithms for student performance modeling in Subsection 5.3. One limitation is that we did not refit IRT’s student ability parameter after each user response. Wilson2016:Estimating showed that refitting the ability parameter after each interaction makes IRT more competitive on multiple smaller datasets.

Recently, multiple intricate deep learning based techniques have been proposed and yield state-of-the-art performance for specific datasets (e.g. Shin2021:Saint+, Zhou2021:Lana, Zhang2021:Muse, …). Unfortunately, many of these works only employ one or two datasets which raises the question of how suitable they are for other tutoring systems. The code and new data released alongside this manuscript increases the usability of multiple large-scale educational datasets. We hope that future works will leverage these available datasets to test whether novel student performance modeling algorithms are effective across different tutoring systems.

7 Conclusion

In this paper we present several approaches to extending the state of the art in student performance modeling. We show that approaches based on logistic regression can be improved by adding specific new features to student log data, including features that can be calculated from simple question-response student logs, and features not traditionally captured in student logs such as which videos the student watched or skipped. Furthermore, we introduce a new approach of training multiple logistic regression models on different partitions of the data, and show further improvements in student performance modeling by automatically combining the predictions of these multiple specialized logistic regression models.

Taken together, our proposed improvements lead to our Combined-AugmentedLR method which achieves a new state of the art for student performance modeling. Whereas the previous state-of-the-art logistic regression approach, Best-LR [Gervet et al. (2020)], achieves an AUC score of 0.766 on average over our four datasets, our Combined-AugmentedLR approach achieves an improved AUC of 0.807 – a reduction of 17.3% in the AUC error (i.e., in the difference between the achieved AUC score and the ideal AUC score of 1.0). Furthermore, we observed that Combined-AugmentedLR achieves improvements consistently across all four of the diverse datasets we considered, suggesting that our methods should be useful across a broad range of intelligent tutoring systems.

To encourage researchers to compare against our approach, three of the four datasets we chose are publicly available. We make the fourth, Squirrel Ai ElemMath2021, available via Carnegie Mellon University’s DataShop444https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=4888. In addition, we make the entire code base for the algorithms and experiments reported here available on GitHub555https://github.com/rschmucker/Large-Scale-Knowledge-Tracing. Our implementation converts the four large-scale datasets into a standardized format and uses parallelization for efficient processing. It increases the usability of the large-scale log data, and we hope that it will benefit future research on novel student performance modeling techniques.

Appendix A Appendix: Extended Feature Description

This Appendix provides a reference with additional implementation details for the features described in Section 4. The individual ITS capture similar information in different ways and we discuss necessary system specific adaptions. While we are already sharing the complete code base that was used to generate the experimental results in this paper on GitHub, this Appendix is intended to guide independent re-implementations.

Question (Item) ID

All four of our datasets assign each question (i.e., item) a unique identifier. To make a performance prediction for a question our implementation converts the corresponding question identifier into a sparse one-hot vector which is then passed to the machine learning algorithm. Note a one-hot vector refers to a vector where all values are zero except for a single value of 1. For example, given a dataset with different questions, our one-hot vector representation contains values, of which are zeros, and just a single value of 1 to indicate the current question. Knowledge about question identifiers allows a model to learn question specific difficulty parameters.

Knowledge Component (KC) ID

Knowledge about KC identifiers allows a model to learn skill specific difficulty parameters. While ElemMath2021 and Junyi15 only assign a single KC identifier per question, Ednet KT3 and Eedi can assign multiple KC identifiers to the same question. To make a performance prediction for a question our implementation uses a sparse-vector which is 0 everywhere except in the entries which mark the corresponding KC identifiers with a 1 value.

Total/KC/Question Counts

Count features summarize a student’s history of past interaction with the ITS and are an important component of PFA [Pavlik Jr et al. (2009)] and Best-LR [Gervet et al. (2020)]. In our experiments we evaluate three ways of counting the number of prior correct responses and overall attempts:

  1. Total counts: Here we compute two features capturing the total number of prior correct responses and overall attempts.

  2. KC counts: For each individual KC we compute two features capturing the number of prior correct responses and attempts related to questions that target the respective KC. A vector containing the counts for all KCs related to the current question is then passed to the machine learning algorithm.

  3. Question counts: Here we compute two features capturing the total number of prior correct responses and attempts on the current question.

All count features are subjected to scaling function before being passed to the machine learning algorithm. This avoids features of large magnitude.

Total/KC/Question Time-Window (TW) Counts

Time-window based count features summarize student history over different periods of time and provide the model with temporal information [Choffin et al. (2019)]. Following the original DAS3H implementation we define a set of time windows measured in days. For each window , we count the number of prior correct responses and overall attempts of the student which fall into the window. We evaluate three ways of counting the number of prior correct responses and overall attempts:

  1. Total TW counts: For each time-window, we compute two features capturing the total number of prior correct responses and overall attempts.

  2. KC TW counts: For each time-window and each individual KC we compute two features capturing the number of prior correct responses and attempts related to questions that target the respective KC. A vector containing the counts for all time-windows and KCs related to the current question is then passed to the machine learning algorithm.

  3. Question TW counts: For each time-window, we compute two features capturing the total number of prior correct responses and attempts on the current question.

All count features are subjected to scaling function before being passed to the machine learning algorithm. This avoids features of large magnitude.

Current/Prior Elapsed Time

Elapsed time measures the time span from question display to response submission [Shin et al. (2021)]. The idea is that a faster response is correlated with student proficiency. For our experiments we subject elapsed time values to scaling function and also use the categorical one-hot encodings from [Shin et al. (2021)]. There, elapsed time is capped off at 300 seconds and categorized based on the integer second. We evaluate two variations of this feature. In the first version we compute elapsed time based on interactions with the current question. In the second version we compute elapsed time based on interactions with the prior question. Because it is unknown how long a student will take to answer a question ahead of time the elapsed time value of the current question is not available for question scheduling purposes and is excluded from the model comparison in Subsection 5.3.

Current/Prior Lag Time

Lag time measures the time passed between the completion of the previous exercise until the next question is received [Shin et al. (2021)]. Lag time can be indicative for short-term memorization and forgetting. For our experiments we subject lag time values to scaling function and also use the categorical one-hot encodings from [Shin et al. (2021)]. There, lag time is rounded to integer minutes and assigned to one of 150 categories (). Because we cannot compute a lag time value for the very first question a student encounters we use an additional indicator flag. We evaluate two variations of this feature. In the first version we compute lag time based on interactions with the current question. In the second version we compute lag time based on interactions with the prior question.

Date-Time: Month, Week, Day, Hour

Date-time features provide information related to the temporal context a learning activity is placed in. Here we consider the month, week, day and hour of interaction. We pass each of these four attribute to the algorithm using one-hot encodings.

Study Module: One-Hot/Counts

All four datasets group study activities into distinct categories or modules (e.g. pre-test, effective learning, review, …). Providing a machine learning model with information about the corresponding study module can help adapting the predictions to the different learning contexts. ElemMath2021 indicates different modules with the s_module attribute. EdNet KT3 indicates different modules with the source attribute. Eedi indicates different modules with the SchemeOfWorkId attribute. For Junyi15 we can derive 8 study module identifiers corresponding to the unique combinations of topic_mode, review_mode and suggested flags. We encode these study module attribute values into one-hot vectors before passing them to the algorithm.

Teacher/Group ID

The ElemMath2021 dataset annotates each student response with the identifier of the supervising teacher. Similarly, the Eedi dataset annotates each student response with their group identifier. Both attributes provide information about the current learning context. We encode the identifiers into one-hot to allow the machine learning algorithm to learn teacher/group specific parameters.

School ID

The ElemMath2021 dataset associates each student with a physical or virtual tutoring center. Each tutoring center is assigned a unique school identifier. To capture potential differences between the various schools we allow the machine learning algorithm to learn school specific parameters by encoding the school identifiers into one-hot.

Course

The ElemMath2021 dataset contains logs from a variety of different mathematics courses. While each course is assigned a unique course identifier the KCs and questions treated in the individual courses can overlap. To capture differences in the context set by the individual courses we allow the machine learning algorithm to learn course specific parameters by encoding the course identifiers into one-hot.

Topic

The ElemMath2021 dataset organizes learning activities into courses which itself are split into multiple topics. Each topic is assigned a unique topic identifier. The ITS can deploy the same question in multiple topics and the differences in learning context might affect student performance. We allow the machine learning algorithm to learn topic specific parameters by encoding the topic identifiers into one-hot.

Difficulty

The ElemMath2021 dataset associates each question with a manually assigned difficulty score from the set . We learn a model parameter for each distinct difficulty value by using a one-hot encoding.

Bundle/Quiz ID

The EdNet KT3 datasets annotates each response with a bundle identifier and the Eedi dataset annotates each response with a quiz identifier. Both, bundles and quizzes mark sets of multiple questions which are asked together. The ITS can decide to assign a bundle/quiz to the student which then needs to respond to all associated questions. To capture the learning context provided by the current bundle/quiz we encode the corresponding identifiers into one-hot.

Part/Area one-hot/counts

The Ednet KT3 dataset assigns each question one label based on which of the 7 TOIEC exam parts it addresses. Similarly, the Junyi15 dataset assigns each questions one of 9 area identifiers which marks the area of mathematics the question addresses. We allow the machine learning algorithm to learn part/area specific parameters by encoding the part/area identifiers into one-hot. In addition we experiment with two count features capturing the total number of prior correct responses and attempts on questions related to the current part/area. Before passing the count features to the algorithm we subject them to scaling function .

Age

The Eedi dataset provides an attribute which captures students’ birth date. We learn a model parameter for each distinct age by using a one-hot encoding. Students without a specified age are assigned a separate parameter.

Gender

The Eedi dataset categorizes student gender into female, male, other and unspecified. We learn a model parameter for each attribute value by using one-hot vectors of dimension four.

Social Support

The Eedi dataset provides information on whether students qualify for England’s pupil premium grant (a social support program for disadvantaged students). The attribute categorizes students into qualified, unqualified and unspecified. We learn a model parameter for each attribute value by using one-hot vectors of dimension three.

Platform

The ElemMath2021 dataset contains an attribute which indicates if a question was answered from a physical tutoring centers or the online system. Similarly, the EdNet KT3 dataset indicates if a question was answered using the mobile app or a web browser. To pass this information to the model we use a two-dimensional one-hot encoding.

Prerequisite: one-hot/counts

In addition to a KC model three of the datasets provide a graph structure that captures semantic dependencies between individual KCs and questions. ElemMath2021 offers a prerequisite graph that marks relationships between KCs. Junyi15 provides a prerequisite graph that describes dependencies between questions. In contrast, Eedi organizes its KCs via a 4-level topic ontology tree. For example the KC Add and Subtract Vectors falls under the umbrella of Basic Vectors which itself is assigned to Geometry and Measure which is connected to the tree root Mathematics. To extract prerequisite features from Eedi’s KC ontology we derive a pseudo prerequisite graph by first taking the two lower layers of the ontology tree and then using the parent nodes as prerequisites to the leaf nodes. We evaluate two ways of utilizing prerequisite information for student performance modeling:

  1. Prerequisite IDs: For each question we employ a sparse vector that is zero everywhere except in the entries that mark the relevant prerequisite KCs (for ElemMath2021 and Eedi) or questions (for Junyi15).

  2. Prerequisite counts: For each question we look at its prerequisite KCs (for ElemMath2021 and Eedi) or prerequisite questions (for Junyi15). For each prerequisite we then compute two features capturing the number of prior correct responses and attempts related to the respective prerequisite KC or question. After being subjected to scaling function , a vector containing the counts for all relevant prerequisite KCs/questions is passed to the machine learning algorithm.

Postrequisite: one-hot/counts

By inverting the directed edges of the prerequisite graph we derive a postrequisite graph. Analogous to the prerequisite case we encode postrequisite information in two ways: (i) As sparse vectors that are zero everywhere except in the entries that mark the postrequisite KCs/questions with a 1; (ii) As correct and attempt count features computed for each KC/question that is postrequisite to the current question. For further details refer to the above prerequisite feature description.

Video: count/skipped/time

The ElemMath2021 and EdNet KT3 dataset both provide information on how students interact with lecture videos. We evaluate three ways of utilizing video consumption behaviour for performance modeling:

  1. Videos watched count: Here we compute two features: (i) The total number of videos a student has interacted with before; (ii) The number of videos a student has interacted with before related to the KCs of the current question.

  2. Videos skipped count: ElemMath2021 captures video skipping events directly. For EdNet KT3 we count a video as skipped if the student watches less than 90%. Again we compute two features: (i) The total number of videos a student has skipped before; (ii) The number of videos a student has skipped before related to the KCs of the current question.

  3. Videos watched time: Here we compute two features: (i) The total time a student has spent watching videos in minutes; (ii) The time a student has spent watching videos related to the KCs of the current question in minutes.

All count and time features are subjected to scaling function before being passed to the machine learning algorithm. This avoids features of large magnitude.

Reading: count/time

The ElemMath2021 and EdNet KT3 dataset both provide information on how users interact with reading materials. ElemMath2021 captures when a student goes through a question analysis and EdNet KT3 creates a log whenever a student enters a written explanation. We evaluate two ways of utilizing reading behaviour for student performance modeling:

  1. Reading count: Here we compute two features: (i) The total number of reading materials a student has interacted with before; (ii) The number of reading materials a student has interacted with before related to the KCs of the current question.

  2. Reading time: Here we compute two features: (i) The total time a student has spent on reading materials in minutes; (ii) The time a student has spent on reading materials related to the KCs of the current question in minutes.

The count and time features are subjected to scaling function before being passed to the machine learning algorithm. This avoids features of large magnitude.

Hint: count/time

The Junyi15 dataset captures how students make use of hints. Whenever a student answers a question the system logs how many hints were used and how much time was spent on each individual hint. Students are allowed to submit multiple answers to the same question, though a correct response is only registered if it is the first attempt and no hints are used. We evaluate two ways of utilizing hint usage for student performance modeling:

  1. Hint count: Here we compute two features: (i) The total number of hints a student has used before; (ii) The number of hints a student has used before related to the KCs of the current question.

  2. Hint time: Here we compute two features: (i) The total time a student has spent on hints in minutes; (ii) The time a student has spent on hints related to the KCs of the current question in minutes.

The count and time features are subjected to scaling function before being passed to the machine learning algorithm. This avoids features of large magnitude.

Smoothed Average Correctness

Let and be the number of prior correct responses and overall attempts of student respectively. Let be the average correctness rate over all other students in the dataset. The linear logistic model is unable to infer the ratio of average student correctness on its own. Because of this we introduce the smoothed average correctness feature to capture the average correctness of student over time as

Here, is a smoothing parameter which biases the estimated average correctness rate, of student towards this all students average . The use of smoothing reduces the feature variance during a student’s initial interactions with the ITS. A prior work by Pavlik2020:Logistic proposed, but did not evaluate, an average correctness feature without the smoothing parameter for student performance modeling. While calibrating this parameter for our experiments, we observed benefits from smoothing and settled on .

Response Pattern

Inspired by the use of n-gram models in the NLP community (e.g. Manning1999:Foundations), we propose response patterns as a feature which allows logistic regression models to infer factors impacting short-term student performance. At time , a response pattern is defined as a one-hot encoded vector that represents a student’s sequence of most recent responses formed by binary correctness indicators . The encoding process is visualized by Figure 3 in the paper.

Appendix B Appendix: Hyperparameters and Model Specifications

Here we provide information about the hyperparameters used for the deep learning based student performance modeling approaches evaluated in Subsection 5.3. For all models we relied on the hyperparameters reported in the cited references (DKT, [Piech et al. (2015)]; SAKT, [Pandey and Karypis (2019)]; SAINT, [Choi et al. (2020)]; SAINT+, [Shin et al. (2021)]). A detailed list of the used hyperparameters is provided in Table 10. All models were trained for 100 epochs without learning rate decay.

Model DKT SAKT SAINT SAINT+
Hidden Size 200 200 200 200
Embedding Size 200 200 200 200
Number of Layers 1 1 6 6
Dropout Rate 0.5 0.2 - -
Truncated Sequence Length - 200 100 100
Number of Heads - 5 8 8
Learning Rate
Batch Size 100 200 128 128

Absolute Gradient Clip

- 10 - -
Table 10: Hyperparameters of deep learning based approaches.

Acknowledgments

We would like to thank Squirrel Ai Learning for providing the Squirrel Ai ElemMath2021 dataset. In particular, Richard Tong and Dan Bindman were instrumental in providing this dataset, in helping us understand it. We thank Ken Koedinger and Cindy Tipper for assistance incorporating the ElemMath2021 data into CMU’s DataShop to make it widely accessible to other researchers. We thank Jack Wang and Angus Lamb, for personal communication enabling us to reconstruct the exact sequence of student interactions in the Eedi data logs. We thank Theophile Gervet for making his earlier implementations of several algorithms available in his code repository, and for help understanding the details. We are grateful to the teams at the CK-12 Foundation and at Squirrel Ai Learning for suggestions on how to improve the presentation in this paper. This work was supported in part through the CMU - Squirrel Ai Research Lab on Personalized Education at Scale, and in part by AFOSR under award FA95501710218.

References

  • Aikens and Barbarin (2008) Aikens, N. L. and Barbarin, O. 2008. Socioeconomic differences in reading trajectories: The contribution of family, neighborhood, and school contexts. Journal of educational psychology 100, 2, 235.
  • Badrinath et al. (2021) Badrinath, A., Wang, F., and Pardos, Z. 2021. pybkt: An accessible python library of bayesian knowledge tracing models. arXiv preprint arXiv:2105.00385.
  • Cen et al. (2006) Cen, H., Koedinger, K., and Junker, B. 2006. Learning factors analysis–a general method for cognitive model evaluation and improvement. In International Conference on Intelligent Tutoring Systems. Springer, 164–175.
  • Chang et al. (2015) Chang, H.-S., Hsu, H.-J., and Chen, K.-T. 2015. Modeling exercise relationships in e-learning: A unified approach. In EDM. 532–535.
  • Choffin et al. (2019) Choffin, B., Popineau, F., Bourda, Y., and Vie, J.-J. 2019. Das3h: modeling student learning and forgetting for optimally scheduling distributed practice of skills. In Proceedings of the 12th International Conference on Educational Data Mining. 29–38.
  • Choi et al. (2020) Choi, Y., Lee, Y., Cho, J., Baek, J., Kim, B., Cha, Y., Shin, D., Bae, C., and Heo, J. 2020. Towards an appropriate query, key, and value computation for knowledge tracing. In Proceedings of the Seventh ACM Conference on Learning@ Scale. 341–344.
  • Choi et al. (2020) Choi, Y., Lee, Y., Shin, D., Cho, J., Park, S., Lee, S., Baek, J., Bae, C., Kim, B., and Heo, J. 2020. Ednet: A large-scale hierarchical dataset in education. In

    International Conference on Artificial Intelligence in Education

    . Springer, 69–73.
  • Corbett and Anderson (1994) Corbett, A. T. and Anderson, J. R. 1994. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4, 4, 253–278.
  • d Baker et al. (2008) d Baker, R. S., Corbett, A. T., and Aleven, V. 2008. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. In International conference on intelligent tutoring systems. Springer, 406–415.
  • Ding and Larson (2019) Ding, X. and Larson, E. C. 2019. Why deep knowledge tracing has less depth than anticipated. International Educational Data Mining Society.
  • Ding and Larson (2021) Ding, X. and Larson, E. C. 2021. On the interpretability of deep learning based models for knowledge tracing. arXiv preprint arXiv:2101.11335.
  • Feng et al. (2009) Feng, M., Heffernan, N., and Koedinger, K. 2009. Addressing the assessment challenge with an online system that tutors as it assesses. User modeling and user-adapted interaction 19, 3, 243–266.
  • Gervet et al. (2020) Gervet, T., Koedinger, K., Schneider, J., Mitchell, T., et al. 2020. When is deep learning the best approach to knowledge tracing? JEDM— Journal of Educational Data Mining 12, 3, 31–54.
  • Ghosh et al. (2020) Ghosh, A., Heffernan, N., and Lan, A. S. 2020. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2330–2339.
  • González-Brenes et al. (2014) González-Brenes, J., Huang, Y., and Brusilovsky, P. 2014. General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge. In The 7th international conference on educational data mining. University of Pittsburgh, 84–91.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9, 8, 1735–1780.
  • Huang et al. (2016) Huang, X., Craig, S. D., Xie, J., Graesser, A., and Hu, X. 2016. Intelligent tutoring systems work as a math gap reducer in 6th grade after-school program. Learning and Individual Differences 47, 258–265.
  • Khajah et al. (2016) Khajah, M., Lindsey, R. V., and Mozer, M. C. 2016. How deep is knowledge tracing? In Proceedings of the 9th International Conference on Educational Data Mining. 94–101.
  • Koedinger et al. (2013) Koedinger, K. R., Stamper, J. C., Leber, B., and Skogsholm, A. 2013. Learnlab’s datashop: A data repository and analytics tool set for cognitive science.
  • Käser et al. (2017) Käser, T., Klingler, S., Schwing, A. G., and Gross, M. 2017.

    Dynamic bayesian networks for student modeling.

    IEEE Transactions on Learning Technologies 10, 4, 450–462.
  • LeCun et al. (1999) LeCun, Y., Haffner, P., Bottou, L., and Bengio, Y. 1999. Object recognition with gradient-based learning. In

    Shape, contour and grouping in computer vision

    . Springer, 319–345.
  • Lindsey et al. (2014) Lindsey, R. V., Shroyer, J. D., Pashler, H., and Mozer, M. C. 2014. Improving students’ long-term knowledge retention through personalized review. Psychological science 25, 3, 639–647.
  • Liu et al. (2019) Liu, Q., Huang, Z., Yin, Y., Chen, E., Xiong, H., Su, Y., and Hu, G. 2019. Ekt: Exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering 33, 1, 100–115.
  • Liu et al. (2021) Liu, Q., Shen, S., Huang, Z., Chen, E., and Zheng, Y. 2021. A survey of knowledge tracing. arXiv preprint arXiv:2105.15106.
  • Manning and Schutze (1999) Manning, C. and Schutze, H. 1999. Foundations of statistical natural language processing. MIT press.
  • Mozer and Lindsey (2016) Mozer, M. C. and Lindsey, R. V. 2016. Predicting and improving memory retention. Big data in cognitive science, 34.
  • Nakagawa et al. (2019) Nakagawa, H., Iwasawa, Y., and Matsuo, Y. 2019. Graph-based knowledge tracing: modeling student proficiency using graph neural network. In 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 156–163.
  • Pandey and Karypis (2019) Pandey, S. and Karypis, G. 2019.

    A self-attentive model for knowledge tracing.

    In Proceedings of the 12th International Conference on Educational Data Mining. 384–389.
  • Pardos and Heffernan (2010) Pardos, Z. A. and Heffernan, N. T. 2010. Modeling individualization in a bayesian networks implementation of knowledge tracing. In International Conference on User Modeling, Adaptation, and Personalization. Springer, 255–266.
  • Pardos and Heffernan (2011) Pardos, Z. A. and Heffernan, N. T. 2011. Kt-idem: Introducing item difficulty to the knowledge tracing model. In International conference on user modeling, adaptation, and personalization. Springer, 243–254.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703.
  • Pavlik Jr et al. (2009) Pavlik Jr, P., Cen, H., and Koedinger, K. 2009. Performance factors analysis - a new alternative to knowledge tracing. In Frontiers in Artificial Intelligence and Applications. Vol. 200. 531–538.
  • Pavlik Jr et al. (2020) Pavlik Jr, P. I., Eglington, L. G., and Harrell-Williams, L. M. 2020. Logistic knowledge tracing: A constrained framework for learner modeling. arXiv preprint arXiv:2005.00869.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. 2011. Scikit-learn: Machine learning in python. the Journal of machine Learning research 12, 2825–2830.
  • Piech et al. (2015) Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., and Sohl-Dickstein, J. 2015. Deep knowledge tracing. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds. Vol. 28. Curran Associates, Inc.
  • Pojen et al. (2020) Pojen, C., Mingen, H., and Tzuyang, T. 2020. Junyi academy online learning activity dataset: A large-scale public online learning activity dataset from elementary to senior high school students. Kaggle.
  • Qiu et al. (2011) Qiu, Y., Qi, Y., Lu, H., Pardos, Z. A., and Heffernan, N. T. 2011. Does time matter? modeling the effect of time with bayesian knowledge tracing. In EDM. 139–148.
  • Rasch (1993) Rasch, G. 1993. Probabilistic models for some intelligence and attainment tests. ERIC.
  • Sao Pedro et al. (2013) Sao Pedro, M., Baker, R., and Gobert, J. 2013. Incorporating scaffolding and tutor context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data Mining 2013.
  • Shen et al. (2020) Shen, S., Liu, Q., Chen, E., Wu, H., Huang, Z., Zhao, W., Su, Y., Ma, H., and Wang, S. 2020. Convolutional knowledge tracing: Modeling individualization in student learning process. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1857–1860.
  • Shin et al. (2021) Shin, D., Shim, Y., Yu, H., Lee, S., Kim, B., and Choi, Y. 2021. Saint+: Integrating temporal features for ednet correctness prediction. In LAK21: 11th International Learning Analytics and Knowledge Conference. 490–496.
  • Stamper et al. (2010) Stamper, J., Niculescu-Mizil, A., Ritter, S., Gordon, G., and Koedinger, K. 2010. Bridge to algebra 2006-2007. development data set from kdd cup 2010 educational data mining challenge.
  • Tong et al. (2020) Tong, H., Wang, Z., Liu, Q., Zhou, Y., and Han, W. 2020. Hgkt: Introducing hierarchical exercise graph for knowledge tracing. arXiv preprint arXiv:2006.16915.
  • van der Linden and Hambleton (2013) van der Linden, W. J. and Hambleton, R. K. 2013. Handbook of modern item response theory. Springer Science & Business Media.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Vol. 30. Curran Associates, Inc.
  • von Stumm (2017) von Stumm, S. 2017. Socioeconomic status amplifies the achievement gap throughout compulsory education independent of intelligence. Intelligence 60, 57–62.
  • Wang et al. (2020) Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y., Hernández-Lobato, J. M., Turner, R. E., Baraniuk, R. G., Barton, C., Jones, S. P., et al. 2020. Diagnostic questions: The neurips 2020 education challenge. arXiv preprint arXiv:2007.12061.
  • Wang et al. (2021) Wang, Z., Lamb, A., Saveliev, E., Cameron, P., Zaykov, Y., Hernandez-Lobato, J. M., Turner, R. E., Baraniuk, R. G., Barton, C., Jones, S. P., et al. 2021. Results and insi