With the growing need for people to keep learning throughout their careers, massive open online course (MOOCs) companies, such as Udacity and Coursera, not only aggressively design new courses that are relevant (e.g., self-driving cars and flying cars) but refresh existing courses’ content frequently to keep them up-to-date. This effort results in a significant increase in student numbers, which makes it impractical for even experienced human instructors to assess an individual student’s level of engagement and anticipate their learning outcomes. Moreover, students in each MOOC classroom are heterogeneous in their background and intention, which is very different from a classic classroom [2, 3]. Even subsequent offerings of a course within a year will have a different population of students, mentors, and - in some cases - instructors. Also, due to the nascent nature of the online learning platforms, many other aspects of a course evolve quickly so that students are frequently being exposed to experimental content modalities or workflow refinements. In this world of MOOCs, an automated machine which reliably forecasts students’ performance in real-time (or early stages), would be a valuable tool for making smart decisions about when (and with whom) to make live educational interventions as students interact with online coursework, with the aim of increasing engagement, providing motivation, and empowering students to succeed.
In the last three years, a few deep learning based neural network models for predicting students’ future performance have been explored within data mining and learning analytic communities[4, 5, 6, 1]. In particular, the recently-developed GritNet  was the first deep learning architecture which successfully advances the state of the art by demonstrating substantial prediction accuracy improvements (particularly pronounced in the first few weeks when predictions are extremely challenging). In contrast to other works, the GritNet does not need any feature engineering (it can learn from raw input) and it can operate on any (raw) student event data associated with a time stamp (even when highly imbalanced). Encouraged by this result, we would like to develop methods to use (or operationalize) GritNet in real-time or live predictions with on-going courses. This raises a fundamental question of whether GritNet trained on a previous course would perform (or transfer) reasonably well on a new course (or subsequent offering of the same course). In other words, the extent to which a model trained on one course could generalize to predict for more than one course, or is predictive of other iterations of the same course later in time. Performance in a real-time deployment scenario would be significantly impacted by such generalization, and the impact of the natural domain shift in the input data from a course over time into models like GritNet is not well studied. The broader question of whether there exists a sufficiently abstract input representation to generalize this model across entirely different courses of study (especially to new ones where no student outcome data exists) is also ripe for research.
Since the majority of prior research on student performance prediction has focused on when both training (source) and testing (target) course data is sampled from the same course, the problems of real-time prediction during a live or on-going course (not after it has finished) has been studied in very little literature. While 
reports preliminary transfer learning performance, they also emphasize that MOOCs would be an ideal (and paramount) use case for transfer learning, in learning from the previous course to make predictions and design interventions in the next course. In[5, 8]
, the authors point out that accuracy estimates obtained by training on the same course as the target course for deployment (generally not possible in real-world intervention scenarios) could be overly optimistic. Both works explore an idea ofproxy labels (in-situ) as an approximate quantity of interest which can be collected before a course completes by looking at whether each student interacted with the coursework at least once in the last week.
In this paper, we first introduce (simple) ordinal input encoding procedure as a basis for GritNet to provide transferable sequence-level embeddings across courses. Given that, a novel unsupervised domain adaptation method combined with GritNet is proposed. Specifically, in order to leverage the unlabeled data in the target course, we use a trained GritNet on source course to evaluate and assign pseudo outcome labels to the target course, then continue training the GritNet on the target course. Unlike (crude) proxy labels in [7, 5, 8], these pseudo target labels are produced principally by using the source GritNet and also used to fine-tune the last fully-connected layer of the GritNet to be applied to other different courses. As far as we know, this is the first satisfying work to effectively answer source and target course distribution mismatch encountered in real-time student performance prediction which had been considered a challenging open problem in the prior works [7, 9]. This paper is structured as follows. After briefly reviewing the original GritNet model in Section 2, Section 3 is devoted to developing unsupervised domain adaptation method with the GritNet. Finally, we provide, in Section 4, generalization performance results across (real) Udacity Nanodegree programs’ data, and conclude in Section 5.
2 Background: GritNet
The task of predicting student performance can be expressed as a sequential event prediction problem : given a past event sequence taken by a student, estimate likelihood of future event sequence where . In the form of online classes, each event represents a student’s action (or activities) associated with a time stamp. In other words, is defined as a paired tuple of . Each action represents, for example, “a lecture video viewed”, “a quiz answered correctly/incorrectly”, or “a project submitted and passed/failed”, and states the corresponding (logged) time stamp. Then, log-likelihood of can be written as Equation 1, given fixed-dimensional embedding representation of .
The goal of each GritNet is, therefore, to compute an individual log-likelihood , and those estimated scores can be simply added up to estimate long-term student outcomes.
In order to feed students’ raw event records into the GritNet, it is necessary to encode the time-stamped logs (ordered sequentially) into a sequence of fixed-length input vectors byone-hot encoding. A one-hot vector , where is the number of unique actions and -th element defined as:
is used to distinguish each activity from every other. Then, one-hot vectors of the same student are connected into a long vector sequence to represent the student’s whole sequential activities in . To further capture students’ varying learning speeds, original GritNet defines the discretized time difference between adjacent events as:
then, one-hot encode into and connect them with the corresponding to represent as:
Lastly, the output sequences shorter than the maximum event sequence length (of a given training set) are pre-padded with allvectors.
The complete GritNet architecture is illustrated in Figure 1. The first embedding layer  learns an embedding matrix , where and are the embedding dimension and the number of unique events (i.e., input vector size), to convert an input vector onto a low-dimensional embedding defined as:
This (dense) event embedding
is then passed into the bidirectional long short term memory (BLSTM)
and the output vectors are formed by concatenating each forward and backward direction outputs. Next, a global max pooling (GMP) layer is added to form a fixed dimension vector (independent of the input event sequence length) by taking the maximum over time (over the input event sequence). This GMP layer is able to capture the most relevant features over the sequence and is also able to deal with the imbalanced nature of data. One can view this GMP layer as a hard self-attention layer or can consider the GMP layer output generates a sequential level embedding of the whole input event sequence. The GMP layer output is, ultimately, fed into a fully-connected (FC) layer, and a softmax (i.e., sigmoid) layer subsequently, to calculate the log-likelihood .
3 Domain Adaptation with GritNet
Though GritNet’s superior prediction accuracy was reported in , the GritNet was trained and tested on the data from a specific Nanodegree program via cross-validation. In that, the test data accuracy is a (overly optimistic) proxy for how it may perform on unseen courses. Therefore, it was not fully addressed whether GritNet models transfer well to different courses or if they could be deployed for real-time prediction with on-going courses. With an increase in new courses and the fast pace of content revisions of MOOC classes to meet students’ educational needs,MOOC courses are no longer static. Basically it requires the GritNet model to generalize well to unseen courses and to predict performance in real-time, for students who have not yet finished the course (i.e., new data without prior knowledge of the students’ outcome). In this section, we propose methods for overcoming the biases introduced by specific courses into the GritNet model by leveraging domain adaptation techniques .
3.1 Ordinal Input Encoding
Student activity records collected from different courses often have various lengths, formats and content (for example, when the contents of a course are revised, new concepts are added or removed, or project requirements are changed for newly introduced courses), so that hand-designed aggregate features that are effective in one course might not be so in another. Even carefully engineered feature dimensions are usually constrained to be small in prior works [5, 4, 6]. Both of these deficiencies had produced inputs that are too restricted to tap the full benefits of sequential deep learning models and obviously had limited successful development of larger deep learning models. To avoid limitations of the prior works, GritNet takes students’ learning activities across time as a raw input sequence instead, as shown in Section 2.
For any MOOC coursework, each content exhibits a natural order, following a contents tree, so we could use a simple encoding mapping that exploits this ordering information, and this encoding procedure would not require any modification of the GritNet architecture. We wish to generate a common mapping in which all the ordered but numeric (raw) contents IDs are converted into ordinal values for all online coursework. To ensure this, we could create a sophisticated encoding (e.g. table of contents), but this implies some additional complexity and maintenance of a common lookup source. Therefore, for the sake of simplicity, we began our investigation by grouping content IDs into subcategories and normalizing them using natural ordering implicit in the content paths of the MOOC within each category. This has performed well enough that we consider it a viable encoding for the moment. Then each encoded dataset contains the same number of unique actions as the original. For example, Udacity data in TableI contains three subcategories - content pages, quizzes, and projects. So that one Nanodegree program dataset has three sequences of ordinal IDs - content- to content-, quiz- to quiz- and project- to project-. Note that, with this encoding, the number of unique actions turns out to be equivalent to
since quizzes and projects allow two potential actions (i.e., a quiz answered correctly/incorrectly or a project submitted and passed/failed).
#1 #1 #1 ←
[ht] , Source label , Target data Set source training set as Train with Evaluate on : :=() Assign pseudo-labels to : :=() Update target training set as Freeze all the but the last FC layer Continue training with Source course data
3.2 Pseudo-Labels and Transfer Learning
Note that there are no target labels from the target course for training GritNet on the source data, and naive transfer from source GritNet to target course, in general, yields inferior performance (see results in Section 4). So to deal with unlabeled target course data in unsupervised domain adaptation from source to target course, we harness the trained GritNet on a source data (which is a common real-world scenario when releasing new, never-before consumed educational content) to assign pseudo labels to the target data.
Precisely, as described in Algorithm 3.1, we use a trained GritNet to evaluate unannotated target data and assign hard one-hot labels if its prediction confidence exceeds a pre-determined threshold which is shown as Step 3.1
. We use the softmax probability as a confidence measure. Given the pseudo-labels, we keep training a GritNet on the target course while freezing all the GritNet but the last FC layer as shown in Step3.1 and Step 3.1
. This is to retrain only the top of the GritNet by continuing the backpropagation of errors on the target data which effectively fine tunes the GritNet to target course.
This method was motivated by the observation that the GMP layer output captures generalizable sequential representation of input event sequence, whereas the high-level FC layer learns the course-specific features. It is worth mentioning that the last FC layer limits the number of parameters to learn in Step 3.1. So this enables the entire Algorithm 3.1 to be a very practical solution for faster adaptation time (and in fact deployment cycle) and applicability for even a (relatively) smaller size target course.
4 Generalization Performance
To validate Algorithm 3.1, we benchmarked our proposed methods on the student datasets of multiple Udacity Nanodegree (ND) programs.
4.1 Udacity Data
Udacity has many different curriculum paths, so it is important to check any generalized modeling approach against many different ND programs. Each program has different prerequisites, different levels of technical proficiency required, different mixtures of content modalities, and ultimately different graduation rates. All of these factors could cause the predictive performance of any model to be unusually high if it is biased towards the distribution of just one ND program (or a few carefully picked and very similar programs).
In all programs, graduation is defined as completing each of the required projects in a ND program curriculum with a passing grade. When a user officially graduates, their enrollment record is annotated with a time stamp, so it was possible to use the presence of this time stamp as the target label. Each program has a designated calendar start and end date for each cohort of students that passes through, and users have to graduate before the official end of their cohort’s term to be considered successfully graduated. Each ND program’s curriculum contains a mixture of video content, written content, quizzes, and projects. Note that it is not required to interact with every piece of content or complete every quiz to graduate. See below and Table I for detailed characteristics of each dataset used for this study.
ND-A v1 Dataset: 5,626 students who enrolled in Version 1 of the ND-A program from 2015-02-03.
ND-A v2 Dataset: 2,230 students who enrolled in Version 2 of the ND-A program from 2015-04-01.
ND-B Dataset 13,639 students who enrolled in ND-B program from 2016-06-20.
ND-C Dataset: 4,377 students who enrolled in ND-C program from 2017-03-22.
ND-D Dataset 4,761 students who enrolled in ND-D program from 2017-01-14.
For all datasets, an event represents a user taking a specific action (e.g., watching a video, reading a text page, attempting a quiz, or receiving a grade on a project) at a certain time stamp. Some irrelevant data is filtered out during preprocessing, for example, events that occur before a user’s official enrollment as a result of a free-trial period. It should be noted that no personally identifiable information is included in this data and student equality is determined via opaque unique ids.
4.2 Experimental Setup
To demonstrate the benefits of unsupervised domain adaptation with the GritNet, we focus on the student graduation prediction task and compare various setups below:
GritNet Baseline: We also train a GritNet based on source course and evaluate performance on target course. Note that there is no training data and labels from target course for training the GritNet. In this experiment, since the focus of the paper is to determine benefits of domain adaption, single GritNet architecture of BLSTM with forward and backwards LSTM layers containing 256 cell dimensions per direction and embedding layer dimension of 512 are used across weeks. Further hyper-parameter optimization could be done for the optimal accuracy at each week.
For all models across setups, we trained a different model for different weeks, based on students’ week-by-week event records, to predict whether each student was likely to graduate.
4.3 Evaluation Measure
Since the true binary target label (1: graduate, 0: not graduate) is imbalanced (i.e., number of 0s outweighs number of 1s), accuracy is not an appropriate metric. Instead, we used the Receiver Operating Characteristic (ROC) for evaluating the quality of predictions. An ROC curve was created by plotting the true positive rate (TPR) against the false positive rate (FPR). In this task, the TPR is the percentage of students who graduate, which the GritNet labels positive, and the FPR is the percent of students who do not graduate, which the GritNet incorrectly labels positive. The accuracy of each system’s prediction was measured by the area under the ROC curve (AUC) which scores between 0 and 100% (the higher, the better) with random guess yielding 50% all the time. We used -fold student level cross-validation, while ensuring each fold contained roughly the same proportions of the two groups (graduate and non graduate) of students.
To compare domain adaption performance with the upper bound assessed with oracle setup, we consider another performance measure, mean AUC recovery rate (ARR), as a metric to evaluate AUC improvements from unsupervised domain adaptation. This measure is defined as absolute AUC reduction that unsupervised domain adaptation produces divided by the absolute reduction that oracle (supervised) training produces as:
We assume that the upper bound for unsupervised domain adaptation would be the performance of the oracle training on the same data, which has a AUC recovery of 100%.
4.4 Performance Results
In order to evaluate the transferability of the GritNet across different courses, we consider the following various scenarios from one course to another:
ND-A v1 to ND-A v2: There are minor contents changes between v1 and v2 of ND-A program. In this setting, accuracy gains of GritNet baseline (over vanilla baseline) as shown in Figure 4 imply that sentence-level embedding trained with GritNet is more transferable features as compared to features learned with Vanilla baseline model.
ND-A v1 to ND-C and ND-D: We further analyze transferability of GritNet from an earlier ND-A v1 program to later ND-C and ND-D programs. ND-C curriculum has a lower expectation of prior technical knowledge than ND-D. Being consistent with Figure 4 results, clear wins of GritNet baseline over vanilla baseline in Figure 3 results strongly confirm that sentence-level embedding trained with GritNet is more robust to source and target course distribution mismatch as compared to features learned with Vanilla baseline model. Moreover, domain adaptation provides 70.60% ARR in average during first four weeks (up to 84.88% at week 3) for ND-C dataset and 58.06% ARR during the same four weeks (up to 75.07% at week 3) for ND-D dataset from GritNet baseline performances. Altogether, it suggests that all the benefits of domain adaptation in fact came from the retraining course-specific top FC layer while using the frozen lower layers for fixed sentence-level embedding.
ND-B to ND-C and ND-D: We consider another similar scenario from an earlier ND-B program to later ND-C and ND-D programs. Unlike previous ND-A v1, however, ND-B dataset shows much shorter average event sequence length (285 vs. 421) so that there is far less overlap with typical target event sequences from ND-C and ND-D (285 vs. 675 and 430). In general, as a student progresses, the accumulated event sequence typically gets longer. So for those later weeks, there is less overlap, which in fact would limit potential performance gains which could be achieved from domain adaptation. This is a primary reason why benefits of domain adaptation shown in Figure 3 are actually less observed as compared to Figure 3 results. Even with those limitations in the source dataset ND-B, domain adaptation nevertheless yields 100.00% ARR at week 2 for ND-C dataset and 36.41% and 35.58% ARR at week 2 and 5 respectively for ND-D dataset.
In this paper, we have demonstrated a strong candidate approach for predicting real-time student performance, which has been considered as of the utmost importance in MOOCs, as this enables performance prediction while a course is on-going. In contrast to prior works, we introduced a new domain adaptation algorithm with GritNet and showed that GritNet can be transferred to new courses without labels which has a substantial AUC recovery rate. For future work, we will look into superior input encoding schemes which allows us to effectively handle variable event sequence length.
-  B.-H. Kim, E. Vizitei, and V. Ganapathi, “GritNet: Student performance prediction with deep learning,” in 11th International Conference on Educational Data Mining (EDM 2018), 2018, pp. 625–629.
-  I. Chuang and A. Ho, “HarvardX and MITx: Four years of open online courses, Fall 2012-Summer 2016,” SSRN Electronic Journal, 2016.
-  Economist, “Equipping people to stay ahead of technological change - Lifelong learning,” A Special Report On Lifelong Learning: How to Survive in the Age of Automation, 2017.
-  F. Mi and D.-Y. Yeung, “Temporal models for predicting student dropout in massive open online courses,” in Proceedings of 15th IEEE International Conference on Data Mining Workshop (ICDMW 2015), Atlantic City, New Jersey, 2015, pp. 256–263.
-  J. Whitehill, K. Mohan, D. Seaton, Y. Rosen, and D. Tingley, “Delving deeper into mooc student dropout prediction,” arXiv preprint arXiv:1702.06404, 2017.
-  W. Wang, H. Yu, and C. Miao, “Deep model for dropout prediction in moocs,” in Proceedings of the 2nd International Conference on Crowd Science and Engineering (ICCSE 2017), Beijing, China, 2017, pp. 26–32.
S. Boyer and K. Veeramachaneni, “Transfer learning for predictive models in
massive open online courses,” in
16th International Conference on Artificial Intelligence in Education (AIED 2015). Springer, 2015, pp. 54–63.
-  J. Whitehill, K. Mohan, D. Seaton, Y. Rosen, and D. Tingley, “Mooc dropout prediction: How to measure accuracy?” in Proceedings of the Fourth ACM Conference on Learning @ Scale (L@S 2017), Cambridge, Massachusetts, USA, 2017, pp. 161–164.
F. Dalipi, A. S. Imran, and Z. Kastrati, “Mooc dropout prediction using machine learning techniques: Review and research challenges,” in9th IEEE Global Engineering Education Conference (EDUCON 2018), 2018.
-  C. Rudin, B. Letham, A. Salleb-Aouissi, E. Kogan, and D. Madigan, “Sequential event prediction with association rules,” in 24th Annual Conference on Learning Theory (COLT 2011), 2011, pp. 615–634.
-  Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” in Advances in Neural Information Processing Systems 13 (NIPS 2000), 2001, pp. 932–938.
-  A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM networks,” in 2005 International Joint Conference on Neural Networks (ICJNN’05), 2005, pp. 23–43.
R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in25th International Conference on Machine Learning (ICML’08). ACM, 2008, pp. 160–167.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.