1. Introduction
Massive open online courses (MOOCs) have become popular and provide an inexpensive, learnerdirected learning environment. In MOOCs, predictive models of student behavior can support an instructor to improve the student learning, e.g., provide appropriate feedback and timely intervention. Multiple studies have proposed predictive models to, for example, analyze learning progress, support a better understanding of learning abilities, identify atrisk students, and indicate where proactive interventions may be needed (Chaplot et al., 2015; Kloft et al., 2014).
Quantitative, behaviorally driven, predictive models have certain limitations. In particular, while they can be accurate, they may not be interpretable. For example, they do not integrate latent contextual information that is important for elaborating upon the observed behavior such as the fatigue or motivations of a student. Regarding dropout, they do not reveal the students who are not interested in attaining a certificate and others who may be autodidacts who ignore learning design patterns. Additionally, models do not integrate a teacher’s goals and intended learning patterns which are complex factors that influence a learner. Nonetheless, while a course is ongoing and many students are enrolled while out of personal contact with the instructor, automated prediction can be helpful (Baker and Inventado, 2014). An accurate model can identify learners who could benefit from appropriate suggestive interventions which may prevent them from dropping out. It can assist in guiding learners with adaptive instructional materials and pathways to personally appropriate learning resources. Thus, herein, we focus on technical innovation that improves automation and predictive accuracy of models that are useful in transfer learning settings.
There are also technical and practical modeling challenges. Some models predict an outcome that occurs later in the course from the time a student’s learning behavior is observed. One example is a model that predicts in week 3 whether a student will dropout in week 4. This type of model is trained, with machine learning, on data retrospectively collected from completed courses. This, however, makes it difficult to use in an ongoing course that differs from the completed course data. Different aspects between courses evolve, including the platform affordances and course design, while the learner cohort also differs. Despite best efforts, handcrafted features developed for the earlier offering may not express correlations in the ongoing course, i.e., they are brittle. Or, they may not exist in a subsequent offering, i.e., they are infeasible. For example, if a feature is defined on specific course exercises, it cannot be used in a subsequent offering if the exercises are removed. Previous work has relied on handcrafted features for transfer learning and had operational limitations regarding ongoing courses
(Boyer and Veeramachaneni, 2015).In this paper, our goal is to create operational predictive models for online use. We fundamentally ask whether it is possible, in general, to train a model on a source course’s data and reliably transfer it to perform well for an ongoing course. One possible approach is to eliminate customized features engineered by a human that depend on domain knowledge and instead learn a latent representation amenable to the model transfer. Therefore, we propose to investigate transductive transfer learning methods (Pan and Yang, 2010), see step 2 and 3 in Figure 1. These methods assume that no label is available for the target task and instead learn a model from the target domain plus a source domain where there are labels and similarity to the target distribution. In a transductive approach known as representationbased transfer learning, a latent space representation is learned automatically using source and target data (the latter without labels).
Our particular research questions are:

Does representation learning improve model transfer? We evaluate transferability within offerings for two courses and across two courses.

Can representationbased learning work from a universal, basic set of MOOC activity features as input? We test a timeseries per student where the frequencies of a set of specific MOOC activity types are expressed per time unit.

Can transfer learning improve recognition of minority groups? If we group similar students and transfer learning for each group independently, does predictive performance improve?

What are the embedded features that increase the transferability?
We employ a class of neural networks called autoencoders (AEs) to compress the input into a latent space representation from which the input is reconstructed, as output. To perform well, the AE has to learn, through the training of its weights, to extract the most relevant features in the representation between the encoder and decoder. For transductive transfer when an AE is trained on both source and target features, its embedding is a set of lower dimensional features that compactly capture mutual properties between the source and target courses, which can then be used by subsequent modeling. For events with strong correlation, e.g. play_video and pause_video events, the autoencoder (AE) learns a compressed representation that reduces the noise and mutualcorrelation between them. This leads to improved transferability. We investigate two variations of representationbased transfer learning: (1) Postprocessing the embedding by transductive principal component analysis: a passive approach that learns the compact representation before it trains the predictive model. (2) Training with correlation alignment loss: an active approach that trains the autoencoder and predictive model simultaneously.
With the learned representation of each method and the predictive model for the target, we can evaluate transferability and compare it with existing methods (Boyer and Veeramachaneni, 2015). See Figure 1
for the workflow of our approach. We choose for demonstration dropout prediction in six edX courses – three offerings of 6.00.1x “Introduction to Computer Science and Programming Using Python” and three offerings of 6.00.2x “Introduction to Computational Thinking and Data Science”.
The paper is organized as follows. In Section 2, we present related work. In Section 3 we describe the courses, input data and dropout problem we use for demonstration. This sets up Section 4, where we present the two transfer learning methods. We evaluate the methods using the courses and dropout prediction problem in Section 5. In Section 6 we summarize and mention future work.
2. Related Work
This section covers related work on transfer learning, representation learning with autoencoders, feature learning and dropout prediction.
A summary and investigation into transductive transfer learning are provided in (Pan and Yang, 2010; Arnold et al., 2007). There are two approaches for transductive feature learning: instancebased methods motivated by importance sampling, and representationbased methods using feature learning. In this paper, we explore representation based transfer methods and use the instance based algorithm as a baseline. A previous study of transfer learning for predictive models in MOOCs uses handcrafted features and an importance sampling method to shift the source distribution towards the target one (Boyer and Veeramachaneni, 2015). The study also proposes some inductive transfer methods by forcing models to learn from the features within a sliding window, but the resulting transfer methods use the previous week dropout as the target label for the transfer setups and cannot predict the next week dropout well in an online course. A nonMOOC transfer learning study that investigates graduation rates in degree programs (Hunt et al., 2017) uses AdaBoost and manual features.
Feature identification is a critical precursor to prediction (Gardner and Brooks, 2018). Some human selected and engineered features are page views, video interactions, forum posts, and content interactions. By human and engineering we imply, respectively, that the choice of the combination is made by learning design experts or researchers and counts of the clickstream elements have to be combined to derive the feature. An extensive predictive modeling (and coincidentally dropout focused) investigation that relied upon feature engineering at scale on MOOC courses is (Veeramachaneni et al., 2014)
. The same study made use of crowdsourcing to engineer features. Herein we forgo feature engineering for feature learning, in the context of transfer learning. We are preceded by a study for MOOCs that used predictive engagement analytics with long shortterm memory (LSTM) networks for feature learning
(Le et al., 2018).Transfer learning is applicable to predicting any outcome variable, and we demonstrate it with dropout prediction. There are many types of predictive models for MOOCs, e.g., dropout, certification and grade (Gardner and Brooks, 2018; Gardner et al., 2018). There are a variety of related papers which investigate the dropout prediction problem (Chaplot et al., 2015; Kloft et al., 2014; Burgos et al., 2018; Brooks et al., 2015)
. Most existing work use machine learning approaches such as logistic regression (LR), support vector machine (SVM), decision trees (DT) and Neural Networks (NN).
3. Problem Description
In this section, we compare and contrast the courses we use to evaluate the representationbased transfer learning methods we purposed. We describe how we organize student activity data collected from them for input to the methods. Finally, we define the predictive modeling problem, dropout, that we use to demonstrate the transfer learning methods on.
3.1. Input Data Organization
In practical online prediction, some student attributes are unavailable (e.g., certificate and registration information) and cannot be used to filter the set of students. Thus, the data used for prediction exhibits high variance. We use the clickstream log events, which were engineered by the learning platform developers without a prediction problem in mind, as input to the transfer learning methods. The list of events can be found in the legend of Figure
6. All events have a timestamp and event type. Therefore it is straightforward to aggregate the events by week and event type, per student. This results in a multivariate time series, for each student, , where each is a vector of the normalized frequencies of the event types in that week , and is the number of weeks. The sets of event types of courses do not need to be identical. However, the most frequent and important event types often overlap.3.2. Courses
We experiment with two courses offered on the edX MOOC platform: Introduction to Computer Science and Programming Using Python (6.00.1x), and Introduction to Computational Thinking and Data Science (6.00.2x). We have 3 offerings of each course.
In terms of structure, all offerings have 9 weeks. Both courses have multiple units, where each unit has an associated graded problem set. Students are expected to watch lecture videos narrated by instructors and complete “finger exercises”  optional problems interspersed in lecture videos that teach the content discussed in the video. The topics of each course differ because one course is the continuation of the other, see Table 1 for details. The quantities of videos and finger exercises are much higher in 6.00.1x, see Table 2. Enrollment and activity volume measured in clickstream events are shown in Table 3.
Course  Assignment  Due Week 

6.00.1x  Python Basics  Week 4 
Simple Programs  Week 5  
Structured Types  Week 7  
Good Programming Practices  Week 8  
Object Oriented Programming  Week 9  
6.00.2x  Optimization  Week 5 
Randomness  Week 6  
Midterm Exam  Week 7  
Statistics  Week 8  
Modeling and Fit  Week 9 
Course  N. Videos  N. Finger Exercises 

6.00.1x  81  555 
6.00.2x  43  177 
Course  6.00.1x  6.00.2x  

Offering  Summer 2016A(1A)  Summer 2016B(1B)  Spring 2017(1C)  Spring 2016(2A)  Fall 2016(2B)  Spring 2017(2C) 
N. students  37,363  15,199  26,011  6,774  6,945  5,893 
N. events  17,333,974  7,900,908  13,176,220  2,642,528  2,501,276  2,034,539 
We can statistically compare the courses to gain a mathematical estimation of similarity. We use the Proxy Adistance (an approximation of the
divergence (BenDavid et al., 2010)) as an indicator of the similarity between two samples where each sample distribution consists of the frequency of events with an arbitrary type that occurred in an arbitrary week of a specific course. The pairwise Proxy Adistances (PADs) are difficult to visualize, so we first compute a 2Dembedding where the distances between different distributions of different courses are the average PADs. We find the 2Dembedding which best preserves the distances by multidimensional scaling. We visualize the pairwise distances in Figure
2. This identifies 6.00.2x Spring 2016 (2A) and 6.00.2x Fall 2016 (2B) as most similar while 6.00.1x Summer 2016A (1A) and 6.00.2x Fall 2016 (2B) are most dissimilar. This is mainly because 2A and 2B are two offerings of the same course, while 1A is a different course.3.3. Dropout Prediction
We adopt a widely used dropout definition: dropout occurs when the student no longer interacts with the MOOC platform. When considering all clickstream event types, the dropout labels become noisy. Therefore we define dropout based on video events (e.g., play_video). For a timegranularity based on weeks, the dropout week of a student is defined as the week after the student’s last video interaction event. By this definition, students cannot drop out in the first week, and we train one predictive model for each week after the first. The percentages of student dropout in each week are shown in Figure 3.
Prediction can be based on data from the first week up to the week before the prediction. A student is characterized by a pair of timeseries . The label of a student is a univariate timeseries , where indicates whether the student has droppedout in week and is the total number of weeks. The features of a student, can be represented as a multivariate time series , where is a set of features in week . To predict student dropout during the course, we train models , one for each week, where is the prediction for week . The model for week uses as label, and as features. Each model solves a binary classification problem, i.e. did the student stay in the course or dropout.
4. Transfer Learning
In this section, we formulate the problem of transferring models between MOOC courses and then introduce our transfer methods.
4.1. Transfer Learning Definition
A domain consists of two components: a feature space
and a marginal probability distribution
where . Given a specific domain, a task consists of two components: a label spaceand a conditional probability distribution
where . Considering a source course and target course , the feature spaces of the source and target domains and are the same but the feature distributions are different . However, the prediction tasks and for the two domains are the same, as the conditional distributions coincide .Definition 0 ().
Transductive Transfer Learning Problem in MOOCs In transfer learning the training of the target predictive function in the target domain is supplemented using the knowledge in the source domain , where and . At training time, source data and the unlabeled target domain features are available.
4.2. Transfer Learning Methods
We use autoencoders to learn a representation space that is common to both source and target domains. By training an autoencoder (AE) on both of the source and target features (and, in some variants, the source labels) and using a learning signal that measures the output’s distance from the input, the AE’s embedding layer between the encoder and decoder is forced to capture the common characteristics of the two distributions. There is a tradeoff between the prediction model’s capability and the dimensionality of the representation space. When the reduction is small, the autoencoder can learn multiple modes in the distribution but the predictor is prone to overfit to the source task. When the reduction is large, the autoencoder can only learn a single mode presenting the risk that the embedding is not predictive enough. Thus a dimensionality should be carefully chosen for a specific combination of model and data set.
We introduce two different transfer learning methods: 1) Passive Transfer with TransductivePCA (PassiveAE Transfer) 2) Active Transfer with CORAL loss (ActiveAE Transfer)
4.2.1. Passive Transfer with Transductive PCA (TPCA)
The workflow is depicted in Figure 4. The target embedding is obtained in a “passive” way, since no objective functions are defined and used. First, an AE is trained on both the source and target features. Next, (Step 2 in Figure 4), using target features only, a PCA (Jolliffe, 1986) transform is fit on the learned target embedding to transform the source embedding for predictive model training (Step 3) (Mesnil et al., 2012). This avoids the prediction model being trained on an embedding that has learned irrelevant features from the source domain. The number of outputs of TPCA is set to be larger than the number of predicted labels.
4.2.2. Active Transfer with CORAL loss
An unsupervised transfer learning method, CORAL performs a transformation to align the secondorder statistics of the source and target domains (Sun et al., 2016). This removes variations that are only present in the source domain in the learned embedding. A general transform is achieved a by deep neural network (Sun and Saenko, 2016). A CORAL loss is introduced as a term in the objective function because minimizing the prediction loss itself can lead to overfitting to the source domain, causing reduced performance on the target domain. Let and indicate the covariance matrices of the source and target embedding and , the CORAL loss (Sun and Saenko, 2016) is defined as where denotes the Frobenius norm, and is the dimension of the common embedding space . While minimizing the CORAL loss alone can lead to degenerated features, jointly training the autoencoder and the embedding predictor with both losses and also the target reconstruction loss tries to strike a balance (Figure 5).
4.3. AutoEncoders and Predictive Models
Both transfer learning methods use an AE and a predictive model, and for both autoencoding and prediction, neural networks can be used. In preliminary AE studies, we used a neural network implementation of PCA as an AE baseline. We then explored deep networks: a multilayer perceptron, a convolutional neural network (CNN) (the order of course materials in each week preserves the implicit dependence in the knowledge graph, so we can apply 1Dconvolution in the time domain to find the implicit learning patterns) and a Long ShortTerm Memory (LSTM) network adapted for autoencoding. The LSTM AE has two 1Dconvolutional layers with kernel size 1 and a bidirectional LSTM layer in the encoder and one 1Dconvolutional layer with kernel size 1 and a bidirectional LSTM layer in the decoder (see Appendix
A for details). In a bidirectional LSTM, two hidden layers of opposite directions are connected to the same output to make future input information reachable from the current state. We report our best results which use the LSTM AE.Training models for prediction relies on the embedded features from the AE (or TPCA) representations. We explored predictive modeling both with and without embedding features. We used logistic regression (LR) as a baseline of the predictive models and then compared using CNN and LSTM with architectures similar to the encoder parts of AEs. We report our best results which use CNN on top of embeddings and LSTM without embeddings (see Appendix A for details).
5. Experiments
We performed a massive number of experiments, which was made possible by using an efficient and scalable software framework that allows model tuning. Preliminary experiments allowed us to settle on parameters, e.g., architectures of the networks and training hyperparameters to use throughout evaluations. We ran our entire pipeline multiple times with different combinations of models and architectures. The pipeline is released in an opensource Python MOOC learner data science analysis (mldsa) toolkit ^{1}^{1}1https://github.com/MOOCLearnerProject/MOOCLearnerDataScienceAnalytics.
We use the area under the receiver operating characteristic curve (AUC) to measure the performance of all predictive models. The AUC scores are chosen from the best performing architecture and are averaged on all possible pairs of source and target courses and on all weeks. See Appendix
B for detailed experimental configurations.Our input features are based on the thirteen common event types, see Figure 6. These events all tend to decrease in frequency, most likely due to dropout. We also note that some events are highly correlated (e.g. problem_check and problem_graded). This is one of the motivations of representation learning on the raw features.
5.1. Transfer Learning Baselines
We set up four baseline models. All of them use LSTM or LSTM AE and CNN for prediction (see Appendix A), the same as PassiveAE Transfer and ActiveAE Transfer.
 LabelTruth and LabelTruthAE:

These models do not use transfer learning. Instead, they are trained with target features and labels (retrieved retrospectively) and provide a label truth baseline. LabelTruthAE uses an LSTM AE embedding for prediction. The target domain data is split into train and test, and test accuracy is reported.
 Naive Transfer:

This model is learned on the source course (without using target labels) and subsequently applied to the target course.
 InSitu Learning:

We learn a predictive model from the data from the ongoing course itself with a sliding window (Boyer and Veeramachaneni, 2015). For week , we use window size and apply the model trained for the previous week for prediction. It transfers between weeks of the same target course and does not use the source course.
 InstanceBased Transfer:

Here it is assumed the features of the source and target domains are drawn from a common distribution and the difference comes from a sample selection bias. We correct this bias by giving more weight to the students in the source course that are similar to the students in the target course. The learning objective in the target domain is approximated by . Thus, by applying different loss weights to each instance in the source domain we can train a precise model for the target domain. We use the kernelmean matching algorithm (Huang et al., 2006) to compute the weights.
5.2. Model Selection
First, without transfer, we compared the prediction performance of the LSTM neural network with the standard logistic regression (LR), see Figure 7. The LSTM model consistently outperforms LR by a large margin. Thus we chose LSTMs.
Next, we compared the performance of predictive models using the embeddings from the AEs as input. More specifically, a CNN model trained on an embedding learned by an LSTM AE or PCA, see Figure 7. With bottleneck size of eight, the AUC scores of the CNN and LSTM AE embedding are similar to the LSTM’s. This implies that it is possible to learn a compact and effective embedding, and suggests the effective use of dimensionality reduction for transfer learning. All predictive models perform better in the middle part of the course, and their AUCs are negatively correlated with the dropout rates (see Figure 3).
5.3. Transfer Learning Results
1B1A  1C1A  1A1B  1C1B  1A1C  1B1C  Avg.  
PassiveAE Transfer  .7581  .7771  .7492  .7572  .8011  .7842  .7712 
ActiveAE Transfer  .7591  .7881  .7692  .7692  .8121  .7922  .7821 
InSitu Learning  .6812  .6592  .7242  .6882  
InstanceBased Transfer  .6131  .6831  .6081  .6631  .6531  .6491  .6451 
Naive Transfer  .7161  .7561  .7003  .6993  .7432  .7362  .7252 
LabelTruth  .8001  .7731  .8191  .7971 
2A1A  2B1A  2C1A  2A1B  2B1B  2C1B  2A1C  2B1C  2C1C  Avg.  
PassiveAE Transfer  .7531  .7521  .7601  .7352  .7332  .7452  .7861  .7771  .7821  .7581 
ActiveAE Transfer  .7131  .7201  .7171  .7342  .7402  .7432  .7551  .7512  .7572  .7372 
InSitu Learning  .6812  .6592  .7242  .6881  
InstanceBased Transfer  .5902  .5692  .6061  .6373  .6402  .6001  .5792  .5281  .5983  .5942 
Naive Transfer  .6681  .7002  .7062  .6753  .6843  .6763  .7352  .7162  .7392  .7022 
LabelTruth  .8001  .7731  .8191  .7971 
We now compare the proposed transfer learning methods with the baselines and LabelTruth case. The transfer methods that perform best are the PassiveAE Transfer and ActiveAE Transfer, even similar to the LabelTruth baseline. The dropout prediction performance of PassiveAE Transfer and ActiveAE Transfer transfer with the baselines (LabelTruth, LabelTruthAE, Naive Transfer, InSitu Learning, and InstanceBased Transfer), for the similar pairs of source and target 6.00.1x6.00.1x (within offerings of one course) are shown in Table 4 and Figure 7(a), and the dissimilar pairs 6.00.2x 6.00.1x (across two courses) in Table 5 and Figure 7(b). Note that the performance of InSitu Learning and LabelTruth does not depends on the choice of source course. All transfer baselines have significantly lower average AUCs than the PassiveAE Transfer and ActiveAE Transfer methods.
Now we analyze the average AUC per predicted week, see Figure 8. It shows that Naive Transfer overfits to the source domain from week 5. InSitu Learning learning does not work well until the final weeks of the course since it applies a sliding window and a considerable proportion of information is lost when week is small. InstanceBased Transfer has abysmal performance on all weeks, since the raw features we used are highdimensional and sparsely distributed, and hence obtaining a good approximation of the sampling weights is very challenging. We note that InSitu Learning and InstanceBased Transfer struggles to outperform Naive Transfer in most cases. An important reason for this is that we used a very lowlevel input representation (timeseries of clickstream events). These methods need complex features whereas representation learning is capable of learning from simpler ones. We anticipate an improved performance for the baseline transfer methods with handcrafted features, but that is not certain and also demands a human to solve the challenge of engineering good features.
The sensitivity of some of the important parameters of the PassiveAE Transfer and ActiveAE Transfer methods are also investigated in Figure 9. We analyze how the performance of passive and active transfer is correlated to the PAD and the sample size ratio between the source and target courses. We see that the PassiveAE Transfer performs better than the ActiveAE Transfer approach when transferring from a course with students (6.00.2x) to one with students (6.00.1x) (the upper right corner of Figure 9), and in the other cases the ActiveAE Transfer often sightly outperforms the PassiveAE Transfer. We find that the best transfer performance is close to the LabelTruth baseline when both the target to source sample size ratio and the PAD are small. ActiveAE Transfer or PassiveAE Transfer transfer have on average an 8% improvement compared with the Naive Transfer baseline in terms of AUC scores (calculated from the last columns of Table 4 and Table 5).
5.4. ExperienceBased Transfer Models
The composition of a MOOC class can be very diverse, and this is one of the differences compared to traditional education. Since 6.00.2x is a more indepth course than 6.00.1x, the proportion of high school students in 6.00.2x is smaller, but there are more postgraduate students (Figure 9(b)). Training a predictive model for high school students on 6.00.2x can be difficult in the sense that there are not enough samples to learn from. Transferring the knowledge from 6.00.1x by the proposed methods can partially solve this problem. Equipped with the student background labels, there are two possible ways of transfer learning: transfer to the entire target class as we did in the other experiments, or consider each group of students as a target and transfer specifically to that group in an experiment. With simpler target distributions and thus easier transfer objectives, the latter approach is potent to achieve better prediction performance. We evaluate their performance when transferring from 6.00.1x to 6.00.2x as shown in Figure 9(a). Where we see the average AUC on high school students is improved by active transfer, and is even further improved if we specifically transfer to the high school group. PassiveAE Transfer does not perform as well as ActiveAE Transfer, since the source course has much more students than the target. Results showed that ActiveAE Transfer to specific groups performed the best. This approach can also be applied to minority groups based on other demographic variables including income level, gender, age, location, ethnicity, and race.
5.5. Examining the Embeddings
To further examine the transfer embeddings we replaced the LSTMAE used by ActiveAE Transfer
with a neural network implementation of PCA to access the embedding as a linear transformation of the raw feature space. Now we can calculate the “weight” (the scaling factor in the direction of a feature under the PCA transformation) of a raw feature in the learned embedding, which is an indicator of how vital the raw feature is for the transfer learning task. We calculate the average weights of the thirteen features on different groups of source and targets, as shown in Figure
11. The average weights on all possible transfers (the last bar) show the relative importance of each raw features, and we order the legend labels accordingly. We can see that video related events (bold) occupy the first six places, which implies they are more transferable and predictive than the others.6. Conclusions
In this paper, we explored the possibilities of using deep autoencoders for transfer learning among courses in MOOCs. We proposed two methods to improve the transferability of learned embedding: the PassiveAE Transfer using transductive PCA to remove the variations present in the source domain but irrelevant for the target domain, and the ActiveAE Transfer using the CORAL loss to force the alignment of the secondorder statistics of the source and target embeddings. Moreover, the deep transfer models solve the domainspecific feature engineering problem and can learn compact and effective representations from the raw features, a straightforward representation of the clickstream. We found that our transferred models consistently outperform the other transfer baselines and achieve similar performance compared with the labeltruth (no transfer) models which are trained on the target course. In this sense, we believe that we have made significant progress in solving the transfer learning problem in MOOCs. With the acknowledged limitations that the models, while accurate and automated, are not transparent and do not integrate contextual human knowledge like the learning design.
We answered the following research questions: (1) Does representation learning improve model transfer? We evaluated transferability within offerings of two courses and across two courses. We found that transferring from a course with fewer students to one with more students PassiveAE Transfer was best versus ActiveAE Transfer and baselines. In the opposite situation, ActiveAE Transfer was best versus PassiveAE Transfer and baselines. Both passive and active methods approached the AUC level of the LabelTruth (no transfer) baseline, which learned from target labels directly. (2) Can representationbased learning work from a universal, basic set of MOOC activity features as input? We successfully used the same time series per student where the frequencies of a set of specific MOOC activity types are expressed per time unit as input to every experiment. It supported our transfer learning results. This suggests that, for transfer learning problems in MOOCs, in general, it is possible to eliminate costly feature engineering. (3) Can transfer learning improve recognition of minority groups? If we group similar students and transfer learning for each group independently, does predictive performance improve? We grouped students by their highest level of education and found that transfer learning with the proposed methods can help improve the prediction performance on minority groups, and transferring specifically to a target group might achieve even higher performance. This is an example where contextual knowledge, if available, can be used to improve this methodology despite that knowledge not being explicitly integrated into the model. (4) What are the embedded features that increase the transferability? We calculated the weights of event features in the embedding learned by ActiveAE Transfer using PCA instead of LSTM AE and showed that video related events are more transferable and predictive than the others.
The contributions of this paper are:

We introduced two online transfer learning methods based on representation learning that improve prediction for the target course and eliminate manual feature engineering. We improve the dropout prediction AUC scores by 8% using either method compared with the naive transfer baseline.

We introduced a data organization for input to the prediction and transfer methods that requires no feature extraction. It is a time series per student where the frequencies of a set of specific MOOC activity types are expressed per time unit.

Through visualization and metric analysis, we described the representationbased learning embeddings.

We found that transfer learning for specific groups of students independently improved predictions, facilitating more specific learning support.
For future work, a direct extension of our work is to apply the transfer algorithms to different types of MOOCs and across MOOC platforms. It will also be interesting to investigate how we can effectively learn from multiple source courses, simultaneously characterizing and utilizing the sources’ relationships to the target course based on course content and student population. Finally, it is possible to investigate further raw data representations that take structural course context into account, not only temporal activity.
References
 (1)
 Arnold et al. (2007) Andrew Arnold, Ramesh Nallapati, and William W Cohen. 2007. A Comparative Study of Methods for Transductive Transfer Learning. In ICDM Workshops. ICDM, Pittsburgh, PA, USA, 77–82.
 Baker and Inventado (2014) Ryan Shaun Baker and Paul Salvador Inventado. 2014. Educational data mining and learning analytics. In Learning analytics. Springer, New York, NY, USA, 61–75.
 BenDavid et al. (2010) Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine learning 79, 12 (2010), 151–175.
 Boyer and Veeramachaneni (2015) Sebastien Boyer and Kalyan Veeramachaneni. 2015. Transfer Learning for Predictive Models in Massive Open Online Courses. In Artificial Intelligence in Education, Cristina Conati, Neil Heffernan, Antonija Mitrovic, and M. Felisa Verdejo (Eds.). Springer International Publishing, Cambridge, MA, USA, 54–63.
 Brooks et al. (2015) Christopher Brooks, Craig Thompson, and Stephanie Teasley. 2015. A time series interaction analysis method for building predictive models of learners using log data. In Proceedings of the fifth international conference on learning analytics and knowledge. ACM, 126–135.
 Burgos et al. (2018) Concepción Burgos, María L. Campanario, David de la Peña, Juan A. Lara, David Lizcano, and María A. Martínez. 2018. Data mining for modeling students’ performance: A tutoring action plan to prevent academic dropout. Computers & Electrical Engineering 66 (2018), 541 – 556.

Chaplot
et al. (2015)
Devendra Singh Chaplot,
Eunhee Rhim, and Jihie Kim.
2015.
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks.. In
AIED Workshops (CEUR Workshop Proceedings), Jesus Boticario and Kasia Muldner (Eds.), Vol. 1432. CEURWS.org, Seoul, South Korea.  Gardner and Brooks (2018) Josh Gardner and Christopher Brooks. 2018. Student success prediction in MOOCs. User Modeling and UserAdapted Interaction 28, 2 (2018), 127–203.
 Gardner et al. (2018) Josh Gardner, Christopher Brooks, Juan Miguel Andres, and Ryan Baker. 2018. Replicating MOOC Predictive Models at Scale. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (L@S ’18). ACM, New York, NY, USA, 1:1–1:10. https://doi.org/10.1145/3231644.3231656
 Huang et al. (2006) Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Scholkopf. 2006. Correcting Sample Selection Bias by Unlabeled Data. In Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS’06). MIT Press, Cambridge, MA, USA, 601–608.
 Hunt et al. (2017) Xin J. Hunt, Ilknur Kaynar Kabul, and Jorge Silva. 2017. Transfer Learning for Education Data. KDD Workshop (2017).
 Jolliffe (1986) Ian T Jolliffe. 1986. Principal component analysis and factor analysis. In Principal component analysis. Springer, Kent, England, UK, 115–128.
 Kloft et al. (2014) Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. 2014. Predicting MOOC dropout over weeks using machine learning methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. Association for Computational Linguistics, Doha, Qatar, 60–65.
 Le et al. (2018) Christopher Vu Le, Zachary A. Pardos, Samuel D. Meyer, and Rachel Thorp. 2018. Communication at Scale in a MOOC Using Predictive Engagement Analytics. In Artificial Intelligence in Education  19th International Conference, AIED 2018, London, UK, June 2730, 2018, Proceedings, Part I. IJAIED, Berkeley, CA, USA, 239–252. https://doi.org/10.1007/9783319938431_18

Mesnil et al. (2012)
Grégoire Mesnil, Yann
Dauphin, Xavier Glorot, Salah Rifai,
Yoshua Bengio, Ian Goodfellow,
Erick Lavoie, Xavier Muller,
Guillaume Desjardins, David WardeFarley,
Pascal Vincent, Aaron Courville, and
James Bergstra. 2012.
Unsupervised and Transfer Learning Challenge: a Deep Learning Approach. In
Proceedings of ICML Workshop on Unsupervised and Transfer Learning (Proceedings of Machine Learning Research), Isabelle Guyon, Gideon Dror, Vincent Lemaire, Graham Taylor, and Daniel Silver (Eds.), Vol. 27. PMLR, Bellevue, Washington, USA, 97–110.  Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
 Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In AAAI, Vol. 6. Palo Alto, CA, USA, 8 pages.

Sun and Saenko (2016)
Baochen Sun and Kate
Saenko. 2016.
Deep coral: Correlation alignment for deep domain
adaptation. In
European Conference on Computer Vision
. Springer, Springer International Publishing, Cambridge, MA, USA, 443–450.  Veeramachaneni et al. (2014) Kalyan Veeramachaneni, UnaMay O’Reilly, and Colin Taylor. 2014. Towards Feature Engineering at Scale for Data from Massive Open Online Courses. CoRR abs/1407.5238 (2014).
Appendix
Appendix A Model Architectures
The detailed architectures of the three neural network models used by all the transfer methods and the labeltruth baseline are:
 LSTM Predictive Model:
 LSTM AE:

Input–Conv1D(12, 1)–LeakyRelu(0.2)–BLSTM(8)
–LeakyRelu(0.2)–Conv1D(8, 1)–Flatten–Embedding–Reshape
–BLSTM(6)–LeakyRelu(0.2)–Conv1D(13, 1)–Sigmoid–Output  CNN Predictive Model on AE embeddings:

Input
–Conv1D(8, 3)–ReLU–Conv1D(8, 3)–ReLU–Flatten–FC(1)
–Sigmoid–Output
Where FC(n) is a fully connect layer with neurons, Conv1D(n, k) is an 1Dconvolutional layer with output channels and kernel size , LSTM(n) is a LSTM layer with cells, LeakyReLU() is a leakyReLU activation, and BLSTM(n) is a bidirectional LSTM layer with cells.
Appendix B Experimental Configurations
We implemented the models using PyTorch and Keras in Python. For all training, the batch size is , the Adam optimizer is used with learning rate
, and the number of epochs is between
and . For all AEs, the bottleneck size is eight dimensions per time unit. For LabelTruth and LabelTruthAE, the traintest split ratio is . For PassiveAE Transfer, the number of output components of TPCA is set to six per time unit. For ActiveAE Transfer, the loss weights of the source prediction crossentropy loss, the target autoencoding meansquared error and the CORAL loss are , and respectively. For each combination of training method, model architecture, source, target, and week we only train once, since all of our results are averages with low variance, and the variance from stochastic optimization here is very small (less than 1% according to our experiments).
Comments
There are no comments yet.