Online courses have become popular over the years. In most courses, clickstream logs are the only data available from all students that gives information about their learning behaviors. While we have abundant learning theories that define educationally meaningful behaviors and learning strategies, we have limited understanding of how their behaviors are represented within click logs. This understanding is important because it allows for the application of traditional educational theories to online settings and helps us identify important study behaviors that lead to positive learning outcomes in online courses. In this project, we model study behaviors from click logs and predict students’ final grades based on clicks and behaviors.
Our first task is to model student behaviors using clustering algorithms. Specifically, we focus on how students distribute their time among learning materials in a study session. We cluster click segments using a static clustering method (multinomial mixture model) and a dynamic clustering method (hidden Markov model).
We then predict students’ grades using a time series analysis of clicks. There have been studies on predicting learning outcomes in online courses, but to the best of our knowledge, this work is the first to use a time series analysis. Moreover, many of the previous studies focused on information that is not available for all students, e.g., forum activity, whereas this work uses general clickstreams, which are available for all students. We leveraged Long Short Term Memory (LSTM) to learn sequential characteristics relevant to student performance.
Comparing both sequence-aware and non sequence-aware approaches, we find that approaches which incorporate sequential information outperform those which do not at classifying student performance, and generalize better to other courses. Additionally, we use these sequence modeling approaches to identify differences between students of different achievement levels, helping to demonstrate how sequence information better distinguishes low graders from high graders.
2 Related Work
Existing studies on modeling student behaviors via click logs can be categorized into top-down and bottom-up. Top-down approaches predefine a set of behaviors of interest, such as disengagement and sequential navigation, and corresponding click patterns [8, 13, 4]. These approaches provide interpretability, but analyses are focused only on the predefined behaviors and patterns. In contrast, bottom-up approaches aim to find meaningful click patterns from clickstream data and interpret behaviors they mean. For instances, topic modeling approaches [16, 1]
treat clicks as words and learn click categories as topics. Using n-grams may give insights onto click sequences that represent the topics. These studies also showed the correlation between the learned patterns and success in the courses. Bottom-up approaches require no predefined set of patterns and learn behaviors that occur frequently, but it may be hard to connect the learned click patterns to meaningful behaviors.
There are Bayesian network models that can be used for finding meaningful click patterns. Static clustering algorithms, such as Gaussian mixture models or multinomial mixture models, may be used to categorize similar click segments. These models assume that all clusters are independent of each other. Since our data are sequences, dynamic clustering algorithms that consider transitions between clusters may be more appealing. For instances, a hidden Markov model simultaneously clusters similar click segments into states and models transitions between the states. A state transition topic model extends HMM such that click segments are clustered into topics, and a state is represented as a mixture of the topics. A model for sequential pattern mining  can be used to find frequent click segments.
There have been a lot of studies that predict students’ learning outcomes in online courses. Forum activity, linguistic features in discussions, sentiment, quiz participation, and video interaction have been used as features for dropout prediction [13, 17, 4], certification prediction , and performance on quizzes . However, to the best of our knowledge, no work used time series analysis, that is, changes of feature values over time, for prediction.
LSTMs [5, 7] have become increasingly popular for the task of sequence modeling and time series analysis. They are based on recurrent neural network (RNN) architecture and had been shown to outperform traditional RNNs on numerous temporal processing tasks [2, 3]. Especially for a sequence modeling task using a traditional RNN, during the gradient back-propagation phase, if the length of the sequences are huge, gradient vanishing or gradient explosion can occur. We make use of LSTM model which is designed to prevent such situations . To the best of our knowledge, this will be the first work to use LSTM to model students’ clickstream sequences in an online course.
Our goals are twofold. We first want to investigate the general behaviors of students in terms of how they engage in individual learning materials. We then want to examine whether we can predict students’ final grades based on the first portion of their click stream, and if we can automatically learn student behaviors which inform the prediction task. In the following sections, we explain the models we used for these two goals.
3.1 Behavior Modeling
Different students spend their time differently while interacting with course learning materials, e.g., some students may spend more time on lectures, whereas other students actively engage in forum activity. We hypothesize that the way students distribute their time, which we call behavior, throughout a course is closely related to their learning outcomes. More formally we define session and behavior as follows.
Session: a sequence of clicks from one student separated from that student’s other click sequences by more than one hour of inactivity.
Behavior: a distribution over clicks within a single session.
Since behaviors are on an infinite space, we categorize behaviors into a specified number of categories using two clustering algorithms: a multinomial mixture model (MMM) and a hidden Markov model (HMM).
A MMM assumes a fixed set of clusters, each of which has a probability distribution over observations and generates observations from a multinomial distribution. In our task, the clusters are students’ states that correspond to individual sessions. Observations are clicks, and the state parameters are the parameters of multinomial distributions that generate clicks. A MMM does not model transitions between successive sessions.
A HMM is a dynamic version of a MMM, considering transitions between successive data points. In this work, a HMM has the same definitions for states and obervations, but each state has two parameters: emission parameter and transition parameter. Emission parameters are equivalent to the state parameters in a MMM, which generates clicks based on multinomial distributions. The transition parameter of each state is a probability distribution of transitions from that state to another state.
In Section 4, we will show that students’ behaviors learned by these models have a high correlation with their final grades. By comparing the performance by MMM-based behaviors and that by HMM-based behaviors, we will see that taking into account temporal aspects for clustering informs predicting the final grades of students.
3.2 Performance Prediction
Instructors could benefit from knowing whether a student would eventually perform well or not in the course and then provide personalized support to different supports in form of adaptive interventions. For our work, we used student’s grade as final outcome of the course. This section describes the method using student click sequence features in a binary classification task of predicting whether a student achieved a course grade above a threshold. Students who made graded progress towards course completion are assigned an output label of 1, all others are assigned label 0. It should be noted that the data is almost balanced for this classification task, as seen in Table 3
. If we try to define success of students with a larger threshold than zero, say 40%, then data will be very skewed (80% majority class; 20% minority class).
We use a baseline Support Vector Machine (SVM) to contextualize the results of our prototype RNN model. For this, we use Python’s SciKit-Learn module . We ran experiments for SVM using two feature sets: clickstream length as the only feature (SVML
), and a vector ofcounts of each click type as the features (SVMC).
Predicting a student’s grade from his/her click sequence is a task of sequence modeling and time series analysis, for which RNNs are really popular. To investigate the use of recurrent neural networks in modeling student click sequences, we implemented a LSTM prototype using a Theano-based Deep Learning library,keras111https://github.com/fchollet/keras/. The implemented model is composed of a single LSTM layer followed by a mean pooling and a logistic regression layer as can be seen in Figure 1. Hinton’s dropout 
has been used to prevent over-fitting and sigmoid activation function has been applied to the output.Adam optimizer, proposed by Kingma and Lei Ba  has been used for stochastic optimization and binary cross entropy has been used as cost function.
For our experiments, we used click-stream data from two Coursera222https://www.coursera.org/ courses.
Algebra course which ran from January through April 2013 with 43,361 students. The data from this course was used as training (first 80%) and validation data (remaining 20%).
Pre-Calculus which ran from January through April 2013 with 51,069 students. The data from this course was used as testing data (100% data).
We preprocessed data to extract click sequences for each student in the course and then performed our experiments for following different set of features.
Raw clicks: These are the original clicks students made, represented as URLs. There are 2648 types of click events in the training/validation datasets, so it is a high dimensional feature set. For instance, this feature set has ”view lecture 1” event different from ”view lecture 2” event. Note that these features are course-dependent as different courses may have different lectures or assignments, and even have different types of learning activities. Thus, these features are not generalizable and cannot be applied to other courses.
Click categories: We further categorized the raw click event types into 46 categories thereby resulting in a low dimensional feature set. For instance, this feature set has ”viewed lecture 1” event and ”viewed lecture 2” event merged into ”viewed a lecture” event. These click categories were decided such that they are consistent across courses and thus generalizable and can be applied to other courses as well.
Session states: The click categories were further reduced by dividing the click sequence for a student is divided into sessions (defined earlier) and associating each session with a graphical model state. We decided the number of such states to be 10. These states feature set is almost course-independent and thus the most generalizable among the three feature sets described here.
Table 4 summarizes statistics for the features and labels in the dataset.
4.2 Behavior Modeling
This section describes the behaviors learned by MMM and HMM, and interpret them in the context of learning in online courses. We first split each student’s click sequence into sessions, and fit all the sequences to MMM and HMM. We use the click categories instead of the raw clicks, because the raw clicks, when fitted to the models, make each state represent learning materials in a similar time period. We empirically chose the number of states to 10. Increasing this number produces redundant behaviors, and decreasing this number reduces the diversity of behaviors.
The behaviors learned by the MMM and the HMM were very similar. Hence, we demonstrate only the HMM results here. Figure 2(a) is the learned behaviors. For clear visualization, we combined the click categories into five bigger categories: lecture-, quiz-, forum-, class-, and wiki-related clicks. According to this result, a session is either focused on lectures (states 0-3), composed of lectures and quizzes (states 4 and 5), divided among lectures, quizzes, and forums (state 6), focused on browsing the course (state 7), focused on quizzes (state 8), or focused on forum activity (state 9).
Figure 2(b) is the learned transitions between the behaviors. The initial probabilities show that many students start their first session with browsing the website (state 8) and additionally watching lectures (state 2). However, students do not tend to take quizzes in their first session. About forum activity, forum-related sessions (states 3, 6, and 9) have high transition probabilities among themselves. For quizzes, a quiz session is very likely to transition to another quiz session (state 8).
4.3 Feature Analysis
For qualitative validation of our ability to predict students’ grades based on their clicks and behaviors, we examined the features indicative of their final grades.
|High Graders||Low Graders|
Individual clicks that are frequent among high graders and low graders reveal what learning materials are important to engage in to get a high/low grade in the course. We selected the five most frequent click categories for high graders and low graders, respectively, in each state, and ranked them by popularity. Table 3(a) shows that students who engage in certificate quizzes and submit quizzes are likely to be high graders. In addition, high graders actively visit other people’s profiles in forums, which may indicate their interests in social activity in the course. In contrast, low graders are likely to only attempt quizzes and use late days for quizzes. Interestingly, low graders are more likely to download lecture videos than high graders, which might indicate that they prefer to study whenever they have time, instead of setting aside a time for study.
Trigrams of clicks (a click segment of length 3) that are frequent among high graders and low graders give insights into the sequences of actions that are related to high grades and low grades. Table 2 shows that high graders engage more in quizzes and forums, whereas low graders engage in lectures and browsing. Some interesting behaviors of high graders are that they start a quiz and consult lectures or they finish a quiz and then go to forums possibly to see other people’s opinions on the quiz or to search for relevant information. In contrast, low graders are hesitant to submit a quiz, and they often go back to other pages of the course.
States that are frequent among high graders and low graders may inform us of how the distribution of time within a session distinguishes students by grade. Table 3(b) shows that high graders spend more time on quizzes and forums, whereas low graders primarily view lectures and browse the course. It is not surprising that quizzes are closely related to high grades because grades are based on quiz scores, but it is interesting that forum activity is correlated with higher grades in the course.
4.4 Performance prediction
Because many students sign up for MOOCs out of curiosity and do not actively participate in the learning community or even interact substantially with the course material, we consider only students whose clickstreams contain a minimum of 100 clicks . Another motivation behind excluding students with fewer than 100 clicks was to exclude students for whom predicting zero score is trivial. Also, for some students grades are not available, so we exclude these students as well.
We partition the students randomly into training (80%) and test (20%) partitions, which we use across all experiments. To understand how well each model can identify struggling students, we learn to predict using only the first part of each clickstream, selected based on the following dimensions:
Number of course days : Days since the course has started. This dimension of evaluation would be useful for an instructor to know at a particular point of time the status of all the students n the course.
Number of student days : Days since a student has joined the course. This dimension of evaluation would be useful for providing out personalized interventions to the student.
Number of clicks : Clicks made by a student till now on the course. This dimension gives the more accurate information about the participation about a student.
Number of states : The number of states through which student transitioned till now. These states have been fetched from HMM.
Each of the four experiments is conducted on the three feature sets: Raw clicks, Click categories, and Session states.
We also examined the transferability of the classifiers trained on the algebra course to the precalculus course. To infer the states of the clickstreams in the new course, we used a maximum likelihood estimation for MMM states and viterbi algorithm for HMM states.
|Click Features||Student Grade Labels|
Distribution follows Zipf: Linear log-log regression explains 85% of variance.
|# Click Types||2,378||2,021||2,623||1||57.65%||57.23%||55.10%|
|Raw clicks||Session states||Click categories|
|Range||(101, 56310)||(5,277)||(101, 56310)|
Test set accuracies for the performance prediction task are shown in Table 5 (Algebra Test set) and Table 6 (Precalculus Test set).444Training set accuracies may be found in the appendix (Table 7). LSTM using raw click features is the clear winner within the Algebra course, significantly outperforming all other baselines for most configurations. We note, though, that these features are course-dependent, and cannot reliably be used for prediction on other courses (initial experiments using SVM validated this assumption, with a raw click feature baseline performing 5-10 percentage points lower than the MMM state features). Furthermore, due to the poor performance of click category features even within the Algebra course, it is unlikely that these features will perform well on other courses, even though they may be able to generalize. However, both MMM and HMM state features show the potential for capturing aspects which distinguish student performance, with LSTM outperforming SVM where sequence information is represented in the features (HMM state features, by a first-order Markov assumption), and failing to beat the baseline when the features assume sessions to be fully independent of one another (MMM state features).
explores the use of MMM and HMM state features on the precalculus course, to examine whether sequence information improves the generalizability of SVM or LSTM to other courses. LSTM with HMM features performs surprisingly well, significantly outperforming SVM for a majority of configurations. The Multilayer Perceptron (MLP) is a non-RNN option which we trained using the Python Scikit-learn package555http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html. This is a much more difficult baseline, and more investigation is needed to determine whether LSTM is definitively better for generalizability of our approach. For the Precalculus course data, however, LSTM with HMM features outperforms even this baseline, although the difference is not significant.666
As per a 2-tailed Student T-Test (p0.05). Interesting as well is the fact that LSTM significantly outperforms SVM using MMM features in half of the configurations, suggesting that the sequential information encoded in the order of sessions is not completely lost when encoded into MMM features, even though the MMM features themselves make no assumption about ordering.
|Click Features||State Features|
|Raw Clicks||Click Categories||MMM||HMM|
|35||92.80*777Statistical significance as determined by a 2-tailed Student T-Test (p0.05).||89.46||74.68||74.56||85.20||87.58||86.73||84.82|
|#states||10||80.72||83.63*||84.27*888Indicates significant improvement over the lowest performing classifier.||83.37||85.85*||85.68*|
5 Discussion & Future Work
The work presented compares LSTM to simple baselines to demonstrate the strength of sequential feature information in modeling and capturing characteristics of sequential inputs such as clickstream data. However, we acknowledge the limitations of this work with respect to the comparisons which were made.
First, the two Coursera courses used came from the same institution (UC-Irvine), and overlap substantially in terms of content, even sharing some of the same video lectures. Additionally, these courses were held at the same time, which controls for many platform-dependent aspects specific to the version of Coursera which was live at that time. Future work should extend the comparisons we present here to other courses which are not held at same time, on the same platform, by the same instructors, or even on similar topics. A truly generalized theory of student learning would apply across all these domains, and it would be interesting to see the extent to which LSTMs or other methods could generalize in this way.
We conducted some initial experiments which showed that LSTM outperformed a simple neural network baseline in generalizing to other course data, but the results were not conclusive enough to claim statistical significance. Future work should investigate whether the success of LSTMs is due to the qualities of neural networks, or due to the sequential applicability of its framework.
The behavior modeling techniques explored so far consider only states which give equal weight to each click emitted by a state. Future work could consider state emissions which take into consideration the length of time a student spends on each click, acknowledging the intuition that students who take more or less time to complete the same activities may have differing course performance.
The MMM features gives a good sequence-agnostic comparison for HMM, but initial experiments using GMM features suggest this or another sequence-agnostic model may have a better fit to the data than MMM features. Future work should explore the use of GMM features or Gaussian HMM emissions to examine whether these features could surpass those explored here.
In this work, we demonstrated modeling students’ behaviors in online courses via click logs, and predicted students’ final grades on the basis of their clicks and behaviors. The experiments revealed that students with a high grade actively engage in forums and quizzes, whereas low-grade students tend to watch or download lectures and attempt quizzes without submission. These course-independent behaviors turned out to achieve a high accuracy in predicting final grades and to generalize well to other courses. Although raw clicks were the most informative of final grades, they cannot be used in other courses because they are incorporated with course-specific information. On the other hand, click categories are course-independent, but they predict final grades poorly. Another contribution of this work is that we increased prediction accuracy by using the temporal information of clicks. This arguably suggests that the temporal dynamics of clicks convey information about final grades that cannot be obtained from individual clicks.
-  Coleman, C. A., Seaton, D. T., & Chuang, I. (2015). Probabilistic Use Cases: Discovering Behavioral Patterns for Predicting Certification. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale - L@S ’15 (pp. 141–148). New York, New York, USA: ACM Press. http://doi.org/10.1145/2724660.2724662
-  F. A. Gers and J. Schmidhuber (2000). Recurrent nets that time and count in Proc. IJCNN’2000, Int. Joint Conf. on Neural Networks, (Como, Italy), 2000.
-  F. A. Gers and J. Schmidhuber (2001). LSTM recurrent networks learn simple context free and context sensitive languages IEEE Transactions on Neural Networks, 2001.
-  Halawa, S., Greene, D., & Mitchell, J. (2014). Dropout Prediction in MOOCs using Learner Activity Features. In Proceedings of the European MOOC Stakeholder Summit 2014 (pp. 58–65).
-  Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780., 1997. http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
The vanishing gradient problem during learning recurrent neural nets and problem solutions.International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6.02 (1998): 107-116.
-  Hochreiter, S., & Schmidhuber, J. (2012). Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer, 2012. http://www.cs.toronto.edu/~graves/preprint.pdf
-  Jeske, D., Backhaus, J., & Stamov Roßnagel, C. (2014). Self-regulation during e-learning: using behavioural evidence from navigation log files. Journal of Computer Assisted Learning, 30(3), 272–284. http://doi.org/10.1111/jcal.12045
-  Jo, Y., & Rosé, C. P. (2015). Time Series Analysis of Nursing Notes for Mortality Prediction via a State Transition Topic Model. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management.
-  Kingma, Diederik, and Jimmy Ba. (2015). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
-  Narciss, S., Proske, A., & Koerndle, H. (2007). Promoting self-regulated learning in web-based learning environments. Computers in Human Behavior, 23, 1126–1144.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E.
Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, v. 12, pp. 2825-2830
-  Ramesh, A., Goldwasser, D., Huang, B., Daum, H., & Getoor, L. (2013). Modeling Learner Engagement in MOOCs using Probabilistic Soft Logic. NIPS Workshop on Data Driven Education, 1–7.
-  Srivastava, Nitish, et al. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15.1 (2014): 1929-1958. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
-  Schuster, M., & Poliwal, K. (1997). Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, v. 45, no. 11, 1997. http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf
-  Wen, M., & Rosé, C. P. (2014). Identifying Latent Study Habits by Mining Learner Behavior Patterns in Massive Open Online Courses. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management - CIKM ’14 (pp. 1983–1986). http://doi.org/10.1145/2661829.2662033
-  Wen, M., Yang, D., & Rosé, C. P. (2014). Sentiment Analysis in MOOC Discussion Forums: What does it tell us. In Proceedings of Educational Data Mining.
-  Yang, J., McAuley, J., Leskovec, J., LePendu, P., Shah, N., & Informatics, B. (2014). Finding Progression Stages in Time-evolving Event Sequences. In Www 2014 (pp. 783–793). New York, New York, USA: ACM Press. http://doi.org/10.1145/2566486.2568044
-  Brinton, C. G., & Chiang, M. (2015). MOOC Performance Prediction via Clickstream Data and Social Learning Networks. In IEEE INFOCOM.
|Click Features||State Features|
|Raw Clicks||Click Categories||MMM||HMM|