1 Introduction
The average sixyear graduation rate across fouryear higher education institutions has been around 59% over the past 15 years [Kena et al. (2016), Braxton et al. (2011)], while less than half of college graduates finish within four years [Braxton et al. (2011)]. These statistics pose challenges in terms of workforce development, economic activity and national productivity. This has resulted in a critical need for analyzing the available data about past students in order to provide actionable insights to improve college student graduation and retention rates. Some examples of the problems that have been investigated are: course recommendation [Elbadrawy and Karypis (2016), Bendakir and Aïmeur (2006), Lee and Cho (2011), Parameswaran and GarciaMolina (2009), Parameswaran et al. (2010), Parameswaran et al. (2010), Parameswaran et al. (2011)], nextterm course grade prediction [Polyzou and Karypis (2016), Sweeney et al. (2016), Elbadrawy and Karypis (2016), Morsy and Karypis (2017), Hu and Rangwala (2018)], predicting the final grade of the course based on the student’s ongoing performance during the term [Meier et al. (2015)], inclass activities grade prediction [Elbadrawy et al. (2015)], predicting student’s performance in tutoring systems [ThaiNghe et al. (2011), Hershkovitz et al. (2013), Hwang and Su (2015), Romero et al. (2008), ThaiNghe et al. (2012)], and knowledge tracing and student modeling [Reddy et al. (2016), Lan et al. (2014), GonzálezBrenes and Mostow (2012)].
Both course recommendation [Bendakir and Aïmeur (2006), Parameswaran et al. (2011), Elbadrawy and Karypis (2016), Bhumichitr et al. (2017), Hagemann et al. (2018)] and grade prediction [Sweeney et al. (2016), Elbadrawy and Karypis (2016), Polyzou and Karypis (2016), Morsy and Karypis (2017), Hu and Rangwala (2018)]
methods aim to help students during the process of course registration in each semester. By learning from historical registration data, course recommendation focuses on recommending courses to students that will help them in completing their degrees. Grade prediction focuses on estimating the students’ expected grades in future courses. Based on what courses they previously took and how well they performed in them, the predicted grades give an estimation of how well students are prepared for future courses. Nearly all of the previous studies have focused on solving each problem separately, though both problems are interrelated in the sense that they both aim to help students graduate in a timely and successful manner.
In this paper, we propose a gradeaware course recommendation framework that focuses on recommending a set of courses that will help students: (i) complete their degrees in a timely fashion, and (ii) maintain or improve their GPA. To this end, we propose two different approaches for recommendation. The first approach ranks the courses by using an objective function that differentiates between courses that are expected to increase or decrease a student’s GPA. The second approach uses the grades that students are expected to obtain in future courses to improve the ranking of the courses produced by course recommendation methods.
To obtain course rankings in the first approach
, we adapt two widelyknown representation learning techniques, which have proven successful in many fields, to solve the gradeaware course recommendation problem. The first is based on Singular Value Decomposition (SVD), which is a linear model that learns a lowrank approximation of a given matrix. The second, which we refer to as Course2vec, is based on Word2vec
[Mikolov et al. (2013)] that uses a loglinear model to formulate the problem as a maximum likelihood estimation problem. In both approaches, the courses taken by each student are treated as temporallyordered sets of courses, and each approach is trained to learn these orderings.1.1 Contributions
The main contributions of this work are the following:

We propose a Gradeaware Course Recommendation framework in higher education that recommends courses to students that the students are most likely to register for in their following terms and that will help maintain or improve their overall GPA. The proposed framework combines the benefits of both course recommendation and grade prediction approaches to better help students graduate in a timely and successful manner.

We investigate two different approaches for solving gradeaware course recommendation. The first approach uses an objective function that explicitly differentiates between good and bad courses, while the other approach combines grade prediction methods with course recommendation methods in a nonlinear way.

We adapt twowidely used representation learning techniques to solve the gradeaware course recommendation problem, by modeling historical course ordering data and differentiating between courses that increase or decrease the student’s GPA.

We perform an extensive set of experiments on a dataset spanning 16 years obtained from the University of Minnesota, which includes students who belong to 23 different majors. The results show that: (i) the proposed gradeaware course recommendation approaches outperform gradeunaware course recommendation methods in recommending more courses that increase the students’ GPA and fewer courses that decrease it; and (ii) the proposed representation learning approaches outperform competing approaches for gradeaware course recommendation in terms of recommending courses which students are expected to perform well in, as well as differentiating between courses which students are expected to perform well in and those which they are expected not to perform well in.

We provide an indepth analysis of the recommendation accuracy across different majors and different student groups. We show the effectiveness of our proposed approaches on different majors and student groups over the best competing method. In addition, we analyze two important characteristics for the recommendations: the course difficulty as well as the course popularity. We show that our proposed approaches are not prone to recommending easy courses. Furthermore, they are able to recommend courses with different popularity in a similar manner.
2 Related Work
2.1 Course Recommendation
Different machine learning methods have been recently developed for course recommendation. For example, bendakir2006using used association rule mining to discover significant rules that associate academic courses from previous students’ data. lee2011intelligent ranked the courses for each student based on the course’s importance within his/her major, its satisified prerequisites, and the extent by which the course adds to the student’s knowledge state.
Another set of recommendation methods proposed in [Parameswaran and GarciaMolina (2009), Parameswaran et al. (2010), Parameswaran et al. (2010), Parameswaran et al. (2011)]
focused on satisfying the degree plan’s requirements that include various complex constraints. The problem was shown to be NPhard and different heuristic approaches were proposed in order to solve the problem.
elbadrawy2016domain proposed using both student and coursebased academic features, in order to improve the performance of three popular recommendation methods in the education domain, namely: popularitybased ranking, userbased collaborative filtering and matrix factorization. These features are used to define finer groups of students and courses and were shown to improve the recommendation performance of the three aforementioned methods than using coarser groups of students.
The group popularity ranking method proposed in [Elbadrawy and Karypis (2016)] and referred to as grppop, ranks the courses based on how frequently they were taken by students of the same major and academic level as the target student. Though this is a simple ranking method, it was shown to be among the best performing methods proposed by the authors. This is due to the domain restrictions, where each degree program offers a specific set of required and elective courses for the students to choose a subset from, and a prerequisite structure exists among most of these courses.
[Pardos et al. (2019)] proposed a similar course2vec model that was done independently and in parallel to our proposed work^{2}^{2}2An earlier version of our paper was published as a technical report at: https://goo.gl/HrxVdr.
. They used a skipgram neural network architecture that takes as input one course, and outputs multiple probability distributions over the courses. The approaches that are presented here differ from that work because they use a Continuous BagOfWord (CBOW) neural network architecture that takes as input multiple courses and outputs one probability distribution over the courses for recommendation. Another difference is that their model is gradeunaware, while ours is gradeaware, which is a main contribution of our work.
Another model [Backenköhler et al. (2018)] that is also parallel and most relevant to our work also proposed to combine grade prediction with course recommendation. Our work is different in three aspects.
First, [Backenköhler et al. (2018)] uses a course dependency graph constructed using the MannWhitney Utest as the course recommendation method. This graph consists of nodes that represent courses and directed edges between them. A directed edge going from course A to course B means that the chance of getting a better grade in B is higher when A is taken before B than when A is not taken before B. One limitation of this approach is that, for pairs (A, B) of courses that do not have sufficient data about A not being taken before B, no directed edge will exist from A to B, despite the fact that there may be sufficient data about A followed by B, which may imply that A is a prerequisite for B. Our proposed representation learning approaches for course recommendation, described in Section 3.1, on the other hand, are able to learn all possible orderings for pairs of courses that have sufficient data. In addition, the course embeddings are learned in a way such that courses that are taken after a common set of courses are located close in the latent space, which enables discovering new relationships between previous and subsequent courses that do not necessarily exist in the data.
Second, we propose a new additional approach for gradeaware course recommendation, which modifies the course recommendation objective function to differentiate between good and bad sequences of courses and does not require a grade prediction method.
2.2 Course Sequence Discovery and Recommendation
Though our focus in this paper is to recommend courses for students in their following term, and not to recommend the whole sequence of courses for all terms, our proposed models try to learn the sequencing of courses such that they predict the nextterm’s good courses based on the previouslytaken set of courses.
cucuringu2017rank utilized several ranking algorithms, e.g., PageRank, to extract a global ranking of the courses, where the rank here denotes the order in which the courses are taken by students. The discovered course sequences were used to infer the hidden dependencies, i.e., informal prerequisites, between the courses, and to understand how/if course sequences learned from high and lowperforming students are different from each other. This technique learns only one global ranking of courses from all students, which cannot be used for personalized recommendation.
xu2016personalized proposed a course sequence recommendation framework that aimed to minimize the timetograduate, which is based on satisfying the prerequisite requirements, course availability during the term, the maximum number of courses that can be taken during each term, and the degree requirements. They also proposed to do joint optimization of both graduation time and GPA by clustering students based on some contextual information, e.g., their high school rank and SAT scores, and keeping track of each student’s sequence of taken courses as well as his/her GPA. Then, for a new student, he/she is assigned to a specific cluster based on their contextual information and the sequence of courses from that cluster that has the highest GPA estimate is recommended to him/her. This framework can work well on the more restricted degree programs that have little variability between the degree plans taken by students, given that there is enough support for the different degree plans from past students. However, the more flexible degree programs have much variability in the degree plans taken by their students, as shown in [Morsy and Karypis (2019)]. This makes an exact extraction system like the one above inapplicable for their students, unless there exists a huge dataset that covers the many different possible sequences with high support.
2.3 Representation Learning
Representation learning has been an invaluable approach in machine learning and artificial intelligence for learning from different types of data such as text and graphs. Objects can be represented in a vector space via local or distributed representations. Under local (or onehot) representations, each object is represented by a binary vector, of size equal to the total number of objects, where only one of the values in the vector is one and all the others are set to zero. Under distributed representations, each object is represented by a dense or sparse vector, which can come from handengineered features that is usually sparse and highdimensional, or a learned representation, called “embeddings” in a latent space that preserves the relationships between the objects, which is usually lowdimensional and more practical than the former.
A widely used approach for learning object embeddings is Singular Value Decomposition (SVD) [Golub and Reinsch (1970)]. SVD is a traditional lowrank approximation method that has been used in many fields. In recommendation systems, a useritem rating matrix is typically decomposed into the user and item latent factors that uncover the observed ratings in the matrix [Sarwar et al. (2000), Bell et al. (2007), Paterek (2007), Koren (2008), for eg.].
Recently, neural networks have gained a lot of interest for learning object embeddings in different fields, for their ability to handle more complex relationships than SVD. Some of the early wellknown architectures include Word2vec [Mikolov et al. (2013)] and Glove [Pennington et al. (2014)], which were proposed for learning distributed representations for words [Mikolov et al. (2013)]
. For instance, neural language models for words, phrases and documents in Natural Language Processing
[Huang et al. (2012), Mikolov et al. (2013), Le and Mikolov (2014), Pennington et al. (2014), Mikolov et al. (2013), for eg.]are now widely used for different tasks, such as machine translation and sentiment analysis. Similarly, learning embeddings for graphs, such as: DeepWalk
[Perozzi et al. (2014)], LINE [Tang et al. (2015)] and node2vec [Grover and Leskovec (2016)] were shown to have performed well on different applications, such as: multilabel classification and link prediction. Moreover, learning embeddings for products in ecommerce and music playlists in cloudbased music services have been recently proposed for next basket recommendation [Chen et al. (2012), Grbovic et al. (2015), Wang et al. (2015)].3 Gradeaware Course Recommendation
Undergraduate students often achieve inconsistent grades in the various courses they take, which may increase or decrease their overall GPA. This is illustrated in Figure 1 that shows the histogram of differences between each grade obtained by a student over his/her prior average grade, for the dataset used in our experiments (Table 1). As we can see, more than of the grades are a fullletter grade lower, than the corresponding students’ previous average grades^{3}^{3}3The letter grading system in this dataset has 11 letter grades (A, A, B+, B, B, C+, C, C, D+, D, F) that correspond to the numerical grades (4, 3.67, 3.33, 3, 2.67, 2.33, 2, 1.67, 1.33, 1, 0), with A being the highest grade and F the lowest one.. The poor performance in some of these courses can result in students having to retake the same courses at a later time, or increase the number of courses that they will have to take in order to graduate with a desired GPA. As a result, this will increase the financial cost associated with obtaining a degree and can incur an opportunity cost by delaying the students’ graduation.
For the cases in which a student’s performance in a course is a result of him/her not being wellprepared for it (i.e., is taking the course at the wrong time in his/her studies), course recommendation methods can be used to recommend a set of courses for that student that will help: (i) him/her in completing his/her degree in a timely fashion, and (ii) maintain or improve his/her GPA. We will refer to the methods that do those simultaneously as gradeaware course recommendation approaches. Note that the majority of the existing approaches cannot be used to solve this problem as they ignore the performance the student is expected to get in the courses that they recommend.
In this work, we propose two different approaches for gradeaware course recommendation. The first approach (Section 3.1) uses two representation learning approaches that explicitly differentiate between courses in which the student is expected to perform well in and courses in which the student is expected not to perform well in. The second approach (Section 3.2) combines grade prediction methods with course recommendation methods to improve the final course rankings. The goal of both approaches is to rank the courses in which the student is expected to perform well in higher than those in which he/she is expected not to perform well in.
3.1 Gradeaware Representation Learning Approaches
Our first approach for solving the gradeaware course recommendation problem relies on modifying the way we use the previous students’ data to differentiate between courses which the student is expected to perform well in and courses which the student is expected not to perform well in. As such, for every student, we define a course taken by him/her to be a good (subsequent) course if the student’s grade in it is equal to or higher than his/her average previous grade, otherwise, we define that course to be a bad (subsequent) course. The goal of our method is to recommend to each student a set of good courses.
Motivated by the success of representation learning approaches in recommendation systems [Koren (2008), Chen et al. (2012), Grbovic et al. (2015), Wang et al. (2015)], we adapt two widelyused approaches to solve the gradeaware course recommendation problem. The first approach applies Singular Value Decomposition linear factorization model on a cooccurrence frequency matrix that differentiates between good and bad courses (Section 3.1.1), while the second one optimizes an objective function of a neural network loglinear model that differentiates between good and bad courses (Section 3.1.2).
In both approaches, the courses taken by each student are treated as temporallyordered sets of courses, and each approach is trained on this data in order to learn the proper ordering of courses as taken by students. The course representations learned by these models are then used to create personalized rankings of courses for students that are designed to include courses that are relevant to the students’ degree programs and will help them maintain or increase their GPAs.
3.1.1 Singular Value Decomposition (SVD)
SVD [Golub and Reinsch (1970)] is a traditional lowrank linear model that has been used in many fields. It factorizes a given matrix by finding a solution to , where the columns of and are the left and right singular vectors, respectively, and is a diagonal matrix containing the singular values of . The largest singular values, and corresponding singular vectors from and , is the rank approximation of (). This technique is called truncated SVD.
Since we are interested in learning course ordering as taken by past students, we apply SVD on a previoussubsequent cooccurrence frequency matrix , where is the number of students in the training data that have taken course before they took course .
We form two different previoussubsequent cooccurrence frequency matrices, as follows. Let and be the number of students who have taken course before course , where course is considered a good course for the first group and a bad course for the second one, respectively. The two matrices are:

: where .

: where .
We scaled the rows of each matrix to norm and then applied truncated SVD on them. The course embeddings are then given by and for the previous and subsequent courses, respectively.
Note that we append a (+), or (+) as a superscript to the matrix and as a suffix to the corresponding method’s name based on what course information it utilizes during learning and how it utilizes it. A (+)based method utilizes the good course information only and ignores the bad ones, while a (+)based method utilizes both the good and bad course information and differentiates between them.
Recommendation.
Given the previous and subsequent course embeddings estimated by SVD, course recommendation is done as follows. Given a student with his/her previouslytaken set of courses, , who would like to register for his/her following term, we compute his/her implicit profile by averaging over the embeddings of the courses taken by him/her in all previous terms^{4}^{4}4Note that we tried using different window sizes for the number of previous terms. Using all previous terms achieved the best results than using one, two or three previous terms only.. We then compute the dot product between ’s profile and the embeddings of each candidate course . Then, we rank the courses in nonincreasing order according to these dot products, and select the top courses as the final recommendations for .
3.1.2 Course2vec
The above SVD model works on pairwise, onetoone relationships between previous and subsequent courses. We also model course ordering using a manytoone, loglinear model, which is motivated by the recent word2vec Continuous BagOfWord (CBOW) model [Mikolov et al. (2013)]. Word2vec works on sequences of individual words in a given text, where a set of nearby (context) words (i.e., words within a predefined window size) are used to predict the target word. In our case, the sequences would be the ordered terms taken by each student, where each term contains a set of courses, and the previous set of courses would be used to predict future courses for each student.
Model Architecture.
We formulate the problem as a maximum likelihood estimation problem. Let be a set of courses taken in some term . A sequence is an ordered list of terms as taken by some student , where each term can contain one or more courses. Let be the courses’ representations when they are treated as previous courses, and let be their representations when they are treated as “subsequent” courses, where is the number of courses and is the number of dimensions in the embedding space. We define the probability of observing a future course given a set of previouslytaken courses using the softmax function, i.e.,
(1) 
where denotes the aggregated vector of the representations of the previous courses, where we use the average pooling for aggregation, i.e.,
where
is a onehot encoded vector of size
that has in the ’s position and otherwise. The Architecture for Course2vec is shown in Figure 2. Note that one may consider more complex neural network architectures, which is left for future work.We propose the two following models:

Course2vec(+). This model maximizes the loglikelihood of observing only the good subsequent courses that are taken by student in some term given his/her previouslytaken set of courses. The objective function of Course2vec(+) is thus:
(2) where: is the set of students, is the set of good courses taken by student at term , and is the set of courses taken by student prior to term . Note that starts from , since the previous set of courses would be empty for .

Course2vec(+). This model maximizes the loglikelihood of observing good courses and minimizes the loglikelihood of observing bad courses given the set of previouslytaken courses. The objective function of Course2vec(+) is thus:
(3) where: is the set of bad courses taken by student at term , and the rest of the terms are as defined in Eq. 2.
Note that Course2vec(+) is analogous to SVD(+) and Course2vec(+) is analogous to SVD(+) in terms of how they utilize the good and bad courses in the training set.
Model Optimization.
The objective functions in Eqs. 2 and 3
can be solved using Stochastic Gradient Descent (SGD), by solving for one subsequent course at a time. The computation of gradients in the two equations requires computing Eq.
1 for all courses for the denominator, which requires knowing whether a course is to be considered a good or a bad subsequent course for a given context. However, not all the relationships between every context (previous set of courses) and every subsequent course is known from the data. Hence, for each context, we only update the subsequent course vector when the course is known to be a good or bad subsequent course associated with that context. In the case that some context does not have a sufficient predefined number of subsequent courses with known relationships, then we randomly sample a few other courses and treat them as bad courses, similar to the negative sampling approach used in word2vec [Le and Mikolov (2014)].Note that in Course2vec(+), since a course can be seen as both a good and a bad subsequent course for the same context in the data (for different students), then, in this case, we randomly choose whether to treat that course as good or bad each time according to a uniform distribution that is based on its good and bad frequency in the dataset. In addition, for both Course2vec(+) and Course2vec(+), if the frequency between a context and a subsequent course is less than a predefined threshold, e.g., 20, then we randomly choose whether to update that subsequent course’s vector in the denominator each time it is visited. The code for Course2vec can be found at:
https://goo.gl/uCCqie, which is built on the original word2vec code that was implemented for the CBOW model^{5}^{5}5Original code is at: https://goo.gl/UvUuMQ.Recommendation
Given the previous and subsequent course embeddings estimated by Course2vec, course recommendation is done as follows. Given a student with his/her previouslytaken set of courses, , who would like to register for his/her following term, we compute the probability for each candidate course according to Eq. 1. We then rank the courses in nonincreasing order according to their probabilities, and select the top courses as the final recommendations for . Note that since the denominator in Eq. 1 is the same for all candidate courses, the ranking score for course can be simplified to the dot product between and , where represents the student’s implicit profile.
3.2 Combining Course Recommendation with Grade Prediction
The second approach that we developed for solving the gradeaware course recommendation problem relies on using the grades that students are expected to obtain in future courses to improve the ranking of the courses produced by course recommendation methods. Our underlying hypothesis behind this approach is that, a course that both is ranked high by a course recommendation method and has a high predicted grade should be ranked higher than one that either has a lower ranking by the recommendation method or is predicted to have a lower grade in it. This in turn will help improve the final course rankings for students by taking both scores into account simultaneously.
Let be the predicted grade for course as generated from some grade prediction model, and let be the ranking score for as generated from some course recommendation method. We combine both scores to compute the final ranking score for as follows:
(4) 
where is a hyperparameter in the range that controls the relative contribution of and to the overall ranking score, and sign() denotes the sign of , i.e., if is positive and otherwise. Note that both and
are standardized to have zero mean and unit variance.
In this work, we will use the representation learning approaches described in Section 3.1 as the course recommendation method. We will also use the gradeunaware variations of each of them (see Section 4.2) to compare combining the grade prediction methods with both recommendation approaches.
To obtain the grade prediction score, we will use Cumulative Knowledgebased Regression Models [Morsy and Karypis (2017)], or CKRM for short. CKRM is a set of gradeprediction methods that learn lowdimensional as well as textualbased representations for courses that denote the required and provided knowledge components for each course. It represents a student’s knowledge state as the sum of the provided knowledge component vectors of the courses taken by them, weighted by their grades in them. CKRM then predicts the student’s grade in a future course as the dot product between their knowledge state vector and the course’s required knowledge component vector. We will denote the recommendation method that combines CKRM with SVD and Course2vec as CKRM+SVD and CKRM+Course2vec, respectively.
4 Experimental Evaluation
4.1 Dataset Description and Preprocessing
The data used in our experiments was obtained from the University of Minnesota, where it spans a period of 16 years (Fall 2002 to Summer 2017). From that dataset, we extracted the degree programs that have at least 500 graduated students until Fall 2012, which accounted for 23 different majors from different colleges. For each of these degree programs, we extracted all the students who graduated from this program and extracted the 50 most frequent courses taken by the students as well as the courses that belonged to frequent subjects, e.g., CSCI is a subject that belongs to the Computer Science department at the University. A subject is considered to be frequent if the average number of courses that belong to that subject over all students is at least three. This filtering was made to remove the courses we believe are not relevant to the degree program of students. We also removed any courses that were taken as pass/fail.
Using the above dataset, we split it into train, validation and test sets as follows. All courses taken before Spring 2013 were used for training, courses taken between Spring 2013 and Summer 2014 inclusive were used for validation, and courses taken afterwards (Fall 2014 to Summer 2017 inclusive) were used for test purposes.
At the University of Minnesota, the letter grading system has 11 letter grades (A, A, B+, B, B, C+, C, C, D+, D, F) that correspond to the numerical grades (4, 3.667, 3.333, 3, 2.667, 2.333, 2, 1.667, 1.333, 1, 0). For each (context, subsequent) pair in the training, validation, and test set, where the context represents the previouslytaken set of courses by a student, the context contained only the courses taken by the student with grades higher than the D+ letter grade. The statistics of the 23 degree programs are shown in Table 1.
Major  # Students  # Courses  # Grades 

Accounting (ACCT)  661  55  7,614 
Aerospace Engr. (AEM)  866  72  13,280 
Biology (BIOL)  1,927  113  15,590 
Biology, Soc. & Envir. (BSE)  1,231  56  9,389 
Biomedical Engr. (BME)  1,002  64  13,808 
Chemical Engr. (CHEN)  1,045  82  10,219 
Chemistry (CHEM)  765  78  7,814 
Civil Engr. (CIVE)  1,160  74  15,992 
Communication Studies (COMM)  2,547  90  17,135 
Computer Science & Engr. (CSE)  1,790  98  13,520 
Electrical Engr. (ECE)  1,197  84  12,781 
Elementary Education (ELEM)  1,283  60  15,303 
English (ENGL)  1,790  113  12,451 
Finance (FIN)  1,326  55  12,150 
Genetics, Cell Biol. & Devel. (GCD)  843  92  9,726 
Journalism (JOUR)  2,043  91  23,549 
Kinesiology (KIN)  1,499  161  23,451 
Marketing (MKTG)  2,077  51  13,084 
Mechanical Engr. (MECH)  1,501  79  25,608 
Nursing (NURS)  1,501  88  18,239 
Nutrition (NUTR)  940  71  12,400 
Political Science (POL)  1,855  111  13,904 
Psychology (PSY)  3,047  100  25,299 
4.2 Baseline and Competing Methods
We compare the performance of the proposed representation learning approaches against competing approaches for gradeaware course recommendation, which are defined as follows:

Grppop(+): We modify the group popularity ranking method developed in elbadrawy2016domain and explained in Section 2 to solve the gradeaware course recommendation. For each course , let and be the number of students that have the same major and academic level as the target student , where was considered a good subsequent course for the first group and a bad one for the second group. We can differentiate between good and bad subsequent courses using the following ranking score (which is similar to the (+)based approaches):
(5) 
Grppop(+): Here, the group popularity ranking method considers only the good subsequent courses, similar to SVD(+) and Course2vec(+). Specifically, the ranking score is computed as:
where is as defined in Eq. 5.

Course dependency graph: This is the course recommendation method utilized in [Backenköhler et al. (2018)] (see Section 2.1).
We also compare the performance of the representation learning approaches for both gradeaware and gradeunaware course recommendation. The gradeunaware representation learning approaches are defined as follows:

SVD(++): Here, SVD is applied on the previoussubsequent cooccurrence frequency matrix: : where .

Course2vec(++). This model maximizes the loglikelihood of observing all courses taken by student in some term given the set of previouslytaken courses, regardless of the subsequent course being a good or a bad one. This can be written as:
where: is the set of courses taken by student at term , and the rest of the terms are as defined in Eq. 2.
Note that, here we append a (++) suffix to the gradeunaware variation of the method’s name since it utilizes all the course information without differentiating between good and bad courses.
4.3 Evaluation Methodology and Metrics
Previous course recommendation methods used the recall metric to evaluate the performance of their methods. The goal of the proposed gradeaware course recommendation methods is to recommend to the student courses which he/she is expected to perform well in and not recommend courses which he/she is expected not to perform well in. As a result, we cannot use the recall metric as is, and instead, we use three variations of it that differentiate between good and bad courses. The first, Recall(good), measures the fraction of the actual good courses that are retrieved. The second, Recall(bad), measures the fraction of the actual bad courses that are retrieved. The third, Recall(diff), measures the overall performance of the recommendation method in ranking the good courses higher than the bad ones.
The first two metrics are computed as the average of the studenttermspecific corresponding recalls. In particular, for a student and a target term , the first two recall metrics for that (, ) tuple are computed as:

.

.
and denote the set of good and bad courses, respectively, that were taken by in and exist in his/her list of recommended courses, is the actual number of courses taken by in , and and are the actual number of good and bad courses taken by in , respectively. Since our goal is to recommend good courses only, we consider a method to perform well when it achieves a high Recall(good) and a low Recall(bad).
Recall(diff) is computed as the difference between Recall(good) and Recall(bad), i.e.,

Recall(diff) = Recall(good)  Recall(bad).
Recall(diff) is thus a signed measure that assesses both the degree and direction to which a recommendation method is able to rank the actual good courses higher than the bad ones in its recommended list of courses for each student, so the higher the Recall(diff) value, the better the recommendation method is.
To further analyze the differences in the ranking results of the proposed approaches, we also computed the following two metrics:

Percentage GPA increase/decrease: Let and be the current GPA achieved by student on the good and bad courses recommended by some recommendation method, respectively, and let be his/her GPA prior to that term. Then, the percentage GPA increase and decrease are computed as:

Coverage for good/bad terms: The number of terms where some recommendation method recommends good (or bad) subsequent courses to will be referred to as its coverage for good (or bad) terms. The higher the coverage for good terms by some method, the more students who will get good recommendations that will maintain or improve their overall GPA. On the other hand, the lower the coverage for bad terms, the less students who will get bad recommendations that will decrease their overall GPA.
We compute the above two metrics for the terms on which the recommendation method recommends at least one of the actual courses taken in that term. For each method, the percentage GPA increase and decrease as well as the coverage for good and bad terms are computed as the average of the individual scores. Since we would like to recommend courses that optimize the student’s GPA, the higher the GPA percentage increase and the coverage for good terms and the lower the GPA percentage decrease and the coverage for bad terms that a method achieves, the better the method is.
Note that, a recommendation is only done for students who have taken at least three previous courses. For each (, ) tuple, the recommended list of courses using any method are selected from the list of courses that are being offered at term only, and that were not already taken by with an associated grade that is either: (i) , or, (ii) , where is the average previous grade achieved by . Therefore, we only allow recommending repeated courses in the case that the student has achieved a low grade in it such that the course’s credits do not add to the earned credits, or when they a achieve bad grade in them relative to their grades in previous terms. This filtering technique significantly improved the performance of all the baseline and proposed methods.
4.4 Model Selection
We did an extensive search in the parameter space for model selection. The parameters in the SVDbased models is the number of latent dimensions (). The parameters in the Course2vecbased models are: the number of latent dimensions (), and the minimum number of subsequent courses (), in the denominator of Eq. 1 that are used during the SGD process of learning the model. We experimented with the parameter in the range with a step of , with the minimum number of with the values , and with the parameter in Eq. 4 in the range with a step of 0.2.
The training set was used for learning the distributed representations of the courses, whereas the validation set was used to select the best performing parameters in terms of the highest Recall(diff).
5 Results
We evaluate the effectiveness of the proposed gradeaware course recommendation methods in order to answer the following questions:

How do the SVD and Course2vecbased approaches for course recommendation compare to each other?

How do the combination of grade prediction with representation learning approaches compare to each other?

How do the two proposed approaches for solving gradeaware course recommendation compare to each other?

How do the proposed approaches compare to competing approaches for gradeaware course recommendation?

What are the benefits of gradeaware course recommendation over gradeunaware course recommendation?

How does the recommendation accuracy vary across different majors and student subgroups?

What are the characteristics of the recommended courses, in terms of course difficulty and popularity?
5.1 Comparison of the Representation Learning Approaches for Gradeaware Course Recommendation
Table 2 shows the prediction performance of the two proposed representation learning approaches for gradeaware course recommendation. The results show that SVD(+) achieves the best Recall(good), while SVD(+) achieves the best Recall(diff). Course2vec(+) achieves the best Recall(bad), which is comparable to SVD(+).
By comparing the corresponding SVD and Course2vec approaches, we see that SVD outperforms Course2vec in almost all cases. We believe this is caused by the fact that there is a limited number of positive training data for Course2vec, since only the good courses are used as positive examples for learning the models. This is supported by the comparable prediction performance of the (++)based approaches that use all the available training data as positive examples, which are shown in Table 5.
By comparing the (+) and (+)based methods, we see that, the (+)based model achieves a worse Recall(good), but a much better Recall(bad). For instance, SVD(+) achieves a decrease in Recall(good) and a decrease in Recall(bad) over SVD(+). This is expected, since utilizing the bad course information gives the models more power to learn to rank these courses low, but it also adds some noise, since different students with the same or similar previous set of courses can achieve different outcomes on the same courses.
Metric  SVD  Course2vec  

(+)  (+)  (+)  (+)  
Recall(good)  0.468  0.396  0.448  0.351 
Recall(bad)  0.372  0.206  0.404  0.202 
Recall(diff)  0.096  0.190  0.044  0.149 
5.2 Comparison of the Gradeaware Recommendation Approaches Combining Grade Prediction with Course Recommendation
Table 3 shows the prediction performance of the gradeaware recommendation approaches that combine CKRM with the gradeaware and gradeunaware representation learning methods. The results show that CKRM+SVD(++) achieves the best Recall(good), while CKRM+Course2vec(+) achieves the best Recall(bad). Overall, CKRM+SVD(+) achieves the best Recall(diff). Combining CKRM with the gradeunaware, i.e., (++)based, approaches helped in differentiating between good and bad courses, by achieving a high Recall(diff) of 0.158 and 0.142 for SVD and Course2vec, respectively. However, despite these performance improvements, the combinations that use the gradeaware recommendation methods do better. For instance, CKRM+SVD(+) outperforms CKRM+SVD(++) by in terms of Recall(diff).
The results also show that the SVDbased (+) and (+)based approaches outperform their Course2vec counterparts in terms of Recall(diff), similar to the results of SVD and Course2vec alone (Section 5.1). Unlike the difference in the performance of SVD(+) vs SVD(+), CKRM+SVD(+) achieves a similar Recall(diff) to that achieved by CKRM+SVD(+) (and the same holds for the Course2vecbased approaches). The difference is that CKRM+SVD(+) achieves higher Recall(good) and Recall(bad) than CKRM+SVD(+).
5.3 Comparison of the Proposed Approaches for Gradeaware Course Recommendation
Comparing each of the SVD and Course2vecbased approaches with and without CKRM (shown in Tables 2 and 3), we see that combining CKRM with the (+)based approaches significantly improved their performance with and increase in Recall(diff) for SVD and Course2vec, respectively. On the other hand, combining CKRM with the (+)based approaches achieves comparable performance to using the corresponding (+)based approach alone.
By further analyzing these ranking results, Figure 3 shows the percentage GPA increase and decrease as well as the coverage for good and bad terms for each SVDbased method with and without CKRM^{6}^{6}6The results of the Course2vecbased methods are similar, and are thus omitted.. CKRM+SVD(+) outperforms SVD(+) in all but one metric, which is coverage for good terms, where it achieves slightly worse performance than SVD(+). On the other hand, CKRM+SVD(+) has comparable performance to SVD(+), which is analogous to their recall metrics results.
Metric  CKRM + SVD  CKRM + Course2vec  

(++)  (+)  (+)  (++)  (+)  (+)  
Recall(good)  0.438  0.417  0.385  0.411  0.417  0.338 
Recall(bad)  0.279  0.230  0.189  0.269  0.264  0.183 
Recall(diff)  0.158  0.187  0.197  0.142  0.152  0.155 
5.4 Representation Learning vs Competing Approaches for Gradeaware Course Recommendation
Table 4 shows the prediction performance of the representation learning and competing approaches for gradeaware course recommendation. Grppop(+) achieves the best Recall(diff) among the three competing (baseline) approaches. The results also show that SVD(+) achieves the best Recall(good), while grppop(+) achieves the best Recall(bad). Overall, SVD(+) achieves the best Recall(diff).
5.5 Gradeaware vs Gradeunaware Representation Learning Approaches
Table 5 shows the performance prediction of the representation learning approaches for gradeaware, i.e., (+) and (+)based approaches, and gradeunaware, i.e., (++)based approach, course recommendation. Each of SVD(+) and Course2vec(+) achieves a Recall(good) that is comparable to or better than that achieved by its corresponding (++)based approach. In addition, both the (+) and (+)based methods achieve much better (lower) Recall(bad). For instance, SVD(+) and SVD(+) achieve and Recall(bad), respectively, resulting in and improvement over SVD(++), respectively.
By comparing the (++), (+), and (+)based approaches in terms of Recall(diff), we can see that the (++)based approaches achieve negative recall values which indicates that they recommend more bad courses than they recommend good ones. The (+)based approaches do slightly better, while the (+)based approaches achieve the highest Recall(diff). This is expected, since the (++)based methods treat both types of subsequent courses equally during their learning, and so they recommend both types in an equal manner. This shows that differentiating between good and bad courses in any course recommendation method is very helpful for ranking the good courses higher than the bad ones, which will help the student maintain or improve their overall GPA.
In terms of percentage GPA increase and decrease (shown in Figure 3), SVD(+) outperforms SVD(++) by in percentage GPA increase and in percentage GPA decrease. Moreover, SVD(+) achieves less coverage for the bad terms than SVD(++), while it achieves less coverage for the good terms.
Metric 
Dependency  Grppop  Grppop  SVD  SVD  Course2vec  Course2vec 

Graph  (+)  (+)  (+)  (+)  (+)  (+)  
Recall(good)  0.382  0.425  0.367  0.468  0.396  0.448  0.351 
Recall(bad)  0.260  0.343  0.188  0.372  0.206  0.404  0.202 
Recall(diff)  0.122  0.082  0.179  0.096  0.190  0.044  0.149 
Metric 
SVD  Course2vec  SVD  Course2vec  SVD  Course2vec 

(++)  (++)  (+)  (+)  (+)  (+)  
Recall(good)  0.453  0.455  0.468  0.448  0.396  0.351 
Recall(bad)  0.502  0.493  0.372  0.404  0.206  0.202 
Recall(diff)  0.048  0.038  0.096  0.044  0.190  0.149 
5.6 Analysis of Recommendation Accuracy
Our discussion so far focused on analyzing the performance of the different methods by looking at metrics that are aggregated across the different majors. However, given that the structure of the degree programs of different majors is sometimes quite different, and that different student groups can exhibit different characteristics, an important question that arise is how the different methods perform across the individual degree programs and different student groups and if there are methods that consistently perform well across majors as well as across student groups. In this section, we analyze the recommendations done by one of our best performing models, CKRM+SVD(+), against the best performing baseline, i.e., grppop(+), in terms of Recall(diff), across these degree programs and student groups (RQ6).
5.6.1 Analysis on Different Majors
Table 4 shows the recommendation accuracy, in terms of Recall(diff), across the 23 majors, by both grppop(+) and CKRM+SVD(+) (Fig (a)a). First, we can see that there is a huge variation in the recall values across the majors, ranging from 0.05 to 0.5. Second, we see that CKRM+SVD(+) consistently outperforms grppop(+), except for the nursing major. To further look into why this happens, we investigated some of the characteristics of the students’ degree sequences. For each major, we computed the pairwise percentage of common courses among students who belong to that major, which is shown in Figure (b)b. In addition, we computed the similarity in the sequencing, i.e., ordering, of the common courses between each pair of students, which is shown in Figure (c)c. For computing the pairwise degree similarity, we utilized the formula proposed in [Morsy and Karypis (2019)], which computes the degree similarity between a pair of degree plans and as:
(6) 
where is the set of courses taken in degree , and is the time, i.e., term number, that course was taken in , e.g., the first term is numbered 1, the second is numbered 2 and so forth. Function is defined as:
(7) 
where is an exponential decay constant. Function assigns a value of for pairs of courses taken concurrently, i.e., during the same term, in both plans, and assigns a value of for pairs of courses that are either: (i) taken in reversed order in both plans, or (ii) taken concurrently in one plan and sequentially in the other. For pairs of courses taken in the same order, it assigns a positive value that decays exponentially with .
We found that there is a high correlation between the Recall(diff) values and both the average pairwise percentage of common courses and the average pairwise degree similarity among students of these majors (correlation values of 0.47 and 0.5 for grppop(+), and 0.47 and 0.38 for CKRM+SVD(+), respectively). This implies that, as the percentage of common courses and degree similarity between pairs of students decrease, accurate course recommendation becomes more difficult, since there is more variability in the set of courses taken as well as their sequencing. The nursing major, where grppop(+) significantly outperforms CKRM+SVD(+) has the highest average pairwise percentage of common courses, 76%, as well as the highest average pairwise degree similarity, 0.86, compared to all other majors. This implies that the nursing major is the most restricted major and that students tend to follow highly similar degree plans and take very similar courses at each academic level. The group popularity ranking in this case can easily outperform other recommendation methods.
Student Pair  Degree Similarity 

AB  0.597 
AC  0.535 
BC  0.534 
5.6.2 Analysis on Different Student Groups
Figure 5 shows the recommendation accuracy, in terms of Recall(diff), for grppop(+) and CKRM+SVD(+) across different student subgroups.
Figure (a)a shows the recommendation accuracy among different GPAbased student types, A vs B vs C. We notice that, first, CKRM+SVD(+) outperforms grppop(+) for all student groups. Second, we found that CKRM+SVD(+) achieves the highest Recall(diff) for the typeB students, followed by typeA, and then by typeC. This could be due to the following reasons. After analyzing the training data, we found that the typeA and typeB students constitute 96% of the student population. After analyzing the average pairwise percentage of common courses and degree similarity among each GPAbased groups of students, as well as among pairs of different GPAbased groups, we found that typeC students follow more diverse sequencing for their degree plans that typeA or typeB students, as illustrated in Table 6, while there was no difference among the different groups in the average pairwise percentage of common courses. As discussed in Section 5.6.1, there is a high correlation between the pairwise degree similarity and the recommendation accuracy. Since there is no enough training data for the typeC students to learn their sequencing of the courses, this can explain why the recommendation accuracy for them was the lowest.
Figure (b)b shows the recommendation accuracy among different student subgroups based on their academic level. At the University of Minnesota, there are four academic levels, based on the number of both earned and transferred credits by the beginning of the semester: (1) freshman ( 30 credits), (2) sophomore ( and credits), (3) junior ( and credits), and senior ( credits). First, we can notice that CKRM+SVD(+) significantly outperforms grppop(+) across all student groups. Second we see that, as the student’s academic level increases, and hence he/she has spent more years at the university and took more courses, both methods tend to achieve more accurate recommendations. This can be due to the following reasons. First, since we filter out the courses that have been previously taken by the student before making recommendations (see Section 4.3), this means that as the student’s academic level increases, there is a smaller number of candidate courses from which the recommendations are to be made. Second, for CKRM+SVD(+), as the student takes more courses, his/her implicit profile that is computed by aggregating the embeddings of the previouslytaken courses becomes more accurate.
6 Characteristics of Recommended Courses
An important question to any recommendation method is what the characteristics of the recommendations are. In this section, we study two important characteristics for the recommended courses; (i) the difficulty of courses (Section 6.1), and (ii) their popularity (Section 6.2) (RQ7).
6.1 Course Difficulty
As our proposed gradeaware recommendation methods are trained to recommend courses that help students maintain or improve their GPA, these methods can be prone to recommending more easier courses in which students usually achieve high grades. Here, we investigate whether this happens in our recommendations or not. Table 7 shows the grade statistics of all courses, as well as the courses recommended by all variations of gradeunaware and gradeaware SVD variations. The mean grade is 3.5 for all courses, while for the recommended courses, it is 3.24, 3.4, and 3.56, for SVD(++), SVD(+) and SVD(+), respectively. These statistics show that the gradeaware SVD approaches tend to only slightly favor easier courses in their recommendations than the gradeunaware SVD approach.
Course Set  Mean  Median  Std. Dev. 

All  3.50  3.61  0.51 
SVD(++)  3.24  3.24  0.27 
SVD(+)  3.40  3.40  0.24 
SVD(+)  3.56  3.55  0.20 
6.2 Course Popularity
Since the university administrators need to make sure that students are enrolled in courses with different popularity, as there is a capacity for each course and classroom, course popularity is an important factor for course recommendations.
We also analyze the results of our models in terms of the popularity of the courses they recommend. Figure 6 shows the frequency of the actual good courses in the test set, as well as the frequency of the good courses recommended by both grppop(+) and CKRM+SVD(+)^{7}^{7}7Remember that we recommend courses, which is the total number of (good and bad) courses taken by student in term (see Section 4.3), so the number of recommendations can be higher than the number of actual good courses..
The figure shows that both grppop(+) and CKRM+SVD(+) recommend courses with different popularity^{8}^{8}8Since we use a filtering technique before making recommendations, grppop(+) can recommend courses with little popularity (see Section 4.3), similar to the actual good courses taken by students. Comparing CKRM+SVD(+) to grppop(+), we can notice that, grppop(+) tends to recommend a higher number of the more popular courses, while CKRM+SVD(+) recommends more of the less popular ones, which can be considered a major benefit for the latter method.
7 Discussions and Conclusions
In this paper, we proposed gradeaware course recommendation approaches for solving the course recommendation problem. The proposed approach aims to recommend to students good courses on which the student’s expected grades will maintain or improve their overall GPA. We proposed two different approaches for solving the gradeaware course recommendation problem. The first approach ranks the courses by using an objective function that differentiates between sequences of courses that are expected to increase or decrease a student’s GPA. The second approach combines the grades predicted by grade prediction methods in order to improve the rankings produced by course recommendation methods. To obtain course rankings in the first approach, we adapted two widelyknown representation learning techniques; one that uses the linear Singular Value Decomposition model, while the other uses loglinear neural network based models.
We conducted an extensive set of experiments on a large dataset obtained from 23 different majors at the University of Minnesota. The results showed that: (i) the proposed gradeaware course recommendation approaches outperform gradeunaware recommendation methods in recommending more courses that increase the students’ GPA and fewer courses that decrease it; (ii) the proposed representation learning based approaches outperform competing approaches for gradeaware course recommendation; and (iii) the approaches that utilize both the good and bad courses and differentiates between them achieve comparable performance to combining grade prediction with the approaches that either utilize the good courses only, or those that differentiate between good and bad courses.
We also provided an indepth analysis of the recommendation accuracy across different majors and student groups. We found that our proposed approaches consistently outperformed the best baseline method across these majors and groups. We also analyzed the characteristics of the recommendations in terms of course difficulty and popularity. We found that our proposed gradeaware course recommendation approaches are not prone to recommending easy courses, and that they recommend courses with high and low popularity in a similar manner. This shows the effectiveness of our proposed gradeaware approaches for course recommendation.
Timetodegree is another important factor for academic success, which is the number of years or terms that the student enrolls in to finish his/her degree. An interesting research direction would be to investigate the effect of our recommendations on the timetodegree, and accordingly, develop recommendation approaches that considers both the student’s GPA and timetodegree.
Acknowledgement
We would like to thank the anonymous reviewers for their valuable feedback on the original manuscript. This work was supported in part by NSF (1447788, 1704074, 1757916, 1834251), Army Research Office (W911NF1810344), Intel Corp, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute, http://www.msi.umn.edu.
References
 Backenköhler et al. (2018) Backenköhler, M., Scherzinger, F., Singla, A., and Wolf, V. 2018. Datadriven approach towards a personalized curriculum. In Proceedings of the 11th International Conference on Educational Data Mining. 246–251.
 Bell et al. (2007) Bell, R., Koren, Y., and Volinsky, C. 2007. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’07. ACM, New York, NY, USA, 95–104.
 Bendakir and Aïmeur (2006) Bendakir, N. and Aïmeur, E. 2006. Using association rules for course recommendation. In Proceedings of the AAAI Workshop on Educational Data Mining. Vol. 3.
 Bhumichitr et al. (2017) Bhumichitr, K., Channarukul, S., Saejiem, N., Jiamthapthaksin, R., and Nongpong, K. 2017. Recommender systems for university elective course recommendation. In Computer Science and Software Engineering (JCSSE), 2017 14th International Joint Conference on. IEEE, 1–5.
 Braxton et al. (2011) Braxton, J. M., Hirschy, A. S., and McClendon, S. A. 2011. Understanding and Reducing College Student Departure: ASHEERIC Higher Education Report, Volume 30, Number 3. Vol. 16. John Wiley & Sons.
 Chen et al. (2012) Chen, S., Moore, J. L., Turnbull, D., and Joachims, T. 2012. Playlist prediction via metric embedding. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 714–722.

Cucuringu et al. (2017)
Cucuringu, M., Marshak, C. Z., Montag, D., and Rombach, P. 2017.
Rank aggregation for course sequence discovery.
In International Workshop on Complex Networks and their Applications. Springer, 139–150.  Elbadrawy and Karypis (2016) Elbadrawy, A. and Karypis, G. 2016. Domainaware grade prediction and topn course recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 183–190.
 Elbadrawy et al. (2015) Elbadrawy, A., Studham, R. S., and Karypis, G. 2015. Collaborative multiregression models for predicting students’ performance in course activities. In Proceedings of the 5th International Learning Analytics and Knowledge Conference.
 Golub and Reinsch (1970) Golub, G. H. and Reinsch, C. 1970. Singular value decomposition and least squares solutions. Numerische mathematik 14, 5, 403–420.
 GonzálezBrenes and Mostow (2012) GonzálezBrenes, J. P. and Mostow, J. 2012. Dynamic cognitive tracing: Towards unified discovery of student and cognitive models. EDM.
 Grbovic et al. (2015) Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., and Sharp, D. 2015. Ecommerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1809–1818.
 Grover and Leskovec (2016) Grover, A. and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
 Hagemann et al. (2018) Hagemann, N., OâMahony, M. P., and Smyth, B. 2018. Module advisor: Guiding students with recommendations. In International Conference on Intelligent Tutoring Systems. Springer, 319–325.

Hershkovitz
et al. (2013)
Hershkovitz, A., Gowda, S. M., and Corbett, A. T. 2013.
Predicting future learning better using quantitative analysis of momentbymoment learning.
In EDM.  Hu and Rangwala (2018) Hu, Q. and Rangwala, H. 2018. Coursespecific markovian models for grade prediction. In PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 29–41.
 Huang et al. (2012) Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long PapersVolume 1. Association for Computational Linguistics, 873–882.
 Hwang and Su (2015) Hwang, C.S. and Su, Y.C. 2015. Unified clustering locality preserving matrix factorization for student performance prediction. IAENG Int. J. Comput. Sci.
 Kena et al. (2016) Kena, G., Hussar, W., McFarland, J., de Brey, C., MusuGillette, L., Wang, X., Zhang, J., Rathbun, A., WilkinsonFlicker, S., Diliberti, M., et al. 2016. The condition of education 2016. nces 2016144. National Center for Education Statistics.
 Koren (2008) Koren, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 426–434.
 Lan et al. (2014) Lan, A. S., Waters, A. E., Studer, C., and Baraniuk, R. G. 2014. Sparse factor analysis for learning and content analytics. The Journal of Machine Learning Research.
 Le and Mikolov (2014) Le, Q. and Mikolov, T. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML14). 1188–1196.
 Lee and Cho (2011) Lee, Y. and Cho, J. 2011. An intelligent course recommendation system. SmartCR 1, 1, 69–84.
 Meier et al. (2015) Meier, Y., Xu, J., Atan, O., and Schaar, M. v. d. 2015. Personalized grade prediction: A data mining approach. In ICDM.
 Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
 Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
 Morsy and Karypis (2017) Morsy, S. and Karypis, G. 2017. Cumulative knowledgebased regression models for nextterm grade prediction. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 552–560.
 Morsy and Karypis (2019) Morsy, S. and Karypis, G. 2019. A study on curriculum planning and its relationship with graduation gpa and time to degree. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM, 26–35.
 Parameswaran et al. (2011) Parameswaran, A., Venetis, P., and GarciaMolina, H. 2011. Recommendation systems with complex constraints: A course recommendation perspective. ACM TOIS 29, 4, 20.
 Parameswaran and GarciaMolina (2009) Parameswaran, A. G. and GarciaMolina, H. 2009. Recommendations with prerequisites. In Proceedings of the third ACM conference on Recommender systems. ACM, 353–356.
 Parameswaran et al. (2010) Parameswaran, A. G., GarciaMolina, H., and Ullman, J. D. 2010. Evaluating, combining and generalizing recommendations with prerequisites. In Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 919–928.
 Parameswaran et al. (2010) Parameswaran, A. G., Koutrika, G., Bercovitz, B., and GarciaMolina, H. 2010. Recsplorer: recommendation algorithms based on precedence mining. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 87–98.
 Pardos et al. (2019) Pardos, Z. A., Fan, Z., and Jiang, W. 2019. Connectionist recommendation in the wild: on the utility and scrutability of neural networks for personalized course guidance. User Modeling and UserAdapted Interaction, 1–39.
 Paterek (2007) Paterek, A. 2007. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD cup and workshop. Vol. 2007. 5–8.
 Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
 Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Polyzou and Karypis (2016) Polyzou, A. and Karypis, G. 2016. Grade prediction with course and student specific models. In PAKDD. Springer.
 Reddy et al. (2016) Reddy, S., Labutov, I., and Joachims, T. 2016. Latent skill embedding for personalized lesson sequence recommendation. arXiv preprint.

Romero et al. (2008)
Romero, C., Ventura, S., Espejo, P. G., and Hervás, C. 2008.
Data mining algorithms to classify students.
In EDM.  Sarwar et al. (2000) Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. 2000. Application of dimensionality reduction in recommender system a case study. In Proceeding of WebKDD2000 Workshop.
 Sweeney et al. (2016) Sweeney, M., Lester, J., Rangwala, H., and Johri, A. 2016. Nextterm student performance prediction: A recommender systems approach. Journal of Educational Data Mining 8, 1, 22–51.
 Tang et al. (2015) Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. 2015. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067–1077.
 ThaiNghe et al. (2012) ThaiNghe, N., Drumond, L., Horváth, T., and SchmidtThieme, L. 2012. Using factorization machines for student modeling. In UMAP Workshops.
 ThaiNghe et al. (2011) ThaiNghe, N., Horváth, T., and SchmidtThieme, L. 2011. Factorization models for forecasting student performance. In EDM.
 Wang et al. (2015) Wang, P., Guo, J., Lan, Y., Xu, J., Wan, S., and Cheng, X. 2015. Learning hierarchical representation model for nextbasket recommendation. In Proceedings of the 38th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 403–412.
 Xu et al. (2016) Xu, J., Xing, T., and Van Der Schaar, M. 2016. Personalized course sequence recommendations. IEEE Transactions on Signal Processing 64, 20, 5340–5352.