1 Introduction
The use of predictive models in higher education to identify atrisk students can offer advantages for the efficient allocation of resources to students. Teaching staff may direct support to struggling students early in the course, or advising staff may guide students on course planning based on model predictions. However, there is a growing concern with predictive models of this kind because they may inadvertently introduce bias [10.1145/3303772.3303838, 10.1145/3303772.3303791, DBLP:conf/edm/HuttGDD19, loukinaetal2019many, ranger2020]. For example, an unfair model may fail to identify a successful student more frequently because of their membership in a certain demographic group.
In this work, we build a course success prediction model using administrative academic records from a U.S. research university and evaluate its fairness using three statistical fairness measures: demographic parity [10.1145/2783258.2783311], equality of opportunity [NIPS2016_6374], and positive predictive parity [Chouldechova2016FairPW]. Demographic parity requires an equal rate of positive predictions for different subgroups. Equality of opportunity requires that the model can correctly identify successful students at equal rates for different subgroups. Positive predictive parity requires that the proportion actually successful students out of those who receive positive predictions is the same for different subgroups. However, according to the impossibility results for these different statistical measures of fairness [Chouldechova2016FairPW, kleinberg2016inherent], it is not possible to satisfy any two of them at once. We therefore investigate how correcting for one fairness measure, equality of opportunity, of the student success prediction model may affect model accuracy and performance on the other two fairness measures considered.
2 Methods
We build a prediction model to identify students who will receive a median grade or above in one of six required courses for a given major at a U.S. research university, and evaluate its accuracy and fairness. We then alter the predictions in the postprocessing step to improve fairness and evaluate the model performance again using the same criteria.
Category  Features  












URM 

Male  Female  

Accuracy  orig  0.695  0.735  0.040  0.694  0.755  0.061  
fair  0.736  0.729  0.007  0.704  0.740  0.036  

orig  0.649  0.851  0.202  0.755  0.860  0.105  
fair  0.761  0.762  0.001  0.781  0.780  0.001  

orig  0.473  0.754  0.281  0.624  0.753  0.129  
fair  0.556  0.636  0.080  0.645  0.655  0.010  

orig  0.770  0.787  0.017  0.753  0.805  0.052  
fair  0.767  0.835  0.068  0.753  0.840  0.087 
2.1 Data
The data spans the Fall 2014 to Spring 2019. The studentlevel administrative data used to create features includes coursetaking history, demographic information, and standardized test scores along with other academic information. We remove duplicates, missing course grades, courses taken multiple times by the same student, and any grades other than letter (AF) or pass/fail grades. We impute missing standardized test scores and course grades with a placeholder value of 999 along with an indicator variable. Table 1 shows the categorization of features considered in our analysis. Some feature values with fewer than 30 instances are merged together as “Other”. The final processed data has 5,443 rows and 56 columns.
We focus our analysis on two binary protected attributes of students defined by their racialethnicity and gender. For ethnicity, we group American Indian, Black, Hawaiian or Pacific Islander, Hispanic, and Multicultural students as underrepresented minority students (URM), and Asian and White students as nonURM. For gender, we consider male students and female students.
2.2 Model Building
We use the most recent semester (i.e. spring of 2019) for testing, and train the model using the rest of the semesters that comprises approximately 78.4% of the original dataset. We fit a random forest model using default settings with the randomForest function in R. We note that the training dataset is skewed toward the positive label in the training set, comprising 60.6% of the dataset. The default settings we use include weighting each instance with the inverse of its label proportion to achieve label balance in the training set. We find that the resulting model results in an outofbag error of 29.36%.
2.3 Improving Fairness
The original model uses the threshold value of 0.5 to determine the label based on the estimated label probabilities for each instance, as illustrated in Figure 1. To improve fairness, we pick different threshold values for each subgroup such that equality of opportunity is achieved in the testing set. The resulting groupspecific threshold values are 0.48 and 0.58 for male and female groups, and 0.46 and 0.59 for URM and nonURM groups, respectively. Then we reevaluate the resulting predictions in terms of accuracy and fairness, using a test of equal proportions to evaluate the statistical significance of group differences for each measure.
3 Results
We find that the overall accuracy of the resulting model on the test data is 0.73 with an fscore of 0.80. The positive label comprises 66.6% of the test data. Table 2 shows the results of accuracy and fairness of the model using ethnicity and gender as protected attributes. We observe that the resulting model is unfair to male and URM students in terms of demographic parity and equality of opportunity, while fair in terms of positive predictive parity.
After correcting for equality of opportunity by adjusting the classification threshold values for each group, we find that the subgroup accuracy remains similar for both groups. For URM students, the correction slightly increases accuracy from 0.695 to 0.736. In terms of fairness, we find that the correction successfully eliminated differences in equality of opportunity for both protected attributes. The correction also yields predictions that are less biased in terms of demographic parity; however, they are more biased in terms of positive predictive parity.
4 Discussion
Random forest models are commonly used in educational data mining and learning analytics. Here we find that an outofthebox random forest model violates both equality of opportunity and demographic parity for male and URM students. We note that another notion of fairness, namely positive predictive parity, is already satisfied with the original model without introducing any fairnessrelated interventions. This is consistent with the findings of [pmlrv97liu19f], which posits predictive parity as “the implicit fairness criterion of unconstrained learning”. Based on the impossibility results, improving the fairness of the original model for any other fairness metrics (i.e. equality of opportunity or demographic parity) therefore implies that the altered model predictions will consequently have different interpretations for each student subgroup; for example, a predicted probability of student success of 60% may be interpreted as positive for one group but negative for another.
We find that optimizing the model to satisfy equality of opportunity perpetuates unfairness in terms of demographic parity and positive predictive parity for both gender and racialethnic groups, consistent with the impossibility results. There is indeed a general decrease in pergroup proportions of positive predictions, but this may not matter since the main goal of this student success prediction model is to correctly assign more positive predictions to successful students, not just to any students. In addition, we observe that positive predictive parity is violated. This is due to an increase in the precision values for nonURM and female students, while the values for URM and male students remained unchanged. As this does not directly lower precision for URM and male students, this can be considered a reasonable tradeoff to instead achieve an alternative notion of fairness which is equality of opportunity.
We conclude that setting groupspecific threshold values to achieve a certain fairness criterion itself may be considered unfair, since it means that some students will be held to a more stringent standard to achieve accurate predictions simply because of their group membership. Our findings demonstrate that different notions of fairness are in tension with each other in the context of a standard application of predictive modeling in higher education. This calls for more open discourse and careful evaluation of the potential tradeoffs and desiderata around issues of fairness in the use of predictive modeling in educational applications.