## 1 Introduction

Predictive analytics has become an increasingly hot topic in higher education. In particular, predictive-analytics tools have been used to predict various measures of student success (e.g., course completion, retention, and degree attainment) by mapping the input set of attributes of individuals (e.g., the student’s high school GPA and demographic features) with their outcomes (e.g., college credits accumulated) ekowo2016promise. Campus officials have used these predictions to guide decisions surrounding college admissions and student-support interventions, such as providing more intensive advising to certain students ekowo2016promise.

Despite the potentials for predictive analytics, there is a critical disconnection between predictive analytics in higher education research and accessibility of them in practice. Two major barriers to existing uses of predictive analytics in higher education that cause this disconnection are the lack of democratization in deployment and the potential to exacerbate inequalities.

First, education researchers and policy makers face many challenges in deploying predictive and statistical techniques in practice. These challenges present in different steps of modeling including data cleaning (e.g. imputation), identifying most important attributes associated with success, selecting the correct predictive modeling technique, and calibrating the hyperparameters of the selected model. Nevertheless, each of these steps can introduce additional bias to the system if not appropriately performed

barocas2016big. Missing Values are the frequent latent causes behind many data analysis challenges. Most large-scale and nationally representative education data sets suffer from a significant number of incomplete responses from the research participants. While many education-related studies addressed the challenges of missing data, missing-review1; Missing3-MI; dataprep3, little is known about the impact of handling missing values on the fairness of predictive outcomes in practice. To date, few works studied the impact of data preparation on the unfairness of the predictive outcome in a limited setting valentim2019impact or using merely a single notion of fairness metrics missing2021.Second, predictive models rely on historical data and have the potential to exacerbate social inequalities ekowo2016promise; kizilcec2020algorithmic. Over the last decade, researchers realized that disregarding the consequences and especially the societal impact of algorithmic decision making, might negatively impact individuals lives. COMPAS, a criminal justice support tool, was found to be decidedly biased against Black people propublica. Colleges and universities have been using risk algorithms to evaluate their students. Recently, Markup investigated four major public universities and found that EAB’s Navigate software is racially biased markup. Achieving this goal, however, is complex and it requires education researchers and practitioners undergo an comprehensive algorithm audit to ensure technical correctness and social accountability of their algorithms.

It is imperative that predictive models are designed with careful attention to their potential social consequences. A wave of fair decision making algorithms and more particularly fair machine learning models for prediction, has been proposed in recent years years

Fair-accurate-education; AIunfairness-education. Nevertheless, most of the proposed research either deals with inequality in the pre-processing or post-processing steps, or consider model-based in-processing approach. To take any of the aforementioned routes for bias mitigation, it is critical to first audit the unfairness of the predictive algorithms outcome and identify the most severe unfairness issues to address.Following these concerns, fairness audits of algorithmic decision systems have been pioneered in a variety of fields kondmann2021under; kearns2018preventing. The auditing process of unfairness detection of the model provides a comprehensive guideline for education researchers and officials to evaluate the inequalities of predictive modeling algorithms from different perspective before deploying them in practice. To the best of our knowledge, there is no work in ML for higher education that have transparently audit ML performance and unfairness in education using a real dataset.

In this paper, we first study if predictive modeling techniques for student success shows inequalities for or against a sample of marginalized communities. We use a real national level education dataset to analyze the case of discrimination. We consider a wide range of Machine Learning models for student-success predictions. Then, we audit if prediction outcomes are discriminating against certain subgroups considering different notions of fairness to identify a potential bias in predictions. Furthermore, we investigate the impact of imputing the missing values using various techniques on model performance and fairness to key insights for educational practitioner for responsible ML pipeline.

This study has the potential to significantly impact the practice of data-driven decision-making in higher education investigating the impact of a critical pre-processing step on predictive inequalities. In particular, how different imputation techniques fundamentally compare to one another, and to what extent they impact the performance and the fairness of a student-success prediction outcome.

We predict the most common proxy attribute *graduation completion* concerning equal treatment to different demographic groups through different notions of fairness. The comprehensive study of the real large scale datasets of ESL:2002 and IPEDS allow us to validate the performance of different ML techniques for predictive analytics in higher education in a real situation.
To the best of our knowledge, none of the existing fair machine learning (ML) models have studied existing large-scale datasets for student success modeling. Most of the extant applications of fair ML demonstrate results using small datasets considering a limited number of attributes (e.g., kleinberg2018algorithmic; yu2020towards) or in a specific context, such as law school admission kusner2017counterfactual; cole2019avoiding.

## 2 Bias in Education

*“Bias in, bias out”.*
The first step toward auditing and addressing disparity in student success prediction is to understand and identify different sources of bias in the dataset. Most of the social data including education data is almost always biased as it inherently reflects historical biases and stereotypes olteanu2019social. Data collection and representation methods often introduce additional bias. Disregarding the societal impact of the modeling using biased data, exacerbates the discrimination further in the predictive modeling outcome. The term bias refers to demographic disparities in the sampled data that compromise its representativeness olteanu2019social; fairmlbook. *Population bias* in the data prevents a model to be accurate for minorities asudeh2019assessing. Figure 2 shows the racial population bias in the ELS dataset based on our preliminary analysis.

On the other hand, bias in the value distribution of attributes across different demographic groups, which is referred to as *behavioral bias*. This is due to the high correlation of
sensitive attributes with other attributes in the dataset. For example, Figure 2 demonstrates the population bias in the ELS dataset with White people representing the majority, with about 69% of the observations. Biased values in the data directly yield bias in the algorithmic outcomes.

Moreover, the distribution of attributes’ values across different demographic groups could indicate another source of bias, commonly referred to as behavioral bias. This is due to the high correlation of sensitive attributes with other attributes in the dataset. We can observe a behavioral bias in the highest degree earned of students across the top 4 representative racial groups, as shown in Figure 2 (a). This indeed indicates final degree attainments below the bachelor’s degree is highly frequent in Black and Hispanic communities, respectively. Figure 2 (b), reveals that below bachelor’s degree attainment is most frequent in students from middle class and low-income families (excluding social class-degree attained groups with a frequency of less than 1%). Using degree attainment as a student-success indicator, therefore, requires careful consideration and efforts to mitigate the effect of both population and behavioral bias.

Transitioning toward a more comprehensive and specified model of disparities among population groups, yields a greater understanding of the key drivers of disparities among population groups. The key causes behind disparities among population groups that have been identified in the previous education research include, but are not limited to, social classdisparity-racial; disparity-social, race and ethnicity disparity-racial, gender disparity-structure; disparity-gender2; voyer2014gender, household characteristics(e.g. education of adults) disparity-structure, community characteristics (e.g. presence of schools) disparity-structure, and socioeconomic statusdisparity-structure; disparity-socioeconomic.

In this paper, our goal is to investigate and identify different factors causing and sustaining these biases using data science techniques to better specify relationships within and between demographic, socioeconomic, pre-college academic preparation, grades, expenditures, extra activities, and school climate, and their impacts on disparities in educational outcome and success. Bias detection shed the light to identify the correct imputation technique and to control the adverse impact of imputation on the performance and fairness of the model later. To accomplish this end, we audit the unfairness of the predictive outcome before and after imputation using different techniques, and demonstrate how the correlations of the variables that are highly correlated with the sensitive attributes ( potential sources of behavioral bias) varies and affect the unfairness.

By examining bias in existing data and identifying the key characteristics of vulnerable populations, this paper illuminates how predictive models can produce discriminatory results if the bias is not addressed, and how we need to resolve the predictive outcome disparity.

To identify potential sources of behavioral bias, Figure 24 (a) and (b) illustrate racial disparities with respect to total credits and math/reading test scores, respectively. More specifically, Figure 24

(a) shows that Black, Hispanic, and American Indian groups have lower median earned credits with their first and second quartile (50% of observations) plotted with lower values compared to others. Similarly, Figure

24 (a) indicates that the student standardized combined math/reading test score has a lower median for Black, Hispanic, and American Indian groups. In addition, the size of each boxplot (from lower quartile to upper quartile) provides insight about the distribution of each group. For example, in Figure 24 (a), Hispanic subgroup has a large box plot meaning that these students have very different outcomes in terms of total earned credits from very low to high values. However, the box plot for the White group of students indicates more similar credit outcomes, mainly distributed around the median value. More, disparity detection plots are provided in the Appendix.## 3 Fairness In Predictive Modeling

*Fairness-aware learning* has received considerable attention in the machine learning literature (fairness in ML) fairML3; fairML4-zafar. More specifically, fairness in ML seeks to develop methodologies such that the predicted outcome becomes fair or non-discriminatory for individuals based on their protected attributes such as race and sex.
The goal of improving fairness in learning problems can be achieved by intervention at pre-processing, in-processing (algorithms), or post-processing strategies.
Pre-processing strategies involve the fairness measure in the data preparation step to mitigate the potential bias in the input data and produce fair outcomes feldman2015certifying; kamiran2012data; calmon2017optimized.
In-process approaches zafar2015fairness; zhang2021omnifair; anahideh2020fair incorporate fairness in the design of the algorithm to generate a fair outcome. Post-process methods pleiss2017fairness; feldman2015certifying; zehlike2017fa, manipulate the outcome of the algorithm to mitigate the unfairness of the outcome for the decision making process.

There are various definition for fairness in the literature vzliobaite2017measuring; fairmlbook; narayanan2018translation; barocas2016big; dwork2012fairness. The fairness definitions fall into different categories including Statistical Parity hardt2016equality

, Equalized Odds

hardt2016equality, Predictive Parity chouldechova2017fair, Predictive Equality corbett2017algorithmic, Equal Opportunity madras2019fairness, and Accuracy Equality berk2021fairness. Table 1 demonstrates the mathematical definitions of each of these common metrics makhlouf2021applicability. Let be a binary sensitive attribute, in a binary classification setting let be the true label, andbe the predicted class label. Most of the fairness notions are derived based the conditional probabilities using these variables to reveal the inequalities of the predictive model. Evaluating the fairness of algorithmic predictions requires a notion of fairness, which can be difficult to choose in practice. Different metrics have been leveraged regarding different contexts, business necessities, and regulations. A predictive modeling outcome might have inequalities under one notion of fairness and might not have any under others.

In the education context, to give some examples, a) Demographic (statistical) parity is referred to as: the discrepancy of the predicted highest level of degree (success) across different demographic groups of students, and b) Equal opportunity indicates: the discrepancy of the predicted highest level of degree across different demographic groups of students, given their success is 1. In this paper, we use a binary classification setting but across multilevel racial and gender population subgroups ( is not necessarily binary).
We extend the fairness metrics described in Table 1 for non-binary sensitive attributes, by considering *one-versus-rest* approach for unfairness calculation. More specifically, to calculate the unfairness gaps, we consider each subgroup as and compare it against the rest (i.e. all other subgroups), one at a time. In this paper, we mainly focus on the racial and gender disparities, however, our proposed approach for auditing fairness and investigating the imputation impact can be extended to use other sensitive attributes. For example, the decision maker can use Martial Status as a sensitive.

## 4 Fairness Audits

Notwithstanding the awareness of biases and unfairness in machine learning, the actual challenges of ML practitioners have been discussed in a few previous research veale2018fairness; holstein2019improving with the focus on specific contexts, such as predictive policing propublica and child mistreatment detection chouldechova2018case. ML practitioners often struggle to apply existing auditing and de-biasing methods in their contexts holstein2019improving. The concept of auditing algorithms and ethics-based auditing in various contexts has lately reached its pinnacle. mokander2021ethics; raji2020closing; wilson2021building; kondmann2021under; mokander2021ethics. The final goal of the fairness auditing process is to determine whether the ML model’s results are fair. As a result, the auditing process aids in determining the appropriate actions to take, the best bias mitigation method to employ, and the most suitable technique to use throughout the ML development pipeline raji2020closing.

A designated user of predictive modeling in higher education needs support to audit the ML model performance and inequalities before adopting and deploying it in practice. To address the education practitioners and policymakers on assessing the inequalities of the predictive outcome, in this paper, we audit the unfairness of ML models for student success prediction using major notions of fairness to identify bias in algorithms using different metrics, Table 1. We also audit unfairness to ensure an ethical pre-processing approach. We audit a wide range of fairness metrics and conduct a comprehensive analysis on the performance of different ML models and their inequalities across different racial and gender subgroups throughout the data preparation (imputation) and the model training steps using the ELS dataset.

## 5 Success Prediction

Before moving to the ML pipeline, we first discuss the prediction problem of interest. In this paper, we specifically focus on predicting the academic success of students in higher education. Student-success prediction is critical for institution performance evaluation, college admission, intervention policy design, and many more use cases in higher education yu2020towards; stephan2015will.

Quantifying student success is a very complex topic since the true quality of any candidate is hidden and there is limited information available. There are proxy attributes such as first-year GPA or graduation completion that are typically used as the measure of success. In this work, we are primarily interested in studying the prediction of highest level of degree (classification problem) using ELS:2002 dataset.

Numerous factors can affect the student success voyer2014gender. Thus, identifying the most informative and significant subset of potential variables is a critical task in predictive modelingFSstudy. To select a proper subset of attributes, we conducted a thorough literature search and combine it with the domain expert knowledge. These factors include, but are not limited to, academic performance (SAT scores, GPA) chamorro2008personality, student demographic attributes (e.g. race, gender) voyer2014gender; fairethic-education, socio-economic status disparity-structure; disparity-socioeconomic2, environmental factors, and extra(out of school) activities.

Incorporating protected attributes in the modeling procedure has raised concerns in the fair-ML domain barocas2016big. Machine learning models are based on correlation, and any feature associated with an outcome can be used as a decision basis. However, the predictive outcome depends on the information available to the model and the specific algorithm used. A model may leverage any feature associated with the outcome, and common measures of model performance and fairness will be essentially unaffected. In contrast, in some cases the inclusion of unprotected attributes may adversely affect the performance and fairness of a predictive model due to a latent correlation with other protected attributes. In this paper, we shall audit the unfairness of the model and the impact of imputation when we incorporate the sensitive attribute as determinants.

Decision Tree DT-corr-perform

RF-risk, K-Nearest Neighbor dudani1976distance tanner2010predicting, LDA citeriffenburgh1957linear alyahyan2020predictingthompson2018predicting, and SVM SVM-DT-NN are among the well-known ML models in higher education. Table 2, represents the list of variables in this study, and their corresponding missing value percentages.### 5.1 Missing Values and Imputation

*Missing Values* are the frequent latent causes behind many data analysis challenges, from modeling to prediction, and from accuracy to fairness for protected (sensitive) subgroups. Therefore, *handling Missing values* is a complicated problem that requires careful consideration in education research missing-review1; Missing3-MI. In this regard, different imputation techniques have been proposed in the literature and the effectiveness of each methodology on various applications has been studied. *Mean Imputation*, *Multiple Imputation*, and other clustering-based imputation strategies such as *KNN-imputation*, are among the well-known techniques in handling missing values which we will briefly describe here.

Most large-scale and nationally representative education data sets, (e.g.,ELS) suffer from a significant number of incomplete responses from the research participants. While features that contain more than 75% missing or unknown values are not usually informative, most features suffer from less than 25% missing values, and are worth keeping. Removing all observations with missing values induces significant information loss in success prediction.

Simple Imputation

is one the most basic imputation strategies. The process involves replacing the missing variable of an observation with the mean (or median) of the observations with available values for the same variable. The mean imputation method is known to decrease the standard error of the mean. This fact exacerbates the risk of failing to capture the reality through statistical tests

missing-book; missing-review1. Multiple Imputation (MI) rubin1996multipleis a more advanced imputation strategy that aims to estimate the natural variation in the data through performing several missing data imputations. In fact, MI produces a sets of estimates through various imputed datasets, and combine them into a single set of estimates by averaging across different values. The standard errors of parameter estimates produced with this method has shown to be unbiased

rubin1996multiple. KNN Imputation is a non-parametric imputation strategy, which has been shown to be successful for different contexts missing-knn. KNN imputer replaces each sample’s missing values with the mean value from nearest neighbors found in the dataset. In fact, two samples are considered close neighbors if the features that neither is missing are close. KNN imputation is able to capture structure in the dataset while the underlying data distribution is unknown somasundaram2011evaluation. To the best of our knowledge, KNN imputation has not been used in the education context.Overall, ignoring missing data can not be an effective approach of handle missing values, and more importantly can result in predictive disparity for minorities. While many education-related studies addressed the challenges of missing data, as discussed, little is known about the impact of applying different imputation techniques on the fairness outcome of the final model. In this project, we aim to address this gap considering the 3 above mentioned imputation strategies. To the best of our knowledge, none of the prior works in the education domain worked on ensuring fairness while imputing for missing values in pre-processing steps.

## 6 Experiments

Data Prepration.

As previously stated, we use the ELS dataset in this study to audit the fairness of ML models in the development pipeline for predicting student success. The ELS dataset includes many categorical variables. Therefore, we begin by creating appropriate labeling and converting categorical attributes to numeric ones (dummy variable) following the NCES dataset documentation

^{1}

^{1}1https://nces.ed.gov/surveys/els2002/avail˙data.asp

). Next, we perform a transformation on the considered response variable

*highest level of degree*to construct a binary classification problem. That is, we label students with a college degree (BS degree and higher) as the favorable outcome (label=1), and others as the Unfavorable outcome (label=0). A Data cleaning would then is performed to identify and rename the missing values (based on the documentation) and remove the observations that have many missing attributes ( of the attributes are missing). The final and significantly important task is then to handle the remaining missing values in the dataset. We consider different imputation techniques; Simple Imputation SI, Multiple Imputation MI, and KNN Imputation KNN-I. We consider a baseline where we remove observations with missing attributes and referred to it as Remove NA. For KNN-I we consider additional scenario where we do not impute the response variable. In this scenario we remove observations with missing and apply KNN-I on the set of attributes .

Model Training procedure follows the data preperation step as we obtain clean-format datasets. We aim to analyze the performance of different Machine learning (ML) models under each imputation technique, to audit the inequalities in the prediction outcome. We consider Decision Tree DT, Random Forest RF

, Support Vector Classifier

SVC, Linear Discriminant Analysis LDA, Logistic Regression Log, and K-Nearest Neighbor KNN ML models. For each model we perform a hyperparameter tuning to find the best model under each imputated dataset. For example, Table 3 represents the best obtained hyperparameters for each model while fitting on the KNN-imputed dataset.Model | Hyperparameters | ||
---|---|---|---|

Decision Tree Classifier: | maxdepth=6 | minsplit=30 | maxfeatures=30 |

Random Forest Classifier: | maxdepth=10 | minsplit=10 | maxfeatures=10 |

Support vector Classifier : | kernel=Linear | ||

LDA Classifier: | Shrinkage=None | Solver=svd | |

Logit Classifier: | penalty= | Solver=liblinear | |

K-nearest neighbor Classifier: | K=34 | weight=distance |

Results. In this section, we summarize our findings in three key discussions. First, we compare the performance of the considered ML models on different imputed datasets. Then, we analyze the impact of imputation on unfairness gaps among protected racial groups. Lastly, we compare the correlation results before and after imputation to identify the source of difference in fairness performance.

One noticeable fact is that both testing and training accuracy increases after imputation across all models (detailed report on the accuracy is provided in the appendix). In fact, the model fitted on the imputed dataset has a higher generalization power. KNN however, has the largest performance gap before and after imputation (about 8% decrease on testing and 20% decrease training) compared to other models, for which the accuracy levels (training and testing) increase by 2%-6%, on average.

Figures 18-23 represents the unfairness of different ML models for different imputation techniques. Each group of bars corresponds to one racial subgroups. The results for *gender* as the sensitive attribute is provided in the appendix.

As shown in Table 1, Statistical Parity (SP) is a fairness metric that compares positive (or success) prediction outcome () among students with different demographic groups without considering their true label (real outcome). Based on Figures 18, we can observe that Statistical parity is model independent, and observable changes only occur under different imputation techniques. However, other fairness metrics are both model and imputation dependent, as the unfairness gaps are considerably different from one model to the other across imputation scenarios. Note that SP increases with imputation for majority of subgroups. *More than one race* and *White* subgroups are exceptions.

Predictive Parity (PP) considers the students who are predicted to be successful given their racial subgroup, and measures if they are correctly classified. The PP unfairness gaps increases after imputation in the majority of cases, however, the effect also depends on the model type. For example, KNN classifier tends to do worst across all racial groups after imputation. The PP gaps also increase for Black students after imputation with most of the models.

Predictive Equality (PE) focuses mainly on unsuccessful students who are incorrectly predicted as successful given their racial subgroup. This type of unfairness could lead to considerably unfavorable results in higher education, where policymakers fail to identify those in need. Figures 20, show that imputation mainly decreases PE gaps across racial subgroups while some models are exceptions. For instance, Log and SVC classifiers increase the unfairness for Asian students after imputation, which leads to unfavorable prediction outcomes for Asian students who are less likely to succeed.

Equal Opportunity (EoP) also emphasizes the positive prediction outcome and measures how much the model correctly classifies successful students given their racial subgroup. Based on Figures 21, we can observe that imputation decreases the unfairness gaps for all racial groups except for the Black subgroup. In fact, following identification as a minority group, Black students may encounter even more discrimination. In other words, successful Black students are less likely to be predicted as such after imputation.

Equalized Odds (EO) measures the True Positive and False Positive rates among different racial subgroups. It merely emphasizes on the positive prediction outcome , which is *BS degree attainment*. Figures 22, show that imputation drastically decreases the unfairness gaps across different racial groups. That is, the models tend to predict equal positive outcomes across different racial subgroups.

Accuracy Equality (AE) measures the overall accuracy in prediction outcome for students given their racial subgroup. Figures 23, show that imputation drastically decreases the AE gaps across different racial groups. That is, the models tend to be more equally accurate in prediction outcome across different racial subgroups after imputation. However, models perform discriminatory for each racial group when we remove all observations with missing values.

Note that fairness metrics are model dependant. For example, comparing KNN with DT and RF under the notion of predictive parity (PP), we can observe that the gaps mostly increases after imputation unlike the others, which mostly achieve lower unfairness. A comprehensive analysis and plots are provided in the appendix.

Figures 26(a) to 26(c), demonstrate heatmaps of correlations between attributes. We considered two subsets of unprotected attributes: School-related, and grade/credit-based attributes. The plots indicate the impact of imputation on the correlation of both sensitive and unprotected attributes. Correlations enable us to identify the changes in the distribution of the unprotected attributes, and how their correlations with sensitive attributes (which is indicating behavioral bias) change and affect the unfairness of the model. For example, comparing Figures 26(a) with 26(b), the correlation between college entrance exam and White increase after KNN-I, which can amplify the unfairness of the outcome, accordingly. Moreover considering the second subset of attributes, comparing Figures 26(c) with 26(d), we can observe that the correlation between all credits taken and parent education (which is highly correlated with White) increases after KNN-I. Similarly, most of the other score and GPA-related attributes correlation with White increases after KNN-I. The reason is White is the majority group in the data and imputation is biased towards it. This shall cause inequalities in the predictive outcome.

While increasing the correlation further exacerbate the bias in prediction, decreasing the correlation can induce unfairness reduction. In fact, that is the main reason behind the unfairness change using different imputation techniques and ML models. As a result, we observe that for some fairness metrics (e.g., PP) the gaps have been enlarged after imputation while others (e.g., EO) benefit from imputation as it enables the models to decrease the inequalities in prediction. The correlation heatmaps for other imputation techniques and more discussions are provided in the appendix.