Synthetic Embedding-based Data Generation Methods for Student Performance

01/03/2021 ∙ by Dom Huh, et al. ∙ George Mason University 3

Given the inherent class imbalance issue within student performance datasets, samples belonging to the edges of the target class distribution pose a challenge for predictive machine learning algorithms to learn. In this paper, we introduce a general framework for synthetic embedding-based data generation (SEDG), a search-based approach to generate new synthetic samples using embeddings to correct the detriment effects of class imbalances optimally. We compare the SEDG framework to past synthetic data generation methods, including deep generative models, and traditional sampling methods. In our results, we find SEDG to outperform the traditional re-sampling methods for deep neural networks and perform competitively for common machine learning classifiers on the student performance task in several standard performance metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 11

page 13

page 14

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the educational domain, student performance tasks involve designing a classifier to predict the performance of a student given a set of features. These datasets, however, are often skewed centrally to the mean as a relative, more commonly known as bell-curve, grading scale is often employed. Consequently, the limited number of examples belonging to outlier students, specifically those demonstrating extreme success and failure, do not sufficiently represent minority classes, as opposed to the overwhelming number of average performing students’ examples that represent to the majority classes. This evident disproportion results in machine learning algorithms failing to understand and learn how to classify the minority classes, which was supported in

Provost and Weiss (2011). This issue we are describing is known in the machine learning community as class imbalances. Class imbalances are recognized when working with an imbalanced dataset, that exhibits a notable disparity in the number of examples amongst different classes, and often can disproportionately harm the classifier’s performance on the minority classes. As Sun et al. (2011) discusses, these datasets describe a nature of the problem, which correlates heavily with rare events, small sample size, low class separability, and/or existence of within-class subconcepts. For student performance datasets, we find all to be the case, and even so, gaining insight into the outlier students is significant for the educators as it may point to indicators of failure and success.

Commonly for class imbalance tasks, the nature of the dataset results in a relatively high predictive accuracy for the majority classes and significantly lower in the minority classes. If classification accuracy was used to evaluate the model’s efficiency, we would obtain an overly optimistic belief in the classifier. Thus, often for such datasets Provost and Fawcett (1997), we consider using Area under Curve (AUC) on the Receiver Operating Characteristic (ROC) curve, a graph used to show the trade-off between true positive and false positive error, as the performance metric for the classifier. The ROC space requires the true positive rate (TPR), or more commonly known as the sensitivity, and the false positive rate (FPR), or more commonly known as fall-out. We can calculate the values for TPR and FPR, seen in Equation 1

, from the components in confusion matrix, a table containing the values of true and false positives and negatives, seen in Figure

1.

(1)

Then, we obtain the ROC curve to calculate the AUC score, seen in Figure 2. For multi-class settings, we can use an one-vs-rest paradigm, and either obtain the AUC scores at the class level, or aggregate with or without class weighing to obtain a micro-AUC or macro-AUC score respectively. When we view at a class level, we call this class-AUC score.

Figure 1:

Confusion Matrix: A contingency table between true class and predicted class in a binary class setting.

Figure 2:

ROC curve: Above shows a graph in ROC space. The dotted line denotes the ROC curve if the classification is randomly estimated. The solid line denotes the ROC curve if the classification is more accurately estimated. The AUC would be the area under the ROC curves.

Generally, as defined in Sun et al. (2011), there exists three common approaches to alleviate the effects of class imbalances: data modification, algorithm modifications, learning modification. As the name suggests, each approach modifies different aspects of the learning system to counter class imbalances. In this paper, we will focus on data modification approaches, or more specifically sampling methods.

We consider two classes of sampling methods: oversampling and under-sampling. Oversampling methods append samples to the pre-existing training set, whereas under-sampling methods remove samples from the training set. Both approaches aim to balance the number of samples, given some criteria, in the training set. More generally, sampling methods must consider two important components. Firstly, the imbalanced criteria that is targeted must be defined. For imbalanced dataset, for example, the criteria may be the number of samples in the class. Secondly, the method of creating the balance must then be defined, such as randomly re-sampling or randomly removing samples. Variations of these two components have been explored in the past works Susan and Kumar . In particular, a promising oversampling approach that has improved handling of imbalanced datasets has been generating new synthetic samples from pre-existing samples from the training dataset, and this method is known as Synthetic Data Generation (SDG). In Bowyer et al. (2011)

, the authors introduced the prevalently used Synthetic Minority Over-sampling Technique (SMOTE), which generates new samples by modifying a pre-existing sample by adding the scaled (0-1) difference between feature vectors of its nearest neighbors, with the addition of noise. Furthermore,

Guo and Viktor (2004); Haibo He et al. (2008) proposed variants of SMOTE, called DataBoost-IM and ADASYN respectively, changing the sampling criteria to be based on predictive error.

Recently, deep learning methods have exhibited advances in robustness

Mullick et al. (2019); Shamsolmoali et al. (2020) and efficacy, specifically to end-to-end training approaches Turhan and Bilge (2018), for SDG methods. In specific, deep generative models (DGMs), such as generative adversarial models, have been previously introduced to tackle class imbalances Mullick et al. (2019); Shamsolmoali et al. (2020).

In this paper, we introduce a framework for embedding-based SDG methods, called SEDG, to handle effects of class imbalances on student performance tasks. The student performance dataset and the set of classifiers used will be described in Section 2. Further details of SEDG framework and deep generative models will be discussed in Section 3. In Section 4, the results are presented and evaluated.

2 Preliminaries

In this section, we will discuss the details of the Student Performance Dataset from Cortez and Silva (2008) and the classifiers used to test the SEDG framework introduced in this paper. The training methods and implementation decisions will also be described.

Figure 3: Target Value Distribution: From the student performance dataset, the imbalance in the distribution of target values skews centrally, indicating a disproportionate number of examples between the central classes and the outer classes.

2.1 Student Performance Dataset

The dataset provided by Cortez and Silva (2008), can be used to predict student performance, or the final grade, on a scale from to . The data was collected from two Portuguese secondary level schools, and has samples of features. In 3, we show the class imbalance in the dataset, skewed centrally as mentioned in Section 1. In 4, we show the distribution of features in the dataset, where some features, such as parent’s cohabitation status, extra paid classes, and extra educational school support, also demonstrated to be imbalanced. Further information on the meaning of these features to can be found in Cortez and Silva (2008). We will not opt to use the binary or five-level classification found in other works on this dataset, but will classify all 20 classes. For future works, we find a three-level classification, or binary classification between the outlier samples, which exists on the edges of the target values, and the average samples to be more appropriate simplification.

Figure 4: Feature Distribution: From the student performance dataset, we can see an imbalance in some of the distribution of feature values.

2.2 Classifiers

We will use three traditional machine learning classifiers (support vector machines, random forest, gradient-boosted decision trees/XGBoost) and an neural network architecture which we will denote as NNModel.

Support vector machines (SVM) Hearst (1998) are optimal-margin classifiers that seek to find the hyper-plane that maximize the geometric margins between binary classes. In a multi-class setting, we can proceed by using an one-vs-rest approach, where the maximum score from SVMs, where each SVM is trained on masking all but one class to formulate a binary setting, is selected, or an one-vs-one approach, where similar to one-vs rest, we formulate a binary setting, but with pairs of classes instead.

Decision trees Quinlan (1986) are tree-based classifier that create binary splits based on a condition on the features that maximizes information gain, a metric of weighted entropy. Random forest Breiman (2001) is a bootstrap aggregation, or bagging, algorithm using decision trees, where learners learn on bootstrapped subsets of the dataset and are also limited to a subset of features it can split on. Thus, inference of random forest is run on majority vote of the learners. XGBoost Chen and Guestrin (2016) is a gradient boosting algorithm using decision trees, where learners are introduced to additively correct errors of trained learners. Gradient boosting refers to correcting the errors through gradient learning when adding new learners Friedman (2001).

For our experiments, the neural network architecture we call NNModel is a multi-layer perceptron model that is sequentially made up of

LRND blocks, seen in Figure 5

. The LRND block is composed of a linear layer, rectified linear activation, batch normalization, and followed by a dropout layer. This model is used for classification, and its parameters are optimized using gradient learning on cross entropy loss, seen in Equation

2, where is the true, target function and is the hypothesis function, and an Adam optimizer scheduled to reduce and anneal the learning rate on plateaus by an order magnitude.

(2)
Figure 5: NNModel Block Diagram: In the diagram, the input is composed of features. The input data is first mapped into embeddings, then passed into the processing component, which has LRND blocks, denoted as the dotted block. To obtain the final prediction for a multi-class classification, the output is transformed using a softmax activation.

3 Synthetic Embedding-based Data Generation

In this section, we will discuss the embedding-based method for SDG that will be evaluated in Section 4

. We will formalize the generalized SDG method step by step, dividing the discussion into 4 chronological parts: sample selection, feature selection, feature modification and synthetic sample usage. In each step, we offer various design considerations. We will then write on deep generative models in the context of SDG.

3.1 Sample Selection

We must first consider how to select the samples from the original dataset to base the synthetic samples on, and also, how many samples we want to select. The latter consideration of the number of samples can be fixed or stochastically chosen within a defined range. we will call the former consideration sample selection henceforth.

Naively, we can randomly select samples from the entire dataset

, all with equal probability. Once the samples are selected, we place them into a sample pool,

, that will be used to generate the new synthetic samples. This approach will be referred to as random sample selection (RSS).

A more systematic approach is to partition the dataset in a targeted manner and select a member of the partition to sample from. The member selection can occur at a probability , or deterministically, which can then be stochastically sample from, with or without replacement. More concretely, suppose we want samples and given we are working with imbalanced datasets, we can choose the partition to be divided based on the classes. We set each member of the partition to be associated to a probability proportional to their cardinality , which represent how likely the member will be selected. Thus, there exists a surjective mapping from the partition

to the probability distribution

. A special case would be to deterministically select the member based on the cardinality, where for all . Once a member is selected, we sample examples from the set, where . The value of , similar to , can be fixed or randomly chosen within a defined range. The samples are stochastically selected from by performing RSS on the subset of , we add this to our sample pool . We repeat our selection of a new member in the partition, with or without replacement, where at the end of the process, we get . This approach will be referred to as partition-based sample selection (PaSS). We note that this approach is dependent solely on the dataset, and remains entirely independent on the classifier.

Another approach for sample selection consists of selecting samples based on the classifier’s performance, similar to concept of boosting. Specifically, we can associate higher selection likelihoods for samples the learner has the largest margin for error, or uncertainty. In this context, we can consider classification accuracy to quantify error and uncertainty. Thus, we can place the highest probability of selection to the samples that were misclassified, and rank the correctly classified samples based on the classifier’s certainty of the prediction. So, given a classifier

and an loss function

, where is a sample belonging to the training set , we let the selection probability for to be greater than the selection probability for if and only if . The loss function returns a scalar value that represents the correlation between the target and prediction values. This approach will be referred to as performance-based sample selection (PeSS). We note that this approach, in contrast to PaSS, is dependent on the classifier, thus also requires trained model for its operation. Consequently, PeSS is more computationally costly to use.

We now can formulate a selection method that incorporates from both PaSS and PeSS approaches by using class-AUC score. We can first use class partitions described above and the class-AUC score as our uncertainty metric to develop a selection probability distribution to select the member to sample from. Thus, we incorporate elements from the PaSS approach by using a class partition to first select a member to sample from, and we incorporate the AUC score from the classifier to develop the probability associated to member selection. We call this the partition-performance sample selection approach, which we will refer to as PPSS.

3.2 Feature Selection

We can think of every sample from the dataset , where , at a more granular level, to be made up of features. Or in other words, a sample can be defined to be a set of features. After sample selection, we can modify the selected samples to generate new samples. Thus, given some mapping from to where would provide us new samples , where

. But in order to modify these features, we now must consider how to select which features to modify efficiently as to appropriately increase variance in the dataset without creating class overlaps. But, to modify the samples, we must first select the features of the sample that we wish to adjust. We will discuss three approaches to select features: random selection, feature imbalance, and feature importance approach. We can group the feature imbalance and feature importance together, and label this weighted feature selection. In all, we will call this procedure

feature selection henceforth.

The random selection approach, similar to RSS, stochastically selects a subset of feature , where , with all features with equal probability. The number of feature selected can be fixed or randomly chosen per sample.

The feature imbalance approach is to select the feature based on feature value imbalances, as seen in Figure 4. We can quantify the imbalances by using the following method, which differs from past works Sun et al. (2011). We first create , where is the number of features, by counting the number of examples that contains each unique value, resulting in a set of counts , for each feature. Then, for each , we find the clusters for

using k-means, with

, to obtain the ratio between the difference between the two centroids, and the maximum count in . In the end, we obtain a vector of ratios , where , which we can first normalize to treat as a probability distribution to select subsets of features. This approach is largely motivated to place higher emphasis on more imbalanced features, and thus attempting to balance all features’ value representation.

The feature importance approach creates a probability distribution based on feature importance, which will be used to select the features. Feature importance provides insight into the relationship between the predictive system and the data, and how much significance each feature in the data may have on the overall system’s performance at the given task at hand. We will consider three approaches to calculate and obtain the feature importance: mean decrease impurity importance, permutation importance, drop-column importance.

Figure 6: Selection distribution from weighted feature selection methods for classifiers.

3.2.1 Feature Importance

Mean decrease, or gini impurity, importance methods rely on the use of tree-based classifiers, and is calculated by aggregating the gini decreases of each feature, which is criteria used to determine the splits, at every tree. However, Strobl et al. (2006) has shown gini impurity importance approach is biased given the scale of measurement or number of categories of the features. This approach can only be used for tree-based classifiers, thus will only be consider such models.

Permutation importance methods, on the other hand, calculate the feature importance by evaluating the decrease in performance of a trained model on a test set that is shuffled on at the single feature value of interest. The shuffling process can be repeated to test multiple permutations of the feature values. Thus, permutation importance only requires the model to be trained once, but also needs to use the testing dataset. Also, this approach can be used for any models.

Similarly, drop-column importance methods obtain the feature importance using the difference between the baseline performance of a model, which would trained on the entire dataset, and the performance of the model that has been trained on a limited dataset with a single feature value dropped. Thus, drop-column importance can also be used for any model. However, we see that drop-column importance is very computationally costly, proportionally to the number of features in the dataset.

3.3 Feature Modification

We must now consider how to modify the selected features optimally, a procedure we will call feature modification henceforth. We define optimality as maximizing the variance between the synthetic samples from original samples and minimizing the class overlap in the new dataset. Our aim is to maximize the improvements in the minority classes, since that is the deficient area. To modify the features, we can naively inject noise to the continuous features and replace discrete features with sampling, but we wish to formulate a more targeted approach. In this paper, we will focus on embedding-based modifications, a search-based method where we will leverage some learned embeddings to offer insights in how to optimally modify the features.

Embeddings are mappings from the feature domain to a domain that can be more useful and understandable for the classifier to perform the task at hand. In other words, we can think of embeddings as optimized data pre-processing transforms. We consider two methods of embedding generation: classification transfer learning and auto-encoding. However, there are many other embedding generation approaches

Bengio et al. (2012) that have been successfully demonstrated in field of representation learning.

3.3.1 Embedding Generation

Classification transfer learning refers to learning the embedding mapping through training an classifier on the task at hand. We note that both and are parameters by their own independent weights and . Thus, inference on the model can be seen in Equation 3, where is the prediction, and is the input data.

(3)

Thus, the classifier and the embedding mapping are both optimized for the targeted tasks, given some loss function . Thus, if gradient learning is used, the weight updates can be seen in Equation 4, where is the loss function for classification, such as cross entropy, which is seen in Equation 2.

(4)

Once the classifier and embedding mapping are trained, we can export the embedding mapping to be used for feature modification.

Auto-encoders are a type of neural network that learn to compress and decompress the input data, thus having a bottleneck structure. In other words, let the set of layers

be the layer that make up the autoencoder, and

for be the number of parameters in layer , then, in our definition, given and , there must exists such that and , where we will refer to as the bottleneck layer. The aim of the autoencoder is to accurately build a reconstruction of the input given this decompression. Inference on auto-encoders can be seen in Equation 5, where is the prediction, and is the input data. We note that the dimension of and are equal.

(5)

Auto-encoders are optimized given some loss function , often referred to as the reconstruction loss, and the target is the input data . Typically, the loss function will compare the input to the predicted value directly, using a function like mean-square error or cross entropy, again seen in Equation 2. Thus, if gradient learning is used, the weight updates can be seen in Equation 6.

(6)

Once the auto-encoder is optimized, we can export a proper subset of layers , where is the bottleneck layer, as the embedding mapping.

We will also consider variational auto-encoders (VAE), which follows the same inference and optimization methods, however, we represent the bottleneck layer as a probability distribution, often a Gaussian . Thus, the input of layer will be samples from the distribution from

. The loss function often used is empirical lower bound (ELBO), which combines the reconstruction loss normally used in auto-encoders and the Kullback–Leibler divergence loss term

, where is a mapping that uses the layers , is a mapping that uses the layers and is the latent space representation, or the output of the . Thus, will be considered to be the embedding of the sample.

When we use the auto-encoders are the embedding function, we remark that the models are trained on the training set independently to the training of the classifier. Thus, for computational efficiency, we can store and reuse embedding models.

For each of the embedding generation approaches, we can use principal component analysis for dimensionality reduction and further compress the embedding.

Returning back to our discussion on feature modification, as mentioned before, a sample can be thought of as a set of features . These features can either be discrete and continuous. We will consider both cases, and discuss how we can handle each case separately.

Assuming a set of features that was selected for modification is homogeneously discrete, for each element in , there exists a set of possible values it can take on . We can randomly select a subset of , where and if then . If the search space for unique feature values is small enough, we can try all possible values for , however if it is large, then we could sample a subset as mentioned. Given the embedding mapping , we calculate the similarity score between and for all , and replace the feature value with the one with the highest similarity score. We repeat this process for every feature in . However, given the embedding mapping , where , we will again follow the same procedure, however we must compare and where and where . When we repeat this process for each feature in , we can choose to replace to the updated for the proceeding feature modifications, or choose not to replace .

Assuming a set of features that was selected for modification is homogeneously continuous, for each element in , we define a range with a fixed step where and . Then, we proceed similarly to the discrete feature case for feature modification.

3.4 Synthetic Sample Usage

Once we obtain the synthetic samples, we now must decide how the samples will be utilized. We propose two considerations: cold or warm start and iterative or non-iterative,.

Initially, the model learns on the training set . Then, the set of synthetic samples are created and added to the training set . We can choose to either re-initialize the weights of the model, which we will call cold-start, or keep the weights from the previous training cycle, which we will call warm-start. We must now decide whether to repeat the process of creating a new from the newly trained model, making the process iterative, or end the training cycle entirely, making the process non-iterative. If the process is iterative, we can also choose to dropout and replace the previous partially or wholly with , or append to the current at each step. The latter option can quickly become memory intensive.

3.5 Deep Generative Models

In this section, we will formalize and discuss deep generative models (DGM) in more depth. Generative models can either be defined as modeling the conditional probability , or the joint probability or the prior . A DGM is expressed using a deep neural network, parameterized by learnable weights . Equivalently, DGM are mappings , , or . In context of SDG, DGM can be viewed as an end to end approach to generating synthetic samples, bypassing the need for Section 3.2, and 3.3. For our purposes, the objective is to learn how to generate new samples following the definition of optimality discussed in Section 3.3. Thus, we can set the input of the DGM to either be random noise, or samples selected using methods from Section 3.1

. A potential issue that is highly problematic is the possibility of the generative model learning to map the input to itself without any variance, especially for traditional auto-encoders and given the nature of the optimization. We suggest adding some regularization term to the loss function to more directly counter the over-fitting problem for future works, however this paper does and will not propose a proven solution to this issue. In fact, we address this issue by simply preventing over-fitting with early stopping with heuristics, discussed in further detail in Section

4. Now, we will discuss two main approaches to training DGMs: unsupervised and adversarial training.

3.5.1 Deep Generative Model Training

Unsupervised training refers to learning the prior , thus in more practical terms, optimizing the model using only the input . An example of unsupervised training is training auto-encoders, where we treat the input as the output as well. Instead of exporting the embedding function, as seen in Section 3.3.1, we treat the auto-encoder as the generator model , where

is the synthetic sample. In this paper, we will solely focus on auto-encoders and VAEs as traditional unsupervised learning methods for deep generative modeling.

Adversarial training adds onto unsupervised learning with the discriminator model, which learns to discern real or fake data. At a high level, the generator model attempts to trick the discriminator model by learning how to generate samples that are realistic. We will consider the traditional approach and the conditional approach.

The traditional approach is to train the generator and discriminator models separately using unsupervised and supervised learning respectively. The training for the generator model will follow the procedure discussed in Section

3.3.1, however we aggregate the discriminator’s inaccuracy and the reconstruction loss, scaled by and respectively, where , to obtain our generative model’s loss. The training of the discriminator will optimize normal classification between real and fake data using the dataset and the generator model. We can instead have the discriminator instead predict on the latent representation,or the embedding, as seen in Turhan and Bilge (2018). For this approach, we will focus on adversarial auto-encoders and adversarial VAEs.

The conditional approach follows most of what is stated in the traditional approach, however we aim to have both the generative and discriminative models conditioned on the class label, . Similar to practices discussed in Turhan and Bilge (2018), we can simply treat

as another input layer, paired with an embedding layer, and is trained accordingly. For this approach, we will focus on conditional generative adversarial networks using auto-encoders and VAEs as the generative model.

4 Results

In this section, we will evaluate the methods discussed in Section 3 using the student performance dataset and classifiers described in Section 2. We will compare our results with other balancing methods from past works, such as random oversampling, random under-sampling, SMOTE, Tomek Links, extended nearest neighbors, and various combinations methods of these approaches.

We see that if the classifiers are trained on the dataset normally, optimizing the cross entropy loss, we obtain performance seen in Figure 7, which follows our hypothesis of disproportion performance between minority and majority classes. We seek to mitigate these effects of class imbalances with the SDG methods we have proposed.

Figure 7: Percent error distribution on the testing set over all target classes averaged over classifiers from Section 2.2
Figure 8: Max % performance improvement with traditional sampling methods for imbalanced student performance dataset on the testing set.

We used the balancing methods aforementioned in Section 1, and we obtained the max % improvements over 50 trials. The results can be seen in Figure 8. However, as we can see, they all provide a performance boost to the all performance metrics, except for SMOTE+ENN and random under-sampling, which detriments the classifier’s performance. To avoid under-sampling methods to be completely useless as there exists minority classes with samples, thus under-sampling an already limited number of samples would lose immense information, we limited the under-sampling to majority classes..

We now will discuss our results from our experiments using the SEDG and DGM methods mentioned in Section 3 and using the NNModel and the traditional models mentioned in Section 2.2 as the classifier. For all of the experiments below, we set the number of synthetic samples that will be generated to samples. All recorded improvements are on the testing set, which encompasses of the dataset. We define improvement in Equation 7, where is the score from some performance metric from the dataset with the synthetic samples, and is the score from the performance metric from the original dataset.

(7)

4.1 NNModel as Classifier

Now, we consider using the NNModel defined in Section 2.2 as our classifier, and we will test our embedding-based SDG methods and DGMs on the student performance dataset.

From Figure 9, we show the max improvements over different sampling methods from Section 3.1 over 50 trials. We note that PPSS method, or the AUC approach, demonstrates highest improvements in predictive accuracy, however, based on the macro-AUC and class AUC scores, there doesn’t seem to be a clear method that outperforms the others, all performing comparatively targeting the minority classes than the traditional sampling methods seen in 8.

Figure 9: Performance improvement on different metrics over different sample selection methods discussed in Section 3.1 using NNModel. The graphs above use the following notation: score refers to predictive accuracy, auc_macro refers to macro-AUC score, and auc_class refers to AUC score per class.

We now compare the max improvements over different generative methods, comparing models from Section 3.5 and the traditional embedding-based SDG method in Section 3, over 50 trials. From Figure 10, we note that DGMs demonstrates very minimal improvements in predictive accuracy and macro-AUC score, and does not perform comparatively to the embedding-based SDG methods. We assume this to be related to the issue of over-fitting, as we accommodated this issue only by employing early stopping with basic heuristics: if the majority of training data () share of the features to the target sample, then we end training. We suggest for future works to handle this issue of over-fitting in the context of SDG more efficiently.

Figure 10: Performance improvement on different metrics over different generation methods discussed in Section 3.5 using NNModel. The graphs above use the same notation as Figure 9. In this graph, gen_ae represents generative auto-encoder, gen_vae represent generative VAE, gen_aae represents generative adversarial auto-encoder, gen_avae represent generative adversarial VAE, gen_aae represents generative conditional adversarial auto-encoder, and gen_cavae represent generative conditional adversarial VAE.

Similarly, the max improvements over different feature selection methods when using embedding-based SDG, discussed in Section 3.2, over 50 trials. In Figure 11, we confirm that weighing feature selection is more effective than random feature selection in all performance metrics. For both accuracy and macro-AUC, the difference is obvious, and for class AUC scores, we can see that majority of the minority classes, as well as the majority classes, are handled better with weighed feature selection.

Figure 11: Performance improvement on different metrics over different feature selection methods discussed in Section 3.2 using NNModel. The graphs above use the same notation as Figure 9. The letter i represents the use of some feature weighing, such as feature importance or feature imbalance, for feature selection, and the letter r represent random feature selection.

Figure 12 shows the max improvements over different feature weighing methods used in feature selection, discussed in Section 3.2, over 50 trials. The feature imbalance approach performs better in all scoring metrics, with permutation importance and drop-column importance, the two feature importance methods, performing similarly in predictive accuracy. However, permutation importance performs noticeably better in macro-AUC and class AUC than drop-column importance.

Figure 12: Performance improvement on different metrics over different feature weighing approaches discussed in Section 3.2 using NNModel. The graphs above use the same notation as Figure 9.

The max improvements over different feature modification methods used, discussed in Section 3.3, over 50 trials is seen in Figure 13

. In terms of macro-AUC score, modification approaches using nearest neighbor search on features reduced by PCA and cosine similarity on embedded features both outperformed other methods. We see that random feature modification does better in terms of predictive accuracy. For class AUC score, we can see that cosine similarity on reduced features by PCA helps the minority classes the most, however this approach does perform worse in terms of macro-AUC score than random modification.

Figure 13: Performance improvement on different metrics over different feature modification methods discussed in Section 3.2 using NNModel. The graphs above use the same notation as Figure 9. In this graph, r represents random modification, pn represents nearest neighbor search on features reduced by principal component analysis (PCA), pc represents cosine similarity on features reduced by PCA, en represents nearest neighbor search on embedded features, and ec represents cosine similarity on embedded features

We show the max improvements over different embedding generation methods for feature modification, discussed in Section 3.3.1, over 50 trials in Figure 14. We find that for both accuracy and macro-AUC score, using the embedding matrix from the NNModel outperform the other methods, with VAE being a close second in terms of macro-AUC score. For class AUC scores, we find that VAEs and embedding matrix from NNModel improve the scores of the minority classes the most.

Figure 14: Performance improvement on different metrics over different embedding methods discussed in Section 3.2 using NNModel. The graphs above use the same notation as Figure 9. In this graph, model represents using the embedding learned by the NNModel for feature modification, ae represents using the embedding learned by an independent auto-encoder for feature modification, and vae represents using the embedding learned by an independent VAE for feature modification.

In Figure 15, we show the max improvements over different synthetic sample usage approaches discussed in Section 3.4 over 50 trials. We find that for all performance metrics, cold start performs better than warm start approaches.

Figure 15: Performance improvement on different metrics over different synthetic sample usage discussed in Section 3.4 using NNModel. The graphs above use the same notation as Figure 9. In this graph, any scores denoted with a prefix f means cold start training was used, and score denoted with a prefix c means warm start training was used.
Figure 16: Performance improvement comparing traditional sampling methods against DGMs and SEDG using NNModel as classifier. The graphs above use the same notation as Figure 9.

In Figure 16, we find that the SEDG method outperforms all classic resampling methods for all performance metrics when using NNModel as the classifier. In fact, many of the classic sampling method perform poorly, harming the performance, even given 50 trials to run. Most notably, the class-AUC score demonstrate a significant positive difference in targeting the minority classes when using SEDG methods.

4.2 Traditional Classifiers

Now we consider the traditional classifiers seen in Section 2.2, and how SDG methods affect their performances. First, we show that when we transfer the embeddings from the deep models, to act as data pre-processing mappings, to these learners, their performance increases noticeably, as seen in Table 1. This supports and allows us to proceed to use embedding-based SDG methods without worries of the embeddings being incompatible to these classifiers.

Model % improvement
OvR SVM 8.4615
OvO SVM 3.0769
Random Forest 13.846
XGBoost 4.2307
Table 1: Performance improvement when using trained NNModel’s embedding as data pre-processing to following classifiers. In the table above, OvR SVM represents one-vs-rest SVM and OvO SVM represents one-vs-one SVM.

Thus, we test the embedding-based SDG methods and the deep generative models on the traditional classifiers similar to Section 4.1.

Figure 17: Performance improvement on different metrics over different sample selection methods discussed in Section 3.1 using traditional classifiers. The graphs above use the same notation as Figure 9.

From Figure 17, we show the max improvements over different sampling methods from Section 3.1 over 50 trials. Unlike to the results from Section 4.1, PaSS method demonstrates highest improvements in macro-AUC and a close second in accuracy, however, for class AUC score and accuracy, PPSS method performs the best.

Figure 18: Performance improvement on different metrics over different generation methods discussed in Section 3.5 using traditional classifiers. The graphs above use the same notation as Figure 10.

The max improvements over different generative methods, discussed in Section 3.5, compared to SEDG methods from Section 3 over 50 trials is seen in Figure 18. We see that SEDG methods performs the best in accuracy, but generative adversarial autoencoder outperforms SEDG methods in terms of macro-AUC score. For class AUC score, generative autoencoder seem to perform the best in terms of improving the minority classes.

Figure 19: Performance improvement on different feature selection methods discussed in Section 3.2 using traditional classifiers. The graphs above use the same notation as Figure 11.

We find the max improvements over different feature selection methods from Section 3.2 used in embedding-based SDG method in Section 3, over 50 trials to again confirm that weighted feature selection proves to be more effective than random feature selection, seen in Figure 19.

Figure 20: Performance improvement on different feature weighing methods discussed in Section 3.2.1 using traditional classifiers. The graphs above use the same notation as Figure 12

Figure 20 shows the max improvements over different feature weighing methods used in feature selection, discussed in Section 3.2, over 50 trials. Feature imbalance again performs best in terms of accuracy, but falters in other performance metrics. For macro-AUC and class AUC scores, mean decrease importance for tree-based classifiers seems to improve the minority classes the most than the other approaches.

Figure 21: Performance improvement on different metrics over different feature modification methods discussed in Section 3.2 using traditional classifiers. The graphs above use the same notation as Figure 13

The max improvements over different feature modification methods used, discussed in Section 3.3, over 50 trials is seen in Figure 21. All modification methods performed similarly in terms of class AUC scores, with random modification taking a slight lead. In terms of accuracy and macro-AUC, nearest neighbor on embedded features was the most successful.

Figure 22: Performance improvement on different embedding methods discussed in Section 3.3.1 using traditional classifiers. The graphs above use the same notation as Figure 9.

We show the max improvements over different embedding generation methods for feature modification, discussed in Section 3.3.1, over 50 trials in Figure 22. We find that for accuracy, autoencoder outperform the other methods, with VAE and embedding matrix from NNModel performing similarly. For both macro-AUC and class AUC score, using the embedding matrix from NNModel proves to be the best.

Figure 23: Performance improvement comparing classic re-sampling methods against DGMs and SEDG with traditional machine learning classifiers. The graphs above use the same notation as Figure 9.

In Figure 23, we find that the SEDG method, while it does not excel in accuracy and macro-AUC score % improvements for traditional machine learning classifiers, it is able to target the minority classes more effectively in class-AUC score than all other methods. Additionally, compared to the other classic re-sampling methods, the results show when considering all the performance metric, SEDG methods, similarly to DGMs, performs the best as certain re-sampling approaches that excel in one metric often fail to replicate the same success in other metrics.

Figure 24: Likelihood of feature modification and feature value candidates using generative VAE from 100 synthetic samples per class.

4.3 Understanding Student Performance

In this section, we investigate the synthetic samples more in-depth for each class to improve our understanding for student performance by showing how likely certain features are allowed to change without much significant changes to the data distribution, similar to feature importance. However, with synthetic samples, we can also see how we can change these features, thus seeing possible feature value candidates.

In Figure 24, we use a VAE as DGM to create synthetic samples for each class to show how likely each feature in each class is subject to change, and what feature values is reasonable. For example, we can see for low performing students, the likelihood of changing absences is low, with greater number of absences being a possible candidates. We also see that some features, such as travel time, school supplies, and famrel, are highly susceptible to change, inferring to their lack of distinction between possible candidates.

5 Conclusion

In this paper, we proposed methods for embedding-based SDG and investigated DGMs for class imbalanced tasks, specifically student performance tasks. We tested SEDG approach and DGMs against standard balancing methods, and found greater improvements in our proposed approaches in terms of % improvement in accuracy, macro-AUC score, and class-AUC score when we use our NNModel as the classifier, and an more comprehensive improvement when we use traditional machine learning classifiers on student performance task. We also introduced a technique for greater interpretibilty for SDG methods by looking at the synthetic samples generated and seeing how likely each feature is modified and to what values it can take on for each class.

6 Acknowledgement

We thank Samuel Schmidgall (George Mason University), Ji Kuo (George Mason University), and Jay Deorukhkar (George Mason University) for helping create and build up the core ideas of the SEDG framework, and their efforts in paper revisions. We thank Dr. Huzefa Rangwala (George Mason University) for useful suggestions and advices that motivated key components such as weighted feature selection. We thank Yuanqi Du (George Mason University) for assistance in selecting the area of focus to be the educational domain and in finding the appropriate dataset.

References

  • Y. Bengio, A. C. Courville, and P. Vincent (2012) Unsupervised feature learning and deep learning: A review and new perspectives. CoRR abs/1206.5538. External Links: Link, 1206.5538 Cited by: §3.3.
  • K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer (2011) SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813. External Links: Link, 1106.1813 Cited by: §1.
  • L. Breiman (2001) Random forests. Mach. Learn. 45 (1), pp. 5–32. External Links: ISSN 0885-6125, Link, Document Cited by: §2.2.
  • T. Chen and C. Guestrin (2016) XGBoost: A scalable tree boosting system. CoRR abs/1603.02754. External Links: Link, 1603.02754 Cited by: §2.2.
  • P. Cortez and A. Silva (2008) Using data mining to predict secondary school student performance. EUROSIS, pp. . Cited by: §2.1, §2.
  • J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. The Annals of Statistics 29 (5), pp. 1189–1232. External Links: ISSN 00905364, Link Cited by: §2.2.
  • H. Guo and H. L. Viktor (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor. Newsl. 6 (1), pp. 30–39. External Links: ISSN 1931-0145, Link, Document Cited by: §1.
  • Haibo He, Yang Bai, E. A. Garcia, and Shutao Li (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Vol. , pp. 1322–1328. External Links: Document Cited by: §1.
  • M. A. Hearst (1998) Support vector machines. IEEE Intelligent Systems 13 (4), pp. 18–28. External Links: ISSN 1541-1672, Link, Document Cited by: §2.2.
  • S. S. Mullick, S. Datta, and S. Das (2019) Generative adversarial minority oversampling. CoRR abs/1903.09730. External Links: Link, 1903.09730 Cited by: §1.
  • F. Provost and T. Fawcett (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, KDD’97, pp. 43–48. Cited by: §1.
  • F. J. Provost and G. M. Weiss (2011) Learning when training data are costly: the effect of class distribution on tree induction. CoRR abs/1106.4557. External Links: Link, 1106.4557 Cited by: §1.
  • J. R. Quinlan (1986) Induction of decision trees. Mach. Learn. 1 (1), pp. 81–106. External Links: ISSN 0885-6125, Link, Document Cited by: §2.2.
  • P. Shamsolmoali, M. Zareapoor, L. Shen, A. H. Sadka, and J. Yang (2020) Imbalanced data learning by minority class augmentation using capsule adversarial networks. External Links: 2004.02182 Cited by: §1.
  • C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn (2006) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8, pp. 25 – 25. Cited by: §3.2.1.
  • Y. Sun, A. Wong, and M. S. Kamel (2011) Classification of imbalanced data: a review.

    International Journal of Pattern Recognition and Artificial Intelligence

    23, pp. .
    External Links: Document Cited by: §1, §1, §3.2.
  • [17] S. Susan and A. Kumar The balancing trick: optimized sampling of imbalanced datasets—a brief survey of the recent state of the art. Engineering Reports n/a (n/a), pp. e12298. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1002/eng2.12298 Cited by: §1.
  • C. G. Turhan and H. S. Bilge (2018) Recent trends in deep generative models: a review. In 2018 3rd International Conference on Computer Science and Engineering (UBMK), Vol. , pp. 574–579. External Links: Document Cited by: §1, §3.5.1, §3.5.1.