Generalized Inverse Classification

10/05/2016 ∙ by Michael T. Lash, et al. ∙ The University of Iowa 0

Inverse classification is the process of perturbing an instance in a meaningful way such that it is more likely to conform to a specific class. Historical methods that address such a problem are often framed to leverage only a single classifier, or specific set of classifiers. These works are often accompanied by naive assumptions. In this work we propose generalized inverse classification (GIC), which avoids restricting the classification model that can be used. We incorporate this formulation into a refined framework in which GIC takes place. Under this framework, GIC operates on features that are immediately actionable. Each change incurs an individual cost, either linear or non-linear. Such changes are subjected to occur within a specified level of cumulative change (budget). Furthermore, our framework incorporates the estimation of features that change as a consequence of direct actions taken (indirectly changeable features). To solve such a problem, we propose three real-valued heuristic-based methods and two sensitivity analysis-based comparison methods, each of which is evaluated on two freely available real-world datasets. Our results demonstrate the validity and benefits of our formulation, framework, and methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In typical classification settings, a model is trained and used to make predictions about some event of interest. Depending upon the predictive task, some action may then be taken. In a medical domain, a patient may be monitored more carefully if a prediction yields a high likelihood of some negative outcome. However, in the same setting, we may want to know what actions can be taken to minimize the patient’s chances of said adverse event occurring. The process of finding the optimal set of actions, or changes, that can be taken in order to minimize the probability of such events occurring is what we term

inverse classification.

This example domain further highlights the nature and importance of the problem. Consider, specifically, the problem of mitigating the long-term risk of cardiovascular disease (CVD) of Patient 29 taken from our experiments below. Initially, we use a constructed model to estimate this patient’s risk, or probability, of developing CVD, which is found to be 55%. This estimate is based on pertinent factors such as medications, lab measurements (e.g.,  blood glucose), lifestyle (e.g., diet), and demographics (e.g., age).

Following an initial assessment of risk, we would like to work ‘backwards’ through our learned model to obtain recommendations that reduce Patient 29’s probabililty of CVD. Past methods, however, restrict the set of classifiers that are used to obtain such recommendations, often only affording the use of a single algorithm. Such restrictions are prohibitive in that a particular classifier may have useful properties. These might include high predictive accuracy, such as the random forest used to obtain Patient 29’s initial level of risk, or a high degree of explanatory power,which may help the patient better understand why certain recommendations were made. Therefore we propose

generalized inverse classification (GIC), which permits the use of virtually any classification function, requiring only a simple non-prohibitive assumption (further discussed in Section 3). This is the first contribution of this work.

Our second contribution is to show that the problem can be solved using heuristics. Specifically, we propose three real-valued heuristic-based methods that solve this problem, which we compare to two sensitivity analysis-based baseline methods. We demonstrate the efficacy of these results on two freely available datasets, one of which includes Patient 29, whose risk we lower from 55% to less than 30% (Section 4). Thirdly, we refine an existing inverse classification framework to include non-linear cost-to-change functions, which we then incorporate into our experiments. Section 3 outlines the framework, the generalized inverse classification problem, the three heuristic-based methods, and two sensitivity analysis-based methods, while Section 5 concludes the paper.

2 Related Work

Inverse classification is akin to the sub-discipline of sensitivity analysis, which examines the impact of predictive algorithm input on the output. While there are many forms of sensitivity analysis[1, 2], local and variable perturbation methods are most similar. Based on this we develop two sensitivity analysis-based methods, related in Section 3, for comparison purposes.

Past works on inverse classification differ with respect to three distinct perspectives: operational data types, algorithmic mechanism, and framework. The operational data types which encode the data, on which inverse classification is performed, are either discrete [3, 4, 5], continuous [6, 7, 8, 9], or both [10]. The latter two allow for more fine-grained results leading to greater precision in the recommendations made.

The algorithmic mechanism operates on these data types by finding the feasible recommendations that optimize the predicted probability. Such optimization strategies are constructed to be greedy [3, 4, 5, 6] or non-greedy [7, 8, 9, 10].

The framework ensures that recommended changes are feasible and implementable. These include: (1) identifying features that can be changed (e.g. age cannot), (2) the difficulty in implementing changes (feature-specific costs) and (3) a restriction on the cumulative change (budget). In [3, 8] there are no constraints imposed. Of those that do impose constrains:

  • In [7] constraints are imposed that lead to non-extreme recommendations, but neither (1), (2), or (3) are considered.

  • In [6] (2) is imposed, but not (1) or (3).

  • In [4, 5] only (1) and (2) are considered.

  • A different notion of (1), (2), and (3) are explored in [10] by matching discrete entities to compute features.

  • In [9] (1), (2), and (3) are all considered, but does not permit nondifferentiable classifiers.

Real-valued heuristic-based methods are also relevant to this work. These methods include variable neighborhood search (VNS), genetic algorithms

[11], and hill-climbing [4, 6]. In this work we elect to focus on genetic algorithms, hill-climbing, and local search, which can be viewed as a simpler form of VNS. As will be shown, by using heuristic-based methods, we can be as general as possible in solving the inverse classification problem.

3 Generalized Inverse Classification

In this section we first briefly discuss GIC. Subsequently, we outline our inverse classification framework. Next, we relate three heuristic-based methods that can be used to solve GIC. Finally, we introduce two sensitivity analysis-based methods that will be compared to our heuristic-based methods.

Under the GIC formulation no assumptions are made about the classification function other than . Such a level of generality allow us to obtain optimal solutions for nondifferentiable functions. These functions include popular ensemble techniques such as bagging [12] and boosting [13]

, as well as C4.5 decision trees. Classifiers such as these are often found to have high predictive power (ensembles) or are more readily interpretable and explainable (e.g.,  C4.5 decision trees), which is why it is so important methods be developed that incorporate such classifiers.

3.1 Framework

Suppose is a dataset of instances where

is a column feature vector of length

and is the binary label associated with for . Let be a function that computes the probability of being in the positive class (with ). Typically, is based on a certain classification model built on the dataset. Given a new instance, with feature vector , we want to modify some components of , subject to some budget constraints, so that the predicted probability of being positive is minimized.

We further partition the features into three subsets, , and , which represent the sets of unchangeable, directly changeable and indirectly changeable features, respectively. When we optimize the features, we can only determine the value for and the values of will depend on and . Therefore, we model the dependency of on and as where the mapping is assumed to be differentiable. Note that the mapping can be any predictive model constructed using the same training instances. Therefore, we represent as to distinguish these three blocks so that the feature optimization problem can be formulated as


Here, we assume the reasonable value of each directly changeable feature in must be within an interval, denoted by for . If can only be increased (decreased), we can set (). In addition, is a convex cost function that measures the cost for changing to and is the total budget we have to support this change. We require .

Here, we provide two examples of . The first assumes the cost increases linearly, as is deviated from , which is


where and , and and denotes the costs for increasing and decreasing the feature by one unit for . If one assumes the costs increase quadratically as deviates from , then


Note that the constants and in (2) and (3) can be different. In both cost functions, if decreasing (increasing) is cost-free, we can set (). In the rest of this paper, we will only focus on the quadratic cost in (3).

We define in (1) and, by changing variables, (1) can be equivalently written as


where ,


and for . The projection mapping onto the feasible set is defined as


We then define a subroutine for solving (6). We first define


for each and . The subroutine is given in Algorithm 1 whose validity can be easily verified by the KKT conditions of (4). Note that the bisection search in Algorithm 1 can always succeed because monotonically decreases to zero as increases to infinity.

0:  , , , and
1:  if  then
3:  else
4:     Apply bisection search to find such that
5:  end if
6:   for .
Algorithm 1 Projection Mapping Proj

3.2 Heuristic-based methods

We propose three real-valued heuristic-based algorithms to solve the generalized inverse classification problem: hill-climbing + local search (HC+LS), a genetic algorithm (GA), and a genetic algorithm + local search (GA+LS).

There are several processes shared among the three algorithms. For simplicity of notations, we assume the features of indexed by are the first features, i.e., . Let

represent a uniformly distributed random variable over

indicating the indexical position of feature vector that will be perturbed. Perturbations to feature

occur according to a standard normal distribution


where is random variable representing the perturbation that occurs at indexical position and

is the standard deviation of feature

obtained from the training data. Let be a vector that equals one in the th coordinate and zero in other places so that the perturbed version of is denoted by . Let represent the th row of a matrix . Two additional shared parameters include which we will use to denote the total population size and which we will use to denote the number of iterations until an algorithm terminates.

3.2.1 Hill-climbing + local search

Our hill-climbing + local search (HC+LS) algorithm is based on that outlined by Mannino and Koushik [6] and is related by Algorithm 3 which calls a local search procedure, outlined in Algorithm 2. In this algorithm, the best current solution, denoted by , is perturbed a single feature at a time in order to find a better solution. There are single-feature perturbations that occur at each iteration, leading to perturbed versions of , denoted by , for . We use Algorithm 1 to convert the direction into a feasible state and update along the direction that yields the smallest , where is defined in (4).

0:  , , , , , , and
1:  for  to  do
2:     Generate and as (8);
3:     ;
4:  end for
5:  if  then
7:  end if
Algorithm 2 LS

We note here that the difference between regular HC and HC+LS is that HC operates on a first improvement basis, whereas HC+LS operates on a best improvement basis.

0:  , , , , , , , and
1:  Initialize ;
2:  for  to  do
4:  end for
Algorithm 3 HC+LS

3.2.2 Genetic algorithm

Genetic algorithms are composed of four primary processes: initial population generation, crossover, carryover, and mutation. Our real-valued genetic algorithm (GA) is outlined by Algorithm 4. Prior to outlining such a method, we first relate the four aforementioned components.

At the first iteration of our genetic algorithm, an initial population is generated. For , let be a discrete uniform random variable. We then generate and as in (8) for and define


We use for as the initial population and store them as the rows of a matrix , i.e,


We note that is updated times, resulting in unique entries in . Here, we apply (6) to ensure that all population chromosomes are feasible.

Following this, a simple procedure is called. This orders the rows by objective function value from smallest to largest. Let be a user specified parameter that denotes the proportion of the population that will be bred to produce the offspring for the next generation. We make a copy of the first rows of and store them as a matrix with for . We then randomly shuffle the rows of using a procedure .

Let be the proportion of the population that should be composed of children ( being the proportion of the population that will be carried over, discussed shortly). We construct a vector of indices as


where is a uniformly distributed random index for .

Selected chromosomes are bred using single-point crossover outlined in Michalewicz, 2013 [11], adapted to maintain feasibility via our projection operator. Without loss of generality, we assume is an even number and for some integer . For , we use the vector defined in (11) to create children from the matrix of parent chromosomes by doing


where represent the entry in the th row and the th column of a matrix , is a selected crossover point, generated for a pair of parents, i.e., the rows and of for . Mut is the mutation operator defined as


where , is generated as (8), is a binary random variable which equals zero and one with a probability of and respectively ( is a user-specified parameter representing the probability of mutation occurring to allele by amount defined in (8)). Subsequently, the children are ensured feasible by


These feasible children are then stored as rows of a matrix .

The carryover procedure uses roulette wheel selection [11] to select chromosomes from the current generation that will survive to the next. Chromosomes that have larger (i.e., better) fitness values (where fitness denotes solution quality) have a higher likelihood of surviving to the next generation.

First, we create an inverted solution vector . These inverted solutions are transformed from by function defined as


where is the worst-case solution possible and is assumed to be positive. Using we construct a vector of selection probability


Intuitively, higher quality solutions have larger probability to be selected, since we have already ordered by so that .

Using (16) we select chromosomes from the population matrix to be carried over to the next generation by


for . Those selected children are stored as the carryover matrix .

Using (10),(12), (13) and (17) we construct our GA as outlined by Algorithm 4. The procedure begins by initializing the best solution to the unperturbed chromosome . The algorithm then begins iteration, executing times. If it is the first iteration, the initial population is generated. The current population is then evaluated and if a better solution is found, it is updated. Following this, a simple procedure is called. This orders by objective function value from smallest to largest. Crossover points are then selected in an elitist fashion from this ordered matrix of chromosomes. Selected chromosomes are then randomly shuffled, using procedure , before crossover is applied to create the offspring chromosomes. Next, the carryover chromosomes are selected. Finally, the children and carryover chromosome matrices are concatenated to form the population for the next generation.

0:  , , , , , , , , , , , and
1:  Initialize
2:  for  to  do
3:     if  then
4:        Generate an initial population using (10)
5:     end if
6:     if  then
8:     end if
10:     ,
12:     Obtain from according to (12)
13:     Obtain from by (17)
14:     Create the new population
15:  end for
Algorithm 4 GA

3.2.3 Genetic algorithm + local search

The third method is a genetic algorithm + local search (GA+LS). It is related by Algorithm 5. There are a few important distinctions between the original GA and that with local search applied. First, we reformulate the crossover procedure outlined by (12) to be


for . The reader will note that here the mutation procedure is not applied.

Second, we incorporate the use of the local search (LS) procedure previously outlined in Algorithm 2. Here, we set parameter equal to , which dictates the extent of the search.

GA+LS is outlined by Algorithm 5. The differences between this method and the original GA are outlined in blue. At line 14 the LS procedure is applied to each of the non-mutated children. The best solution obtained from LS is the child chromosome that is kept for the next generation.

0:  , , , , , , , , , , , and
1:  Initialize
2:  for  to  do
3:     if  then
4:        Generate an initial population using (10)
5:     end if
6:     if  then
8:     end if
10:     ,
12:     Obtain from by (18)
14:     Obtain from by (17)
15:     Create the new population
16:  end for
Algorithm 5 GA+LS

3.3 Sensitivity analysis-based methods

As discussed in Section 2, sensitivity analysis is closely related to inverse classification. Therefore, we propose two sensitivity analysis-based algorithms that serve as baselines against which the heuristic-based methods can be compared against. To our knowledge, no past methods addressing this problem have been proposed. Therefore, we craft these ourselves, and believe that they represent a reasonable initial attempt at a solution. Such methods can be viewed as a combination of local and variable perturbation methods of sensitivity analysis.

We refer to the first sensitivity analysis-based method as Local Variable Perturbation–Best Improvement (LVP-BI). This method calls for perturbing a single feature to the extent of feasibility given by . The single feature perturbation having the greatest objective function improvement is the one that is accepted. If some budget remains following this perturbation, subsequent perturbations are performed (e.g., double feature, triple feature, etc. perturbations).

Our second method, which we refer to as Local Variable Perturbation–First Improvement (LVP-FI), is very similar to that of LVP–BI. Instead of accepting the best perturbation over all it accepts the first perturbation that leads to a better objective function value, where is selected at random.

4 Experiments

In this section we first outline our choices regarding the parameters of the inverse classification framework and then apply our methods to two freely available datasets. Our experiments will evaluate the five methods by examining the average likelihood of test instances conforming to a non-ideal class over varying budget constraints.. First, we will explore the capability of each algorithm in reducing the likelihood of test instances conforming to a non-ideal class. Additionally, we will examine the perturbations made to an individual test instance, selected at random, by the top performing algorithm. We wish to emphasize that practical and real-world use of these methods should be undertaken with experts in the domain of use. We further emphasize that inverse classification puts the individual at the center of the process and optimizes over his/her current values. Therefore, if an individual so choses, he/she can adjust expert-specified costs according to their own outlook on what may be more or less difficult to change.

4.1 Experiment Parameters and Evaluation

There are three choices that need to be made regarding the established inverse classification framework: the learning algorithm, the indirectly changeable feature estimator, and the method we will use to set the lower- and upper-bounds that directly changeable features can take.

4.1.1 Objective Function

We selected the Random forest classifier [14] to evaluate each of the five methods. We chose this as it is (a) an ensemble classifier and (b) composed of weak-learner decision trees. Both (a) and (b) are separately non-differentiable, and comprehensively help highlight the need for the GIC formulation we have proposed. The returned objective function value will be the proportion of decision trees in the ensemble voting in favor of the class to be minimized. As such . We therefore can also parameterize , the worst-case objective function value in (15).

4.1.2 Indirectly Changeable Feature Estimation

The inverse classification framework allows for any smooth model to be selected to estimate the indirectly changeable features. We elect to use a kernel regression method [15, 16]


where is a training instance and is the Gaussian kernel. We elect to use this function and corresponding Gaussian kernel for its similarity-based estimation properties. We cross-validate this model on each of the indirectly changeable features in order to learn the best for each.

4.1.3 Bound-setting method and cost function

Lash et al., [9] outline two methods of specifying lower- and upper-bounds for the directly changeable features. Each result in different algorithmic behavior. In our experiments we use the Hard-line bound-setting method. Under this method we specify, for feature , the upper- and lower- bounds such that can only either increase or decrease. If feature should increase from its current value of we set . If feature should decrease from it’s current value of we set . This allows us to maintain more control over what we know and believe to be the beneficial direction of feature movement. We do note, however, that under different circumstances (e.g., uncertainty) it may be beneficial to allow the optimization to learn the most beneficial direction of feature movement.

In this set of experiments we elect to explore the effects of non-linear costs, related by 3. We elect to do so as non-linear costs, to the best of our knowledge, have not been explored in past works.

4.1.4 Evaluating Recommendations

To evaluate the success of the inverse classification we use an established procedure originally outlined in [4] and refined in [9]. This process entails initially splitting a dataset randomly into two equal parts , where the first is used for training the random forest model upon which inverse classification will take place. The second set , is the held-out set of data to which inverse classification will be applied.

is further partitioned into distinct subsets which we can denote , ( in our experiments). The process of evaluation entails that we perform inverse classification on and use to train a separate model to evaluate the success of the inverse classification. Such a process ensures that no information used to perform the inverse classification and obtain recommendations is used in evaluating how successful the process actually was. Additionally, this helps ensure that the classifier used to make the recommendations has not overfit the data.

4.2 Student Performance: Grade-improving recommendations

Our first set of experiments are conducted on a UCI Machine Learning Repository dataset called Student Performance

[17]. This dataset consists of Portuguese students enrolled in two different classes: a math class and a Portuguese language class. Represented as two disjoint, but overlapping datasets, we elect to use the Portuguese language set as it has the larger number of instances ().

4.2.1 Data Description

Each individual in Student Performance is initially represented by 45 features, including a unique identifier (discarded) and class variable , which we define to be whether or not a student’s final grade was above a C () or, conversely, less than or equal to a C (). Our GIC methods will attempt to reduce the likelihood of earning a grade of C or worse. We discard the two intermediary grade reports to reflect a long-term goal of earning a higher grade overall and make the problem more realistic. The full set of features and corresponding parameters can be viewed in the Supplemental Material.

The parameters set for the three heuristic-based methods in these experiments are related by Table 1, as is the computational complexity. We arrived at these after a brief exploration of the parameter space, selecting values that were comparable so that performance could be equivalently compared. For GA+LS, we kept (abbreviated ) lower because of the added complexity of the parameter.

Param HC + LS GA GA + LS
300 300 150
15 15 15
Table 1: Benchmark experiment parameters

4.2.2 Results

We first examine the success of reducing the average predicted probability for each of the five methods. These results are reported in Figure 1. We report each over 15 increasing budgetary constraints. Additionally, we include the best result on a randomly selected positively classified instance – Student 57 – obtained using GA.

Figure 1: The average predicted probability vs budget for three heuristic- and two sensitivity analysis-based methods. GA result for Student 57.

As we can observe in Figure 1 the two sensitivity analysis-based methods were unsuccessful. The result also shows that the three heuristic-based methods are comparable, with GA and GA+LS declining slightly faster than HC+LS. We include more detailed information about the performance of each method in the Supplemental Materials.

Figure 2: GA recommended changes to Student 57.

We report the changes made to “Student 57” in Figure 2 for the method most successful in reducing their predicted probability: GA. We report this so that the reader may have a better idea of what such recommendations look like. GA recommends the student to increase study time and curb weekday alcohol consumption, as well as to decrease time out with friends.

Cumulatively, the three heuristic methods were, on average, able to reduce the probability from approximately 70% to 62% at a budget level of three. Individually, the best performing method was able to reduce Studet 57’s probability from 70% to 50% at a budget level of five.

4.3 Cardiovascular disease mitigating lifestyle recommendations

Our second set of experiments is conducted on a real-world patient dataset, derived from the ARIC study. These data are freely available upon request from BioLINCC.

4.3.1 Data Description

These data represent patients, for whom we have known cardiovascular disease (CVD) outcomes over a 10 year period. There are 110 defined features for each patient. Patients who, during the course of the 10 year period have probable myocardial infarction (MI), definite MI, suspect MI, definite fatal coronary heart disease (CHD), possible fatal CHD, or stroke have and otherwise. Patients who had a pre-existing CVD event are excluded from our dataset, giving us a total of patients. This set of experiments is meant to more closely reflect a real-world scenario and, as such, is guided by a CVD specialist. The full list of features, their feature designation (e.g., changeable) and parameters (e.g., cost) can be viewed in the Supplemental Materials.

After a brief exploration of the parameter space, we arrived at the same set of parameters as in the previous experiment (Student Performance). We omit the duplicate table and refer to Table 1. Additionally, because of the size of the testing dataset, and the computational complexity associated with the heuristic-based methods, we elected to test on a subset of data. We used all 587 positive test instances and another 587 randomly selected negative test instances, giving us a final evaluative test set size of 1164. Evaluation models were constructed using the full set of data by the procedure outlined in Section 4.1.4.

4.3.2 Results

We first examine the success of reducing the average predicted probability using the five outlined methods. These results are reported in Figure 3. We report each over 15 increasing budgetary constraints. Additionally, we include the best result on a randomly selected positively classified instance – Patient 29 – obtained using GA+LS.

Figure 3: The average predicted probability vs budget for the three heuristic- and two sensitivity analysis-based methods. GA+LS result for Patient 29.

The results obtained for the heuristic-based methods are similar to those of Student Performance. There is a striking difference, however, between those and the sensitivity based-method results here. We observe that LVP-FI outperforms all other methods, while LVP-BI is comparable to GA and GA+LS. HC+LS performs the worst. The stark difference in performance of LVP-FI and LVP-BI on this dataset vs. that of student performance may suggest that there are instances in which it is advantageous to use sensitivity analysis-based methods over those that are heuristic-based, and vice-versa. We leave such an analysis for future work.

Figure 4: GA+LS recommended changes to Patient 29.

We report the changes made to “Patient 29” in Figure 4 for the method most successful in reducing the patient’s predicted probability: GA+LS. Here we observe that the number of feature changes recommended are quite numerous: there are 22 of them. This suggests that it may be beneficial to include sparsity constraints.

Cumulatively, these results show that, on average, risk can be taken from approximately 50% to 30-35%, depending upon the method, at a budgetary level of two. At the individual level, using the best method, Patient 29’s risk can be lowered from 55% to less than 30%, also a at a budgetary level of two.

5 Conclusions

In this work we propose and solve generalized inverse classification by working backward through the previously un-navigable random forest classifier using five proposed algorithms that we incorporated into a framework, updated to account for non-linear costs, that leads to realistic recommendations. Future work is needed to analyze instances in which one method may outperform another, the performance of other classifiers and constraints limiting the number of features that are changed.


  • [1] S. S. Isukapalli, Uncertainty Analysis of Transport-transformation Models. PhD thesis, Citeseer, 1999.
  • [2] J. Yao, “Sensitivity analysis for data mining,” in Fuzzy Information Processing Society, 2003. NAFIPS 2003. 22nd International Conference of the North American, pp. 272–277, July 2003.
  • [3] C. C. Aggarwal, C. Chen, and J. Han, “The inverse classification problem,” Journal of Computer Science and Technology, vol. 25, no. May, pp. 458–468, 2010.
  • [4] C. L. Chi, W. N. Street, J. G. Robinson, and M. A. Crawford, “Individualized patient-centered lifestyle recommendations: An expert system for communicating patient specific cardiovascular risk information and prioritizing lifestyle options,” Journal of Biomedical Informatics, vol. 45, no. 6, pp. 1164–1174, 2012.
  • [5] C. Yang, W. N. Street, and J. G. Robinson, “10-year CVD risk prediction and minimization via inverse classification,” in Proceedings of the 2nd ACM SIGHIT symposium on International health informatics - IHI ’12, pp. 603–610, 2012.
  • [6] M. V. Mannino and M. V. Koushik, “The cost minimizing inverse classification problem : A algorithm approach,” Decision Support Systems, vol. 29, no. 3, pp. 283–300, 2000.
  • [7]

    D. Barbella, S. Benzaid, J. Christensen, B. Jackson, X. V. Qin, and D. Musicant, “Understanding support vector machine classifications via a recommender system-like approach,” in

    Proceedings of the International Conference on Data Mining, pp. 305–11, 2009.
  • [8] P. C. Pendharkar, “A potential use of data envelopment analysis for the inverse classification problem,” Omega, vol. 30, no. 3, pp. 243–248, 2002.
  • [9] M. T. Lash, Q. Lin, W. N. Street, and J. G. Robinson, “A budget-constrained inverse classification framework for smooth classifiers,” arXiv preprint; arxiv:1605.09068, 2016.
  • [10] M. T. Lash and K. Zhao, “Early predictions of movie success: The who, what, and when of profitability,” Journal of Management Information Systems, vol. 33, no. 3, pp. 874–903, 2016.
  • [11] Z. Michalewicz, Genetic algorithms+ data structures= evolution programs. Springer Science & Business Media, 2013.
  • [12] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, pp. 123–140, 1996.
  • [13] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” Thirteenth International Conference on Machine Learning, pp. 148–156, 1996.
  • [14] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [15] E. a. Nadaraya, “On estimating regression,” Theory of Probability & Its Applications, vol. 9, no. 1, pp. 141–142, 1964.
  • [16]

    G. S. Watson, “Smooth regression analysis,”

    The Indian Journal of Statistics, Series A, vol. 26, no. 4, pp. 359–372, 1964.
  • [17] P. Cortez and A. M. G. Silva, “Using data mining to predict secondary school student performance,” in Proceedings of 5th Annual Future Business Technology Conference, EUROSIS, 2008.

Supplemental Material – Generalized Inverse Classification

Supplementary Tables

These tables show the unchangeable, indirectly changeable, and directly changeable features for each of our two freely available datasets. For each of the indirectly changeable features, the kernel regression parameter is also included.

Feature Name
School Attended, Sex, Age, Address, Size of family, Parent’s cohabitation status, Mother’s education, Father’s education, Mother’s job= ”At Home”, Mother’s job=”Health”, Mother’s job=”Other”, Mother’s job=”Services”, Mother’s job=”Teacher”, Father’s job=”Teacher”, Father’s job=”Other”, Father’s job=”Services”, Father’s job=”Health”, Father’s job=”At Home”, Reason for school=”Course”, Reason for school=”Other”, Reason for school=”home”, Reason for school=”Reputation”, Guardian=”Mother”, Guardian=”Father”, Guardian=”Other”, Time spent traveling to school
ST. 1: Unchangeable features for the Student Performance dataset.
Feature Name:
Extra-curricular activities: 1.5, Higher education aspirations: 1.0, In a romantic relationship: 1.5, Free time after school: 1.0
ST. 2: Indirectly changeable features and learned kernel regression parameters for the Student Performance dataset.
Study time: 7, Paid tutoring: 8
Time out with friends: 6, Weekday alcohol: 3, Weekend alcohol: 6, Absences from class: 5
ST. 3: Directly changeable variables for the Student Performance dataset.


Feature Name
Insulin (uu-ml), Height (cm), Age, Peripheral Artery Disease, Peripheral Artery Disease (definition 2), Plaque/shadowing in either internal, Plaque in either internal carotid, Cholesterol lowering med (last 2 weeks), Hypertension (definition 5), Education level, Diabetes, Age when menopause began, Menopause status, Ever smoked cigarettes, High blood pressure med (past 2 weeks), Agina-chest pain med (past 2 weeks), Heart rhythm control med (past 2 weeks), Heart failure med (past 2 weeks), Blood thinning med (past 2 weeks), Blood sugar med (past 2 weeks), Stroke med (past 2 weeks), Walking leg pain med (past 2 weeks), Headache or cold med (past 2 weeks), Pain meds (past 2 weeks), Gender, Race, Years smoked cigarettes
ST. 4: Unchangeable features for the ARIC CVD dataset.
Feature Name:
BMI (Body Mass Index): .5, Recalibrated HDL cholesterol (mg/dl): .5, Re-calibrated LDL cholesterol (mg/dl): .5,Total cholesterol (mmol/L): .5, Total triglycerides (mmol/L): .5, 2nd and 3rd systolic blood pressure (avg.): .5, 2nd and 3rd systolic blood pressure (avg.) Num 2: .5, Waist girth (cm): .5, Hip girth (cm): .5, Heart rate: .5, White blood count: .5, Apolipoprotein AI(mg-dl): .5, Apolipoprotein B (mg-dl): .5, Apolp(A) Data (ug-ml): .5, Ankle-brachial index (Def 4): .5, FV(1)/FVC Predicted (%): .25, FEV(1) (L): .5, FVC (L): .5, Hematocrit: .5, Hemaglobin: .5, Platelet count: .5, Neutrophils: .5, Neutrophil bands: .5, Lymphocytes: .5, Monocytes: .5, Eosinophils: .5, Basophils: .5, APTT Value: .5, VIII: C Value: .5, Fibrinogen Value: .5, VII Value: .5, ATIII Value: .5, Protein: C Value: .5, VWF Value: .5
Feature Name:
Cornell voltage (uV): .5, Waist-hip ratio: .5, Vegetable fat (% kcal): .5, Carbs (% kcal): .5, Alcohol (% kcal): .5, Omega fatty acid (g): .5, Calf girth (cm): .5, Subcaps measure 2 (mm): .5, Triceps measure 2 (mm): .5, Uric acid (mg-dl): .5, Total protein (gm-dl): .5, Albium (gm-dl): .5, Phosphorus (mg-dl): .5, Magnesium (meq-l): .5, Calcium (mg-dl): .5, Urea nitgrogen (mg-dl): .5, Potassium (mmol-l): .5, Sodium (mmol-l): .5, Creatinine (mg-dl): .5, Weight (lb): .5,Total fat (% kcal): .5, Saturate fatty acid (% kcal): .5, Protein (% kcal): .5, Polyunsaturated fatty acid (% kcal): .5, Monounsaturated fatty acid (% kcal): .5, Total fat (g): .25
ST. 5: Indirectly changeable features and learned kernel regression parameters for the ARIC CVD dataset.
Dark or grain breads: 3, Peanut butter: 4, Nuts: 5, Other(prunes,avocado): 5, Vegetables: 6, Fruit: 6, Fiber: 7, Vegetable fat: 5, Polyunsaturated fat: 5
Liver: 8, White carbs: 6, Fish: 9, Cereal: 4, Cigarettes: 9, Caffeine: 7, Carbs: 7, Cholesterol: 6, Sodium: 7, Animal fat: 7, Saturated fat: 6
Exercise hours: 10, Alcohol: 9
ST. 6: Directly changeable variables for the ARIC CVD dataset.


Supplementary Figures

These figures show additional algorithm-specific results that supplement and support certain conclusions that are made in the main content of the paper. Here, red shows the average probability, yellow shows the probability for a randomly selected instance and blue shows the bottom 5 and top 95 % of probabilities.

(a) LVP-FI
(b) LVP-BI
(c) HC+LS
(d) GA
(e) GA+LS
SF. 1: Probability vs budget for the five methods.
(a) LVP-FI
(b) LVP-BI
(c) HC+LS
(d) GA
(e) GA+LS
SF. 2: Probability vs budget for the five methods.