Binary classification is concerned with learning a binary classifier that can be used to categorize objects into one of two predefined statuses. It arises in a wide range of applications and has been extensively studied in the statistics and machine learning literature. For comprehensive surveys and discussions on the binary classification methods, see e.g.Devroye, Györfi, and Lugosi (1996), Vapnik (2000), Lugosi (2002), Boucheron, Bousquet, and Lugosi (2005) and Hastie, Tibshirani, and Friedman (2009)
. Solving for the optimal binary classifier by minimizing the empirical misclassification risk is known as an empirical risk minimization (ERM) problem. There has been a massive research interest in high dimensional classification problems where the dimension of the feature vector used to classify the object’s label can be comparable with or even larger than the available training sample size. It is known (see e.g.,Bickel and Levina (2004) and Fan and Fan (2008)) that working directly with a high dimensional feature space can result in poor classification performance. To overcome this, it is often assumed that only a small subset of features are important for classification and feature selection is performed to mitigate the high dimensionality problem. See Fan, Fan, and Wu (2011) for an overview on the issues and methods for high dimensional classification.
In this paper, we study the ERM based binary classification in the setting with high dimensional vectors of features. We propose an -penalized ERM procedure for classification by minimizing over a class of linear classifiers the empirical misclassification risk with a penalty on the number of selected features. Here, the -norm of a real vector refers to the number of non-zero components of the vector. When the Bayes classifier, which is the optimal classifier that minimizes the population misclassification risk, is also of the linear classifier form and respects a sparsity condition, we show that this penalized ERM classification approach can yield a sparse solution for feature selection with high probability. Moreover, we derive non-asymptotic bound on the excess misclassification risk and establish its rate of convergence.
There are alternative ERM based approaches to the high dimensional binary classification problem. Greenshtein (2006), Jiang and Tanner (2010) and Chen and Lee (2018) studied the best subset variable selection approach where the ERM problem is solved subject to a constraint on a pre-specified maximal number of selected features. Jiang and Tanner (2010) further showed that the best subset ERM problem can be approximated by the -penalized ERM problem. However, they did not establish theoretical results characterizing the size of the subset of feature variables selected under the -penalized ERM approach. Neither did they provide numerical algorithms for solving the -penalized estimation problem.
In the present paper, we take the -penalized ERM approach and develop a computational method for implementation. We show that there is a high probability in large samples that the resulting number of selected features under this penalized estimation approach can be capped above by an upper bound which can be made arbitrarily close to the unknown smallest number of features that are relevant for classification. Our penalized ERM approach is also closely related to the method of structural risk minimization (see, e.g., Devroye, Györfi, and Lugosi (1996, Chapter 18)) where the best classifier is selected by solving a sequence of penalized ERM problems over an increasing sequence of spaces of classifiers with the penalty depending on the complexity of the classifier space measured in terms of the Vapnik-Chervonenkis (VC) dimension. As will be discussed later, our approach can also be interpreted in a similar fashion yet with a different type of complexity penalty.
For implementation, we show that the -penalized ERM problem of this paper can be equivalently reformulated as a mixed integer linear programming (MILP) problem. This reformulation enables us to employ modern efficient MIO solvers to solve our penalized ERM problem. Well-known numerical solvers such as CPLEX and Gurobi can be used to effectively solve large-scale MILP problems. See Nemhauser and Wolsey (1999) and Bertsimas and Weismantel (2005) for classic texts on the MIO theory and applications. See also Jünger, Liebling, Naddef, Nemhauser, Pulleyblank, Reinelt, Rinaldi, and Wolsey (2009), Achterberg and Wunderling (2013) and Bertsimas, King, and Mazumder (2016, Section 2.1) for discussions on computational advances in solving the MIO problems.
The present paper is organized as follows. In Section 2, we describe the binary classification problem and set forth the -penalized ERM approach. In Section 3, we establish theoretical properties of the proposed classification approach. In Section 4, we provide a computational method using the MIO approach. In Section 5, we conduct a simulation study on the performance of the -penalized ERM approach in high dimensional binary classification problems. We then conclude the paper in Section 6. Proofs of all theoretical results of the paper are collated in Appendix A.
2 An -Penalized ERM Approach
Let be the binary label or outcome of an object and a dimensional feature vector of that object. Write , where
is a scalar random variable that is always included and has a positive effect andis the dimensional subvector of subject to feature selection. For , let
where is the support of , is a vector of parameters, and is an indicator function that takes value 1 if its argument is true and 0 otherwise.
We consider binary classification using linear classifiers of the form (2.1). Since the condition is invariant with respect to any positive scalar that multiplies both sides of this inequality, working with the classifier (2.1) amounts to normalizing the scale by setting the coefficient of to be unity. For any dimensional real vector , let be the -norm of . Assume that the researcher has a training sample of independent identically distributed (i.i.d.) observations of . We allow the dimension to be potentially much larger than the sample size . We estimate the coefficient vector by solving the following -penalized minimization problem
where denotes the parameter space, and, for any indicator function ,
and is a given non-negative tuning parameter of the penalized minimization problem.
The function is known as the empirical misclassification risk for the binary classifier . Minimization of over the class of binary classifiers given by (2.1) is known as an empirical risk minimization (ERM) problem. The penalized ERM approach (2.2) enforces dimension reduction by attaching a higher penalty to a classifier which uses more object features for classification. Let be a solution to the minimization problem (2.2). We shall refer to the resulting classifier (2.1) evaluated at as an -penalized ERM classifier.
For any , let
That is, is the class of all linear classifiers in (2.1) whose vector has no more than non-zero components. For , let
Then it is straightforward to see that the minimized objective value of the penalized ERM problem (2.2) is equivalent to that of the problem
In other words, our approach is akin to the method of structural risk minimization as it amounts to solving ERM problems over an increasing sequence of classifier spaces which carries a complexity penalty . In the next section, we will set forth regularity conditions on the penalty tuning parameter and establish theoretical properties for our classification approach.
3 Theoretical Properties
In this section, we study theoretical properties of the -penalized ERM classification approach. Let
denote the joint distribution of. For any indicator function , let
For , let
For any measurable function , let denote the -norm of . The functions and as well as the -norm depend on the data generating distribution . It is straightforward to see that, for any binary classifier ,
so that is minimized at . The optimal classifier is known as the Bayes classifier in the classification literature.
We assess the predictive performance of the -penalized ERM approach by bounding the excess risk
The difference is non-negative by (3.4). Hence, a good classifier will result in a small value of with a high probability and also on average.
We impose the following assumption.
For every data generating distribution , there is a non-negative integer , which may depend on , such that and .
Let denote the smallest value of non-negative integers satisfying . By Condition 1, such value is finite and always exists. Condition 1 implies that the Bayes classifier admits a linear threshold crossing structure in the sense that the equivalence
holds almost surely for some , where can be interpreted as the sparsity parameter associated with , which is unknown in this binary classification problem. Moreover, the assumption that implies that the feature vector is rich enough to embody those relevant ones for constructing the Bayes classifier.
For any two real numbers and , let and . For any , let and respectively denote the integer ceiling and floor of . We impose the following condition on the growing rates of relative to the sample size.
for some constant .
The estimate corresponds to the number of features selected under the -penalized ERM approach. We now provide a result on the statistical behavior of , which sheds lights on the dimension reduction performance of our penalized estimation approach.
For any fixed , we can deduce from Theorem 1 that as . Moreover, this theorem implies that our approach is effective in reducing the feature dimension in the sense that, with high probability in large samples, the number of selected features is capped above by the quantity (3.9), which can be made arbitrarily close to the true sparsity in the classification problem. Specifically, if turns out to be smaller than , the result (3.8) implies that tends to zero exponentially in .
The next theorem characterizes the predictive performance of the -penalized ERM approach.
Under the setup and assumptions stated in Theorem 1, the following result holds:
which converges to zero whenever
The rate condition (3.16) allows the case that
In other words, the -penalized ERM classification approach is risk-consistent even when the dimension of the input feature space () grows exponentially in sample size, provided that the number of truly effective features () can only grow at a polynomial rate.
We shall provide some further remarks on the convergence rate result (3.15). Condition 1 implies that the space contains the Bayes classifier . Thus, if the value of were known, one could performed classification via the -constrained ERM approach where the empirical risk is minimized with respect to . The lower bound on the VC dimension of the classifier space grows at rate (Abramovich and Grinshtein (2017, Lemma 1)). Hence, the rate is the minimax optimal rate at which the excess risk converges to zero under this constrained estimation approach (Devroye, Györfi, and Lugosi (1996, Theorem 14.5)). In view of this, suppose grows at a polynomial or exponential rate in . Then our rate result (3.15) is nearly oracle in the sense that, when grows at rate , the rate (3.15) remains close within some factor to the optimal rate attained under the case of known . Moreover, both rates coincide and reduce to when the value of does not increase with the sample size.
4 Computational Algorithms
While the ERM approach to binary classification is theoretically sound, its implementation is computationally challenging and is known to be an NP (Non-deterministic Polynomial time) hard problem (Johnson and Preparata (1978)). Florios and Skouras (2008) developed a mixed integer optimization (MIO) based computational method and provided numerical evidence demonstrating effectiveness of the MIO approach to solving the ERM type optimization problems. Kitagawa and Tetenov (2018) and Mbakop and Tabord-Meehan (2018) adopted the MIO solution approach to solving the optimal treatment assignment problem which is closely related to the ERM based classification problem. The MIO approach is also useful for solving problems of variable selection through -norm constraints. See Bertsimas, King, and Mazumder (2016) and Chen and Lee (2018) who proposed MIO based computational algorithms to solving the -constrained regression and classification problems respectively.
Motivated by these previous works, we now present an MIO based computational method for solving the -penalized ERM problem. Given that , solving the problem (2.2) amounts to solving
We assume that the parameter space takes the form
where and are lower and upper parameter bounds such that for .
Our implementation builds on the method of mixed integer optimization. Specifically, we note that the minimization problem (4.1) can be equivalently reformulated as the following mixed integer linear programming problem:
where is a given small and positive real scalar (e.g. as in our numerical study), and
We now explain the equivalence between (4.1) and (4.2). Given , the inequality constraints (4.3) and the dichotomization constraints (4.5) enforce that for . Moreover, the on-off constraints (4.4) and (4.6) ensure that, whenever , the value must also be zero so that the th component of the feature vector is excluded in the resulting -penalized ERM classifier. The sum thus captures the number of non-zero components of the vector . As a result, both minimization problems (4.1) and (4.2) are equivalent. This equivalence enables us to employ modern MIO solvers to solve for -penalized ERM classifiers. For implementation, note that the values in the inequality constraints (4.3) can be computed by formulating the maximization problem in (4.7) as linear programming problems, which can be efficiently solved by modern numerical solvers. Hence these values can be easily computed and stored as inputs to the MILP problem (4.2).
5 Simulation Study
In this section, we conduct simulation experiments to study the performance of our approach. We consider a simulation setup similar to that of Chen and Lee (2018, Section 5) and use the following data generating design. Let be a multivariate normal random vector with mean zero and covariance matrix with its element . The binary outcome is generated according to the following specification:
where denotes the true data generating parameter value, is a dimensional feature vector with and , and is a random variate that is independent of and follows the standard logistic distribution. The constant term in is included to capture the regression intercept. We set and for The coefficient is chosen to be non-zero such that, among all the feature variables in , only the variable is relevant in the data generating processes (DGP). We consider the following two specifications for and :
We used simulation repetitions in each Monte Carlo experiment. For each simulation repetition, we generated a training sample of observations for estimating the coefficients and a validation sample of observations for evaluating the out-of-sample classification performance. We considered simulation configurations with to assess the classifier’s performance in both the low and high dimensional binary classification problems.
We specified the parameter space to be for the MIO computation of the -penalized ERM classifiers. Throughout this paper, we used the MATLAB implementation of the Gurobi Optimizer to solve the MIO problems (4.2). Moreover, all numerical computations were done on a desktop PC (Windows 7) equipped with 32 GB RAM and a CPU processor (Intel i7-5930K) of 3.5 GHz. To reduce computation cost of solving the -penalized ERM problems, we set the MIO solver time limit to be one hour beyond which we forced the solver to stop early and used the best discovered feasible solution to construct the resulting -penalized ERM classifier. For implementation, it remains to specify an exact form of the penalty parameter in (2.2). We set
where is a tuning constant which remains to be calibrated. The form (5.1) implies that the value in Condition 2 is taken to be , which will satisfy inequality (3.11) and hence validate the probability bound (3.8) when is sufficiently large. Moreover, by the risk upper bound (3.14), the convergence rate result (3.15) continues to hold up to a factor of .
For practical applications, we recommend calibrating the tuning scalar
via the method of cross validation. Yet, for this simulation study, we used a simple heuristic rule and set
The value in (5.3) can also be computed via the MIO approach by simply removing from the MIO problem (4.2) the constraints (4.4) and (4.6) as well as the binary controls and the penalty part in the objective function. This computation is much faster as it is concerned with one-dimensional optimization. The rationale behind the choice (5.2) is as follows. Note that (5.3) corresponds to an ERM classification using classifier (2.1) where only consists of the intercept term, and (5.2
) corresponds to an estimate of the variance of the misclassification loss under such a simple classification rule. Intuitively speaking, the valuecaptures the variability of the empirical risk under a parsimonious feature space specification. From the bias and variance tradeoff perspective, when this variability is small, we may as well increase the classifier flexibility by attaching a small penalty in the penalized ERM procedure so as to induce a richer set of selected features for classification.
Let logit_lasso denote the
-penalized logistic regression approach(see e.g. Friedman, Hastie, and Tibshirani, 2010). The logit_lasso estimation approach is a computationally attractive approach that can be used to estimate high dimensional binary response models. We compared in simulations the performance of our method to that of the logit_lasso approach. We used the MATLAB implementation of the well known glmnet computational algorithms (Qian, Hastie, Friedman, Tibshirani, and Simon, 2013) for solving the logit_lasso estimation problems. We did not penalize the coefficient of the feature variable so that, as in the simulations of the -penalized ERM approach, this variable would always be included in the resulting classifier constructed under the logit_lasso approach. We calibrated the lasso penalty parameter value over a sequence of 100 values via the 10-fold cross validation procedure. We used the default setup of glmnet for constructing this tuning sequence among which we reported results based on the following two choices, , of the penalty parameter value. The value refers to the lasso penalty parameter value that minimized the cross validated misclassification risk, whereas
denotes the largest penalty parameter value whose corresponding cross validated misclassification risk still falls within the one standard error of the cross validated misclassification risk evaluated at. The choice induces a more parsimonious estimating model and is known as the ”one-standard-error” rule, which is also commonly employed in the statistical learning literature (Hastie, Tibshirani, and Friedman, 2009).
We considered the following performance measures. Let denote the estimated coefficients under a given classification approach. For the logit_lasso approach, we derived by dividing the lasso-penalized logistic regression coefficients of the variables by the magnitude of that of the variable . We can easily deduce that is the Bayes classifier in this simulation design. To assess the classification performance, we report the relative risk, which is the ratio of the misclassification risk evaluated at the classifier over that evaluated at the Bayes classifier. In each simulation repetition, we approximated the out-of-sample misclassification risk using the generated validation sample. Let and respectively denote the average of in-sample and that of out-of-sample relative risks over all the simulation repetitions.
We also examine the feature selection performance of the classification method. We say that a feature variable is effectively selected if and only if the magnitude of is larger than a small tolerance level (e.g., as used in our numerical study) which is distinct from zero in numerical computation. Let be the proportion of the variable being effectively selected. Let be the proportion of obtaining an oracle feature selection outcome where the variable was the only one that was effectively selected among all the variables in . Let denote the average number of effectively selected features whose true DGP coefficients are zero.
5.1 Simulation Results
We now present in Tables 1 and 2 the simulation results under the setups of DGP(i) and DGP(ii) respectively. From these two tables, we find that, regarding the in-sample classification performance, our method outperformed the two logit_lasso based approaches across almost all the DGP configurations in the simulation. For the out-of-sample classification performance, we see that the -penalized ERM classifier dominated the logit_lasso classifiers across all simulation scenarios and this dominance was more evident in the high dimensional setup with .
Concerning the feature selection results, both Tables 1 and 2 indicate that all the three classification approaches had high rates and hence were effective for selecting the relevant variable . However, the good performance in the criterion might just be a consequence of overfitting, which may result in excessive selection of irrelevant variables and thus adversely impact on the out-of-sample classification performance. From the results on the performance measure, we note that the numbers of irrelevant variables selected under the two logit_lasso based approaches remained quite large relatively to those under the -penalized ERM approach even though all these approaches exhibited the effect of shrinking the feature space dimension. In fact, we observe non-zero and high values of for the -classifier across all the simulation setups whereas the two logit_lasso classifiers could not induce any oracle variable selection outcome in the simulation. These feature selection performance results help to explain that the risk performance dominance of the -penalized ERM approach could be observed even in the DGP(i) simulations where the logistic regression model was correctly specified.
In this paper, we study the binary classification problem in a setting with high dimensional vectors of features. We construct a binary classification procedure by minimizing the empirical misclassification risk with a penalty on the number of selected features. We establish a finite-sample probability bound showing that this classification approach can yield a sparse solution for feature selection with high probability. We also conduct non-asymptotic analysis on the excess misclassification risk and establish its rate of convergence. For implementation, we show that the penalized empirical risk minimization problem can be solved via the method of mixed integer linear programming.
There are a few topics one may consider as possible extensions. First, it might be fruitful to explore an -penalized approach for regression and other estimation problems. Second, our proposed method is suitable for training samples with small or moderate size. It would be a natural step to develop a divide-and-conquer algorithm for a large-scale problem (see, e.g., Shi, Lu, and Song, 2018). Third, our approach might be applicable for developing sparse policy learning rules (see, e.g., Athey and Wager, 2018). These are topics for further research.
Appendix A Proofs of Theoretical Results
For all , there is a universal constant , which depends only on , such that
for any integer such that
a.1 Proof of Theorem 1
Proof of Theorem 1.
We first prove the probability bound (3.8). Let . Because , it is straightforward to see that
Since , it follows from (A.3) that
where, for any ,
By construction, for any indicator function . We thus have that
For each positive integer , let
Given that , by (A.10), we have that
Therefore, by (3.10), we have that, for all ,
where the second inequality follows from (A.9).
By (A.4), we have that
where (A.14) follows from (A.11) and (A.13), (A.15) follows from the fact that and for all , and, because for , (A.16) follows from an application of Lemma 1, where the value of in this lemma is taken over the range .
a.2 Proof of Theorem 2
- Abramovich and Grinshtein (2017) Abramovich, F., and V. Grinshtein (2017): “High-dimensional classification by sparse logistic regression,” arXiv preprint arXiv:1706.08344.
Achterberg and Wunderling (2013)
Achterberg, T., and R. Wunderling (2013): “Mixed integer
programming: Analyzing 12 years of progress,” in
Facets of combinatorial optimization, pp. 449–481. Springer.
- Athey and Wager (2018) Athey, S., and S. Wager (2018): “Efficient policy learning,” arXiv preprint arXiv:1702.02896.
- Bertsimas, King, and Mazumder (2016) Bertsimas, D., A. King, and R. Mazumder (2016): “Best subset selection via a modern optimization lens,” Annals of Statistics, 44(2), 813–852.
- Bertsimas and Weismantel (2005) Bertsimas, D., and R. Weismantel (2005): Optimization over integers, vol. 13. Dynamic Ideas Belmont.
Bickel and Levina (2004)
Bickel, P. J., and E. Levina
(2004): “Some theory for Fisher’s linear discriminant function,naive Bayes’, and some alternatives when there are many more variables than observations,”Bernoulli, 10(6), 989–1010.
- Boucheron, Bousquet, and Lugosi (2005) Boucheron, S., O. Bousquet, and G. Lugosi (2005): “Theory of classification: A survey of some recent advances,” ESAIM: probability and statistics, 9, 323–375.
- Chen and Lee (2018) Chen, L.-Y., and S. Lee (2018): “Best subset binary prediction,” Journal of Econometrics, 206(1), 39–56.
Devroye, Györfi, and Lugosi (1996)
Devroye, L., L. Györfi, and G. Lugosi (1996):
Probabilistic Theory of Pattern Recognition. Springer.
- Fan and Fan (2008) Fan, J., and Y. Fan (2008): “High dimensional classification using features annealed independence rules,” Annals of statistics, 36(6), 2605–2637.
- Fan, Fan, and Wu (2011) Fan, J., Y. Fan, and Y. Wu (2011): “High-dimensional classification,” in High-dimensional data analysis, pp. 3–37. World Scientific.
- Florios and Skouras (2008) Florios, K., and S. Skouras (2008): “Exact computation of max weighted score estimators,” Journal of Econometrics, 146(1), 86–91.
- Friedman, Hastie, and Tibshirani (2010) Friedman, J., T. Hastie, and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software, 33(1), 1–22.
(2006): “Best subset selection, persistence in high-dimensional statistical learning and optimization underconstraint,” Annals of Statistics, 34(5), 2367–2386.
- Hastie, Tibshirani, and Friedman (2009) Hastie, T., R. Tibshirani, and J. Friedman (2009): The Elements of Statistical Learning: Prediction, Inference and Data Mining, vol. 2. Springer series in statistics New York.
- Jiang and Tanner (2010) Jiang, W., and M. A. Tanner (2010): “Risk Minimization for Time Series Binary Choice with Variable Selection,” Econometric Theory, 26(5), 1437–1452.
- Johnson and Preparata (1978) Johnson, D., and F. Preparata (1978): “The densest hemisphere problem,” Theoretical Computer Science, 6(1), 93–107.
- Jünger, Liebling, Naddef, Nemhauser, Pulleyblank, Reinelt, Rinaldi, and Wolsey (2009) Jünger, M., T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey (2009): 50 years of integer programming 1958-2008: From the early years to the state-of-the-art. Springer Science & Business Media.
- Kitagawa and Tetenov (2018) Kitagawa, T., and A. Tetenov (2018): “Who should be treated? empirical welfare maximization methods for treatment choice,” Econometrica, 86(2), 591–616.
- Lugosi (2002) Lugosi, G. (2002): “Pattern Classification and Learning Theory,” in Principles of Nonparametric Learning, ed. by L. Györfi, pp. 1–56. Springer.
- Mbakop and Tabord-Meehan (2018) Mbakop, E., and M. Tabord-Meehan (2018): “Model Selection for Treatment Choice: Penalized Welfare Maximization,” arXiv preprint arXiv:1609.03167.
- Nemhauser and Wolsey (1999) Nemhauser, G. L., and L. A. Wolsey (1999): Integer and combinatorial optimization. Wiley-Interscience.
- Qian, Hastie, Friedman, Tibshirani, and Simon (2013) Qian, J., T. Hastie, J. Friedman, R. Tibshirani, and N. Simon (2013): Glmnet for Matlab http://www.stanford.edu/~hastie/glmnet_matlab/.
- Shi, Lu, and Song (2018) Shi, C., W. Lu, and R. Song (2018): “A Massive Data Framework for M-Estimators with Cubic-Rate,” Journal of the American Statistical Association, forthcoming.
Vapnik, V. (2000):
The nature of statistical learning theory, vol. 2. Springer-Verlag New York.