The regression analysis is a statistical methodology for predicting values of response (dependent) variables from a set of explanatory (independent) variables by investigating the relationships among the variables. The regression analysis is used for forecasting and prediction in a variety of areas, from economics to biology. When the relationship among the variables is expressed as a linear equation and the set of explanatory variables has more than one variable, it is termedmultiple linear regression. The multiple linear regression model is the most popular model among the various variants of regression analyses.
Given a fixed set of explanatory variables, the goal of the multiple linear regression is to find the coefficients for the explanatory variables that minimize the fitting error. When the fitting error is measured as the sum of squared errors (), optimal coefficients can be calculated by a formula. However, when the fitting error is measured as the sum of absolute errors (), often referred as the least absolute deviation (LAD) regression, there is no explicit formula available. Charnes et al. (1955)
first pointed out that LAD regression is essentially a linear program (LP), andWagner (1959) formulated the problem as an LP. Schlossmacher (1973) proposed an alternative approach by iteratively reweighting the square error to build LAD regression. For a detailed review of algorithms for LAD regression, the reader is referred to Narula and Wellington (1982) and Dielman (2005).
The subset selection problem, also referred as variable selection or model selection, for multiple linear regression is to choose a subset of the set of explanatory variables to build an efficient linear regression model. In detail, given a data set with observations and explanatory variables, we want to use only explanatory variables to build a linear regression model. The goal is to decrease as much as possible while maintaining error loss relative small. Miller (1984) stated that regression subset selection consists of two steps: (i) determining an objective to measure the efficiency of the model and (ii) developing an algorithm to solve the problem and optimize the determined objective function.
Depending on the goal of the regression analysis, an objective function is defined to measure the efficiency of the model. Given a subset of the explanatory variables and the corresponding coefficients, the objective function is typically defined based on the number of explanatory variables used and the errors that the regression model produces. Criteria such as the mean square error (), mean absolute error (), adjusted , Mallow’s
, etc, are in this category. On the other hand, there also exist objective functions that additionally take the regression coefficients into account by penalizing large regression coefficients. Among many variants in this category, ridge and LASSO regressions are the most popular one proposed byHoerl and Kennard (1970) and Tibshirani (1996), respectively. Note that the objective functions in the second category do not explicitly take the number of explanatory variables used into account and require penalty parameters. In this paper, we focus on and .
After an objective function is chosen, an algorithm is needed to optimize the objective function value over all possible regression models. Algorithms for optimizing have already been studied. Among them, stepwise-type algorithms are frequently used in practice due to their computational simplicity and efficiency. An exact algorithm is to enumerate all possible regression models, but the computational cost is excessive. To overcome this computational difficulty, Furnival and Wilson (1974) proposed a branch-and-bound algorithm, called leaps-and-bound, to find the best subset for without enumerating all possible subsets. For subset selection of LAD regression, Konno and Yamamoto (2009) presented a mixed integer program (MIP) to optimize given fixed . Recently, Bertsimas et al. (2015) proposed an MIP based algorithm for optimizing SSE and SAE given fixed . A discrete first order method is proposed and used to warmstart the MIP formulation, which is formulated based on specially ordered sets, Bertsimas and Weismantel (2005), to avoid use of big M. However, to the best of authors’ knowledge, there is no direct mathematical formulation to optimize or and directly. For a detailed review of algorithms for subset selection, the reader is referred to Miller (1984) and Miller (2002).
The MIP model presented in Konno and Yamamoto (2009) can be compared to the models studied in Bienstock (1996), Bertsimas and Shioda (2009), and Bertsimas et al. (2015), as all of them include cardinality constraints. In Table 1, we summarize the properties of the mathematical programming models for regression subset selection in the literature, including the models in this paper. The models in Konno and Yamamoto (2009), Bertsimas and Shioda (2009), and Bertsimas et al. (2015) assume fixed and the cardinality constraint is explicit in the models. Konno and Yamamoto (2009) optimizes
by introducing binary variables, whereasBertsimas and Shioda (2009) optimizes without introducing binary variables. de Farias and Nemhauser (2003) provides a polyhedral study for the cardinality constrained models studied in Bienstock (1996) and Bertsimas and Shioda (2009). In contrast to the models in Konno and Yamamoto (2009), Bertsimas and Shioda (2009), and Bertsimas et al. (2015), we optimize and without fixing . The models in Sections 2.1 and 2.2 do not have explicit cardinality constraint since the objective functions implicitly penalize large cardinality. With fixed , the model in Section 2.1 reduces to the model in Konno and Yamamoto (2009). The difference between the models in Section 2.2 with fixed and Bertsimas and Shioda (2009) is the existence of binary variables. The difference between the models in Section 2.2 with fixed and Bertsimas et al. (2015) is the use of big M or specially ordered sets.
|Model reference||Objective||Cardinality constraint||Binary variables|
|Konno and Yamamoto (2009)||Explicit and fixed||Exist|
|Bertsimas and Shioda (2009)||Explicit and fixed||Not exist|
|Bertsimas et al. (2015)||Explicit and fixed||Exist|
|Section 2.1||Implicit and is decision variable||Exist|
|Section 2.2||Implicit and is decision variable||Exist|
Observe that when there are more explanatory variables than observations (), we can build a regression model with zero fitting error. Precisely, if , then linearly independent explanatory variables and one intercept variable yield a regression model with zero error. In this case, and always yield an optimal value of 0. Hence, prior work on model selection assumes . However, in practice, there are also data sets with . One example is gene data. Gene information has many attributes (explanatory variables) whereas only few observations are usually available. For more examples, the reader is referred to Stodden (2006), where a study how model selection algorithms behave for this case with a different but fixed ratio of is provided. Candes and Tao (2007) proposed an -regularized problem based approach, called Dantzig selector, for the case when . They minimize the summation of the absolute value of the regression coefficients subject to a bounding constraint that is a function of errors. Their approach does not explicitly take into account the number of selected variables.
In the statistics community, subset selection when is called high dimensional variable selection. Note that, if each row of the data matrix is an observation, the length of the data matrix is greater than the width when , and the width of the data matrix is greater when . Based on the shape of the data matrix, we hereafter refer to the cases and as thin and fat cases, respectively.
In this paper, we present mathematical programs for optimization of and for the thin case. Our model is different from the branch and bound algorithm of Furnival and Wilson (1974), as their model explores branches based on subsets whereas ours is based on branching on binary variables. Our model for the thin case does not fix and directly optimizes , whereas the model in Konno and Yamamoto (2009), Bertsimas and Shioda (2009), and Bertsimas et al. (2015) assume fixed and minimizes or . For the fat case, we define a new objective function and extend the formulations from the thin case. Our work is distinguished from LASSO since our objective is a modification of the traditional criteria of and , whereas LASSO penalizes the regression coefficients in the objective function. We also present algorithms that optimize the new objective function.
Our contribution can be summarized as follows.
For the thin case, we present mathematical programs for the subset selection problem that directly minimize and . To the best knowledge of the authors of this article, the proposed models are the first mathematical programming formulations that directly optimize and . We found that our models quickly return a good candidate solution when solved by commercial mixed integer programming solver. Further, our models can easily handle conditional inclusion and exclusion of explanatory variables in the regression model by adding constraints.
For the fat case, we introduce new objective functions by modifying and to further discourage large . The new objective functions prevent selecting all variables when the variables are highly correlated. We also present mathematical programs that directly minimize the new objective functions.
For the fat case, we present an algorithm that gives a quality solution in a relatively short time. We analyze the algorithm and show that it yields a local optimum. We also present a randomized version of the algorithm and show its convergence to global optimum.
In Section 2, the mathematical models for the thin case with and objectives are derived. In Section 3, we present the new objective functions and the modified mathematical models for the fat case. We also present algorithms to tackle large size instances for the fat case. Finally, we present computational experiments in Section 4.
2 Mathematical Models for Thin Case
In this section, we consider the thin case, in which there are more observations than explanatory variables. We derive mathematical programs to optimize and directly. We use the following notation:
: number of observations
: number of explanatory variables
: number of selected explanatory variables
: index set of observations
: index set of explanatory variables
: data matrix corresponding to the independent variables
: data vector corresponding to the dependent variable.
Throughout this paper, we use the following decision variables:
: coefficient of the explanatory variable,
: intercept of the regression model
: error term of the observation,
Note that the multiple linear regression model takes the form
Let us consider a regression model with fixed subset of . For the minimization of given , the following LP gives optimal regression coefficients:
We later use this LP as a subroutine when we need to construct a regression model that minimizes given a fixed subset. In this paper we consider the following two objective functions, where and are taken with respect to a subset of cardinality .
Mean Square Error (): minimize
is one of the most popular criteria (Tamhane and Dunlop, 1999), defined as divided by . Since is decreasing in , the goal is to balance and by minimizing . Another popular criteria is adjusted , usually denoted by . Adjusted is a modification of to penalize having too many explanatory variables in the model. In detail, it is defined as , where is the total sum of squares. Observe that maximizing is equivalent to minimizing , because is a constant. This implies that a solution that optimizes also optimizes .
Mean Absolute Error (): minimize
It is known that squared errors often overemphasize outliers. An alternative approach is to use absolute error to build a regression model that is more robust to outliers.
This section is organized as follows. In Section 2.1, we derive a mathematical program for . Then, in Section 2.2, we present a mathematical program for based on the arguments used for . In Section 2.3, we present a derivation of big M, which is an upper bound for the regression coefficients. Finally in Section 2.4 we provide some valid inequalities that help the LP-based branch-and-bound algorithm.
2.1 Minimization of
Observe that has two terms ( and ) that can be written as and in terms of the decision variables. Using these expressions, we can write a mathematical model
to minimize . Observe that, if we add constraint to (3) given fixed , we obtain an easier problem, which is equivalent to the model presented by Konno and Yamamoto (2009) since the denominator of the objective becomes constant. By adding cardinality constraint with fixed and by replacing (3c) with specially order sets based constraints, we obtain the model presented in Bertsimas et al. (2015). The remaining development is completely different from the work in Konno and Yamamoto (2009) or Bertsimas et al. (2015) and thus new. This is due to the fact that they assume fixed which implies that model (3) is already linear. In our case we have to linearize this model which is not a trivial task.
Note that in (3c) is a constant, which is an upper bound for ’s, that we have not yet specified. Konno and Yamamoto (2009) set an arbitrary large value for in their study. For now, let us assume that a proper value of is given (we derive a valid value for in a later section). To linearize nonlinear objective (3a), we introduce
In order to linearize nonlinear constraint (5b), we introduce , . Using a linearization technique, we obtain
The proof is given in Appendix A and is based on the fact that feasible solutions to (5) and (6) map to each other. Observe that the signs of , and in (6) are not restricted. In order to make all variables non-negative, we introduce , , and , in which and . We also use and , where , to replace the absolute value function in (6b). Finally, we obtain mixed integer program (7) for regression subset selection with the objective.
It is known that either or is equal to 0 if is minimized in the objective function. However, since is not directly minimized and binary variables are present in (6), we give the following proposition in order to make sure that (6) is equivalent to (7), where the proof is given in Appendix A.
An optimal solution to (7) must have either or for every .
By Proposition 2, it is easy to see that (7b) is equivalent to (6b). Therefore, (7) correctly solves (3). A final remark regarding the model is with regard to the dimension of the formulation. For a data set with candidate explanatory variables and observations, formulation (7) has variables (including binary variables) and constraints (excluding non-negativity constraints).
2.2 Minimization of
Observe that the only difference between and is that has , whereas has . Hence, the left hand side of (7b) is replaced by . Also, in order to make the constraint convex, we use inequality instead of equality. Hence, we use
instead of (7b). Finally, the mixed integer quadratically constrained program with the convex relaxation reads
2.3 Big M for ’s and ’s
Deriving a tight and valid value of in (7) and (9) is crucial for two reasons. For optimality, too small values cannot guarantee optimality even when the optimization model is solved optimally. For computation, a large value of causes numerical instability and slows down the branch-and-bound algorithm. Recall that we assume that a valid value of is given for the formulations (7) and (9) and that the same notation is used for both ’s and ’s. However, ’s and ’s are often in different magnitudes. Hence, it is necessary to derive distinct and valid values of for ’s and ’s. In this section, we derive valid values of for ’s and ’s in (7). The result also holds for (9) with trivial modifications. We also provide a computationally faster procedure for for ’s in (9) as an alternative. Both of the values do not cause any numerical errors in our experiments.
First, let us consider for ’s. Observe that a valid for must be greater than all possible values for . However, it is generally better to have tight upper bounds. Hence, we use , the mean absolute error of an optimal regression model with all explanatory variables, as upper bounds. We set
for every in (7). Note that (10) can be calculated by LP formulation (2) in polynomial time. By using the value in (10), we treat regression models that have worse objective function values than as infeasible.
Next, let us consider for ’s in (7) for . We start with the following assumption.
Data set is linearly independent.
This assumption implies that there is no regression model with total error equal to 0. If there exists a regression model with total error equal to 0, then we have , which is the minimum value that we can find for . In this case, there is no reason to reduce the number of regression variables. Also, in practice, we typically have a data set with structural and random noises, and thus Assumption 1 holds.
In order to find a valid value of for ’s in (7), we formulate an LP. Let be the decision variable having the role of . Let and be the average of ’s and the maximum total error bound allowed, respectively. Any attractive regression model should have the total error less than in order to justify the effort, because with gives an automatically worse objective function value than the model with no explanatory variable. This requirement is written as
For notational convenience, let
be a vector in (11).
Next, let us try to increase to its maximum value. For a fixed , we define the objective as
With the second term, we force to be the maximum value we need, yet not preventing a further increment of . From the linear program
we obtain , a candidate for , from the value of of an optimal solution solution to (12). Similarly, is obtained from . Then the maximum value for explanatory variable can be obtained by setting . Finally, considering all explanatory variables, we define as
Before we proceed, we first need to make sure that (12) is not unbounded so that the values are well defined.
Linear program (12) is bounded.
The proofs are given in Appendix A. Note that Lemma 1 implies that (11) covers all possible values of and of (7) with the maximum total error bound . Note also that in (13) is the maximum value out of all possible values of and that (11) covers.
For a contradiction, suppose that is not a valid upper bound for ’s in (7). That is, there exists a regression model with total error less than but , in which is the coefficient for explanatory variable . However, by Lemma 1, we must have a corresponding feasible solution for (11) with . Note that must satisfy from (11d). Then, implies . This contradicts definition (13). A similar argument holds if . Hence, is a valid upper bound. ∎
Observe that a similar approach can be used to derive a valid value of for ’s in (9) for . Calculating a valid value for ’s in (7) and (9) consists of solving LPs and quadratically constrained convex quadratic programs (QCP). Hence, we conclude that it can be obtained in polynomial time.
To reduce the computational time for the big calculation, we present an alternative approach that works for models from a different prospective.
Note that we can obtain coefficients of an optimal regression model that minimizes over all explanatory variables as , where and . This is equivalent to solving , with and . For a rational number (, , and relative prime numbers), a rational vector , and a rational matrix , let us define
size() := size()
size() := size().
Note that it is known that the size of solutions to are bounded. Here, we extend this over the various submatrices of and subvectors of encountered in our subset selection procedure. The following proposition provides a valid value of .
Value is a valid upper bound for ’s and ’s in (9).
The proof of Proposition 6 and the omitted detailed derivations are available in Appendix B. Observe that and can be calculated in polynomial time. In detail, it takes in which is the number of digits of the largest absolute number among all elements of and to compute . Recall that the previous approach requires to solve QCPs. Hence, we have an alternative polynomial time big calculation procedure which is computationally more efficient than the one provided by Proposition 5. However, this procedure yields a larger value of .
2.4 Valid Inequalities
To accelerate the computation, we apply several valid inequalities at the root node of the branch and bound algorithm. Let and be the objective function values of a heuristic and the LP relaxation, respectively. Let () be the objective function value of the LP relaxation of (7) after fixing (). Then, the following inequalities are valid for (7):
We do not provide proofs as it is trivial to establish their validity. In Figure 1, we illustrate the valid inequalities. In both figures, the dark and light-shaded areas represent the feasible and infeasible region, respectively, after applying the valid inequalities, whereas the combined area represents the original feasible region of the formulation. In Figure (0(a)), valid inequalities (14) and (15) are presented. Value is the optimal objective function value. In Figure (0(b)), is the objective function value of the LP relaxation with non-integer before applying (16). The black circles represent and that give valid lower bounds for any integer solution. Observe that integer feasible solutions (empty rectangles in the figure) are in the feasible region after applying the valid inequality.
3 Mathematical Models and Algorithms for Fat Case
Let us consider the fat case, in which there are more explanatory variables than observations. A natural extension of (7) or (9) for the fat case is to add cardinality constraint . This works in many cases, however, we found that the objectives and for the fat case could be problematic in some cases.
Minimizing can be thought as approximating the right hand side (dependent values ) using a combination of columns (explanatory variables). If we have more linearly independent explanatory variables than observations, we can always build a regression model with . Hence, if we allow , then the objective is not useful. Further, due to the definition of , we must have in order to make the numerator positive.
Suppose we can select explanatory variables out of candidate explanatory variables. Since converges to zero as we add more linearly independent explanatory variables and since and are close to each other, can be near zero. In this case, having explanatory variables might not be penalized enough by the definition of . This could make optimal and it actually happens in many instances studied in Section 4, which is not a desired solution in most cases. Hence, even with the restriction , may not be a useful criteria. In order to fix this issue, we propose a new objective function by additionally penalizing having too many explanatory variables in the regression model.
In this section, we first define new objective functions, which we call and , as alternatives to and . Then we modify the formulations in Section 2 based on the new objective functions. We also propose algorithms to solve the problems more efficiently. Throughout the section, we only present the result for , because the result for follows with trivial modifications, see Appendix C.
3.1 New Objective Function and Modified Formulations
Before we derive the objective function, let us temporarily assume so that any subset of automatically satisfies . We will relax this assumption later to consider . Suppose that we want to penalize large in a way that the best model with explanatory variables is as bad as a regression model with no explanatory variables. Hence, we want the objective function to give the same value for models with and . With this in mind, we propose the adjusted as
where is the mean absolute error of the optimal regression model with . Observe that (17) is equivalent to when . The penalty term increases as increases. Observe also that the numerator in (17) can be interpreted as LASSO-type objective with penalty to the cardinality.
Let us now assume that is near zero when , which happens often. Then we have
Hence, instead of near-zero , the new objective has almost the same value as when .
Recall that and is the objective function in the previous thin case model. Hence, we need to modify the definitions and constraints. First we rewrite constraint (7b) as
Finally, we remove the assumption we made () at the beginning of this section by adding cardinality constraint
and obtain the following final formulations:
In fact, without (20), cannot be well-defined since it becomes negative for and the denominator becomes 0 for . Observe that (21) is an MIP with variables (including binary variables) and constraints. Observe also that (7) with the additional constraint (20) can be used for the fat case. However, using explanatory variables out of candidate explanatory variables can lead to an extremely small as we explained at the beginning of this section.
for ’s to consider regression models that are better than having no regression variables. Given a heuristic solution with objective function value , we can strengthen by making solutions worse than the heuristic solution infeasible. Hence, we set for ’s in (19).
However, obtaining a valid value of for ’s in (21) is not trivial. Note that (13), which we used for the thin case, is not applicable for the fat case because LP (11) can easily be unbounded for the fat case. One valid procedure is to (i) generate all possible combinations of explanatory variables and all observations, (ii) compute for each combination using the procedure in Section 2.3, and (iii) pick the maximum value out of all possible combinations. However, this is a combinatorial problem. Actually, the computational complexity of this procedure is as much as that of solving (2) for all possible subsets. Hence, enumerating all possible subsets just to get a valid big M is not tractable.
Instead we can use a heuristic approach to obtain a good estimation of valid value of. In Appendix B, we propose a statistic-based procedure that ensures a valid value of with a certain confidence level. This procedure can give an value that is valid with confidence. However, for the instances considered in this paper, this procedure gives values of that are too large because many columns can be strongly correlated to each other. Note that a large value of can cause numerical errors when solving the MIP’s.
Hence, for computational experiment, we use a simple heuristic approach instead. Let us assume that we are given a feasible solution to (21) from a heuristic and ’s are the coefficient of the regression model. Then, we set
3.2 Core Set Algorithm
Observe that (21) might be difficult to solve optimally if the data is large. Given the same number of values, measured by , (21) is more difficult to solve than (7) because more explanatory variables require more binary decision variables and related constraints. To overcome this computational difficulty and get a quality solution quickly, we develop an iterative algorithm based on (21).
Let be a subset of such that , with the cardinality of defined by
where is a fraction that defines the target cardinality of . We refer to as the core set and iteratively solve
We present the algorithmic framework in Algorithm 1 based on the core set concept. Let be the current best subset in Algorithm 1 with corresponding objective function value . In Steps 1 - 3, we initialize core set with cardinality not exceeding . We solve (25) with in Step 5 and then update in Step 6. We iterate these steps until there is no improvement of the objective function value from a previous iteration.
We next explain how the core set is updated. The updating algorithm is presented in Algorithm 2. In Steps 13 and 14, the idea is to keep the explanatory variables of the current best subset in the core set and additionally selecting explanatory variables not in based on scores . The score is defined based on how much of the error could be reduced if we add explanatory variable to the current best subset . In Steps 1 - 6, we calculate ’s and ’s by checking neighboring subsets. Note that ’s can be calculated by LP formulation (2). In Steps 7 - 12, we update the current best subset if we found a better solution in Steps 1 - 6. If is updated, we go to Step 1 and restart the algorithm with new and . Observe that ’s in Steps 1 - 3 are only for updating in Step 8, whereas ’s and ’s in Steps 4 - 6 are also used to define in Step 13.
Let us define the neighborhood of set as
where defines the symmetric difference of and . Through the following propositions, we show that Algorithm 1 does not cycle and terminates with a local optimal solution based on the neighborhood definition given in (26).
Algorithm 1 does not cycle.
Algorithm 1 gives a local optimum.
3.3 Randomized Core Set Algorithm
infinitely many times, we show that we can find a global optimal solution with probability 1 when. The randomized version of Update-Core-Set is presented in Algorithm 3. Update-Core-Set-Random is similar to Update-Core-Set, with one difference. Instead of the greedy approach in Steps 13-14 of Algorithm 2, we randomly choose
explanatory variables one-by-one without replacement based on a probability distribution.
Let us next describe the initial probability distribution used in Step 2 of Algorithm 3. Let be the current best objective function value whenever explanatory variable is included in the regression model. We update ’s at each iteration throughout the entire algorithm. In detail, we set for whenever current best objective function value and subset are updated. In order to enhance the local optimal search, we give a bonus to the columns currently in by setting weight if and if . Observe that giving the same weight for all is equivalent to a random search. On the other hand, if the weight for is much smaller (hence much greater selection probability) than the weight for , then we are likely to choose all variables in , which is similar to Algorithm 2. By means of a computational experiment, we found out that giving twice more weights for compared to balances exploration and exploitation.
We normalize ’s and generate ’s so that and . In detail,