Log In Sign Up

Best subset selection in linear regression via bi-objective mixed integer linear programming

We study the problem of choosing the best subset of p features in linear regression given n observations. This problem naturally contains two objective functions including minimizing the amount of bias and minimizing the number of predictors. The existing approaches transform the problem into a single-objective optimization problem. We explain the main weaknesses of existing approaches, and to overcome their drawbacks, we propose a bi-objective mixed integer linear programming approach. A computational study shows the efficacy of the proposed approach.


page 1

page 2

page 3

page 4


COMBSS: Best Subset Selection via Continuous Optimization

We consider the problem of best subset selection in linear regression, w...

Novel bi-objective optimization algorithms minimizing the max and sum of vectors of functions

We study a bi-objective optimization problem, which for a given positive...

A Visual Web Tool to Perform What-If Analysis of Optimization Approaches

In Operation Research, practical evaluation is essential to validate the...

Blackout Resilient Optical Core Network

A disaster may not necessarily demolish the telecommunications infrastru...

Kidney Exchange with Inhomogeneous Edge Existence Uncertainty

Motivated by kidney exchange, we study a stochastic cycle and chain pack...

SWITSS: Computing Small Witnessing Subsystems

Witnessing subsystems for probabilistic reachability thresholds in discr...

First-Order Mixed Integer Linear Programming

Mixed integer linear programming (MILP) is a powerful representation oft...

1 Introduction

Linear regression models should have two important characteristics in practice including prediction accuracy and interpretability (Tibshirani, 1996)

. The traditional approach of constructing regression models is to minimize the sum of squared residuals. It is evident that models obtained in this approach have low biases. However, their prediction accuracy can be low due to their large variances. Furthermore, models constructed by this approach may contain a large number of predictors and so data analysts struggle in interpreting them.

In general, reducing the number of predictors in a regression model can improve not only the interpretability but also, sometimes, the prediction accuracy by reducing the variance (Tibshirani, 1996). Hence, there is often a trade-off between the amount of bias and the practical characteristics of a regression model. In other words, finding a desirable regression model is naturally a bi-objective optimization problem that minimizes the amount of bias and the number of predictors, simultaneously. This classical Statistics problem, that is

finding the best subset of

potential predictors and estimating their coefficients in a linear regression model given


is called the Best Subset Selection Problem (BSSP). Due to the intensive computation of solving bi-objective optimization problems, BSSP has always been transformed into a single-objective optimization problem (e.g., see Ren and Zhang (2010) and Bertsimas et al. (2016)). Such transformation may cause practical issues that will be discussed in this paper.

Albeit, recent algorithmic and theoretical advances have made bi-objective optimization problems computationally tractable in practice. More precisely, under some mild conditions, we are now able to solve them reasonably fast. To the best of our knowledge, this paper is the first work to tackle BSSP utilizing a bi-objective optimization approach.

The structure of the paper is organized as follows: In Section 2, two main single-objective approaches for BSSP are presented and their practical drawbacks are discussed. In Section 3, a new bi-objective optimization formulation for BSSP is developed. In Section 4, the computational results are reported. Finally, in Section 5, some concluding remarks are provided.

2 Single-objective models for Bssp

As discussed in Section 1, BSSP is naturally a bi-objective optimization problem which can be stated as where

is the feasible set of parameter estimator vector

, is the total bias and is the number of predictors. In the literature, the following two approaches have widely been used to convert BSSP to a single-objective optimization problem:

  1. The weighted sum approach: Given some , BSSP has been reformulated as

  2. The goal programming approach: Given some , BSSP has been reformulated as

For further details, interested readers are referred to Chen et al. (1998), Meinshausen and Bühlmann (2006), Zhang and Huang (2008), Bickel et al. (2009), Candés and Plan (2009), Ren and Zhang (2010), Ciuperca (2014) and Wang and Tian (2017) for the weighted sum approach, and Miller (2002) and Bertsimas et al. (2016) for the goal programming approach. Although, those two optimization programs (i) and (ii) can be solved significantly faster than a bi-objective optimization problem, their drawbacks are explained and illustrated here. However, we first need to present some notation and definitions from bi-objective optimization.

2.1 Preliminaries

A Bi-Objective Mixed Integer Linear Program (BOMILP) can be stated as follows:


where represents the feasible set in the decision space, , , , , and . It is assumed that is bounded and where and for represents a linear objective function. The image of under vector-valued function represents the feasible set in the objective/criterion space, that is for all . Note that BOMILP is called Bi-objective Linear Program (BOLP) and Bi-Objective Integer Linear Program (BOILP) for the special cases of and , respectively.

Definition 1.

A feasible solution is called efficient or Pareto optimal, if there is no other such that and or and . If is efficient, then is called a nondominated point. The set of all efficient solutions is denoted by . The set of all nondominated points for is denoted by and referred to as the nondominated frontier.

Definition 2.

If there exists a vector such that , then is called a supported efficient solution and is called a supported nondominated point.

Definition 3.

Let be the set of extreme points of the convex hull of , that is the smallest convex set containing the set . A point is called an extreme supported nondominated point, if is a supported nondominated point and .

The convex hull of

Non-extreme supported nondominated point

Extreme supported nondominated point

Unsupported nondominated point

Dominated point

Figure 1: An illustration of different types of (feasible) points in the criterion space

In summary, based on Definition 1, the elements of can be divided into two groups including dominated and nondominated points. Furthermore, based on Definitions 2 and 3, the nondominated points can be divided into unsupported nondominated points, non-extreme supported nondominated points and extreme supported nondominated points. Overall, bi-objective optimization problems are concerned with finding all elements of , that is all nondominated points, including supported and unsupported nondominated points. An illustration of the set and its corresponding categories are shown in Figure 1.

(a) BOLP


Figure 2: An illustration of the nondominated frontier

It is well-known that in a BOLP, both the set of efficient solutions and the set of nondominated points are supported and connected (Sayin, 1996). Consequently, to describe all nondominated points in a BOLP, it suffices to find all extreme supported nondominated points. A typical illustration of the nondominated frontier of a BOLP is displayed in Figure 2.

Since we assume that is bounded, the set of nondominated points of a BOILP is finite. However, due to the existence of unsupported nondominated points in a BOILP, finding all nondominated points is more challenging than in a BOLP. A typical nondominated frontier of a BOILP is shown in Figure 2 where the rectangles are unsupported nondominated points.

Finding all nondominated points of a BOMILP is even more challenging. Nonetheless, if at most one of the objective functions of a BOMILP contains continuous decision variables, then the set of nondominated points is finite and BOILP solution approaches can be utilized to solve it (Stidsen et al., 2014). However, in all other cases that more than one objective function contain continuous decision variables, the nondominated frontier of a BOMILP may contain connected parts as well as supported and unsupported nondominated points. Therefore, in these cases, the set of nondominated points may not be finite and BOILP algorithms cannot be applied to solve them anymore. A typical nondominated frontier of a BOMILP is illustrated in Figure 2, where even half-open (or open) line segments may exist in the nondominated frontier. Interested readers are referred to Ghosh and Chakraborty (2014), Hamacher et al. (2007) and Boland et al. (2015a, b) for further discussions on the properties of BOILPs and BOMILPs and algorithms to solve them.

(a) The wighted sum approach fails to find rectangles

(b) The goal programming approach may incorrectly find rectangles
Figure 3: The set of feasible points in the criterion space

2.2 Drawbacks of transforming Bssp into a single-objective model

Suppose that for each , the corresponding point is plotted into the criterion space. Figures 3 and 3 show two typical plots of such pairs for all . In these two figures, all filled circles and rectangles are nondominated points of the problem and unfilled rectangles and circles are dominated points. In Figure 3, the region defined by the dashed lines is the convex hull of all feasible points. In this case, it is impossible that the wighted sum approach finds the filled rectangles for any arbitrary weight as all filled rectangles are unsupported nondominated points (i.e., they are interior points of the convex hull). So, this illustrates that there may exist many nondominated points, but the weighted sum approach can fail to find most of them for any arbitrary weight. Figure 3 is helpful for understanding the main drawback of the goal programming approach. It is obvious that depending on the value of , the goal programming approach may find one of the unfilled rectangles which are dominated points. So, the main drawback of the goal programming approach is that it may even fail to find a nondominated point.

The main contribution of our research presented here is to overcome both of these disadvantages by utilizing bi-objective optimization techniques. We note that in the literature of BSSP, is mainly defined as the sum of squared residuals. The reason lies in the fact that the sum of squared residuals is a smooth (convex) function. However, to be able to exploit existing bi-objective mixed integer linear programming solvers, we use the sum of absolute residuals for . We conclude this section by providing two remarks.

Remark 4.

If we incorporate additional linear constraints on the vector of parameter estimators of the regression model, , it will be more likely that the goal programming approach fails to find a nondominated point.

Remark 5.

Unlike the weighted sum and goal programming approaches that new parameters and , respectively, should be employed and tuned by the user, the bi-objective optimization approach does not need any extra parameter.

3 A bi-objective formulation for Bssp

Let be the model matrix (it is assumed that ), be the vector of regression coefficients, and be the response vector. It is assumed that is unknown and should be estimated. Let denote an estimate for . To solve BSSP for this set of data, we construct the following BOMILP and denote it by BSSP-BOMILP:

such that: (3)

where and are, respectively, a lower bound and an upper bound (known) for for , for is a non-negative continuous variable that captures the value of in any efficient solution, and for is a binary decision variable that takes the value of one if , implying that the predictor is active. By these definitions, for any efficient solution, the first objective function, , takes the value of the sum of absolute residuals and the second objective function, , computes the number of predictors. Constraint (3) ensures that if then for . Constraints (4) and (5) guarantee that for . Note that since we minimize the first objective function, we have for in an efficient solution.

Remark 6.

The BSSP-BOMILP can handle additional linear constraints and variables. Furthermore, by choosing tight bounds in Constraint (3), we can speed up the solution time of BSSP-BOMILP. Hence, we should try to choose / as large/small as possible.

Remark 7.

Since only one of the objective functions in BSSP-BOMILP contains continuous variables, based on our discussion in Section 2.1, the set of nondominated points of BSSP-BOMILP is finite. More precisely, the nondominated frontier of BSSP-BOMILP can have at most number of nondominated points as . So we can use BOILP solvers such as the -constraint method or the balanced box method to solve BSSP-BOMILP (Chankong and Haimes, 1983; Boland et al., 2015a).

Remark 8.

The solution is a trivial efficient solution of BSSP-BOMILP which attains the minimum possible value for the second objective function. Accordingly, the point is a trivial nondominated point of BSSP-BOMILP where there is no parameter selected in the estimated regression model. Hence, we exclude this trivial nondominated point by adding the constraint to BSSP-BOMILP.

3.1 Bounds for the regression coefficients

In this section, we develop a data-driven approach to find bounds and for such that , in the lack of any additional information. To achieve this, we firstly present the following proposition.

Proposition 9.

Let be the median of response observations . If is an efficient solution of BSSP-BOMILP, then .


Let consider the feasible solution where , , for , for . So, we have for because for in BSSP-BOMILP. Since by Remark 8, , we must have to keep an efficient solution. ∎ ∎

Remark 10.

It is readily seen that if we replace with any other real number, the inequality given in Proposition 9 still holds. However, as the minimum of is achieved at (Schwertman et al., 1990), Proposition 9 provides the best upper bound for .

Motivated from Proposition 9, we can solve the following optimization problem to find for :


There are several ways to transform (9) into a linear program (e.g., see Dielman (2005)). Here, we propose the following linear programing model:


It should be noted that (10) is a relaxation of (9) since over-calculates for . Analogously, for can be computed by changing ‘’ into ‘’ in (10).

4 Computational results

We conduct a computational study to show the performance of the -constraint method on BSSP-BOMILP, numerically. We use C++ to code the -constraint method. In this computational study, the algorithm uses CPLEX 12.7 as the single-objective integer programming solver. All computational experiments are carried out on a Dell PowerEdge R630 with two Intel Xeon E5-2650 2.2 GHz 12-Core Processors (30MB), 128GB RAM, and the RedHat Enterprise Linux 6.8 operating system. We allow CPLEX to employ at most 10 threads at the same time.

We design six classes of instances, each denoted by where and . Based on this construction, we generate three regression models for each class as follows:

  • Set all and all other for

    are randomly drawn from the discrete uniform distribution on interval


  • Two steps are taken to construct for : (1) A vector is generated such that the two third of its components are zero, and the others are randomly drawn from the uniform distribution on interval ; (2) Set (with at most one decimal place) where

    is randomly generated from the standard normal distribution;

  • Optimal values of and for are computed by solving (10).

Class Instance 1 Instance 2 Instance 3 Average
Time(Sec.) #NDPs Time(Sec.) #NDPs Time(Sec.) #NDPs Time(Sec.) #NDPs
C(20,40) 4.1 21 3.7 21 3.8 21 3.8 21.0
C(20,60) 4.6 21 5.4 21 4.3 21 4.8 21.0
C(20,80) 5.2 21 6.0 21 6.1 21 5.8 21.0
C(40,80) 264.4 41 385.5 41 290.6 41 313.5 41.0
C(40,120) 313.0 41 921.5 41 247.0 41 493.8 41.0
C(40,160) 275.6 41 327.9 41 591.0 41 398.2 41.0
Table 1: Numerical results obtained by running the -constraint method

Table 1 reports the numerical results for all 18 instances. For each instance, there are two columns ‘Time(Sec.)’ and ‘#NDPs’ showing the solution time in seconds and the number of nondominated points, respectively. All nondominated points can be found for instances with and in about 5 seconds and 7 minutes in average, respectively.

















The number of predictors

The total bias

Linear regression model of the desirable point:

(imaginary) ideal point
Figure 4: The nondominated frontier of the Instance 1 from the Class

To highlight the drawbacks of existing approaches including the weighted sum approach and the goal programming approach, the nondominated frontier of the Instance 1 from the Class is illustrated in Figure 4. The filled rectangles and circles are unsupported and supported nondominated points, respectively. As we discussed previously, it is impossible to find any of the unsupported nondominated points using the weighted sum approach. Also, observe that many of the nondominated points lies on an almost vertical line. This implies that all these points are almost optimal for the goal programming approach when .

We note that selecting a desirable nondominated point in the nondominated frontier depends on decision maker(s). Here, we introduce a heuristic algorithm to do so. Let

and be the top and bottom endpoints of the nondominated frontier. One may simply choose the point that has the minimum Euclidean distance from the (imaginary) ideal point, that is . Based on this algorithm, in Figure 4, the (imaginary) ideal point is and the closest nondominated point to it is . The generated instance that we discuss in Figure 4 is and the estimated linear regression model corresponding to the selected nondominated point is , which are very close together.

5 Conclusion

In this paper, we discussed the practical drawbacks of transforming the best subset selection problem in linear regression modeling into a single-objective optimization problem. To resolve those disadvantages, we developed a new bi-objective optimization model for BSSP. The efficacy of this new model was shown through numerical results. Hopefully, the simplicity, versatility and performance of this new bi-objective optimization model encourage practitioners to consider it for constructing linear regression models.


  • Bertsimas et al. (2016) Bertsimas, D., King, A., Mazumder, R., 2016. Best subset selection via a modern optimization lens. The Annals of Statistics 44 (2), 813–852.
  • Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B., 2009. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 (4), 1705–1732.
  • Boland et al. (2015a) Boland, N., Charkhgard, H., Savelsbergh, M., 2015a. A criterion space search algorithm for biobjective integer programming: The balanced box method. INFORMS Journal on Computing 27 (4), 735–754.
  • Boland et al. (2015b) Boland, N., Charkhgard, H., Savelsbergh, M., 2015b. A criterion space search algorithm for biobjective mixed integer programming: The triangle splitting method. INFORMS Journal on Computing 27 (4), 597–618.
  • Candés and Plan (2009) Candés, E. J., Plan, Y., 2009. Near-ideal model selection by minimization. The Annals of Statistics 37 (5A), 2145–2177.
  • Chankong and Haimes (1983) Chankong, V., Haimes, Y. Y., 1983. Multiobjective Decision Making: Theory and Methodology. Elsevier Science, New York.
  • Chen et al. (1998) Chen, S. S., Donoho, D. L., Saunders, M. A., 1998. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20 (1), 33–61.
  • Ciuperca (2014) Ciuperca, G., May 2014. Model selection by lasso methods in a change-point model. Statistical Papers 55 (2), 349–374.
  • Dielman (2005) Dielman, T. E., 2005. Least absolute value regression: recent contributions. Journal of Statistical Computation and Simulation 75 (4), 263–286.
  • Ghosh and Chakraborty (2014) Ghosh, D., Chakraborty, D., 2014. A new pareto set generating method for multi-criteria optimization problems. Operations Research Letters 42 (8), 514 – 521.
  • Hamacher et al. (2007) Hamacher, W. H., Pedersen, C. R., Ruzika, S., 2007. Finding representative systems for discrete bicriterion optimization problems. Operations Research Letters 35, 336–344.
  • Meinshausen and Bühlmann (2006) Meinshausen, N., Bühlmann, P., 2006. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 34 (3), 1436–1462.
  • Miller (2002)

    Miller, A., 2002. Subset Selection in Regression, 2nd Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability.

  • Ren and Zhang (2010) Ren, Y., Zhang, X., 2010. Subset selection for vector autoregressive processes via adaptive lasso. Statistics & Probability Letters 80 (23-24), 1705 – 1712.
  • Sayin (1996) Sayin, S., 1996. An algorithm based on facial decomposition for finding the efficient set in multiple objective linear programming. Operations Research Letters 19 (2), 87 – 94.
  • Schwertman et al. (1990) Schwertman, N. C., Gilks, A. J., Cameron, J., 1990. A simple noncalculus proof that the median minimizes the sum of the absolute deviations. The American Statistician 44 (1), 38–39.
  • Stidsen et al. (2014) Stidsen, T., Andersen, K. A., Dammann, B., 2014. A branch and bound algorithm for a class of biobjective mixed integer programs. Management Science 60 (4), 1009–1032.
  • Tibshirani (1996) Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
  • Wang and Tian (2017) Wang, M., Tian, G.-L., Feb 2017. Adaptive group lasso for high-dimensional generalized linear models. Statistical Papers.
  • Zhang and Huang (2008) Zhang, C.-H., Huang, J., 2008. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics 36 (4), 1567–1594.