1 Introduction
Linear regression models should have two important characteristics in practice including prediction accuracy and interpretability (Tibshirani, 1996)
. The traditional approach of constructing regression models is to minimize the sum of squared residuals. It is evident that models obtained in this approach have low biases. However, their prediction accuracy can be low due to their large variances. Furthermore, models constructed by this approach may contain a large number of predictors and so data analysts struggle in interpreting them.
In general, reducing the number of predictors in a regression model can improve not only the interpretability but also, sometimes, the prediction accuracy by reducing the variance (Tibshirani, 1996). Hence, there is often a tradeoff between the amount of bias and the practical characteristics of a regression model. In other words, finding a desirable regression model is naturally a biobjective optimization problem that minimizes the amount of bias and the number of predictors, simultaneously. This classical Statistics problem, that is
finding the best subset of
potential predictors and estimating their coefficients in a linear regression model given
observations
is called the Best Subset Selection Problem (BSSP). Due to the intensive computation of solving biobjective optimization problems, BSSP has always been transformed into a singleobjective optimization problem (e.g., see Ren and Zhang (2010) and Bertsimas et al. (2016)). Such transformation may cause practical issues that will be discussed in this paper.
Albeit, recent algorithmic and theoretical advances have made biobjective optimization problems computationally tractable in practice. More precisely, under some mild conditions, we are now able to solve them reasonably fast. To the best of our knowledge, this paper is the first work to tackle BSSP utilizing a biobjective optimization approach.
The structure of the paper is organized as follows: In Section 2, two main singleobjective approaches for BSSP are presented and their practical drawbacks are discussed. In Section 3, a new biobjective optimization formulation for BSSP is developed. In Section 4, the computational results are reported. Finally, in Section 5, some concluding remarks are provided.
2 Singleobjective models for Bssp
As discussed in Section 1, BSSP is naturally a biobjective optimization problem which can be stated as where
is the feasible set of parameter estimator vector
, is the total bias and is the number of predictors. In the literature, the following two approaches have widely been used to convert BSSP to a singleobjective optimization problem:
The weighted sum approach: Given some , BSSP has been reformulated as

The goal programming approach: Given some , BSSP has been reformulated as
For further details, interested readers are referred to Chen et al. (1998), Meinshausen and Bühlmann (2006), Zhang and Huang (2008), Bickel et al. (2009), Candés and Plan (2009), Ren and Zhang (2010), Ciuperca (2014) and Wang and Tian (2017) for the weighted sum approach, and Miller (2002) and Bertsimas et al. (2016) for the goal programming approach. Although, those two optimization programs (i) and (ii) can be solved significantly faster than a biobjective optimization problem, their drawbacks are explained and illustrated here. However, we first need to present some notation and definitions from biobjective optimization.
2.1 Preliminaries
A BiObjective Mixed Integer Linear Program (BOMILP) can be stated as follows:
(1) 
where represents the feasible set in the decision space, , , , , and . It is assumed that is bounded and where and for represents a linear objective function. The image of under vectorvalued function represents the feasible set in the objective/criterion space, that is for all . Note that BOMILP is called Biobjective Linear Program (BOLP) and BiObjective Integer Linear Program (BOILP) for the special cases of and , respectively.
Definition 1.
A feasible solution is called efficient or Pareto optimal, if there is no other such that and or and . If is efficient, then is called a nondominated point. The set of all efficient solutions is denoted by . The set of all nondominated points for is denoted by and referred to as the nondominated frontier.
Definition 2.
If there exists a vector such that , then is called a supported efficient solution and is called a supported nondominated point.
Definition 3.
Let be the set of extreme points of the convex hull of , that is the smallest convex set containing the set . A point is called an extreme supported nondominated point, if is a supported nondominated point and .
In summary, based on Definition 1, the elements of can be divided into two groups including dominated and nondominated points. Furthermore, based on Definitions 2 and 3, the nondominated points can be divided into unsupported nondominated points, nonextreme supported nondominated points and extreme supported nondominated points. Overall, biobjective optimization problems are concerned with finding all elements of , that is all nondominated points, including supported and unsupported nondominated points. An illustration of the set and its corresponding categories are shown in Figure 1.
It is wellknown that in a BOLP, both the set of efficient solutions and the set of nondominated points are supported and connected (Sayin, 1996). Consequently, to describe all nondominated points in a BOLP, it suffices to find all extreme supported nondominated points. A typical illustration of the nondominated frontier of a BOLP is displayed in Figure 2.
Since we assume that is bounded, the set of nondominated points of a BOILP is finite. However, due to the existence of unsupported nondominated points in a BOILP, finding all nondominated points is more challenging than in a BOLP. A typical nondominated frontier of a BOILP is shown in Figure 2 where the rectangles are unsupported nondominated points.
Finding all nondominated points of a BOMILP is even more challenging. Nonetheless, if at most one of the objective functions of a BOMILP contains continuous decision variables, then the set of nondominated points is finite and BOILP solution approaches can be utilized to solve it (Stidsen et al., 2014). However, in all other cases that more than one objective function contain continuous decision variables, the nondominated frontier of a BOMILP may contain connected parts as well as supported and unsupported nondominated points. Therefore, in these cases, the set of nondominated points may not be finite and BOILP algorithms cannot be applied to solve them anymore. A typical nondominated frontier of a BOMILP is illustrated in Figure 2, where even halfopen (or open) line segments may exist in the nondominated frontier. Interested readers are referred to Ghosh and Chakraborty (2014), Hamacher et al. (2007) and Boland et al. (2015a, b) for further discussions on the properties of BOILPs and BOMILPs and algorithms to solve them.
2.2 Drawbacks of transforming Bssp into a singleobjective model
Suppose that for each , the corresponding point is plotted into the criterion space. Figures 3 and 3 show two typical plots of such pairs for all . In these two figures, all filled circles and rectangles are nondominated points of the problem and unfilled rectangles and circles are dominated points. In Figure 3, the region defined by the dashed lines is the convex hull of all feasible points. In this case, it is impossible that the wighted sum approach finds the filled rectangles for any arbitrary weight as all filled rectangles are unsupported nondominated points (i.e., they are interior points of the convex hull). So, this illustrates that there may exist many nondominated points, but the weighted sum approach can fail to find most of them for any arbitrary weight. Figure 3 is helpful for understanding the main drawback of the goal programming approach. It is obvious that depending on the value of , the goal programming approach may find one of the unfilled rectangles which are dominated points. So, the main drawback of the goal programming approach is that it may even fail to find a nondominated point.
The main contribution of our research presented here is to overcome both of these disadvantages by utilizing biobjective optimization techniques. We note that in the literature of BSSP, is mainly defined as the sum of squared residuals. The reason lies in the fact that the sum of squared residuals is a smooth (convex) function. However, to be able to exploit existing biobjective mixed integer linear programming solvers, we use the sum of absolute residuals for . We conclude this section by providing two remarks.
Remark 4.
If we incorporate additional linear constraints on the vector of parameter estimators of the regression model, , it will be more likely that the goal programming approach fails to find a nondominated point.
Remark 5.
Unlike the weighted sum and goal programming approaches that new parameters and , respectively, should be employed and tuned by the user, the biobjective optimization approach does not need any extra parameter.
3 A biobjective formulation for Bssp
Let be the model matrix (it is assumed that ), be the vector of regression coefficients, and be the response vector. It is assumed that is unknown and should be estimated. Let denote an estimate for . To solve BSSP for this set of data, we construct the following BOMILP and denote it by BSSPBOMILP:
(2)  
such that:  (3)  
(4)  
(5)  
(6)  
(7)  
(8) 
where and are, respectively, a lower bound and an upper bound (known) for for , for is a nonnegative continuous variable that captures the value of in any efficient solution, and for is a binary decision variable that takes the value of one if , implying that the predictor is active. By these definitions, for any efficient solution, the first objective function, , takes the value of the sum of absolute residuals and the second objective function, , computes the number of predictors. Constraint (3) ensures that if then for . Constraints (4) and (5) guarantee that for . Note that since we minimize the first objective function, we have for in an efficient solution.
Remark 6.
The BSSPBOMILP can handle additional linear constraints and variables. Furthermore, by choosing tight bounds in Constraint (3), we can speed up the solution time of BSSPBOMILP. Hence, we should try to choose / as large/small as possible.
Remark 7.
Since only one of the objective functions in BSSPBOMILP contains continuous variables, based on our discussion in Section 2.1, the set of nondominated points of BSSPBOMILP is finite. More precisely, the nondominated frontier of BSSPBOMILP can have at most number of nondominated points as . So we can use BOILP solvers such as the constraint method or the balanced box method to solve BSSPBOMILP (Chankong and Haimes, 1983; Boland et al., 2015a).
Remark 8.
The solution is a trivial efficient solution of BSSPBOMILP which attains the minimum possible value for the second objective function. Accordingly, the point is a trivial nondominated point of BSSPBOMILP where there is no parameter selected in the estimated regression model. Hence, we exclude this trivial nondominated point by adding the constraint to BSSPBOMILP.
3.1 Bounds for the regression coefficients
In this section, we develop a datadriven approach to find bounds and for such that , in the lack of any additional information. To achieve this, we firstly present the following proposition.
Proposition 9.
Let be the median of response observations . If is an efficient solution of BSSPBOMILP, then .
Proof.
Let consider the feasible solution where , , for , for . So, we have for because for in BSSPBOMILP. Since by Remark 8, , we must have to keep an efficient solution. ∎ ∎
Remark 10.
4 Computational results
We conduct a computational study to show the performance of the constraint method on BSSPBOMILP, numerically. We use C++ to code the constraint method. In this computational study, the algorithm uses CPLEX 12.7 as the singleobjective integer programming solver. All computational experiments are carried out on a Dell PowerEdge R630 with two Intel Xeon E52650 2.2 GHz 12Core Processors (30MB), 128GB RAM, and the RedHat Enterprise Linux 6.8 operating system. We allow CPLEX to employ at most 10 threads at the same time.
We design six classes of instances, each denoted by where and . Based on this construction, we generate three regression models for each class as follows:

Two steps are taken to construct for : (1) A vector is generated such that the two third of its components are zero, and the others are randomly drawn from the uniform distribution on interval ; (2) Set (with at most one decimal place) where
is randomly generated from the standard normal distribution;

Optimal values of and for are computed by solving (10).
Class  Instance 1  Instance 2  Instance 3  Average  
Time(Sec.)  #NDPs  Time(Sec.)  #NDPs  Time(Sec.)  #NDPs  Time(Sec.)  #NDPs  
C(20,40)  4.1  21  3.7  21  3.8  21  3.8  21.0 
C(20,60)  4.6  21  5.4  21  4.3  21  4.8  21.0 
C(20,80)  5.2  21  6.0  21  6.1  21  5.8  21.0 
C(40,80)  264.4  41  385.5  41  290.6  41  313.5  41.0 
C(40,120)  313.0  41  921.5  41  247.0  41  493.8  41.0 
C(40,160)  275.6  41  327.9  41  591.0  41  398.2  41.0 
Table 1 reports the numerical results for all 18 instances. For each instance, there are two columns ‘Time(Sec.)’ and ‘#NDPs’ showing the solution time in seconds and the number of nondominated points, respectively. All nondominated points can be found for instances with and in about 5 seconds and 7 minutes in average, respectively.
To highlight the drawbacks of existing approaches including the weighted sum approach and the goal programming approach, the nondominated frontier of the Instance 1 from the Class is illustrated in Figure 4. The filled rectangles and circles are unsupported and supported nondominated points, respectively. As we discussed previously, it is impossible to find any of the unsupported nondominated points using the weighted sum approach. Also, observe that many of the nondominated points lies on an almost vertical line. This implies that all these points are almost optimal for the goal programming approach when .
We note that selecting a desirable nondominated point in the nondominated frontier depends on decision maker(s). Here, we introduce a heuristic algorithm to do so. Let
and be the top and bottom endpoints of the nondominated frontier. One may simply choose the point that has the minimum Euclidean distance from the (imaginary) ideal point, that is . Based on this algorithm, in Figure 4, the (imaginary) ideal point is and the closest nondominated point to it is . The generated instance that we discuss in Figure 4 is and the estimated linear regression model corresponding to the selected nondominated point is , which are very close together.5 Conclusion
In this paper, we discussed the practical drawbacks of transforming the best subset selection problem in linear regression modeling into a singleobjective optimization problem. To resolve those disadvantages, we developed a new biobjective optimization model for BSSP. The efficacy of this new model was shown through numerical results. Hopefully, the simplicity, versatility and performance of this new biobjective optimization model encourage practitioners to consider it for constructing linear regression models.
References
 Bertsimas et al. (2016) Bertsimas, D., King, A., Mazumder, R., 2016. Best subset selection via a modern optimization lens. The Annals of Statistics 44 (2), 813–852.
 Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B., 2009. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 (4), 1705–1732.
 Boland et al. (2015a) Boland, N., Charkhgard, H., Savelsbergh, M., 2015a. A criterion space search algorithm for biobjective integer programming: The balanced box method. INFORMS Journal on Computing 27 (4), 735–754.
 Boland et al. (2015b) Boland, N., Charkhgard, H., Savelsbergh, M., 2015b. A criterion space search algorithm for biobjective mixed integer programming: The triangle splitting method. INFORMS Journal on Computing 27 (4), 597–618.
 Candés and Plan (2009) Candés, E. J., Plan, Y., 2009. Nearideal model selection by minimization. The Annals of Statistics 37 (5A), 2145–2177.
 Chankong and Haimes (1983) Chankong, V., Haimes, Y. Y., 1983. Multiobjective Decision Making: Theory and Methodology. Elsevier Science, New York.
 Chen et al. (1998) Chen, S. S., Donoho, D. L., Saunders, M. A., 1998. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20 (1), 33–61.
 Ciuperca (2014) Ciuperca, G., May 2014. Model selection by lasso methods in a changepoint model. Statistical Papers 55 (2), 349–374.
 Dielman (2005) Dielman, T. E., 2005. Least absolute value regression: recent contributions. Journal of Statistical Computation and Simulation 75 (4), 263–286.
 Ghosh and Chakraborty (2014) Ghosh, D., Chakraborty, D., 2014. A new pareto set generating method for multicriteria optimization problems. Operations Research Letters 42 (8), 514 – 521.
 Hamacher et al. (2007) Hamacher, W. H., Pedersen, C. R., Ruzika, S., 2007. Finding representative systems for discrete bicriterion optimization problems. Operations Research Letters 35, 336–344.
 Meinshausen and Bühlmann (2006) Meinshausen, N., Bühlmann, P., 2006. Highdimensional graphs and variable selection with the lasso. The Annals of Statistics 34 (3), 1436–1462.

Miller (2002)
Miller, A., 2002. Subset Selection in Regression, 2nd Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability.
 Ren and Zhang (2010) Ren, Y., Zhang, X., 2010. Subset selection for vector autoregressive processes via adaptive lasso. Statistics & Probability Letters 80 (2324), 1705 – 1712.
 Sayin (1996) Sayin, S., 1996. An algorithm based on facial decomposition for finding the efficient set in multiple objective linear programming. Operations Research Letters 19 (2), 87 – 94.
 Schwertman et al. (1990) Schwertman, N. C., Gilks, A. J., Cameron, J., 1990. A simple noncalculus proof that the median minimizes the sum of the absolute deviations. The American Statistician 44 (1), 38–39.
 Stidsen et al. (2014) Stidsen, T., Andersen, K. A., Dammann, B., 2014. A branch and bound algorithm for a class of biobjective mixed integer programs. Management Science 60 (4), 1009–1032.
 Tibshirani (1996) Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
 Wang and Tian (2017) Wang, M., Tian, G.L., Feb 2017. Adaptive group lasso for highdimensional generalized linear models. Statistical Papers.
 Zhang and Huang (2008) Zhang, C.H., Huang, J., 2008. The sparsity and bias of the lasso selection in highdimensional linear regression. The Annals of Statistics 36 (4), 1567–1594.