 # A Mathematical Programming Approach for Integrated Multiple Linear Regression Subset Selection and Validation

Subset selection for multiple linear regression aims to construct a regression model that minimizes errors by selecting a small number of explanatory variables. Once a model is built, various statistical tests and diagnostics are conducted to validate the model and to determine whether regression assumptions are met. Most traditional approaches require human decisions at this step, for example, the user adding or removing a variable until a satisfactory model is obtained. However, this trial-and-error strategy cannot guarantee that a subset that minimizes the errors while satisfying all regression assumptions will be found. In this paper, we propose a fully automated model building procedure for multiple linear regression subset selection that integrates model building and validation based on mathematical programming. The proposed model minimizes mean squared errors while ensuring that the majority of the important regression assumptions are met. When no subset satisfies all of the considered regression assumptions, our model provides an alternative subset that satisfies most of these assumptions. Computational results show that our model yields better solutions (i.e., satisfying more regression assumptions) compared to benchmark models while maintaining similar explanatory power.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Regression analysis is one of the most popular forms of statistical modeling for analyzing the relationship between variables. Due to its intuitive characteristics and simplicity, multiple (multivariate) linear regression

has been the most commonly used model for several decades, with a variety of fields employing it to handle prediction tasks. Multiple linear regression seeks to identify the most suitable linear relationship for several explanatory variables and a response variable. A multiple linear regression model for

explanatory variables can be represented by the linear equation

 bn×1=An×mxm×1+e,

where represents the data matrix corresponding to explanatory variables with observations, denotes the data matrix corresponding to the values of the response variable,

is the matrix of estimated parameters corresponding to the coefficients of the variables, and

refers to the errors between the predictions of the linear model and the response variable. We further assume that data matrices and

are already standardized with respective means and standard deviations, and thus, without loss of generality, we can remove the intercept from the model above. It then makes sense that

should be chosen to minimize the errors.

Subset selection is a procedure in which a subset of explantory variables is selected to reduce model complexity while maintaining explanatory power. Guyon and Elisseeff  and James et al.  demonstrate that a reduced subset can provide faster, more cost-effective predictors, along with a better understanding of the underlying process that generated the data. A reduced subset can also prevent over-fitting in the presence of too many irrelevant or redundant features  and can help users to more readily understand the results .

Many studies have been conducted on subset selection in the literature. Stepwise selection, including forward selection, backward elimination, and their combination, are the most popular algorithms thanks to their simple implementation and fast computing time. However, the solutions found by these algorithms is often poor in quality due to their greedy characteristics. Therefore, in order to improve subset selection, more complicated algorithms have been proposed. Meta-heuristic algorithms, such as genetic algorithm variations and simulated annealing, are presented in

Yang and Honavar , Leardi et al. , Siedlecki and Sklansky , and Meiri and Zahavi 

. Subset selection procedures based on statistical or machine learning methods have also been carried out. For example, Bayesian approach to subset selection is taken by

Mitchell and Beauchamp , while George and McCulloch 

suggest a selection procedure based on the superiority of each subset estimated from posterior probabilities given by Gibbs sampling.

Genuer et al. 

point out that random forests can be employed in a strategy that involves a ranking of explanatory variables which provides insight for the selection of the variable(s). In addition, other research on machine learning algorithms for subset selection has been conducted by

Rakotomamonjy , Castellano and Fanelli , and Zou and Hastie . By assigning an penalty to coefficients, least absolute shrinkage and selection operator (LASSO) is introduced by Tibshirani . Although it has been several years since LASSO was first devised, it is still one of the most commonly used regression subset selection methods. Note that most of the methods above are heuristics, indicating that they do not necessarily give parameters that minimize the mean squared error (MSE).

To find the best subset, several models based on mathematical programming have been proposed. Although the best subset selection problem can be formulated as mixed integer quadratic programming (MIQP) when ordinary least squares (OLS) is used, the statistical community has not paid it close attention because it requires significant computational time for a practical implementation. Recently, however,

Bertsimas and King  have pointed out that the computing power of mixed integer programming (MIP) solvers has increased at an sensational rate in the last 25 years and also emphasized eye-opening progress in exact algorithms for solving integer programs. These recent developments have made integer programming a key method for finding the best subset. Konno and Yamamoto  introduce an MIQP formulation to find the best subset with a fixed number of variables for multiple linear regression with the minimum sum of squared error (SSE). Konno and Takaya  proposed a multistep method to generate a nearly optimal solution in a shorter time. Based on the formulation of Konno and Yamamoto , Bertsimas et al.  suggest obtaining tight big-

values in the indicator constraint to improve computing speed. They also introduce an algorithm to achieve an initial solution for a warm-start, effectively reducing the computing time. The choice of a best subset for logistic regression via MIP using piecewise linear approximation is also presented in

Sato et al. . Models and algorithms that do not require the number of selected variables to be fixed have also been proposed. Miyashiro and Takano  introduce mixed integer second-order cone programming formulations that do not restrict the cardinality of the subsets. Miyashiro and Takano  set Mallows’ as the goodness of fit for the model and formulate the subset selection as MIP. In their work, Mallows’ includes the number of selected variables and hence the cardinality of the subset need not to be fixed when Mallows’ is used as a measure of goodness of fit. Park and Klabjan  consider a MIQP formulation for subset selection when the shape of data is not only ordinary (i.e., ) but also for high dimensional variable selection when the number of selected variables is not fixed. They also provide a proof of big- as an upper bound on coefficients and an efficient heuristic algorithm for cases when . Note that these studies only consider several goodness-of-fit measures to obtain an optimal subset.

While numerous studies have focused on regression subset selection, few have considered regression assumptions and diagnostics in their solution approach. Bertsimas and King 

suggest an algorithmic approach that iteratively solves MIP problems to obtain a desirable model. They use penalized objective functions and constraints for sparsity, multicollinearity reduction, robustness to outliers, and statistical significance.

Tamura et al.  propose MIQP for best subset selection while eliminating multicollinearity from linear regression models. Carrizosaa et al.  study a mathematical model for subset selection in order to construct a linear regression model with the significance of coefficients and small multicollinearity by constraints performing shrinkage of the coefficients.

In this paper, we propose a fully automated model building and validation procedure for multiple linear regression via a mathematical programming-based algorithm that minimizes MSE while guaranteeing both (i) statistical significance of the regression coefficients and (ii) regression assumptions and diagnostics. Note that it is not trivial to incorporate this model validation step into other popular subset selection approaches such as LASSO and stepwise selection. Our model is also capable of providing an alternative solution when there is no subset that satisfies all of the considered regression assumptions and tests. Our model is different from Bertsimas and King  and Carrizosaa et al.  in that our algorithm finds a solution that satisfies the tests and diagnostics with only one call to an MIP solver, whereas they use an iterative method that calls an MIP solver multiple times or heuristic observations for modeling. The contributions of our paper can be summarized as follows:

• We provide a mathematical programming-based algorithm that allows a regression model to be built that satisfies the majority of statistical tests and diagnostics. In particular, residual-based diagnostics are incorporated into the model for the first time.

• Our algorithm smartly uses a commercial MIP solver to which exactly one call is required. To the best of our knowledge, our algorithm is unique in the sense that it does not require multiple iterative calls to the solver. The experimental results demonstrate the efficiency of our algorithm in comparison to existing iterative methods.

• The experimental results show that our model identifies a subset that satisfies all of the considered tests and assumptions within a reasonable time, while at the same time keeping the adjusted values close to those of the benchmarks (without statistical tests and diagnostics).

• We present a logical procedure to find a near-feasible solution when no subset satisfies the tests and diagnostics or our model fails to find a feasible solution within the given time limit. The proposed procedure mimics the typical steps used to building a linear regression model. The experimental results show that our procedure produces higher quality subsets than the benchmarks.

The paper is structured as follows. In Section 2, we discuss several characteristics of a desirable multivariate linear regression model which incorporates transformed variables and meets essential regression assumptions. In Section 3, a mathematical programming approach for the best subset which reflects the features discussed in Section 2 is presented. A logical procedure for obtaining an alternative solution when the model is infeasible is proposed in Section 4. The result of computational experiments are presented in Section 5, followed by conclusions in Section 6.

## 2 Model Validation for Multiple Linear Regression

As mentioned previously, building a multivariate linear regression model involves finding a linear equation which explains well the relationship between the explanatory variables and the response variable. “Explain well” in this case means that the errors between the predictions and the observations are suitably small. At the same time, the model needs to satisfy the assumptions of linear regression from a statistical point of view. However, when a model is obtained from a single run using one of the popular least squares methods, some of the statistical assumptions are usually violated. Neter et al.  suggests building a regression model by repetitively checking the assumptions and diagnostics. Figure 1

summarizes this strategy. After preprocessing the collected data, several explanatory variables (or non-linear transformed variables) are selected to build a statistically significant model. Diagnostics are then conducted on the model. If the model satisfies all assumptions, it is forwarded to the postprocessing step. Otherwise, another subset of explanatory variables is selected and tested. As a result, constructing a useful regression model requires many diagnostics tests. In this section, we discuss three popular techniques and tests that will be included in the mathematical formulation proposed in Section

3. Figure 1: Strategy for linear regression model construction 

### 2.1 Transformation of explanatory variables

When the explanatory and response variables have a non-linear relationship, it is possible to have non-linear residual trend versus fitted values. In these cases, the transformation of the explanatory variables using a non-linear function including or is one of the most commonly used techniques to resolve this problem. A model can be established with a subset of explanatory variables and their transformations, with restriction that a variable and its transformation cannot be selected simultaneously because this leads to multicollinearity. If the explanatory variables are transformed using an appropriate function, then a model constructed with transformed variables will no longer demonstrate a non-linear relationship in the residual plot. Therefore, we generate transformed data as fixed candidates set of selected variables. The MIQP model presented in Section 3.2 has a constraint to select at most one between an original and transformed variable pair.

### 2.2 Statistical tests for linear regression parameters

Because linear regression models are constructed based on statistical assumptions, any proposed model should be verified to determine whether it satisfies these assmptions. One linear regression assumption is that the data always has noise that is generated from a normal distribution. For this reason, the estimated coefficients have a probability distribution. Suppose that

. The standard deviations of the coefficients are then composed of the diagonal of a matrix derived as

 s2{x}=MSEA(A′A)−1,

where denotes the mean squared error of the model developed from data matrix (refer to Neter et al.  for its proof). The -th element of the diagonal of is equal to the standard deviation of the -th estimated coefficient (i.e., -th the element of ). We denote the standard deviation and the estimated coefficient as and , respectively. If we employ a data matrix with variables, then the coefficients follow Student’s -distribution.

 ^xj−xjs(A)j∼t1−α2,n−k−1,j=1,...,k. (1)

Note that the -statistic is constant when the number of selected variables is fixed. Our interest is to test whether coefficient is equal to zero. If is close to zero, then it can be regarded that there is no linear relationship between the response variable and . Formally, the following hypothesis can be tested:

 H0:xj =0 H1:xj ≠0.

Thus, to reject the null hypothesis, the following inequality must hold:

 ∣∣∣^xjs(A)j∣∣∣≥t1−α2,n−k−1,j=1,...,k. (2)

We will discuss how this requirement can be handled with a mathematical programming approach in Section 3.3 and Section 3.4.

### 2.3 Model validation with residual plots

One of the key assumptions of linear regression is that errors have constant variance over fitted values or over independent variables. This can be verified by drawing residual plots. The residuals have constant variance if they form a band-like region with constant width over the entire range of fitted values.

Breusch and Pagan  noted that the violation of residual assumptions can lead to inefficiency when using OLS and subsequently invalid inferences. However, linear models often do not satisfy these assumptions. Thus, once a linear regression model is constructed, it is critical to check the diagnostic plots of the residuals. (a)

Figure 2 shows representative plots of residuals verses fitted values. In Figure 1(a), the variance of the residuals seems to be constant over the fitted values and thus the generated model satisfies the assumption that the variance of residuals is constant. However, this is not the case for the other examples in Figure 2. Figure 1(b) displays a positive correlation between the residuals and fitted values, Figure 1(c) shows heteroscedasticity in which the variance of the residuals increases as the fitted values increase, and Figure 1(d) presents non-linear relationship between the residuals and fitted values.

In fact, the latter three cases in Figure 2 are frequently observed when implementing linear regression models in many real-world problems. Thus, we need to run several statistical tests to diagnose these cases and thus allow an appropriate combination of explanatory variables to be selected.

To detect the situation depicted in Figure 1(b), a univariate linear regression model, called an auxiliary model, is constructed where the explanatory and response variables are the fitted values and residuals in the plot, respectively. If there is no issue, the estimated slope is close to zero. If linearity exists, the estimated slope is non-zero. A simple hypothesis test can be used to check whether the estimated slope is zero. If the test rejects the null hypothesis (), then we conclude that the model violates the residual assumption. Otherwise, we conclude that the model does not suffer linearity.

To detect heteroscedasticity (Figure 1(c)), two statistical tests will be introduced. The first test, referred to as an absolute residual test in this paper, is identical to the previous test except for the fact that the values of response variable in the auxiliary model are the absolute values of the residuals. Figure 2(b) shows that, if the negative side of Figure 1(c) is flipped to overlap the positive side, an increasing trend in the residuals is observed. On the other hand, the constant variance of the absolute values of the residuals with respect to the fitted values is presented in Figure 2(a), which is associated with the desirable residual plot of Figure 1(a). Thus, as a linear residual trend, if a statistical test on the coefficient of the auxiliary model rejects the null hypothesis, we can regard the residuals of the linear model as having heteroscedasticity. Another widely used diagnostic tool for heteroscedasticity is the Breusch-Pagan test . To prevent being overly rigorous, we consider the residual assumption to be violated if both proposed tests indicate heteroscedasticity. In Section 3.4, our mathematical programming-based diagnostic approach will be presented to capture the features of heteroscedasticity.

Finally, for the case illustrated in Figure 1(d), we formulate a model that selects variables from among the explanatory variables and their non-linear transformations. (a)

## 3 Integrated Model Building and Validation Procedure via Mathematical Programming

In this section, we develop an automated procedure based on mathematical programming models that minimize SSE to select a subset of a fixed number of explanatory variables and include statistical tests and diagnostics for multivariate linear regression. In Section 3.1, we introduce the base model from the literature, which does not consider statistical test or diagnostics. In Section 3.2, we extend the base model to include log-transformed explanatory variables. In Section 3.3, we propose constraints to remove statistically insignificant coefficients. In Section 3.4, we present the final algorithm, which considers all of the remaining tests and diagnostics.

### 3.1 Base model

We start with a basic model similar to that proposed by Wilson . All the sets, parameters, and decision variables used are as follows.

Sets and Parameters
: number of observations : number of independent variables : index set of observations, : index set of independent variables, : index of observations, : index of independent variables, : number of selected independent variables : standardized data matrix corresponding to independent variables, : standardized data matrix corresponding to dependent variables,

Decision Variables
: coefficient of independent variables, : error between the -th observation and its prediction value, : if the -th independent variable is selected; otherwise,

Using the parameters and decision variables above, the basic mathematical programming model is presented as follows.

 minimize ∑i∈Ie2i (3a) subject to ei=∑j∈Jaijxj−bi, ∀i∈I, (3b) −Mzj≤xj≤Mzj, ∀j∈J, (3c) ∑j∈Jzj=k, ∀j∈J, (3d) xjunrestricted, ∀j∈J, (3e) eiunrestricted, ∀i∈I, (3f) zj∈{0,1}, ∀j∈J (3g)

The basic model (3) minimizes the SSE of a multivariate linear regression model with a fixed . Remark that MSE is actually calculated as , but it is equivalent to simply minimizing SSE because is fixed as a given parameter. Constraint (3b) defines the residuals, and constraint (3c) indicates that if an explanatory variable is not selected, then the coefficient of the variable must be zero. Lastly, constraint (3d) ensures that the number of selected variables is .

The mathematical model (3) can be converted into an MIQP with a popular linearization technique. Instead of unrestricted continuous decision variables and , non-negative variables , , , and are used, where and . By plugging these in, the following MIQP can be obtained.

 minimize ∑i∈I(e+i2+e−i2) (4) subject to e+i−e−i=∑j∈Jaij(x+j−x−j)−bi, ∀i∈I, x+j+x−j≤Mzj, ∀j∈J, ∑j∈Jzj=k, ∀j∈J, x+j,x−j≥0, ∀j∈J, e+i,e−i≥0, ∀i∈I, zj∈{0,1}, ∀j∈J

It is to be remembered that the main purpose of our study is to build a multivariate linear regression model that considers diagnostics. To meet this goal, we now extend the base mathematical model (4) above to include several of the important diagnostic tests presented in Section 2.

### 3.2 Inclusion of log-transformed explanatory variables

We first discuss how to include log-transformed explanatory variables in the model. Because an explanatory variable and its log-transformation are highly correlated, we need to prevent both variables from being selected simultaneously, as discussed in Section 2.1. Let us then define new parameters and sets.

Set and Parameters
: index set of the logarithm of the independent variables, : standardized data matrix corresponding to the logarithm of the independent variables, : data matrix concatenating and , i.e.,

If a vector corresponding to the

-th column of possesses non-positive elements, we compute the logarithm of the column using a conventional method to deal with the non-positives: . Since all original explanatory variables have transformed variables, we can define for each . The model incorporating the log-transformed explanatory variables, , can be formulated as follows:

 MPbase(k):minimize ∑i∈I(e+i2+e−i2) (5a) subject to e+i−e−i=∑j∈J∪Jl~aij(x+j−x−j)−bi ∀i∈I, (5b) x+j+x−j≤Mzj, ∀j∈J∪Jl, (5c) ∑j∈J∪Jlzj=k, ∀j∈J, (5d) zj+zjl≤1, ∀j∈J,jl=j+m, (5e) x+j,x−j≥0, ∀j∈J∪Jl, (5f) e+i,e−i≥0, ∀i∈I, (5g) zj∈{0,1}, ∀j∈J∪Jl (5h)

The extended model is similar to formulation (4). The difference is that (5) has additional derived explanatory variables in , while presenting the simultaneous selection of original and log-transformed explanatory variables by (5e). We note that, in addition to the log-transformation, another type of transformation of the explanatory variables can be considered in a similar manner. Finally, we remark that an appropriate value of in (5c) is required to solve the problem. When is too small, we cannot guarantee optimality. When is too large, the optimization solver can struggle due to numerical issues. Park and Klabjan  proposed a sampling-based approach, where is estimated by iteratively sampling a subset of explanatory variables. We use this method of with a slight modification to make sure that the sampled explanatory variables do not simultaneously include an original explanatory variable and its transformation.

### 3.3 Inclusion of constraints corresponding to t-tests for the significance of the regression coefficients

In this section, we present constraints to check the statistical significance of the regression coefficients. Formulating such constraints is not a trivial task because the statistical significance of an estimated coefficient depends on the selected subset. In detail, in (1) can only be calculated given a subset so that an inequality including cannot be trivially formulated as a convex constraint. In order to address this issue, we convert the inequality (2) into a constraint that checks the statistical significance of the -th variable if it is selected. We derive

 ∣∣∣xjs(As(k))j∣∣∣≥t1−α2,n−k−1 (6a) ⟹∣∣xj∣∣≥t1−α2,n−k−1∣∣s(As(k))j∣∣−M(1−zj) (6b) ⟹x+j+x−j≥t1−α2,n−k−1s(As(k))j−M(1−zj) (6c)

where is a subset with explanatory variables, is a submatrix of derived from , and is the standard deviation of the estimated coefficient for explanatory variable which is equilvalent to and only defined for explanatory variables in . The resulting constraint above is only activated when the -th variable is selected (i.e., ). Note that varies according to the selected variables. It implies that the values in the constraint above vary accordingly. Hence, it is still difficult to add constraint (6c) in its current form to the MIQP model.

To handle the changing values of depending on the selected subset, we instead use the lower bound of , which gives a relaxed version of the exact constraint. A tight lower bound for can cut some of the subsets with insignificant regression coefficients.

To obtain a lower bound for , we consider the definition of , Since , it is straightforward to multiply the lower bound of by the lower bound of to provide the lower bound of .

We first consider the lower bound of . Exact can be calculated by solving . However, solving , which is an MIQP, can take a long time. Thus, we instead use its LP relaxation,

 MSELB(k): minimize (5a) subject to (5b), (5c), (5d), (5e), (5f), 0≤zj≤1, ∀j∈J∪Jl,

which gives us the lower bound of , notated as .

We next consider the lower bound of . In a multivariate linear regression model with data matrix , the elements of the diagonal of are called a variance inflation factor (VIF). The VIF for the -th coefficient is calculated by following equation:

 VIFj=11−R2j,

where is the coefficient of determination for a linear regression model, where variable is the response variable and all variables except for are explanatory variables. Note that since for a linear model with an arbitrary data matrix. Therefore, we can set the lower bound of the -th diagonal element of to 1.

Finally, the lower bound of , denoted as , can be derived based on the following inequality:

 s(As(k))j=√MSEAs(k)(A′s(k)As(k))−1j≥√% MSELB(k)=s(k)LB≥0.

Thus, we can formulate the relaxed constraint of (6) as

 x+j+x−j≥t1−α2,n−k−1s(k)LBzj,∀j∈J. (7)

Since is a calculated constant and is determined by the user, constraint (7) is a linear constraint and can be added to the MIQP model.

We remark that (7) is a relaxed version of the exact -test constraint and can be useful in reducing the search space of the branch-and-bound algorithm of the solver.

### 3.4 Final algorithm

Recall that constraint (7) is a relaxed constraint derived from the lower bound of and a feasible solution for (8) can have statistically insignificant coefficients, while we want all coefficients to be statistically significant. Furthermore, it is challenging to formulate tests to check the residual assumptions in Section 2.3 as linear or convex constraints. To overcome all these limitations, we propose a lazy constraint-based approach to solve the following problem.

 minimize (8) subject to coefficients t-tests constraints (???) % from Section ???, residual diagnostics constraints from Section ???

Lazy callback is mainly employed in practical implementation using an optimization solver when the number of constraints in an MIP model is extremely large . For example, we may use the lazy callback when solving a traveling salesman problem (TSP) because it has a large number of subtour elimination constraints. Instead of adding all subtour elimination constraints at the root node of the branch-and-bound algorithm, we can iteratively and selectively add some of the subtour elimination constraints. In detail, when the solver arrives at a branch-and-bound node and the solution contains a subtour, lazy callback allows us to add the corresponding subtour elimination constraint at the current node. Although our MIQP model does not have an extremely large number of constraints, we can use lazy callback to stop at a branch-and-bound node and check if the solution completely satisfies all of the tests and diagnostics. The lazy constraint-based procedure is presented in Algorithm 1.

In the algorithm, is the index set of selected variables at the current node in the brand-and-bound tree. The lazy constraint is added to the model when (a) a linear model associated with fails to pass at least one of the statistical tests for the coefficients or (b) the model fails to pass the residual diagnostics test. The lazy constraint eliminates the solution of selecting all explanatory variables in

from the feasible region. For example, let us consider a problem with 10 explanatory variables with the associated binary variables set

. We assume that we set and the solver is currently at the node with a solution having and for all other explanatory variables. The procedure generates a linear model with the selected explanatory variable set and performs statistical tests and diagnostics for the model. When the regression model fails to pass at least one of the tests, lazy constraint is added to the MIQP model. As a result, the solution (the others are zero by constraint (5d)) becomes infeasible in all of the subsequent branches.

## 4 Alternative Solution Procedure for Infeasible or Difficult Problems

When all possible subsets violate at least one of the tests and assumptions, model (8) in Section 3.4 becomes infeasible and the algorithm will not return a solution. However, even for this case, a near-feasible solution is desired (i.e., satisfying most of the tests and assumptions while mildly violating a few of the tests). In this section, we propose a procedure, referred to as the alternative solution procedure, to find the most attractive near-feasible solution when the original problem is infeasible or no feasible solution is found in a suitable period of time.

The alternative solution procedure is invoked at each branch-and-bound node while no feasible solution satisfying all tests and diagnostics has been found by the algorithm. Once a feasible solution is found, the alternative solution procedure will not be invoked since at least one feasible solution to the problem satisfying all of the tests and diagnostics exists. When the alternative solution procedure is invoked, it compares a new subset with the incumbent best subset (not feasible, but the best alternative near-feasible subset) and decides which subset is better. Our model updates and keeps only one alternative solution until it finds a feasible solution. The new subset, denoted as , is the subset at the current branch-and-bound node and the incumbent best subset, denoted as , is the best alternative subset found so far. Remark that and must be infeasible solutions with some insignificant coefficients or residual assumption violations, as the alternative solution procedure keeps until a feasible solution is found. To simplify the discussion, we will use and to refer to both selected subsets and linear regression models.

Note that one of the simplest approaches to compare , , and all other subsets is to add penalty terms to the objective function based on diagnostic violations such as average p-values or number of insignificant variables. Before presenting the details of the alternative solution procedure, we first explain why the simple penalty approach may fail and the proposed procedure is needed. First, the penalty approach cannot effectively reflect the overall development procedure of a linear regression model. In typical model development, explained in Section 2, we first establish a linear regression model with statistically significant coefficients. We then check the residual assumptions via diagnostics. That is, we should consider the residual assumptions after establishing a statistically significant model. This cannot be achieved with the simple penalty approach. Next, there exist non-trivial cases where the penalty approach may fail to select the better subset. The problem is discussed with the illustrative examples in Table 1.

Table 1 includes three cases determining which solution is better between and , where and are set to 5 and 0.05, respectively. The set of p-values from the statistical test for the coefficients is shown for each case. In Case , it is not surprising that we choose because the number of insignificant coefficients of is less than that of (i.e., ), and the average of the violating p-values for is also less than that of (i.e., ). On the other hand, must be better in Case 2 because the average of the violating p-values for is much larger than that of () although the number of insignificant coefficients of is greater than that of (). In Case 3, although the average of the violating p-values of is less than that of (), selecting as the better solution seems to be reasonable because the difference is negligible while the number of insignificant coefficients of is significantly greater than of (i.e., ). These illustrative examples indicate that we should consider both the average of the violating p-values and the number of insignificant coefficients. However, due to the different magnitudes of the measures, it is not trivial to set apropriate weights. Consequently, it is necessary to develop an intuitive procedure that reflects the framework of linear regression model construction and further alleviats the scale difference for measures of diagnostic violations.

Now, we discuss the alternative solution procedure. For a more efficient explanation of the procedure, we define the functions related to a significance test for coefficients.

 π(S) : number of insignificant coefficients in the model with subset S E(S) : average p-values of the insignificant coefficients in the model with subset S, 0 if all coefficients in subset S are statistically significant fq(S1,S2) : decision by quality; S1, S2 are subsets ft(S1,S2;τ) : decision with tolerance; S1, S2 are subsets

The last two functions, and decide whether or is better based on different principles, which will be discussed in detail later in this section.

The overall alternative solution procedure is presented in Algorithm 2.

Step 1 is for when is the statistically significant model while is not. Hence, it is natural to return in Step 2. Steps 3 and 4 consider the opposite case to that of Steps 1 and 2. In Steps 5-9, non-trivial cases are considered: both models are statistically significant in Step 5 and both models are statistically insignificant in Step 7. For these two cases, , referred to as decision by quality, and , referred to as decision by toloerance, are employed to make a decision.

Function returns the better solution between and , as determined by quantifying the quality of the solutions. Function quantifies the solutions using penalty function of solution introduced below.

 MSE(S) : the MSE of solution S rl(S) : p-value from the residual linearity test for solution S rh(S) : p-value for the residual heteroscedasticity tests, which is the maximum p-value between the p-values of the absolute residual test and the Breusch-Pagan test λ : penalty parameter for MSE(S) λπ : penalty parameter for π(S) λE : penalty parameter for E(S) λl : penalty parameter for rl(S) λh : penalty parameter for rh(S) w1(p,α) : percentage transform function for E(S); max(p−(1−α),0)α w2(p,α) : percentage transform function for rl(S) and rh(S); max((1−α)−p,0)1−α
 q(S;λ,λn, λE,λl,λh)= λMSE(S)+λππ(S)+λEw1(E(S),αE)+λlw2(rl(S),αl)+λnw2(rh(S),αh)

where , , and are the significance levels for the -tests for coefficients, diagnostics for residual linearity, and residual heteroscedasticity, respectively. Note that function includes percentage transform functions and to indicate the percentage gap from the significance levels. We introduce these functions for the following two reasons. First, we want to accurately measure the insignificance of the p-values when the significance levels are at different scales. Second, we do not want to apply penalties if the p-value is within the significance level.

Two examples are provided for and to support our claims and demonstrate their necessity. The first example is of penalty nullification. Suppose and , which implies the residuals of are consistent while those of have heteroscedasticity. Given , we can calculate , and . The penalties make sense because the does not violate the test. Hence, the evaluation of is based only on the MSE and the significance of coefficients. The second example demonstrates the role of scaling between p-values from different statistical tests. Suppose that we get , . Further, and are both set to . Then, violates the -tests of the coefficients by , which is equal to the value of the heteroscedasticity test violation . However, the violation of is relatively small for the possible -test violation range compared to the possible residual test violation range Thus, despite the same violation value, the significance of the violation can differ depending on the tests and diagnostics. We can scale the magnitudes of the violations through and : , and .

We now discuss the decision with tolerance presented in Algorithm 3. Steps 1-2 are key to our procedure. If exceeds the tolerance , a positive parameter predetermined by the user, then the alternative solution procedure concludes that is better. When the average of the violating p-values of is smaller or not significantly worse (smaller than ) than that of , we conclude is competitive and investigate further in Steps 3-11 by comparing the two solutions based on two factors, and . Steps 4-5 conclude that, if is smaller than in terms of those two factors, is better. The opposite case is shown in Steps 6-7. If there is a conflict between the two factors, settles the decision. Remark that Algorithms 2 and 3 imitate the linear regression model building procedure described in Section 2 by using statistical significance tests first to make the decision, followed by other diagnostics.

We illustrate the entire alternative solution procedure using the three toy examples presented in Table 1. Suppose that our model is searching a subset whose cardinality is set to 5 and a feasible solution has not been found yet. A user sets the significance levels and to 95 percent and , respectively.

• Case 1. Our model cannot determine which solution is better at the first step because and also . Hence, the decision with tolerance is invoked (Algorithm 3). In Step 1, the algorithm computes , , and . Thus, the next steps compare with and with . Since and , is determined to the better solution.

• Case 2. Since and are both greater than , our model cannot determine which solution is better in Step 7 of Algorithm 2, and is called. In Algorithm 3, since in Step 2, is returned.

• Case 3. Similar to the previous cases, is called in Step 8 of Algorithm 2. In Algorithm 3, although and is greater than , Steps 4-10 are considered because . However, since is greater than and is less than , the winning solution is determined by .

Overall, we found that the alternative solution procedure picks the desired solution discussed in Table 1. Throughout the alternative solution procedure, we first compare with using the significance of the estimated coefficients, in particular with and . If this step cannot make a decision, residual diagnostics are considered. This multistage procedure reflects the linear regression model building framework discussed in Section 2. The procedure also deals with the possible issues illustrated in Case 2 and 3 in Table 1 by introducing . A user can intuitively give a value of based on the average level of p-value difference he or she can permit.

In the next sections, we will refer our final algorithm, solving (8) with Algorithm 1 and the alternative solution procedure in Algorithms 2 and 3, to .

## 5 Computational Results

In this section, the results of numerical experiments using the proposed model and benchmarks are presented. The experiments are designed to demonstrate the performance of the proposed model and the necessity of the alternative solution procedure presented in Section 4.

### 5.1 Benchmark dataset description and preprocessing

Twelve publicly available datasets are used in the experiment. We collect five datasets for regression from the UCI machine learning data repository , seven datasets from several sources. Table 2 summarizes the features of the experimental datasets.

We preprocess the datasets as follows. First, we remove records which include at least one missing value. Second, dummy variables are introduced for the categorical variables. Third, nominal variables are changed into numerical variables. Next, log-transformed variables are created for the numerical variables in the original data. Finally, all of the datasets are standardized.

### 5.2 Experimental design

In this section, we present the design of the experiments. To demonstrate the practical viability of our model, we set a time limit of 600 seconds for every experiment. The algorithm time includes the computation time for big- in (5c) and in (7). For all experiments, we use the values in Table 3 for the significance levels of the tests and parameters for the alternative procedures.

#### Experiment 1: proposed model vs. simple benchmark models

We compare with two simple benchmark models: and a forward selection algorithm. With this experiment, the performance of the proposed model in terms of the aforementioned statistical significance and regression assumptions can be verified. The forward selection algorithm is presented in Algorithm 4.

The algorithm is constructed by modifying a standard forward selection algorithm to select from among variables and their log-transforms. In Step 5 in , once a variable is selected, both the original and transformed variables are excluded from the candidate set of variables. If , then the log-transformed variable is excluded as well. If , then the original variable is excluded as well. Thus, always provides a solution, with at most one variable from an original and transformed variable pair.

For each dataset in Table 2, all models and algorithms are tested over . For each case, we check the goodness-of-fit with adjusted , denoted as . Although the objective of all models and algorithms is to maximize the SSE, it is equivalent to maximizing because is fixed. Also, since ranges from to , we can compare the goodness-of-fit over different values and datasets easily.

#### Experiment 2: proposed model vs. iterative model

We compare our model with an iterative algorithm used in Bertsimas and King  for cases where the solution for violates some of the tests and diagnostics. The algorithm iteratively adds constraints to avoid subsets with insignificant coefficients. We will refer to the iterative model as in this paper. Recall that a key feature of our algorithm is the ability to avoid iterative calls of the MIQP solver, where iterative approaches solve multiple MIQPs by adding constraints. In this comparison, we demonstrate the effectiveness of our lazy constraint-based algorithm. As the two models have different constraints and parameters, we compare the two models using common constraints, the number of variables selected and -tests. In detail, we exclude residual test constraints and set from the default parameters in Table 3.

The algorithm is initialized with the empty set of constraints CutSet in Step 1. The algorithm continues while the obtained subset has at least one statistically insignificant coefficient. In Step 3, the current model ( with constraints in CutSet) is solved to obtain . In Steps 4-6, if includes statistically insignificant coefficient, then a constraint that cuts from the feasible region is added to CutSet.

#### Experiment 3: alternative solution procedure vs. simple penalty function

We demonstrate the effectiveness of the alternative solution procedure by comparing it with a simple penalty function. The benchmark is obtained by replacing the alternative solution procedure with simple penalty function . The benchmark model will be referred to as .

### 5.3 Experimental Results

For all experiments, we utilize an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (8 CPUs) and 8GB RAM. All models and algorithms are implemented with Python, in which mathematical models are solved by Gurobi 7.0.0. For the construction of the linear model in , we employ the Python package statsmodels .

In every experiment, all of the solutions from every comparative model do not demonstrate linearity between their residuals and fitted values, so the corresponding results are not provided.

#### Result for Experiment 1: proposed model vs. simple benchmark models

To measure the explanatory power of each model, we introduce a measure for relative explanatory power , where is a subset obtained by solving and is the solution of the corresponding model. We use the of as a denominator of since provides the greatest among all of the compared models because does not have any diagnostic constraints.

To provide better insights, we present the summarized results for the datasets and over number of variables selected () in Tables 4 and 5, respectively. In both tables, is the average over datasets or values. To check the performance for feasible datasets, we also present , which only considers cases with feasible solutions available when calculating the average. Additionally, we count the number of cases satisfying the (i) -test and residual test, (ii) -test only, and (iii) residual test only.