In many practical applications, especially in the case of high-dimensional problems, model selection — selecting a subset of features highly related with response variable — is crucially important. A popular approach to achieve this task is the LassoTibshirani (1996). Although various properties of Lasso have been extensively studied in the past decades (see, e.g., Hastie et al. (2015)), exact statistical inference such as computing
-values or confidence intervals foradaptively selected features by Lasso has only recently begun to be actively studied Lee et al. (2016); Fithian et al. (2014).
A recent seminal work is the selective inference (SI also a.k.a. post-selection inference) for the Lasso proposed by Lee et al Lee et al. (2016). The main idea is to make inference for the selected features conditional on the selection event of the Lasso. By conditioning on the selection event, exact valid inference on adaptively selected features by Lasso is possible in the sense that -values for proper false positive rate control or confidence intervals with proper coverage guarantees can be obtained. After the seminal work, conditional inference-based SI has been actively studied and applied to various problems Bachoc et al. (2014); Fithian et al. (2014, 2015); Choi et al. (2017); Tian et al. (2018); Chen and Bien (2019); Hyun et al. (2018); Bachoc et al. (2018); Charkhi and Claeskens (2018); Loftus and Taylor (2014); Loftus (2015); Panigrahi et al. (2016); Tibshirani et al. (2016); Yang et al. (2016); Suzumura et al. (2017); Tanizaki et al. (2020); Duy et al. (2020).
The main challenge when developing a conditional inference-based SI method is to fully characterize the selection event. In the case of Lasso, the authors in Lee et al. (2016) showed that the selection event can be characterized as a finite set of affine constraints, which leads to the sampling distribution of relevant test-statistic in the form of a truncated Normal distribution
when the error is normally distributed.
Let be a subset of the selected features by applying Lasso on any random data sample and be their signs. Then, it has been shown that the selection event is represented by a polytope in the data space, where and are the corresponding observations (see §2 for detailed notations and setup). However, it is well-known that additionally conditioning on the signs leads to low statistical power because of over-conditioning, which is widely recognized as a major drawback of the current Lasso SI approach and almost all the following studies Fithian et al. (2014); Lee and Taylor (2014); Fithian et al. (2015); Taylor and Tibshirani (2018).
The authors in Lee et al. (2016) discussed the solution to overcome the drawback by conducting inferences conditional on the selection event without signs , which can be characterized by polytopes. If the number of selected features is moderate (e.g., up to 15), it is feasible to consider the whole affine constraints of all these polytopes. However, if is large, it becomes impossible to enumerate all the affine constraints for exponentially increasing number of polytopes.
In the other direction, Liu et al. Liu et al. (2018) proposed an approach for improving the power in the case where the inference target is full model coefficients. Unfortunately, this approach can be applied only to full-model target case which is often not the main interest in post-selection inference literature, and it is not even applicable when the number of features is greater than the number of instances . As other recent approaches to improve the power, Tian et al. Tian et al. (2018) and Terada et al. Terada and Shimodaira (2019) proposed methods using randomization. A drawback of these randomization-based approaches including simple data-splitting approach is that further randomness is added in both feature selection and inference stages.
We present a general deterministic method for resolving the low statistical power issue of current Lasso SI by using parametric programming (a.k.a. homotopy methods) Ritter (1984); Allgower and George (1993); Gal (1995); Best (1996), which is motivated by Liu et al. Liu et al. (2018)
and the discussion therein. Our main idea is to compute the continuum path of Lasso solutions in the direction of interest, and compute the tail probability of the truncated Normal distribution by following the solution path. We show that the continuum path of Lasso solution along the direction of the test-statistic can be exactly and efficiently computed by piecewise-linear homotopy computation. One might wonder how we can circumvent the computational bottleneck of exponentially increasing number of polytopes. Our experience suggests that, by focusing on the the line along the test-statistic in data space, we can skip majority of the polytopes that do not affect the truncated Normal sampling distribution because they do not intersect with this line. We demonstrate the efficiency of the proposed method through experiments in which we show that Lasso SI without conditioning on signs can be done even when there are hundreds of selected features. Parametric programming has been used in various statistical and machine learning problemsOsborne et al. (2000); Efron and Tibshirani (2004); Hastie et al. (2004); Rosset (2005); Bach et al. (2006); Rosset and Zhu (2007); Tsuda (2007); Lee and Scott (2007); Takeuchi et al. (2009); Karasuyama and Takeuchi (2010); Hocking et al. (2011); Karasuyama et al. (2012). However, to the best of our knowledge, this is the first work showing that piecewise-linear parametric programming or homotopy method can be effectively used for characterizing the selection events in SI. Figure 1 shows the schematic illustration and the efficiency of the method we propose in this paper.
2 Problem Statement
To formulate the problem, we consider a random response vector
where is the number of instances, is modeled as a linear function of features , and
is a covariance matrix which is known or estimable from independent data. The goal is to statistically quantify the significance of the relation between the features and response while properly controlling the false positive rate. To achieve the goal, the authors inLee et al. (2016) have proposed a practical SI framework, in which a subset of features is first “selected” by the Lasso, and the inferences are then conducted for each selected feature.
Feature selection and its selection event.
Given an observed response vector sampled from the model (1), the Lasso optimization problem is given by
where is a feature matrix, and is a regularization parameter. Since the Lasso produces sparse solutions, the active set selected by applying the Lasso to is defined as
Then, the event that the Lasso active set for a random vector is the same as is written as
The authors in Lee et al. (2016) showed that the selection event can be characterized by a set of linear inequalities.
Statistical inference for the selected feature.
For the inference on the selected feature in , we consider the following statistical test
A natural choice of the test statistic is defined as , where in which is a unit vector whose element is 1 and 0 otherwise. Since the hypothesis is generated from the data, selection bias exists. In order to correct the selection bias, we have to remove the information that has been used for initial hypothesis generating process. This is achieved by considering the sampling distribution of the test statistic conditional on the selection event, i.e.,
where with . The second condition is additionally added for technical tractability Fithian et al. (2014); Lee et al. (2016), which indicates the component that is independent of the test statistic for a random vector is the same as the one for .
Once the selection event is identified, we can easily compute the pivotal quantity
which is the c.d.f. of the truncated Normal distribution with mean
, variance, and the truncation region which is calculated based on the selection event. The pivotal quantity is crucial for calculating -value or obtaining confidence interval. Based on the pivotal quantity, we can consider
selective type I erroror selective -value Fithian et al. (2014) in the form of
which is valid in the sense that
Furthermore, to obtain confidence interval for any , by inverting the pivotal quantity in Equation (5), we can find the smallest and largest values of such that the value of pivotal quantity remains in the interval Lee et al. (2016).
However, the main challenge is that characterizing in Equation (4) is computationally intractable because we have to consider possible sign vectors. To overcome this issue, the authors in Lee et al. (2016) consider inference conditional not only on the selected features but also on their signs. Unfortunately, additionally considering the signs leads to low statistical power because of over-conditioning. In the next section, we will provide an efficient method for identifying , which enables us to easily obtain the minimum amount of conditioning, leading to high statistical power.
3 Proposed Method
In this section, we propose to use parametric programming for efficiently identifying the conditioning event . The schematic illustration of our idea is shown in Figure 1.
3.1 Characterization of Conditional Data Space
Let us define the conditional data space in Equation (4) as
where , , and
Now, let us consider a random variableand its observation , which satisfy and . The conditional inference in (4) is re-written as the problem of characterizing the sampling distribution of
under the null hypothesis, the law offollows a truncated Normal distribution. Once the truncation region is identified, the pivotal quantity in Equation (5) is equal to , and can be easily obtained. Thus, the remaining task is to characterize .
Characterization of truncation region .
Let us introduce the optimization problem (2) with parametrized response vector for as
The subdifferential of the -norm at is defined as follows:
where we denote . Then, for any in , the optimality condition is given by
To construct the truncation region in Equation (9), we have to 1) compute the entire path of , and 2) identify the set of intervals of on which . However, it seems intractable to compute for infinitely many values of . Our main idea to overcome this difficulty is to propose a parametric programming method for efficiently computing a finite number of “transition points” at which the active set changes.
3.2 A Piecewise Linear Homotopy
We now derive the main technique of the proposed method. We show that is a piecewise linear function of . To make the notation lighter, we write , and we denote the set of inactive features as .
Consider two real values and . Suppose for all , for all , and is invertible. If and have the same active set and the same signs, then we have
where , and .
From the optimality conditions of the Lasso, we have
Thus, we achieve Equation (13). Next, from the optimality conditions of the Lasso, we also have
In this paper, we assume the uniqueness of the Lasso solution for all as well as for all and the invertibility of . These assumptions are justified by assuming the columns of are in general position Tibshirani and others (2013). Parametric programming methods for handling the rare cases where these assumptions are not satisfied have been studied, e.g., in Best (1996), and can be applied to our problem setup. In practice, when the design matrix is not in general position, it is also common to introduce an additional ridge penalty term, resulting in the elastic net Zou and Hastie (2005). Our proposed method can be extended for the elastic net case (see Appendix for the details).
Computation of the transition point.
From Lemma 1, the solution is a linear function of until reaches a transition point at which either an element of becomes zero or a component of becomes one in absolute value. We now introduce how the transition point is identified.
Let be a real value such that . Then, , , and for any real value in the interval , where is the value of transition point,
Here, we use the convention that for any real number , if , and otherwise.
From Equation (13), we can see that is a function of . For a real value , there exists such that for any real value in , all elements of remain the same signs with . Similarly, from Equation (14), we can see that is a function of . Then, for a real value , there exists such that for any real value in , all elements of are smaller than 1 in absolute value. Finally, by taking , we obtain the interval in which the active set and signs of Lasso solution remain the same. The remaining task is how to compute and . We defer the detailed derivations of and to the Appendix. ∎
In this section, we show the detailed algorithm of our proposed parametric programming method. In Algorithm 1, for feature selection step, we just simply apply Lasso to the data , and obtain the active set . Then, we conduct SI for each selected feature. For testing we first obtain the direction of interest , which can be easily computed as in §2. Second, the main task is to compute the solution path of in Equation (11) for the parametrized response vector , where, note that, the parametrized solution are different among different since the direction of interest depends on . This task can be done by Algorithm 2. Finally, after having the path, we can easily obtain truncation region which is used to compute selective -value or selective confidence interval.
In Algorithm 2, a sequence of transition points are computed one by one. The algorithm is initialized at . At each , the task is to find the next transition point , where the active set changes. This task can be done by computing the step size in Algorithm 3. This step is repeated until . The algorithm returns the sequences of Lasso solutions and transition points.
Choice of .
4 Numerical Experiments
False positive rate (FPR) and True positive rate (TPR).
We show the FPRs and TPRs of our proposed method for the following two cases of conditional inferences:
TN-As: , where is the sign vector of Lasso solutions on , and is the sign vector of the Lasso solutions on .
We also additionally show the FPRs and TPRs of data splitting (DS) method, which is the commonly used procedure for the purpose of selection bias correction. In this approach, the data is randomly divided in two halves — first half used for model selection and the other for inference.
We generated outcomes as , , where in which , and . We set the regularization parameter , significance level . For the FPR experiments, all elements of were set to 0. For the TPR experiments, the first two elements of were set to 0.25. We ran 100 trials for each , and we repeated this experiments 10 times. The results are shown in Figure 2. The results are consistent with the discussion in Lee et al. (2016). The TN-A obviously has higher power than TN-As because we conduct inference conditional only on the set of selected features, i.e., minimum amount of conditioning. However, the method proposed in Lee et al. (2016) requires a huge amount of computing time while our proposed parametric programming approach can easily complete the task. We will demonstrate this advantage of our method in the latter part of this section.
We generated outcomes as , , where in which , and . The first 5 elements of were set to 0.25, and was set to 1. In the cases of TN-A and TN-As, 9 features were selected by the Lasso while only 8 features were selected in the case of DS. Therefore, we only show the 95% confidence interval of the features that are selected in both cases on the left side of Figure 3. We repeated this experiment 100 times and showed the boxplot of the lengths of the confidence intervals on the right side of Figure 3.
Efficiency of the proposed method.
We demonstrate the efficiency of the proposed method by comparing the computing time with the existing method proposed in Lee et al. (2016) when the number of active features is small. We then show the computing time of our method for the case when the number of active features is large, in which existing method requires a huge burden of calculating cost or can not complete the task in realistic time.
We considered two cases: and . The outcome was generated as , , where , and . We set the regularization parameter for the first case, and for the second case. For the comparison experiments with the existing method when the number of active features is small, the first components of are set to for each . For the experiments showing the computing time of the proposed method, when the number of active features is large, the first components of were set to for each , and we ran 10 trials for each case. The results are shown in Figure 4.
Uniformity verification of the pivotal quantity.
We generated outcomes as , , where , and . We set the first two elements of to 2, and set . We ran 1,200 trials for each case of conditioning: TN-A and TN-As, and verified the uniform QQ-plot of the pivotal quantity when performing our proposed method. The detailed results are deferred to Appendix.
Overall, the results indicate that the proposed method is valid in the sense that the false finding probability is properly controlled under the pre-defined significance level (e.g, 0.05), and significantly more efficient compared to the existing method in terms of computational cost. Since our method overcomes the computational challenge of conducting inference conditional only on the set of selected features, we now can effectively preserve the high statistical power while successfully controlling the false finding probability.
In this paper, we have introduced an efficient method for characterizing the selection event of Lasso SI by using piecewise-linear parametric programing. With the proposed method, we can maintain the high statistical power by conditioning only on the features without the need of enumerating all possible sign vectors. We conducted several experiments to demonstrate the good performance of the proposed method in terms of FPR control, power and computational efficiency.
- Continuation and path following. Acta Numerica 2, pp. 1–63. Cited by: §1.
Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research 7, pp. 1713–41. Cited by: §1.
- On the post selection inference constant under restricted isometry properties. Electronic Journal of Statistics 12 (2), pp. 3736–3757. Cited by: §1.
- Valid confidence intervals for post-model-selection predictors. arXiv preprint arXiv:1412.4605. Cited by: §1.
- An algorithm for the solution of the parametric quadratic programming problem. Applied Mathemetics and Parallel Computing, pp. 57–76. Cited by: §1, Remark 1.
- Asymptotic post-selection inference for the akaike information criterion. Biometrika 105 (3), pp. 645–664. Cited by: §1.
Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, pp. 1–12. Cited by: §1.
- Selecting the number of principal components: estimation of the true rank of a noisy matrix. The Annals of Statistics 45 (6), pp. 2590–2617. Cited by: §1.
- Computing valid p-value for optimal changepoint by selective inference using dynamic programming. arXiv preprint arXiv:2002.09132. Cited by: §1.
- Least angle regression. Annals of Statistics 32 (2), pp. 407–499. Cited by: §1.
- Optimal inference after model selection. arXiv preprint arXiv:1410.2597. Cited by: §1, §1, §1, §2, §2, §3.1.
- Selective sequential model selection. arXiv preprint arXiv:1512.02565. Cited by: §1, §1.
- Postoptimal analysis, parametric programming, and related topics. Walter de Gruyter. Cited by: §1.
The entire regularization path for the support vector machine. Journal of Machine Learning Research 5, pp. 1391–415. Cited by: §1.
- Statistical learning with sparsity: the lasso and generalizations. CRC press. Cited by: §1.
- Clusterpath: an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pp. 745–752. Cited by: §1.
- Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644. Cited by: §1.
- Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning 88 (3), pp. 297–330. Cited by: §1.
Nonlinear regularization path for quadratic loss support vector machines.
IEEE Transactions on Neural Networks22 (10), pp. 1613–1625. Cited by: §1.
- The one class support vector machine solution path. In Proc. of ICASSP 2007, pp. II521–II524. Cited by: §1.
- Exact post-selection inference, with application to the lasso. The Annals of Statistics 44 (3), pp. 907–927. Cited by: Parametric Programming Approach for Powerful Lasso Selective Inference without Conditioning on Signs, §1, §1, §1, §1, §2, §2, §2, §2, §2, §4, §4.
- Exact post model selection inference for marginal screening. In Advances in neural information processing systems, pp. 136–144. Cited by: §1.
- More powerful post-selection inference, with application to the lasso. arXiv preprint arXiv:1801.09037. Cited by: §1, §1, §3.1, §3.3.
- A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920. Cited by: §1.
- Selective inference after cross-validation. arXiv preprint arXiv:1511.08866. Cited by: §1.
- A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis 20 (20), pp. 389–404. Cited by: §1.
- Bayesian post-selection inference in the linear model. arXiv preprint arXiv:1605.08824 28. Cited by: §1.
- On parametric linear and quadratic programming problems. mathematical Programming: Proceedings of the International Congress on Mathematical Programming, pp. 307–335. Cited by: §1.
- Piecewise linear regularized solution paths. Annals of Statistics 35, pp. 1012–1030. Cited by: §1.
- Following curved regularized optimization solution paths. In Advances in Neural Information Processing Systems 17, pp. 1153–1160. Cited by: §1.
- Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3338–3347. Cited by: §1.
Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation 21 (2), pp. 539–559. Cited by: §1.
- Computing valid p-values for image segmentation by selective inference. Cited by: §1.
- Post-selection inference for-penalized likelihood models. Canadian Journal of Statistics 46 (1), pp. 41–61. Cited by: §1.
- Selective inference after variable selection via multiscale bootstrap. arXiv preprint arXiv:1905.10573. Cited by: §1.
- Selective inference with a randomized response. The Annals of Statistics 46 (2), pp. 679–710. Cited by: §1, §1.
- Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.
- The lasso problem and uniqueness. Electronic Journal of statistics 7, pp. 1456–1490. Cited by: Remark 1.
- Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association 111 (514), pp. 600–620. Cited by: §1.
- Entire regularization paths for graph data. In In Proc. of ICML 2007, pp. 919–925. Cited by: §1.
- Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, pp. 2469–2477. Cited by: §1.
- Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67 (2), pp. 301–320. Cited by: §A.2, Remark 1.
Appendix A Appendix
a.1 Detailed Proof for Lemma 2
From Equation (13), we can see that is a function of . For a real value , there exists such that for any real value in , all elements of remain the same signs with . Similarly, from Equation (14), we can see that is a function of . Then, for a real value , there exists such that for any real value in , all elements of are smaller than 1 in absolute value. Finally, by taking , we obtain the interval in which the active set and signs of lasso solution remain the same. The remaining task is to compute and .
We first show how to derive . From Equation (13), we have
To guarantee and have the same signs,
For a specific , we consider the following cases:
If , then .
If , then (This inequality always holds since the left hand side is positive while the right hand side is negative).
If , then .
If , then .
If , then .
Finally, for satisfying the condition in Equation (21),
We next show how to derive . From Equation (14), we have
To guarantee ,
For a specific , we have the following cases:
If , then .
If , then .
Note that the first inequalities of the above two cases always hold since the left hand side is negative while the right hand side is positive). Then, for satisfying the condition in Equation (22),
Finally, we can compute by taking .
a.2 Extension to Elastic Net
In some cases, the lasso solutions are unstable. One way to stabilize them is to add an penalty to the objective function, resulting in the elastic net Zou and Hastie . Therefore, we extend our proposed method and provide detailed derivation for testing the selected features in elastic net case. We now consider the optimization problem with parametrized response vector for as follows
For any in , the optimality condition is given by
Similar to lasso case, to construct the truncation region , we have to 1) compute the entire path of in Equation (23), and 2) identify a set of intervals of on which .
Let us consider two real values and . If and have the same active set and the same signs, then we have