As machine learning (ML) is being applied to a greater variety of practical problems, ensuring the reliability of ML is recognized as becoming increasingly important. Among several potential approaches to reliable ML,conditional selective inference (SI) is recognized as a promising approach for evaluating the statistical reliability of data-driven hypotheses selected by ML methods. The basic idea of conditional SI is to make inference on a data-driven hypothesis conditional on the selection event that the hypothesis is selected by analyzing the data with the ML algorithm. Conditional SI has been actively studied especially in the context of feature selection. Notably, Lee et al.  and Tibshirani et al.  proposed conditional SI methods for exact conditional inference on selected features by using Lasso and stepwise feature selection (SFS), respectively. The basic idea of these conditional SI methods is to characterize the hypothesis selection event by a polytope, i.e., a set of linear inequalities, in the sample space. When a hypothesis selection event is characterized by a polytope, the authors in these studies developed practical methods for making inference on the selected hypotheses by deriving the exact (non-asymptotic) sampling distribution conditional on the polytope. In this paper, we call conditional SI based on a polytope polytope-based SI. These studies are regarded as significant advance in the field of statistical inference since traditional statistical inference cannot cope with hypotheses selected after observing the data.
Unfortunately, however, polytope-based SI has several limitations because it can be used only when the characterization of all the relevant selection events is represented by a polytope. In fact, in most of the existing polytope-based SI studies, extra-conditioning is required in order for the selection event to be characterized as a polytope. For example, in the SI for SFS method of Tibshirani et al. , the authors needed to consider conditioning not only on the selected features but also on extra event regarding the signs and the orders in the feature selection process. Such extra-conditioning leads to loss of power in the inference .
In this paper, we go beyond polytope-based SI and propose a novel SI method by using homotopy continuation for conditional inference on the features selected by SFS. We call the proposed method homotopy-based SI. Compared to the polytope-based SI for SFS method of 
, the proposed method is more powerful and more general. The basic idea of homotopy-based SI is to use the homotopy continuation approach to keep track of the hypothesis selection event when the data changes in the direction of the selected test statistic, which enables efficient identification of the subspace of the sample space in which the same hypothesis is selected. We demonstrate the effectiveness of the proposed homotopy-based SI method through intensive simulations and real data analyses.
A fundamental issue in current data analysis with more increasingly complicated datasets and algorithms is post-selection inference. Traditional statistical inference presumes that, before observing the data, we have decided on a statistical model and a statistical target for which we seek to conduct inference. Therefore, if we apply traditional statistical inference methods to hypotheses selected after observing the data, a selection bias is introduced and the inferential results are no longer valid. This issue has been extensively discussed in the context of post-feature-selection inference. In fact, even in commonly-used feature selection methods such as SFS, it has been difficult to correctly assess the statistical reliability of the selected features.
There have been several approaches suggested in the literature toward addressing this problem [3, 21, 20, 2, 30, 4, 23, 37]. A particularly notable approach that has received considerable attention in the past few years is conditional selective inference introduced in the seminal paper by Lee et al. . In their work, the authors showed that, for a feature selected by Lasso, the hypothesis selection event can be characterized as a polytope by conditioning on the selected set of features as well as extra events on their signs, and exact conditional inference can be conducted by exploiting the polyhedral selection event (polytope-based SI). Furthermore, Tibshirani et al.  showed that polytope-based SI is also applicable to SFS by conditioning on the set of selected features and by imposing some extra conditions regarding the signs and the orders in the feature selection process. Conditional SI has been actively studied in the past few years and exntended to various directions [12, 6, 39, 5, 15, 24, 25, 29, 40, 42, 32, 36, 9, 7].
Note that in conditional SIs, we typically prefer to condition on as little as possible , so that the resulting inference can be more powerful. Often, there are statistical reasons that make it necessary to condition , but other times the reasons are purely computational (e.g. conditioning on the signs and the orders in the SFS process). Namely, the main limitations of current polytope-based SI methods are that excessive conditioning is required to represent the selection event with a single polytope, and that selection events that cannot be represented with a polytope cannot be handled properly. In the seminal paper by Lee et al. , the authors already discussed the issue of over-conditioning, and explained how conditioning on signs can be omitted by an exhaustive enumeration of all possible signs and taking the union over the resulting polyhedra. However, such an exhaustive enumeration of exponentially increasing number of sign combination is feasible only when the number of selected features is fairly small. Several other approaches were proposed to circumvent the drawbacks and the restrictions of polytope-based SI. Loftus et al.  extended the polytope-based SI such that their method could handle selection events characterized by quadratic inequalities, but it inherits the over-conditioning issue in polytope-based SI. To improve the power, Tian et al.  proposed an approach to randomize the algorithm in order to condition on less. Terada et al.  proposed to use bootstrap re-sampling to characterize the selection event more generally. A drawback of these approaches is that additional randomness is introduced into the algorithm and/or the inference.
This study is motivated by Liu et al.  and Duy et al. . The former studied Lasso SI for full-model parameters, while the latter extended the basic idea of the former so that it can be also applied to Lasso SI in more general settings including inference on partial-model parameters (exactly same problem setup as ). These two studies go beyond the polytope-based SI for more powerful Lasso SI without conditioning on the signs by carefully analyzing the convex optimality conditions of the Lasso solutions. Unfortunately, in SI for SFS, we cannot use the optimality conditions of the solutions because SFS is not formulated as a convex optimization problem. Furthermore, in SI for SFS, we need to consider extra-conditioning not only on the signs but also on the order in the SFS process.
2 Problem Statement
We consider forward stepwise feature selection (SFS) method for regression problem. Let be the number of instances and be the number of original features. We denote the observed dataset as where is the design matrix and
be the response vector. Following the problem setup in existing conditional SI literature such as and , we assume that the observed response is a realization of the following random response vector
where is the unknown mean vector, and
is the covariance matrix which is known or estimable from independent data, and the design matrixis assumed to be non-random. For notational simplicity, we assume that each column vector of is normalized to have unit length.
Stepwise feature selection
We consider the standard forward SFS method as studied in  — at each step of the SFS method, the feature which most improves the fit is newly added. When each feature has unit length, it is equivalent to the feature which is most correlated with the residual of the least-square regression model fitted with previously selected features. For a response vector and a set of features , let be the residual vector obtained by regressing onto for a set of features , i.e.,
where is the -by-identity matrix, and . Let be the number of selected features by the SFS method111We discuss a situation where the number of selected features is selected by cross-validation in §3.4. In other parts of the paper, we assume that is determined before looking at the data.. When we need to clarify the fact that features are selected by applying -step SFS method to a response vector , we denote the set of selected features as . Similarly, we denote the sequence of selected feature at each step as for , In the SFS method, the selected feature at step is defined as
where and .
In order to quantify the statistical significance of the relation between the selected features and the response, we consider the statistical test for each of the coefficient of the selected model parameters
where is the matrix with the set of columns corresponding to . Note that the coefficient is written as by defining
where is a unit vector whose element is 1 and 0 otherwise. Note that depends on and , but we omit the dependence for notational simplicity. We consider the following statistical test for each of the coefficient
where is the element of the population least squares , i.e., the projection of onto the column space of .
Conditional Selective Inference (SI)
Since the target of the inference is selected by observing the data
, if we naively apply a traditional statistical test to the problem in eq:hypotheses as if the inference target is pre-determined, the result is not valid (type I error cannot be controlled at the desired significance level) due toselection bias. To address the selection bias issue, we consider conditional selective inference (SI) framework introduced in  and . In conditional SI framework, the inference is conducted based on the following conditional sampling distribution of the test-statistic:
is the nuisance parameters which is independent of the test statistic. The first condition in eq:condition_model indicates the event that the set of selected features obtained by applying the -step SFS method to a random response vector is the same as the ones for . The second condition indicates the nuisance parameters for a random response vector is the same as the one for the observed vector 222The corresponds to the component in the seminal paper (see , Sec 5, Eq 5.2 and Theorem 5.2)..
To conduct the conditional inference for eq:condition_model, the main task is to identify the conditional data space
Once is identified, we can easily compute the pivotal quantity
is the c.d.f. of the truncated Normal distribution with mean
, variance, and the truncation region . Later, we will explain how is defined in eq:pivotal_quantity is defined. The pivotal quantity is crucial for calculating
-value or obtaining confidence interval. Based on the pivotal quantity, we can obtainselective type I error or selective -value  in the form of
where . This -value is valid in the sense that
Furthermore, to obtain confidence interval for any , by inverting the pivotal quantity in (7), we can find the smallest and largest values of such that the value of pivotal quantity remains in the interval .
Characterization of the conditional data space
where , , and
Here, like , , and depend on and
, but we omit the subscripts for notational simplicity. Now, let us consider a random variableand its observation such that they respectively satisfy and . The conditional inference in (5) is re-written as the problem of characterizing the sampling distribution of
Since , follows a truncated Normal distribution. Once the truncation region is identified, the pivotal quantity in (7) is obtained as . Thus, the remaining task is reduced to the characterization of .
Extra-conditioning in existing conditional SI methods
Unfortunately, it has been considered computationally infeasible to fully identify the truncation region in conditional SI for SFS method. Therefore, in the existing conditional SI studies such as , the authors circumvent the computational difficulty by overly conditioning with extra-conditions. Note that over-conditioning is not harmful for selective type-I error control, but it has been known that over-conditioning leads to the loss of power in conditional SI . In fact, the decrease in the power due to over-conditioning is not unique issue for SFS in , but is a common major issue in many existing conditional SIs . In the next section, we propose a method to overcome the difficulty by removing the extra-conditions for minimumly-conditioned SI for SFS.
3 Proposed Method
As we discussed in §2, to conduct the conditional SI, the truncation region in eq:truncation_region_z must be identified. To construct , our idea is 1) computing for all , and 2) identifying the set of intervals of on which . However, it seems intractable to obtain for infinitely many values of . To overcome the difficulty, we combine two approaches called extra-conditioning and homotopy continuation.
Figure 1 shows the schematic illustration of the proposed method. Our idea is motivated by regularization path of Lasso [28, 10], SVM  and other similar methods [31, 1, 31, 41, 18, 33, 35, 17, 14, 16, 27, 34] in which the solution path along the regularization parameter can be computed by analyzing the KKT optimality conditions of the parametrized convex optimization problems. Although SFS cannot be formulated as a convex optimization problem, by introducing the notion of extra-conditioning, we note that conceptually similar approach as homotopy continuation can be used to keep track all possible changes of the selected features when the response vector changes along the direction of the test-statistic.
First, we consider not only the feature selection event but also an extra event . In conditional SI for SFS method, the extra event consists of two types of information regarding orders and signs which are necessary for characterizing the process of SFS method. We use a symbol for the former and for the latter. Like the notation , when we need to clarify the fact that they are selected by applying -step SFS method to a response vector , we denote them as and , respectively. Thus, the extra event is denoted as .
The former contains the information on the order in which the features are selected. Concretely, we define to be the selected permutation of the set of selected features . By combining and , the history of how features have been selected so far, i.e., and , can be fully characterized.
The latter contains the information on the sign when each feature is entered to the model. Concretely, we define to be a binary vector with length whose element is defined as
which indicates the correlation between the selected feature at step and the residual vector after steps.
The following lemma tells that, by conditioning not only on the feature selection event but also on the extra events on sign and order, the over-conditioned truncation region can be simply represented as an interval in the line .
For a response vector , the over-conditioned truncation region defined as
is an interval in .
By conditioning on the feature selection event and extra event, the triplet is fixed as
From the first two conditions,
Therefore, we have
By further conditioning on the sign information, the condition is written as
By restricting on a line , the range of is written as
for . ∎
Note that the original conditional SI for SFS in  exactly considered only the over-conditioning case in the above lemma. The polytope-based SI is applicable to SFS method only with the extra conditioning . Although it is known that the over-conditioning leads to loss of power , it has been recognized to be computationally infeasible to remove the extra conditioning . In the following subsection, we overcome this computational difficulty by homotopy continuation approach.
3.2 Homotopy Continuation
Consider all possible triplets for all possible response vectors in a line . Clearly, the number of such triplets is finite. With a slight abuse of notation, we index each of the triplets by and denote it such as where is the number of the triplets. Lemma 1 indicates that each triplet corresponds to an interval in the line. Without loss of generality, we assume that the left most interval corresponds to , the second left most interval corresponds to , and so on. Then, using an increasing sequence of real numbers , we can write these intervals as . In practice, we do not have to consider the entire line , but it suffices to consider, e.g., whereand in our implementation and consider only the set of the intervals within the range.
Our simple idea is to compute all possible triplets by keeping track of the intervals one by one, and then to compute the truncation region by collecting the intervals in which the set of selected features is the same as the set of actually selected features from the observed data , i.e.,
We call breakpoints. We start from applying the SFS method to the response vector and obtain the first triplet . Then, the next breakpoint is obtained by eq:lemm1_c with . Then, the second triplet is obtained by applying the SFS method to the response vector where is a small value such that for all . This process is repeated until the next breakpoint becomes greater than .
In this section, we present the detail of the proposed method. In Algorithm 1, we apply -step SFS method to the observed dataset , and obtain the set of selected features . Then, for each feature , we first compute the direction of interest by (3). Next, for each feature , we compute the truncation region by the homotopy continuation. Note that are different among different since the direction of interest depends on . This task can be done by Algorithm 2. Finally, after having the truncation region , we can compute selective -values and selective confidence intervals. In Algorithm 2, multiple breakpoints are computed one by one. The algorithm is initialized at . At each , the task is to find the next breakpoint , at which their is a change either in the set of selected features or orders or signs . This step is repeated until .
3.4 Selecting the number of steps by cross-validation
In this section, we introduce a method for SI conditional also on the selection of the number of selected features via cross-validation. Consider selecting the number of steps in the SFS method from a given set of candidates where is the number of candidates. Based on the cross-validation on the observed dataset , suppose that is selected as the best one. The test-statistic for the selected feature when applying the SFS method with steps to is then defined as
The conditional data space in (9) with the event of selecting is then written as
where . The truncation region can be obtained by the intersection of the following two sets:
Since the former can be obtained by using the method described above, the remaining task is to identify the latter .
For notational simplicity, we consider the case where the dataset is divided into training and validation sets, and the latter is used for selecting . The following discussion can be easily extended to cross-validation scenario. Let us re-write
With a slight abuse of notation, for , let be the set of selected features by applying -step SFS method to . The validation error is then defined as
where . Then, we can write
Since the validation error in eq:cv_error is a picecewise-quadratic function of , we have a corresponding picecewise-quadratic function of for each . The truncation region can be identified by the intersection of the intervals of in which the validation error corresponding to is minimum among a set of picecewise-quadratic functions for all the other .
Loftus  already discussed that it is possible to consider cross-validation event into conditional SI framework. However, his method is highly over-conditioned in the sense that in each single run of a SFS method in the entire process of cross-validation, extra-coniditioning on orders and signs are required. Our method described above is minimumly-conditioned SI in the sense that our inference is conducted based exactly on the conditional sampling distribution of the test-statistic in eq:cv_error without any extra conditions.
In the section, we demonstrate the performance of the proposed homotopy-based SI. We executed the experiment on Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz. We show the false positive rates (FPRs), true positive rates (TPRs) and confidence intervals (CIs) for the following cases of conditional SI:
Active: only conditioning on the active set.
ActiveSign: conditioning on the active set and signs.
ActiveOrder: conditioning on the active set and the sequential order that the feature enters the active set.
ActiveSignOrder: conditioning on the active set, signs, and sequential order, which is exactly same as the polytope-based SI in .
We also show the FPRs, TPRs, and CIs of data splitting (DS) method, which is the commonly used procedure for the purpose of selection bias correction. In this approach, the data is randomly divided in two halves — first half used for model selection and the other for inference.
We set the significance level for all the experiments. We generated outcomes as , where in which . We set for FPR and TPR experiments, and was respectively set to and . We ran 100 trials and we repeated this experiments 20 times. For CI experiments, we set and .
The results of FPR and TPR are shown in Figure 2 and 2. In all five cases, the FPRs are properly controlled under the significance level . Regarding the TPR comparison, it is obvious that Active has the highest power since we only condition on the selected features. Figure 2 shows the demonstration of CI for each selected feature. Note that we only show commonly selected features among all five cases of conditional inference. The lengths of CI obtained by Active are almost the shortest. We repeated this experiment 100 times and showed the dot plot of the lengths of the confidence intervals in Figure 2. In summary, the CI results are consistent with the TPR results. In other words, the shortest length of CI in the case of Active indicates that Active has the highest power.
Especially, we also demonstrate the TPRs and the CIs between the cases when is fixed and is selected from the set , or using -fold cross-validation. We set , only the first elements of was set to , and all the rest were set to . We show that the TPR tends to decrease when increasing the size of in Figure 3. This is due to the fact that when we increase the size of , we have to condition on more information which leads to shorter truncation interval and results low TPR. The TPR results are consistent with the CI results shown in Figure 4 in which the length of CI is longer when increasing the size of .
Next, we demonstrate the computational efficiency of the proposed method. We show the result of comparing the computational time between the proposed method and the existing method using artificial dataset in Figure 5. For the existing studies, if we want to keep high statistical power, we have to consider a huge number of all the signs and order patterns which is unrealistic. With the proposed method, we are able to significantly reduce the computational cost while keeping high power.
We also show the violin plot of the actual number of interval of the test statistic that involves in the construction of truncated sampling distribution using artificial dataset in Figure 6. In this case, the number of polytopes intersecting the line that we need to consider is larger than in the Lasso case  because we have conditioned not only the signs but also the order to derive each polytope, but it is much smaller than . When is large, because the ordering constraint is too strict, so the number of intervals of satisfying the condition including the order (ActiveOrder) becomes very small even after removing the sign constraint.
We also show the computational time for using the proposed method on artificial dataset and the high dimensional real-world bioinformatics related datasets in Figure 7 and Table 1. The real-world datasets are available at http://www.coepra.org/CoEPrA_regr.html.
In this paper, we proposed a more powerful and general conditional selective inference method for stepwise feature selection method. We resolve the over-conditioning issue in existing approach by introducing homotopy continuation approach. The experimental results indicate that the proposed homotopy-based approach is more powerful and computationally efficient.
This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and RIKEN Junior Research Associate Program.
Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research 7, pp. 1713–41. Cited by: §3.
-  (2009) Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 367 (1906), pp. 4255–4271. Cited by: §1.
-  (2005) False discovery rate–adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association 100 (469), pp. 71–81. Cited by: §1.
-  (2013) Valid post-selection inference. The Annals of Statistics 41 (2), pp. 802–837. Cited by: §1.
Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, pp. 1–12. Cited by: §1.
-  (2017) Selecting the number of principal components: estimation of the true rank of a noisy matrix. The Annals of Statistics 45 (6), pp. 2590–2617. Cited by: §1.
Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823. Cited by: §1.
-  (2020) Parametric programming approach for powerful lasso selective inference without conditioning on signs. arXiv preprint arXiv:2004.09749. Cited by: §1, §4.
-  (2020) Computing valid p-value for optimal changepoint by selective inference using dynamic programming. arXiv preprint arXiv:2002.09132. Cited by: §1.
-  (2004) Least angle regression. Annals of Statistics 32 (2), pp. 407–499. Cited by: §3.
-  (2014) Optimal inference after model selection. arXiv preprint arXiv:1410.2597. Cited by: §1, §1, §2, §2, §2, §3.1.
-  (2015) Selective sequential model selection. arXiv preprint arXiv:1512.02565. Cited by: §1.
The entire regularization path for the support vector machine. Journal of Machine Learning Research 5, pp. 1391–415. Cited by: §3.
-  (2011) Clusterpath: an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pp. 745–752. Cited by: §3.
-  (2018) Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644. Cited by: §1.
-  (2012) Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning 88 (3), pp. 297–330. Cited by: §3.
-  (2010) Nonlinear regularization path for quadratic loss support vector machines. IEEE Transactions on Neural Networks 22 (10), pp. 1613–1625. Cited by: §3.
-  (2007) The one class support vector machine solution path. In Proc. of ICASSP 2007, pp. II521–II524. Cited by: §3.
-  (2016) Exact post-selection inference, with application to the lasso. The Annals of Statistics 44 (3), pp. 907–927. Cited by: §1, §1, §1, §1, §2, §2, §2, §2, footnote 2.
-  (2006) Can one estimate the conditional distribution of post-model-selection estimators?. The Annals of Statistics 34 (5), pp. 2554–2591. Cited by: §1.
-  (2005) Model selection and inference: facts and fiction. Econometric Theory, pp. 21–59. Cited by: §1.
-  (2018) More powerful post-selection inference, with application to the lasso. arXiv preprint arXiv:1801.09037. Cited by: §1, §2.
-  (2014) A significance test for the lasso. Annals of statistics 42 (2), pp. 413. Cited by: §1.
-  (2014) A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920. Cited by: §1.
-  (2015) Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478. Cited by: §1, §1.
-  (2015) Selective inference after cross-validation. arXiv preprint arXiv:1511.08866. Cited by: §3.4.
-  (2013) Infinitesimal annealing for training semi-supervised support vector machines. In International Conference on Machine Learning, pp. 897–905. Cited by: §3.
-  (2000) A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20 (3), pp. 389–403. Cited by: §3.
-  (2016) Bayesian post-selection inference in the linear model. arXiv preprint arXiv:1605.08824 28. Cited by: §1.
-  (2010) Confidence sets based on penalized maximum likelihood estimators in gaussian regression. Electronic Journal of Statistics 4, pp. 334–360. Cited by: §1.
-  (2007) Piecewise linear regularized solution paths. Annals of Statistics 35, pp. 1012–1030. Cited by: §3.
-  (2017) Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3338–3347. Cited by: §1.
Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation 21 (2), pp. 539–559. Cited by: §3.
-  (2013) Parametric task learning. Advances in Neural Information Processing Systems 26, pp. 1358–1366. Cited by: §3.
-  (2011) Target neighbor consistent feature weighting for nearest neighbor classification. In Advances in neural information processing systems, pp. 576–584. Cited by: §3.
-  (2020) Computing valid p-values for image segmentation by selective inference. Cited by: §1.
-  (2014) Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889 354. Cited by: §1.
-  (2019) Selective inference after variable selection via multiscale bootstrap. arXiv preprint arXiv:1905.10573. Cited by: §1.
-  (2018) Selective inference with a randomized response. The Annals of Statistics 46 (2), pp. 679–710. Cited by: §1, §1.
-  (2016) Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association 111 (514), pp. 600–620. Cited by: §1, §1, §1, §1, §2, §2, §2, §2, §3.1, §4.
-  (2007) Entire regularization paths for graph data. In In Proc. of ICML 2007, pp. 919–925. Cited by: §3.
-  (2016) Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, pp. 2469–2477. Cited by: §1.