1 Introduction
We consider postselection inference in highdimensional (generalized) linear models. Statistical inference in highdimensional models is challenging: the main methods in a frequentist setting use some biascorrected estimators of the Lasso
(Zhang and Zhang, 2014; van de Geer et al., 2014; Javanmard and Montanari, 2014)or of Ridge regression
(Bühlmann, 2013), and Cai and Guo (2017) provide refined optimality results for such techniques. On the other hand, postselection inference provides a very different approach for constructing confidence statements in highdimensional models. Postselection inference is attractive as it is closer in some vague sense to what practitioners do, namely to apply first some model selection in order to restrict the set of covariates and make the problem feasible. Postselection inference has long been viewed as rather illposed (Leeb and Pötscher, 2003) until Berk et al. (2013) provided a conservative approach to change the negative picture. More recent work by Fithian et al. (2014), Tian and Taylor (2018), Taylor and Tibshirani (2018) and others lead to interesting new inferential tools. The current work is building on those contributions.The instability of postselection inference.
Postselection inference deals with the problem of inference statements, after having selected a set of covariates using a datadriven algorithm or method. For postselection inference in highdimensional (generalized) linear models, a very popular model selection method is the Lasso (Tibshirani, 1996); and in fact, in this work, we only focus on the Lasso as model selector. Among the main concerns when using the Lasso or any other variable selection method, which massively reduces the number of original covariates to a much smaller number of selected ones, is its instability. The selected model, e.g. by the Lasso, has low degree of replicability due to its instability arising from correlated covariates and/or high noise scenarios. Thus, the inference after model selection might be very nonreplicable if the model selector leads to different results for small perturbations of the data, e.g. simply being new realizations from the same data generating process. Our new multicarving proposal is a possible remedy to make postselection inference more replicable.
A variety of approaches to get valid tests and confidence intervals after model selection have been developed. In order to put our proposal in some context, we discuss briefly the ones most relevant to our work in the following.
A simple approach for valid inference is to split the data into two parts and use the first half for selection and the second half for inference (Wasserman and Roeder, 2009). Thus, the idea is very similar to any validation scheme using data splitting.
This simple single data splitting method has certain drawbacks. Since splitting the data is a random process, the inference statements change if a different split is chosen. If we repeat this process multiple times, we observe that the obtained pvalues per predictor change a lot: Meinshausen et al. (2009) call this phenomenon the “pvalue lottery”. For the Lasso selector, this is especially accentuated as it is highly nonstable and potentially selects quite different models depending on the split. Therefore, results obtained through this method are not replicable at all unless one fixes the split. In order to receive more stable and replicable pvalues, Meinshausen et al. (2009) suggest splitting the data multiple times, say, times leading to pvalues for each split and each predictor
. The pvalues per predictor are aggregated using quantile functions and adequate correction terms. Although there is still randomness involved, the results should become more stable with increasing
in the spirit of the law of large numbers. This technique is referred to as multisplitting.
To avoid confusion, we save the term postselection inference for techniques that perform inference on the same data as used for selection and refer to the methods from Wasserman and Roeder (2009) and Meinshausen et al. (2009) as (multi)splitting. Postselection inference for a (generalized) linear model can be achieved by calculating or simulating a constrained null distribution, where the constraints reflect the selected model.
Lee et al. (2016)
analyse the case of Lasso selection in a linear model. They show that the KarushKuhnTucker (KKT) criteria, which are necessary conditions for the Lasso solution, lead to a polyhedral constraint on the observed response vector. Using this constraint, they derive a truncated normal distribution which allows for valid inference. A drawback of this method is a loss in power introduced by those polyhedral constraints. Similar constraints have been derived in
Tibshirani et al. (2016) for sequential regression problems: compared to Lasso selection for fixed value of , those constraints increase in dimensionality rather quickly, since every step of the procedure results in additional constraints.Somewhere in between data splitting and postselection inference is a technique called data carving (Fithian et al., 2014). In order to distinguish data carving from methods as in Lee et al. (2016), we will refer to the latter as pure postselection inference in the following. Due to the loss in power introduced by pure postselection inference, Fithian et al. (2014) prefer not to use all observations for the selection process. Further, they prove that completely discarding the fraction of data used for model selection in the inference stage leads to inadmissible tests. Instead, one should use as much information of the selection data as still usable and should only discard the information that was actually needed to obtain the given selection. This means that one “carves” the data. One can reuse the selection constraints introduced for pure postselection inference but imposes them on the selection data only. This method outperforms pure postselection inference and simple sample splitting with respect to power. Though, it is computationally much more involved under certain model assumptions. Naturally, pure postselection inference can be seen as a special case of data carving, and Fithian et al. (2014) refer to it as Carve.
Berk et al. (2013) introduce an inference technique that is valid given any preceding model selection procedure. This is possible by using the socalled PoSI (postselection inference) constant
. This constant is defined as the minimal value such that the maximal absolute tstatistic maximized overall possible predictor variables and submodels is at most equal to
with probability at least
. The advantage of this method is that it leaves all freedom to the selection process without losing validity. For example, visual inspection of the data through a human, which in practice often happens, is allowed. On the other hand, this method is quite conservative by construction. Furthermore, calculating the constant gets computationally involved such that the authors only suggest to use their method for up to. Despite the nice theoretical framework, the method is not suited for highdimensional statistics, which is our focus.
1.1 Relation to other work and contribution
There exists a broad literature which builds on the simple idea of datasplitting in highdimensional statistics. Especially, Meinshausen et al. (2009) have argued that multiple splits are to be preferred over a single split, and Fithian et al. (2014) have introduced the idea of data carving, which leads to more powerful tests after using a subset of the data for model selection. Both of which emphasize certain drawbacks of singlesplitting and show how to overcome them. Therefore in our work, we focus on how to optimally combine those improvements. Since we focus on selection using the Lasso, much of our results are also dependent on Lee et al. (2016) who derived the polyhedral constraints for the case of Lasso as selector.
We further elaborate two more extensions of data carving in a linear model that can be combined with multisplitting in the same fashion. The first one concerns group testing. There are many developments in highdimensional statistics for testing groups of covariates for significance instead of single covariates, see for example van de Geer and Stucky (2016), Mitra and Zhang (2016), and Guo et al. (2019). Group tests are of particular use as with many (highly correlated) covariates, it might be overly ambitious to correctly detect the individual active variables, whereas groups of variables might be more realistic to detect. Hierarchical testing schemes are particularly attractive for this taks; see for example Mandozzi and Bühlmann (2016) and Renaux et al. (2020)
. Secondly, we provide extensions of multicarving to generalized linear models. Pure postselection inference in logistic linear regression is discussed in
Taylor and Tibshirani (2018) who rely on asymptotic Gaussianity. As for the linear model, pure postselection inference is suboptimal regarding power, thus we extend their argument to the data carving approach. We only provide a detailed discussion for the case of logistic linear regression. Though, similar adjustments could be done for other generalized linear models.2 Methodology for highdimensional postselection inference
We first consider the methodological framework for linear models and summarize multisplitting (Section 2.2.1) as well as data carving (Section 2.2.2). This serves as a basis to develop our novel multicarving procedure for single covariates (Section 2.3) and an extension to group inference (Section 2.5
) and logistic regression or other generalized linear models (Section
2.6). While those developments focus on hypothesis testing, we discuss confidence intervals in Section 2.4.2.1 Highdimensional linear model and inference for single variables
We assume to have a response vector and a (fixed) design matrix , where . This yields a linear model of the form
(1) 
where consists of i.i.d.
entries with known or unknown variance
and is the unknown parameter of interest. We represent vectors in boldface, whereas scalars and matrices are written in usual letters. We write for a given realization of the random vector . We use index (, and ) and index (, and ) to denote selection data and data used for inference only, respectively. Further, we assume that the active set is sparse, i.e.such that inference using ordinary least squares would be possible on the data if the true active set was known.
After data driven model selection, we deal with a subset of size . We aim to perform inference based on this subset . We write for the matrix restricted to the selected columns. Likewise, and denote selection and inference data restricted to the selected columns. Generally, a distinction has to be made whether we test
(2) 
for the entries of the full or if the test is made with respect to
(3) 
Here, corresponds to the selected submodel and is defined as
(4) 
the best linear predictor in the given model. We write for , i.e. the generalized inverse of . We introduce corresponding null hypotheses for groups of variables in Section 2.5.
Typically, an inference statement for (2) would be more favourable, since we are interested in the true underlining model. Though, tests for (3) are valid under weaker assumptions.
Of particular interest is the screening property. Screening is defined as or in words, screening asks for all active variables being part of the selected model. If this holds, we have . Thus, tests valid for (3) are also unbiased for (2) assuming screening. Importantly, screening is a requirement on the initial model selection process and not on the following inference calculation.
We focus on model selection using the Lasso. The screening property for the Lasso is rather delicate to achieve in the finite sample case. Though, it can be guaranteed with probability for under adequate conditions. Such conditions are discussed in Meinshausen and Bühlmann (2006), Meinshausen and Yu (2009) and Bickel et al. (2009), see also the book by Bühlmann and van de Geer (2011).
2.2 Previously proposed methods
We first review some earlier work which serves as a basis for our new proposal in Section 2.3.
2.2.1 Multisplitting for inference
In this section, we briefly summarize the multisplitting method introduced in Meinshausen et al. (2009). Multisplitting works as follows:
For each :

Randomly split the data into two disjoint groups of sizes and .

Find using and .

For , calculate using and with ordinary leastsquares; for , set .

Adjust the pvalues to to correct for multiplicity using Bonferroni adjustment.
The fourth step is designed to control the familywise error rate (FWER). Throughout this work, we use lower case letters () for raw pvalues that result from a test and upper case letters () for pvalues resulting from any correction or aggregation. The default value for splitting is . It remains to aggregate the pvalues for covariate . Valid aggregation is possible by using a quantile of fixed fraction as
(5) 
with being the empirical quantile function. Since a good choice of might not be known a priori, one can also optimize over a range where . This yields a new pvalue
(6) 
The additional factor corrects for optimizing over all possible quantiles. A typical choice is , yielding a correction factor of .
Without any screening assumption, those pvalues actually test the following null hypothesis for some given covariate
(7) 
Given two conditions, Meinshausen et al. (2009) derive asymptotic (for ) FWER control with respect to null hypothesis (2). The conditions are:
(A1)  
(A2) 
The screening condition, as argued before, leads to and makes the inference statement valid for the true underlining parameter vector. The sparsity condition enables us to do leastsquares inference, implicitly assuming that has full column rank for all .
If screening held in the finite sample case as well, the error control could be formulated in a nonasymptotic sense. Although this is usually not the case, the simulations in Meinshausen et al. (2009)
as well as ours show that multisplitting controls the typeI error with respect to (
2) clearly better than singlesplitting when screening cannot be guaranteed. This can be explained by the “pvalue lottery”: Every split results in different pvalues for the selected variables. There are chances that some true nonactive variables are significant for some splits. After aggregation, only variables that are significant in a decent number of splits remain significant overall. Due to the variability of these pvalues over different splits, chances are that fewer nonactive variables get rejected after aggregation than in the average single split. Thus, multisplitting leads to better error control.2.2.2 Data carving
In this section, we discuss the idea of data carving introduced in Fithian et al. (2014). We focus on the special case of the linear model (1) with Lasso selection, which we will later extend to logistic regression and other generalized linear models. We emphasize that they provide a theoretical framework that could be applied to a much broader spectrum of problems.
The main conceptual idea of data carving is summarized in the following statement (Fithian et al., 2014): “The answer must be valid, given that the question was asked.” Thus, one should control the selective typeI error rate
(8) 
The hypothesis is a general notation for a hypothesis as e.g. in (3). Define the event as , the selection event using data . Then, the requirement (8) can be equivalently stated as
(9) 
Simple data splitting on the other hand controls the following error
at level . Thus, more conditioning is done than would theoretically be needed, since does not contain all information about but only guarantees that it resulted in the given selection event.
To perform inference controlling the error in (9), one needs to understand the distribution of . The first step is to understand the selection event . We focus on our case of interest, inference in the linear model (1) using Lasso selection. More precisely, let Lasso selection be defined as follows
(10) 
There exist different definitions of the Lasso that are equivalent after rescaling. We use definition (10) following Lee et al. (2016) where this selection event is fully characterised. The set of that would lead to the same forms a union of polyhedra in . If we additionally condition on the signs of the parameters’ Lasso estimates, , this union is shrunk to a single polyhedron. Dealing with a single polyhedron is easier both computationally as well as from a theoretical perspective. Hereafter, we additionally condition on the signs at the price of a small loss in power. This single polyhedron can easily be described by linear inequality constraints, e.g. . Those constraints can be split into “active” () and “inactive” () constraints which define statistically independent events. Further, is independent of the inactive constraints such that it is also independent while conditioning on the active constraints, i.e. . Therefore, we can ignore the inactive constraints for inference purposes which are based on . For simplicity, we refer to as being the active constraints only.
Fithian et al. (2014) elaborate how to handle in a given model. As is unknown, the conditional distribution is not tractable yet. To deal with this problem, one can treat the unknown parameters as nuisance parameters in an exponential family which one can get rid of by conditioning accordingly. Generally, one has to decide between the “saturated model” and the “selected model”:

Saturated model: has degrees of freedom and is the best linear predictor based on the selected model (cf. (4)).

Selected model: has degrees of freedom and completely defines the distribution.
If we consider the saturated model, which includes more parameters than the selected model, more conditioning has to be done. This leads to a drop in power but with the advantage that tests are valid for (3) without any screening assumption. The selected model view is generally more powerful since less conditioning is done but it needs stronger assumptions to hold. The existence of such that is exactly the screening condition. If screening holds, either approach is valid to test (2). Since we are mainly interested in this null hypothesis, we focus on the selected model leading to more powerful tests under screening. In Section 2.4, we elaborate further on the saturated method and its advantages.
Consider the selected model. To perform inference for covariate , one has to condition onto . After applying this conditioning, the random vector of interest is independent from the unknown parameters
. This leads to a degenerate truncated multivariate Gaussian distribution with no more unknown nuisance parameters. The truncation is defined by the selection event. To test the null hypothesis, one further assumes
. Thus, one is interested in(11) 
Note that we can use onesided tests, since we implicitly condition on the sign of by restricting ourselves to the single polyhedron . If we have selected a correct model such that the selected model view is applicable, is known, and is not a true active variable, then we have . This null distribution is not easily tractable and thus the probability is hard to calculate. Though, it can be sampled from using MCMC. This means that data carving achieves higher power compared to sample splitting at the price of a substantially higher computational cost. We present an applicable MCMC sampling scheme in Appendix B.
In the saturated viewpoint, only one degree of freedom remains after conditioning (cf. Section 2.4). Therefore, one can deal with a univariate truncated normal such that the exact probability, the analogue of (11), can be calculated efficiently using the CDF of a Gaussian. Thus, the tradeoff between the selected and the saturated model also involves a computational component.
So far, we assumed to be known. If this is not the case, could be handled as further nuisance parameter, which is resolved by additionally conditioning on . However, this nonlinear constraint disables some of the computational shortcuts which all linear constraints allow for. In our simulations, we use some estimate wherever the variance is assumed to be unknown and proceed as if it was known initially. For completeness, we mention that the distribution when additionally conditioning on is not Gaussian anymore. The corresponding null distribution can still be approximated using a different MCMC sampling technique. Note that this is only possible for the selected model. In the saturated model, one would end up imposing one quadratic and linear equality constraints onto an dimensional vector. This would only leave two points to sample from such that no inference is possible.
2.3 Novel multicarving for valid inference
Meinshausen et al. (2009) have theoretically argued and empirically shown that splitting several times and aggregating is to be preferred over a singlesplit approach. On the other hand, Fithian et al. (2014) have shown that discarding all selection data in a splitting setup is mathematically inadmissible and typically less efficient. To overcome this problem, they introduce the idea of data carving. Nevertheless, their approach potentially suffers from a similar pvalue lottery as discussed in Meinshausen et al. (2009) since it is initiated by randomly splitting the data into two disjoint groups of given sizes; one for selection and inference, the other for inference only.
Therefore, it is often difficult to replicate.
Thus, we advocate the idea of applying data carving multiple times in order to a) overcome the pvalue lottery and b) use tests that are theoretically admissible.
We use the following procedure:
For :

Randomly split the data into two disjoint groups of sizes and .

Find using and with Lasso selection.

For , calculate for the given split and selected model according to (11), for , set .

Adjust the pvalues to to correct for multiplicity using Bonferroni adjustment.
As in multisplitting, we include the fourth step in order to control the FWER. Different corrections could be applied to obtain some less restrictive error control such as the false discovery rate (FDR) as discussed in Meinshausen et al. (2009). There is a tradeoff involved in choosing and . The higher we set , the higher the probability of screening gets, which is of need for valid tests. On the other hand, more power remains for the second stage, the inference calculation, for higher values of . We empirically analyse this tradeoff in our simulations in Section 4. To get one pvalue per predictor, we use the same aggregation techniques as presented in Section 2.2.1, resulting in a single pvalue or . In our simulations, we focus on optimizing over the quantiles as described in (6) instead of using a fixed predefined quantile . To distinguish the different methods, we call this procedure multicarving and the method described in Section 2.2.2 singlecarving.
2.4 Saturated view and confidence intervals
Naturally, one wants to perform inference without the screening assumption. As mentioned in Section 2.2.2, we can use the saturated model from Fithian et al. (2014) for this. In the saturated view, we do not assume the selected submodel to completely define the mean parameter but only to approximate it as in (4). In order to get rid of the unknown parameters and create a tractable distribution, we have to condition on to . Here, we define , leading to . As has rank , there remains only one degree of freedom after conditioning, namely, in the direction of . Therefore, one deals with a univariate truncated Gaussian where the truncation comes from invoking the selection event . Inference statements can be calculated efficiently using the CDF of a univariate Gaussian. A detailed explanation of this procedure can be found in Lee et al. (2016).
This can be done regardless of the quality of the selected submodel. Therefore, the saturated viewpoint leads to valid tests for null hypotheses (3) (singlecarving) and (7) (multicarving) without any screening assumption. However, if screening fails, there is generally no , and there cannot be any false positives with respect to those null hypotheses. Therefore, such tests for null hypotheses without any screening assumption are not of particular interest. Nevertheless, those tests can be used to determine confidence intervals. As for any test, confidence intervals for multicarving can be found by inverting it. Dezeure et al. (2015) give a detailed explanation of how to compute confidence intervals for multisplitting. This can be directly adopted to multicarving by calculating carving pvalues but following the same scheme otherwise. For covariate , this leads to a confidence interval (CI) such that
(12) 
where are defined through (4). This is of particular interest when differ for different splits . Therefore, it appears natural to omit the screening assumption and to adopt the saturated model for our confidence intervals. Further, the use of the saturated model leads to more efficient computation.
We focus on twosided confidence intervals for two reasons. First, having both a lower and an upper bound might be more informative for a practitioner. Second, is not necessarily the same for all splits in which covariate is selected such that combining different splits to a onesided confidence interval is not appropriate. Thus, the confidence intervals are not the exact inversion of the hypothesis tests.
Notably, if one were to apply simultaneous tests for different null hypotheses in the selected model, this could be done by just calculating a single MCMC chain and relying on the idea of importance sampling afterwards. However, to get a precise enough statement for such simultaneous tests, more MCMC samples might be required than for just calculating a pvalue such that this extra statement is not for free.
2.5 Extension to group testing
In a highdimensional setup with potentially correlated predictors, finding individual active variables is often too ambitious. Especially, the Lasso selector struggles with distinguishing between two or more highly correlated variables. Therefore, one might prefer to test several variables as a group. We define the null hypothesis for a given group as
(13) 
for the full model coefficients. Let be the variables in our group that have been selected then we define the null hypothesis in the selected model as
(14) 
The practitioner often wants to test multiple groups or test groups in a hierarchical fashion, say, in a datadriven way. Of course, a multiplicity correction has to be applied which is possible for any valid group test which controls the type I error. We refer to Meinshausen (2008) for a detailed explanation of a hierarchical testing procedure and corresponding multipletesting correction; see also Mandozzi and Bühlmann (2016) and Renaux et al. (2020).
2.5.1 (Multi)splitting for group inference
Groups of variables can be tested for significance by splitting the data in the same way as single variables. The extension to groups follows naturally as in the lowdimensional case by applying partial Ftests instead of ttests. This can be done either with a single split or multiple splits using the previously mentioned aggregation techniques (
5) and (6).2.5.2 (Multi)carving for group inference
The above mentioned (multi)splitting techniques for group inference suffer from the same inadmissibility issue as in the single variable case as more conditioning than necessary is applied. Therefore, we suggest a slight transformation of the data carving idea which makes it applicable to testing for group significance. We focus on the selected viewpoint meaning that our derivation will actually only be valid if a correct model has been found. We emphasize that the saturated model could be extended to inference for groups with very similar adjustments.
Inference for a group follows the single variable case closely. Firstly, note that the selection event is completely unchanged by the idea of testing group significance afterwards as we still apply Lasso for model selection. Thus, we can still invoke the selection event by conditioning on . Based on Fithian et al. (2014), one can see that does not depend on such that there are no more unknown parameters in our model under the null hypothesis (14). Due to this independence from the nuisance parameters, we can base the inference on
or functions thereof. We advocate the use of the following test statistic
In words, it is a directed sum of projections in to all directions corresponding to the group variables. Including in our test statistic is valid, since we additionally condition on having observed the parameters’ signs for the sampling procedure. This additional conditioning is not mandatory for valid inference but simplifies the computation (cf. Section 2.2.2). The success of the sum can be intuitively justified as potentially no variable has a significant effect by itself, but the group as a whole could have.
We need to sample from the (approximate) null distribution to perform tests. As in the single variable case, the carving procedure leads to a Gaussian subject to linear equality and inequality constraints, which can be sampled from as presented in Appendix B with few adjustments.
In contrast to testing of single variables, the group problem remains multidimensional in the saturated view (for ) as one conditions on all but the group variables. To sample from this saturated model, some more changes would be needed, especially the conditioning in B.1 has to be adjusted, while B.2 has to be omitted.
With this group test at hand, further procedures could be derived such as applying the group test on several splits and aggregating them in the spirit of Section 2.3.
2.6 Extension to logistic regression
Not all data can be described and approximated well by the linear model given in (1). We extend the inference method to be applicable to generalized linear models and focus on logistic regression only in the following. Many of the ideas could be carry over to different generalized linear models too, after applying the right transformations.
In logistic regression, we have a binary response vector and some matrix of predictor variables . For every entry of , the probability of being is modelled as
(15) 
for some unknown parameter vector , the target of our inference. We denote the th row of by .
In a classical, lowdimensional setting with , this would be fitted using the MLE or equivalently by minimizing the negative of the loglikelihood for an observation . The loglikelihood is defined as
The negative of the above formula can be minimized e.g. by using a Newton algorithm, which leads to solving an iteratively reweighted least squares (IRLS) problem as derived in Hastie et al. (2009). Starting with some initial estimate , one iterates
where we defined
Thus, in every step a weighted leastsquares problem with weight matrix , which iteratively changes, is solved. This explains the name of the procedure.
By further defining
this can be reformulated as a usual leastsquares problem (Dezeure et al., 2015)
In the lowdimensional case, Dezeure et al. (2015) suggest to perform the inference as if the final iterate follows . This approach is asymptotically valid because if this was the case, one would have
which is the limiting distribution of the MLE. This can be seen by noting that the covariance matrix is the plugin estimate of the inverse Fisher information.
As for the linear model (1), the MLE cannot be uniquely found for since is not invertible anymore. Therefore, one also depends on some sort of shrinkage. One can use the Lasso, i.e. an penalty, in the same fashion as for the linear model and solve the following minimization
This minimizer can be found similarly as in the nonpenalized case by adding the penalty term in every update (Friedman et al., 2010)
Thus, the final Lasso estimate will (approximately) fulfil
where and are functions of the estimate itself. As this is exactly a Lasso fit as in (10), the estimate will also fulfil the KKT criteria defined by and . Therefore, we can formulate the constraint , which the observed adjusted response is required to fulfil.
In the highdimensional case with Lasso selection, it is an obvious approach to calculate inference statements as if inspired by the inference techniques in the lowdimensional setting. Or in other words, proceed as in the usual Gaussian case using our new transformed data and . This can be done likewise for either pure postselection inference or data carving. Taylor and Tibshirani (2018) provide an argument for the first case. We empirically test the adaption to data carving in our simulations without giving a full theoretical argument. Presumably, such an argument could follow using similar concepts as in Taylor and Tibshirani (2018).
Other types of generalized linear models are often fit in the same fashion using (penalized) IRLS. Whenever this is the case, one can apply our carving method to the transformed data, i.e. and , which behave asymptotically Gaussian.
Multicarving and aggregation.
As in Section 2.3, we apply this method of calculating pvalues to various splits and aggregate as described in Section 2.2.1. Those aggregation techniques are proven to be unbiased given screening. Obviously, assuming that aggregation is performed over pvalues that are all valid themselves given screening.
Here, the pvalues are only asymptotically valid even under screening. Asymptotic validity of the aggregation over asymptotically valid pvalues has not yet been theoretically studied in depth. Therefore, we cannot restate the same theoretical results for logistic regression as were derived in Meinshausen et al. (2009) for multisplitting and which we adapt in Section 3.1 for multicarving in a linear model. Nevertheless, in our simulations, applying multicarving to logistic regression does not result in any problem with typeI error control so that we can advocate its use.
3 Theoretical properties
We elaborate here the theoretical properties of multicarving and the extension to group testing for (multi)carving in the selected view, requiring the screening assumption in (A1). Without the screening assumption, (multi)carving is still valid controlling the type I error in great generality when taking the saturated view. Then, at the price to be often overly conservative, confidence intervals with guaranteed coverage should be preferred over tests, see also Sections 2.2.2 and 2.4. 2.2.2 and 2.4. Throughout this section, we assume that the data follow the linear model (1) with Gaussian errors.
3.1 Multicarving for the linear model
Validity of our multicarve method follows naturally from validity of singlecarving and multisplitting. Assuming screening in split and known variance, we know from the theory of data carving that as defined in (11) follows for but . Basically, this uniformity of the pvalue is the only thing needed to construct the proofs of Theorems 3.1 and 3.2 in Meinshausen et al. (2009). Therefore, we can restate their theoretical result for the aggregation methods. Though, we slightly alter the assumptions on the model selection procedure. We assume
(A1)  
(Ã2) 
The difference in the second condition yields from the fact that one has to invert to perform inference using splitting, while has to be inverted for data carving. Actually, the condition is and we implicitly assume this to follow from the sparsity condition. Our simulations suggest to use , thus this altered sparsity assumption is less restrictive. Using those two conditions, we establish FWER control for our multicarve procedure.
Theorem 1.
Let be generated by the linear model (1) with Gaussian errors. Assume that (A1) and (Ã2) apply. Let . Let be calculated as in Section 2.3 with known and let be the aggregated value according to (5) with finite . Then, it holds
where the probability is with respect to the data sample. The statement holds regardless of the random sample splits.
The analogue result holds when aggregation is not done with a fixed quantile but with the optimized quantile and the adequate correction term.
Theorem 2.
For proofs, we refer to the appendix of Meinshausen et al. (2009) invoking the fact that is stochastically larger than under our assumptions.
Two more technicalities have to be added in a practical setup. First, in order for the uniformity assumption to hold, we depend on a good convergence of the MCMC approximation. Second, since we refrain from conditioning on , we need to know the variance, which is often rather unrealistic. Though, we emphasize that the same theoretical result would hold in the unknown variance case when actually using the conditioning trick. Further, when using an overestimate of , tests become likely more conservative such that typeI error control is given at least as good as with the true variance parameter. However, this cannot be guaranteed in all cases. A discussion on this issue can for example be found in the supplemental materials of Tibshirani et al. (2018).
3.2 Data carving for group testing
In this section, we focus on the theoretical properties of our group test applied to a single group using a single split. Using Theorem 3, results for multicarving then follow from standard arguments.
At the base of our group test is the following lemma, which is proven in Appendix A.
Lemma 1.
Using this lemma, we can base our inference statement on the conditional distribution of . Let be some observation, then we define our selected group pvalue as
(16) 
This probability can be calculated since we additionally condition on the only remaining unknowns in the model. Notably, this exactly defines the “probability of observing a value at least as extreme as the observed statistic” under null hypothesis (14). Thus, it fulfils the desired property of a pvalue, which leads to the following theorem.
Theorem 3.
Now further define a general group pvalue for group as
(17) 
Then, we can establish error control of our procedure.
Theorem 4.
4 Numerical results
In this section, we provide detailed results of the performance of our proposed methods in simulation studies. All results were obtained using the programming language R (R Core Team, 2019). As an overall summary, we find that multicarving exhibits often an advantage, sometimes being substantial, over multisplitting or single carving methods.
4.1 Multicarving for the linear model
We tested our multicarve method testing for single variables in the linear model on several simulation setups and we present here the results for two of them. We do not restrict ourselves to successful screening, we assume the variance to be unknown and estimate it, and lastly, we select our model through crossvalidated Lasso with regularization parameter . All these choices are (in part only slightly) deviating from our theoretical assumptions. In particular, by choosing through crossvalidation, more information of is used than invoked in the selection event, making the inference biased. There are first approaches to correct for this additional bias, for example, in Tian and Taylor (2018). However, we refrain from applying any of these, since they will get computationally more involved and because our empirical results do not show any significant violation of the selective typeI error rate (8) using crossvalidation. Perhaps though, this should be done with a certain precaution as e.g. Taylor and Tibshirani (2018) report bad error control using a crossvalidated for postselection inference in a Cox model.
We vary the number of splits in and the fraction of data used for selection in . In order to keep this section wellarranged, we restrict ourselves to reporting results for and . Generally, results for different values of are qualitatively similar with a tendency to get slightly better with increasing . Naturally, does only make sense for a single split.
For aggregation over the different splits, we optimize over quantiles as in (6). Starting with the default value in the multisplitting literature, , we noticed that this makes the procedure sometimes overly optimistic. Therefore, we additionally consider to have a comparison. Using a larger is also favourable for computational reasons since less MCMC samples are required to be able to find a significant aggregated pvalue for the smallest possible quantile, namely the quantile; see Appendix C.3 for more details.
4.1.1 Toeplitz design
In a first scenario, we sample once from a multivariate Gaussian distribution with mean zero and a Toepliz covariance matrix with with , and we then treat it as fixed design. The dimensionality is and . The coefficient vector is 5sparse, and the active predictors are , each of which having a coefficient equal to
. The standard deviation is set to
, leading to a signaltonoise ratio (SNR) of approximately
. For each simulation run, the variance estimate is calculated through crossvalidated Lasso on the entire data set and is used globally for all splits and inference methods.In Figure 1, we present the outcome of the simulations for the Toeplitz design. Each performance measure represents simulation runs. Although screening cannot always be guaranteed, FWER and power are calculated with respect to (2) with rejection level . Carving using the entire data for selection, i.e. , is performed using a different algorithm, namely, the exact calculation from Lee et al. (2016). We emphasis this using a cross in the figures.
The lefthand side of Figure 1 illustrates that neither singlesplitting nor singlecarving controls the error at for and . Though, this is not a violation of our theoretical result, error control would hold when only looking at successful screening. For carving, the power initially increases in and decreases in the larger values of . This can be explained by the tradeoff between more successful screening of the true active set and losing power for the inference stage as more constraints are imposed. The same holds for splitting and multicarving when looking at lower values of as eventually to few active variables are selected in the first stage such that no decent power remains. As indicated by the inadmissibility statement in Fithian et al. (2014), carving outperforms splitting with respect to power. The important question is now whether multicarving introduces some improvement over singlecarving. The singlecarve method has the highest power starting from , where can be basically ignored as error control is not given at all. The multicarve method with performs best among all carving methods regarding FWER for all values of . Multicarving with seems to be inferior in this scenario. Thus, there is a tradeoff between higher power and better error control. The highest power with is obtained at for all carving methods with a value of (singlecarving), () and (). So, the singlecarve method is favourable in this situation.
However, this comparison is not quite fair since the methods have different FWER. Therefore, we additionally look at an adjusted power, i.e. the rejection level is adjusted such that each method has an FWER of exactly ; see Figure 2. Carving is still superior to splitting although the multisplit method with is now competitive for lower values of . All three carving methods reach their optimum at , with an adjusted power of (singlecarving), () and ().
Saturated viewpoint.
As discussed in Section 2.4, testing for null hypotheses (3) (singlecarving) and (7) (multicarving) while omitting the screening assumption is not particularly meaningful as is fully dense. Therefore, the saturated viewpoint without the screening assumption has no advantage for testing null hypotheses. However, in order to assess the power drop mentioned in Section 2.2.2, we test for null hypothesis (2) using inference in the saturated model. For the setup discussed above, this leads to the following performance measures. The highest power with is for singlecarving (), for multicarving with () and for multicarving with (). The corresponding highest adjusted power is (singlecarving), () and (), all of which obtained at . Thus, the saturated approach leads to lower power and adjusted power as anticipated. Though, this drop is less distinct for the adjusted power as the additional conservatism also leads to better typeI error control. Furthermore, we see that for multicarving the differences are less pronounced than for singlecarving. For computational reasons, the saturated viewpoint might, therefore, be an interesting alternative for our multicarve procedure.
4.1.2 Semisynthetic Riboflavin data
Since simulated data sometimes behaves somewhat more nicely than real data, we also test the methods on “semisynthetic” setups, meaning that the matrix comes from some real data set. We simulate the response from (1) with known .
We use the Riboflavin data set with and , which was made publicly available by Bühlmann et al. (2014). The original response measures the Riboflavin production rate for samples of strains of Bacillus subtilis and gives the data its name. The matrix contains the logexpression level of genes for each of these strains.
For our simulations, we set to be sparse and use an SNR of . The active variables are chosen at random for every simulation run and their respective coefficient is set to 1. Since this can result in very different signal strength depending on the correlation between the 2 variables, we fix the SNR on a per run basis by always adjusting such that . Here, denotes the empirical variance of the true underlining signal. We choose this rather sparse setup with high SNR since otherwise Lasso selection works very poorly in this highdimensional setup and none of the inference methods has good performance. To illustrate this, we repeat the same simulation with active predictors; compare with Appendix C.1.
For the selection, we again perform crossvalidation on the given split. To be more realistic, we stick to the unknown assumption. With the estimation technique described before, we realized that is empirically quite low in this scenario. Therefore, we choose the more conservative approach of calculating a new for every split as
(18) 
where is calculated on the selection data only but and are the full data.
The results obtained for the Riboflavin data with a sparsity of are shown in Figures 4 (FWER and power) and 4 (adjusted power). This setup is now highly in favour of our multicarve method. Especially, the highest power obtained for is (singlecarving), () and (); see Figure 4. The multicarve methods reach this maximum at , while singlecarving only obtains error control starting from and higher. There is a power versus FWER tradeoff between the two different values of .
The adjusted power is slightly in favour of the lower value as illustrated in Figure 4. More precisely, the highest adjusted power is () and () for the multicarve method. Both these values are obtained for . Singlecarving reaches its maximum of at . Thus, the adjusted power clearly prefers multicarving as well.
We note that although we increase both SNR and sparsity, the adjusted power is not (much) better than in the previous setup. This can be intuitively explained by the following two reasons: First, in the Riboflavin design is much larger than in our Toeplitz design. Second, there are variables with a very high empirical correlation of up to around , making them hardly distinguishable in the selection stage.
4.2 Confidence intervals
We apply our method for confidence intervals to the same setup with simulated from a multivariate normal distribution with Toeplitz covariance matrix as in Section 4.1.1. As we explicitly omit the screening assumption, we use a different estimate for every split as in (18). The parameters in (12) are calculated including an intercept. Naturally, whenever screening works, this intercept term vanishes.
We use splits and aggregate according to (6) with . The obtained intervals are targeted to be confidence intervals (CI).
Median interval length for CI  Average false coverage  
number of active variables  among  
Method  1  5  10  15  20  all  tested  
Splitting  
Carving  
Splitting  
Carving  
Splitting  
Carving  
Splitting                
Carving  
Splitting                
Carving 
In Tables 1 and 2, we compare the performance of our carving confidence intervals to the ones obtained using multisplitting implemented in Dezeure et al. (2015). Those results are based on simulation runs. The obtained intervals are generally rather conservative as the false coverage rate is always far below the theoretical bound of . Notably, for , the intervals obtained through carving are not actually shorter than those from splitting. The advantage of carving is that the intervals get shorter in a first phase when increasing . By increasing , the selected models become more stable and likewise, differs less over different splits, . Due to more stable , shorter intervals are theoretically possible with higher . Though, multisplitting cannot profit from this as too little information for the inference stage remains after increasing . The same holds for carving when becomes too large. The best performing method is carving with a selection fraction of which outperforms every other configuration with respect to at least three interval lengths. Further, it also performs comparably well with respect to the false coverage rate as every configuration with lower false coverage rate suffers from substantially longer intervals.
In a further analysis, we look at the length of the confidence intervals of all covariates that were selected at least once within the splits. For the other variables, there is no real interpretation of the coverage in Equation (12). Further, not selecting a covariate at all in splits is a rather strong indication for the variable generally being inactive such that treating it as if it has an infinite confidence interval length does not seem appropriate. However, there are still many variables obtaining an infinite interval length, namely, those that are selected at least once but less than times, i.e. once or twice in this setup.
In Table 2, we report the median over the simulation runs over several quantiles of the interval lengths among the selected variables. Due to the possibility of infinite interval lengths, we focus on quantiles instead of averages.
Quantile  
Method  
Splitting  
Carving  
Splitting  
Carving  
Splitting  
Carving  
Splitting            
Carving  
Splitting            
Carving 
Again, we note that for , the intervals obtained through multicarving are longer than those from multisplitting. However, the power of multicarving comes from the ability to raise the selection fraction without losing all information for the inference stage. The selected models become more stable for larger values of and fewer covariates are selected in total over the splits. The total number of distinct variables selected over all the splits is 96 and 20 on average for respectively . With fewer features under consideration, a higher fraction of those is selected sufficiently often such that powerful inference is possible. Those effects are visible in Table 2 as the quantiles of the intervals using multicarving mostly become shorter when increasing . For carving, there is also a natural countereffect as information for the inference stage is lost, thus the quantiles of interval lengths are not strictly decreasing.
In summary, our confidence intervals obtain the desired coverage stated in Equation (12). Further, multicarving brings an advantage compared to multisplitting because of the possibility to perform well using a higher selection fraction .
4.3 Data carving for group testing
In order to see how well our group test performs, we compare it with results presented in Guo et al. (2019). The authors consider two scenarios testing either a small or large group based on data simulated using different covariance structures. Testing a large group in a dense scenario is described below. Results of group testing for a small group in a sparse and high correlation scenario are illustrated in the Appendix C.2.
The dense alternative with many small nonzero coefficients is a setup where testing single variables is difficult. More precisely, is and is varied in . The feature matrix is generated from normally distributed features having a Toeplitz covariance matrix with . The parameter vector is defined as for and otherwise. We vary over where corresponds to the global null. The response vector follows our linear model (1) with . This leads to SNR in . We are interested in testing null hypothesis (13) for the group .
4.3.1 Singlecarving for group testing
We perform inference in either setup using our group test introduced in Section 2.5. As in Section 4.1, we vary the fraction of data used for selection in . We start with just using a single split, i.e. , for inference. Notably, for the group test, inference using is obtained with MCMC sampling as well. Since we condition on all but the covariates of interest, we generally have more than degree of freedom such that an easy calculation as in Lee et al. (2016) is not possible. The only exception to that is if , which is algorithmically equivalent to single variable testing.
For the selection, we perform crossvalidation. Based on the assumption that Lasso might eliminate many of the covariates with weak signal, we use instead of . To assess the variance parameter , we use a global estimate obtained with crossvalidation and on all data.
In Table 3, we show the results for the dense alternative. For each combination of , , and , we report the empirical rejection rate (ERR), i.e. the fraction out of 200 simulation runs in which the null hypothesis is rejected at level . For , this measures the typeI error, for , this measures the power.
For fixed and , the power increases in the number of observations , and for fixed and , it increases in the signal strength . This conclusion is to be expected.
The FWER is controlled for all combinations of and , for most combinations even conservatively. The fraction has always the lowest power because selection works not overly well. In many settings, is also suboptimal with respect to power as too little power remains for the inference stage. Fractions to are all competitive and perform similar. This is in good accordance with our results testing for single variables in Section 4.1.
Table 3 can now be compared to Guo et al. (2019, Table 1), where six different methods are evaluated in this scenario. Ignoring those two where error control is not given at all, our method with fractions between and is amongst the best with respect to power in each setup. Especially, it has clearly higher power than their method for , whereas the power is similar for higher . Though, their method controls the error more conservatively such that a clear statement in favour of either method is not possible.
If we summarize the results from the dense scenario in this section and the sparse scenario in Appendix C.2,
we can state that our method does not have the best performance in all possible setups. Though, it is competitive in all of them, while all competitors have some setups where they do not work well at all. Thus, our group test, which results from a very simple adjustment of the data carving idea, brings some valuable results.
4.3.2 Multicarving for group testing
In Section 4.1, we see that the multicarve method usually has better error control than singlecarving. Based on this observation, it is to be expected that multicarving could further improve on group inference in scenarios where the error is not controlled conservatively (cf. Table 3). Therefore, we test multicarving for group testing as well. Indeed, with multicarving, no ERR above the target level occurs for in either alternative. However, the ERR for , i.e. the power, is sensitive to the choice of the tuning parameters , /, and . Especially, in those two scenarios, aggregation using a fixed quantile clearly outperforms the use of an optimized quantile according to Equation (6).
In the following, we present results obtained using splits and a fixed quantile for aggregation of in order to show the possibilities of multicarving. We emphasize that these choices work comparably well such that in general, when no such comparison is possible, one could expect slightly lower power using multicarving for group testing. The results are shown in Table 4.
We consider the dense alternative. For multicarving, the highest ERR for is , whereas it is for singlecarving. Naturally, there is some fluctuation involved in those empirical values. Nevertheless, this difference indicates an improvement of multicarving over singlecarving. For most scenarios with , a selection fraction of is favourable. The intuitive explanation is that although the group is not tested for as often as with higher fractions , it is still tested for in a decent number of splits. In these splits, the lower allows for a more powerful inference statement making the method more powerful overall after aggregation. Notably, using and (fixed quantile for aggregation) is equivalent to a Bonferroni corrected minimum pvalue (cf. Equation (5)). Thus, only the most significant split is of importance. We now compare the power in Table 4 to that for singlecarving in Table 3. Using a selection fraction of , multicarving outperforms any singlecarving configuration in all scenarios unless and . Thus, using multiple splits and aggregating can bring a clear improvement. Though, this is rather sensitive to the choice of the tuning parameters as mentioned above.
In summary, the natural extension of our group test using multiple splits leads to a performance boost. Especially, the error can be controlled on a more conservative level using multiple splits. A drawback of the method is its sensitivity to tuning parameters. If those happen to be chosen poorly, power might be lower compared to singlecarving.
4.4 Multicarving for logistic regression
We conduct a similar simulation study as in Section 4.1 for the logistic model (15). We reuse the matrix coming from a Toeplitz covariance design from Section 4.1 with dimensions and . The active variables are , each of which having a coefficient of . After having noticed that in logistic regression, at least in this setup, crossvalidated Lasso tends to select overly sparse models, we alter the selection technique. Namely, we select a Lasso model with a given number of selected variables, or if there is no such model, the largest model with fewer variables. Inspired by Meinshausen et al. (2009), we choose this number to be . Just as for crossvalidated Lasso, this introduces a slight bias to our test as is determined in a datadependent fashion and is not predefined. We stick to our usual tuning parameters, i.e. is varied in and in . The target level for the FWER remains at .
Figures 6 (FWER and power) and 6 (adjusted power) illustrate the same performance statistics as for the simulation examples in Section 4.1. Every performance measure corresponds to simulation runs.
All methods are rather conservative in this setup. Especially, no value of the FWER above the level occurs. Furthermore, there are no pvalues below the significance level at all for splitting. There exist probably better algorithms for calculating lowdimensional pvalues in logistic regression than the ones used for splitting here. Though, as it is not of primary interest to our work, we did not investigate this further. Singlecarving has clearly higher power than multicarving, whereas the latter controls the error on a more conservative level. The highest power obtained is (singlecarving), () and (). All these maxima are reached at . Pure postselection inference has a power of . Thus, the conjecture that the constraints might be too restrictive is confirmed.
For the tradeoff between power and error control, we consider the adjusted power as defined in Section 4.1. Interestingly, multisplitting is now quite competitive. The interpretation is that although pvalues are generally larger than , there is still a distinction between active and nonactive variables. The best adjusted power of the multisplit method is . As the curve seems to increase towards lower values of , we further tested and . Neither leads to an increase in the adjusted power for multisplitting such that we can assume that the optimum is reached around . Multicarving clearly outperforms singlecarving with the respective maxima being at (), () and (singlecarving). Pure postselection obtains an adjusted power of .
In summary, we can state for this data that either carving method improves on pure postselection inference. The choice between multicarving and singlecarving is a tradeoff between power and FWER. Our definition of adjusted power, which makes the different methods have equal FWER, is in favour of multicarving.
4.5 Runtime considerations
Our method is computationally quite involved while performing empirically well. Details are discussed in the Appendix C.4. The computational bottleneck is the MCMC sampling required to calculate pvalues and therefore, we ignore the other steps for our considerations. An approximate bound is for multicarving, where the expectation is due to the fact that is nonconstant over splits.
Another popular inference technique for highdimensional statistics is the debiased Lasso (van de Geer et al., 2014). A total of Lasso fits have to be calculated on the entire data. Thus, it scales as
. For highdimensional data with
and the standard assumption on the Lasso, we have accordingly . Then, our multicarve method is more efficient than the debiased Lasso for if .5 Discussion and conclusions
We provide new developments based on the idea of data carving (Fithian et al., 2014). Particularly for highdimensional scenarios, we improve upon standard data carving.
First, we introduce multicarving in the spirit of multisplitting. Our simulation study shows that multicarving generally leads to better error control and its adjusted power is better than for the singlecarve method. Furthermore, multisplitting and multicarving not only aim to reduce the FWER but also to make results more replicable. It is very plausible that our multicarve method clearly increases replicability compared to singlecarving, due to the instability of the Lasso model selector.
Second, we present group inference, a natural extension of single variable testing. Such a group test can be applied using singlecarving or using the advocated multicarving. In simulation examples, either variant appears to be competitive to several methods discussed in Guo et al. (2019).
Last, we adapt data carving to make it applicable to logistic linear regression and other generalized linear models. Those adjustments are based on the central limit theorem and follow from similar ideas as already introduced for lowdimensional data and for pure postselection inference. Our simulation study leads to the same conclusions as for the linear model. In particular, data (multi)carving in the logistic case leads as well to a performance increase compared to pure postselection inference.
Some userfriendly software for multicarving will be available soon within the Rpackage hdi (Dezeure et al., 2015).
Acknowledgment
The research of P. Bühlmann was supported in part by the European Research Council under the Grant Agreement No 786461 (CausalStats  ERC2017ADG).
References
 Berk et al. (2013) Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid postselection inference. The Annals of Statistics, 41(2):802–837.
 Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.
 Bühlmann (2013) Bühlmann, P. (2013). Statistical significance in highdimensional linear models. Bernoulli, 19(4):1212–1242.
 Bühlmann et al. (2014) Bühlmann, P., Kalisch, M., and Meier, L. (2014). Highdimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278.
 Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for highdimensional data: methods, theory and applications. Springer Science & Business Media.
 Cai and Guo (2017) Cai, T. T. and Guo, Z. (2017). Confidence intervals for highdimensional linear regression: Minimax rates and adaptivity. The Annals of Statistics, 45(2):615–646.
 Dezeure et al. (2015) Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2015). Highdimensional inference: Confidence intervals, pvalues and Rsoftware hdi. Statistical Science, 30(4):533–558.
 Fithian et al. (2014) Fithian, W., Sun, D., and Taylor, J. (2014). Optimal inference after model selection. arXiv preprint arXiv:1410.2597.
 Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
 Guo et al. (2019) Guo, Z., Renaux, C., Bühlmann, P., and Cai, T. T. (2019). Group inference in high dimensions with applications to hierarchical testing. arXiv preprint arXiv:1909.01503.
 Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.

Javanmard and Montanari (2014)
Javanmard, A. and Montanari, A. (2014).
Confidence intervals and hypothesis testing for highdimensional
regression.
The Journal of Machine Learning Research
, 15(Oct):2869–2909.  Lee et al. (2016) Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact postselection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927.
 Leeb and Pötscher (2003) Leeb, H. and Pötscher, B. M. (2003). The finitesample distribution of postmodelselection estimators and uniform versus nonuniform approximations. Econometric Theory, 19(1):100–142.
 Mandozzi and Bühlmann (2016) Mandozzi, J. and Bühlmann, P. (2016). Hierarchical testing in the highdimensional setting with correlated variables. Journal of the American Statistical Association, 111:331–343.
 Meinshausen (2008) Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2):265–278.
 Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P. (2006). Highdimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462.
 Meinshausen et al. (2009) Meinshausen, N., Meier, L., and Bühlmann, P. (2009). Pvalues for highdimensional regression. Journal of the American Statistical Association, 104(488):1671–1681.
 Meinshausen and Yu (2009) Meinshausen, N. and Yu, B. (2009). Lassotype recovery of sparse representations for highdimensional data. The Annals of Statistics, 37(1):246–270.
 Mitra and Zhang (2016) Mitra, R. and Zhang, C.H. (2016). The benefit of group sparsity in group inference with debiased scaled group lasso. Electronic Journal of Statistics, 10(2):1829–1873.
 Pakman (2015) Pakman, A. (2015). tmg: Truncated Multivariate Gaussian Sampling. R package version 0.3.
 Pakman and Paninski (2014) Pakman, A. and Paninski, L. (2014). Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542.
 R Core Team (2019) R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 Renaux et al. (2020) Renaux, C., Buzdugan, L., Kalisch, M., and Bühlmann, P. (2020). Hierarchical inference for genomewide association studies: a view on methodology with software (with discussion). Computational Statistics, 35(1):1–40.
 Taylor and Tibshirani (2018) Taylor, J. and Tibshirani, R. (2018). Postselection inference for penalized likelihood models. Canadian Journal of Statistics, 46(1):41–61.
 Tian and Taylor (2018) Tian, X. and Taylor, J. (2018). Selective inference with a randomized response. The Annals of Statistics, 46(2):679–710.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
 Tibshirani et al. (2018) Tibshirani, R. J., Rinaldo, A., Tibshirani, R., and Wasserman, L. (2018). Uniform asymptotic inference and the bootstrap after model selection. The Annals of Statistics, 46(3):1255–1287.
 Tibshirani et al. (2016) Tibshirani, R. J., Taylor, J., Lockhart, R., and Tibshirani, R. (2016). Exact postselection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620.
 van de Geer et al. (2014) van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for highdimensional models. The Annals of Statistics, 42(3):1166–1202.
 van de Geer and Stucky (2016) van de Geer, S. and Stucky, B. (2016). 2confidence sets in highdimensional regression. In Statistical analysis for highdimensional data, pages 279–306. Springer.
 Wasserman and Roeder (2009) Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. The Annals of Statistics, 37(5A):2178–2201.
 Zhang and Zhang (2014) Zhang, C.H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.
Appendix A Proofs
Proof of Lemma 1
We require Assumption (Ã2) such that is defined. As in Section 3.1, we implicitly assume to follow from the sparsity condition. This inverse is implicitly included in and . Using the screening assumption, we know . Thus, we can write the unconditional distribution of as follows
where denotes the normalizing constant of the Gaussian distribution. We see that is the sufficient statistic, while is the natural parameter as is assumed to be known. Conditioning on the selection event leads to a different exponential family with the same sufficient statistic and natural parameter but different normalizing constant, say, , compare with Fithian et al. (2014, Section 3).
Here, we split into the parameter that we want to perform inference for and the nuisance parameter in the model . From the theory of exponential families, we know that the conditional law does not depend on . We now want to establish the same result for . For simplicity, we assume such that it can be separated into variables being part of the group and the others. The result holds w.l.o.g., since permutations of the matrix’ columns do not change our inference statement. Then, we get
Thus, is a fixed affine transform of
, making it independent from as well. Naturally, the subset is conditionally independent too. Based on our two assumptions, the only parameters in the model are and . Thus, after establishing independence from the former, the only parameter left in the model is the latter, which is exactly the lemma’s statement.
Proof of Theorem 4
For a group , we either have or . Assume the former case first. Due to screening, we know , which leads to as . Null hypothesis (13) then directly implies
which corresponds to null hypothesis (14). Therefore, all assumptions of Theorem 3
are fulfilled, leading to the uniform distribution of the pvalue. Error control can thus be stated as
In the other case () we have
Thus, we obtain error control in either case, which closes the proof.
Appendix B Sampling from a linearly constrained Gaussian
The algorithm presented in this section is strongly based on the GitHub repository cited in Fithian et al. (2014) for their simulations. However, since there seems to be no written documentation of the algorithm itself and the theory behind, we provide it for the interested reader.
For simplicity, we will suppress index , since we implicitly assume to work in a selected submodel throughout this section.
In order to do inference for variable , the goal is to sample from subject to , and . The first condition leads to boundaries on the sampling region, the second one changes both the mean parameter and the covariance matrix, and the last one further changes the mean and creates a null distribution.
b.1 Change of mean and covariance
Let be a Gaussian random vector with mean and covariance . We are interested in and . To find those, split into
One can see (e.g. by calculating the covariance) that the second term is independent of , thus unchanged by the conditioning, while the first part is completely defined by the conditioning. Thus, we have
and similarly
In our problem of interest, we have (after setting ), , and . This yields
and
Most importantly, the mean term does not have any dependence on such that we can calculate an inference statement without knowing the other coefficients.
b.2 Computational shortcuts: linear transformations
Since all constraints are linear, they can also be guaranteed for linear transformations of
if not too much dimensionality reduction is applied.Define the least squares solution on all data as
and the one on the selection data only as
Then, two vectors which are well suited to fulfil all constraints after transformation are
Since those are linear transformations, they will still be Gaussian with mean and covariance that can be easily derived from those of .
Further, the constraints transform to
We use the bracket notation for the indices to indicate that row of the resulting matrix has to be omitted. And, by using the active constraints from Lee et al. (2016), we have
where denotes the signs of the parameters’ Lasso estimates. This can be transformed to
Thus, we have transformed the linear equality and inequality constraints and can proceed as if we were to sample from by firstly adjusting the mean and the covariance matrix as described in Section B.1.
The choice of whether to sample from or is rather simple: just use whichever has lower dimensionality in order to increase efficiency. As stated in Section 2.2.2, one would further condition on in the unknown variance case. Though, this constraint is not transformable to or , thus the dimensionality could not be reduced. Therefore, we use an estimate of the variance instead of the (theoretically beautiful) conditioning idea for our simulations.
b.3 Whitening
In order to make the MCMC algorithm simpler, we would like to always sample from zero mean unit variance independent Gaussians (i.e. white Gaussians). This can be achieved by applying a further linear transformation. We need a forward map transforming the initial point and an inverse map transforming back the MCMC sample.
Assume that we sample from which is achieved by applying the transformations from the previous two sections. Here, has rank , i.e. is not fullranked whenever . This is as we lose some degrees of freedom after conditioning (cf. Section B.1). Further, define matrices and such that
These can be found e.g. by using the eigenvalue decomposition of
. Then, our forward map isand accordingly, the inverse map is
Note that and further for all fulfilling the equality constraints, thus all we are interested in.
Importantly, the boundary constraint has to be transformed as well. This is possible by
which leads to
i.e. the constraints in the whitened space. With these whitened constraints at hand, the only thing left is to sample from a white Gaussian subject to linear inequality constraints.
Notably, since is a wide matrix ( unless ), we transform into a lowerdimensional space. Therefore, the transformation into the withened space leads to a further dimensionality reduction, which makes the sampling more efficient.
b.4 Sampling from a linearly constrained white Gaussian
The MCMC algorithm presented in this section is as well based on the mentioned GitHub repository. Though, we emphasize that any algorithm approximating a white Gaussian with linear inequality constraints could be invoked in this place using the same preprocessing steps (cf. Sections B.1  B.3).
For simplicity, reuse all initial names, thus we want to sample from subject to and let be a point fulfilling the constraints. More precisely, is the preprocessed version of the observed vector.
The idea is to move in every step in a given random direction , while keeping the projections into its orthogonal complement fixed, i.e.
Or in other words, we want to sample from
This is in exact analogy to the setup in Lee et al. (2016) for pure postselection inference using as direction of interest and as observation to base the inference on. Thus, the boundary derived for pure postselection inference can be reused, making a univariate truncated Gaussian with known mean and variance. One can easily sample from this leading to a new point . For every , this can be repeated for a new random direction such that the whole constrained space should be explored. After enough steps, the samples should approximate the null distribution sufficiently well.
Appendix C Additional numerical results
This section contains additional numerical results and details about runtime considerations.
c.1 Semisynthetic Riboflavin data for sparsity 4
We redo the simulation as in Section 4.1.2 setting the sparsity to without changing anything else. The respective results are presented in Figures 8 (FWER and power) and 8 (adjusted power). Note that we restrict the plotting area of the yaxis to a maximum of such that some values of the FWER are nonvisible.
At first glance, one sees that the power is generally quite low for all methods. Importantly, this is not a problem of the inference methods themselves but of the combination between selection and inference. Screening, which is theoretically needed for the validity of those approaches, only worked in of the simulation runs using all data for selection and naturally even less for any subset. For comparison, screening worked in of the instances in the sparser alternative, which makes the problem much easier.
In this setup, multicarving with has the highest power for all , while has the lowest FWER amongst the three carving methods. The highest power obtained controlling the FWER at is in favour of using with a value of . The other two methods obtain respective maxima of () and (singlecarving). Especially, singlecarving only reaches error control at which is pure postselection inference. The adjusted power is slightly higher for than for with maximal values of and . For singlecarving, the maximal value is . In summary, multicarving is to be preferred over singlecarving in this harder setup too.
As mentioned above, one of the main difficulties in this scenario is the bad screening property. A natural adaption is the use of instead of for selection with crossvalidation. This leads to larger selected models and could potentially increase the probability of screening. Our simulation confirms that this leads to a performance boost with the highest adjusted power for multicarving now being . For simplicity, we refrain from showing the results in detail. It has to be mentioned though that the use of leads to a substantial increase in runtime as more variables are selected. We elaborate this effect further in Section C.4.
c.2 Data carving for group testing: sparse scenario
We refer to Section 4.3 for more details about the implementation and further discussion.
For the sparse scenario, we choose to be sparse and the active covariates are strongly correlated with other covariates. The number of covariates is as well , and is simulated using the following covariance structure
Thus, is the same Toeplitz matrix as in the dense alternative described in Section 4.3 unless in the first five variables. The parameter vector is defined as and otherwise, meaning that the active variables are within the highly correlated set. This time is varied over and over . The response is generated as before, leading to SNR in . In this scenario, we are interested in the null hypothesis (13) for the group .
c.2.1 Singlecarving for group testing: sparse scenario
In Table 5, we report the empirical rejection rate for the scenario with very sparse and highly correlated features. This scenario seems to be easier to handle than the dense scenario. Especially, the error is controlled at a more conservative level, with the highest error being . For the power, the tendencies are similar to before. For , and have generally the lowest power, while the highest power is obtained with . Starting from , leads to the lowest power, while the other ERR are mostly exactly .