Multicarving for high-dimensional post-selection inference

06/08/2020
by   Christoph Schultheiss, et al.
0

We consider post-selection inference for high-dimensional (generalized) linear models. Data carving (Fithian et al., 2014) is a promising technique to perform this task. However, it suffers from the instability of the model selector and hence may lead to poor replicability, especially in high-dimensional settings. We propose the multicarve method inspired by multisplitting, to improve upon stability and replicability. Furthermore, we extend existing concepts to group inference and illustrate the applicability of the methodology also for generalized linear models.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/16/2020

A Support Detection and Root Finding Approach for Learning High-dimensional Generalized Linear Models

Feature selection is important for modeling high-dimensional data, where...
11/01/2020

DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks

Recent years have witnessed strong empirical performance of over-paramet...
05/08/2018

Hierarchical inference for genome-wide association studies: a view on methodology with software

We provide a view on high-dimensional statistical inference for genome-w...
10/11/2021

Nonparametric Functional Analysis of Generalized Linear Models Under Nonlinear Constraints

This article introduces a novel nonparametric methodology for Generalize...
12/06/2014

A Likelihood Ratio Framework for High Dimensional Semiparametric Regression

We propose a likelihood ratio based inferential framework for high dimen...
11/29/2009

An Iterative Algorithm for Fitting Nonconvex Penalized Generalized Linear Models with Grouped Predictors

High-dimensional data pose challenges in statistical learning and modeli...
03/02/2017

Active Learning for Accurate Estimation of Linear Models

We explore the sequential decision making problem where the goal is to e...

1 Introduction

We consider post-selection inference in high-dimensional (generalized) linear models. Statistical inference in high-dimensional models is challenging: the main methods in a frequentist setting use some bias-corrected estimators of the Lasso

(Zhang and Zhang, 2014; van de Geer et al., 2014; Javanmard and Montanari, 2014)

or of Ridge regression

(Bühlmann, 2013), and Cai and Guo (2017) provide refined optimality results for such techniques. On the other hand, post-selection inference provides a very different approach for constructing confidence statements in high-dimensional models. Post-selection inference is attractive as it is closer in some vague sense to what practitioners do, namely to apply first some model selection in order to restrict the set of covariates and make the problem feasible. Post-selection inference has long been viewed as rather ill-posed (Leeb and Pötscher, 2003) until Berk et al. (2013) provided a conservative approach to change the negative picture. More recent work by Fithian et al. (2014), Tian and Taylor (2018), Taylor and Tibshirani (2018) and others lead to interesting new inferential tools. The current work is building on those contributions.

The instability of post-selection inference.

Post-selection inference deals with the problem of inference statements, after having selected a set of covariates using a data-driven algorithm or method. For post-selection inference in high-dimensional (generalized) linear models, a very popular model selection method is the Lasso (Tibshirani, 1996); and in fact, in this work, we only focus on the Lasso as model selector. Among the main concerns when using the Lasso or any other variable selection method, which massively reduces the number of original covariates to a much smaller number of selected ones, is its instability. The selected model, e.g. by the Lasso, has low degree of replicability due to its instability arising from correlated covariates and/or high noise scenarios. Thus, the inference after model selection might be very non-replicable if the model selector leads to different results for small perturbations of the data, e.g. simply being new realizations from the same data generating process. Our new multicarving proposal is a possible remedy to make post-selection inference more replicable.

A variety of approaches to get valid tests and confidence intervals after model selection have been developed. In order to put our proposal in some context, we discuss briefly the ones most relevant to our work in the following.

A simple approach for valid inference is to split the data into two parts and use the first half for selection and the second half for inference (Wasserman and Roeder, 2009). Thus, the idea is very similar to any validation scheme using data splitting.

This simple single data splitting method has certain drawbacks. Since splitting the data is a random process, the inference statements change if a different split is chosen. If we repeat this process multiple times, we observe that the obtained p-values per predictor change a lot: Meinshausen et al. (2009) call this phenomenon the “p-value lottery”. For the Lasso selector, this is especially accentuated as it is highly non-stable and potentially selects quite different models depending on the split. Therefore, results obtained through this method are not replicable at all unless one fixes the split. In order to receive more stable and replicable p-values, Meinshausen et al. (2009) suggest splitting the data multiple times, say, times leading to p-values for each split and each predictor

. The p-values per predictor are aggregated using quantile functions and adequate correction terms. Although there is still randomness involved, the results should become more stable with increasing

in the spirit of the law of large numbers. This technique is referred to as multisplitting.

To avoid confusion, we save the term post-selection inference for techniques that perform inference on the same data as used for selection and refer to the methods from Wasserman and Roeder (2009) and Meinshausen et al. (2009) as (multi)splitting. Post-selection inference for a (generalized) linear model can be achieved by calculating or simulating a constrained null distribution, where the constraints reflect the selected model.

Lee et al. (2016)

analyse the case of Lasso selection in a linear model. They show that the Karush-Kuhn-Tucker (KKT) criteria, which are necessary conditions for the Lasso solution, lead to a polyhedral constraint on the observed response vector. Using this constraint, they derive a truncated normal distribution which allows for valid inference. A drawback of this method is a loss in power introduced by those polyhedral constraints. Similar constraints have been derived in

Tibshirani et al. (2016) for sequential regression problems: compared to Lasso selection for fixed value of , those constraints increase in dimensionality rather quickly, since every step of the procedure results in additional constraints.

Somewhere in between data splitting and post-selection inference is a technique called data carving (Fithian et al., 2014). In order to distinguish data carving from methods as in Lee et al. (2016), we will refer to the latter as pure post-selection inference in the following. Due to the loss in power introduced by pure post-selection inference, Fithian et al. (2014) prefer not to use all observations for the selection process. Further, they prove that completely discarding the fraction of data used for model selection in the inference stage leads to inadmissible tests. Instead, one should use as much information of the selection data as still usable and should only discard the information that was actually needed to obtain the given selection. This means that one “carves” the data. One can reuse the selection constraints introduced for pure post-selection inference but imposes them on the selection data only. This method outperforms pure post-selection inference and simple sample splitting with respect to power. Though, it is computationally much more involved under certain model assumptions. Naturally, pure post-selection inference can be seen as a special case of data carving, and Fithian et al. (2014) refer to it as Carve.

Berk et al. (2013) introduce an inference technique that is valid given any preceding model selection procedure. This is possible by using the so-called PoSI (post-selection inference) constant

. This constant is defined as the minimal value such that the maximal absolute t-statistic maximized overall possible predictor variables and submodels is at most equal to

with probability at least

. The advantage of this method is that it leaves all freedom to the selection process without losing validity. For example, visual inspection of the data through a human, which in practice often happens, is allowed. On the other hand, this method is quite conservative by construction. Furthermore, calculating the constant gets computationally involved such that the authors only suggest to use their method for up to

. Despite the nice theoretical framework, the method is not suited for high-dimensional statistics, which is our focus.

1.1 Relation to other work and contribution

There exists a broad literature which builds on the simple idea of data-splitting in high-dimensional statistics. Especially, Meinshausen et al. (2009) have argued that multiple splits are to be preferred over a single split, and Fithian et al. (2014) have introduced the idea of data carving, which leads to more powerful tests after using a subset of the data for model selection. Both of which emphasize certain drawbacks of single-splitting and show how to overcome them. Therefore in our work, we focus on how to optimally combine those improvements. Since we focus on selection using the Lasso, much of our results are also dependent on Lee et al. (2016) who derived the polyhedral constraints for the case of Lasso as selector.

We further elaborate two more extensions of data carving in a linear model that can be combined with multisplitting in the same fashion. The first one concerns group testing. There are many developments in high-dimensional statistics for testing groups of covariates for significance instead of single covariates, see for example van de Geer and Stucky (2016), Mitra and Zhang (2016), and Guo et al. (2019). Group tests are of particular use as with many (highly correlated) covariates, it might be overly ambitious to correctly detect the individual active variables, whereas groups of variables might be more realistic to detect. Hierarchical testing schemes are particularly attractive for this taks; see for example Mandozzi and Bühlmann (2016) and Renaux et al. (2020)

. Secondly, we provide extensions of multicarving to generalized linear models. Pure post-selection inference in logistic linear regression is discussed in

Taylor and Tibshirani (2018) who rely on asymptotic Gaussianity. As for the linear model, pure post-selection inference is suboptimal regarding power, thus we extend their argument to the data carving approach. We only provide a detailed discussion for the case of logistic linear regression. Though, similar adjustments could be done for other generalized linear models.

2 Methodology for high-dimensional post-selection inference

We first consider the methodological framework for linear models and summarize multisplitting (Section 2.2.1) as well as data carving (Section 2.2.2). This serves as a basis to develop our novel multicarving procedure for single covariates (Section 2.3) and an extension to group inference (Section 2.5

) and logistic regression or other generalized linear models (Section

2.6). While those developments focus on hypothesis testing, we discuss confidence intervals in Section 2.4.

2.1 High-dimensional linear model and inference for single variables

We assume to have a response vector and a (fixed) design matrix , where . This yields a linear model of the form

(1)

where consists of i.i.d.

entries with known or unknown variance

and is the unknown parameter of interest. We represent vectors in boldface, whereas scalars and matrices are written in usual letters. We write for a given realization of the random vector . We use index (, and ) and index (, and ) to denote selection data and data used for inference only, respectively. Further, we assume that the active set is sparse, i.e. 

such that inference using ordinary least squares would be possible on the data if the true active set was known.

After data driven model selection, we deal with a subset of size . We aim to perform inference based on this subset . We write for the matrix restricted to the selected columns. Likewise, and denote selection and inference data restricted to the selected columns. Generally, a distinction has to be made whether we test

(2)

for the entries of the full or if the test is made with respect to

(3)

Here, corresponds to the selected submodel and is defined as

(4)

the best linear predictor in the given model. We write for , i.e. the generalized inverse of . We introduce corresponding null hypotheses for groups of variables in Section 2.5.

Typically, an inference statement for (2) would be more favourable, since we are interested in the true underlining model. Though, tests for (3) are valid under weaker assumptions.

Of particular interest is the screening property. Screening is defined as or in words, screening asks for all active variables being part of the selected model. If this holds, we have . Thus, tests valid for (3) are also unbiased for (2) assuming screening. Importantly, screening is a requirement on the initial model selection process and not on the following inference calculation.

We focus on model selection using the Lasso. The screening property for the Lasso is rather delicate to achieve in the finite sample case. Though, it can be guaranteed with probability for under adequate conditions. Such conditions are discussed in Meinshausen and Bühlmann (2006), Meinshausen and Yu (2009) and Bickel et al. (2009), see also the book by Bühlmann and van de Geer (2011).

2.2 Previously proposed methods

We first review some earlier work which serves as a basis for our new proposal in Section 2.3.

2.2.1 Multisplitting for inference

In this section, we briefly summarize the multisplitting method introduced in Meinshausen et al. (2009). Multisplitting works as follows:
For each :

  1. Randomly split the data into two disjoint groups of sizes and .

  2. Find using and .

  3. For , calculate using and with ordinary least-squares; for , set .

  4. Adjust the p-values to to correct for multiplicity using Bonferroni adjustment.

The fourth step is designed to control the family-wise error rate (FWER). Throughout this work, we use lower case letters () for raw p-values that result from a test and upper case letters () for p-values resulting from any correction or aggregation. The default value for splitting is . It remains to aggregate the p-values for covariate . Valid aggregation is possible by using a quantile of fixed fraction as

(5)

with being the empirical quantile function. Since a good choice of might not be known a priori, one can also optimize over a range where . This yields a new p-value

(6)

The additional factor corrects for optimizing over all possible quantiles. A typical choice is , yielding a correction factor of .

Without any screening assumption, those p-values actually test the following null hypothesis for some given covariate

(7)

Given two conditions, Meinshausen et al. (2009) derive asymptotic (for ) FWER control with respect to null hypothesis (2). The conditions are:

(A1)
(A2)

The screening condition, as argued before, leads to and makes the inference statement valid for the true underlining parameter vector. The sparsity condition enables us to do least-squares inference, implicitly assuming that has full column rank for all .

If screening held in the finite sample case as well, the error control could be formulated in a non-asymptotic sense. Although this is usually not the case, the simulations in Meinshausen et al. (2009)

as well as ours show that multisplitting controls the type-I error with respect to (

2) clearly better than single-splitting when screening cannot be guaranteed. This can be explained by the “p-value lottery”: Every split results in different p-values for the selected variables. There are chances that some true non-active variables are significant for some splits. After aggregation, only variables that are significant in a decent number of splits remain significant overall. Due to the variability of these p-values over different splits, chances are that fewer non-active variables get rejected after aggregation than in the average single split. Thus, multisplitting leads to better error control.

2.2.2 Data carving

In this section, we discuss the idea of data carving introduced in Fithian et al. (2014). We focus on the special case of the linear model (1) with Lasso selection, which we will later extend to logistic regression and other generalized linear models. We emphasize that they provide a theoretical framework that could be applied to a much broader spectrum of problems.

The main conceptual idea of data carving is summarized in the following statement (Fithian et al., 2014): “The answer must be valid, given that the question was asked.” Thus, one should control the selective type-I error rate

(8)

The hypothesis is a general notation for a hypothesis as e.g. in (3). Define the event as , the selection event using data . Then, the requirement (8) can be equivalently stated as

(9)

Simple data splitting on the other hand controls the following error

at level . Thus, more conditioning is done than would theoretically be needed, since does not contain all information about but only guarantees that it resulted in the given selection event.

To perform inference controlling the error in (9), one needs to understand the distribution of . The first step is to understand the selection event . We focus on our case of interest, inference in the linear model (1) using Lasso selection. More precisely, let Lasso selection be defined as follows

(10)

There exist different definitions of the Lasso that are equivalent after rescaling. We use definition (10) following Lee et al. (2016) where this selection event is fully characterised. The set of that would lead to the same forms a union of polyhedra in . If we additionally condition on the signs of the parameters’ Lasso estimates, , this union is shrunk to a single polyhedron. Dealing with a single polyhedron is easier both computationally as well as from a theoretical perspective. Hereafter, we additionally condition on the signs at the price of a small loss in power. This single polyhedron can easily be described by linear inequality constraints, e.g. . Those constraints can be split into “active” () and “inactive” () constraints which define statistically independent events. Further, is independent of the inactive constraints such that it is also independent while conditioning on the active constraints, i.e. . Therefore, we can ignore the inactive constraints for inference purposes which are based on . For simplicity, we refer to as being the active constraints only.

Fithian et al. (2014) elaborate how to handle in a given model. As is unknown, the conditional distribution is not tractable yet. To deal with this problem, one can treat the unknown parameters as nuisance parameters in an exponential family which one can get rid of by conditioning accordingly. Generally, one has to decide between the “saturated model” and the “selected model”:

  • Saturated model: has degrees of freedom and is the best linear predictor based on the selected model (cf. (4)).

  • Selected model: has degrees of freedom and completely defines the distribution.

If we consider the saturated model, which includes more parameters than the selected model, more conditioning has to be done. This leads to a drop in power but with the advantage that tests are valid for (3) without any screening assumption. The selected model view is generally more powerful since less conditioning is done but it needs stronger assumptions to hold. The existence of such that is exactly the screening condition. If screening holds, either approach is valid to test (2). Since we are mainly interested in this null hypothesis, we focus on the selected model leading to more powerful tests under screening. In Section 2.4, we elaborate further on the saturated method and its advantages.

Consider the selected model. To perform inference for covariate , one has to condition onto . After applying this conditioning, the random vector of interest is independent from the unknown parameters

. This leads to a degenerate truncated multivariate Gaussian distribution with no more unknown nuisance parameters. The truncation is defined by the selection event. To test the null hypothesis, one further assumes

. Thus, one is interested in

(11)

Note that we can use one-sided tests, since we implicitly condition on the sign of by restricting ourselves to the single polyhedron . If we have selected a correct model such that the selected model view is applicable, is known, and is not a true active variable, then we have . This null distribution is not easily tractable and thus the probability is hard to calculate. Though, it can be sampled from using MCMC. This means that data carving achieves higher power compared to sample splitting at the price of a substantially higher computational cost. We present an applicable MCMC sampling scheme in Appendix B.

In the saturated viewpoint, only one degree of freedom remains after conditioning (cf. Section 2.4). Therefore, one can deal with a univariate truncated normal such that the exact probability, the analogue of (11), can be calculated efficiently using the CDF of a Gaussian. Thus, the trade-off between the selected and the saturated model also involves a computational component.


So far, we assumed to be known. If this is not the case, could be handled as further nuisance parameter, which is resolved by additionally conditioning on . However, this nonlinear constraint disables some of the computational shortcuts which all linear constraints allow for. In our simulations, we use some estimate wherever the variance is assumed to be unknown and proceed as if it was known initially. For completeness, we mention that the distribution when additionally conditioning on is not Gaussian anymore. The corresponding null distribution can still be approximated using a different MCMC sampling technique. Note that this is only possible for the selected model. In the saturated model, one would end up imposing one quadratic and linear equality constraints onto an -dimensional vector. This would only leave two points to sample from such that no inference is possible.

2.3 Novel multicarving for valid inference

Meinshausen et al. (2009) have theoretically argued and empirically shown that splitting several times and aggregating is to be preferred over a single-split approach. On the other hand, Fithian et al. (2014) have shown that discarding all selection data in a splitting set-up is mathematically inadmissible and typically less efficient. To overcome this problem, they introduce the idea of data carving. Nevertheless, their approach potentially suffers from a similar p-value lottery as discussed in Meinshausen et al. (2009) since it is initiated by randomly splitting the data into two disjoint groups of given sizes; one for selection and inference, the other for inference only. Therefore, it is often difficult to replicate. Thus, we advocate the idea of applying data carving multiple times in order to a) overcome the p-value lottery and b) use tests that are theoretically admissible. We use the following procedure:
For :

  1. Randomly split the data into two disjoint groups of sizes and .

  2. Find using and with Lasso selection.

  3. For , calculate for the given split and selected model according to (11), for , set .

  4. Adjust the p-values to to correct for multiplicity using Bonferroni adjustment.

As in multisplitting, we include the fourth step in order to control the FWER. Different corrections could be applied to obtain some less restrictive error control such as the false discovery rate (FDR) as discussed in Meinshausen et al. (2009). There is a trade-off involved in choosing and . The higher we set , the higher the probability of screening gets, which is of need for valid tests. On the other hand, more power remains for the second stage, the inference calculation, for higher values of . We empirically analyse this trade-off in our simulations in Section 4. To get one p-value per predictor, we use the same aggregation techniques as presented in Section 2.2.1, resulting in a single p-value or . In our simulations, we focus on optimizing over the quantiles as described in (6) instead of using a fixed predefined quantile . To distinguish the different methods, we call this procedure multicarving and the method described in Section 2.2.2 single-carving.

2.4 Saturated view and confidence intervals

Naturally, one wants to perform inference without the screening assumption. As mentioned in Section 2.2.2, we can use the saturated model from Fithian et al. (2014) for this. In the saturated view, we do not assume the selected submodel to completely define the mean parameter but only to approximate it as in (4). In order to get rid of the unknown parameters and create a tractable distribution, we have to condition on to . Here, we define , leading to . As has rank , there remains only one degree of freedom after conditioning, namely, in the direction of . Therefore, one deals with a univariate truncated Gaussian where the truncation comes from invoking the selection event . Inference statements can be calculated efficiently using the CDF of a univariate Gaussian. A detailed explanation of this procedure can be found in Lee et al. (2016).

This can be done regardless of the quality of the selected submodel. Therefore, the saturated viewpoint leads to valid tests for null hypotheses (3) (single-carving) and (7) (multicarving) without any screening assumption. However, if screening fails, there is generally no , and there cannot be any false positives with respect to those null hypotheses. Therefore, such tests for null hypotheses without any screening assumption are not of particular interest. Nevertheless, those tests can be used to determine confidence intervals. As for any test, confidence intervals for multicarving can be found by inverting it. Dezeure et al. (2015) give a detailed explanation of how to compute confidence intervals for multisplitting. This can be directly adopted to multicarving by calculating carving p-values but following the same scheme otherwise. For covariate , this leads to a -confidence interval (CI) such that

(12)

where are defined through (4). This is of particular interest when differ for different splits . Therefore, it appears natural to omit the screening assumption and to adopt the saturated model for our confidence intervals. Further, the use of the saturated model leads to more efficient computation.

We focus on two-sided confidence intervals for two reasons. First, having both a lower and an upper bound might be more informative for a practitioner. Second, is not necessarily the same for all splits in which covariate is selected such that combining different splits to a one-sided confidence interval is not appropriate. Thus, the confidence intervals are not the exact inversion of the hypothesis tests.

Notably, if one were to apply simultaneous tests for different null hypotheses in the selected model, this could be done by just calculating a single MCMC chain and relying on the idea of importance sampling afterwards. However, to get a precise enough statement for such simultaneous tests, more MCMC samples might be required than for just calculating a p-value such that this extra statement is not for free.

2.5 Extension to group testing

In a high-dimensional set-up with potentially correlated predictors, finding individual active variables is often too ambitious. Especially, the Lasso selector struggles with distinguishing between two or more highly correlated variables. Therefore, one might prefer to test several variables as a group. We define the null hypothesis for a given group as

(13)

for the full model coefficients. Let be the variables in our group that have been selected then we define the null hypothesis in the selected model as

(14)

The practitioner often wants to test multiple groups or test groups in a hierarchical fashion, say, in a data-driven way. Of course, a multiplicity correction has to be applied which is possible for any valid group test which controls the type I error. We refer to Meinshausen (2008) for a detailed explanation of a hierarchical testing procedure and corresponding multiple-testing correction; see also Mandozzi and Bühlmann (2016) and Renaux et al. (2020).

2.5.1 (Multi)splitting for group inference

Groups of variables can be tested for significance by splitting the data in the same way as single variables. The extension to groups follows naturally as in the low-dimensional case by applying partial F-tests instead of t-tests. This can be done either with a single split or multiple splits using the previously mentioned aggregation techniques (

5) and (6).

2.5.2 (Multi)carving for group inference

The above mentioned (multi)splitting techniques for group inference suffer from the same inadmissibility issue as in the single variable case as more conditioning than necessary is applied. Therefore, we suggest a slight transformation of the data carving idea which makes it applicable to testing for group significance. We focus on the selected viewpoint meaning that our derivation will actually only be valid if a correct model has been found. We emphasize that the saturated model could be extended to inference for groups with very similar adjustments.

Inference for a group follows the single variable case closely. Firstly, note that the selection event is completely unchanged by the idea of testing group significance afterwards as we still apply Lasso for model selection. Thus, we can still invoke the selection event by conditioning on . Based on Fithian et al. (2014), one can see that does not depend on such that there are no more unknown parameters in our model under the null hypothesis (14). Due to this independence from the nuisance parameters, we can base the inference on

or functions thereof. We advocate the use of the following test statistic

In words, it is a directed sum of projections in to all directions corresponding to the group variables. Including in our test statistic is valid, since we additionally condition on having observed the parameters’ signs for the sampling procedure. This additional conditioning is not mandatory for valid inference but simplifies the computation (cf. Section 2.2.2). The success of the sum can be intuitively justified as potentially no variable has a significant effect by itself, but the group as a whole could have.

We need to sample from the (approximate) null distribution to perform tests. As in the single variable case, the carving procedure leads to a Gaussian subject to linear equality and inequality constraints, which can be sampled from as presented in Appendix B with few adjustments.

In contrast to testing of single variables, the group problem remains multidimensional in the saturated view (for ) as one conditions on all but the group variables. To sample from this saturated model, some more changes would be needed, especially the conditioning in B.1 has to be adjusted, while B.2 has to be omitted.

With this group test at hand, further procedures could be derived such as applying the group test on several splits and aggregating them in the spirit of Section 2.3.

2.6 Extension to logistic regression

Not all data can be described and approximated well by the linear model given in (1). We extend the inference method to be applicable to generalized linear models and focus on logistic regression only in the following. Many of the ideas could be carry over to different generalized linear models too, after applying the right transformations.

In logistic regression, we have a binary response vector and some matrix of predictor variables . For every entry of , the probability of being is modelled as

(15)

for some unknown parameter vector , the target of our inference. We denote the -th row of by .

In a classical, low-dimensional setting with , this would be fitted using the MLE or equivalently by minimizing the negative of the log-likelihood for an observation . The log-likelihood is defined as

The negative of the above formula can be minimized e.g. by using a Newton algorithm, which leads to solving an iteratively reweighted least squares (IRLS) problem as derived in Hastie et al. (2009). Starting with some initial estimate , one iterates

where we defined

Thus, in every step a weighted least-squares problem with weight matrix , which iteratively changes, is solved. This explains the name of the procedure.

By further defining

this can be reformulated as a usual least-squares problem (Dezeure et al., 2015)

In the low-dimensional case, Dezeure et al. (2015) suggest to perform the inference as if the final iterate follows . This approach is asymptotically valid because if this was the case, one would have

which is the limiting distribution of the MLE. This can be seen by noting that the covariance matrix is the plug-in estimate of the inverse Fisher information.

As for the linear model (1), the MLE cannot be uniquely found for since is not invertible anymore. Therefore, one also depends on some sort of shrinkage. One can use the Lasso, i.e. an -penalty, in the same fashion as for the linear model and solve the following minimization

This minimizer can be found similarly as in the non-penalized case by adding the penalty term in every update (Friedman et al., 2010)

Thus, the final Lasso estimate will (approximately) fulfil

where and are functions of the estimate itself. As this is exactly a Lasso fit as in (10), the estimate will also fulfil the KKT criteria defined by and . Therefore, we can formulate the constraint , which the observed adjusted response is required to fulfil.

In the high-dimensional case with Lasso selection, it is an obvious approach to calculate inference statements as if inspired by the inference techniques in the low-dimensional setting. Or in other words, proceed as in the usual Gaussian case using our new transformed data and . This can be done likewise for either pure post-selection inference or data carving. Taylor and Tibshirani (2018) provide an argument for the first case. We empirically test the adaption to data carving in our simulations without giving a full theoretical argument. Presumably, such an argument could follow using similar concepts as in Taylor and Tibshirani (2018).

Other types of generalized linear models are often fit in the same fashion using (penalized) IRLS. Whenever this is the case, one can apply our carving method to the transformed data, i.e.  and , which behave asymptotically Gaussian.

Multicarving and aggregation.

As in Section 2.3, we apply this method of calculating p-values to various splits and aggregate as described in Section 2.2.1. Those aggregation techniques are proven to be unbiased given screening. Obviously, assuming that aggregation is performed over p-values that are all valid themselves given screening.

Here, the p-values are only asymptotically valid even under screening. Asymptotic validity of the aggregation over asymptotically valid p-values has not yet been theoretically studied in depth. Therefore, we cannot restate the same theoretical results for logistic regression as were derived in Meinshausen et al. (2009) for multisplitting and which we adapt in Section 3.1 for multicarving in a linear model. Nevertheless, in our simulations, applying multicarving to logistic regression does not result in any problem with type-I error control so that we can advocate its use.

3 Theoretical properties

We elaborate here the theoretical properties of multicarving and the extension to group testing for (multi-)carving in the selected view, requiring the screening assumption in (A1). Without the screening assumption, (multi-)carving is still valid controlling the type I error in great generality when taking the saturated view. Then, at the price to be often overly conservative, confidence intervals with guaranteed coverage should be preferred over tests, see also Sections 2.2.2 and 2.4. 2.2.2 and 2.4. Throughout this section, we assume that the data follow the linear model (1) with Gaussian errors.

3.1 Multicarving for the linear model

Validity of our multicarve method follows naturally from validity of single-carving and multisplitting. Assuming screening in split and known variance, we know from the theory of data carving that as defined in (11) follows for but . Basically, this uniformity of the p-value is the only thing needed to construct the proofs of Theorems 3.1 and 3.2 in Meinshausen et al. (2009). Therefore, we can restate their theoretical result for the aggregation methods. Though, we slightly alter the assumptions on the model selection procedure. We assume

(A1)
(Ã2)

The difference in the second condition yields from the fact that one has to invert to perform inference using splitting, while has to be inverted for data carving. Actually, the condition is and we implicitly assume this to follow from the sparsity condition. Our simulations suggest to use , thus this altered sparsity assumption is less restrictive. Using those two conditions, we establish FWER control for our multicarve procedure.

Theorem 1.

Let be generated by the linear model (1) with Gaussian errors. Assume that (A1) and (Ã2) apply. Let . Let be calculated as in Section 2.3 with known and let be the aggregated value according to (5) with finite . Then, it holds

where the probability is with respect to the data sample. The statement holds regardless of the random sample splits.

The analogue result holds when aggregation is not done with a fixed quantile but with the optimized quantile and the adequate correction term.

Theorem 2.

Let be generated by the linear model (1) with Gaussian errors. Assume that (A1) and (Ã2) apply. Let . Let be calculated as in Section 2.3 with known and let be the aggregated value according to (6) with finite . Then, it holds

where the probability is as in Theorem 1.

For proofs, we refer to the appendix of Meinshausen et al. (2009) invoking the fact that is stochastically larger than under our assumptions.

Two more technicalities have to be added in a practical set-up. First, in order for the uniformity assumption to hold, we depend on a good convergence of the MCMC approximation. Second, since we refrain from conditioning on , we need to know the variance, which is often rather unrealistic. Though, we emphasize that the same theoretical result would hold in the unknown variance case when actually using the conditioning trick. Further, when using an overestimate of , tests become likely more conservative such that type-I error control is given at least as good as with the true variance parameter. However, this cannot be guaranteed in all cases. A discussion on this issue can for example be found in the supplemental materials of Tibshirani et al. (2018).

3.2 Data carving for group testing

In this section, we focus on the theoretical properties of our group test applied to a single group using a single split. Using Theorem 3, results for multicarving then follow from standard arguments.

At the base of our group test is the following lemma, which is proven in Appendix A.

Lemma 1.

Let be generated by the linear model (1) with Gaussian errors. Let be some group with , where . Assume that the screening property () and (Ã2) hold, and is known. Then, the probability law of

is completely defined by our parameter of interest .

Using this lemma, we can base our inference statement on the conditional distribution of . Let be some observation, then we define our selected group p-value as

(16)

This probability can be calculated since we additionally condition on the only remaining unknowns in the model. Notably, this exactly defines the “probability of observing a value at least as extreme as the observed statistic” under null hypothesis (14). Thus, it fulfils the desired property of a p-value, which leads to the following theorem.

Theorem 3.

Let be generated by the linear model (1) with Gaussian errors. Assume that the screening property () and (Ã2) hold, and is known. Let be a realization of and for some group with be calculated as in (3.2). Then, under null hypothesis (14), it holds

Now further define a general group p-value for group as

(17)

Then, we can establish error control of our procedure.

Theorem 4.

Let be generated by the linear model (1) with Gaussian errors. Assume that the screening property and (Ã2) hold, and is known. Let be a realization of and for some group be calculated as in (17). Then, under null hypothesis (13) and for any , it holds

The proof is also dedicated to Appendix A. The technicalities mentioned at the end of Section 3.1 apply in the same fashion for our group test.

4 Numerical results

In this section, we provide detailed results of the performance of our proposed methods in simulation studies. All results were obtained using the programming language R (R Core Team, 2019). As an overall summary, we find that multicarving exhibits often an advantage, sometimes being substantial, over multisplitting or single carving methods.

4.1 Multicarving for the linear model

We tested our multicarve method testing for single variables in the linear model on several simulation set-ups and we present here the results for two of them. We do not restrict ourselves to successful screening, we assume the variance to be unknown and estimate it, and lastly, we select our model through cross-validated Lasso with regularization parameter . All these choices are (in part only slightly) deviating from our theoretical assumptions. In particular, by choosing through cross-validation, more information of is used than invoked in the selection event, making the inference biased. There are first approaches to correct for this additional bias, for example, in Tian and Taylor (2018). However, we refrain from applying any of these, since they will get computationally more involved and because our empirical results do not show any significant violation of the selective type-I error rate (8) using cross-validation. Perhaps though, this should be done with a certain precaution as e.g. Taylor and Tibshirani (2018) report bad error control using a cross-validated for post-selection inference in a Cox model.

We vary the number of splits in and the fraction of data used for selection in . In order to keep this section well-arranged, we restrict ourselves to reporting results for and . Generally, results for different values of are qualitatively similar with a tendency to get slightly better with increasing . Naturally, does only make sense for a single split.

For aggregation over the different splits, we optimize over quantiles as in (6). Starting with the default value in the multisplitting literature, , we noticed that this makes the procedure sometimes overly optimistic. Therefore, we additionally consider to have a comparison. Using a larger is also favourable for computational reasons since less MCMC samples are required to be able to find a significant aggregated p-value for the smallest possible quantile, namely the -quantile; see Appendix C.3 for more details.

4.1.1 Toeplitz design

In a first scenario, we sample once from a multivariate Gaussian distribution with mean zero and a Toepliz covariance matrix with with , and we then treat it as fixed design. The dimensionality is and . The coefficient vector is 5-sparse, and the active predictors are , each of which having a coefficient equal to

. The standard deviation is set to

, leading to a signal-to-noise ratio (SNR) of approximately

. For each simulation run, the variance estimate is calculated through cross-validated Lasso on the entire data set and is used globally for all splits and inference methods.

In Figure 1, we present the outcome of the simulations for the Toeplitz design. Each performance measure represents simulation runs. Although screening cannot always be guaranteed, FWER and power are calculated with respect to (2) with rejection level . Carving using the entire data for selection, i.e. , is performed using a different algorithm, namely, the exact calculation from Lee et al. (2016). We emphasis this using a cross in the figures.

Figure 1: Results for the Toeplitz design. Results using a single split on the left, results using multiple splits on the right. On the x-axis: fraction of data used for the selection. On the y-axis: FWER depicted by symbols and power depicted by lines. For , the power is represented by a cross and the FWER is represented by a circle including a cross. Symbols for the FWER are slightly horizontally offset for better visibility. The horizontal line indicates the target level of the FWER at . The parameter for aggregation is defined in (6).
Figure 2: Results for the Toeplitz design for the adjusted power. Results using a single split on the left, results using multiple splits on the right. On the x-axis: fraction of data used for the selection. On the y-axis: adjusted power such that all methods have FWER of exactly ). For , the power is represented by a cross. The parameter for aggregation is defined in (6).

The left-hand side of Figure 1 illustrates that neither single-splitting nor single-carving controls the error at for and . Though, this is not a violation of our theoretical result, error control would hold when only looking at successful screening. For carving, the power initially increases in and decreases in the larger values of . This can be explained by the trade-off between more successful screening of the true active set and losing power for the inference stage as more constraints are imposed. The same holds for splitting and multicarving when looking at lower values of as eventually to few active variables are selected in the first stage such that no decent power remains. As indicated by the inadmissibility statement in Fithian et al. (2014), carving outperforms splitting with respect to power. The important question is now whether multicarving introduces some improvement over single-carving. The single-carve method has the highest power starting from , where can be basically ignored as error control is not given at all. The multicarve method with performs best among all carving methods regarding FWER for all values of . Multicarving with seems to be inferior in this scenario. Thus, there is a trade-off between higher power and better error control. The highest power with is obtained at for all carving methods with a value of (single-carving), () and (). So, the single-carve method is favourable in this situation.

However, this comparison is not quite fair since the methods have different FWER. Therefore, we additionally look at an adjusted power, i.e. the rejection level is adjusted such that each method has an FWER of exactly ; see Figure 2. Carving is still superior to splitting although the multisplit method with is now competitive for lower values of . All three carving methods reach their optimum at , with an adjusted power of (single-carving), () and ().

Saturated viewpoint.

As discussed in Section 2.4, testing for null hypotheses (3) (single-carving) and (7) (multicarving) while omitting the screening assumption is not particularly meaningful as is fully dense. Therefore, the saturated viewpoint without the screening assumption has no advantage for testing null hypotheses. However, in order to assess the power drop mentioned in Section 2.2.2, we test for null hypothesis (2) using inference in the saturated model. For the set-up discussed above, this leads to the following performance measures. The highest power with is for single-carving (), for multicarving with () and for multicarving with (). The corresponding highest adjusted power is (single-carving), () and (), all of which obtained at . Thus, the saturated approach leads to lower power and adjusted power as anticipated. Though, this drop is less distinct for the adjusted power as the additional conservatism also leads to better type-I error control. Furthermore, we see that for multicarving the differences are less pronounced than for single-carving. For computational reasons, the saturated viewpoint might, therefore, be an interesting alternative for our multicarve procedure.

In Section 2.4, we further introduced the idea of multicarving confidence intervals, where omitting the screening assumption and using the saturated method appears to be more natural. We present a corresponding analysis in Section 4.2.

4.1.2 Semi-synthetic Riboflavin data

Since simulated data sometimes behaves somewhat more nicely than real data, we also test the methods on “semi-synthetic” set-ups, meaning that the matrix comes from some real data set. We simulate the response from (1) with known .

We use the Riboflavin data set with and , which was made publicly available by Bühlmann et al. (2014). The original response measures the Riboflavin production rate for samples of strains of Bacillus subtilis and gives the data its name. The matrix contains the log-expression level of genes for each of these strains.

For our simulations, we set to be -sparse and use an SNR of . The active variables are chosen at random for every simulation run and their respective coefficient is set to 1. Since this can result in very different signal strength depending on the correlation between the 2 variables, we fix the SNR on a per run basis by always adjusting such that . Here, denotes the empirical variance of the true underlining signal. We choose this rather sparse set-up with high SNR since otherwise Lasso selection works very poorly in this high-dimensional set-up and none of the inference methods has good performance. To illustrate this, we repeat the same simulation with active predictors; compare with Appendix C.1.

For the selection, we again perform cross-validation on the given split. To be more realistic, we stick to the unknown assumption. With the estimation technique described before, we realized that is empirically quite low in this scenario. Therefore, we choose the more conservative approach of calculating a new for every split as

(18)

where is calculated on the selection data only but and are the full data.

Figure 3: Results for the Riboflavin with sparsity 2. See caption of Figure 1.
Figure 4: Results for the Riboflavin with sparsity 2 for the adjusted power. See caption of Figure 2.
Figure 3: Results for the Riboflavin with sparsity 2. See caption of Figure 1.

The results obtained for the Riboflavin data with a sparsity of are shown in Figures 4 (FWER and power) and 4 (adjusted power). This set-up is now highly in favour of our multicarve method. Especially, the highest power obtained for is (single-carving), () and (); see Figure 4. The multicarve methods reach this maximum at , while single-carving only obtains error control starting from and higher. There is a power versus FWER trade-off between the two different values of .

The adjusted power is slightly in favour of the lower value as illustrated in Figure 4. More precisely, the highest adjusted power is () and () for the multicarve method. Both these values are obtained for . Single-carving reaches its maximum of at . Thus, the adjusted power clearly prefers multicarving as well.

We note that although we increase both SNR and sparsity, the adjusted power is not (much) better than in the previous set-up. This can be intuitively explained by the following two reasons: First, in the Riboflavin design is much larger than in our Toeplitz design. Second, there are variables with a very high empirical correlation of up to around , making them hardly distinguishable in the selection stage.

4.2 Confidence intervals

We apply our method for confidence intervals to the same set-up with simulated from a multivariate normal distribution with Toeplitz covariance matrix as in Section 4.1.1. As we explicitly omit the screening assumption, we use a different estimate for every split as in (18). The parameters in (12) are calculated including an intercept. Naturally, whenever screening works, this intercept term vanishes.

We use splits and aggregate according to (6) with . The obtained intervals are targeted to be -confidence intervals (-CI).

Median interval length for -CI Average false coverage
number of active variables among
Method 1 5 10 15 20 all tested
Splitting
Carving
Splitting
Carving
Splitting
Carving
Splitting - - - - - - -
Carving
Splitting - - - - - - -
Carving
Table 1: Median length for active variables and average false coverage rate of the confidence intervals. The left-hand side displays the median interval length obtained for the true active predictors, i.e. . The average false coverage rate of the obtained confidence intervals is shown on the right-hand side. This rate is calculated either with respect to all variables or only with respect to variables that are actually tested for, i.e. variables that are at least selected once within the splits. In this analysis, variables not selected at all are assigned an infinite confidence interval such that no false coverage can occur.

In Tables 1 and 2, we compare the performance of our carving confidence intervals to the ones obtained using multisplitting implemented in Dezeure et al. (2015). Those results are based on simulation runs. The obtained intervals are generally rather conservative as the false coverage rate is always far below the theoretical bound of . Notably, for , the intervals obtained through carving are not actually shorter than those from splitting. The advantage of carving is that the intervals get shorter in a first phase when increasing . By increasing , the selected models become more stable and likewise, differs less over different splits, . Due to more stable , shorter intervals are theoretically possible with higher . Though, multisplitting cannot profit from this as too little information for the inference stage remains after increasing . The same holds for carving when becomes too large. The best performing method is carving with a selection fraction of which outperforms every other configuration with respect to at least three interval lengths. Further, it also performs comparably well with respect to the false coverage rate as every configuration with lower false coverage rate suffers from substantially longer intervals.

In a further analysis, we look at the length of the confidence intervals of all covariates that were selected at least once within the splits. For the other variables, there is no real interpretation of the coverage in Equation (12). Further, not selecting a covariate at all in splits is a rather strong indication for the variable generally being inactive such that treating it as if it has an infinite confidence interval length does not seem appropriate. However, there are still many variables obtaining an infinite interval length, namely, those that are selected at least once but less than times, i.e. once or twice in this set-up.

In Table 2, we report the median over the simulation runs over several quantiles of the interval lengths among the selected variables. Due to the possibility of infinite interval lengths, we focus on quantiles instead of averages.

Quantile
Method
Splitting
Carving
Splitting
Carving
Splitting
Carving
Splitting - - - - -
Carving
Splitting - - - - -
Carving
Table 2: Results of length of confidence intervals. Median is taken over simulation runs of several quantiles over lengths of -CI of variables that were selected at least once in splits.

Again, we note that for , the intervals obtained through multicarving are longer than those from multisplitting. However, the power of multicarving comes from the ability to raise the selection fraction without losing all information for the inference stage. The selected models become more stable for larger values of and fewer covariates are selected in total over the splits. The total number of distinct variables selected over all the splits is 96 and 20 on average for respectively . With fewer features under consideration, a higher fraction of those is selected sufficiently often such that powerful inference is possible. Those effects are visible in Table 2 as the quantiles of the intervals using multicarving mostly become shorter when increasing . For carving, there is also a natural countereffect as information for the inference stage is lost, thus the quantiles of interval lengths are not strictly decreasing.

In summary, our confidence intervals obtain the desired coverage stated in Equation (12). Further, multicarving brings an advantage compared to multisplitting because of the possibility to perform well using a higher selection fraction .

4.3 Data carving for group testing

In order to see how well our group test performs, we compare it with results presented in Guo et al. (2019). The authors consider two scenarios testing either a small or large group based on data simulated using different covariance structures. Testing a large group in a dense scenario is described below. Results of group testing for a small group in a sparse and high correlation scenario are illustrated in the Appendix C.2.

The dense alternative with many small non-zero coefficients is a set-up where testing single variables is difficult. More precisely, is and is varied in . The feature matrix is generated from normally distributed features having a Toeplitz covariance matrix with . The parameter vector is defined as for and otherwise. We vary over where corresponds to the global null. The response vector follows our linear model (1) with . This leads to SNR in . We are interested in testing null hypothesis (13) for the group .

4.3.1 Single-carving for group testing

We perform inference in either set-up using our group test introduced in Section 2.5. As in Section 4.1, we vary the fraction of data used for selection in . We start with just using a single split, i.e. , for inference. Notably, for the group test, inference using is obtained with MCMC sampling as well. Since we condition on all but the covariates of interest, we generally have more than degree of freedom such that an easy calculation as in Lee et al. (2016) is not possible. The only exception to that is if , which is algorithmically equivalent to single variable testing.

For the selection, we perform cross-validation. Based on the assumption that Lasso might eliminate many of the covariates with weak signal, we use instead of . To assess the variance parameter , we use a global estimate obtained with cross-validation and on all data.

In Table 3, we show the results for the dense alternative. For each combination of , , and , we report the empirical rejection rate (ERR), i.e. the fraction out of 200 simulation runs in which the null hypothesis is rejected at level . For , this measures the type-I error, for , this measures the power.

Table 3: Empirical rejection rate at level for the dense alternative using single-carving.

For fixed and , the power increases in the number of observations , and for fixed and , it increases in the signal strength . This conclusion is to be expected.

The FWER is controlled for all combinations of and , for most combinations even conservatively. The fraction has always the lowest power because selection works not overly well. In many settings, is also suboptimal with respect to power as too little power remains for the inference stage. Fractions to are all competitive and perform similar. This is in good accordance with our results testing for single variables in Section 4.1.

Table 3 can now be compared to Guo et al. (2019, Table 1), where six different methods are evaluated in this scenario. Ignoring those two where error control is not given at all, our method with fractions between and is amongst the best with respect to power in each set-up. Especially, it has clearly higher power than their method for , whereas the power is similar for higher . Though, their method controls the error more conservatively such that a clear statement in favour of either method is not possible.

If we summarize the results from the dense scenario in this section and the sparse scenario in Appendix C.2, we can state that our method does not have the best performance in all possible set-ups. Though, it is competitive in all of them, while all competitors have some set-ups where they do not work well at all. Thus, our group test, which results from a very simple adjustment of the data carving idea, brings some valuable results.

4.3.2 Multicarving for group testing

In Section 4.1, we see that the multicarve method usually has better error control than single-carving. Based on this observation, it is to be expected that multicarving could further improve on group inference in scenarios where the error is not controlled conservatively (cf. Table 3). Therefore, we test multicarving for group testing as well. Indeed, with multicarving, no ERR above the target level occurs for in either alternative. However, the ERR for , i.e. the power, is sensitive to the choice of the tuning parameters , /, and . Especially, in those two scenarios, aggregation using a fixed quantile clearly outperforms the use of an optimized quantile according to Equation (6).

In the following, we present results obtained using splits and a fixed quantile for aggregation of in order to show the possibilities of multicarving. We emphasize that these choices work comparably well such that in general, when no such comparison is possible, one could expect slightly lower power using multicarving for group testing. The results are shown in Table 4.

Table 4: Empirical rejection rate at level for the dense alternative using multicarving.

We consider the dense alternative. For multicarving, the highest ERR for is , whereas it is for single-carving. Naturally, there is some fluctuation involved in those empirical values. Nevertheless, this difference indicates an improvement of multicarving over single-carving. For most scenarios with , a selection fraction of is favourable. The intuitive explanation is that although the group is not tested for as often as with higher fractions , it is still tested for in a decent number of splits. In these splits, the lower allows for a more powerful inference statement making the method more powerful overall after aggregation. Notably, using and (fixed quantile for aggregation) is equivalent to a Bonferroni corrected minimum p-value (cf. Equation (5)). Thus, only the most significant split is of importance. We now compare the power in Table 4 to that for single-carving in Table 3. Using a selection fraction of , multicarving outperforms any single-carving configuration in all scenarios unless and . Thus, using multiple splits and aggregating can bring a clear improvement. Though, this is rather sensitive to the choice of the tuning parameters as mentioned above.

In summary, the natural extension of our group test using multiple splits leads to a performance boost. Especially, the error can be controlled on a more conservative level using multiple splits. A drawback of the method is its sensitivity to tuning parameters. If those happen to be chosen poorly, power might be lower compared to single-carving.

4.4 Multicarving for logistic regression

We conduct a similar simulation study as in Section 4.1 for the logistic model (15). We reuse the matrix coming from a Toeplitz covariance design from Section 4.1 with dimensions and . The active variables are , each of which having a coefficient of . After having noticed that in logistic regression, at least in this set-up, cross-validated Lasso tends to select overly sparse models, we alter the selection technique. Namely, we select a Lasso model with a given number of selected variables, or if there is no such model, the largest model with fewer variables. Inspired by Meinshausen et al. (2009), we choose this number to be . Just as for cross-validated Lasso, this introduces a slight bias to our test as is determined in a data-dependent fashion and is not predefined. We stick to our usual tuning parameters, i.e.  is varied in and in . The target level for the FWER remains at .

Figure 5: Results for the Toeplitz design in logistic regression. See caption of Figure 1. Note that the range of values on the y-axis is different compared to all the other figures.
Figure 6: Results for the Toeplitz design in logistic regression. See caption of Figure 2.
Figure 5: Results for the Toeplitz design in logistic regression. See caption of Figure 1. Note that the range of values on the y-axis is different compared to all the other figures.

Figures 6 (FWER and power) and 6 (adjusted power) illustrate the same performance statistics as for the simulation examples in Section 4.1. Every performance measure corresponds to simulation runs.

All methods are rather conservative in this set-up. Especially, no value of the FWER above the level occurs. Furthermore, there are no p-values below the significance level at all for splitting. There exist probably better algorithms for calculating low-dimensional p-values in logistic regression than the ones used for splitting here. Though, as it is not of primary interest to our work, we did not investigate this further. Single-carving has clearly higher power than multicarving, whereas the latter controls the error on a more conservative level. The highest power obtained is (single-carving), () and (). All these maxima are reached at . Pure post-selection inference has a power of . Thus, the conjecture that the constraints might be too restrictive is confirmed.

For the trade-off between power and error control, we consider the adjusted power as defined in Section 4.1. Interestingly, multisplitting is now quite competitive. The interpretation is that although p-values are generally larger than , there is still a distinction between active and non-active variables. The best adjusted power of the multisplit method is . As the curve seems to increase towards lower values of , we further tested and . Neither leads to an increase in the adjusted power for multisplitting such that we can assume that the optimum is reached around . Multicarving clearly outperforms single-carving with the respective maxima being at (), () and (single-carving). Pure post-selection obtains an adjusted power of .

In summary, we can state for this data that either carving method improves on pure post-selection inference. The choice between multicarving and single-carving is a trade-off between power and FWER. Our definition of adjusted power, which makes the different methods have equal FWER, is in favour of multicarving.

4.5 Runtime considerations

Our method is computationally quite involved while performing empirically well. Details are discussed in the Appendix C.4. The computational bottleneck is the MCMC sampling required to calculate p-values and therefore, we ignore the other steps for our considerations. An approximate bound is for multicarving, where the expectation is due to the fact that is non-constant over splits.

Another popular inference technique for high-dimensional statistics is the de-biased Lasso (van de Geer et al., 2014). A total of Lasso fits have to be calculated on the entire data. Thus, it scales as

. For high-dimensional data with

and the standard assumption on the Lasso, we have accordingly . Then, our multicarve method is more efficient than the de-biased Lasso for if .

5 Discussion and conclusions

We provide new developments based on the idea of data carving (Fithian et al., 2014). Particularly for high-dimensional scenarios, we improve upon standard data carving.

First, we introduce multicarving in the spirit of multisplitting. Our simulation study shows that multicarving generally leads to better error control and its adjusted power is better than for the single-carve method. Furthermore, multisplitting and multicarving not only aim to reduce the FWER but also to make results more replicable. It is very plausible that our multicarve method clearly increases replicability compared to single-carving, due to the instability of the Lasso model selector.

Second, we present group inference, a natural extension of single variable testing. Such a group test can be applied using single-carving or using the advocated multicarving. In simulation examples, either variant appears to be competitive to several methods discussed in Guo et al. (2019).

Last, we adapt data carving to make it applicable to logistic linear regression and other generalized linear models. Those adjustments are based on the central limit theorem and follow from similar ideas as already introduced for low-dimensional data and for pure post-selection inference. Our simulation study leads to the same conclusions as for the linear model. In particular, data (multi)carving in the logistic case leads as well to a performance increase compared to pure post-selection inference.

Some user-friendly software for multicarving will be available soon within the R-package hdi (Dezeure et al., 2015).

Acknowledgment

The research of P. Bühlmann was supported in part by the European Research Council under the Grant Agreement No 786461 (CausalStats - ERC-2017-ADG).

References

  • Berk et al. (2013) Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection inference. The Annals of Statistics, 41(2):802–837.
  • Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.
  • Bühlmann (2013) Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242.
  • Bühlmann et al. (2014) Bühlmann, P., Kalisch, M., and Meier, L. (2014). High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278.
  • Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
  • Cai and Guo (2017) Cai, T. T. and Guo, Z. (2017). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of Statistics, 45(2):615–646.
  • Dezeure et al. (2015) Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2015). High-dimensional inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30(4):533–558.
  • Fithian et al. (2014) Fithian, W., Sun, D., and Taylor, J. (2014). Optimal inference after model selection. arXiv preprint arXiv:1410.2597.
  • Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
  • Guo et al. (2019) Guo, Z., Renaux, C., Bühlmann, P., and Cai, T. T. (2019). Group inference in high dimensions with applications to hierarchical testing. arXiv preprint arXiv:1909.01503.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
  • Javanmard and Montanari (2014) Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression.

    The Journal of Machine Learning Research

    , 15(Oct):2869–2909.
  • Lee et al. (2016) Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927.
  • Leeb and Pötscher (2003) Leeb, H. and Pötscher, B. M. (2003). The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations. Econometric Theory, 19(1):100–142.
  • Mandozzi and Bühlmann (2016) Mandozzi, J. and Bühlmann, P. (2016). Hierarchical testing in the high-dimensional setting with correlated variables. Journal of the American Statistical Association, 111:331–343.
  • Meinshausen (2008) Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2):265–278.
  • Meinshausen and Bühlmann (2006) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462.
  • Meinshausen et al. (2009) Meinshausen, N., Meier, L., and Bühlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681.
  • Meinshausen and Yu (2009) Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246–270.
  • Mitra and Zhang (2016) Mitra, R. and Zhang, C.-H. (2016). The benefit of group sparsity in group inference with de-biased scaled group lasso. Electronic Journal of Statistics, 10(2):1829–1873.
  • Pakman (2015) Pakman, A. (2015). tmg: Truncated Multivariate Gaussian Sampling. R package version 0.3.
  • Pakman and Paninski (2014) Pakman, A. and Paninski, L. (2014). Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542.
  • R Core Team (2019) R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Renaux et al. (2020) Renaux, C., Buzdugan, L., Kalisch, M., and Bühlmann, P. (2020). Hierarchical inference for genome-wide association studies: a view on methodology with software (with discussion). Computational Statistics, 35(1):1–40.
  • Taylor and Tibshirani (2018) Taylor, J. and Tibshirani, R. (2018). Post-selection inference for -penalized likelihood models. Canadian Journal of Statistics, 46(1):41–61.
  • Tian and Taylor (2018) Tian, X. and Taylor, J. (2018). Selective inference with a randomized response. The Annals of Statistics, 46(2):679–710.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
  • Tibshirani et al. (2018) Tibshirani, R. J., Rinaldo, A., Tibshirani, R., and Wasserman, L. (2018). Uniform asymptotic inference and the bootstrap after model selection. The Annals of Statistics, 46(3):1255–1287.
  • Tibshirani et al. (2016) Tibshirani, R. J., Taylor, J., Lockhart, R., and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620.
  • van de Geer et al. (2014) van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202.
  • van de Geer and Stucky (2016) van de Geer, S. and Stucky, B. (2016). 2-confidence sets in high-dimensional regression. In Statistical analysis for high-dimensional data, pages 279–306. Springer.
  • Wasserman and Roeder (2009) Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. The Annals of Statistics, 37(5A):2178–2201.
  • Zhang and Zhang (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.

Appendix A Proofs

Proof of Lemma 1

We require Assumption (Ã2) such that is defined. As in Section 3.1, we implicitly assume to follow from the sparsity condition. This inverse is implicitly included in and . Using the screening assumption, we know . Thus, we can write the unconditional distribution of as follows

where denotes the normalizing constant of the Gaussian distribution. We see that is the sufficient statistic, while is the natural parameter as is assumed to be known. Conditioning on the selection event leads to a different exponential family with the same sufficient statistic and natural parameter but different normalizing constant, say, , compare with Fithian et al. (2014, Section 3).

Here, we split into the parameter that we want to perform inference for and the nuisance parameter in the model . From the theory of exponential families, we know that the conditional law does not depend on . We now want to establish the same result for . For simplicity, we assume such that it can be separated into variables being part of the group and the others. The result holds w.l.o.g., since permutations of the matrix’ columns do not change our inference statement. Then, we get

Thus, is a fixed affine transform of
, making it independent from as well. Naturally, the subset is conditionally independent too. Based on our two assumptions, the only parameters in the model are and . Thus, after establishing independence from the former, the only parameter left in the model is the latter, which is exactly the lemma’s statement.

Proof of Theorem 4

For a group , we either have or . Assume the former case first. Due to screening, we know , which leads to as . Null hypothesis (13) then directly implies

which corresponds to null hypothesis (14). Therefore, all assumptions of Theorem 3

are fulfilled, leading to the uniform distribution of the p-value. Error control can thus be stated as

In the other case () we have

Thus, we obtain error control in either case, which closes the proof.

Appendix B Sampling from a linearly constrained Gaussian

The algorithm presented in this section is strongly based on the GitHub repository cited in Fithian et al. (2014) for their simulations. However, since there seems to be no written documentation of the algorithm itself and the theory behind, we provide it for the interested reader.

For simplicity, we will suppress index , since we implicitly assume to work in a selected submodel throughout this section.

In order to do inference for variable , the goal is to sample from subject to , and . The first condition leads to boundaries on the sampling region, the second one changes both the mean parameter and the covariance matrix, and the last one further changes the mean and creates a null distribution.

b.1 Change of mean and covariance

Let be a Gaussian random vector with mean and covariance . We are interested in and . To find those, split into

One can see (e.g. by calculating the covariance) that the second term is independent of , thus unchanged by the conditioning, while the first part is completely defined by the conditioning. Thus, we have

and similarly

In our problem of interest, we have (after setting ), , and . This yields

and

Most importantly, the mean term does not have any dependence on such that we can calculate an inference statement without knowing the other coefficients.

b.2 Computational shortcuts: linear transformations

Since all constraints are linear, they can also be guaranteed for linear transformations of

if not too much dimensionality reduction is applied.

Define the least squares solution on all data as

and the one on the selection data only as

Then, two vectors which are well suited to fulfil all constraints after transformation are

Since those are linear transformations, they will still be Gaussian with mean and covariance that can be easily derived from those of .

Further, the constraints transform to

We use the bracket notation for the indices to indicate that row of the resulting matrix has to be omitted. And, by using the active constraints from Lee et al. (2016), we have

where denotes the signs of the parameters’ Lasso estimates. This can be transformed to

Thus, we have transformed the linear equality and inequality constraints and can proceed as if we were to sample from by firstly adjusting the mean and the covariance matrix as described in Section B.1.

The choice of whether to sample from or is rather simple: just use whichever has lower dimensionality in order to increase efficiency. As stated in Section 2.2.2, one would further condition on in the unknown variance case. Though, this constraint is not transformable to or , thus the dimensionality could not be reduced. Therefore, we use an estimate of the variance instead of the (theoretically beautiful) conditioning idea for our simulations.

b.3 Whitening

In order to make the MCMC algorithm simpler, we would like to always sample from zero mean unit variance independent Gaussians (i.e. white Gaussians). This can be achieved by applying a further linear transformation. We need a forward map transforming the initial point and an inverse map transforming back the MCMC sample.

Assume that we sample from which is achieved by applying the transformations from the previous two sections. Here, has rank , i.e.  is not full-ranked whenever . This is as we lose some degrees of freedom after conditioning (cf. Section B.1). Further, define matrices and such that

These can be found e.g. by using the eigenvalue decomposition of

. Then, our forward map is

and accordingly, the inverse map is

Note that and further for all fulfilling the equality constraints, thus all we are interested in.

Importantly, the boundary constraint has to be transformed as well. This is possible by

which leads to

i.e. the constraints in the whitened space. With these whitened constraints at hand, the only thing left is to sample from a white Gaussian subject to linear inequality constraints.

Notably, since is a wide matrix ( unless ), we transform into a lower-dimensional space. Therefore, the transformation into the withened space leads to a further dimensionality reduction, which makes the sampling more efficient.

b.4 Sampling from a linearly constrained white Gaussian

The MCMC algorithm presented in this section is as well based on the mentioned GitHub repository. Though, we emphasize that any algorithm approximating a white Gaussian with linear inequality constraints could be invoked in this place using the same preprocessing steps (cf. Sections B.1 - B.3).

For simplicity, reuse all initial names, thus we want to sample from subject to and let be a point fulfilling the constraints. More precisely, is the preprocessed version of the observed vector.

The idea is to move in every step in a given random direction , while keeping the projections into its orthogonal complement fixed, i.e.

Or in other words, we want to sample from

This is in exact analogy to the set-up in Lee et al. (2016) for pure post-selection inference using as direction of interest and as observation to base the inference on. Thus, the boundary derived for pure post-selection inference can be reused, making a univariate truncated Gaussian with known mean and variance. One can easily sample from this leading to a new point . For every , this can be repeated for a new random direction such that the whole constrained space should be explored. After enough steps, the samples should approximate the null distribution sufficiently well.

An alternative algorithm that could be used for the actual MCMC sampling is the Hamiltonian Monte Carlo algorithm described in Pakman and Paninski (2014). An implementation thereof is available in the R-package tmg (Pakman, 2015).

Appendix C Additional numerical results

This section contains additional numerical results and details about runtime considerations.

c.1 Semi-synthetic Riboflavin data for sparsity 4

We redo the simulation as in Section 4.1.2 setting the sparsity to without changing anything else. The respective results are presented in Figures 8 (FWER and power) and 8 (adjusted power). Note that we restrict the plotting area of the y-axis to a maximum of such that some values of the FWER are non-visible.

At first glance, one sees that the power is generally quite low for all methods. Importantly, this is not a problem of the inference methods themselves but of the combination between selection and inference. Screening, which is theoretically needed for the validity of those approaches, only worked in of the simulation runs using all data for selection and naturally even less for any subset. For comparison, screening worked in of the instances in the sparser alternative, which makes the problem much easier.

In this set-up, multicarving with has the highest power for all , while has the lowest FWER amongst the three carving methods. The highest power obtained controlling the FWER at is in favour of using with a value of . The other two methods obtain respective maxima of () and (single-carving). Especially, single-carving only reaches error control at which is pure post-selection inference. The adjusted power is slightly higher for than for with maximal values of and . For single-carving, the maximal value is . In summary, multicarving is to be preferred over single-carving in this harder set-up too.

As mentioned above, one of the main difficulties in this scenario is the bad screening property. A natural adaption is the use of instead of for selection with cross-validation. This leads to larger selected models and could potentially increase the probability of screening. Our simulation confirms that this leads to a performance boost with the highest adjusted power for multicarving now being . For simplicity, we refrain from showing the results in detail. It has to be mentioned though that the use of leads to a substantial increase in runtime as more variables are selected. We elaborate this effect further in Section C.4.

Figure 7: Results for the Riboflavin with sparsity 4. See caption of Figure 1.
Figure 8: Results for the Riboflavin with sparsity 4 for the adjusted power. See caption of Figure 2.
Figure 7: Results for the Riboflavin with sparsity 4. See caption of Figure 1.

c.2 Data carving for group testing: sparse scenario

We refer to Section 4.3 for more details about the implementation and further discussion.

For the sparse scenario, we choose to be sparse and the active covariates are strongly correlated with other covariates. The number of covariates is as well , and is simulated using the following covariance structure

Thus, is the same Toeplitz matrix as in the dense alternative described in Section 4.3 unless in the first five variables. The parameter vector is defined as and otherwise, meaning that the active variables are within the highly correlated set. This time is varied over and over . The response is generated as before, leading to SNR in . In this scenario, we are interested in the null hypothesis (13) for the group .

c.2.1 Single-carving for group testing: sparse scenario

In Table 5, we report the empirical rejection rate for the scenario with very sparse and highly correlated features. This scenario seems to be easier to handle than the dense scenario. Especially, the error is controlled at a more conservative level, with the highest error being . For the power, the tendencies are similar to before. For , and have generally the lowest power, while the highest power is obtained with . Starting from , leads to the lowest power, while the other ERR are mostly exactly .