Aggregating knowledge across studies is a common practice for pooling multiple findings, increasing precision of common results and planning new studies. In this paper, we evaluate treatment effects after adjusting for potential confounding factors from various independent, existing studies. In doing so, we inform an efficient design of a follow-up validation analysis from these existing studies, and estimate the common parameters by aggregating summary information from each existing study with the validation data. The existing studies for instance can be pilot studies, preliminary analyses, or previously reported results, all studying the impact of the same treatment effect. Moreover, we work under a challenging framework in which the number of potential confounders might be large. Estimation of treatment effects in the presence of many possible confounders usually calls for a model selection procedure such as the LASSO  in any individual study.
Two practical considerations gain prominence in an estimation of the treatment effect from multiple studies. First, full data from the existing studies such as raw data for the confounders might be unavailable, due to either practical limitations in data-sharing, or due to legitimate privacy and confidentiality concerns [21, 1]
. Second, if feature selection is performed on each existing study, the post-hoc estimation of treatment effects based on the selected model tends to be biased. We thus ask: (i) whether and how data from two or more studies can be aggregated without the risk of selection bias; (ii) whether the former goal can be achieved without much loss of information and using only summary statistics sufficient for data aggregation.
To address the above challenges, we consider an
-regularized framework assuming a parsimonious linear model for all the studies. In each existing study when the LASSO is used to select a set of covariates as confounding variables, the usual summary statistics, the first two moments of the data in the least squares analysis refitted for the selected model, are no longer sufficient for the next stage of data aggregation. It however turns out that the summary statistics required by our aggregation protocol assume a fairly simple and compressed form. An immediate by-product of this finding is a new guideline for what needs to be reported in publications of statistical analyses in order to facilitate an efficient aggregation of models selected by the LASSO.
While data aggregation from existing studies can be performed on their own, a further validation study is often necessitated by a regulatory requirement, or pursued due to the need for higher statistical power. More generally, researchers in science and public health often find it more cost-efficient to design and carry out new studies by relying on a synthesis of prior research, in order to focus on a smaller set of confounding factors and to reduce variability of estimators from any single study. Such new studies play the role of validation studies in our framework. Informed by the selection from the existing studies, the validation study collects data on all the potentially important confounding factors that are identified as active in any of these existing studies.
We showcase how a recently developed technique, namely data carving
, can be utilized to efficiently pool the information from each existing study with the validation data for an unbiased estimation of the treatment effect in the selected models. Data carving borrows information that has not been been fully exploited by the LASSO, and subsequently leads to higher statistical efficiency for estimating treatment effects than off-the-shelf alternatives like splitting which only uses the (independent) data from the follow-up study during estimation. In contrast to the debiased LASSO estimator based on a one-step bias correction to the LASSO solution, the appeal of our estimator lies in the fact that it does not require estimating a high dimensional matrix (or its inverse) based on the existing data. Combining summary information from each existing study with the validation data through data carving, we call the new estimator acarved estimator. A simple averaging of the carved estimators then aggregates the findings from the existing studies. We schematically depict our aggregation protocol in Figure 1. Our new estimator, after combining each existing study with the validation data through data carving, is denoted by , and the aggregated estimator is given by .
Our present work relates with multiple frontiers in the literature. In the context of high dimensional meta-analysis, , , and  propose distributed inference procedures based on aggregating debiased LASSO estimators [22, 19] from multiple studies. In a similar spirit as , 
propose a more efficient inversely variance weighted debiased LASSO estimator to accommodate cross-study heterogeneity. The debiased approaches in the aforementioned papers offer a valid way out to avoid model selection bias, but lose statistical efficiency when the treatment assignment is correlated with the noise variables within the data. Such correlations are commonly expected in large observational studies and a demonstration of this phenomenon is provided in.
A simple estimator in our framework is a split-and-aggregate estimator, obtained by: (i) refitting the validation sample with the selected model from each existing study and followed by (ii) averaging the split-based estimators. Because the data used for model selection is not used for estimation, the resulting estimator does not suffer from the issue of the over-fitting bias. However, the estimator suffers from a loss of efficiency by discarding the data from existing studies, which might be very significant when the more carefully designed validation study is based on fewer samples relative to the existing studies. Drawing inspiration from recent efforts to reuse data from selection [3, 16, 2, 5, 13], utilizing residual information from the existing studies for an estimation of the treatment effect from the full data steers our proposal. The idea of conditioning upon the observed selection, previously explored in [7, 17, 11]
, among others, discards bias from model selection through p-values, confidence intervals, credible intervals based on conditional probabilities. The estimator we propose here implements the principles of data carving[4, 12, 10, 14] within the conditional framework. Data carving resembles data-splitting in the selection stage, because model selection operates on only a subset of the full data. However, estimators based on data carving, unlike those based on splitting, use the entire data instead of the held-out portion alone.
begins with an illustrative case analysis to present the heuristics behind the construct of our new estimator. Section4 provides details of the carved estimator in the general framework and proves asymptotic unbiasedness of our estimator. Numerical experiments that back up our asymptotic results are presented in Section 5. Section 6 provides proofs for our main theory. Section 7 concludes our work with some summarizing remarks.
We denote the data from the independent studies by:
for , where each study measures independently and identically distributed triples; (i) are the observed outcomes, (ii) represents the high-dimensional covariates with potentially greater than , and (iii) denotes the -valued treatment variable. Fixing notations, we use to denote an index set of covariates with cardinality less than for any
. For any vector or matrixand an index set , let be the sub-vector or sub-matrix containing the columns of indexed by the set . Lastly, we use the symbol and to represent a -valued vector of all ones and all zeroes, respectively.
We assume that the data collected in our studies are governed by the population model
where indicates the active covariates in the model, measures the average treatment effect, and are the unknown parameters that capture the effects of the covariates in the model, and are random errors. We assume to be the same across the studies for the sake of simplicity, but note that our work generalizes easily to cases where the parameter vector varies with , i.e., the true confounding variables are allowed to be different across the studies. Later in the section, we discuss the extension of our framework to accommodate heterogeneity across studies.
We can think of the first column of as a vector of all 1’s to represent the intercept in the population model in (1); in practice, we center all the data vectors to proceed by simply assuming that sums to zero and so does every column of and . Each existing study in our framework includes an extensive collection of covariates and the treatment assignment is randomized conditional on these covariates. We further suppose that the observed covariates, , in each study include the ones indexed by , that is, the active covariates of interest in the population model are measured in each study. This is a rather strong assumption, requiring any unobserved confounding variables to be balanced in all the studies. Unobserved confounding effects in more general settings certainly deserve further attention.
Because each study includes a large number of covariates, we assume that the treatment effect evaluation will rely on the LASSO, a common tool of choice in high dimensional data analyses. Fixingto be the active set of covariates selected by the LASSO, each study will report and a set of summary statistics which we specify explicitly in the next section. Under appropriate conditions, we have with probability tending to one as the sample size increases. A validation study, , collects information on the response , treatment variables , and the covariates from the same population model in (1) and independent of , , where with cardinality equal to . In this protocol, we note that the selection from the existing studies informs a careful design for the downstream validation study. In the rest of the paper, let (i) , (ii) , and (iii) . Lastly, we use to represent the norm of a vector unless specified otherwise.
We conclude the section by showing that our framework easily generalizes to accommodate some heterogeneity across studies. Let contain the active covariates in the existing study , and contain the active covariates in the validation study. Suppose, the data in our existing studies is drawn from the following model:
and the data in the validation study are generated from:
where for . Now, let
represent a dummy variable that assumes the value 1 if the observationis measured in study and assumes the value otherwise, for , i.e., . For an observation , we have
Given the above generating models, we can describe observation in the data from the validation and existing studies through the following unified model:
Based on the above unified model, our proposal easily extends to two scenarios. First, when , the summary statistics reported in the existing studies are exactly the same as the case under the model in (1), wherein the support sets coincide in all the studies. Second, when , the construct of our estimator follows the same recipe as proposed in the paper, except now the unbiasedness of the estimator holds under the additional assumption that the union of the selected variables is collected in all existing studies, and we proceed by fitting the union of the selected models to the combined data for an existing study and the validation study. In the second situation, we need to retrieve the summary statistics from the existing studies based on the union of the selected model instead. Because the unified model is cast into the same form as the simpler model in (1), we work with the simpler model in the rest of the paper for ease of presentation. Later in Section 5, we verify the validity of such an extension in a simulation study.
2.2 Efficient and privacy-preserving aggregation with data carving
Suppose, for each existing study, we use the LASSO to estimate:
, for , where is a diagonal matrix with the tuning parameters in the LASSO penalty in the diagonal entries. Recall, is the support set of . We first specify the summary statistics involved in the construct of our estimator:
Summary from model selections: the support set of the LASSO estimator , the penalty weights , and the signs of the (nonzero) LASSO estimator for the selected covariates .
Summary based on first two moments: the sample covariance matrices from the existing study based on the selected model and the observed sample of size , that is,
Note that has two blocks, one -dimensional and the other -dimensional. In a similar fashion, is partitioned along the same dimensional structure.
Using these summary statistics from the existing study together with the validation study data , we introduce our carved estimator below. We begin by computing the least squares estimator
For a fixed vector and for , let be a -valued vector whose -th component equals
we then solve from a convex optimization problem:
Then, our carved estimator for from the the existing study and the validation study is given by
Finally, we propose the aggregated estimator by taking a simple average of the carved estimators:
If the sample size for each existing study, , varies with , we can naturally replace the simple average in (7) by the weighted average, where the weights are proportional to . As indicated in the above discussion, the summary statistics from each existing study includes the first two moments needed for the least squares estimator, alongside the penalty weights and the signs of the Lasso solution for the selected covariates indexed by .
We note some natural comparisons with existing approaches to estimate the common treatment effects in the following remarks.
Remark 2.1 (A comparison with “split-and-aggregate” strategy).
An off-the-shelf option available for estimation in our context is a simple split-and-aggregate estimator, obtained by refitting the validation sample with the selected model (call this estimator , followed by averaging these estimators. Because the data used for model selection is not used for estimation, the resulting estimator does not suffer from the issue of the over-fitting bias, but, can prove to be significantly less efficient than our proposed carved-and-aggregate strategy.
Remark 2.2 (A comparison with the debiased LASSO strategy).
Another legitimate estimator is the aggregated debiased LASSO estimator that can be viewed as the proposal in  customized to our setup. To be specific, we aggregate by averaging the debiased LASSO estimators from each site and the validation dataset , denoted as and respectively. The debiased LASSO estimator reduces the penalization bias introduced by the high dimensional covariates. However, it tends to be statistically inefficient whenever there exist high correlations between the treatment variable and some of the covariates as noted previously by . In a simulated setting under Section 5 with highly correlated treatment and confounding variables, we observe a significant loss in efficiency for the debiased estimator.
Remark 2.3 (Overfitting bias correction in the carved estimator).
We note that the refitted pooled estimate is a biased estimate of defined in (3), because it double uses the data for model selection and parameter estimation. To see this, whenever , the refitted pooled estimate satisfies that . Due to the correlation between and the selected set of covariates in , we typically have , meaning that is biased. We refer to the resulting bias in as the over-fitting bias. The proposed carved estimator defined in (6) applies a novel bias correction term to remove the over-fitting bias. In this key step, carving allows us to free our statistical estimate of model-selection bias, and the Rao-Blackwellization deployed on a carved likelihood improves the efficiency upon this unbiased estimate. We shall see more detailed discussion of this appealing property of our carved estimator through a simple example in the next section.
3 Debiasing through data carving: an illustrative case analysis
We present in this section an illustrative case analysis to introduce the debiasing approach of our carved estimator in (6). Assuming access to the summary information (1) and (2) for an existing study (i.e., ), we let contain a real valued covariate and a treatment variable (i.e., and ). Both and in this example are standardized to have mean zero and variance one with correlation . Consider observing and observations in and respectively, from the population model (1) with (i.e. ) and with noise variance . Letting , recall that the augmented sample size is , and the ratio of the samples from the existing study is which we set to be equal to in the current illustrative example.
Suppose the model we select after solving the LASSO from our existing study is the full model , i.e., the model includes the covariate . Using the validation study, we proceed to evaluate the treatment effect in the overfitted model:
is the identity matrix. Define
Based on model (8), the likelihood is denoted as
where is the least squares estimator for using the combined data. Following the notation in the previous sections, denote
where is the sign of the LASSO estimator for the coefficient of and is the corresponding tuning parameter.
To outline the construct of our estimator, we begin by defining the variable as
where . We observe a simple equivalence of the Karush–Kuhn–Tucker (K.K.T.) mapping for (2) with that from:
when both mappings are evaluated at the LASSO estimator . We term
as a randomization variable, so named because this variable is generated by using a random subsample of our i.i.d. observations (of size) for selection. Under the selected model (8), we have as :
jointly follow an Gaussian distribution such thatis centered at with the covariance
is independent of .
Using (9), we characterize the event that the LASSO selects the active set of covariates with signs in terms of the randomization variable and the least squares estimator as follows:
Our carved estimator can be constructed in two steps. In the first step, we recognize an unbiased initial estimator that is independent of the selection event:
This leads us to observe that is unbiased for conditional upon the event in (10). In the next step, we improve upon the initial estimator through Rao-Blackwellization by conditioning further upon the complete sufficient statistic for in the conditional likelihood. In our case, this statistic is , because of a basic fact that conditioning preserves the complete sufficient statistic in the unconditional law . Our estimator is thus given by:
where we have conditioned upon alongside the event of selection in (10).
Simplifying the expression on the right-hand side of (3), we have:
Consistent with the estimator in (6), takes the form of an additive debiasing correction applied to the simple least squares estimator . Completing the details, we give the expression for in Proposition 6.1 and note:
To provide some numerical evidence for the effectiveness of our new estimator in debiasing the treatment effect from data re-use in model selection and parameter estimation, we depict in Figure 2 the result of a simulation for , and and Monte Carlo samples. Our carved estimator takes the expression in Proposition 6.1, which we compare against the two popular alternatives in Remark 2.1 and 2.2. The generative scheme for the simulation and the implementation details for the other two alternatives under comparison are elaborated in Section 5. Noteworthy, all the three estimators are centered around the true value , but the carved estimator has smallest variability.
4 Carved estimator
In the present section, we generalize the debiasing approach for the illustrative analysis to our framework in the paper. Our carved estimator is obtained by conditioning upon the observed event:
which in turn allows us to construct a debiasing term for the treatment effect based on only summary statistics from the existing studies. We then prove the asymptotic unbiasedness of our estimator and identify the relative gains in variance for our estimator over splitting.
4.1 Our debiasing term
In line with the illustrative analysis in the preceding discussion, we define our randomization variable as follows:
Note, we need not observe the covariates not selected by the LASSO in the validation study. The variables defined above only serve to assist a theoretical investigation of the debiasing term.
In Lemma 4.1, we state an asymptotic linear representation for the variables , and , which implies that the variables are distributed as Gaussian variables in the limit as the sample size . Conditioning the limiting Gaussian law in this lemma upon the observed event of selection in (11) gives us an asymptotic conditional distribution, which we simply call our conditional law. Theorem 4.2 then provides the expression of an asymptotic debiasing term for the refitted least squares estimate with respect to this conditional law. The results in the section assume that we fit the selected model to the combined data from the existing and validation study:
Since we work under a fixed setting, we let in the remaining analysis.
Fix . Then, the following assertion holds:
where (i) is the average of i.i.d. variables with mean equal to , (ii) , (iii) , and are asymptotically independent, and (iv) is asymptotically centered at with the covariance , where
Before stating Theorem 4.2, we let
denote the first moment of a Gaussian variable with mean and covariance truncated to the region . Further, let be the associated log-partition function for the truncated Gaussian density at the natural parameter
Finally, for defined in (4), let be the population covariance.
Define the half-space
Then, the estimate
is unbiased for with respect to the conditional law derived from Lemma 4.1.
The estimate in Theorem 4.2 however does not admit direct computations, due to lack of readily available expressions for the moments of a truncated Gaussian variable. To this end, we use the plug-in estimate and the inverse for and its inverse, respectively, and apply a Laplace-type approximation to facilitate a manageable calculation for our proposal (6) through an easy-to-solve optimization. Deferring the rigorous justification of asymptotic unbiasedness to the next discussion, we outline the approximate construct of our estimator. For
a Laplace approximation grants us the following:
We replace the constrained optimization with the unconstrained version through a logarithmic barrier penalty to obtain:
Applying the outlined approximation to the asymptotic debiasing term in Theorem 4.2 allows us to write
4.2 Asymptotic properties
Our main result in the section, Theorem 4.6, establishes the asymptotic unbiasedness of our carved estimator. We first present the regularity and moment conditions on the data generating mechanism, required for the theoretical justification of our debiasing approach.
Consider i.i.d. realizations of the response , the covariates collected across the existing studies and the treatment by , such that is drawn from the model in (1) with
Each of our existing studies in particular observes samples of the outcome and the treatment along with the samples for a subset of covariates seen in all the studies. For the remaining section, we consider the following parameters
in our generation scheme, such that and are constants and .
The following condition allows us to control the bias that results from the use of a Laplace-type approximation towards a feasible expression for our debiasing factor. The first part of the condition assumes the existence of an exponential raw moment, which in turn in necessary to justify a Laplace-type approximation for a mean of i.i.d. variables, which in our case is in Lemma 4.1 by setting . The second part of the condition is required to extend this approximation to asymptotically linear variables with an added error term, as is seen in the left-hand side of the representation in Lemma 4.1.
for a fixed subcollection of our covariates containing . We assume that
for some . For the observed set of covariates in the existing study k, consider the linearizable representation in Lemma 4.1. We assume that the remainder term satisfies:
The probability of selection admits the following characterization in terms :
where and are fixed matrices. This follows from the asymptotic polyhedral characterization of the selection event associated with the LASSO in previous work; we direct interested readers to [Proposition 4.2 12]. The next condition allows us to ignore the remainder term in the polyhedral characterization which converges to in probability with increasing sample size.
We assume that the probability of selection satisfies
The final condition bounds the deviation of the sample estimate from the population parameter .
For the selected set , we assume: , where denotes the operator norm of the matrix .
For the remaining section, consider the parameters
in our generation scheme such that and are constants and .
The asymptotic variance of our estimator based on the Fisher information matrix, we have:
which we detail out in Proposition 6.4 in Section 6. Lemma 4.7 then quantifies the relative gain in variance of our estimator over splitting within each study . Here, denotes the estimated variance of the least squares estimate refitted on the validation data alone and let