Statistics aims to glean insights about a population by aggregating samples of individual observations, so data integration is at the core of the subject. In recent years, a keen interest in combining data—or statistical inferences—from multiple studies has emerged in both statistical and domain science research (Chatterjee et al., 2016; Jordan et al., 2018; Yang and Ding, 2019; Michael et al., 2019; Lin and Lu, 2019; Chen et al., 2019; Tang et al., 2020; Cahoon and Martin, 2020). While each sample is typically believed to be a collection of observations from one population, there is no reason to believe another sample would also come from the same population. This difficulty has given rise to a large body of literature for performing data integration in the presence of heterogeneity; see Claggett et al. (2014); Wang et al. (2018); Duan et al. (2019); Cai et al. (2019); Hector and Song (2020) and references therein for examples. These methods take an all-or-nothing approach to data integration: either the two sources of data can be integrated or not. This binary view of what can, or even should, be integrated is at best impractical and at worst useless when confronted with the messy realities of real data. Indeed, two samples are often similar enough that inferences can be improved with a joint analysis despite differences in the underlying populations. It does us statistical harm to think in these binary terms when datasets come from similar but not identical populations: assuming data integration is wholly invalid leads to reduced efficiency whereas assuming it is wholly valid can unknowingly produce biased estimates (Higgins and Thompson, 2002). It is thus not practical to think of two datasets as being either entirely from the same population or not. This begs the obvious but non-trivial (multi-part) question: What and how much can be learned from a sample given an additional sample from a similar population, and how to carry out this learning process?
Towards answering this question, we consider a simple setup with two datasets: one for which a generalized linear model has already been fit, and another for which we wish to fit the same generalized linear model. The natural question is if inference based on the second dataset can be improved in some sense by incorporating results from the analysis of the first dataset. This problem has been known under many different names, of which “transfer learning” is the most recently popular (Pan and Yang, 2010; Zhuang et al., 2021)
. The predominant transfer learning approach uses freezing and unfreezing of layers in a neural network to improve prediction in a target dataset given a source dataset(Tan et al., 2018; Weiss et al., 2016). The prevailing insight into why deep transfer learning performs well in practice is that neural networks learn general dataset features in the first few layers that are not specific to any task, while deeper layers are task specific (Yosinski et al., 2014; Dube et al., 2020; Raghy et al., 2019). This insight, however, does not give any intuition into or quantification of the similarity of the source and target datasets, primarily because of a lack of model interpretability. More importantly, this approach fails to provide the uncertainty quantification required for statistical inference. Similarly, transfer learning for high-dimensional regression in generalized linear models aims to improve predictions and generally does not provide valid inference (Li et al., 2020, 2022; Tian and Feng, 2022). Another viewpoint casts this problem in an online inference framework (Schifano et al., 2016; Toulis and Airoldi, 2017; Luo and Song, 2021) that assumes the true value of the parameter of interest in two sequentially considered datasets is the same.
In contrast, we aim to tailor the inference in one dataset according to the amount of relevant information in the other dataset. Our goal is to provide a nuanced approach to data integration, one in which the integration tunes itself according to the amount of information available. In this sense, we do not view datasets as being from the same population or not, nor data integration as being valid or invalid, and we especially do not aim to provide a final inference that binarizes data integration into a valid or invalid approach. Rather, our perspective is that there is a continuum of more to less useful integration of information from which to choose, where we use the term “useful” to mean that we are minimizing bias and variance of model parameter estimates.
This new and unusual perspective motivates our definition of information-driven regularization based on a certain Kullback–Leibler divergence penalty. This penalty term allows us to precisely quantify what and how much information is borrowed from the first dataset when inferring parameters in the second dataset. We introduce a so-called “dial” parameter that controls how much information is borrowed from the first dataset to improve inference in the second dataset. We prove that there exists a range of values of the dial parameter such that our proposed estimator has reduced mean squared error over the maximum likelihood estimator that only considers the second dataset. This striking result indicates that there is always a benefit to integrating the two datasets, but that the amount of information integrated depends entirely on the similarity between the two. Based on this result, we propose a choice of the dial parameter that calibrates the bias-variance trade-off to minimize the mean squared error, and show how to construct confidence intervals for our biased parameter estimator. Finally, we demonstrate empirically the superiority of our approach over alternative approaches, and show empirically that our estimator is robust to differences between the two datasets. Due to its disjointed nature, relevant literature is discussed throughout.
gives the specific form of our estimator in the linear and logistic regression models. We establish the theoretical properties of our estimator in Section4, demonstrating its improved efficiency over the maximum likelihood estimator. Finally, we investigate the empirical performance of our estimator through simulations in Section 5.
2 A continuum of information integration
2.1 Problem setup
Consider two populations, which we will denote as Populations 1 and 2. Units are randomly sampled from the two populations and the features and are measured on unit from Population , with and . Note that, while the values of these measurements will vary across and , the features being measured are the same across the two populations. Write for the dataset sampled from Population , for . Note that we have not assumed any relationship between the two populations, only that we have access to independent samples consisting of measurements of the same features in the two populations. The reader can keep in mind the prototypical example where a well designed observational study or clinical trial has collected a set of high quality data and has at its disposal a set of data of unknown source and quality to use for improved inference.
Since the features measured in the two datasets are the same, it makes sense to consider fitting identical generalized linear models to the two datasets. These models assume the conditional probability density/mass function for, given , is of the form
where , , and are known functions, is the quantity of primary interest, and is a nuisance parameter that will not receive much attention in what follows. Since inference of is not of interest and we can appropriately estimate based on , we do not concern ourselves with how integration affects estimates of and we assume . The conditional distribution’s dependence on is implicit in the “” subscript on . We will not be concerned with the marginal distribution of so, as is customary, we assume throughout that this marginal distribution does not depend on ), and there is no need for notation to express this marginal. We assume that is full rank.
Of course, the two datasets can be treated separately and, for example, the standard likelihood-based inference can be carried out. In particular, the maximum likelihood estimator (MLE) can be obtained as
where is the joint density/likelihood function based on dataset , . The MLEs, together with the corresponding observed Fisher information matrices, can be used to construct approximately valid tests and confidence regions for the respective parameters. This would be an appropriate course of action if the two populations/datasets in question had nothing in common.
Here, however, with lots in common between the two datasets, we have a different situation in mind. We imagine that the analysis of dataset has been carried out first, resulting in the MLE , among other things. Then the question of interest is whether the analysis of dataset ought to depend in some way on the results of ’s analysis and, if so, how. While the problem setup is similar to the domain adaptation problem frequently seen in binary classification (Cai and Wei, 2021; Reeve et al., 2021), we emphasize that our interest lies in improving efficiency of inference on . More specifically, can the inference on based on be improved through the incorporation of the results from ’s analysis?
2.2 Information-driven regularization
Our stated goal is to improve inference of in given the analysis of . Further inspection of equation (1) reveals that the primary difference between and is in the difference between and . Thus, at first glance, the similarity between and is primarily driven by the closeness of and . Intuitively, if and are close, say is small, then some gain in the inference of can be expected if the inference in is taken into account. This rationale motivates a potential objective to maximize under the constraint for some constant . The choice of then reflects one’s belief in the similarity or dissimilarity between and , and data-driven selection of relies on the distance .
We must ask ourselves, however, if closeness between and is sufficient for us to integrate information from into estimation of . For example, would we choose to constrain to be close to if were small? How about if is small, reflecting uninformative features and resulting in a large variance of ? These intuitive notions of what we consider to be informative highlight a gap in our argument so far: elements that are “close” in the parameter space may not be close in an information-theoretic sense, and vice-versa. This is best visualized by Figure 1, which plots two negative log-likelihood functions for , one being based on a more informative dataset than the other. The true is also displayed there and, as expected, is different from the MLE based on ; for visualization purposes, we have arranged for to be the same for both the more and less informative datasets. The plot reveals that the more informative data can better distinguish the two points and , in terms of quality of fit, than can the less informative data—so it is not just the distance between and that matters! Intuitively, estimation of should account not only for its distance to but also the “sharpness” of the likelihood, or the amount of information in .
This motivates our proposal of a distance-driven regularization that takes both the ambient geometry of the parameter space and aspects of the statistical model into consideration. That is, we propose a regularization based on a sort of “statistical distance.” The most common such distance, closely related to the Fisher information and the associated information geometry (Amari and Nagaoka, 2000; Nielsen, 2020), is the Kullback–Leibler (KL) divergence which, in our case, is given by
where , a standard property satisfied by all natural exponential families. Our proposal, then, is to use the aforementioned KL divergence to tie the estimates based on the two datasets together, i.e., to produce an estimator that solves the following constrained optimization problem:
where is a constant to be specified by the user. The KL divergence measures the entropy of relative to . As the Hessian of the KL divergence, the Fisher information describes the local shape of the KL divergence and quantifies its ability to discriminate between and .
A key difference between what is being proposed here and other regularization strategies in the literature is that our penalty term is “data-dependent” or, perhaps more precisely, relative to . That is, it takes the information in into account when estimating , hence the name information-driven regularization. Of course, the extent of the information-driven regularization, i.e., how much information is shared between and , is determined by , so this choice is crucial. We will address this, and the desired properties of the proposed estimator, in Section 4 below.
To solve the optimization problem in equation (2), we propose the estimator
where the objective function is
for a dial parameter . Thanks to the general niceties of natural exponential families, this objective function is convex and, therefore, is straightforward to optimize. The addition of the KL penalty term in equation (3) essentially introduces -discounted units of information from into the inference based on . Note that we specifically avoid use of the term “tuning parameter” to make a clear distinction between our approach and the classical penalized regression approach. The addition of the KL divergence penalty intentionally introduces a bias in the estimator for which we hope—and later show—will be offset by a reduction in variance. This is the same basic bias–variance trade-off that motivates other shrinkage estimation strategies, such as ridge regression and lasso, but with one key difference: our motivation is information-sharing, so we propose to shrink toward a -dependent target whereas existing methods’ motivation is structural constraints (e.g., sparsity), so they propose to shrink toward a fixed target. With this alternative perspective comes new questions: does adding a small amount of -dependent bias lead to efficiency gains? Below we will identify a range of the dial parameter such that the use of the external information in reduces the variance of enough so that we gain efficiency even with the small amount of added bias. The parameter thus acts as a dial to calibrate the trade-off between bias and variance in .
The maximum of coincides with the root of the estimating function
where is the
-vector of (conditional) mean responses, given, for . The root is the value of that satisfies
From this equation, we see that a solution must be such that a certain linear combination of and agrees with the same linear combination of and , two natural estimators of the mean responses in and , respectively. In Section 3, we give the specific form of the estimating function for two common generalized linear models: the linear and logistic regression models.
2.3.1 Comparison to Wasserstein distance
The Wasserstein distance has enjoyed recent popularity in the transfer learning literature; see for example Shen et al. (2018); Yoon et al. (2020); Cheng et al. (2020); Zhu et al. (2021). An example illustrates why we prefer the KL divergence to the Wasserstein distance. Suppose is the Normal density with mean and variance . The KL divergence defined above is
Clearly, this takes into consideration not only the distance between and , but also other relevant information-related aspects of the dataset . By contrast, following Olkin and Pukelsheim (1982), the -Wasserstein distance between the two Gaussian joint densities and is
In this example, the Wasserstein distance is only a function of the distance between the means, and fails to take into account any other aspects of , in particular, it does not depend on the observed Fisher information in . Replacing the KL divergence in our proposal above with the 2-Wasserstein distance would, therefore, reduce to a simple ridge regression formulated with -dependent shrinkage target . As discussed above, such an approach would not satisfactorily achieve our objectives.
2.3.2 Connection to data-dependent penalties
The dependence of the constraint in equation (2) on the data is unusual in contrast to more common constraints on parameters. Our formulation leads to a penalty term in that depends on data. This approach allows an appealing connection to the empirical Bayes estimation framework, which allows us to treat information gained from analyzing as prior knowledge when analyzing . Through this lens, acts like a prior density and maximizing is equivalent to maximizing the corresponding posterior density for , i.e., is the maximum a posteriori estimator of . The dial parameter adjusts the impact of the prior and reflects the experimenter’s belief in the similarity of and .
The prior term is itself a measure of the excess surprise (Baldi and Itti, 2010) between the prior information and the likelihood in . This observation leads to an intuitive understanding of our KL prior: if the prior and the likelihood are close for all values of
, then the prior probabilities are small for all values ofand the prior is weakly informative.
2.3.3 Distinction from data fusion
Apart from the literature that considers (a subset of) effects to be a priori heterogeneous (e.g. Liu et al., 2015; Tang and Song, 2020), homogeneous (e.g. Xie et al., 2011) or somewhere in between (e.g. Shen et al., 2020), some methods consider fusion of individual feature effects. Broadly, fusion approaches jointly estimate and by shrinking them towards each other. At first blush, these methods may appear similar to our approach, so we take the time to highlight two key differences.
Primarily, these methods differ in that they do not consider estimation in to be fixed to , and can jointly re-estimate both parameters for improved efficiency. This is quite different from our goal, which is to quantify the utility of a first, already analyzed dataset in improving inference in a second dataset . Our approach has the advantage that we do not require the model to be correctly specified in , which endows our method with substantial robustness properties which are lacking for fusion approaches.
Moreover, data fusion’s underlying premise is to exploit feature clustering structure across data sources, thereby determining which features should be combined (Bondell and Reich, 2009; Shen and Huang, 2010; Tibshirani and Taylor, 2011; Ke et al., 2015; Tang and Song, 2016). In particular, this leads to the integration of some feature effects but not others. Different sets of feature effects are estimated from different sets of data, which does not provide the desired quantification of the similarity between datasets.
3.1 Gaussian linear regression
Optimizing corresponds to finding the root of the estimating function
Let , , denote the scaled Gram matrices. Then the solution to the estimating equation is
where is the MLE based on only. The estimator in equation (5) progressively grows closer to as more weight (through ) is given to . This is desirable behavior: acts as a “dial” allowing us to tune our preference towards or . When , we recognize
as a generalization of the best linear unbiased estimator. The estimatordoes not rely on individual level data from the first dataset, ; all that is needed is the sample size, the estimate, and the observed Fisher information. Thus, our proposed procedure is privacy preserving and can be implemented in a meta-analytic setting where only summaries of and are available.
This estimator also takes the familiar form of a generalized ridge estimator (Hoerl and Kennard, 1970; Hemmerle, 1975) and benefits from many well-known properties; see the excellent review of van Wieringen (2021). The estimator is a convex combination of the MLEs from and , with weights determined by the corresponding observed Fisher information matrices. This formulation allows us to identify the crucial role that plays in balancing not only the bias but the variance of , a role reminiscent of that played by the tuning parameter for ridge regression (Theobald, 1974; Farebrother, 1976).
Our estimator can be rewritten as with
In contrast, Chen et al. (2015) consider pooling and to jointly estimate and with some penalty . Their estimator is given by , where
When , if and only if . This is clearly always true when . When , if is informative relative to , and if is uninformative relative to . Therefore, our estimator assigns more weight to than Chen et al. (2015)’s estimator when is more informative, and less when it is uninformative, as desired.
3.2 Bernoulli logistic regression
For the standard logistic regression model, the KL divergence is given by
Then the objective function in (3) can be written as
To optimize , we find the root of the estimating function given by
where is the vector obtained by applying to each component of the vector . The estimator is the solution to . Of course, there is no closed-form solution, but the solution can be obtained numerically.
4 Theoretical support
A distinguishing feature of our perspective here is that we treat as fixed. In particular, we treat as a known constant. A relevant feature in the analysis below is the difference , which is just one measure of the similarity/dissimilarity between and . Of course, is unknown because is unknown, but can be estimated if necessary. Note also that we do not assume to be small.
Thanks to well-known properties of KL divergence, the expectation of the objective function is concave in , so it has a unique maximum, denoted by . If the true value of in is , then clearly when and . On the other hand, it is easy to verify that, as , the objective function satisfies uniformly on compact subsets of the parameter space of , so, as expected, . As stated in Section 2.2, our objective is to find a range of the dial parameter —depending on , etc.—such that the use of the external information in sufficiently reduces the variance of so as to overcome the addition of a small amount of bias to .
We first consider the Gaussian linear model for its simplicity, which allows for exact (non-asymptotic) results. The ideas extend to generalized linear models, but there we will need asymptotic approximations as . Proofs of all the results are given in Appendix A. Throughout, we use the notational convention for a matrix .
4.2 Exact results in the Gaussian linear model
In the Gaussian case, the expectation of the estimating function evaluated at is
Denote . Rearranging, we obtain
Thus, when or . In general, however, our proposed estimator is estimating , so a practically important question is, if we already have an unbiased estimator, , of , then why would we introduce a biased estimator? Theorem 1 below establishes that there exists a range of for which the mean squared error (MSE) of is strictly less than that of .
There exists a range of on which the mean squared error of is strictly less than the mean squared error of .
This result is striking when considered in the context of data integration: we have shown that it is always “better” to integrate two sources of data than to use only one, even when the two are substantially different. We also claim that our estimator’s gain in efficiency is robust to heterogeneity between the two datasets; we will return to this point in Section 4.3 to make general remarks in the context of transfer learning.
Of course, the reader familiar with Stein shrinkage (Stein, 1956; James and Stein, 1960) may not be surprised by our Theorem 1. Our result has a similar flavor to Stein’s paradox (Efron and Morris, 1977), i.e., that some amount of shrinkage always leads to a more efficient estimator compared to a (weighted) sample average, here . In their presciently titled paper, “Combining possibly related estimation problems,” Efron and Morris (1973) anticipated the ubiquity of results like our Theorem 1 that show estimation efficiency gains when combining different but related datasets.
The proof of Theorem 1 shows that for all when ; when , it proves that is monotonically decreasing over the range and monotonically increasing over the range for some that does not have a closed-form. The proof does, however, provide a bound for :
, the eigenvalues ofin increasing order and , , the eigenvalues of in decreasing order. That is, the MSE of will be less than that of if is no more than the right-hand side of (6). From the above expression, we see that if elements of are large in absolute value, then the improvement in MSE only occurs for a small range of . Moreover, if , or are large then the range of such that is small: in other words, if is highly informative then very little weight should be given to , regardless of how informative is. This is intuitively appealing and practically useful because it provides a loose guideline based on the informativeness of for how much improvement can be obtained using .
In practice, we propose to find a data-driven version of the minimizer, , of the MSE. For this, we minimize an empirical version of the MSE based on plug-in estimators, i.e.,
the minimizer of the estimated mean squared error of , where
is the usual estimate of the error variance , and is a bias-adjusted estimate of computed as follows. If we define , then the equality implies that is a positively biased estimator of . Similar to Vinod (1987) and Chen et al. (2015), we estimate with
where for a symmetric matrix with eigendecomposition . If all eigenvalues are negative, we use . This bias-adjusted approach to estimating MSE is related to Stein’s unbiased risk estimator (SURE, Stein, 1981); see also Vinod (1987) for a discussion in the ridge regression setting. The minimizer always exists since is non-zero with probability 1.
Finally, we show that constructing confidence intervals for using reduces to the traditional inference using the MLE. It follows from the proof of Theorem 1 that,
and therefore, using the definition of in equation (5),
Some rearranging shows that , which finally yields the familiar result: . Therefore, confidence intervals based on equation (7) reduce to confidence intervals obtained from the MLE . As remarked by Obenchain (1977) in ridge regression, our shrinkage does not produce “shifted” confidence regions, and the MLE is the most suitable choice if one desires valid confidence intervals.
4.3 Asymptotic results in generalized linear models
Next we investigate the asymptotic properties of in generalized linear models as . For , let
Define and . Then
We assume the following conditions.
exists and is finite.
For any between and inclusive, the two matrices defined below exist and are positive definite:
Denote by . Recall that the asymptotic variance of the MLE is .
Lemma 1 is a standard result stating that the minimizer of our objective function converges to the minimizer of its expectation.
If Condition 1 holds, then , for each .
Since we already have that as , an immediate consequence of Lemma 1 is that as and . More can be said, however, about the local behavior of depending on how quickly vanishes with