In many applications of statistical inference, the aim is to compare data from different populations. Specifically, given and
samples from two groups, collected in vectorsand , the target quantity is often the difference between their means, denoted , which we call the effect. For instance, in randomized trials and A/B testing, the data are outcomes from two populations and is the average causal effect of assigning subjects to a test group ‘’ as compared to a control group ‘’. [1, 2] The standard approach is to use the difference between sample averages in each group, viz. , where
. Confidence intervals forcan be obtained using Welch’s method, which employs an approximating t-distribution [3, 4, 5].
Ideally, the samples from both groups are representative of their target populations. Then the bias of the estimator,
is zero. However, in nonideal conditions with finite samples this is not the case, e.g., when some units of the intended populations are less likely to be included than others. Under such conditions, decreases with sample sizes and but will nevertheless be nonzero. Sampling biases increase the risk of inferring spurious effects when using standard inference methods.
In this paper, we develop an inference method that is resilient to sampling biases. In contrast to the standard approach, the proposed method reduces the risk of reporting spurious effect estimates and is capable of controlling the false positive errors under moderate biases. The method relies on an effect estimator using a fully automatic and data-adaptive regularization. We demonstrate its performance on both synthetic and real data.
Code for the method can be found at https://github.com/dzachariah/two-groups-data
Ii Problem formulation
We model the dataset as
The model based on the Gaussian distribution yields the least favourable distribution for estimating the unknown effect
. We model the effect as a random variable, where different ranges of values of
have different probabilities. To achieve resilliance to sampling biases, we adopt a conservative approach in which nonexistant or negligible effects are considered to be more probable. Specifically, we employ the following model:
where is an unknown parameter.
Our aim is to derive a confidence interval that contains the unknown with a coverage probability of at least . That is,
The confidence interval is to be centered on an estimator and should be resilient to sampling biases. That is, even if the interval must not indicate nonzero effects with a probability greater than . Fig. 1 illustrates the ability of the method proposed below to ensure (3) under a range of biases, provided does not greatly exceed the dispersion of sample averages, i.e., .
Iii Proposed method
Let be the conditional mean of the effect given the data. Using an estimate of the nuisance parameters, we propose the following effect estimator
where we introduce the variable that can be interpreted as a signal-to-noise ratio .
Result 1 (Cramér-Rao bound).
When the systematic error of is invariant with respect to , then the mean-squared error over all possible effects and data is bounded as follows where
See Appendix -A. ∎
Result 2 (Confidence interval).
See Appendix -B. ∎
Evaluating and requires estimates of the nuisance parameters . Here we adopt the maximum likelihood approach and estimate using the marginalized data distribution,
It can be shown that (7) is a Gaussian distribution with mean and covariance
where and . The estimated parameters are given by
Interestingly, the problem (9) can be solved by a one-dimensional numerical search. Begin by defining the variables
Note that . Then the following result holds.
Result 3 (Nuisance parameter estimates).
See Appendix -C. ∎
By plugging in , , and into (4) and (6), we obtain estimates and , respectively. We note that the overall mean is fitted to the data in a nonstandard manner using (13), which yields a fully automatic and data-adaptive regularization of the effect estimator (4). If the minimizing is such that , then the estimated signal-to-noise ratio is . In this case, the method indicates that the data is not sufficiently informative to discriminate any systematic difference from noise. Consequently, collapses to zero and , indicating a case in which the effect cannot be reliably inferred.
Iv Experimental results
We demonstrate the proposed inference method using both synthetic and real data.
Iv-a Synthetic data
We generate two-group data using the model (1) and add a negative bias to the test group, using the setup parameters described in Fig. 1. The adaptive regularization of is illustrated in Fig. 2: when the unknown effect is nonexistent, , the estimates are concentrated at zero, despite the bias . As exceeds the dispersion of the sample averages, however, the regularized and standard estimators become nearly identical.
We report a significant effect estimate when a nonempty interval excludes the zero effect. Fig. 3 illustrates the ability of the proposed method to control the false positive error probability as increases, in contrast to the standard method. This is achieved while incurring a loss of statistical power that vanishes as the number of samples increases.
Iv-B Prostate cancer data
We now consider real data from healthy individuals and individuals with prostate cancer [8, 9]. The data contains 6033 different biomarker responses. The inferred effects are shown in Fig. 4. For 6 markers, the effects were found to be significant at the level. By contrast, the standard approach using Welch’s t-intervals yields 478 genes, but the inferences are less reliable under sampling biases.
We developed a method for inferring effects in two-group data that, unlike the standard approach, is resilient to sampling biases. The method is able to control the false positive errors under moderate bias levels and its performance was demonstrated using both synthetic and real biomarker data.
-a The derivation of the Cramér-Rao bound
The mean-square error can be decomposed as
where is the conditional mean. Next, define the score function and the information matrix,
Since the marginal pdf is Gaussian, we can compute using Slepian-Bangs formula . It has a block diagonal form
Let denote the correlation between the score function and estimation error. Then we have the general bound
In our case, we obtain
where the fourth line follows under the constant bias assumption. Inserting this expression for in (18) yields
This completes the proof.
-B The derivation of the confidence interval
We have that
Let , then
Thus when the estimator is efficient.
-C The derivation of the concentrated cost
Problem (9) can be formulated equivalently as the minimization of:
is inserted back to yield a concentrated cost function
Next, using the Sherman-Morrison and matrix determinant lemmas we can reparametrize as
where we dropped the subindices for notational convenience.
To find the minimizing , we first consider the stationary point of
Taking the derivative with respect to , yields the following condition for a stationary point:
or equivalently . Solving for , we obtain the estimate (12).
-  G. Imbens and D. Rubin, Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
-  J. Pearl, M. Glymour, and N. Jewell, Causal Inference in Statistics: A Primer. Wiley, 2016.
-  C. Rao, Linear Statistical Inference and its Applications. Wiley Series in Probability and Statistics, Wiley, 1973.
-  B. L. Welch, “The significance of the difference between two means when the population variances are unequal,” Biometrika, vol. 29, no. 3/4, pp. 350–362, 1938.
-  S.-H. Kim and A. S. Cohen, “On the behrens-fisher problem: a review,” Journal of Educational and Behavioral Statistics, vol. 23, no. 4, pp. 356–377, 1998.
-  P. Stoica and P. Babu, “The Gaussian data assumption leads to the largest Cramér-Rao bound [lecture notes],” IEEE Signal Processing Magazine, vol. 28, no. 3, pp. 132–133, 2011.
-  T. Kailath, A. Sayed, and B. Hassibi, Linear Estimation. Prentice Hall Information and, Prentice Hall, 2000.
Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, vol. 1. Cambridge University Press, 2012.
-  D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, et al., “Gene expression correlates of clinical prostate cancer behavior,” Cancer cell, vol. 1, no. 2, pp. 203–209, 2002.
-  P. Stoica and R. L. Moses, Spectral analysis of signals. Pearson/Prentice Hall Upper Saddle River, NJ, 2005.