 # Effect Inference from Two-Group Data with Sampling Bias

In many applications, different populations are compared using data that are sampled in a biased manner. Under sampling biases, standard methods that estimate the difference between the population means yield unreliable inferences. Here we develop an inference method that is resilient to sampling biases and is able to control the false positive errors under moderate bias levels in contrast to the standard approach. We demonstrate the method using synthetic and real biomarker data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In many applications of statistical inference, the aim is to compare data from different populations. Specifically, given and

samples from two groups, collected in vectors

and , the target quantity is often the difference between their means, denoted , which we call the effect. For instance, in randomized trials and A/B testing, the data are outcomes from two populations and is the average causal effect of assigning subjects to a test group ‘’ as compared to a control group ‘’. [1, 2] The standard approach is to use the difference between sample averages in each group, viz. , where can be obtained using Welch’s method, which employs an approximating t-distribution [3, 4, 5].

Ideally, the samples from both groups are representative of their target populations. Then the bias of the estimator,

 b=E[ˆδ]−δ,

is zero. However, in nonideal conditions with finite samples this is not the case, e.g., when some units of the intended populations are less likely to be included than others. Under such conditions, decreases with sample sizes and but will nevertheless be nonzero. Sampling biases increase the risk of inferring spurious effects when using standard inference methods.

In this paper, we develop an inference method that is resilient to sampling biases. In contrast to the standard approach, the proposed method reduces the risk of reporting spurious effect estimates and is capable of controlling the false positive errors under moderate biases. The method relies on an effect estimator using a fully automatic and data-adaptive regularization. We demonstrate its performance on both synthetic and real data.

###### Remark 1.

Code for the method can be found at https://github.com/dzachariah/two-groups-data

## Ii Problem formulation

We model the dataset as

 y=[y0y1]∼N([μδ],[v0I00v1I]) (1)

The model based on the Gaussian distribution yields the least favourable distribution for estimating the unknown effect



. We model the effect as a random variable, where different ranges of values of

have different probabilities. To achieve resilliance to sampling biases, we adopt a conservative approach in which nonexistant or negligible effects are considered to be more probable. Specifically, we employ the following model:

 δ∼N(0,λ), (2)

where is an unknown parameter.

Our aim is to derive a confidence interval that contains the unknown with a coverage probability of at least . That is,

 Pr{δ∈Cα(y)}≥1−α. (3)

The confidence interval is to be centered on an estimator and should be resilient to sampling biases. That is, even if the interval must not indicate nonzero effects with a probability greater than . Fig. 1 illustrates the ability of the method proposed below to ensure (3) under a range of biases, provided does not greatly exceed the dispersion of sample averages, i.e., . Fig. 1: Probability of false positive error versus bias b, when δ=0. Significant effects are inferred when the confidence interval excludes the zero effect, using Welch’s method (black dashed line) and proposed method (solid line). Setting α=0.05, the error rate must not exceed 5%(red dashed line). The bias is varied in units of the standard deviation of ¯¯¯y0 and added to the data from the test group. Data was generated using (1) with n0=40, n1=20, and unknown variances v0=0.32, v1=0.152 and mean μ=1.

We will derive a confidence interval using model (1) and (2), with nuisance parameters

 θ=col{μ,λ,v0,v1}.

## Iii Proposed method

Let be the conditional mean of the effect given the data. Using an estimate of the nuisance parameters, we propose the following effect estimator

 ˆδ(y)=Eθ[δ|y]∣∣θ=ˆθ=ρn1ρn1+1(¯¯¯y1−μ)∣∣θ=ˆθ, (4)

where we introduce the variable that can be interpreted as a signal-to-noise ratio .

###### Result 1 (Cramér-Rao bound).

When the systematic error of is invariant with respect to , then the mean-squared error over all possible effects and data is bounded as follows where

 c2θ=ρv1ρn1+1+ρ2n21(ρn1+1)2(n0v0+n1v11ρn1+1)−1. (5)
###### Proof.

See Appendix -A. ∎

###### Result 2 (Confidence interval).

Let

 Cα(y)={δ′:|δ′−ˆδ(y)|<α−1/2cθ}. (6)

When using an efficient estimator that attains the bound (5), the interval in (6) satifies the specified coverage probability (3).

###### Proof.

See Appendix -B. ∎

Evaluating and requires estimates of the nuisance parameters . Here we adopt the maximum likelihood approach and estimate using the marginalized data distribution,

 pθ(y)=∫pθ(y|δ)pθ(δ)dδ (7)

It can be shown that (7) is a Gaussian distribution with mean and covariance

 Eθ[y]=1μandCovθ[y]=diag(V0,V1), (8)

where and . The estimated parameters are given by

 ˆθ=argmaxθpθ(y). (9)

Interestingly, the problem (9) can be solved by a one-dimensional numerical search. Begin by defining the variables

 α=y⊤1y1−μn1(2¯¯¯y1−μ)β=n21(¯¯¯y1−μ)2γ=αn1−β.

Note that . Then the following result holds.

###### Result 3 (Nuisance parameter estimates).

The estimated variances are given by

 ˆv0=1n0y⊤0y0−μ(2¯¯¯y0−μ), (10)
 ˆv1=1n1α+ργ1+ρn1, (11)

which are ensured to be nonnegative, and , where

 ˆρ={β−αγ,β−α≥0.0,otherwise. (12)

All variables in (10)-(12) are functions of the mean , whose estimate is obtained by minimizing the one-dimensional function

 f(μ)=n0lnˆv0+n1ln(α+ˆργ)−(n1−1)ln(1+ˆρn1) (13)
###### Proof.

See Appendix -C. ∎

By plugging in , , and into (4) and (6), we obtain estimates and , respectively. We note that the overall mean is fitted to the data in a nonstandard manner using (13), which yields a fully automatic and data-adaptive regularization of the effect estimator (4). If the minimizing is such that , then the estimated signal-to-noise ratio is . In this case, the method indicates that the data is not sufficiently informative to discriminate any systematic difference from noise. Consequently, collapses to zero and , indicating a case in which the effect cannot be reliably inferred.

## Iv Experimental results

We demonstrate the proposed inference method using both synthetic and real data.

### Iv-a Synthetic data

We generate two-group data using the model (1) and add a negative bias to the test group, using the setup parameters described in Fig. 1. The adaptive regularization of is illustrated in Fig. 2: when the unknown effect is nonexistent, , the estimates are concentrated at zero, despite the bias . As exceeds the dispersion of the sample averages, however, the regularized and standard estimators become nearly identical.

We report a significant effect estimate when a nonempty interval excludes the zero effect. Fig. 3 illustrates the ability of the proposed method to control the false positive error probability as increases, in contrast to the standard method. This is achieved while incurring a loss of statistical power that vanishes as the number of samples increases.

### Iv-B Prostate cancer data

We now consider real data from healthy individuals and individuals with prostate cancer [8, 9]. The data contains 6033 different biomarker responses. The inferred effects are shown in Fig. 4. For 6 markers, the effects were found to be significant at the level. By contrast, the standard approach using Welch’s t-intervals yields 478 genes, but the inferences are less reliable under sampling biases. Fig. 4: Confidence intervals Cα=0.05(y) for 6033 different experiments using two-group biomarker data (n0=50 and n1=52). In six cases, highlighted in red, the effects were found to be significant as the intervals did not contain the zero effect. Note that several intervals are empty, indicating cases in which the data is not informative enough for the fitted model to discern any systematic effect from the noise.

## V Conclusions

We developed a method for inferring effects in two-group data that, unlike the standard approach, is resilient to sampling biases. The method is able to control the false positive errors under moderate bias levels and its performance was demonstrated using both synthetic and real biomarker data.

### -a The derivation of the Cramér-Rao bound

The mean-square error can be decomposed as

 E[|δ−ˆδ|2]=Ey[Eδ|y[|δ−¯¯¯δ+¯¯¯δ−ˆδ|2]]=Ey[Var[δ|y]+|¯¯¯δ−ˆδ|2]=λv1λn1+v1+Ey[|¯¯¯δ−ˆδ|2]. (14)

where is the conditional mean. Next, define the score function and the information matrix,

 ϕ≜∂θlnpθ(y)andJ=Ey[ϕϕ⊤]. (15)

Since the marginal pdf is Gaussian, we can compute using Slepian-Bangs formula . It has a block diagonal form

 J=[J1,100∗], (16)

where

 (17)

Let denote the correlation between the score function and estimation error. Then we have the general bound

 0≤Ey[|(¯¯¯δ−ˆδ)−g⊤J−1ϕ|2]=Ey[|¯¯¯δ−ˆδ|2]−g⊤J−1g. (18)

In our case, we obtain

 g=∫[∂θlnpθ](¯¯¯δ−ˆδ)pθdy=∫∂θ[pθ(¯¯¯δ−ˆδ)]−pθ[∂θ(¯¯¯δ−ˆδ)]dy=∂θ[bias(θ)]−Ey[∂θ(¯¯¯δ−ˆδ)]=−Ey[∂θ(λλn1+v11⊤(y1−μ1))]=−⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣λλn1+v11⊤1+0∂λλλn1+v11⊤Ey[y1−μ1]0∂v1λλn1+v11⊤Ey[y1−μ1]⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=−⎡⎢ ⎢ ⎢ ⎢ ⎢⎣λn1λn1+v1000⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,

where the fourth line follows under the constant bias assumption. Inserting this expression for in (18) yields

 E[|δ−ˆδ|2]≥λv1λn1+v1+(λn1λn1+v1)2J−11,1. (19)

This completes the proof.

### -B The derivation of the confidence interval

We have that

 δ∉Cα(y)⇔αc−2θ|δ−ˆδ(y)|2≥1. (20)

Let , then

Thus when the estimator is efficient.

### -C The derivation of the concentrated cost

Problem (9) can be formulated equivalently as the minimization of:

 f(θ)=n0lnv0+1v0∥y0−μ1∥2f0(μ,v0)+ln|V1|+∥y1−μ1∥2V−11f1(μ,λ,v1). (21)

The minimizer

 ˆv0=∥y0−μ1∥2/n0 (22)

is inserted back to yield a concentrated cost function

 f0(μ,ˆv0)=n0lnˆv0+n0 (23)

Next, using the Sherman-Morrison and matrix determinant lemmas we can reparametrize as

 f1(μ,ρ,v)=ln(1+ρn)+lnvn+1v(∥y1−μ1∥2−ρ|1⊤(y1−μ1)|21+ρn) (24)

where we dropped the subindices for notational convenience.

Using the identities , and , the minimizing of (24) is found as (11). Inserting the variance estimate back, yields a concentrated cost function

 f1(μ,ρ,ˆv)=ln(α+ργ)n(1+ρn)n−1+n. (25)

To find the minimizing , we first consider the stationary point of

 ˜f1(μ,ρ)=(α+ργ)n(1+ρn)−(n−1).

Taking the derivative with respect to , yields the following condition for a stationary point:

 nγ(α+ργ)n−1(1+ρn)−n+1−(n−1)n(1+ρn)−n(α+ργ)n=0,

or equivalently . Solving for , we obtain the estimate (12).

By evaluating the second derivative at this point, we verify that it is a minimum. Inserting (12) back into (25) and combining with (23), we can write (21) in the concentrated form (13) after omitting irrelevant constants.

## References

•  G. Imbens and D. Rubin, Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
•  J. Pearl, M. Glymour, and N. Jewell, Causal Inference in Statistics: A Primer. Wiley, 2016.
•  C. Rao, Linear Statistical Inference and its Applications. Wiley Series in Probability and Statistics, Wiley, 1973.
•  B. L. Welch, “The significance of the difference between two means when the population variances are unequal,” Biometrika, vol. 29, no. 3/4, pp. 350–362, 1938.
•  S.-H. Kim and A. S. Cohen, “On the behrens-fisher problem: a review,” Journal of Educational and Behavioral Statistics, vol. 23, no. 4, pp. 356–377, 1998.
•  P. Stoica and P. Babu, “The Gaussian data assumption leads to the largest Cramér-Rao bound [lecture notes],” IEEE Signal Processing Magazine, vol. 28, no. 3, pp. 132–133, 2011.
•  T. Kailath, A. Sayed, and B. Hassibi, Linear Estimation. Prentice Hall Information and, Prentice Hall, 2000.
•  B. Efron,

Large-scale inference: empirical Bayes methods for estimation, testing, and prediction

, vol. 1.
Cambridge University Press, 2012.
•  D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, et al., “Gene expression correlates of clinical prostate cancer behavior,” Cancer cell, vol. 1, no. 2, pp. 203–209, 2002.
•  P. Stoica and R. L. Moses, Spectral analysis of signals. Pearson/Prentice Hall Upper Saddle River, NJ, 2005.