Identifiability and estimation under the test-negative design with population controls with the goal of identifying risk and preventive factors for SARS-CoV-2 infection

Due to the rapidly evolving COVID-19 pandemic caused by the SARS-CoV-2 virus, quick public health investigations of the relationships between behaviours and infection risk are essential. Recently the test-negative design was proposed to recruit and survey participants who are being tested for SARS-CoV-2 infection in order to evaluate associations between their characteristics and testing positive on the test. It was also proposed to recruit additional untested controls who are part of the general public in order to have a baseline comparison group. This study design involves two major challenges for statistical risk factor analysis: 1) the selection bias invoked by selecting on people being tested and 2) imperfect sensitivity and specificity of the SARS-CoV-2 test. In this study, we investigate the nonparametric identifiability of potential statistical parameters of interest under a hypothetical data structure, expressed through missing data directed acyclic graphs. We clarify the types of data that must be collected in order to correctly estimate the parameter of interest. We then propose a novel inverse probability weighting estimator that can consistently estimate the parameter of interest under correctly specified nuisance models.



There are no comments yet.


page 23

page 24

page 25


Double Negative Control Inference in Test-Negative Design Studies of Vaccine Effectiveness

The test-negative design (TND) has become a standard approach to evaluat...

A statistical model to assess risk for supporting SARS-CoV-2 quarantine decisions

In February 2020 the first human infection with SARS-CoV-2 was reported ...

The evolving usefulness of the Test-Negative Design in studying risk factors for COVID-19 due to changes in testing policy

This paper is a short extension of our previous paper [arXiv:2004.06033]...

Positive results from UK single gene testing for SARS-COV-2 may be inconclusive, negative or detecting past infections

The UK Office for National Statistics (ONS) publish a regular infection ...

Inference for a test-negative case-control study with added controls

Test-negative designs with added controls have recently been proposed to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Under the current pandemic caused by the SARS-CoV-2 virus, where the resulting illness is referred to as COVID-19, it is challenging to implement fast epidemiological inquiries to map and understand the disease. Highly infectious [12] in a completely non-immune population and targeting primarily the respiratory system with clinical symptoms that include fever, cough, and fatigue [5, 32], this illness continues to cause substantial morbidity and mortality, straining the healthcare systems of many countries. With the aim of reducing infection, global campaigns encourage individuals to modify their daily behaviour by measures which include physical distancing, the use of masks, and intensified hygienic practices. Much interest lies in establishing whether these interventions are effective at reducing infection probabilities at an individual or population level.

Given the challenges involved in testing large portions of the population for active infection with SARS-CoV-2, cases in the general public are typically ascertained through testing sites. In Canada, though regulations have varied by epidemic stage and jurisdiction [20], in order to obtain a test, individuals may be required to be experiencing symptoms of COVID-19 and/or have other reason to believe that they are infected, such as being a healthcare worker. Thus, if potential study participants are recruited at test centers, the resulting study cohort will not be representative of the general population at risk for the disease. Further, due to the nature of testing self-selection, associations measured between participant covariates (e.g. demographics, characteristics, and behaviours) and outcomes will not necessarily be representative of true causes or even predictors of infection [13].

Recently, Vandenbroucke et al.[30] proposed a a modified case-control design that combines a test-negative design, best known from its use in vaccine effectiveness research [7, 9]

, with the recruitment of additional population controls in order to identify risk/preventive factors of SARS-CoV-2 infection. This study design involves both the recruitment of patients who are seeking testing for SARS-CoV-2 infection and of untested individuals in the general population. They propose a matching approach to compare differences in study participant covariates between people who test positive, people who test negative, and people who are not seeking testing in order to triangulate factors that likely increase or decrease the odds of infection. Karmakar and Small

[15] proposed a more efficient test to compare factors between the three groups. However, neither of these studies addresses the problem from a missing data perspective while dealing with potential selection bias of having two groups of participants who received testing based on outcome-related symptoms.

In this methodological study, we give a specific definition of the parameters of interest in a “risk/preventive factor” analysis, corresponding to modeling the covariates that are predictive of SARS-CoV-2 infection in a regression model. We then provide identifiability conditions under the test-negative design and an assumed structural setting in the context of the COVID-19 pandemic. Identifiable means that we would know the exact value of the parameters of interest if we had an infinite sample size; our study thus gives some conditions under which identifiablity is achieved. Identifiability is of great importance because without this guarantee, we cannot construct a consistent estimator under the given assumptions. Finally, we propose an inverse probability weighting (IPW) estimator [14, 21, 3] of the parameters of interest that may be feasible in this setting. We evaluate this approach through simulation study and compare it with a naive approach to estimation that does not incorporate untested population controls.

2 The Test-Negative Study Design with Population Controls

Given access to the recruitment of people seeking SARS-CoV-2 tests at a given testing site, we consider a study design that involves the recruitment of two groups of people: (1) people who are being tested for SARS-CoV-2 at the test site and (2) members of the general population, possibly selected as matched pairs for those being tested, who are not seeking testing but who are under the jurisdiction of the test site. It must be the case that members of group (2) would be able to access the test site were they to have symptoms of COVID-19 or otherwise qualify for and seek testing. Participants recruited from group (1) are denoted and those from group (2) are denoted . This is essentially a case-control study design except that cases are those who are tested and controls are those who are untested. We assume random sampling of independent members of both groups, with a total sample size of . The need for independence implies that, for example, only one member per household should be recruited.

All participants are given a questionnaire to collect information about the potential risk/preventive factors under study, , which we will now refer to simply as risk factors, and related confounders, . The questionnaire may also capture information about current symptoms related to COVID-19, . It is also necessary that we receive the result of the SARS-CoV-2 test from those being tested. The test result is denoted (1=positive; 0=negative), which is only observable for those who are being tested ().

Recruitment from groups (1) and (2) will give us three categories of participants to contrast: those being tested and who test positive for COVID-19, those being tested and who test negative for COVID-19, and those not being tested. In principle, the comparison of the covariates () between test-positives and negatives will allow us to see which risk factors differ between people who have become infected with SARS-CoV-2 versus those who haven’t and we can then compare to the controls from the general population [30]. But simple contrasts of tested participants may not estimate an interpretable parameter. In fact, the interpretation of any measured associations relies on certain structural assumptions and the collection of necessary data that would allow for the credibility of these assumptions. We expand on these concepts in the remainder of the paper.

3 Parameters of interest

Our scientific objective is to identify risk factors, i.e. behaviours or characteristics of individuals that are associated with COVID-19 in a chosen outcome regression model, possibly after adjustment for other suspected confounders of these risk factors [24, 23]. The population of interest is members of the general public, who were not previously infected with SARS-CoV-2 (i.e. who are at risk of infection), and who are under the jurisdiction of the COVID-19 testing site under study, but who may or may not be seeking testing. The regression model thus represents the associations between the risk factors and prospective short-term risk of infection with SARS-CoV-2 in this population.

The binary outcome of interest is infection with SARS-CoV-2, denoted . Due to imperfect test sensitivity and specificity, this outcome may not correspond with , the result of the test for COVID-19. Sensitivity is defined as the probability of testing positive when truly infected. Specificity is defined as the probability of testing negative when not infected. The observed data are thus of the form . The complete data under knowledge of the true outcomes are . Under a perfect test for COVID-19 (i.e. sensitivity and specificity equal to one), when in which case the observed data with outcome censoring can be written as

. We will use lower case letters to represent realizations of these random variables. In particular,

for where is the total sample size.

We then define a logistic regression model


where , (and similarly for the realizations denoted by lower case letters), , and

, under a typical log-likelihood loss function. Our interest lies in the vector parameter

where corresponds to the conditional odds ratio related to the covariate .

Importantly, the parameters in this regression model may not represent causal effects, i.e. even if a coefficient is negative, it does not necessarily mean that decreases the risk of infection [24, 23]. In order to establish such a relationship, all confounders of the effects of risk factors on the outcome must be adjusted for in the model, the model must correctly represent the mechanisms of infection, and all risk factors must be independently manipulable. While we could extend this work to consider these aspects, for simplicity we retain as the statistical parameter of interest in this study; in practice its estimation may provide important insights into the variables related to SARS-CoV-2 infection or ways that high-risk individuals may be identified. The remainder of the article addresses the estimation of the statistical parameter within the regression model.

4 Potential for Collider Bias Resulting from Selecting Patients at Test Sites

The challenge in comparing patients who tested positive versus negative arises from selecting on patients who seek testing.

Figure 1 is a missing data directed acyclic graph (mDAG) [6, 16] representing assumed relationships between covariates in this analysis. In particular, we allow for the baseline covariates to potentially cause (i.e. influence the risk for) SARS-CoV-2 infection, . If a patient seeks testing () then we observe a test result . If the test can perfectly detect infection, then if and if . Testing is typically obtained if the individual has suspected symptoms of COVID-19, (which may include fever, respiratory symptoms, etc) but the act of seeking and then receiving testing may also be affected by the risk factors () and other baseline covariates (). For example, an alert individual who frequently hand washes may be more inclined to seek testing, possibly also depending on whether they are experiencing real or perceived symptoms of COVID-19 (included in ). Any variable in , such as recent travel, that places a person at higher risk for infection may also prompt that person to seek testing, even with absent or mild symptoms. We assume that true infection only affects test-seeking behaviour through symptoms. Thus, in we include all symptoms known to the participant. may also represent symptoms of other (respiratory or other) infections, and these may be caused by pathogens other than the SARS-CoV-2 virus.

Figure 1: mDAG representing hypothetical relationship between baseline covariates and , symptoms , seeking testing , true infection , and observed test outcome . Note that is observed with error for tested subjects (). Drawn using DAGitty. [26]

Other unmeasured factors may modify the risk of COVID-19, . We allow for unmeasured causes (omitted from graph for simplicity) of both COVID-19 and other infections. In constructing this mDAG, we also assume that no unmeasured factor simultaneously affects any pair of nodes. We will discuss and relax some of these assumptions in Section 5.

The objective of our analysis is to estimate the model parameters representing the relationship between and while adjusting for , i.e. the parameters of the model for . But because we only have outcome data from those who are seeking testing, we may consider directly modeling the observed outcomes among those who were tested . Even under a perfect test for COVID-19, so that when , such modeling of the selected population may produce misleading associations between and . This is due to collider bias [13, 4], which is caused by subsetting or adjusting for a variable that is caused by the two variables whose association is of interest. In our case, we would be conditioning on , which is caused by both (through ) and by . Thus, there is a possibility for erroneous conclusions resulting from the measured associations between and among those seeking testing.

For example, having access to a private vehicle allows you to avoid public transit, which may be a preventive factor for viral infection. Thus we may be interested in measuring the association between access to a private car and COVID-19. However, access to a COVID-19 test site is also facilitated by access to a vehicle, especially for those living further away from the test sites. Because (we are supposing) those with a car are more likely to be able to seek testing if they have symptoms, there is a disproportionate number of people without cars with COVID-19 who are not tested. Thus, we may measure a negative association between access to a private vehicle and COVID-19 even if there is no causal relationship between car ownership and COVID-19. We also demonstrate this by example in the simulation study.

5 Identifiability of Risk Factors in the General Population

In this section, we discuss the identifiability of the parameters of the model in equation (1) under the mDAG in Figure 1. We then discuss identifiability under some less restrictive assumptions. When identifiability holds, a maximum likelihood substitution estimator [22] and an IPW estimator can be constructed.

5.1 Identifiability under the graph in Figure 1

The mDAG in Figure 1 makes several important assumptions. In particular, it assumes that is fully measured, meaning that we measured all symptoms caused by SARS-CoV-2 infection that may lead to seeking testing. This can be achieved by thorough harmonized data collection on individuals being recruited from both populations. We assume that is equal to when up to random error. As mentioned, we allow for unmeasured common causes of and other infections. Otherwise, we assume that there are no common causes of any pair of nodes in the graph.

As a consequence of this structure, we have the independence condition


We then use the law of total probability to write our association of interest as


by the independence assumption (2). The quantity is the possibly multivariate distribution of conditional on and and the multiple integrals are taken over the domain of this distribution.

If the SARS-CoV-2 test is perfect then we can replace the by directly. If not, assuming uniform test sensitivity and specificity independent of patient characteristics, under the condition that , the law of total probability expanding on and rearranging terms gives us


which is estimable from the data of the tested subjects [11].

For the distribution in equation (3), we may write


where is the conditional distribution of in the population corresponding to for . Components (a) and (b) are estimable from the subjects from the tested and untested populations, respectively. For component (c), the case-control design of sampling tested and untested patients allows us to identify associations between and . However, case-control data do not give us a baseline prevalence of the outcome ( in this case). Therefore, it is only under external knowledge of the marginal , the prevalence of testing, that component (c) is identified [29].

Because we are able to relate the target probability

to quantities that are known if given infinite samples of the observed data structure, under the given assumptions, we have established identifiability. This result does not rely on the specification of parametric models. Thus, for identifiability we need complete data on the symptoms

(specifically, all variables caused by leading to testing) in addition to the covariates of interest . These covariates must include all common causes of testing and and/or (see next section). We also require knowledge of the parameters , , and .

5.2 Identifiability in settings with fewer restrictions

The assumed relationships in the mDAG of Figure 1 may be too restrictive in certain studies. We thus describe several specific generalizations of the above graph and the consequences on identifiability and interpretability of the parameter.

5.2.1 Unmeasured variable affecting test seeking which also affects and/or

In Figure 2 we add an unmeasured variable that is a factor influencing the act of getting tested which acts independently of all other variables in the graph. Suppose that also affects being infected with SARS-CoV-2 or having symptoms of COVID-19 or both. For instance, mold exposure is associated with living in lower income neighborhoods [1]. Mold exposure may lead to respiratory symptoms (e.g. asthma exacerbation) that could be confused for COVID-19 symptoms. People living in lower income neighborhoods may be more at risk of COVID-19 due to greater population density and greater proportions of people who work in “essential services” [27]. Thus, socio-economic status may be such a variable if it is not included in or . A second example is any variable that makes an individual high-risk (arrow into ) that also leads to testing even if the individual is symptom free.

The consequence of such a variable is that independence condition (2) no longer holds. This is because directly creates dependence between and and/or adjusting for , which is a collider of and , creates the dependence between and . Thus, if such a variable exists we cannot use the described maximum likelihood procedure. We should therefore attempt to measure all such factors and include them in or .

Figure 2: Presence of an unmeasured variable that affects and symptoms and/or infection with COVID-19, . Drawn using DAGitty. [26]

5.2.2 Unmeasured symptoms of COVID-19 leading to testing

Another potential scenario involves symptoms of COVID-19 that were not included in but can also lead to testing. This scenario is portrayed in Figure 3. Such a variable may exist if, for instance, a study does not ask about the less common symptoms of COVID-19, such as headache or skin rashes.

In this scenario, is a mediator of the effect of on so the independence condition (2) does not hold. This is still the case if is related to the baseline covariates or partially correlated with other symptoms.

Figure 3: Presence of unmeasured symptoms of COVID=19 . Drawn using DAGitty. [26]

5.2.3 Unmeasured variables correlated with baseline covariates and cause of testing

Consider the presence of a variable affecting and correlated with baseline covariates (either by causal relationship or other). Such a variable does not affect the independence condition (2) and thus identifiability is preserved. Such variables include demographic information and participant characteristics that affect test-seeking behaviour but are otherwise not related to infection or symptoms.

5.2.4 Unmeasured variables affecting risk factors and COVID-19 infection

In the presence of a variable that affects both and , the association between risk factors and outcome of interest will be confounded. However, the presence of such a variable will not affect the independence condition  (2) and thus the parameters in model (1) will still be identifiable. However, their values may be less meaningful and not represent causal relationships between risk factors and infection due to the unmeasured confounding.

6 Estimation with IPW

The g-formula relates the observed data to and thus the model of interest. Estimation is available in principle through modeling the components of the g-formula, producing a substitution estimator [25]. We find that when is high-dimensional, as is likely the case in this setting, the g-formula estimator may not be feasible. Alternatively, one may model the probability of selection directly using the case-control data of tested and untested individuals to construct an estimator using IPW [3, 29]. We describe the latter, which requires knowledge of the test sensitivity, , and specificity, , and the value of testing prevalence . If these parameters are uncertain, then one can undertake a sensitivity analysis by varying the assigned values.

The IPW estimator for the parameters of interest in model 1 is given through the score equations of a weighted logistic regression


where values refer to the data realizations of subject .

In order to estimate the numerator of the IPW estimator, we must first define a model for . This model is fit on subjects who received a test. Predictions from this model fit are denoted . By the relationship in equation (4), we set

for all subjects with . Note that approximates .

In order to estimate the denominator of (6), we note that the associations between covariates, symptoms, and the probability of testing must be estimated from the data resulting from the case-control design, where sampling is carried out in both the tested and untested groups. If we know the baseline testing prevalence , we may use a simple weighting method for case-control studies [29]. Specifically, we assign all cases the weight and all controls the weight where is the ratio of the number of controls to cases in the sample. We use these weights in any chosen binomial regression model for conditional on , , and . Finally, we use predictions from this model fit to estimate for all tested subjects.

A simple proof of the consistency of this estimator under the independence assumption (2) is given in the Appendix. It is required that the models for and are both correctly specified. We expect that some values of the denominator may be close to one for some tested subjects who are experiencing several symptoms of COVID-19. However, we would not necessarily expect denominator values close to zero because the IPW equation (6) only uses subjects who did, in fact, get tested. We thus expect our IPW estimator to be fairly stable in this setting.

7 Simulation Study

In order to evaluate the proposed IPW method under the mDAG in Figure 1, and compare it to a naive approach, we perform a simulation study. We evaluate the method under ideal circumstances, where the assumed parametric models are close to well-specified, where sensitivity and specificity of the test are known, and where the baseline prevalence of testing is known. We then evaluate the sensitivity to departures from these assumptions.

We first simulate ordered data , where each variable is unidimensional, for a population of 1,000,000. Baseline confounder

is generated from a standard Gaussian distribution. The risk factor of interest,

, is generated as a Bernoulli random variable conditional on such that its prevalence in the population is approximately 10%. True COVID-19 status is generated as Bernoulli conditional on and such that the true incidence of acute infection in this previously untested population is 10% overall (for this example, though this is likely lower in the general population in practice). The true conditional association between and is . Symptoms are Bernoulli conditional on , , and , where the dependence on is strong so that infected individuals have a high probability of experiencing symptoms (roughly between 0.5 and 0.9). Testing status is then generated given , , and with a fairly strong dependence on such that symptoms lead to a higher probability of being tested. Given the true test infection state , true sensitivity and specificity , test outcome is drawn for all tested subjects. Then, we randomly sample 2000 tested participants (roughly all available) and 2000 untested participants (a small subsample of the total available) from the population, which gives us our study sample. The data generation is given in Appendix Table 2.

In order to demonstrate the selection bias from using only tested subjects to evaluate risk factors, we fit a logistic regression for conditional on and using the data from tested subjects in the sample. We then apply our method using logistic regressions for and , where the latter regression is weighted using the case-control weights. The score equations (6) are then solved using a standard optimization procedure for logistic regressions with a log-likelihood loss function, though our implementation allows for values of that are outside of (0,1) which occurs due to the transformation with and .

We implement our method with correctly specified logistic regression models under the following settings: assumed values set to (i.e. truth), , and ; assumed testing prevalence set to truth (roughly 0.2% of the full population), truth , and truth . We then misspecify our testing model by omitting an interaction term between and . In the last IPW implementation, we do not adjust for symptoms by removing from all models.

We use a case-control nonparametric bootstrap method, where resampling with replacement is done separately in the tested and untested groups, to estimate the standard error and 95% confidence intervals for the IPW method 

[31]. The usual logistic regression standard errors are used for the naive method. All simulations were run with R statistical software v. 3.6.1 [19].

The results of all implementations in addition to the analysis conducted only on tested subjects are given in Table 1. Mean parameter estimates, mean standard error estimates, Monte Carlo standard errors, and

coverage of the 95% confidence intervals are given. We first note that the logistic regression analysis run with only tested subjects is highly biased, suggesting on average that

leads to a lower risk of the outcome while the opposite is true. IPW implemented with correct parameter values and models had no error on average, with a bootstrap variance estimate that corresponded to the Monte Carlo value and only slight undercoverage from the 95% bootstrap confidence interval. Ignoring the sensitivity and specificity (i.e. setting

) in the IPW method led to some attenuation in the average estimate though coverage remained similar. Incorrectly specifying and resulted in substantial overestimation and inflated standard error estimates. Misspecifying by an order of 10 did not lead to important bias, but misspecifying by an order of 100 led to underestimation. Missing an interaction in the model for testing led to an inverted odds ratio suggesting that is protective for . This result points to the importance of correct modeling of the nuisance functions in the IPW estimator. Finally, when not adjusting for , IPW gives the same biased results as the subsetted logistic regression. Bias occurs because if we omit then we do not satisfy the independence assumption (2) with the measured variables.

True OR = 1.5 Mean MC Mean est % Cov
True , est SE SE
Analysis of tested subjects () 0.87 0.18 0.17 10
IPW assumed assumed
sens, spec prevalence of
 Correct models
=truth 1.49 0.24 0.24 92
=truth 1.43 0.22 0.22 91
=truth 2.19 0.39 86
=truth 1.46 0.24 0.24 91
=truth 1.26 0.21 0.20 80
 Missing interaction
=truth 0.79 0.29 0.29 27
=truth 0.86 0.19 0.19 11
Table 1: Aggregate results of the application of each method and implementation on 1000 simulated datasets of 2000 controls and approximately 2000 cases. Mean est: exponential of the mean estimate of (i.e. transformed to the OR scale); MC SE: Monte-Carlo standard error of ; Mean est SE: square-root of the mean estimated variance of ; % Cov: % of 95% confidence intervals that contain the true (optimal is 95%). median used due to large amount of inflated bootstrap estimates.

8 Discussion

In this paper, we have contributed to the investigation of statistical analysis under the test-negative design in the context of evaluating risk or preventive factors of COVID-19 when participants may be conveniently recruited at disease testing sites. We defined a potential parameter of interest in such a study as the coefficients in a regression model for the true infection outcome. We explained and demonstrated the importance of sampling additional population controls [30] in order to avoid selection bias from comparing only tested individuals. We then investigated the identifiability of the target parameter under several settings. Finally, we proposed a novel IPW estimator that accounts for both imperfect test sensitivity and specificity and the study design. We then evaluated this estimator through simulation study.

There is a growing literature on identifiability conditions for statistical parameters under missingness [2] and a large literature of identifiability of causal parameters [17, 10]. Our setting is somewhat different from a typical missing data setting in that, because observed outcomes are obtained through imperfect tests, true infection status is not observed for any subject. In addition, the case-control component of the study design must be considered when estimating all probabilities and distributions in the general population of interest. These results are important as they shed light on the data collection needed to correctly estimate the parameter of interest. In particular, we must measure all variables on the pathway between SARS-CoV-2 infection and testing. This means that incomplete ascertainment of the symptoms leading some individuals to be tested would result in a biased estimator. We must also measure and adjust for all causes of testing if they are also causes of SARS-CoV-2 infection and/or symptoms.

The estimator proposed assumes knowledge of the test properties and the prevalence of testing in the population. Given the potential sensitivity to errors made when specifying these quantities, one could undertake a sensitivity analysis. Specifically, confidence intervals could be constructed using all combinations of credible values for , , and . By taking the minimum confidence interval lower bound and the maximum upper bound, we can place bounds on the set of parameter values that are supported by the data and assumed model. Other approaches may involve Bayesian estimation [8] where informed priors are placed on these values, but we do not explore such approaches here. We also noted the sensitivity of the results to misspecification of the model for testing. It is thus important to understand the mechanisms driving people to seek and receive testing and to use a flexible modeling approach [18].

This work can be directly adapted to an investigation of causality by modifying the target parameter of interest to a causal parameter under additional assumptions including “no unmeasured confounders” for an exposure of interest . If the additional assumptions hold, then this approach could investigate potential epidemiological causes of SARS-CoV-2 infection. Future work could also improve the efficiency of the IPW estimator through such approaches as targeted maximum likelihood estimation [28]. Though improvements are likely possible, a practical fully efficient estimator is probably infeasible due to the difficulties in applying the g-formula.

Due to the rapidly evolving nature of the COVID-19 pandemic, studies with short timelines are necessary to monitor public health. The accessibility of the test-negative design with untested controls allows for much shorter timelines compared to a cohort study of uninfected individuals. We must however overcome the inherent selection bias arising from this design. Novel study designs must be followed by clear definitions of parameters of interest, investigations of identifiability of these parameters, and potentially tailored estimators. These steps allow for a principled approach that does not solely rely on intuition and may help avoid substantial sources of bias when tracking risk and preventive factors of COVID-19.


  • [1] G. Adamkiewicz, J. Spengler, A. Harley, A. Stoddard, M. Yang, M. Alvarez-Reeves, and G. Sorensen (2014) Environmental conditions in low-income urban housing: clustering and associations with self-reported health. Am J Public Health 104(9), pp. 1650‐1656. External Links: Document Cited by: §5.2.1.
  • [2] R. Bhattacharya, R. Nabi, I. Shpitser, and J. Robins (2019) Identification in missing data models represented by directed acyclic graphs. Uncertain Artif Intel, pp. In press. Cited by: §8.
  • [3] S. Cole and M. Hernán (2008) Constructing inverse probability weights for marginal structural models. Am J Epidemiol 168(6), pp. 656‐664. External Links: Document Cited by: §1, §6.
  • [4] S. Cole, R. Platt, E. Schisterman, H. Chu, D. Westreich, D. Richardson, and C. Poole (2009-11) Illustrating bias due to conditioning on a collider. Int J Epidemiol 39(2), pp. 417–420. External Links: ISSN 0300-5771, Document, Link, Cited by: §4.
  • [5] M. Cummings, M. Baldwin, D. Abrams, S. Jacobson, B. Meyer, E. Balough, J. Aaron, J. Claassen, L. Rabbani, J. Hastie, and B. Hochman (2020) Epidemiology, clinical course, and outcomes of critically ill adults with COVID-19 in New York City: a prospective cohort study. Lancet, pp. In press. Cited by: §1.
  • [6] R. Daniel, M. Kenward, S. Cousens, and B. De Stavola (2012) Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res 21(3), pp. 243–256. Cited by: §4.
  • [7] G. De Serres, D. Skowronski, X. Wu, and C. Ambrose (2013) The test-negative design: validity, accuracy and precision of vaccine efficacy estimates compared to the gold standard of randomised placebo-controlled clinical trials. Euro Surveill 18(37), pp. 20585. External Links: Document Cited by: §1.
  • [8] P. Diggle (2011) Estimating prevalence using an imperfect test. Epidemiol Res Int 2011, pp. Article ID 608719. External Links: Document Cited by: §8.
  • [9] W. Fukushima and Y. Hirota (2017) Basic principles of test-negative design in evaluating influenza vaccine effectiveness. Vaccine 35(36), pp. 4796‐4800. External Links: Document Cited by: §1.
  • [10] S. Greenland, J. Pearl, and J. Robins (1999) Causal diagrams for epidemiologic research.. Epidemiology 10(1), pp. 37–48. Cited by: §8.
  • [11] S. Greenland (1996) Basic methods for sensitivity analysis of biases. Int J Epidemiol 25(6), pp. 1107–1116. Cited by: §5.1.
  • [12] X. He, E. Lau, P. Wu, and et al (2020) Temporal dynamics in viral shedding and transmissibility of covid-19. Nat Med 26, pp. 672–675. External Links: Document Cited by: §1.
  • [13] M. Hernán, S. Hernández-Díaz, and J. Robins (2004) A structural approach to selection bias. Epidemiology 15(5), pp. 615‐625. External Links: Document Cited by: §1, §4.
  • [14] D. Horvitz and D. Thompson (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260), pp. 663–685. Cited by: §1.
  • [15] B. Karmakar and D. Small (2020) Inference for a test-negative case-control study with added controls. Arxiv (), pp. . Cited by: §1.
  • [16] K. Mohan, J. Pearl, and J. Tian (2013) Graphical models for inference with missing data. In Adv Neural Inf Process Syst, pp. 1277–1285. Cited by: §4.
  • [17] J. Pearl (2009) Causality: models, reasoning and inference. 2nd edition, Cambridge University Press,, Cambridge, MA, USA. Cited by: §8.
  • [18] R. Pirracchio, M. Petersen, and M. van der Laan (2014-12) Improving Propensity Score Estimators’ Robustness to Model Misspecification Using Super Learner. Am J Epidemiol 181 (2), pp. 108–119. External Links: ISSN 0002-9262, Document, Link, Cited by: §8.
  • [19] R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: §7.
  • [20] Respiratory Virus Infections Working Group (2020) Canadian public health laboratory network best practices for covid-19. Can Commun Dis Rep 46(5). External Links: Document Cited by: §1.
  • [21] J. Robins, A. Rotnitzky, and L. Zhao (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429), pp. 106–121. External Links: Document Cited by: §1.
  • [22] J. Robins (1986) A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Mathematical Modeling 7, pp. 1393–1512. Cited by: §5.
  • [23] C. Schooling and H. Jones (2018) Clarifying questions about “risk factors”: predictors versus explanation. Emerg Themes Epidemiol 15, pp. Article 10. External Links: Document Cited by: §3, §3.
  • [24] G. Shmueli (2010) To explain or to predict?. Stat Sci 25(3), pp. 289–310. External Links: Document Cited by: §3, §3.
  • [25] J. Snowden, S. Rose, and K. Mortimer (2011) Implementation of g-computation on a simulated data set: demonstration of a causal inference technique. Am J Epidemiol 173(7), pp. 731–738. External Links: ISSN 0002-9262, Document Cited by: §6.
  • [26] J. Textor, B. van der Zander, M. Gilthorpe, M. Liskiewicz, and G. Ellison (2016) Robust causal inference using directed acyclic graphs: the R package ‘dagitty’. Int J Epidemiol 45(6), pp. 1887–1894. Cited by: Figure 1, Figure 2, Figure 3.
  • [27] (2020) The plight of essential workers during the covid-19 pandemic [editorial]. Lancet 395(10237), pp. P1587. Cited by: §5.2.1.
  • [28] M. van der Laan and S. Rose (2011) Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media. Cited by: §8.
  • [29] M. van der Laan (2008) Estimation based on case-control designs with known prevalence probability. Int J Biostat 4(1), pp. Article 17. Cited by: §5.1, §6, §6.
  • [30] J. Vandenbroucke, E. Brickley, C. Vandenbroucke-Grauls, and N. Pearce (2020) The test-negative design with additional population controls: a practical approach to rapidly obtain information on the causes of the SARS-CoV-2 epidemic. Arxiv (). Cited by: §1, §2, §8.
  • [31] C. Wang, S. Wang, and R. Carroll (1997) Estimation in choice-based sampling with measurement error and bootstrap analysis. J Econom 77(1), pp. 65–86. Cited by: §7.
  • [32] D. Wang, B. Hu, C. Hu, F. Zhu, X. Liu, J. Zhang, B. Wang, H. Xiang, Z. Cheng, Y. Xiong, Y. Zhao, Y. Li, X. Wang, and Z. Peng (2020) Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in Wuhan, China. JAMA 323(11), pp. 1061–1069. External Links: ISSN 0098-7484, Document, Link Cited by: §1.

Appendix A Consistency of IPW

By the definition of the parameters in equation (1) and the typical log-likelihood loss function, the true values are defined through the equations


We first assume that the model for is consistent such that the values for given converge to the truth. Then, the estimates will converge to the true as long as the parameters and are correct. Then, as goes to infinity, and assuming consistent nuisance function estimation in the denominator, the IPW score equations (6) imply

By iterative expectations, we can rewrite the above as

Appendix B Simulation study data generation

The data-generating mechanism used in the simulation study is given in Table 2. We also present the R code below.


X<-rbinom(prob=plogis(-2.3+0.3*C),size=1,n=popsize) #prevalence is around 0.1

#Y will be censored, Y1 is latent for everyone
#check desired prevalence of true outcome

#generate test results

#symptoms based on infection
W<-rbinom(n=popsize, prob=plogis(-2+0.2*C+0.5*X+3*Y1),size=1)

#selection on outcome for testing
R<-rbinom(n=popsize, size=1, prob=plogis(-7+2*W+0.6*X+0.2*C-W*X))
#about 0.002 of pop tested -> determines sample size
q0<-mean(R) #Pr(R=1) in population


Variable Generating Mechanism (i.i.d)

set for all .

Subsample 2000 subjects with and at most 2000 subjects with .

Table 2: The data generating mechanism used in the simulation study.

Appendix C R code to run the IPW estimator

In this section, we present the R code to run the estimator for observed data with structure where , , and are univariate. Note that the simulation study data has such a structure. This code can be easily extended for multivariate versions of those variables.

The IPW function uses the following two helper functions.

#Log-bin function that can take Y values outside of (0,1)
LogLikelihood<- function(beta, Y, X,w){
pi<- plogis( X%*%beta ) # P(Y|A,W)= expit(beta0 + beta1*X1+beta2*X2...)
pi[pi==0] <- .Machine$double.neg.eps # avoid taking the log of 0
pi[pi==1] <- 1-.Machine$double.neg.eps
logLike<- sum( w*( Y*log(pi) + (1-Y)*log(1-pi) ) )

grad<- function(beta, Y, X, w){
pi<- plogis( X%*%beta ) # P(Y|A,W)= expit(beta0 + beta1*X1+beta2*X2...)
pi[pi==0] <- .Machine$double.neg.eps # for consistency with above
pi[pi==1] <- 1-.Machine$double.neg.eps
gr<- crossprod(w*X, Y-pi) # gradient

The function to run IPW depends on the data (dat) and values for sensitivity (sens), specificity (spec), and the baseline prevalence (q0hat). The function follows.


#Use IPW estimator (with true sens and spec) to estimate

#case-control probability
#Specify some model for R, fit with weights w. We use a logistic regression as an example:


#This solves the IPW score equations by optimizing the log-likelihood
optim.out <- optim(par=c(-3,0.5,0.5), fn=LogLikelihood, gr=grad, Y=Ystar,
X=cbind(1,dat$X,dat$C)[dat$R==1,], w=1/PRwxc[dat$R==1], method="BFGS")
beta<- optim.out$par[2]

#The score equations can also be solved with geeglm in geepack
#But we must truncate the outcome to (0,1)