1 Introduction
Under the current pandemic caused by the SARSCoV2 virus, where the resulting illness is referred to as COVID19, it is challenging to implement fast epidemiological inquiries to map and understand the disease. Highly infectious [12] in a completely nonimmune population and targeting primarily the respiratory system with clinical symptoms that include fever, cough, and fatigue [5, 32], this illness continues to cause substantial morbidity and mortality, straining the healthcare systems of many countries. With the aim of reducing infection, global campaigns encourage individuals to modify their daily behaviour by measures which include physical distancing, the use of masks, and intensified hygienic practices. Much interest lies in establishing whether these interventions are effective at reducing infection probabilities at an individual or population level.
Given the challenges involved in testing large portions of the population for active infection with SARSCoV2, cases in the general public are typically ascertained through testing sites. In Canada, though regulations have varied by epidemic stage and jurisdiction [20], in order to obtain a test, individuals may be required to be experiencing symptoms of COVID19 and/or have other reason to believe that they are infected, such as being a healthcare worker. Thus, if potential study participants are recruited at test centers, the resulting study cohort will not be representative of the general population at risk for the disease. Further, due to the nature of testing selfselection, associations measured between participant covariates (e.g. demographics, characteristics, and behaviours) and outcomes will not necessarily be representative of true causes or even predictors of infection [13].
Recently, Vandenbroucke et al.[30] proposed a a modified casecontrol design that combines a testnegative design, best known from its use in vaccine effectiveness research [7, 9]
, with the recruitment of additional population controls in order to identify risk/preventive factors of SARSCoV2 infection. This study design involves both the recruitment of patients who are seeking testing for SARSCoV2 infection and of untested individuals in the general population. They propose a matching approach to compare differences in study participant covariates between people who test positive, people who test negative, and people who are not seeking testing in order to triangulate factors that likely increase or decrease the odds of infection. Karmakar and Small
[15] proposed a more efficient test to compare factors between the three groups. However, neither of these studies addresses the problem from a missing data perspective while dealing with potential selection bias of having two groups of participants who received testing based on outcomerelated symptoms.In this methodological study, we give a specific definition of the parameters of interest in a “risk/preventive factor” analysis, corresponding to modeling the covariates that are predictive of SARSCoV2 infection in a regression model. We then provide identifiability conditions under the testnegative design and an assumed structural setting in the context of the COVID19 pandemic. Identifiable means that we would know the exact value of the parameters of interest if we had an infinite sample size; our study thus gives some conditions under which identifiablity is achieved. Identifiability is of great importance because without this guarantee, we cannot construct a consistent estimator under the given assumptions. Finally, we propose an inverse probability weighting (IPW) estimator [14, 21, 3] of the parameters of interest that may be feasible in this setting. We evaluate this approach through simulation study and compare it with a naive approach to estimation that does not incorporate untested population controls.
2 The TestNegative Study Design with Population Controls
Given access to the recruitment of people seeking SARSCoV2 tests at a given testing site, we consider a study design that involves the recruitment of two groups of people: (1) people who are being tested for SARSCoV2 at the test site and (2) members of the general population, possibly selected as matched pairs for those being tested, who are not seeking testing but who are under the jurisdiction of the test site. It must be the case that members of group (2) would be able to access the test site were they to have symptoms of COVID19 or otherwise qualify for and seek testing. Participants recruited from group (1) are denoted and those from group (2) are denoted . This is essentially a casecontrol study design except that cases are those who are tested and controls are those who are untested. We assume random sampling of independent members of both groups, with a total sample size of . The need for independence implies that, for example, only one member per household should be recruited.
All participants are given a questionnaire to collect information about the potential risk/preventive factors under study, , which we will now refer to simply as risk factors, and related confounders, . The questionnaire may also capture information about current symptoms related to COVID19, . It is also necessary that we receive the result of the SARSCoV2 test from those being tested. The test result is denoted (1=positive; 0=negative), which is only observable for those who are being tested ().
Recruitment from groups (1) and (2) will give us three categories of participants to contrast: those being tested and who test positive for COVID19, those being tested and who test negative for COVID19, and those not being tested. In principle, the comparison of the covariates () between testpositives and negatives will allow us to see which risk factors differ between people who have become infected with SARSCoV2 versus those who haven’t and we can then compare to the controls from the general population [30]. But simple contrasts of tested participants may not estimate an interpretable parameter. In fact, the interpretation of any measured associations relies on certain structural assumptions and the collection of necessary data that would allow for the credibility of these assumptions. We expand on these concepts in the remainder of the paper.
3 Parameters of interest
Our scientific objective is to identify risk factors, i.e. behaviours or characteristics of individuals that are associated with COVID19 in a chosen outcome regression model, possibly after adjustment for other suspected confounders of these risk factors [24, 23]. The population of interest is members of the general public, who were not previously infected with SARSCoV2 (i.e. who are at risk of infection), and who are under the jurisdiction of the COVID19 testing site under study, but who may or may not be seeking testing. The regression model thus represents the associations between the risk factors and prospective shortterm risk of infection with SARSCoV2 in this population.
The binary outcome of interest is infection with SARSCoV2, denoted . Due to imperfect test sensitivity and specificity, this outcome may not correspond with , the result of the test for COVID19. Sensitivity is defined as the probability of testing positive when truly infected. Specificity is defined as the probability of testing negative when not infected. The observed data are thus of the form . The complete data under knowledge of the true outcomes are . Under a perfect test for COVID19 (i.e. sensitivity and specificity equal to one), when in which case the observed data with outcome censoring can be written as
. We will use lower case letters to represent realizations of these random variables. In particular,
for where is the total sample size.We then define a logistic regression model
(1)  
where , (and similarly for the realizations denoted by lower case letters), , and
, under a typical loglikelihood loss function. Our interest lies in the vector parameter
where corresponds to the conditional odds ratio related to the covariate .Importantly, the parameters in this regression model may not represent causal effects, i.e. even if a coefficient is negative, it does not necessarily mean that decreases the risk of infection [24, 23]. In order to establish such a relationship, all confounders of the effects of risk factors on the outcome must be adjusted for in the model, the model must correctly represent the mechanisms of infection, and all risk factors must be independently manipulable. While we could extend this work to consider these aspects, for simplicity we retain as the statistical parameter of interest in this study; in practice its estimation may provide important insights into the variables related to SARSCoV2 infection or ways that highrisk individuals may be identified. The remainder of the article addresses the estimation of the statistical parameter within the regression model.
4 Potential for Collider Bias Resulting from Selecting Patients at Test Sites
The challenge in comparing patients who tested positive versus negative arises from selecting on patients who seek testing.
Figure 1 is a missing data directed acyclic graph (mDAG) [6, 16] representing assumed relationships between covariates in this analysis. In particular, we allow for the baseline covariates to potentially cause (i.e. influence the risk for) SARSCoV2 infection, . If a patient seeks testing () then we observe a test result . If the test can perfectly detect infection, then if and if . Testing is typically obtained if the individual has suspected symptoms of COVID19, (which may include fever, respiratory symptoms, etc) but the act of seeking and then receiving testing may also be affected by the risk factors () and other baseline covariates (). For example, an alert individual who frequently hand washes may be more inclined to seek testing, possibly also depending on whether they are experiencing real or perceived symptoms of COVID19 (included in ). Any variable in , such as recent travel, that places a person at higher risk for infection may also prompt that person to seek testing, even with absent or mild symptoms. We assume that true infection only affects testseeking behaviour through symptoms. Thus, in we include all symptoms known to the participant. may also represent symptoms of other (respiratory or other) infections, and these may be caused by pathogens other than the SARSCoV2 virus.
Other unmeasured factors may modify the risk of COVID19, . We allow for unmeasured causes (omitted from graph for simplicity) of both COVID19 and other infections. In constructing this mDAG, we also assume that no unmeasured factor simultaneously affects any pair of nodes. We will discuss and relax some of these assumptions in Section 5.
The objective of our analysis is to estimate the model parameters representing the relationship between and while adjusting for , i.e. the parameters of the model for . But because we only have outcome data from those who are seeking testing, we may consider directly modeling the observed outcomes among those who were tested . Even under a perfect test for COVID19, so that when , such modeling of the selected population may produce misleading associations between and . This is due to collider bias [13, 4], which is caused by subsetting or adjusting for a variable that is caused by the two variables whose association is of interest. In our case, we would be conditioning on , which is caused by both (through ) and by . Thus, there is a possibility for erroneous conclusions resulting from the measured associations between and among those seeking testing.
For example, having access to a private vehicle allows you to avoid public transit, which may be a preventive factor for viral infection. Thus we may be interested in measuring the association between access to a private car and COVID19. However, access to a COVID19 test site is also facilitated by access to a vehicle, especially for those living further away from the test sites. Because (we are supposing) those with a car are more likely to be able to seek testing if they have symptoms, there is a disproportionate number of people without cars with COVID19 who are not tested. Thus, we may measure a negative association between access to a private vehicle and COVID19 even if there is no causal relationship between car ownership and COVID19. We also demonstrate this by example in the simulation study.
5 Identifiability of Risk Factors in the General Population
In this section, we discuss the identifiability of the parameters of the model in equation (1) under the mDAG in Figure 1. We then discuss identifiability under some less restrictive assumptions. When identifiability holds, a maximum likelihood substitution estimator [22] and an IPW estimator can be constructed.
5.1 Identifiability under the graph in Figure 1
The mDAG in Figure 1 makes several important assumptions. In particular, it assumes that is fully measured, meaning that we measured all symptoms caused by SARSCoV2 infection that may lead to seeking testing. This can be achieved by thorough harmonized data collection on individuals being recruited from both populations. We assume that is equal to when up to random error. As mentioned, we allow for unmeasured common causes of and other infections. Otherwise, we assume that there are no common causes of any pair of nodes in the graph.
As a consequence of this structure, we have the independence condition
(2) 
We then use the law of total probability to write our association of interest as
(3) 
by the independence assumption (2). The quantity is the possibly multivariate distribution of conditional on and and the multiple integrals are taken over the domain of this distribution.
If the SARSCoV2 test is perfect then we can replace the by directly. If not, assuming uniform test sensitivity and specificity independent of patient characteristics, under the condition that , the law of total probability expanding on and rearranging terms gives us
(4) 
which is estimable from the data of the tested subjects [11].
For the distribution in equation (3), we may write
(5) 
where is the conditional distribution of in the population corresponding to for . Components (a) and (b) are estimable from the subjects from the tested and untested populations, respectively. For component (c), the casecontrol design of sampling tested and untested patients allows us to identify associations between and . However, casecontrol data do not give us a baseline prevalence of the outcome ( in this case). Therefore, it is only under external knowledge of the marginal , the prevalence of testing, that component (c) is identified [29].
Because we are able to relate the target probability
to quantities that are known if given infinite samples of the observed data structure, under the given assumptions, we have established identifiability. This result does not rely on the specification of parametric models. Thus, for identifiability we need complete data on the symptoms
(specifically, all variables caused by leading to testing) in addition to the covariates of interest . These covariates must include all common causes of testing and and/or (see next section). We also require knowledge of the parameters , , and .5.2 Identifiability in settings with fewer restrictions
The assumed relationships in the mDAG of Figure 1 may be too restrictive in certain studies. We thus describe several specific generalizations of the above graph and the consequences on identifiability and interpretability of the parameter.
5.2.1 Unmeasured variable affecting test seeking which also affects and/or
In Figure 2 we add an unmeasured variable that is a factor influencing the act of getting tested which acts independently of all other variables in the graph. Suppose that also affects being infected with SARSCoV2 or having symptoms of COVID19 or both. For instance, mold exposure is associated with living in lower income neighborhoods [1]. Mold exposure may lead to respiratory symptoms (e.g. asthma exacerbation) that could be confused for COVID19 symptoms. People living in lower income neighborhoods may be more at risk of COVID19 due to greater population density and greater proportions of people who work in “essential services” [27]. Thus, socioeconomic status may be such a variable if it is not included in or . A second example is any variable that makes an individual highrisk (arrow into ) that also leads to testing even if the individual is symptom free.
The consequence of such a variable is that independence condition (2) no longer holds. This is because directly creates dependence between and and/or adjusting for , which is a collider of and , creates the dependence between and . Thus, if such a variable exists we cannot use the described maximum likelihood procedure. We should therefore attempt to measure all such factors and include them in or .
5.2.2 Unmeasured symptoms of COVID19 leading to testing
Another potential scenario involves symptoms of COVID19 that were not included in but can also lead to testing. This scenario is portrayed in Figure 3. Such a variable may exist if, for instance, a study does not ask about the less common symptoms of COVID19, such as headache or skin rashes.
In this scenario, is a mediator of the effect of on so the independence condition (2) does not hold. This is still the case if is related to the baseline covariates or partially correlated with other symptoms.
5.2.3 Unmeasured variables correlated with baseline covariates and cause of testing
Consider the presence of a variable affecting and correlated with baseline covariates (either by causal relationship or other). Such a variable does not affect the independence condition (2) and thus identifiability is preserved. Such variables include demographic information and participant characteristics that affect testseeking behaviour but are otherwise not related to infection or symptoms.
5.2.4 Unmeasured variables affecting risk factors and COVID19 infection
In the presence of a variable that affects both and , the association between risk factors and outcome of interest will be confounded. However, the presence of such a variable will not affect the independence condition (2) and thus the parameters in model (1) will still be identifiable. However, their values may be less meaningful and not represent causal relationships between risk factors and infection due to the unmeasured confounding.
6 Estimation with IPW
The gformula relates the observed data to and thus the model of interest. Estimation is available in principle through modeling the components of the gformula, producing a substitution estimator [25]. We find that when is highdimensional, as is likely the case in this setting, the gformula estimator may not be feasible. Alternatively, one may model the probability of selection directly using the casecontrol data of tested and untested individuals to construct an estimator using IPW [3, 29]. We describe the latter, which requires knowledge of the test sensitivity, , and specificity, , and the value of testing prevalence . If these parameters are uncertain, then one can undertake a sensitivity analysis by varying the assigned values.
The IPW estimator for the parameters of interest in model 1 is given through the score equations of a weighted logistic regression
(6) 
where values refer to the data realizations of subject .
In order to estimate the numerator of the IPW estimator, we must first define a model for . This model is fit on subjects who received a test. Predictions from this model fit are denoted . By the relationship in equation (4), we set
for all subjects with . Note that approximates .
In order to estimate the denominator of (6), we note that the associations between covariates, symptoms, and the probability of testing must be estimated from the data resulting from the casecontrol design, where sampling is carried out in both the tested and untested groups. If we know the baseline testing prevalence , we may use a simple weighting method for casecontrol studies [29]. Specifically, we assign all cases the weight and all controls the weight where is the ratio of the number of controls to cases in the sample. We use these weights in any chosen binomial regression model for conditional on , , and . Finally, we use predictions from this model fit to estimate for all tested subjects.
A simple proof of the consistency of this estimator under the independence assumption (2) is given in the Appendix. It is required that the models for and are both correctly specified. We expect that some values of the denominator may be close to one for some tested subjects who are experiencing several symptoms of COVID19. However, we would not necessarily expect denominator values close to zero because the IPW equation (6) only uses subjects who did, in fact, get tested. We thus expect our IPW estimator to be fairly stable in this setting.
7 Simulation Study
In order to evaluate the proposed IPW method under the mDAG in Figure 1, and compare it to a naive approach, we perform a simulation study. We evaluate the method under ideal circumstances, where the assumed parametric models are close to wellspecified, where sensitivity and specificity of the test are known, and where the baseline prevalence of testing is known. We then evaluate the sensitivity to departures from these assumptions.
We first simulate ordered data , where each variable is unidimensional, for a population of 1,000,000. Baseline confounder
is generated from a standard Gaussian distribution. The risk factor of interest,
, is generated as a Bernoulli random variable conditional on such that its prevalence in the population is approximately 10%. True COVID19 status is generated as Bernoulli conditional on and such that the true incidence of acute infection in this previously untested population is 10% overall (for this example, though this is likely lower in the general population in practice). The true conditional association between and is . Symptoms are Bernoulli conditional on , , and , where the dependence on is strong so that infected individuals have a high probability of experiencing symptoms (roughly between 0.5 and 0.9). Testing status is then generated given , , and with a fairly strong dependence on such that symptoms lead to a higher probability of being tested. Given the true test infection state , true sensitivity and specificity , test outcome is drawn for all tested subjects. Then, we randomly sample 2000 tested participants (roughly all available) and 2000 untested participants (a small subsample of the total available) from the population, which gives us our study sample. The data generation is given in Appendix Table 2.In order to demonstrate the selection bias from using only tested subjects to evaluate risk factors, we fit a logistic regression for conditional on and using the data from tested subjects in the sample. We then apply our method using logistic regressions for and , where the latter regression is weighted using the casecontrol weights. The score equations (6) are then solved using a standard optimization procedure for logistic regressions with a loglikelihood loss function, though our implementation allows for values of that are outside of (0,1) which occurs due to the transformation with and .
We implement our method with correctly specified logistic regression models under the following settings: assumed values set to (i.e. truth), , and ; assumed testing prevalence set to truth (roughly 0.2% of the full population), truth , and truth . We then misspecify our testing model by omitting an interaction term between and . In the last IPW implementation, we do not adjust for symptoms by removing from all models.
We use a casecontrol nonparametric bootstrap method, where resampling with replacement is done separately in the tested and untested groups, to estimate the standard error and 95% confidence intervals for the IPW method
[31]. The usual logistic regression standard errors are used for the naive method. All simulations were run with R statistical software v. 3.6.1 [19].The results of all implementations in addition to the analysis conducted only on tested subjects are given in Table 1. Mean parameter estimates, mean standard error estimates, Monte Carlo standard errors, and
coverage of the 95% confidence intervals are given. We first note that the logistic regression analysis run with only tested subjects is highly biased, suggesting on average that
leads to a lower risk of the outcome while the opposite is true. IPW implemented with correct parameter values and models had no error on average, with a bootstrap variance estimate that corresponded to the Monte Carlo value and only slight undercoverage from the 95% bootstrap confidence interval. Ignoring the sensitivity and specificity (i.e. setting
) in the IPW method led to some attenuation in the average estimate though coverage remained similar. Incorrectly specifying and resulted in substantial overestimation and inflated standard error estimates. Misspecifying by an order of 10 did not lead to important bias, but misspecifying by an order of 100 led to underestimation. Missing an interaction in the model for testing led to an inverted odds ratio suggesting that is protective for . This result points to the importance of correct modeling of the nuisance functions in the IPW estimator. Finally, when not adjusting for , IPW gives the same biased results as the subsetted logistic regression. Bias occurs because if we omit then we do not satisfy the independence assumption (2) with the measured variables.True OR = 1.5  Mean  MC  Mean est  % Cov  
True ,  est  SE  SE  
Analysis of tested subjects ()  0.87  0.18  0.17  10  
IPW  assumed  assumed  
sens, spec  prevalence of  
Correct models 

=truth  1.49  0.24  0.24  92  

=truth  1.43  0.22  0.22  91  

=truth  2.19  0.39  86  

=truth  1.46  0.24  0.24  91  

=truth  1.26  0.21  0.20  80  
Missing interaction 

=truth  0.79  0.29  0.29  27  
Omitted 

=truth  0.86  0.19  0.19  11 
8 Discussion
In this paper, we have contributed to the investigation of statistical analysis under the testnegative design in the context of evaluating risk or preventive factors of COVID19 when participants may be conveniently recruited at disease testing sites. We defined a potential parameter of interest in such a study as the coefficients in a regression model for the true infection outcome. We explained and demonstrated the importance of sampling additional population controls [30] in order to avoid selection bias from comparing only tested individuals. We then investigated the identifiability of the target parameter under several settings. Finally, we proposed a novel IPW estimator that accounts for both imperfect test sensitivity and specificity and the study design. We then evaluated this estimator through simulation study.
There is a growing literature on identifiability conditions for statistical parameters under missingness [2] and a large literature of identifiability of causal parameters [17, 10]. Our setting is somewhat different from a typical missing data setting in that, because observed outcomes are obtained through imperfect tests, true infection status is not observed for any subject. In addition, the casecontrol component of the study design must be considered when estimating all probabilities and distributions in the general population of interest. These results are important as they shed light on the data collection needed to correctly estimate the parameter of interest. In particular, we must measure all variables on the pathway between SARSCoV2 infection and testing. This means that incomplete ascertainment of the symptoms leading some individuals to be tested would result in a biased estimator. We must also measure and adjust for all causes of testing if they are also causes of SARSCoV2 infection and/or symptoms.
The estimator proposed assumes knowledge of the test properties and the prevalence of testing in the population. Given the potential sensitivity to errors made when specifying these quantities, one could undertake a sensitivity analysis. Specifically, confidence intervals could be constructed using all combinations of credible values for , , and . By taking the minimum confidence interval lower bound and the maximum upper bound, we can place bounds on the set of parameter values that are supported by the data and assumed model. Other approaches may involve Bayesian estimation [8] where informed priors are placed on these values, but we do not explore such approaches here. We also noted the sensitivity of the results to misspecification of the model for testing. It is thus important to understand the mechanisms driving people to seek and receive testing and to use a flexible modeling approach [18].
This work can be directly adapted to an investigation of causality by modifying the target parameter of interest to a causal parameter under additional assumptions including “no unmeasured confounders” for an exposure of interest . If the additional assumptions hold, then this approach could investigate potential epidemiological causes of SARSCoV2 infection. Future work could also improve the efficiency of the IPW estimator through such approaches as targeted maximum likelihood estimation [28]. Though improvements are likely possible, a practical fully efficient estimator is probably infeasible due to the difficulties in applying the gformula.
Due to the rapidly evolving nature of the COVID19 pandemic, studies with short timelines are necessary to monitor public health. The accessibility of the testnegative design with untested controls allows for much shorter timelines compared to a cohort study of uninfected individuals. We must however overcome the inherent selection bias arising from this design. Novel study designs must be followed by clear definitions of parameters of interest, investigations of identifiability of these parameters, and potentially tailored estimators. These steps allow for a principled approach that does not solely rely on intuition and may help avoid substantial sources of bias when tracking risk and preventive factors of COVID19.
References
 [1] (2014) Environmental conditions in lowincome urban housing: clustering and associations with selfreported health. Am J Public Health 104(9), pp. 1650‐1656. External Links: Document Cited by: §5.2.1.
 [2] (2019) Identification in missing data models represented by directed acyclic graphs. Uncertain Artif Intel, pp. In press. Cited by: §8.
 [3] (2008) Constructing inverse probability weights for marginal structural models. Am J Epidemiol 168(6), pp. 656‐664. External Links: Document Cited by: §1, §6.
 [4] (200911) Illustrating bias due to conditioning on a collider. Int J Epidemiol 39(2), pp. 417–420. External Links: ISSN 03005771, Document, Link, https://academic.oup.com/ije/articlepdf/39/2/417/18480441/dyp334.pdf Cited by: §4.
 [5] (2020) Epidemiology, clinical course, and outcomes of critically ill adults with COVID19 in New York City: a prospective cohort study. Lancet, pp. In press. Cited by: §1.
 [6] (2012) Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res 21(3), pp. 243–256. Cited by: §4.
 [7] (2013) The testnegative design: validity, accuracy and precision of vaccine efficacy estimates compared to the gold standard of randomised placebocontrolled clinical trials. Euro Surveill 18(37), pp. 20585. External Links: Document Cited by: §1.
 [8] (2011) Estimating prevalence using an imperfect test. Epidemiol Res Int 2011, pp. Article ID 608719. External Links: Document Cited by: §8.
 [9] (2017) Basic principles of testnegative design in evaluating influenza vaccine effectiveness. Vaccine 35(36), pp. 4796‐4800. External Links: Document Cited by: §1.
 [10] (1999) Causal diagrams for epidemiologic research.. Epidemiology 10(1), pp. 37–48. Cited by: §8.
 [11] (1996) Basic methods for sensitivity analysis of biases. Int J Epidemiol 25(6), pp. 1107–1116. Cited by: §5.1.
 [12] (2020) Temporal dynamics in viral shedding and transmissibility of covid19. Nat Med 26, pp. 672–675. External Links: Document Cited by: §1.
 [13] (2004) A structural approach to selection bias. Epidemiology 15(5), pp. 615‐625. External Links: Document Cited by: §1, §4.
 [14] (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260), pp. 663–685. Cited by: §1.
 [15] (2020) Inference for a testnegative casecontrol study with added controls. Arxiv (), pp. . Cited by: §1.
 [16] (2013) Graphical models for inference with missing data. In Adv Neural Inf Process Syst, pp. 1277–1285. Cited by: §4.
 [17] (2009) Causality: models, reasoning and inference. 2nd edition, Cambridge University Press,, Cambridge, MA, USA. Cited by: §8.
 [18] (201412) Improving Propensity Score Estimators’ Robustness to Model Misspecification Using Super Learner. Am J Epidemiol 181 (2), pp. 108–119. External Links: ISSN 00029262, Document, Link, https://academic.oup.com/aje/articlepdf/181/2/108/17342497/kwu253.pdf Cited by: §8.
 [19] (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: §7.
 [20] (2020) Canadian public health laboratory network best practices for covid19. Can Commun Dis Rep 46(5). External Links: Document Cited by: §1.
 [21] (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429), pp. 106–121. External Links: Document Cited by: §1.
 [22] (1986) A new approach to causal inference in mortality studies with sustained exposure periods – application to control of the healthy worker survivor effect. Mathematical Modeling 7, pp. 1393–1512. Cited by: §5.
 [23] (2018) Clarifying questions about “risk factors”: predictors versus explanation. Emerg Themes Epidemiol 15, pp. Article 10. External Links: Document Cited by: §3, §3.
 [24] (2010) To explain or to predict?. Stat Sci 25(3), pp. 289–310. External Links: Document Cited by: §3, §3.
 [25] (2011) Implementation of gcomputation on a simulated data set: demonstration of a causal inference technique. Am J Epidemiol 173(7), pp. 731–738. External Links: ISSN 00029262, Document Cited by: §6.
 [26] (2016) Robust causal inference using directed acyclic graphs: the R package ‘dagitty’. Int J Epidemiol 45(6), pp. 1887–1894. Cited by: Figure 1, Figure 2, Figure 3.
 [27] (2020) The plight of essential workers during the covid19 pandemic [editorial]. Lancet 395(10237), pp. P1587. Cited by: §5.2.1.
 [28] (2011) Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media. Cited by: §8.
 [29] (2008) Estimation based on casecontrol designs with known prevalence probability. Int J Biostat 4(1), pp. Article 17. Cited by: §5.1, §6, §6.
 [30] (2020) The testnegative design with additional population controls: a practical approach to rapidly obtain information on the causes of the SARSCoV2 epidemic. Arxiv (). Cited by: §1, §2, §8.
 [31] (1997) Estimation in choicebased sampling with measurement error and bootstrap analysis. J Econom 77(1), pp. 65–86. Cited by: §7.
 [32] (2020) Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in Wuhan, China. JAMA 323(11), pp. 1061–1069. External Links: ISSN 00987484, Document, Link Cited by: §1.
Appendix A Consistency of IPW
By the definition of the parameters in equation (1) and the typical loglikelihood loss function, the true values are defined through the equations
(7) 
We first assume that the model for is consistent such that the values for given converge to the truth. Then, the estimates will converge to the true as long as the parameters and are correct. Then, as goes to infinity, and assuming consistent nuisance function estimation in the denominator, the IPW score equations (6) imply
By iterative expectations, we can rewrite the above as
Appendix B Simulation study data generation
The datagenerating mechanism used in the simulation study is given in Table 2. We also present the R code below.
popsize=1000000 C<rnorm(n=popsize) X<rbinom(prob=plogis(2.3+0.3*C),size=1,n=popsize) #prevalence is around 0.1 #Y will be censored, Y1 is latent for everyone Y1<rbinom(prob=plogis(log(OR)*X+0.5*C2.7),size=1,n=popsize) #check desired prevalence of true outcome #generate test results Y<rbinom(prob=(sens*Y1+(1spec)*(1Y1)),size=1,n=popsize) #symptoms based on infection W<rbinom(n=popsize, prob=plogis(2+0.2*C+0.5*X+3*Y1),size=1) #selection on outcome for testing R<rbinom(n=popsize, size=1, prob=plogis(7+2*W+0.6*X+0.2*CW*X)) #about 0.002 of pop tested > determines sample size q0<mean(R) #Pr(R=1) in population Y[R=0]<NA indcontrols<sample(1:sum(R==0),size=2000,replace=F) indcases<sample(1:sum(R==1),size=min(sum(R==1),2000),replace=F) dat<as.data.frame(rbind((cbind(C,X,Y,W,R,Y1)[R==1,])[indcases,], (cbind(C,X,Y,W,R,Y1)[R==0,])[indcontrols,]))
Variable  Generating Mechanism (i.i.d) 













set for all .  
Subsample 2000 subjects with and at most 2000 subjects with . 


Appendix C R code to run the IPW estimator
In this section, we present the R code to run the estimator for observed data with structure where , , and are univariate. Note that the simulation study data has such a structure. This code can be easily extended for multivariate versions of those variables.
The IPW function uses the following two helper functions.
#Logbin function that can take Y values outside of (0,1) LogLikelihood< function(beta, Y, X,w){ pi< plogis( X%*%beta ) # P(YA,W)= expit(beta0 + beta1*X1+beta2*X2...) pi[pi==0] < .Machine$double.neg.eps # avoid taking the log of 0 pi[pi==1] < 1.Machine$double.neg.eps logLike< sum( w*( Y*log(pi) + (1Y)*log(1pi) ) ) return(logLike) } grad< function(beta, Y, X, w){ pi< plogis( X%*%beta ) # P(YA,W)= expit(beta0 + beta1*X1+beta2*X2...) pi[pi==0] < .Machine$double.neg.eps # for consistency with above pi[pi==1] < 1.Machine$double.neg.eps gr< crossprod(w*X, Ypi) # gradient return(gr) }
The function to run IPW depends on the data (dat) and values for sensitivity (sens), specificity (spec), and the baseline prevalence (q0hat). The function follows.
IPWest<function(dat,sens,spec,q0hat){ #Use IPW estimator (with true sens and spec) to estimate modYR1<glm(Y~C+W*X,subset=(R==1),family=binomial(),data=dat) QY1R1<(predict(modYR1,type="response",newdata=dat)+spec1)/(sens+spec1) #cases nC<sum(dat$R==1) #controls nCo<sum(dat$R==0) J<nCo/nC #weights w<(1q0hat)/J*(dat$R==0+0)+q0hat*(dat$R==1+0) #casecontrol probability #Specify some model for R, fit with weights w. We use a logistic regression as an example: modRwxc<glm(R~W*X+C,weights=w,family=binomial(),data=dat) PRwxc<predict(modRwxc,type="response") Ystar=QY1R1[dat$R==1] #This solves the IPW score equations by optimizing the loglikelihood optim.out < optim(par=c(3,0.5,0.5), fn=LogLikelihood, gr=grad, Y=Ystar, X=cbind(1,dat$X,dat$C)[dat$R==1,], w=1/PRwxc[dat$R==1], method="BFGS") beta< optim.out$par[2] #The score equations can also be solved with geeglm in geepack #But we must truncate the outcome to (0,1) #library(geepack) #Ystar[Ystar<0]<0 #modMSM<geeglm(Ystar~X+C,data=dat[dat$R==1,],id=1:sum(dat$R),weights=1/PRwxc[dat$R==1], family=binomial) #est<coef(modMSM)[2] return(beta) }
Comments
There are no comments yet.