Randomization Bias in Field Trials to Evaluate Targeting Methods

11/21/2017 ∙ by Eric Potash, et al. ∙ The University of Chicago 0

This paper studies the evaluation of methods for targeting the allocation of limited resources to a high-risk subpopulation. We consider a randomized controlled trial to measure the difference in efficiency between two targeting methods and show that it is biased. An alternative, survey-based design is shown to be unbiased. Both designs are simulated for the evaluation of a policy to target food safety inspections using a predictive model. Our work anticipates further developments in economics that will be important as predictive modeling becomes an increasingly common policy tool.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Policymakers may choose to target the allocation of scarce resources to a subpopulation according to risk or need. Rapid advances in predictive modeling in recent decades have the potential to make significant contributions to this age-old economic problem (Kleinberg et al., 2015). Some of the programs where predictive targeting is employed or has been proposed include: residential lead hazard investigations (Potash et al., 2015), restaurant hygiene inspections (Kang et al., 2013), and violence education (Chandler et al., 2011).

Of course, the impact of any targeting method should be evaluated. However, as we shall see, care must be taken in applying the existing economic field trial framework when different treatments (targeting methods) operate on different subsets of the population. We develop a framework for this analysis by drawing on the machine learning

(Baeza-Yates et al., 1999) and targeted therapies (Mandrekar and Sargent, 2009) literatures.

Concretely, suppose we have a population of units (e.g. homes) and the resources to perform observations (e.g. investigations) of some binary outcome (e.g. lead hazards).222We consider interventions in §4. Continuous outcomes may be accommodated but binary outcomes are more common. Next suppose we have a targeting method which selects a subset of units for observation.

We define the precision of at to be the proportion of positive outcomes among the targets . When the goal of targeting is to observe positive outcomes, precision is a measure of efficiency (e.g. the proportion of home investigations finding lead hazards).

In this paper our task is to compare the precision at of two different targeting methods and using observations. Denoting the precisions of and by and , respectively, we wish to measure their difference

(1)

When is positive, is more efficient than as a targeting method.

With observations we can measure the precision of or of . But we would need up to333Depending on the size of the intersection . observations to measure them both and so measure

. Thus we estimate

statistically.

(a) RCT
(b) Survey
Figure 1: In the RCT (SubfigureRCT), the population is randomly split in halves and each targeting method is applied to one half. In the survey (SubfigureSurvey), the targeting methods are applied on the population and randomly sampled after excluding their intersection.

A natural design for a field trial to estimate is an RCT in which the population is randomly split in half and each targeting method is applied to one half. Then we observe the top units in each half, resulting in total observations.

There is an alternative design: consider and as (after discarding their intersection) disjoint subpopulations and observe random units from each. We think of this design as a survey because it randomly samples the two target sets in the population as opposed to applying the targeting methods to random halves of the population. See figure 1 for a graphical comparison of the two designs.

The remainder of the paper is organized as follows. After defining a framework in §2, we show in §3 that the RCT provides unrepresentative observations and we derive a formula for the bias. In §4

, we show that the survey gives an unbiased estimate of

and discuss implementation details. In §5, we apply the above to a field trial to evaluate targeting of residential lead hazard investigations using a predictive model and simulate sampling distributions for both designs.

The issue in the RCT stems from the interaction between finite populations, partitions, and order statistics. It is of particular interest as an example of the failure of random assignment to solve an estimation problem. In this sense it is an example of randomization bias (Heckman and Smith (1995), Sianesi (2017)) and adds to the collection of pitfalls that researchers should consider before selecting an RCT design (Deaton and Cartwright, 2016). Our work anticipates further developments in economics that will be important as predictive modeling becomes an increasingly common policy tool.

2 Framework

A targeting method is a function which, given a set of units and a number selects a subset of size . When , the full population, we use the shorthand . Let denote the outcome restricted to the set and its mean. This is the proportion of units in with positive outcome, i.e. the precision of at resource level . The population precision at is then but we denote it by to reflect that it is a population (albeit a finite population) object.

Any model of is also a targeting method. That is, suppose we have such a model which estimates for any unit

the probability

444Or a score which is not necessarily a probability. . The corresponding targeting method would select, from any subset , the units in with the highest model probabilities.555Ties may be broken randomly. For simplicity, we do not explicitly consider stochastic targeting methods.

An expert may not practically be able to rank all units. Instead, they may only be able to produce a list . However, we assume that the expert is rational in the sense that there is an underlying ranking of all units that is consistently applied to any subset .This implies that any is ordered and we write

to reference units by their rank. When , we use the shorthand .

Following the machine learning literature (Baeza-Yates et al., 1999), we define the precision curve of a targeting method to be as a function of . See figure 2. Note that when the entire population is selected, so precision at of any targeting method is the proportion of positive outcomes in the population.

Figure 2: Precision curves for targeting methods in §5.

3 Randomized Controlled Trial Design

A natural RCT to estimate using observations is as follows (see figure 0(a)):

  1. Randomly partition the population into disjoint halves: with .

  2. Use to select and observe the top units from : .

  3. Use to select and observe the top units from : .

  4. Calculate

Note we’ve assumed and are even so and are integers.

A hint of the problem with this design arises when carefully defining its terms. Since a traditional RCT applies the same treatment to all units in a treatment group, we must have that: there are just two “units”, the subpopulations and ; the “treatments” are selections and observations from each subpopulation; the “outcome” is the precision in the subpopulation, e.g. . The “population” to which and belong might be the set of all population halves. One quantity, then, that is estimated in the RCT is

where the expectation is taken over halves .

Besides the fact that the RCT only samples on a single partition, we also argue that rather than is the quantity of interest. This is because measures the difference in the effect of actually implementing either targeting method on the population of interest () at the scale of interest (). It is not, however, a priori clear what the relationship is between and . It might be the case that they are equal.

We now show that this is not the case, i.e. is not in general equal to . First consider a single targeting method . Note that the relative top observed in the RCT is not necessarily a subset of the absolute top . That is, it may contain units ranked beyond the absolute top. In fact, when some halves will have relative tops containing none of the absolute top.

When induces a ranking , we quantify this by defining to be the maximum (absolute) rank in the relative top:

(2)

Note that . In appendix A.1 we show that the distribution on partitions

induces the following probability distribution on

:

(3)

We marginalize over in appendix A.2 to compute the expected RCT estimate of the precision of

(4)

where is a reweighted precision of at with increased weight on the last unit

(5)
Figure 3: The distribution of when and in §5. The mode is .

We show in A.1 that is unimodal with mode at . See figure 3. Thus the expected RCT estimate of the precision of is a weighted average over the precision curve. The greatest weight is placed on with the weight of rapidly decaying in the distance from to .

From this we draw several conclusions. First, when the precision is flat (as in the case of random targeting) the RCT estimate is unbiased. Second, (disregarding the difference between and ) bias stems from the difference between the precision curve at and its value near . However, differences in opposite directions cancel out. Thus bias is especially large near a local extremum. See figure 4. Third, increasing the sample size in the RCT, i.e. going farther down the list in each population half and using to estimate , does not in general decrease bias.

Finally, combining 4 for both and , we derive a formula for that is not in general equal to . By linearity of expectation we have

(6)

4 Survey Design

In the previous section we showed that the RCT observations are not representative of the subpopulations and of interest. We emphasize thinking of and as subpopulations rather than of and as treatments. Their intersection is not necessarily empty. But the units in the intersection are irrelevant to the difference :

(7)

where . We use this to design a survey (see figure SubfigureSurvey):

  1. Use to select the top units from the population: .

  2. Use to select the top units from the population: .

  3. Observe outcomes for a random sample of size from .

  4. Observe outcomes for a random sample of size from .

  5. Estimate

The above discussion and the fact that and are random samples of and , respectively, implies that is an unbiased estimator of :

Note that if the goal of the trial is only to estimate , statistical power is maximized by allocating no observations to the intersection as specified above. On the other hand, to estimate absolute quantities , the intersection should be sampled as well. Efficiency could further be increased by stratifying the survey (e.g. across neighborhoods).

Above we have focused exclusively on observation outcomes. When the targets receive an intervention, we may simply use an RCT666With the usual Stable Unit Treatment Value Assumption (SUTVA)(Rubin, 1980) within each of the subpopulations and . Then we would estimate the difference in treatment effects where in a potential outcomes framework.

Policymakers may be uncertain about the resources for the targeted policy. It is possible for to outperform in the top but not in the top . Thus, it may be useful select as an upper bound and sample and with some stratification to compare the precision of and on each risk stratum.

Of course, the usual concerns about generalizability (to other time periods, other populations, etc.) apply as well to this field trial design.

5 Application

Figure 4: True precision and expected RCT estimate for up to 250 in §5.

In Potash et al. (2015) we developed a machine learning model to predict which children are at highest risk of lead poisoning using historical blood lead levels, building characteristics, and other data. In this section, we compare the RCT and survey for a field trial to estimate the improvement in precision (i.e. proportion of investigations finding hazards) of targeting investigations using the predictive model over random selection .

In this application for the population of a Chicago birth cohort, restricting our attention to children residing in homes built before 1978, the year in which lead-based residential paint was banned (U.S. CPSC, 1977). To simulate field trial results we need rankings of children and outcomes of investigations. We take these from Potash et al. (2015, §7), which evaluated out-of-sample predictions on the 2011 birth cohort. Since proactive investigations were not performed, lead hazard outcomes were not available for the population. Instead, blood lead level outcomes were used as a proxy. The resulting precision curve is reproduced in figure 2.

Using equation 4 we calculated , the expected RCT estimate of the precision of the predictive model, for up to 250. These are plotted together with the true precision in figure 4. Their difference is the bias of the estimate and is a function of the shape of the precision curve near . This bias, as a percentage of the true value, varies in this range between -11% and 9% with an average magnitude of 2%.

mean 0.291 0.282
std. dev. 0.064 0.063
bias 0.009 0.
Figure 5: Sampling distribution and summary for estimates from the two designs in §5.

Next we estimated full sampling distributions for the RCT and survey designs at . For the RCT, we used Monte Carlo simulation: we calculated over random partitions. The mean of this empirical distribution agreed to five significant figures with the value of that we derived in equation 6. To estimate the survey results, we computed the distribution of exactly using hypergeometric formulas.

The resulting distributions are displayed and summarized in figure 5. We find that the RCT is biased to overestimate

by 3%. It also has higher variance than the survey, which is unbiased as expected. As discussed in §

3 and illustrated in figure 4, the direction and magnitude of the RCT bias stems from the shape of the precision curve near . In light of these results, we advised the Chicago Department of Public Health to use the survey design for a field trial of a targeted lead investigations policy.

6 Acknowledgements

We thank Dan Black, Chris Blattman, Matt Gee, Jesse Naidoo, and the anonymous reviewer for feedback on the manuscript. Thanks for useful discussions to members of the Center for Data Science and Public Policy (DSaPP) and the Energy Policy Institute at Chicago (EPIC).

References

  • Baeza-Yates et al. (1999) Baeza-Yates, R., Ribeiro-Neto, B., et al. (1999). Modern information retrieval, volume 463. ACM press New York.
  • Chandler et al. (2011) Chandler, D., Levitt, S. D., and List, J. A. (2011). Predicting and preventing shootings among at-risk youth. The American Economic Review, 101(3):288–292.
  • Deaton and Cartwright (2016) Deaton, A. and Cartwright, N. (2016). Understanding and misunderstanding randomized controlled trials. Technical report, National Bureau of Economic Research.
  • Heckman and Smith (1995) Heckman, J. J. and Smith, J. A. (1995). Assessing the case for social experiments. The Journal of Economic Perspectives, 9(2):85–110.
  • Kang et al. (2013) Kang, J. S., Kuznetsova, P., Luca, M., and Choi, Y. (2013). Where not to eat? improving public policy by predicting hygiene inspections using online reviews. In

    Empirical Methods in Natural Language Processing

    , pages 1443–1448.
  • Kleinberg et al. (2015) Kleinberg, J., Ludwig, J., Mullainathan, S., and Obermeyer, Z. (2015). Prediction policy problems. The American Economic Review, 105(5):491–495.
  • Mandrekar and Sargent (2009) Mandrekar, S. J. and Sargent, D. J. (2009). Clinical trial designs for predictive biomarker validation: theoretical considerations and practical challenges. Journal of Clinical Oncology, 27(24):4027–4034.
  • Potash et al. (2015) Potash, E., Brew, J., Loewi, A., Majumdar, S., Reece, A., Walsh, J., Rozier, E., Jorgenson, E., Mansour, R., and Ghani, R. (2015). Predictive modeling for public health: Preventing childhood lead poisoning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2039–2047. ACM.
  • Rubin (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593.
  • Sianesi (2017) Sianesi, B. (2017). Evidence of randomisation bias in a large-scale social experiment: The case of ERA. Journal of Econometrics, 198(1):41–64.
  • U.S. CPSC (1977) U.S. CPSC (1977). Ban of lead-containing paint and certain consumer products bearing lead-containing paint. 6 CFR 1303. Fed Reg 42:44199.

Appendix A

a.1 Relative Top

Recall that is the maximum absolute rank in . We can write the event as the intersection of two simpler events: of the absolute top exactly are in , i.e. ; and is in . Denoting these events by and , respectively, we derive the distribution of induced by the distribution on partitions:

(8)

where HG

is the hypergeometric distribution.

To find the mode of this distribution we calculate the change between consecutive probabilities:

(9)

It follows that

(10)

Since is an integer, is even, and , equality in A.1 implies . We conclude that is unimodal with mode at and probability increasing until that point. If then the probability is decreasing after it.

a.2 Bias

We can write the relative precision at as a weighted average of the precision of all but the last unit with the outcome of the last unit, which by definition has rank in the population:

(11)

Then the expectation (over partitions) of the relative precision at can be computed conditional on :

(12)

We define to be this quantity, which is the precision at reweighted. Marginalizing over , the unconditional expected precision of in the RCT is

(13)