A Robust, Differentially Private Randomized Experiment for Evaluating Online Educational Programs With Sensitive Student Data

by   Manjusha Kancharla, et al.

Randomized control trials (RCTs) have been the gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest. However, many RCTs assume that study participants are willing to share their (potentially sensitive) data, specifically their response to treatment. This assumption, while trivial at first, is becoming difficult to satisfy in the modern era, especially in online settings where there are more regulations to protect individuals' data. The paper presents a new, simple experimental design that is differentially private, one of the strongest notions of data privacy. Also, using works on noncompliance in experimental psychology, we show that our design is robust against "adversarial" participants who may distrust investigators with their personal data and provide contaminated responses to intentionally bias the results of the experiment. Under our new design, we propose unbiased and asymptotically Normal estimators for the average treatment effect. We also present a doubly robust, covariate-adjusted estimator that uses pre-treatment covariates (if available) to improve efficiency. We conclude by using the proposed experimental design to evaluate the effectiveness of online statistics courses at the University of Wisconsin-Madison during the Spring 2021 semester, where many classes were online due to COVID-19.



There are no comments yet.


page 1

page 2

page 3

page 4


Asymptotic Efficiency Bounds for a Class of Experimental Designs

We consider an experimental design setting in which units are assigned t...

Inference in experiments conditional on observed imbalances in covariates

Double blind randomized controlled trials are traditionally seen as the ...

The covariate-adjusted residual estimator and its use in both randomized trials and observational settings

We often seek to estimate the causal effect of an exposure on a particul...

Machine Learning for Variance Reduction in Online Experiments

We consider the problem of variance reduction in randomized controlled t...

Robust Designs for Prospective Randomized Trials Surveying Sensitive Topics

We consider the problem of designing a prospective randomized trial in w...

Private Randomized Controlled Trials: A Protocol for Industry Scale Deployment

In this paper, we outline a way to deploy a privacy-preserving protocol ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1 Motivation: Balancing Privacy and Scientific Inquiry in Program Evaluation Using Randomized Experiments

The gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest is a randomized control trial (RCT). In its purest form, an investigator randomly assigns each individual , , to treatment, denoted as , or control, denoted as , and observes his/her outcome

. Because the treatment was randomized, both treated and control groups are similar in their unobserved and observed characteristics and thus, taking the difference of the average outcomes from the two groups yields an unbiased estimate of the average treatment effect (ATE). A bit more formally, a RCT satisfies strong ignorability

[Rosenbaum and Rubin, 1983] and the ATE can be identified from observed data; see Imbens and Rubin [2015] and Hernan and Robins [2020] for textbook discussions.

In addition to strong ignorability, one subtle, yet important assumption underlying RCTs is that after randomly assigning treatment, individuals in the study share their response/outcomes to the investigator. This assumption, almost axiomatic in a RCT, is becoming less plausible in the modern era, especially in online settings where there are increasing regulations to protect users’ privacy. For example, Bond et al. [2012] ran an online randomized experiment among 61 million users of Facebook and collected their voting behaviors as the primary outcome; this experiment attracted controversy due, in part, to concerns over data privacy [Benbunan-Fich, 2017]. In RCT-based evaluations of educational programs, investigators often collect sensitive data on students’ performance, say test scores, probation status, and class rank, as their primary outcomes [Aaronson et al., 2007, Wilson et al., 2018].

Also, from a regulatory perspective, the European Union enacted the General Data Protection Regulation (GDPR), which limits the information that websites can collect from users [GDPR, 2018]. In 2020, California started enforcing the California Consumer Privacy Act, which like GDPR, guaranteed certain privacy rights for online user data. The main theme of this work is to explore how to guarantee individual data privacy while still being able to use RCT-based evaluation of programs, especially those from online educational settings.

1.2 Review of Existing Methods to Data Privacy in RCTs

A popular approach to data privacy in clinical trials and A/B testing in online settings is to lock up the user data after completing the experiment and only report summary statistics for scientific dissemination (e.g., Wilson et al. [2018]). While this approach tilts more closely to guaranteeing privacy, a major downside is that independent replication of the original analysis is difficult, if not impossible. For instance, it is generally difficult for a future, external investigator to plot, diagnose, fit competing or new models based on summary statistics from the original experiment. Relatedly, it would be difficult for future investigators to merge the original experimental data with new experimental data to build more complex models, or to boost statistical power, especially for heterogeneous treatment effects. Even if these difficulties are deemed tolerable, this approach rests on the assumption that subjects are willing to give up their personal information to the investigators in the first place, and investigators will keep the data safe, in perpetuity.

Another popular approach is based on data anonymization where identifiable information is removed, aggregated, or anonymized before sharing the “de-identified” data to the public. Some examples include removing any protected health information (PHI) to be compliant with the Health Insurance Portability and Accountability Act (HIPAA) [Annas, 2003], using -anonymity [Samarati, 2001, Sweeney, 2002], -diversity, or -closeness [Machanavajjhala et al., 2006, Li et al., 2007]. While they are an improvement over the previous approach in terms of replicability and data sharing, it has been shown that many popular de-identification methods are not sufficient to guarantee privacy. For example, Sweeney [2000] linked de-identified patient-specific health data to voter registration records using variables such as ZIP code, birth date and gender and observed that 87% of the U.S. population can be uniquely identified. Also, Narayanan and Shmatikov [2008]

linked the Netflix Prize dataset containing anonymized movie ratings of 500,000 Netflix subscribers to the Internet Movie Database (IMDb), allowing re-identification of users on Netflix; this led to the discontinuation of the Netflix Prize in 2010

[Hunt, 2010].

The approach to data privacy we use in this work is differential privacy [Evfimievski et al., 2003, Dwork, 2006, Dwork and Smith, 2010], specifically local differential privacy; see Duchi et al. [2018] and references therein. Broadly speaking, differential privacy is a mathematical definition of privacy where nearly identical statistics are computed from a dataset, say the sample mean or the p-value from a hypothesis test, regardless of whether any one individual is present or absent in the dataset; see Section 2.2 for details. Differential privacy is considered to be the strongest form of data privacy in that if an adversary were to obtain differentially private data, it is, up to a privacy loss value , impossible to re-identify the individual in the data. Due to these strong privacy guarantees, differential privacy is used by Google’s Chrome browser [Erlingsson et al., 2014] and Apple’s mobile iOS platform [Apple, 2019]

to protect their users’ privacy while enabling the development of novel machine learning methods and statistical analysis.

1.3 Our Contributions

Our main contribution is to propose a simple, robust RCT that guarantees local differential privacy while allowing investigators to estimate treatment effects. Specifically, similar to a typical RCT, we assume that the investigator randomly assigns treatment to individual and therefore, the treatment value is known to the investigator. But, unlike a typical RCT, we use randomized response techniques originally from Warner [1965] to collect differentially private outcome data from individual , denoted as , instead of the sensitive, “true” outcome, denoted as . That is, the investigator only sees a privatized response along with the treatment assignment to estimate the treatment effect on the sensitive/true outcome ; in contrast, a typical RCT allows investigators to see both the sensitive/true outcome and to estimate the treatment effect on the true sensitive/outcome .

A key innovation in our proposed experimental design that distinguishes itself from a straightforward application of existing techniques in differential privacy to RCTs is that we allow the privatized response to be “adversarial.” More concretely, unbeknownst to the investigator, some participants may provide “adversarial” data to further mask their identity, say by providing a completely random value as that deviates from the experimental protocol. Our proposed design allows responses from such participants, which we broadly call “cheaters,” and even if their identity is unknown to the investigator, their responses will not harm estimation and inference of treatment effects. In relation to works in differential privacy, a cheater represents, in a loose sense, an “imperfect” implementation of a differentially private algorithm where a database/central entity holding the private data may not faithfully execute the privacy-preserving algorithm, say the entity added the wrong noise to the private data or it forgot to add any noise to the private data. Our work here shows how to still obtain relevant statistics of interest even if the differentially private algorithm may be imperfectly implemented. We achieve this by using a simple idea based on sample splitting and noncompliance in psychometric testing [Clark and Desharnais, 1998]

where we apply two slightly different differentially private algorithms on two random subgroups of participants and re-weigh the outputs from the two algorithms via inverse probability weights to remove bias arising from cheaters; see Section

3.3 for details. Also, in relation to works in psychometric testing, our work extends Clark and Desharnais [1998] to allow for arbitrary types of cheaters and differential privacy; see Section 2.3 for details.

Once we have data from the proposed design, we propose two consistent estimators. The first estimator is essentially a difference-in-means estimator weighted by the proportion of non-cheaters and is similar, in form, to the local average treatment effect in the noncompliance literature [Angrist et al., 1996]. The second estimator is a doubly robust, covariate-adjusted estimator that uses pre-treatment covariates, if available, to improve efficiency. We also compare our design to a typical RCT that collects the true, sensitive/private outcome and assess the trade-off between statistical efficiency and data privacy.

Finally, the proposed experimental design is used to evaluate online statistics courses at the University of Wisconsin-Madison. Specifically, during the Spring of 2021 when most classroom instruction went online due to COVID-19, students participated in an evaluation about the impact of instructors being present in online lecture videos on learning outcomes. Similar to prior works in this area [Kizilcec et al., 2014, Pi and Hong, 2016, Wilson et al., 2018, Wang et al., 2020], we find that instructor-present video lectures improved students’ attention among non-cheaters. Critically, unlike these prior works, the sensitive learning outcomes from the students are guaranteed to be differentially private to any investigator (including those that actually conducted the evaluation). In fact, the proposed design received approval from the Education and Social/Behavioral Science Institutional Review Board (IRB) of the University of Wisconsin - Madison to release this data to the public for future replication; the protocol met the criteria for exempt human subjects in accordance with categories 1 and 3 as defined under 45 CFR 46 [Office of the Federal Register and Administration, 2005].

2. Setup, Review, and Definition of Cheaters

2.1 Review: Notation, Potential Outcomes, and RCTs

We review the potential outcomes notation used to define treatment effects [Neyman, 1923, Rubin, 1974]. For each individual , let denote the binary treatment assignment with denoting treatment and denoting control. Let denote the potential outcome of individual under treatment assignment value and denote the support of the outcome. For simplicity, we consider to take on binary values, . In Section 3.2, we show that as long as is restricted so that differential privacy is well-defined (see Chapter 2.3 of Dwork and Roth [2014]), our proposed experimental design remains differentially private. Finally, let and be the observed outcome and pre-treatment covariates, respectively, for individual .

The estimand of interest is the average treatment effect (ATE) defined as . To identify , the following assumptions are usually made; here, we write the assumptions under a RCT, but the interested reader is referred to Imbens and Rubin [2015] and Hernan and Robins [2020] for identification strategies outside of RCTs.

  • Stable unit treatment value assumption [Rubin, 1980]:

  • Complete randomization of treatment:

  • Overlap: There exists , where .

Briefly, assumption (A1) states that there are no different versions of the treatment and the treatment of individual does not impact the potential outcome of individual . Assumption (A2) states that the treatment is completely randomized and assumption (A3) states that there is a non-zero probability that individual is assigned to either treatment or control. Assumptions (A1)-(A3) are usually satisfied by a RCT and consequently, can be identified by the well-known, difference-in-means formula, i.e., . In particular, we review two consistent estimators of the ATE in RCTs, the difference-in-means estimator and the covariate-adjusted, doubly robust estimator .


Here, and are outcome regression models for treatment and control groups, respectively. The difference-in-means estimator does not require an outcome model and does not adjust for covariates . The doubly robust estimator is an augmented version of the difference-in-means estimator that adjusts for covariates and, in a randomized experiment, is consistent even if or are mis-specified. But, if and are correctly specified, is more efficient than . Finally, both estimators are asymptotically Normal with mean

and their standard errors can usually be estimated by a sandwich formula or the bootstrap

[Efron and Tibshirani, 1994]. For additional discussions, see Lunceford and Davidian [2004], Zhang et al. [2008], Tsiatis et al. [2008], and Bang and Robins [2005].

While these estimators have desirable statistical properties, from a data privacy perspective, the estimators require individuals’ responses . If individuals are unwilling or apprehensive about sharing their responses to treatment or if they share a dishonest response due to reservations about their data privacy, the estimators may no longer be consistent. The next few sections talk about a formal way to “privatize” and how to use this privatized response to identify and estimate treatment effects on . Also, while we can also privatize the pre-treatment covariates in a similar fashion as the response, we leave them to take on any arbitrary values since we can identify the causal effect without . Instead, we’ll primarily use to gain efficiency; see Sections 3.3 and 3.6 for details.

2.2 Privatizing Responses with Differential Privacy and Forced Randomized Response (FRR)

As before, let be the original response that individual wishes to keep private and let be the “privatized” version of the original response that the investigator actually collects from an experiment. The private response is generated by a “privatizing” function where takes the original response as the input and returns as the output. Some trivial examples of include: (1) a constant function where for any , and (2) an identity function where for any , . Intuitively, the first example is more privacy-preserving than the second example in that the constant function always produces the same value for every individual, making it impossible for an investigator to recover the original, sensitive response from . But, the first example’s makes the ATE unidentifiable since everyone in the study has the same outcome, irrespective of the treatment. In contrast, in the second example with the identify function, identification of the ATE is possible, but it is not privacy-preserving since the investigator sees the original, sensitive response of individual , i.e., . More broadly, Dwork and Roth [2014] argued that most non-randomized functions are inadequate to simultaneously preserve privacy and allow estimation and consequently, Dwork [2006] proposed differential privacy, a family of non-deterministic s that take an input and stochastically generates an output. For example, if the input is individual ’s original response , may return plus some stochastic noise generated by , say plus a random value from a Laplace distribution. The investigator specifies , carefully choosing how “random” should be in order to balance the need for data privacy and the need to estimate scientifically meaningful quantities; too much randomness would mean that the ATE becomes “less identifiable” while too little randomness would mean that the individual’s data is less private. The amount of this randomness is measured by the privacy loss parameter , , and we say a random map is -differentially private if changing the input to only changes its output by a factor based on ; see Definition 2..1.

Definition 2..1 (Differential privacy [Dwork and Roth, 2014]).

A randomized map is -differentially private for , if for all pairs of with and for any measurable set in the range of , we have .

Lower values imply that becomes more privacy preserving; in the extreme case where and there is no loss in privacy, the output from is statistically indistinguishable for any pair of inputs and and estimation of treatment effects would be impossible. An example of such would be a fair coin toss where, regardless of the original input , , i.e., individual ’s privatized response would be a result of a coin toss. On the other hand, by setting , allows some “signal” from the private outcome to be passed onto the privatized outcome so that treatment effects can be estimated. Some values of that are used in practice are [Apple, 2019], [Qin et al., 2016], or approximately [Fanti et al., 2015]. We also remark that unlike the usual definition of differential privacy which considers privacy of databases containing records from multiple individuals, Definition 2..1 is specific to a single entry in a database and as such, is a local definition of differential privacy [Dwork and Roth, 2014, Duchi et al., 2018].

The function that we will use in our proposed design is based on the forced randomized response (FRR) of Fox and Tracy [1984]. Broadly speaking, in an FRR, individuals use a randomization device, typically a six-sided die, and based on the results of the randomization, some are instructed or “forced” to give a specific type of response, say or , regardless of their original response , while others are instructed to give their original response . For example, if individual die roll is on 1, individual is instructed to report a to the investigator, i.e., , and if the die roll is on 6, individual is instructed to report a to the investigator, i.e., . If the die roll is on anything but 1 or 6, individual is instructed to provide the original, true, response . A bit more formally, if represent different instructions where represents ‘report 0’, represents ‘report 1’, and represents ‘report original response’, with probabilities and , respectively, an FRR map is defined as,

A critical part of a FRR is that investigators are unaware of the result of the die roll; only the participant knows the result of the die toss. Hence, investigators have no idea if from individual is the true or one of the forced responses or . To put it differently, a FRR protects the privacy of individual ’s response by plausible deniability of any particular response. Nevertheless, investigators can choose the die probability, represented by and , and these values affect both the privacy loss and efficiency of ; in Section 3.2, we present a formula relating FRR parameters with the privacy loss parameter in differential privacy. For simplicity, we will collapse going forward so that there is one parameter parametrizing the privacy-preserving .

2.3 Noncompliance to Experimental Protocol and Cheaters

Suppose an individual is generally wary of sharing their response to treatment and does not trust the privacy-preserving nature of . For example, a participant may feel like the die in a FRR is rigged and produce a response that deviates from the original FRR protocol. Or, a participant, after being instructed to report the true response from the die roll, may feel uncomfortable doing so and instead report the opposite, say , to the investigator. These are instances of noncompliance to the experimental protocol and our goal is to still estimate causal effects in the presence of it.

To achieve this goal, we use the concept of “cheaters” by Clark and Desharnais [1998] in psychometric testing. Broadly speaking, cheaters are those who deviate from the experimental protocol laid out by investigators whereas non-cheaters/honest individuals are those who follow the experimental protocol. For example, an individual who reports ‘0’ (i.e., ) even though the FRR prompt was to report ‘1’ is a cheater; see Table 1 in supplementary materials for more examples. Clark and Desharnais [1998] showed that if all cheaters are assumed to generate the same response to the investigators, say all cheaters only report , we can detect the presence of cheaters by (a) randomly splitting the study sample into two pieces, say individuals are split into samples of size and , , and (b) comparing appropriate statistics between the two subsamples. Our work extends this idea of using sample splitting to detect cheaters by relaxing the requirement that all cheaters generate the same response. We achieve this by using a different statistic to compare between the two subsamples, namely a variant of the compliance rate in instrumental variables [Angrist et al., 1996]; see Section 3.3 for details.

3. Robust, Private Randomized Controlled Trial

3.1 Protocol

Combining the aforementioned ideas on differential privacy and cheaters, we propose a robust and differentially private RCT, which we call a RP-RCT and is laid out in Figure 1. In a RP-RCT, the investigator specifies (i) the treatment assignment probability that is away from and and (ii) two slightly different FRR maps parametrized by , . The output of a RP-RCT is the individual ’s (a) treatment assignment , (b) privatized response that may be contaminated due to cheaters, and (c) indicating which of the two sub-samples individual belonged to or, equivalently, which FRR they were assigned to. The investigator does not observe the original response , the cheating status of individual i, denoted as where if is a cheater and otherwise, and the result of the randomization device behind the FRR, ; these are denoted as unobserved variables in Figure 1. In particular, the variable can be thought of as a latent characteristic of individual that is never observed by the investigator and the investigator only sees without knowing whether it came from a cheater (i.e ) or not (i.e., ).

Figure 1: Protocol for a Robust, Differentially Private Randomized Controlled Trial (RP-RCT) (left) Versus a Traditional RCT (right). Compared to a RCT, a RC-PCT adds two additional steps, sample splitting and FRR. Sample splitting is used to make our design robust to responses from cheaters and FRR is used to privatize responses.

We make some remarks about the experimental protocol. First, compared to a traditional RCT, a RP-RCT adds two additional steps, the sample splitting step to be robust against cheaters and the FRR step to privatize the study participant’s response. Here, for simplicity, the sample splitting step creates two equal, random, sub-samples of individuals, but this can be relaxed at the expense of additional notation. Second, a RP-RCT satisfies assumptions (A2) and (A3) because the treatment is still randomly assigned to individuals with probability . Also, so long as the treatment is well-defined and does not cause interference (i.e., does not violate SUTVA), all assumptions (A1)-(A3) are satisfied by a RP-RCT. Third, while we only consider the FRR as our , we can replace with another differentially private algorithm. Fourth, because the treatment is randomized, the pre-treatment covariates can take on any value (e.g., missing, censored, etc.) and it won’t impact the identification strategy.

3.2 Differential Privacy Guarantee

The following theorem shows that among non-cheaters, the privatized response generated from a RP-RCT is -differentially private.

Theorem 1 (Differential Privacy of RP-RCT).

Consider a non-cheater’s true response . Then, a RP-RCT which generates his/her privatized response is differentially private.

In words, Theorem 1 states that regardless of the non-cheater’s true response , the privatized outcome generated from a RP-RCT, would be private up to some privacy loss parameter. The exact privacy loss depends on the FRR parameters and investigators can choose to achieve the desired level of privacy loss. Also, Theorem 1 does not make any claims about differential privacy for cheaters. This is because cheaters can choose to report any response that may or may not be differentially private. For example, if a cheater provides the result of a random coin flip as irrespective of his/her true response , then his/her response achieves perfect differential privacy, i.e., . However, if a cheater always provides the opposite of his/her true response to potentially hide his/her true response, i.e., , then his/her data, despite their best intentions, is never differentially private. Without making an assumption about cheaters’ behaviors, we cannot make any guarantees about their responses’ differential privacy.

3.3 Identification of the ATE

To lay out the identification strategy with a RP-RCT, we assume that the RP-RCT satisfies the following assumptions.

  • Random sample splitting: with .

  • Randomization in FRR: with and

  • Extended randomization of treatment: .

Assumption (A4) states that the two subsamples in a RP-RCT were split randomly. Assumption (A5) states that the randomization device in the FRR (i.e., the die roll) is random and the two FRRs used in the two subsamples are different. Assumption (A6) is a re-iteration of the treatment randomization assumption (A2), except we now include the new variables introduced as part of a RP-RCT: and . Note that Assumptions (A4)-(A6) are satisfied by the design of a RP-RCT.

Let be the proportion of cheaters in the population, i.e., . We now state assumptions about the cheater status . While likely in some settings, these assumptions may not always be satisfied by the design of a RP-RCT.

  • Cheater’s response: .

  • Proportion of non-cheaters: .

Assumption (A7) states that a cheater gives the same response to the investigator regardless of whether he/she was randomized to treatment or control; note that assumption (A7) does not say that all cheaters produce the same response (i.e., the assumption underlying Clark and Desharnais [1998]). Assumption (A7) is plausible if the treatment assignment is blinded so that the participant does not know which treatment he/she is receiving and thus, cannot use this information to change his/her final response to the investigator. Also, assumption (A7) still allows a cheater to use his/her original response , potential outcomes , or pre-treatment characteristics to tailor his/her private response . For example, if a cheater reports a constant value, say , or the opposite of his true response, say , assumption (A7) will still hold. Or if some cheaters’ private response depends on unmeasured, pre-treatment characteristics, say, cheaters who are more privacy-conscious report while less privacy-conscious cheaters may report a mixture of or , assumption (A7) will hold. However, if a cheater uses the treatment assignment to change his/her response , say the cheater would report if they receive treatment and if they receive control, assumption (A7) is violated. Assumption (A8) states that not all participants in the study are cheaters. If assumption (A8) does not hold and every participant is a cheater, we cannot identify any treatment effect. Also, in Section 3.4, we present a way to assess assumption (A8) with the observed data by estimating the proportion of cheaters. Overall, so long as the treatment is blinded and there is at least one non-cheater, a RP-RCT plausibly satisfies assumptions (A1)-(A8) by design.

We now show that the data from a RP-RCT can identify the ATE among non-cheaters.

Theorem 2 (Identification of the ATE Under RP-RCT).

Consider a RP-RCT with FRR parameters set by the investigator and the observed data is . Under assumptions (A1)-(A8), the ATE among non-cheaters, denoted as , is identified from data via


Additionally, the proportion of cheaters is identified via


Theorem 2 shows that our new design can identify the ATE among non-cheaters by taking the difference in the averages of the privatized outcomes among treated and control units, re-weighed by the FRR parameters and the proportion of cheaters . If there are no cheaters in the population, Theorem 2 implies and we can identify the ATE for the entire population. In contrast, if everyone is a cheater, we cannot identify the treatment effect; intuitively, if everyone is a cheater, disregards the FRR, and reports , it would be impossible to know the effect of the treatment on the response. Generalizing this intuition, we can only identify the subpopulation of individuals who are non-cheaters, even if the investigator does not know who are cheaters or non-cheaters. We remark that this result is similar in spirit to the local average treatment effect in the noncompliance literature [Angrist et al., 1996] where under noncompliance, only the subpopulation of compliers can be identified from data.

Theorem 2

also shows that we can identify the proportion of cheaters. While the exact formula is complex, roughly, we measure the excess proportion of privatized outcomes had everyone followed the FRR and use additional moment conditions generated from sample splitting to identify

; see Section B of the supplementary materials for details.

Overall, Theorems 1 and 2 show that we can identify the treatment effect with privatized outcomes , some of which may be contaminated by cheaters. The proposed design has some key parameters, and , that govern the privacy loss parameter . They also affect estimation and testing of , which we discuss below.

3.4 Difference-in-Means Estimator of the ATE

Using the identification result in Section 3.3, we can construct estimators of by replacing the expectations in Theorem 2 with their sample counterparts, i.e.,


Note that the estimator can be thought of as an extension of the simple difference-in-means estimator in a RCT re-weighted by the estimated proportion of non-cheaters and the privacy-preserving map, i.e., . Also, an estimate of the proportion of cheaters can be obtained by replacing the expectations in Theorem 2 with their sample counterparts, i.e.,


Also, is the maximum likelihood estimate of ; see Section B.3.1 of the supplementary materials for details. Also, in small samples, may exceed the bound of and , especially if and are close to each other. In this case, we follow Clark and Desharnais [1998]’s advice where we evaluate the likelihood of the estimated and the boundary points and pick the value of corresponding to the highest likelihood.

Theorem 3 shows that is a consistent and asymptotically Normal estimator of .

Theorem 3 (Asymptotic Properties of ).

Suppose the observed data is i.i.d. and generated from a RP-RCT. Then, we have with

Also, can be consistently estimated by,

where, for ,

and, for .

Theorem 3 can be used as a basis to construct confidence intervals and p-values for testing the null hypothesis . For example, a Wald-style test for would be and one would reject the null in favor of a two-sided alternative at level if exceed , where is the quantile of the standard Normal distribution. We can also construct a Wald-based level two-sided confidence interval for as . Note that similar to the usual difference in means estimator in Section 2.1, practitioners could also use the bootstrap to estimate the standard error of and its associated confidence interval.

3.5 Cost of Data Privacy in RP-RCT

In this section, we compare a RP-RCT to a traditional RCT as a way to study the cost of guaranteeing differential privacy on statistical efficiency. To begin, suppose so that both and estimate the same parameter . The following theorem shows that is never as asymptotically efficient as ; in short, there is a statistical cost of using a RP-RCT to guarantee differential privacy of an individual’s response.

Theorem 4.

[Relative Asymptotic Efficiency of and ] For any and , we have,

where, , and . Also, is monotonically increasing in .

The relative efficiency between the two estimators, measured by , is determined by , which in turn is governed by FRR parameters . In particular, as individuals’ responses become less private by an increase in privacy loss , the relative efficiency approaches . In fact, only when the privacy loss approaches infinity, i.e., , do we have relative efficiency equaling and thus, will never be as efficient as as long as we want to guarantee some amount of data privacy. Investigators can use the formula in Theorem 4 to assess what value of privacy loss, , works best in their own studies. In particular, we recommend investigators specify based on (a) their tolerance for loss in efficiency at the expense of more privacy, (b) and , which may be informed from subject-matter experts during the planning stage of the experiment, and/or (c) recommended data privacy standards from relevant regulatory bodies.

3.6 Covariate Adjustment and Doubly Robust Estimation

Similar to a RCT, suppose a RP-RCT collected pre-treatment covariates , which may be missing, contaminated, and/or corrupted. We propose to use the pre-treatment covariates to develop a more efficient estimator without incurring additional bias by using a doubly robust estimator [Bang and Robins, 2005]. Formally, let and be the postulated models for the true outcome regressions and

, respectively. For simplicity, we assume that these functions are fixed, but our result below holds if these functions were estimated at fast enough rates, say those based on parametric models. Consider the following estimator for



The following theorem shows that is a consistent, asymptotically Normal estimate of even if and are mis-specified.

Theorem 5 (Consistency and Asymptotic Normality of ).

Suppose the same assumptions in Theorem 3 hold. Then, for any fixed working models of the privatized outcomes and , we have where

Also, can be consistently estimated by,

Similar to Theorem 3, Theorem 5 can be used to construct confidence intervals and p-values for testing . Also, in Section A.5 of the supplementary materials, we characterize the relative efficiency between the doubly robust estimators under a RP-RCT and a RCT.

4. Evaluation of Online Statistics Courses at the University of Wisconsin-Madison During COVID-19

4.1 Background and Motivation

In light of many courses shifting online during COVID-19, the Department of Statistics at the University of Wisconsin-Madison was interested in evaluating which type of online video lectures best aided conceptual understanding, information retention and problem-solving among students taking the Department’s courses. Specifically, one of the questions of interest was whether instructor-present lecture videos (i.e., treatment) led to a better learning experience for students compared to instructor-absent online lectures (i.e., control); see Figure 2 for an example. Some prior works [Wilson et al., 2018, Kizilcec et al., 2014] found no evidence that instructor-present lecture videos had a significant impact on student learning in terms of attention and comprehension. However, other works [Pi and Hong, 2016, Wang et al., 2020] show the opposite, that instructor-present lecture videos enhanced student learning. Regardless, the participant data from these works are not publicly available due to the sensitive nature of students’ educational data. Additionally, there was concern that students may be less willing to provide their true attention and retention rate in the courses when asked by the Department. For example, some students might not be comfortable admitting to not paying attention to online lectures and simply lie to the investigator, leading to potentially biased results.

Compared to using a RCT, using a RP-RCT provides numerous remedies for the aforementioned issues. First, students’ data is guaranteed to be differentially private, which may encourage more honest participation. Second, even if some students remain dishonest and are cheaters, a RP-RCT still provides a robust estimate of the treatment effect. Third, the data can be shared publicly and students’ data privacy is still preserved via differential privacy; as mentioned earlier, the experimental protocol, including sharing students’ response data, was approved by the Education and Social/Behavioral Science IRB of the University of Wisconsin - Madison.

4.2 Study Participants, Treatment Arms, and Outcomes

The study population consisted of students enrolled in introductory statistics classes at the University of Wisconsin - Madison during the Spring 2021 semester. Electronic, informed consent was obtained from all participants before enrollment. Once a student gave consent, the study collected the following pre-treatment covariates: gender, race/ethnicity, year in college, major or field of study, prior subject matter knowledge, previous exposure to and grades in statistics/mathematics/computer science classes, self-rated interest in statistics, proficiency in English, amount of experience with video lectures in the past or current semester, and preference on online lecture format. Students had the option to not provide answers to any of the pre-treatment covariates. Afterwards, students are randomly placed into one of the two subgroups ( or ) and within each subgroup, they are randomly assigned to treatment or control. The control arm had a narrated ‘Instructor Absent’ (IA) video format where students see the lecture slides and hear an audio narration of the lecture from the instructor. The treatment arm was identical to the control arm except it used an ‘Instructor Present’ (IP) video format where the instructor’s face was embedded in the upper-right corner of the lecture video. Both lecture videos were 13 minutes long and introduced identical statistical concepts, specifically about RCTs. We remark that RCTs were not covered in the classes in which the study was conducted. Also, the study used an FRR with and another FRR with , resulting in a privacy loss of . This value was based of our own preference towards data privacy at the expense of efficiency where we were willing to tolerate roughly increase in standard error under a RP-RCT than a RCT, and consultation with the University’s IRB, which, among other things, gave approval to release the student-level data to the public for future replication and analysis.

After students watched either the IA or IP lecture video, they were asked a series of questions, notably on four areas of student learning (i.e., Attention, Retention, Judgement of Learning and Comprehension) used by previous works [Kizilcec et al., 2014, Pi and Hong, 2016, Wang and Antonenko, 2017, Wilson et al., 2018]. All four outcomes were ‘yes/no’ questions; the exact wording of the questions is in Table 1.

For our working models for the privatized outcomes and

, we used logistic regression models that minimized the Akaike information criterion

[Bozdogan, 1987]; see Section C of supplementary materials for additional details.

Primary Outcome Questions
Attention “I found it hard to pay attention to the video.”
Retention “I was unable to recall the concepts while attempting the followup quiz.”
Judgement of Learning “I don’t feel that I learnt a great deal by watching the video.”
Comprehension “I found the topic covered in the video to be hard.”
Table 1: Questions Concerning Four Areas of Student Learning. Students were asked to say “yes” or “no” if they agreed or disagreed with the statements above.

We remark that as part of the RP-RCT protocol, students were prompted to ‘roll a die’ using an online die roll (i.e., the FRR device). Students were allowed to roll the die only once and the resulting die roll was visible to the student only. Based on the outcome of the roll, students were asked to answer the four questions in Table 1 using the FRR prompt presented in panel C of Figure 2. Also, students must roll the die before being presented with the four questions.

Figure 2: A Sample Slide of The instructor-Absent (Panel A) and Instructor-Present (Panel B) Video Lectures on RCTs. Panel C shows the instructions for the FRR.

4.3 Results

Table 2 presents the estimates of the average treatment effect amongst honest participants () for the four outcomes. Honest students indicated that it was easier to pay attention to instructor-present (IP) video lectures compared to instructor-absent (IA) video lectures (). However, their ability to retain concepts was higher amongst students randomized to the instructor-absent (IA) video lectures (). However, there was no significant difference between students who watched IP and IA video lectures in terms of Judgement of Learning and Comprehension. Also, Section C of the supplementary materials present covariate balance between the IP and IA subgroups and, as expected, we found no differences between the two groups in terms of their baseline pre-treatment covariates. Our results suggest that cheating varies with the question. For example, the estimated proportion of cheaters for the Attention and Judgement of Learning related questions were smaller ( and , respectively). However, they were larger for Retention and Comprehension questions ( and , respectively). These differences may suggest that students might be more apprehensive in sharing outcomes related to their learning abilities (i.e., Retention, Comprehension) than outcomes related to instruction (e.g., ability to engage students and transfer knowledge).

Primary Outcome
Judgement of Learning
Table 2: Estimates of Treatment Effects on the Four Outcomes. Standard errors (s.e.) of are estimated using 5000 bootstrapped samples.

Our results largely agree with previous works on online video lectures. For example, our results and those by Kizilcec et al. [2014], Pi and Hong [2016] and Wang et al. [2020] agree that IP lectures receive considerably more attention than IA lectures. Also, our findings on retention, judgement of learning, and comprehension match those in Wang and Antonenko [2017], Wang et al. [2020], Wilson et al. [2018]. However, we remark that there is work [Kizilcec et al., 2014] that suggests the opposite of what we find on retention. Finally, unlike these works, all of the student-level data and code is publicly available for replication and future analysis, especially if investigators want to combine this data with future datasets to boost power in related evaluations of online video lectures.

5. Conclusion and Discussion

We propose a new experimental design to evaluate the effectiveness of a program, policy, or a treatment on an outcome that may be sensitive, with a particular focus on online education programs where students’ response data are often sensitive. Our design, a RP-RCT, has differential privacy guarantees while also allowing estimation of treatment effects. A RP-RCT also accommodates cheaters who may not trust the privacy-preserving nature of our design and provide arbitrary responses that may further protect their privacy. We provide two consistent, asymptotically Normal estimators, one of which allows for covariate adjustment. We also assess the trade-off between differential privacy and statistical efficiency. We conclude by using the RP-RCT to evaluate different types of online video lectures at the Department of Statistics at the University of Wisconsin-Madison and find that our results largely agree with existing results on online video lectures, while preserving students’ data privacy and allowing sharing of this data for future replication.


  • D. Aaronson, L. Barrow, and W. Sander (2007) Teachers and student achievement in the chicago public high schools. Journal of Labor Economics 25 (1), pp. 95–135. Cited by: §1.1.
  • J. D. Angrist, G. W. Imbens, and D. B. Rubin (1996) Identification of causal effects using instrumental variables. Journal of the American Statistical Association 91 (434), pp. 444–455. Cited by: §1.3, §2.3, §3.3.
  • G. J. Annas (2003) HIPAA regulations — a new era of medical-record privacy?. New England Journal of Medicine 348 (15), pp. 1486–1490. Cited by: §1.2.
  • Apple (2019) Apple differential privacy technical overview. Cited by: §1.2, §2.2.
  • H. Bang and J. M. Robins (2005) Doubly robust estimation in missing data and causal inference models.. Biometrics 61 (4), pp. 962–973. Cited by: §2.1, §3.6.
  • R. Benbunan-Fich (2017) The ethics of online research with unsuspecting users: from a/b testing to c/d experimentation. Research Ethics 13 (3-4), pp. 200–218. Cited by: §1.1.
  • R. M. Bond, C. J. Fariss, J. J. Jones, A. D. Kramer, C. Marlow, J. E. Settle, and J. H. Fowler (2012) A 61-million-person experiment in social influence and political mobilization.. Nature 489(7415), pp. 295–298. Cited by: §1.1.
  • H. Bozdogan (1987) Model selection and akaike’s information criterion (aic): the general theory and its analytical extensions. Psychometrika 52 (3), pp. 345–370. External Links: ISBN 1860-0980 Cited by: §4.2.
  • S. J. Clark and R. A. Desharnais (1998) Honest answers to embarrassing questions: detecting cheating in the randomized response model. Psychological Methods 3 (2), pp. 160–168. Cited by: §1.3, §2.3, §3.3, §3.4.
  • J. C. Duchi, M. I. Jordan, and M. J. Wainwright (2018) Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association 113 (521), pp. 182–201. Cited by: §1.2, §2.2.
  • C. Dwork and A. Roth (2014) The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3–4), pp. 211–407. External Links: ISSN 1551-305X Cited by: §2.1, §2.2, §2.2, Definition 2..1.
  • C. Dwork and A. Smith (2010) Differential privacy for statistics: what we know and what we want to learn. Journal of Privacy and Confidentiality 1 (2). Cited by: §1.2.
  • C. Dwork (2006) Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006) edition, Lecture Notes in Computer Science, Vol. 4052, pp. 1–12. External Links: ISBN 3-540-35907-9 Cited by: §1.2, §2.2.
  • B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC press. Cited by: §2.1.
  • U. Erlingsson, V. Pihur, and A. Korolova (2014) RAPPOR. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security - CCS ’14. External Links: ISBN 9781450329576 Cited by: §1.2.
  • A. Evfimievski, J. Gehrke, and R. Srikant (2003) Limiting privacy breaches in privacy preserving data mining. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’03, New York, NY, USA, pp. 211–222. External Links: ISBN 1581136706 Cited by: §1.2.
  • G. Fanti, V. Pihur, and Ú. Erlingsson (2015) Building a rappor with the unknown: privacy-preserving learning of associations and data dictionaries. External Links: 1503.01214 Cited by: §2.2.
  • J. A. Fox and P. E. Tracy (1984) Measuring associations with randomized response. Social Science Research 13, pp. 188–197. Cited by: §2.2.
  • GDPR (2018) 2018 reform of eu data protection rules. Cited by: §1.1.
  • M. A. Hernan and J. M. Robins (2020) Causal inference: what if. Chapman and Hall/CRC. Cited by: §1.1, §2.1.
  • N. Hunt (2010) Netflix prize update. The Netflix Blog. Cited by: §1.2.
  • G. W. Imbens and D. B. Rubin (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press. Cited by: §1.1, §2.1.
  • R. F. Kizilcec, K. Papadopoulos, and L. Sritanyaratana (2014) Showing face in video instruction: effects on information retention, visual attention, and affect. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, New York, NY, USA, pp. 2095–2102. External Links: ISBN 9781450324731 Cited by: §1.3, §4.1, §4.2, §4.3.
  • N. Li, T. Li, and S. Venkatasubramanian (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In IEEE 23rd International Conference on Data Engineering, pp. 106–115. Cited by: §1.2.
  • J. K. Lunceford and M. Davidian (2004) Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23, pp. 2937–2960. Cited by: §2.1.
  • A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam (2006) -Diversity: privacy beyond -anonymity. In Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, Washington, DC, USA, pp. 24. Cited by: §1.2.
  • A. Narayanan and V. Shmatikov (2008) Robust de-anonymization of large sparse datasets. Proceedings of the 2008 IEEE Symposium on Security and Privacy. Cited by: §1.2.
  • J. Neyman (1923) Sur les applications de la theorie des probabilites aux experiences agricoles: essai des principes.. Master’s Thesis. Excerpts reprinted in English, Statistical Science 5, pp. 463–472. Cited by: §2.1.
  • N. A. Office of the Federal Register and R. Administration (2005) 45 cfr 46 - protection of human subjects. Cited by: §1.3.
  • Z. Pi and J. Hong (2016) Learning process and learning outcomes of video podcasts including the instructor and ppt slides: a chinese case. Innovations in Education and Teaching International 53 (2), pp. 135–144. Cited by: §1.3, §4.1, §4.2, §4.3.
  • Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren (2016) Heavy hitter estimation over set-valued data with local differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 192–203. External Links: ISBN 9781450341394 Cited by: §2.2.
  • P. R. Rosenbaum and D. B. Rubin (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. External Links: ISSN 0006-3444 Cited by: §1.1.
  • D. B. Rubin (1980) Randomization analysis of experimental data: the fisher randomization test comment. J. Am. Statist. Ass. 75, pp. 591–593. Cited by: item (A1).
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 (5), pp. 688–701. Cited by: §2.1.
  • P. Samarati (2001) Protecting respondents identities in microdata release. IEEE Transactions on Knowledge and Data Engineering 13 (6), pp. 1010–1027. Cited by: §1.2.
  • L. Sweeney (2000) Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Cited by: §1.2.
  • L. Sweeney (2002) K-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10 (5), pp. 557–570. Cited by: §1.2.
  • A. A. Tsiatis, M. Davidian, M. Zhang, and X. Lu (2008) Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat Med. 27 (23), pp. 4658–4677. Cited by: §2.1.
  • J. Wang and P. D. Antonenko (2017) Instructor presence in instructional video: effects on visual attention, recall, and perceived learning. Computers in Human Behavior 71, pp. 79–89. External Links: ISSN 0747-5632 Cited by: §4.2, §4.3.
  • J. Wang, P. Antonenko, and K. Dawson (2020) Does visual attention to the instructor in online video affect learning and learner perceptions? an eye-tracking analysis. Computers and Education 146, pp. 103779. Cited by: §1.3, §4.1, §4.3.
  • S. L. Warner (1965) Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60, pp. 63–69. Cited by: §1.3.
  • K. E. Wilson, M. Martinez, C. Mills, S. D’Mello, D. Smilek, and E. F. Risko (2018) Instructor presence effect: liking does not always lead to learning. Computers and Education 122, pp. 205–220. External Links: ISSN 0360-1315 Cited by: §1.1, §1.2, §1.3, §4.1, §4.2, §4.3.
  • M. Zhang, A. A. Tsiatis, and M. Davidian (2008) Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics 64 (3), pp. 707–715. Cited by: §2.1.