1. Introduction
1.1 Motivation: Balancing Privacy and Scientific Inquiry in Program Evaluation Using Randomized Experiments
The gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest is a randomized control trial (RCT). In its purest form, an investigator randomly assigns each individual , , to treatment, denoted as , or control, denoted as , and observes his/her outcome
. Because the treatment was randomized, both treated and control groups are similar in their unobserved and observed characteristics and thus, taking the difference of the average outcomes from the two groups yields an unbiased estimate of the average treatment effect (ATE). A bit more formally, a RCT satisfies strong ignorability
[Rosenbaum and Rubin, 1983] and the ATE can be identified from observed data; see Imbens and Rubin [2015] and Hernan and Robins [2020] for textbook discussions.In addition to strong ignorability, one subtle, yet important assumption underlying RCTs is that after randomly assigning treatment, individuals in the study share their response/outcomes to the investigator. This assumption, almost axiomatic in a RCT, is becoming less plausible in the modern era, especially in online settings where there are increasing regulations to protect users’ privacy. For example, Bond et al. [2012] ran an online randomized experiment among 61 million users of Facebook and collected their voting behaviors as the primary outcome; this experiment attracted controversy due, in part, to concerns over data privacy [BenbunanFich, 2017]. In RCTbased evaluations of educational programs, investigators often collect sensitive data on students’ performance, say test scores, probation status, and class rank, as their primary outcomes [Aaronson et al., 2007, Wilson et al., 2018].
Also, from a regulatory perspective, the European Union enacted the General Data Protection Regulation (GDPR), which limits the information that websites can collect from users [GDPR, 2018]. In 2020, California started enforcing the California Consumer Privacy Act, which like GDPR, guaranteed certain privacy rights for online user data. The main theme of this work is to explore how to guarantee individual data privacy while still being able to use RCTbased evaluation of programs, especially those from online educational settings.
1.2 Review of Existing Methods to Data Privacy in RCTs
A popular approach to data privacy in clinical trials and A/B testing in online settings is to lock up the user data after completing the experiment and only report summary statistics for scientific dissemination (e.g., Wilson et al. [2018]). While this approach tilts more closely to guaranteeing privacy, a major downside is that independent replication of the original analysis is difficult, if not impossible. For instance, it is generally difficult for a future, external investigator to plot, diagnose, fit competing or new models based on summary statistics from the original experiment. Relatedly, it would be difficult for future investigators to merge the original experimental data with new experimental data to build more complex models, or to boost statistical power, especially for heterogeneous treatment effects. Even if these difficulties are deemed tolerable, this approach rests on the assumption that subjects are willing to give up their personal information to the investigators in the first place, and investigators will keep the data safe, in perpetuity.
Another popular approach is based on data anonymization where identifiable information is removed, aggregated, or anonymized before sharing the “deidentified” data to the public. Some examples include removing any protected health information (PHI) to be compliant with the Health Insurance Portability and Accountability Act (HIPAA) [Annas, 2003], using anonymity [Samarati, 2001, Sweeney, 2002], diversity, or closeness [Machanavajjhala et al., 2006, Li et al., 2007]. While they are an improvement over the previous approach in terms of replicability and data sharing, it has been shown that many popular deidentification methods are not sufficient to guarantee privacy. For example, Sweeney [2000] linked deidentified patientspecific health data to voter registration records using variables such as ZIP code, birth date and gender and observed that 87% of the U.S. population can be uniquely identified. Also, Narayanan and Shmatikov [2008]
linked the Netflix Prize dataset containing anonymized movie ratings of 500,000 Netflix subscribers to the Internet Movie Database (IMDb), allowing reidentification of users on Netflix; this led to the discontinuation of the Netflix Prize in 2010
[Hunt, 2010].The approach to data privacy we use in this work is differential privacy [Evfimievski et al., 2003, Dwork, 2006, Dwork and Smith, 2010], specifically local differential privacy; see Duchi et al. [2018] and references therein. Broadly speaking, differential privacy is a mathematical definition of privacy where nearly identical statistics are computed from a dataset, say the sample mean or the pvalue from a hypothesis test, regardless of whether any one individual is present or absent in the dataset; see Section 2.2 for details. Differential privacy is considered to be the strongest form of data privacy in that if an adversary were to obtain differentially private data, it is, up to a privacy loss value , impossible to reidentify the individual in the data. Due to these strong privacy guarantees, differential privacy is used by Google’s Chrome browser [Erlingsson et al., 2014] and Apple’s mobile iOS platform [Apple, 2019]
to protect their users’ privacy while enabling the development of novel machine learning methods and statistical analysis.
1.3 Our Contributions
Our main contribution is to propose a simple, robust RCT that guarantees local differential privacy while allowing investigators to estimate treatment effects. Specifically, similar to a typical RCT, we assume that the investigator randomly assigns treatment to individual and therefore, the treatment value is known to the investigator. But, unlike a typical RCT, we use randomized response techniques originally from Warner [1965] to collect differentially private outcome data from individual , denoted as , instead of the sensitive, “true” outcome, denoted as . That is, the investigator only sees a privatized response along with the treatment assignment to estimate the treatment effect on the sensitive/true outcome ; in contrast, a typical RCT allows investigators to see both the sensitive/true outcome and to estimate the treatment effect on the true sensitive/outcome .
A key innovation in our proposed experimental design that distinguishes itself from a straightforward application of existing techniques in differential privacy to RCTs is that we allow the privatized response to be “adversarial.” More concretely, unbeknownst to the investigator, some participants may provide “adversarial” data to further mask their identity, say by providing a completely random value as that deviates from the experimental protocol. Our proposed design allows responses from such participants, which we broadly call “cheaters,” and even if their identity is unknown to the investigator, their responses will not harm estimation and inference of treatment effects. In relation to works in differential privacy, a cheater represents, in a loose sense, an “imperfect” implementation of a differentially private algorithm where a database/central entity holding the private data may not faithfully execute the privacypreserving algorithm, say the entity added the wrong noise to the private data or it forgot to add any noise to the private data. Our work here shows how to still obtain relevant statistics of interest even if the differentially private algorithm may be imperfectly implemented. We achieve this by using a simple idea based on sample splitting and noncompliance in psychometric testing [Clark and Desharnais, 1998]
where we apply two slightly different differentially private algorithms on two random subgroups of participants and reweigh the outputs from the two algorithms via inverse probability weights to remove bias arising from cheaters; see Section
3.3 for details. Also, in relation to works in psychometric testing, our work extends Clark and Desharnais [1998] to allow for arbitrary types of cheaters and differential privacy; see Section 2.3 for details.Once we have data from the proposed design, we propose two consistent estimators. The first estimator is essentially a differenceinmeans estimator weighted by the proportion of noncheaters and is similar, in form, to the local average treatment effect in the noncompliance literature [Angrist et al., 1996]. The second estimator is a doubly robust, covariateadjusted estimator that uses pretreatment covariates, if available, to improve efficiency. We also compare our design to a typical RCT that collects the true, sensitive/private outcome and assess the tradeoff between statistical efficiency and data privacy.
Finally, the proposed experimental design is used to evaluate online statistics courses at the University of WisconsinMadison. Specifically, during the Spring of 2021 when most classroom instruction went online due to COVID19, students participated in an evaluation about the impact of instructors being present in online lecture videos on learning outcomes. Similar to prior works in this area [Kizilcec et al., 2014, Pi and Hong, 2016, Wilson et al., 2018, Wang et al., 2020], we find that instructorpresent video lectures improved students’ attention among noncheaters. Critically, unlike these prior works, the sensitive learning outcomes from the students are guaranteed to be differentially private to any investigator (including those that actually conducted the evaluation). In fact, the proposed design received approval from the Education and Social/Behavioral Science Institutional Review Board (IRB) of the University of Wisconsin  Madison to release this data to the public for future replication; the protocol met the criteria for exempt human subjects in accordance with categories 1 and 3 as defined under 45 CFR 46 [Office of the Federal Register and Administration, 2005].
2. Setup, Review, and Definition of Cheaters
2.1 Review: Notation, Potential Outcomes, and RCTs
We review the potential outcomes notation used to define treatment effects [Neyman, 1923, Rubin, 1974]. For each individual , let denote the binary treatment assignment with denoting treatment and denoting control. Let denote the potential outcome of individual under treatment assignment value and denote the support of the outcome. For simplicity, we consider to take on binary values, . In Section 3.2, we show that as long as is restricted so that differential privacy is welldefined (see Chapter 2.3 of Dwork and Roth [2014]), our proposed experimental design remains differentially private. Finally, let and be the observed outcome and pretreatment covariates, respectively, for individual .
The estimand of interest is the average treatment effect (ATE) defined as . To identify , the following assumptions are usually made; here, we write the assumptions under a RCT, but the interested reader is referred to Imbens and Rubin [2015] and Hernan and Robins [2020] for identification strategies outside of RCTs.

Stable unit treatment value assumption [Rubin, 1980]:

Complete randomization of treatment:

Overlap: There exists , where .
Briefly, assumption (A1) states that there are no different versions of the treatment and the treatment of individual does not impact the potential outcome of individual . Assumption (A2) states that the treatment is completely randomized and assumption (A3) states that there is a nonzero probability that individual is assigned to either treatment or control. Assumptions (A1)(A3) are usually satisfied by a RCT and consequently, can be identified by the wellknown, differenceinmeans formula, i.e., . In particular, we review two consistent estimators of the ATE in RCTs, the differenceinmeans estimator and the covariateadjusted, doubly robust estimator .
(1)  
(2) 
Here, and are outcome regression models for treatment and control groups, respectively. The differenceinmeans estimator does not require an outcome model and does not adjust for covariates . The doubly robust estimator is an augmented version of the differenceinmeans estimator that adjusts for covariates and, in a randomized experiment, is consistent even if or are misspecified. But, if and are correctly specified, is more efficient than . Finally, both estimators are asymptotically Normal with mean
and their standard errors can usually be estimated by a sandwich formula or the bootstrap
[Efron and Tibshirani, 1994]. For additional discussions, see Lunceford and Davidian [2004], Zhang et al. [2008], Tsiatis et al. [2008], and Bang and Robins [2005].While these estimators have desirable statistical properties, from a data privacy perspective, the estimators require individuals’ responses . If individuals are unwilling or apprehensive about sharing their responses to treatment or if they share a dishonest response due to reservations about their data privacy, the estimators may no longer be consistent. The next few sections talk about a formal way to “privatize” and how to use this privatized response to identify and estimate treatment effects on . Also, while we can also privatize the pretreatment covariates in a similar fashion as the response, we leave them to take on any arbitrary values since we can identify the causal effect without . Instead, we’ll primarily use to gain efficiency; see Sections 3.3 and 3.6 for details.
2.2 Privatizing Responses with Differential Privacy and Forced Randomized Response (FRR)
As before, let be the original response that individual wishes to keep private and let be the “privatized” version of the original response that the investigator actually collects from an experiment. The private response is generated by a “privatizing” function where takes the original response as the input and returns as the output. Some trivial examples of include: (1) a constant function where for any , and (2) an identity function where for any , . Intuitively, the first example is more privacypreserving than the second example in that the constant function always produces the same value for every individual, making it impossible for an investigator to recover the original, sensitive response from . But, the first example’s makes the ATE unidentifiable since everyone in the study has the same outcome, irrespective of the treatment. In contrast, in the second example with the identify function, identification of the ATE is possible, but it is not privacypreserving since the investigator sees the original, sensitive response of individual , i.e., . More broadly, Dwork and Roth [2014] argued that most nonrandomized functions are inadequate to simultaneously preserve privacy and allow estimation and consequently, Dwork [2006] proposed differential privacy, a family of nondeterministic s that take an input and stochastically generates an output. For example, if the input is individual ’s original response , may return plus some stochastic noise generated by , say plus a random value from a Laplace distribution. The investigator specifies , carefully choosing how “random” should be in order to balance the need for data privacy and the need to estimate scientifically meaningful quantities; too much randomness would mean that the ATE becomes “less identifiable” while too little randomness would mean that the individual’s data is less private. The amount of this randomness is measured by the privacy loss parameter , , and we say a random map is differentially private if changing the input to only changes its output by a factor based on ; see Definition 2..1.
Definition 2..1 (Differential privacy [Dwork and Roth, 2014]).
A randomized map is differentially private for , if for all pairs of with and for any measurable set in the range of , we have .
Lower values imply that becomes more privacy preserving; in the extreme case where and there is no loss in privacy, the output from is statistically indistinguishable for any pair of inputs and and estimation of treatment effects would be impossible. An example of such would be a fair coin toss where, regardless of the original input , , i.e., individual ’s privatized response would be a result of a coin toss. On the other hand, by setting , allows some “signal” from the private outcome to be passed onto the privatized outcome so that treatment effects can be estimated. Some values of that are used in practice are [Apple, 2019], [Qin et al., 2016], or approximately [Fanti et al., 2015]. We also remark that unlike the usual definition of differential privacy which considers privacy of databases containing records from multiple individuals, Definition 2..1 is specific to a single entry in a database and as such, is a local definition of differential privacy [Dwork and Roth, 2014, Duchi et al., 2018].
The function that we will use in our proposed design is based on the forced randomized response (FRR) of Fox and Tracy [1984]. Broadly speaking, in an FRR, individuals use a randomization device, typically a sixsided die, and based on the results of the randomization, some are instructed or “forced” to give a specific type of response, say or , regardless of their original response , while others are instructed to give their original response . For example, if individual die roll is on 1, individual is instructed to report a to the investigator, i.e., , and if the die roll is on 6, individual is instructed to report a to the investigator, i.e., . If the die roll is on anything but 1 or 6, individual is instructed to provide the original, true, response . A bit more formally, if represent different instructions where represents ‘report 0’, represents ‘report 1’, and represents ‘report original response’, with probabilities and , respectively, an FRR map is defined as,
A critical part of a FRR is that investigators are unaware of the result of the die roll; only the participant knows the result of the die toss. Hence, investigators have no idea if from individual is the true or one of the forced responses or . To put it differently, a FRR protects the privacy of individual ’s response by plausible deniability of any particular response. Nevertheless, investigators can choose the die probability, represented by and , and these values affect both the privacy loss and efficiency of ; in Section 3.2, we present a formula relating FRR parameters with the privacy loss parameter in differential privacy. For simplicity, we will collapse going forward so that there is one parameter parametrizing the privacypreserving .
2.3 Noncompliance to Experimental Protocol and Cheaters
Suppose an individual is generally wary of sharing their response to treatment and does not trust the privacypreserving nature of . For example, a participant may feel like the die in a FRR is rigged and produce a response that deviates from the original FRR protocol. Or, a participant, after being instructed to report the true response from the die roll, may feel uncomfortable doing so and instead report the opposite, say , to the investigator. These are instances of noncompliance to the experimental protocol and our goal is to still estimate causal effects in the presence of it.
To achieve this goal, we use the concept of “cheaters” by Clark and Desharnais [1998] in psychometric testing. Broadly speaking, cheaters are those who deviate from the experimental protocol laid out by investigators whereas noncheaters/honest individuals are those who follow the experimental protocol. For example, an individual who reports ‘0’ (i.e., ) even though the FRR prompt was to report ‘1’ is a cheater; see Table 1 in supplementary materials for more examples. Clark and Desharnais [1998] showed that if all cheaters are assumed to generate the same response to the investigators, say all cheaters only report , we can detect the presence of cheaters by (a) randomly splitting the study sample into two pieces, say individuals are split into samples of size and , , and (b) comparing appropriate statistics between the two subsamples. Our work extends this idea of using sample splitting to detect cheaters by relaxing the requirement that all cheaters generate the same response. We achieve this by using a different statistic to compare between the two subsamples, namely a variant of the compliance rate in instrumental variables [Angrist et al., 1996]; see Section 3.3 for details.
3. Robust, Private Randomized Controlled Trial
3.1 Protocol
Combining the aforementioned ideas on differential privacy and cheaters, we propose a robust and differentially private RCT, which we call a RPRCT and is laid out in Figure 1. In a RPRCT, the investigator specifies (i) the treatment assignment probability that is away from and and (ii) two slightly different FRR maps parametrized by , . The output of a RPRCT is the individual ’s (a) treatment assignment , (b) privatized response that may be contaminated due to cheaters, and (c) indicating which of the two subsamples individual belonged to or, equivalently, which FRR they were assigned to. The investigator does not observe the original response , the cheating status of individual i, denoted as where if is a cheater and otherwise, and the result of the randomization device behind the FRR, ; these are denoted as unobserved variables in Figure 1. In particular, the variable can be thought of as a latent characteristic of individual that is never observed by the investigator and the investigator only sees without knowing whether it came from a cheater (i.e ) or not (i.e., ).
We make some remarks about the experimental protocol. First, compared to a traditional RCT, a RPRCT adds two additional steps, the sample splitting step to be robust against cheaters and the FRR step to privatize the study participant’s response. Here, for simplicity, the sample splitting step creates two equal, random, subsamples of individuals, but this can be relaxed at the expense of additional notation. Second, a RPRCT satisfies assumptions (A2) and (A3) because the treatment is still randomly assigned to individuals with probability . Also, so long as the treatment is welldefined and does not cause interference (i.e., does not violate SUTVA), all assumptions (A1)(A3) are satisfied by a RPRCT. Third, while we only consider the FRR as our , we can replace with another differentially private algorithm. Fourth, because the treatment is randomized, the pretreatment covariates can take on any value (e.g., missing, censored, etc.) and it won’t impact the identification strategy.
3.2 Differential Privacy Guarantee
The following theorem shows that among noncheaters, the privatized response generated from a RPRCT is differentially private.
Theorem 1 (Differential Privacy of RPRCT).
Consider a noncheater’s true response . Then, a RPRCT which generates his/her privatized response is differentially private.
In words, Theorem 1 states that regardless of the noncheater’s true response , the privatized outcome generated from a RPRCT, would be private up to some privacy loss parameter. The exact privacy loss depends on the FRR parameters and investigators can choose to achieve the desired level of privacy loss. Also, Theorem 1 does not make any claims about differential privacy for cheaters. This is because cheaters can choose to report any response that may or may not be differentially private. For example, if a cheater provides the result of a random coin flip as irrespective of his/her true response , then his/her response achieves perfect differential privacy, i.e., . However, if a cheater always provides the opposite of his/her true response to potentially hide his/her true response, i.e., , then his/her data, despite their best intentions, is never differentially private. Without making an assumption about cheaters’ behaviors, we cannot make any guarantees about their responses’ differential privacy.
3.3 Identification of the ATE
To lay out the identification strategy with a RPRCT, we assume that the RPRCT satisfies the following assumptions.

Random sample splitting: with .

Randomization in FRR: with and

Extended randomization of treatment: .
Assumption (A4) states that the two subsamples in a RPRCT were split randomly. Assumption (A5) states that the randomization device in the FRR (i.e., the die roll) is random and the two FRRs used in the two subsamples are different. Assumption (A6) is a reiteration of the treatment randomization assumption (A2), except we now include the new variables introduced as part of a RPRCT: and . Note that Assumptions (A4)(A6) are satisfied by the design of a RPRCT.
Let be the proportion of cheaters in the population, i.e., . We now state assumptions about the cheater status . While likely in some settings, these assumptions may not always be satisfied by the design of a RPRCT.

Cheater’s response: .

Proportion of noncheaters: .
Assumption (A7) states that a cheater gives the same response to the investigator regardless of whether he/she was randomized to treatment or control; note that assumption (A7) does not say that all cheaters produce the same response (i.e., the assumption underlying Clark and Desharnais [1998]). Assumption (A7) is plausible if the treatment assignment is blinded so that the participant does not know which treatment he/she is receiving and thus, cannot use this information to change his/her final response to the investigator. Also, assumption (A7) still allows a cheater to use his/her original response , potential outcomes , or pretreatment characteristics to tailor his/her private response . For example, if a cheater reports a constant value, say , or the opposite of his true response, say , assumption (A7) will still hold. Or if some cheaters’ private response depends on unmeasured, pretreatment characteristics, say, cheaters who are more privacyconscious report while less privacyconscious cheaters may report a mixture of or , assumption (A7) will hold. However, if a cheater uses the treatment assignment to change his/her response , say the cheater would report if they receive treatment and if they receive control, assumption (A7) is violated. Assumption (A8) states that not all participants in the study are cheaters. If assumption (A8) does not hold and every participant is a cheater, we cannot identify any treatment effect. Also, in Section 3.4, we present a way to assess assumption (A8) with the observed data by estimating the proportion of cheaters. Overall, so long as the treatment is blinded and there is at least one noncheater, a RPRCT plausibly satisfies assumptions (A1)(A8) by design.
We now show that the data from a RPRCT can identify the ATE among noncheaters.
Theorem 2 (Identification of the ATE Under RPRCT).
Consider a RPRCT with FRR parameters set by the investigator and the observed data is . Under assumptions (A1)(A8), the ATE among noncheaters, denoted as , is identified from data via
(3) 
Additionally, the proportion of cheaters is identified via
(4) 
Theorem 2 shows that our new design can identify the ATE among noncheaters by taking the difference in the averages of the privatized outcomes among treated and control units, reweighed by the FRR parameters and the proportion of cheaters . If there are no cheaters in the population, Theorem 2 implies and we can identify the ATE for the entire population. In contrast, if everyone is a cheater, we cannot identify the treatment effect; intuitively, if everyone is a cheater, disregards the FRR, and reports , it would be impossible to know the effect of the treatment on the response. Generalizing this intuition, we can only identify the subpopulation of individuals who are noncheaters, even if the investigator does not know who are cheaters or noncheaters. We remark that this result is similar in spirit to the local average treatment effect in the noncompliance literature [Angrist et al., 1996] where under noncompliance, only the subpopulation of compliers can be identified from data.
Theorem 2
also shows that we can identify the proportion of cheaters. While the exact formula is complex, roughly, we measure the excess proportion of privatized outcomes had everyone followed the FRR and use additional moment conditions generated from sample splitting to identify
; see Section B of the supplementary materials for details.Overall, Theorems 1 and 2 show that we can identify the treatment effect with privatized outcomes , some of which may be contaminated by cheaters. The proposed design has some key parameters, and , that govern the privacy loss parameter . They also affect estimation and testing of , which we discuss below.
3.4 DifferenceinMeans Estimator of the ATE
Using the identification result in Section 3.3, we can construct estimators of by replacing the expectations in Theorem 2 with their sample counterparts, i.e.,
(5) 
Note that the estimator can be thought of as an extension of the simple differenceinmeans estimator in a RCT reweighted by the estimated proportion of noncheaters and the privacypreserving map, i.e., . Also, an estimate of the proportion of cheaters can be obtained by replacing the expectations in Theorem 2 with their sample counterparts, i.e.,
(6) 
Also, is the maximum likelihood estimate of ; see Section B.3.1 of the supplementary materials for details. Also, in small samples, may exceed the bound of and , especially if and are close to each other. In this case, we follow Clark and Desharnais [1998]’s advice where we evaluate the likelihood of the estimated and the boundary points and pick the value of corresponding to the highest likelihood.
Theorem 3 shows that is a consistent and asymptotically Normal estimator of .
Theorem 3 (Asymptotic Properties of ).
Suppose the observed data is i.i.d. and generated from a RPRCT. Then, we have with
Also, can be consistently estimated by,
where, for ,
and, for .
Theorem 3 can be used as a basis to construct confidence intervals and pvalues for testing the null hypothesis . For example, a Waldstyle test for would be and one would reject the null in favor of a twosided alternative at level if exceed , where is the quantile of the standard Normal distribution. We can also construct a Waldbased level twosided confidence interval for as . Note that similar to the usual difference in means estimator in Section 2.1, practitioners could also use the bootstrap to estimate the standard error of and its associated confidence interval.
3.5 Cost of Data Privacy in RPRCT
In this section, we compare a RPRCT to a traditional RCT as a way to study the cost of guaranteeing differential privacy on statistical efficiency. To begin, suppose so that both and estimate the same parameter . The following theorem shows that is never as asymptotically efficient as ; in short, there is a statistical cost of using a RPRCT to guarantee differential privacy of an individual’s response.
Theorem 4.
[Relative Asymptotic Efficiency of and ] For any and , we have,
where, , and . Also, is monotonically increasing in .
The relative efficiency between the two estimators, measured by , is determined by , which in turn is governed by FRR parameters . In particular, as individuals’ responses become less private by an increase in privacy loss , the relative efficiency approaches . In fact, only when the privacy loss approaches infinity, i.e., , do we have relative efficiency equaling and thus, will never be as efficient as as long as we want to guarantee some amount of data privacy. Investigators can use the formula in Theorem 4 to assess what value of privacy loss, , works best in their own studies. In particular, we recommend investigators specify based on (a) their tolerance for loss in efficiency at the expense of more privacy, (b) and , which may be informed from subjectmatter experts during the planning stage of the experiment, and/or (c) recommended data privacy standards from relevant regulatory bodies.
3.6 Covariate Adjustment and Doubly Robust Estimation
Similar to a RCT, suppose a RPRCT collected pretreatment covariates , which may be missing, contaminated, and/or corrupted. We propose to use the pretreatment covariates to develop a more efficient estimator without incurring additional bias by using a doubly robust estimator [Bang and Robins, 2005]. Formally, let and be the postulated models for the true outcome regressions and
, respectively. For simplicity, we assume that these functions are fixed, but our result below holds if these functions were estimated at fast enough rates, say those based on parametric models. Consider the following estimator for
,(7) 
The following theorem shows that is a consistent, asymptotically Normal estimate of even if and are misspecified.
Theorem 5 (Consistency and Asymptotic Normality of ).
Suppose the same assumptions in Theorem 3 hold. Then, for any fixed working models of the privatized outcomes and , we have where
Also, can be consistently estimated by,
4. Evaluation of Online Statistics Courses at the University of WisconsinMadison During COVID19
4.1 Background and Motivation
In light of many courses shifting online during COVID19, the Department of Statistics at the University of WisconsinMadison was interested in evaluating which type of online video lectures best aided conceptual understanding, information retention and problemsolving among students taking the Department’s courses. Specifically, one of the questions of interest was whether instructorpresent lecture videos (i.e., treatment) led to a better learning experience for students compared to instructorabsent online lectures (i.e., control); see Figure 2 for an example. Some prior works [Wilson et al., 2018, Kizilcec et al., 2014] found no evidence that instructorpresent lecture videos had a significant impact on student learning in terms of attention and comprehension. However, other works [Pi and Hong, 2016, Wang et al., 2020] show the opposite, that instructorpresent lecture videos enhanced student learning. Regardless, the participant data from these works are not publicly available due to the sensitive nature of students’ educational data. Additionally, there was concern that students may be less willing to provide their true attention and retention rate in the courses when asked by the Department. For example, some students might not be comfortable admitting to not paying attention to online lectures and simply lie to the investigator, leading to potentially biased results.
Compared to using a RCT, using a RPRCT provides numerous remedies for the aforementioned issues. First, students’ data is guaranteed to be differentially private, which may encourage more honest participation. Second, even if some students remain dishonest and are cheaters, a RPRCT still provides a robust estimate of the treatment effect. Third, the data can be shared publicly and students’ data privacy is still preserved via differential privacy; as mentioned earlier, the experimental protocol, including sharing students’ response data, was approved by the Education and Social/Behavioral Science IRB of the University of Wisconsin  Madison.
4.2 Study Participants, Treatment Arms, and Outcomes
The study population consisted of students enrolled in introductory statistics classes at the University of Wisconsin  Madison during the Spring 2021 semester. Electronic, informed consent was obtained from all participants before enrollment. Once a student gave consent, the study collected the following pretreatment covariates: gender, race/ethnicity, year in college, major or field of study, prior subject matter knowledge, previous exposure to and grades in statistics/mathematics/computer science classes, selfrated interest in statistics, proficiency in English, amount of experience with video lectures in the past or current semester, and preference on online lecture format. Students had the option to not provide answers to any of the pretreatment covariates. Afterwards, students are randomly placed into one of the two subgroups ( or ) and within each subgroup, they are randomly assigned to treatment or control. The control arm had a narrated ‘Instructor Absent’ (IA) video format where students see the lecture slides and hear an audio narration of the lecture from the instructor. The treatment arm was identical to the control arm except it used an ‘Instructor Present’ (IP) video format where the instructor’s face was embedded in the upperright corner of the lecture video. Both lecture videos were 13 minutes long and introduced identical statistical concepts, specifically about RCTs. We remark that RCTs were not covered in the classes in which the study was conducted. Also, the study used an FRR with and another FRR with , resulting in a privacy loss of . This value was based of our own preference towards data privacy at the expense of efficiency where we were willing to tolerate roughly increase in standard error under a RPRCT than a RCT, and consultation with the University’s IRB, which, among other things, gave approval to release the studentlevel data to the public for future replication and analysis.
After students watched either the IA or IP lecture video, they were asked a series of questions, notably on four areas of student learning (i.e., Attention, Retention, Judgement of Learning and Comprehension) used by previous works [Kizilcec et al., 2014, Pi and Hong, 2016, Wang and Antonenko, 2017, Wilson et al., 2018]. All four outcomes were ‘yes/no’ questions; the exact wording of the questions is in Table 1.
For our working models for the privatized outcomes and
, we used logistic regression models that minimized the Akaike information criterion
[Bozdogan, 1987]; see Section C of supplementary materials for additional details.Primary Outcome  Questions 

Attention  “I found it hard to pay attention to the video.” 
Retention  “I was unable to recall the concepts while attempting the followup quiz.” 
Judgement of Learning  “I don’t feel that I learnt a great deal by watching the video.” 
Comprehension  “I found the topic covered in the video to be hard.” 
We remark that as part of the RPRCT protocol, students were prompted to ‘roll a die’ using an online die roll (i.e., the FRR device). Students were allowed to roll the die only once and the resulting die roll was visible to the student only. Based on the outcome of the roll, students were asked to answer the four questions in Table 1 using the FRR prompt presented in panel C of Figure 2. Also, students must roll the die before being presented with the four questions.
4.3 Results
Table 2 presents the estimates of the average treatment effect amongst honest participants () for the four outcomes. Honest students indicated that it was easier to pay attention to instructorpresent (IP) video lectures compared to instructorabsent (IA) video lectures (). However, their ability to retain concepts was higher amongst students randomized to the instructorabsent (IA) video lectures (). However, there was no significant difference between students who watched IP and IA video lectures in terms of Judgement of Learning and Comprehension. Also, Section C of the supplementary materials present covariate balance between the IP and IA subgroups and, as expected, we found no differences between the two groups in terms of their baseline pretreatment covariates. Our results suggest that cheating varies with the question. For example, the estimated proportion of cheaters for the Attention and Judgement of Learning related questions were smaller ( and , respectively). However, they were larger for Retention and Comprehension questions ( and , respectively). These differences may suggest that students might be more apprehensive in sharing outcomes related to their learning abilities (i.e., Retention, Comprehension) than outcomes related to instruction (e.g., ability to engage students and transfer knowledge).
Primary Outcome  

Attention  
Retention  
Judgement of Learning  
Comprehension 
Our results largely agree with previous works on online video lectures. For example, our results and those by Kizilcec et al. [2014], Pi and Hong [2016] and Wang et al. [2020] agree that IP lectures receive considerably more attention than IA lectures. Also, our findings on retention, judgement of learning, and comprehension match those in Wang and Antonenko [2017], Wang et al. [2020], Wilson et al. [2018]. However, we remark that there is work [Kizilcec et al., 2014] that suggests the opposite of what we find on retention. Finally, unlike these works, all of the studentlevel data and code is publicly available for replication and future analysis, especially if investigators want to combine this data with future datasets to boost power in related evaluations of online video lectures.
5. Conclusion and Discussion
We propose a new experimental design to evaluate the effectiveness of a program, policy, or a treatment on an outcome that may be sensitive, with a particular focus on online education programs where students’ response data are often sensitive. Our design, a RPRCT, has differential privacy guarantees while also allowing estimation of treatment effects. A RPRCT also accommodates cheaters who may not trust the privacypreserving nature of our design and provide arbitrary responses that may further protect their privacy. We provide two consistent, asymptotically Normal estimators, one of which allows for covariate adjustment. We also assess the tradeoff between differential privacy and statistical efficiency. We conclude by using the RPRCT to evaluate different types of online video lectures at the Department of Statistics at the University of WisconsinMadison and find that our results largely agree with existing results on online video lectures, while preserving students’ data privacy and allowing sharing of this data for future replication.
References
 Teachers and student achievement in the chicago public high schools. Journal of Labor Economics 25 (1), pp. 95–135. Cited by: §1.1.
 Identification of causal effects using instrumental variables. Journal of the American Statistical Association 91 (434), pp. 444–455. Cited by: §1.3, §2.3, §3.3.
 HIPAA regulations — a new era of medicalrecord privacy?. New England Journal of Medicine 348 (15), pp. 1486–1490. Cited by: §1.2.
 Apple differential privacy technical overview. Cited by: §1.2, §2.2.
 Doubly robust estimation in missing data and causal inference models.. Biometrics 61 (4), pp. 962–973. Cited by: §2.1, §3.6.
 The ethics of online research with unsuspecting users: from a/b testing to c/d experimentation. Research Ethics 13 (34), pp. 200–218. Cited by: §1.1.
 A 61millionperson experiment in social influence and political mobilization.. Nature 489(7415), pp. 295–298. Cited by: §1.1.
 Model selection and akaike’s information criterion (aic): the general theory and its analytical extensions. Psychometrika 52 (3), pp. 345–370. External Links: ISBN 18600980 Cited by: §4.2.
 Honest answers to embarrassing questions: detecting cheating in the randomized response model. Psychological Methods 3 (2), pp. 160–168. Cited by: §1.3, §2.3, §3.3, §3.4.
 Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association 113 (521), pp. 182–201. Cited by: §1.2, §2.2.
 The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3–4), pp. 211–407. External Links: ISSN 1551305X Cited by: §2.1, §2.2, §2.2, Definition 2..1.
 Differential privacy for statistics: what we know and what we want to learn. Journal of Privacy and Confidentiality 1 (2). Cited by: §1.2.
 Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006) edition, Lecture Notes in Computer Science, Vol. 4052, pp. 1–12. External Links: ISBN 3540359079 Cited by: §1.2, §2.2.
 An introduction to the bootstrap. CRC press. Cited by: §2.1.
 RAPPOR. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security  CCS ’14. External Links: ISBN 9781450329576 Cited by: §1.2.
 Limiting privacy breaches in privacy preserving data mining. In Proceedings of the TwentySecond ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’03, New York, NY, USA, pp. 211–222. External Links: ISBN 1581136706 Cited by: §1.2.
 Building a rappor with the unknown: privacypreserving learning of associations and data dictionaries. External Links: 1503.01214 Cited by: §2.2.
 Measuring associations with randomized response. Social Science Research 13, pp. 188–197. Cited by: §2.2.
 2018 reform of eu data protection rules. Cited by: §1.1.
 Causal inference: what if. Chapman and Hall/CRC. Cited by: §1.1, §2.1.
 Netflix prize update. The Netflix Blog. Cited by: §1.2.
 Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press. Cited by: §1.1, §2.1.
 Showing face in video instruction: effects on information retention, visual attention, and affect. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, New York, NY, USA, pp. 2095–2102. External Links: ISBN 9781450324731 Cited by: §1.3, §4.1, §4.2, §4.3.
 Tcloseness: privacy beyond kanonymity and ldiversity. In IEEE 23rd International Conference on Data Engineering, pp. 106–115. Cited by: §1.2.
 Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23, pp. 2937–2960. Cited by: §2.1.
 Diversity: privacy beyond anonymity. In Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, Washington, DC, USA, pp. 24. Cited by: §1.2.
 Robust deanonymization of large sparse datasets. Proceedings of the 2008 IEEE Symposium on Security and Privacy. Cited by: §1.2.
 Sur les applications de la theorie des probabilites aux experiences agricoles: essai des principes.. Master’s Thesis. Excerpts reprinted in English, Statistical Science 5, pp. 463–472. Cited by: §2.1.
 45 cfr 46  protection of human subjects. Cited by: §1.3.
 Learning process and learning outcomes of video podcasts including the instructor and ppt slides: a chinese case. Innovations in Education and Teaching International 53 (2), pp. 135–144. Cited by: §1.3, §4.1, §4.2, §4.3.
 Heavy hitter estimation over setvalued data with local differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 192–203. External Links: ISBN 9781450341394 Cited by: §2.2.
 The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. External Links: ISSN 00063444 Cited by: §1.1.
 Randomization analysis of experimental data: the fisher randomization test comment. J. Am. Statist. Ass. 75, pp. 591–593. Cited by: item (A1).
 Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 (5), pp. 688–701. Cited by: §2.1.
 Protecting respondents identities in microdata release. IEEE Transactions on Knowledge and Data Engineering 13 (6), pp. 1010–1027. Cited by: §1.2.
 Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Cited by: §1.2.
 Kanonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledgebased Systems 10 (5), pp. 557–570. Cited by: §1.2.
 Covariate adjustment for twosample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat Med. 27 (23), pp. 4658–4677. Cited by: §2.1.
 Instructor presence in instructional video: effects on visual attention, recall, and perceived learning. Computers in Human Behavior 71, pp. 79–89. External Links: ISSN 07475632 Cited by: §4.2, §4.3.
 Does visual attention to the instructor in online video affect learning and learner perceptions? an eyetracking analysis. Computers and Education 146, pp. 103779. Cited by: §1.3, §4.1, §4.3.
 Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60, pp. 63–69. Cited by: §1.3.
 Instructor presence effect: liking does not always lead to learning. Computers and Education 122, pp. 205–220. External Links: ISSN 03601315 Cited by: §1.1, §1.2, §1.3, §4.1, §4.2, §4.3.
 Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics 64 (3), pp. 707–715. Cited by: §2.1.
Comments
There are no comments yet.