The stratified micro-randomized trial design: sample size considerations for testing nested causal effects of time-varying treatments

11/09/2017 ∙ by Walter Dempsey, et al. ∙ University of Michigan University of Memphis 0

Technological advancements in the field of mobile devices and wearable sensors have helped overcome obstacles in the delivery of care, making it possible to deliver behavioral treatments anytime and anywhere. Increasingly the delivery of these treatments is triggered by predictions of risk or engagement which may have been impacted by prior treatments. Furthermore the treatments are often designed to have an impact on individuals over a span of time during which subsequent treatments may be provided. Here we discuss our work on the design of a mobile health smoking cessation experimental study in which two challenges arose. First the randomizations to treatment should occur at times of stress and second the outcome of interest accrues over a period that may include subsequent treatment. To address these challenges we develop the "stratified micro-randomized trial," in which each individual is randomized among treatments at times determined by predictions constructed from outcomes to prior treatment and with randomization probabilities depending on these outcomes. We define both conditional and marginal proximal treatment effects. Depending on the scientific goal these effects may be defined over a period of time during which subsequent treatments may be provided. We develop a primary analysis method and associated sample size formulae for testing these effects.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise of wearable technologies has generated increased scientific interest in the use and development of mobile interventions. Such mobile technology holds promise in providing accessible support to individuals in need. Mobile interventions to maintain adherence to HIV medication and smoking cessation, for example, have shown sufficient effectiveness to be recommended for inclusion in health services (Free et al., 2013). Increasingly scientists aim to trigger delivery of treatments based on predictions, such as of risk or engagement, which are outcomes of prior treatments. In these settings scientists are increasingly interested in assessing nested treatment effects. For example, a scientist may want to understand if providing a treatment at high risk time (Hovsepian et al., 2015) is effective. Often times of high risk occur infrequently. In these cases randomization to treatment might be triggered by a risk prediction so as to avoid providing treatment at the wrong time and potentially providing too much treatment. Furthermore the scientist may want to detect these treatment effects over the next hour during which subsequent treatments may be delivered.

In this paper, we propose the stratified micro-randomized trial design because it is critical to stratify randomization to ensure sufficient occasions where the variable of interest (denoted ), such as risk, takes a particular value  and treatment is provided and sufficient occasions where 

and treatment is not provided. In these settings, the outcome of interest may require a period of time over which to develop; during this time period further treatment might be provided. To address this we provide a careful definition of the desired treatment contrast and introduce the notion of a reference distribution. We proceed by developing an appropriate test statistic for the desired treatment contrast. The associated sample size calculation is non-trivial due to unknown form of the non-centrality parameter. Moreover, the distribution of 

over time, , is unknown. Therefore we develop an approach to formulating a simulation based sample size calculator to accommodate the unknown longitudinal distribution of . The calculator requires the scientist to specify a generative model for the history  which achieves the specified alternative treatment effect. However existing data sets that include the use of the required sensor suites and thus can be used to guide the form of the generative model are often small and do not include treatment. To address this we provide a protocol for the use of such noisy, small datasets to inform the selection of the generative model, leading to a data-driven, simulation-based sample size calculator. We also illustrate how exploratory data analysis and over-fitting of the same data can be used in constructing a feasible set of deviations to which the sample size calculator should be robust.

This work is motivated by our participation in a mobile health smoking cessation study, in which an average of 3 stress-reduction treatments should be delivered per day, 1/2 at times the participant is classified as stressed and 1/2 at times the participant is

not classified as stressed. We use data from an observational, no treatment, study of individuals (Sarker et al., 2017; Saleheen et al., 2015) who are attempting to quit smoking to construct the generative model underlying the simulation based sample size calculator. The data directly informs the generative model under no treatment. We then build a generative model under treatment by combining the generative model under no treatment with the targeted alternative treatment effect. We next over-fit the noisy, small data to suggest potential deviations to which we assess robustness of the sample size calculator.

1.1 Related work

We build upon prior work in experimental design and on data analysis methods for time-varying causal effects. We outline this related work below, highlighting key differences to our current setting.

1.1.1 Micro-randomized trials

Recently micro-randomized trial designs (Liao et al., 2016; Dempsey et al., 2015) were developed for testing proximal and delayed effects of treatment (Klasnja et al., 2015)

. While in these trials treatment is sequentially randomized per participant, this approach does not permit the randomization probabilities to depend on features of the participant’s observation history. This restriction is quite problematic. Indeed due to the rapid increase in sensor technology and the ability of various machine learning methods to provide real-time predictions, it is now feasible for scientists to trigger treatments based on these predictions or other features of the participant’s observation history. A critical question is whether triggering a treatment based on such features is effective. Often these features may be impacted by prior treatment. Furthermore the responses of greatest interest may be defined over a span of time during which subsequent treatments may be delivered yet the approach developed in

(Liao et al., 2016) does not accomodate this. We designed the stratified micro-randomized trial specifically for this more complex setting.

1.1.2 N-of-1 trials

At first glance, the micro-randomized trial design appears similar to the N-of-1 trial design frequently used in the behavioral sciences. However the estimand is quite different. We will, as is typical in statistical causal inference, consider average causal effects, possibly conditional on covariates. In the behavioral field N-of-1 trials are used most often to ascertain individual level causal effects (McDonald et al., 2017). A variety of nuanced assumptions about individual behavior using behavioral science theory is brought to bear as scientists attempt to triangulate on individual level effects; see the section on “Measuring behavior over time” in McDonald et al. (2017) for a discussion. In the clinical field, N-of-1 trials were developed for settings in which scientists wish to compare the effect of one treatment versus another (treatment A versus treatment B) on an outcome but it is very expensive to recruit many participants. In both settings a common assumption underlying the analysis of N-of-1 trials is that there are no carry-over effects. Additionally one often assumes that the treatment effect is constant over time. An excellent overview of N-of-1 designs and their use for evaluating technology based interventions is Dallery et al. (2013). See Kravitz et al. (2014) for a review of this design in pharmacotherapy trials.

1.2 Outline

This paper is organized as follows. In section 2 we discuss the stratified micro-randomized trial and describe in greater detail the motivating smoking cessation study. In section 3 we define two types of treatment effects: a conditional treatment effect, conditional on a stratification variable, and a treatment effect that is marginal over the stratification variable. Section 4 provides primary analysis methods and associated theory for the proposed trial design. We then provide a simulation-based method for determining the sample size for a stratified micro-randomized trial in section 5. This simulation-based sample size calculator requires a generative model for the trial data. We develop a generative model for the smoking cessation example in section 6 and develop the simulation based sample size calculator for this example. In this example the development of the generative model begins with the development of model under no treatment. This latter model is constructed using summary statistics on data collected in an observational, no treatment, smoking cessation study of cigarette smokers (Saleheen et al., 2015). Section 6.1.1 describes the dataset and how it is used to inform the generative model. We also conduct a variety of robustness checks and subsequently revise the generative model. Here too, the observational, no treatment, smoking cessation study is used to indicate where robustness is required. Section 7 provides a discussion.

2 Stratified Micro-Randomized Trial

2.1 Motivating example – Smoking cessation study

Here we provide a simplified description of the smoking cessation study which we are involved in through the Mobile Data to Knowledge Center ( This is a 10 day mobile health intervention study focused on developing a mobile health intervention aimed at aiding individuals who are attempting to quit smoking. Participants wear both an AutoSense chest band (Ertin et al., 2011) as well as bands on each wrist for 10 hours per day. Sensors in the chestband and wristband measure various physiological responses and body movements to robustly assess physiological stress. In particular a pattern-mining algorithm uses the sensor data to construct a binary time-varying stress classification (see Section 6 and Sarker et al. (2016) for further details) at each minute of sensor wearing throughout the entire day.

Each participant’s smartphone contains a number of “mindfullness apps” that can be accessed 24/7 to engage in guided stress-reduction exercises. In this study the treatment is a smartphone notification to remind the participant to access the app and practice the stress-reduction exercises. Theoretically, a treatment can be delivered at any minute during the 10 hour day. However in practice, treatment will only be delivered when the participant is available. That is, at some time points it is inappropriate for scientific, ethical or burden reasons to provide treatment. In this example, one of the reasons why a participant would not be available at decision time  is if the participant received a treatment in the past hour (see Section 6 for further details on availability specific to this trial).

At each minute availability is ascertained and if the participant is available, then the participant is randomized to receive or not receive a treatment. In this study the repeated randomizations are stratified to ensure that each participant should receive an average of 1.5 treatments per day while classified as stressed and an average of 1.5 treatments per day while not classified as stressed.

We consider primary analyses and sample size formula when the primary aim of this type of study is to address scientific questions such as:

Is there an effect of the treatment on the proximal response? And is there an effect of the treatment if the individual is currently experiencing stress?

The stratified micro-randomized trial is an experimental design intended to provide data to address such questions.

2.2 A Stratified Micro-Randomized Trial

A micro-randomized trial (Liao et al., 2016; Dempsey et al., 2015) consists of a sequence of within-person decision times , e.g. occasions, at which treatment may be randomized. For example, in the smoking cessation study the decision times are at minute intervals during a 10 hour day over a period of 10 days (i.e., decision times) for each participant. As discussed in the introduction we are interested in treatment effects at particular values of a variable that are likely impacted by prior treatment (in the smoking cessation study, is an indicator of stress and treatment is intended to impact the occurrence of stress); often in these settings some values of occur more rarely (e.g., participants experience many fewer minutes of stress than non-stress minutes in a day) and thus to ensure sufficient treatment exposure at these values we stratify the randomization. We call such trials stratified micro-randomized trials. We assume the sample space for the covariate  is finite and small. That is,

is a time-varying categorical (or ordinal) variable with support 

where is small. In the case of the smoking cessation example, if the participant is classified as stressed at decision time and , otherwise, thus .

( denotes observations collected after time  and up to and including time (including the time varying stratification variable, ); contains baseline covariates. also contains the availability indicator: if available for treatment and otherwise. Availability at time is determined before treatment randomization. In this paper, we consider binary treatment (e.g., on or off); denotes the indicator for the randomized treatment at time . A randomization only occurs if . In the smoking cessation example if at minute , the participant is notified to practice stress-reduction exercises and otherwise. In particular if the participant is unavailable (i.e., ) there can be no notification to practice stress-reduction exercises (i.e., ). The ordering of the data at a decision time is . Let  denote the observation history up to and including time , as well as the treatment history at all decision times up to, but not including, time .

In general the randomization probability for will depend not only on the stratification variable, but also other variables in . The is a known function of , denoted by . We define  when the participant is currently unavailable (i.e., ). Appendix A provides an example, suitable in the smoking cessation example, of a formula for for any possible value of history given by . From here on, we assume the investigator has access to a formula for these randomization probabilities. Let denote the distribution of the data if collected using randomization probabilities determined by this formula.

The proximal response, denoted by , is a known function of the participant’s data within a subsequent window of length  (i.e., ). In the smoking cessation study, for example, the length of window might be minutes with proximal response

In this smoking cessation example, the response is a deterministic function of only the stratification covariate, ; this need not be the case. For example in a physical activity study in which the treatments are activity messages

may be a binary variable indicating currently sedentary or not yet the response might be the number of steps over subsequent


3 Proximal effect of treatment

The primary question of interest is whether the treatment has a proximal effect; that is, whether there is an effect of treatment at decision time on the proximal response . In particular we aim to test if the proximal effect is zero. Note we are only interested in treatment effects conditional on availability (). We consider two types of proximal effects: an effect that is defined conditionally on the value of the stratification variable, and or an effect that is conditional only on , so marginal with respect to the distribution of .

3.1 Proximal effect of treatment, Potential outcomes & Reference distribution

We use potential outcomes (Robins, 1986; Rubin, 1978) to define both the conditional and marginal proximal effect. At time 2, the potential observations are . The potential observations and availability at decision time  are . Recall that the proximal response is a known function of the participant’s data within a subsequent window of length . Thus the potential outcomes for the response at time  are ; each individual has potential responses at time .

Definition 3.1 (Proximal treatment effects).

At the individual level, the effect of providing treatment versus not providing treatment at time is a difference in potential outcomes for the proximal response and is given by


There are of these treatment differences for each individual, each corresponding to a value for . The “fundamental problem of causal inference” (Imbens and Rubin, 2015; Pearl, 2009) is that we can not observe any one of these individual differences. Thus we provide a definition of the treatment effect that is an average across individuals. Furthermore to define the effect of treatment we must specify a reference distribution, that is the distribution of the treatments prior to time , and if then we must also define the distribution of the treatments after time , . If the reference distribution is not a point mass then, in the definition of the treatment effect, here too, the treatment effect will be an average; the average is over the above differences (1) with respect to the reference distribution. So in summary the treatment effect at time will be an average of the differences in (1) both over the distribution across individuals in potential outcomes as well as over the reference distribution for the treatments.

The question is, “Which reference distribution should be used for the treatments?” The choice of which distribution to use for might differ by the type of inference desired. For example in the smoking cessation study, it makes sense to consider setting the treatments  to . In this case we can interpret the treatment effect as the effect of providing a notification at time to practice stress-reduction exercises and no more notifications within the next hour versus no notification at time nor over the next hour on the fraction of time stressed in the next hour (i.e., the proximal response).

In this paper, we set treatment at the subsequent times equal to as described above. In order to select the reference distribution for we follow common practice in observational mobile health studies; here longitudinal methods such as GEEs and random effects models (Liang and Zeger, 1986) might be used to model how a time-varying variable, such as physical activity, varies with current mood. In this case the mean model in these analyses is marginal over the past distribution of mood. A similar strategy in the randomized setting is to use the past treatment randomization probabilities as the reference distribution.

With the reference distribution set to the randomization probabilities for past treatment and set to no treatment for the subsequent times, the average causal effect at time can be viewed as an “excursion.” That is, participants get to time under treatment according to the randomization probabilities, then at time (if available) the effect is the contrast between two opposing excursions into the future. In one excursion, we treat at time and then do not treat for further times; in the opposing excursion, we do not treat at time nor do we treat for subsequent times.

Using the above reference distribution, the marginal, proximal treatment effect at time , , is:

where the expectation, is over the distribution of the potential outcomes and

is a row vector of length

. Define the conditional, proximal effect, , as follows:

The proximal effects can be defined for other reference distributions over . Careful consideration is required in selecting the reference distribution. For example, a natural alternative to setting the treatments to in the above definition would be to use a definition which averaged over the randomization distribution, . Consider the smoking cessation example. Here if at time treatment is delivered then according to the randomization protocol the participant cannot be provided further treatment in the subsequent hour. On the other hand, if treatment is not provided at time then the participant may be provided treatment in the subsequent hour. Thus defining the proximal treatment effect with respect to the randomization distribution  means that the treatment contrast is between providing treatment at time versus the combination of delaying treatment to later time points in the next hour or not providing treatment in the next hour.

A further consideration in selecting a reference distribution is that if the reference distribution is far from the randomization distribution then treatment effects may be very difficult to estimate. That is, the sample size necessary to achieve the requisite power to detect treatment effects will be practically infeasible (i.e, astronomical). Consider again the smoking cessation study example. Using data from other studies on smokers who are trying to quit we know that there are only a few times per day at which the smoker is classified as stressed. In the subset of the observational, no treatment, study used to inform our generative models, the mean (standard deviation) of the number of episodes classified as stressed per day per person was

(). The mean (standard deviation) of the number of episodes not classified as stressed per day per person was (). These statistics support the conclusion that most of the day the smoker is not stressed. Recall the randomization distribution must satisfy the restriction that on average 1.5 treatments are provided while a smoker is classified as stressed and on average 1.5 treatments are provided while a smoker is classified as non-stressed. This is over a 10 hour day. This means that at any given minute, the participant is likely classified as not stressed and the probability of treatment at this minute is very low. As a result the product of randomization probabilities  is close to and thus close to a reference distribution that provides no treatment at times . This means that there will be much data from the study that is consistent with the reference distribution. If, however the randomization probabilities had to satisfy a restriction specifying a much larger number of treatments, then there would be very little data consistent with the reference distribution.

For the reminder of this paper, the proximal effects are defined using the randomization distribution for past treatments () and are set to 0 (no treatment).

3.2 Proximal effect of treatment & Observable Data

To express the causal treatment effects, and in terms of the observable data,
e.g. , we use the following three assumptions.

Assumption 3.2.

We assume consistency, positivity, and sequential ignorability (Robins, 1986):

  • Consistency: For each , . That is, the observed values are equal the corresponding potential outcomes.

  • Positivity: if the joint density  is greater than zero, then .

  • Sequential ignorability: for each , the potential outcomes,
    , are independent of  conditional on the history .

Sequential ignorability and, assuming all of the randomization probabilities are bounded away from and , positivity, are guaranteed for a stratified micro-randomized trial by design. Consistency is a necessary assumption for linking the potential outcomes as defined here to the data. When an individual’s outcomes may be influenced by the treatments provided to other individuals, consistency may not hold. In such instances, a group-based conceptualization of potential outcomes is used (Hong and Raudenbush, 2006; Vanderweele et al., 2013). In particular if the mobile intervention includes treatments that aim to produce social ties between participants, then consistency as stated above will not hold. For simplicity we do not consider such mobile interventions here.

Lemma 3.3.

Under assumption 3.2, the marginal treatment effect satisfies


and the conditional treatment effect satisfies


for all where denotes the expectation with respect to distribution of the data generated via a stratified micro-randomized trial with randomization distribution, .

Note that the above products, e.g. , are set to if . Proof of Lemma 3.3 can be found in Appendix B. In the following we focus on designing a stratified micro-randomized trial for the primary purpose of testing whether the treatment effect at any time point differs from 0.

4 Test statistic

Our main objective is the development of a sample size formula that will ensure sufficient power to detect alternatives to the null hypothesis of no proximal treatment effect. For the conditional proximal effect the null hypothesis is 

and . For the marginal proximal effect the null hypothesis is . The proposed sample size formulas are simulation based and will follow from consideration of the distribution of test statistics under alternatives to the above null hypotheses. The sample size will be denoted by . Our test statistic will be based on a generalization of the test statistics developed by Boruvka et al. (2017) to accommodate the fact that the response covers a time interval during which subsequent treatment may be delivered (in Boruvka et al. (2017), throughout) and the conceptual insight that these estimators can be interpreted as projections. These test statistics are quadratic forms based on estimators of the coefficients involved in projections.

In the following we describe projections, and provide the test statistics. First in the conditional setting the test statistic is based on an empirical projection of on the space spanned by a by vector of features involving and , denoted by . We denote the projection by . The weights in this projection are given by

where are pre-specified probabilities used to define the weighting across time and stratification distribution in the projection. Note that if desired, one can set for all . See Section 5.1 for further comments on the choice of the pre-specified probabilities and on the choice of .

Second, in the marginal setting, the test statistic is based on estimators of the coefficients involved in an projection of on the space spanned by a by vector of features involving , denoted by . We denote the projection by . The weights in this projection are given by

for pre-specified probabilities, . Again these probabilities are used to specify the weighting across time and stratification distribution in the projection.

Here we discuss the estimators of the coefficients in the projections. The estimators will form the basis for the test statistics. Note that neither treatment effect, in (4) nor in (2), are conditional expectations of an observable variable (rather the effects are defined by differences in repeated conditional expectations). Thus instead of minimizing a standard least squares criterion, we minimize a generalization of the criterion in Boruvka et al. (2017) (see (6), (7) below).

In some settings there will be sufficient a priori information (e.g. using data on individuals from a similar population) that will permit the simulation based sample size formula to depend on “control variables.” These variables are used to help reduce the variance of the estimators with the goal that the resulting test statistic is more powerful in detecting particular alternatives to the null hypothesis. See Section 

5.1 for further discussion concerning the choice of the control variables. For example in the smoking cessation study a natural control variable would be the fraction of time stressed in the hour prior to time as this pre-time variable may be expected to be highly correlated with the fraction of time stressed in the hour subsequent to time , .

Given a by vector of “control variables” , define as an projection; in particular

where . Also define as an projection; in particular

where . Note one can choose equal to the scalar, . Again see Section 5.1 for further discussion. See appendix C for a discussion of the trade-off between the approximation error of the  projection of  onto the control variables , sample size , and statistical power .

Recall the proposed test statistic is based on an estimator of or . Here we consider an estimator of which is the minimizer of the following weighted, centered least-squares criteria, minimized over :


where is defined as the average of a function, , over the sample. The centering refers to the centering of the treatment indicator in the above weighted least squares criteria. This criterion is similar to Boruvka et al. (2017); however Boruvka et al. (2017) restrict to and thus the weight does not contain the ratio, . Also Boruvka et al. (2017) assume a model for the treatment effect

(as opposed to estimating the projection of this effect as is the case here). Under finite moment and invertibility assumptions, the minimizers

, are consistent, asymptotically normal estimators of . The limiting variance of is given by where

See Appendix B.2 for technical details.

The estimators of the coefficients in the projection of the marginal treatment effect, minimize the following least-squares criteria over :


where the probability defines the projection (see above and Section 5.1). Similarly under finite moment and invertibility assumptions, the minimizers , are consistent, asymptotically normal estimators of . See Appendix B.2 for technical details. For expositional simplicity we focus on the test for the conditional treatment effect in the remainder of this paper. See Appendix D for a parallel discussion in the case of the marginal treatment effect.

The proposed sample size formula in the conditional setting is based on the test statistic


where is the sample size and is given by

with , and is given by

Here we have implicitly assumed that is invertible. The following lemma provides the distribution of :

Lemma 4.1 (Asymptotic Distribution of ).

Under finite moment and invertibility assumptions,

From a technical perspective the above test statistic, , is very similar to the quadratic form test statistics based on weighted regression used in Generalized Estimating Equations method (Liang and Zeger, 1986; Diggle et al., 2002). In this field much work has been done on how to best adjust these test statistics and their distribution when the sample size might be small (Liao et al., 2016; Mancl and DeRouen, 2001)

. The adjustments are based on the intuition that the quadratic form is akin to the multivariate T-test statistic used to test whether a vector of means is equal to

and thus Hotellings T-squared distribution is used to approximate the distribution when may be small. Here we follow the lead of this well developed area and use a non-central Hotelling’s T-squared distribution to approximate the distribution of

. Recall that if a random variable 

has non-central Hotelling’s T-squared distribution with degrees of freedom 

and non-centrality parameter  then 

has non-central F-distribution with the same degrees of freedom and non-centrality parameter

(Hotelling, 1931). In our setting and and  with


Recall that is the dimension of and is the dimension of . See Appendix B for a discussion of how for large 

, we recover the Chi-Squared distribution given in Lemma 


Thus the rejection region for the test and is:


with a specified significance level. For details regarding further small sample size adjustments, used when analyzing the data, see Appendix E.

5 Sample size formulae

To plan the stratified micro-randomized study, we need to determine the sample size needed, , to detect a specific alternative with a given power () at a given significance level (). The sample size is the smallest value  such that


and denote the cumulative and inverse distribution functions respectively for the non-central -distribution with degrees of freedom  and non-centrality parameter . Calculation of the sample size is non-trivial due to the unknown form of the noncentrality parameter, (where is defined in (9)). This is in contrast to micro-randomized trials where, under certain working assumptions, Liao et al. (2016) were able to find an analytic form for the noncentrality parameter .

We outline a simulation based sample size calculation, starting with general overview and comments in Section 5.1 and employ this calculator to design the smoking cessation study in Section 6.

5.1 Simulation based sample size calculation

As discussed above, calculation of the sample size  is non-trivial due to the unknown form of the non-centrality parameter. Here, we propose a three-step procedure for sample size calculations.

In the first step, equation (9) and information elicited from the scientist is used to calculate, via Monte-Carlo integration, in the non-centrality parameter. The resulting value, , is plugged in to equation (11) to solve for an initial sample size . In the second step we use a binary search algorithm to search over a neighborhood of ; in our simulations we found the binary search quickly resulted in a solution. For each sample size  required by the binary search algorithm, samples each of simulated participants are run. Within each simulation, the rejection region for the test is given by equation (10) at the specified significance level. The average number of rejected null hypotheses across the simulations is the estimated power for the sample size . The sample size is the minimal  with estimated power above the pre-specified threshold .

In the last, third, step we conduct a variety of simulations to assess the robustness of the sample size calculator to any assumptions and to make adjustments to ensure robustness. See our use of these simulations to test robustness in the case of the smoking cessation study in Section 6.

Our sample size formula requires the following information for :

  1. desired type 1 and type 2 error rates,

  2. targeted alternative ,

  3. selected probabilities ,

  4. selected “control variables” ,

  5. the randomization formula used to determine given a history and

  6. a generative model for .

We provide general comments concerning the choice of the above items and then build the sample size calculator for the smoking cessation study of Section 6. First we elicit information from the scientist to construct a specific alternative form for  . A simple approach is to consider linear alternatives,  so that the projection and the alternative coincide. Stratification variables are often categorical ( is categorical); as a result we model the alternative separately for each value of . Furthermore if we suspect that the effect will be generally decreasing (with study time) due to habituation, then we might consider a vector feature, that represents a linear in time, trend. Or we might believe that the effect of the treatments might be low at the beginning of the study and then increase as participants learn how to use the treatment and then decrease due to habituation; here we might consider a vector feature, that results in a quadratic trend.

The less complex the projection (smaller ) of the alternative , the smaller the required sample size, , becomes. On the other hand, the use of a simple projection for the alternative may not reflect the true alternative very well (see appendix C for a discussion of this tradeoff). We suggest sizing a study for primary hypothesis tests using the least complex alternative possible. For example, while there may be within day variation in treatment effect, the study might still be sized to detect treatment effects averaged across such variation – i.e., a constant alternative within a day can result in a hypothesis test with sufficient power against a wide range of alternatives. For example in the smoking cessation study the feature, might be with equal to the number of days following the “quit smoking” date. The linear trend in days would be used to detect an approximately decreasing treatment effect, , with increasing .

An objection to the above approach might be as follows. Suppose that the scientific team believes that there will be an effect only at a very few decision points within a day and thus a test statistic based on an projection that averages over all decision points within the day would result in a test with low power. However if investigators suspect this might be the case then more care should be taken in selecting the decision points. Consider the example of Heartsteps (Klasnja et al., 2015), a mobile health intervention focused on promoting physical activity and reducing sedentary behavior among sedentary office workers. HeartSteps uses an activity tracker to monitor steps taken on a per minute basis. Originally decision points were set to match the frequency of data collection (i.e., each minute). Upon reviewing activity data, it was discovered that the highest within person variability in step count occurred at five timepoints throughout the day with much less within person variability at other times.aaaThese times were pre-morning commute, mid-day, mid-afternoon, evening commute and after dinner. Data collected was on individuals with “regular” daytime jobs. This information combined with the types of treatments being considered indicates that the treatment might be most effective at these 5 timepoints and potentially less effective at other times. Therefore, decision times were selected to align with the five discovered timepoints.

To select the probabilities , recall that these probabilities define the weighting across time and across the stratification distribution of the alternative when operationalized as an projection. To see this suppose we decide to target a constant-across-time alternative and select , then  where

for . If we set the reference probabilities to be constant in  then

In this case is an average treatment effect across time weighted by the fraction of time the participant is available and in stratification level . In our work we usually set to be constant in so as to more easily discuss the targeted alternative with collaborators.

Next a decision should be made about which control variables  should be included in the construction of the test statistic. A natural control variable is the pre-decision time version of the proximal response as this variable is likely highly correlated with the proximal response and thus can be used to reduce variance in the estimation of the coefficients for the projection. For example in the smoking cessation study a natural control variable is the fraction of time stressed in the hour prior to time . One might want to include in the by vector, , many variables so as to maximally reduce variance and thus increase the size of the noncentrality parameter in (9); indeed for fixed , the larger the noncentrality parameter, the smaller the sample size . However from equation (11) we see that fixing all other quantities, the sample size increases with increasing . So intuitively there is a tradeoff between increasing the size of the noncentrality parameter by including more variables in  with the resulting reduction in degrees of freedom in the denominator of the F test caused by increasing , the number of variables in . See appendix C for further discussion.

In the smoking cessation example below, we calculate the sample size with the vector of control variables  set equal to ; this maintains a hierarchical regression yet keeps as small as possible. Incidentally this simplifies the development of the generative model as additional time-varying variables are not included.

Generally the randomization formula has been determined by considerations of treatment burden, availability and whether it is critical for the scientific question that the randomization depend on a time-varying variable such as a prediction of risk. Treatment burden considerations might impose a constraint such as, on average around treatments should occur over a specified time period (e.g. an average of treatments per day); also the randomization formula might be developed so as to limit the variance in the number of treatments in the specified time period. In the smoking cessation study, the randomization probability, at decision point depends on at most (as opposed to the entire history, ).

The sample size formula requires the specification of a generative model for the history  which achieves the specified alternative treatment effect. However existing data sets that include the use of the required sensor suites and thus can be used to guide the form of the generative model are often small and do not include treatment. In the smoking cessation study, for example, we require a generative model for the multivariate distribution of of which only the distribution of given is known (e.g. ). We have access to a small, observational, no-treatment data set that included the required sensor suites and thus can be used to guide the form of the generative model. Because the data set is small, in Section 6 we construct a low dimensional Markovian generative model. Here and in general, the prior data does not include treatments. Thus we use the prior data to develop a generative model under no treatment.

The relatively simple generative model allows us to use only a few summary statistics from this small noisy data set. This of course, may lead to bias – this bias would be problematic if the bias results in sample sizes for which the power to detect the desired effect is below the specified power. Thus we also use the small data set to guide our assessment of robustness of the sample size calculator. In particular, more complex generative models can be proposed by exploratory data analysis. Of course such complex alternatives may be due to noise and not reflect the behavior of trial participants. In Section 6.4.3, we present results of an exploratory data analysis in which we over-fit the noisy, small data to suggest a particular complex deviation from the simple Markovian generative model.

We follow the three steps outlined at the beginning of this subsection to provide a sample size . Our calculator also provides standardized effect sizes. That is, given the alternative effect and a generative model we calculate the average conditional variance given by . Table 14 in Appendix F provides standardized treatment effect sizes, defined as, .

6 Smoking Cessation Study

In the following, we use the above three step procedure to form a sample size calculator for the smoking cessation study. Recall the last step involves a variety of simulations to assess robustness to the assumption underlying the generative model; this step is provided in section 6.4.

As noted previously, the smoking cessation study is a 10 day study; the first day is the “quit day”, the day the participant quits smoking. Recall that participants wear the AutoSense sensor suite (Ertin et al., 2011)

which provides a variety of physiological data streams that are used by the stress classification algorithm. A high level view of the stress classification algorithm is as follows. First, every minute a support vector machine (SVM) algorithm is applied to a number of ECG and respiration features constructed from the prior one minute stream of sensor data. The output of the SVM, e.g. the distance of the features from the separating hyperplane, is then transformed via a sigmoid function to obtain a stress likelihood in

; see Hovsepian et al. (2015) for details. This output (in ) across the minute intervals is further smoothed to obtain a smoother “stress likelihood time series.” Next, a Moving Average Convergence Divergence approach is used to identify minutes at which the trend in the stress likelihood is going up and when it is going down; see Sarker et al. (2016) for details. The beginning of an episode is marked by the start of a positive-trend interval; the peak of an episode is the end of a positive-trend interval followed by the start of a negative-trend interval. If the area under the curve from the beginning of the episode to the peak of the episode exceeds a threshold then the episode is declared to be a stress episode. The threshold is based on prior data from lab experiments and was evaluated on independent test data sets (from both lab and field) in terms of the F1 score (a combination of sensitivity and specificity (Wikipedia, 2017)) for use in detecting physiological stress.

A participant is available, , for a treatment at minute if the participant has not received a treatment in the prior hour, if this minute corresponds to a peak of an episode and if the minute is during the 10 hours since attiring Autosense. The stratification variable at every available minute (decision point) is whether the criterion for stress is met () or whether the criterion for stress was not met (). There are 600 decision times per day (i.e., hours/day minutes/hour) at which, assuming the participant is available, the participant may receive a treatment notification. We plan the trial with 11 hour days in which during the final hour participants cannot receive treatment. The final hour of data collection ensures we can calculate the proximal response for the final decision time each day. Each participant should receive a daily average (over the 10 hours) of 1.5 treatment notifications (notifications to practice the stress-reduction exercise on the app) when and a daily average of 1.5 treatment notifications when .

Next, we build the simulation-based calculator assuming the primary hypothesis is and the test statistic is as given in (8). Small sample corrections are used in constructing the test statistic as discussed in Section 4; see Appendix E for additional details.

6.1 Simulation-based calculator

We start by choosing inputs for the sample size formula as outlined in Section 5.1. We set the desired type  and type  error rates to be % and % respectively. We next specify the targeted alternative  for . Suppose the scientific team suspects that if there is an effect of the mindfulness reminders, then this effect might be negligible at the beginning of the study, increase as participants begin to practice the mindfulness exercises and then the effect may decrease due to habituation. Thus, we select where . This leads to a non-parametric treatment effect model in the stratification variable , and a piece-wise constant treatment effect model in time given  that is quadratic as a function of “day in study.” In this case, the dimension of the  projection is , and the targeted alternative is  for . Next to elicit enough information from the scientist to specify , we ask scientists to specify for each level of , (1) an initial conditional effect, (2) the day of maximal effect () and (3) the average conditional treatment effect