Technological advancements have made it possible to collect, store, manipulate, and access large amounts of data on complex systems in real-time. Consequently, there is enormous potential to use accumulating data to construct adaptive decision support systems that map up-to-date information to a recommended decision. For example, in the context of mobile-health, data collected both passively and actively through a mobile device can be used to monitor a patient’s health status and to construct an individualized treatment strategy that applies interventions if, when, and in the amount they are needed (riley2011health; litvin2013computer; kumar2013mobile; spruijt2014dynamic; nahum2014just). Other examples include data-driven management of infectious diseases wherein accruing information about the spread of the disease can be used to inform how best to allocate treatment resources (chades2011general; meyer2016), and adaptive management of natural resources wherein management decisions are adjusted over time according to current and forecasted resource availability (mccarthy2010resource; mcdonald2011allocating; marescot2013complex; fackler2014addressing).
To estimate a decision support system that maximizes mean cumulative utility, we apply a variant of Thompson sampling (thompson1933likelihood) that avoids directly computing a posterior distribution over the optimal decision at each time point. This estimator is computationally efficient and can be applied in settings in which data are: (i) accumulating rapidly over an indefinite time horizon; (ii) high-dimensional; (iii) composed of a single data stream, i.e., no independent replication; and (iv) the number of possible decisions is too large to enumerate. We derive rates of convergence on the difference in cumulative utility under the proposed estimator and an optimal decision support system. The proposed estimator relies on a model for the underlying system dynamics and therefore is ideally suited to settings where existing domain knowledge or historical data can be used to inform a class of models. In our motivating applications, such domain knowledge is abundant. In settings where domain knowledge is scarce, the proposed methodology can be extended to accomodate more flexible models that grow in complexity as data accumulate.
The estimation problem we consider here is related to estimation of an optimal dynamic treatment regime (murphy2003optimal; robins2004optimal; chakraborty2013statistical; kosorok2015adaptive). Like the decision support systems we consider, a dynamic treatment regime is a sequence of functions, one per decision stage, that map up-to-date information to a recommended decision and the goal is to estimate a regime that maximizes expected cumulative utility. However, existing methodology for estimation of optimal dynamic treatment regimes is designed for application to data collected in observational or randomized studies involving a cohort of patients. Thus, almost all methodology for dynamic treatment regimes is designed for offline estimation using data composed of independent, identically distributed replicates of the decision process observed over a finite time horizon. In contrast, the problems we consider here involve online estimation using a single stream of data, and an indefinite time horizon. Some methodology for dynamic treatment regimes touches on at least one of these features: ertefaie2014constructing proposed a variant of the -learning algorithm (murphy2005generalization; schulte2014q) that applies to problems with an indefinite time horizon but this methodology is designed for offline estimation using a batch of independent, identically distributed replicates; actorCritic proposed a policy-search algorithm (zhang2012robust; zhao2012estimating; zhao2015new) that applies to indefinite time horizons but requires independent, identically distributed replicates; and minsker2015active proposed to use an online estimator of an optimal treatment regime to adaptively change patient recruitment probabilities, however this also requires replicates and only applies to a single decision point.
The proposed estimator is an example of a model-based planning algorithm in reinforcement learning(sutton1998reinforcement; powell2007approximate). Model-based planners estimate a system dynamics model and then apply (approximate) dynamic programming algorithms to the estimated system as if it were known. A key feature of model-based planning is the need to balance making decisions that lead to improved model estimates with those that lead to high-utility under the current estimated model; in the computer science literature this is known as the exploration-exploitation trade-off (kaelbling1996reinforcement; sutton1998reinforcement). Thompson sampling has been studied extensively as a means of balancing exploration and exploitation in the context of multi-armed bandit problems (agrawal2012analysis; kaufmann2012thompson; korda2013thompson; agrawal2013thompson; gopalan2014thompson; russo2014information). However, Thompson sampling for more complex decision problems in which the decisions affect not only immediate utility but also the state of the system and subsequently future potential for utility, has received considerably less attention. gopalan2015thompson
applied Thompson sampling to Markov decision processes and derived convergence rates similar to those presented here. However, the variant of Thompson sampling proposed bygopalan2015thompson
has several features that prevent direct application to our setting; their algorithm requires that: (i) the set of system states be finite and that the underlying decision process returns infinitely often to a fixed reference state, in the settings we consider, e.g., control of an infectious disease, the state is continous and there is no guarantee of return to a reference state; (ii) a fixed policy be applied for prolonged periods in which the estimated system dynamics model is improved, this may not be feasible or ethical in settings with human subjects or limited natural resources; and (iii) one be able to efficiently compute draws from the posterior which may not be possible without conjugate priors. Analyses of the operating characteristics of Thompson sampling from a Bayesian perspective are given inosband2013more and osband2014model.
In Section 2, we introduce an approximate Thompson sampling algorithm for parametric models. In Section 3, we provide rates of convergence for this variant of the Thompson sampling algorithm. In Section 4, we illustrate the use of Thompson sampling using a simple agent-based model of influenza and a model for management of mallard populations in the United States. Section 5 contains a discussion of open problems and concluding remarks.
2 Thompson sampling with parametric models
We consider a decision problem evolving in discrete time . At each time , the decision maker: (i) observes the current state of the process ; (ii) selects an action ; and (iii) observes next state and utility . We assume (A0) that the state process is Markovian so that is conditionally independent of given and ; in some settings, for this definition to hold the state might contain features constructed from observations and actions collected over multiple time points not just information collected between the st and th decision. A decision strategy, , is a map from states to actions so that under a decision maker presented with at time will selection action . An optimal decision strategy maximizes mean discounted utility if applied to select actions in the population of interest; a formal definition is given below. Our goal is to construct an estimator of an optimal strategy that can be applied when the dimension of the state space, , and the number of possible actions, , are large. For example, in spatial-temporal applications, like the influenza example presented in Section 4, may be on the order of tens of thousands and is exponential in .
We use potential outcomes (rubin1974estimating) to define an optimal decision strategy. We use an overline to denote history, i.e., . The potential state at time under action sequence is denoted ; thus, the potential utility at time is . For any strategy, , the potential state at time under is and subsequently the potential utility is . Define the total discounted mean utility of a strategy as
where is a discount factor that balances proximal and distal utility (sutton1998reinforcement; puterman2014markov). Given a class of strategies, , an optimal strategy, , satisfies for all . Thus, an optimal regime is defined in terms of the class which may be chosen to enforce parsimony, logistical or cost constraints, or other structure. Hereafter, we consider as fixed and known; in practice, the choice of an appropriate class of strategies will depend on the domain of application, see Section 4 for examples.
To ensure that is identifiable in terms of the underlying generative model, we make a series of standard assumptions (robins2004optimal; schulte2014q). Let denote the set of all potential states and utilities. We assume: (A1) sequential ignorability, ; (A2) positivity, for all , such that ; and (A3) consistency, . Under (A0)-(A3) it can be seen that the mean utility at time under is
where is the conditional density of given and , is the marginal density of , is a point mass at , and is a dominating measure. The right-hand side of (2) is a functional of the underlying generative model and given estimators of the densities for one could construct a plug-in estimator of . However, non-parametric estimation of these densities is not possible in general as there is only a single observation per time point. Thus, to facilitate estimation, we impose further structure on these densities.
We assume that the densities for are stationary and indexed by a low-dimensional parameter , i.e., , where is not indexed by and denotes the true parameter value. The likelihood for is
where denotes the conditional distribution over actions used by the decision maker and ; it is assumed that the distributions over actions are known and contribute no information about to the likelihood (recall that this is an online estimation problem so that action choice is under the control of the decision maker). We also assume that is known; in practice, one might choose to set to be a point mass at the observed first state.
Let denote the maximum likelihood estimator of based on data accumulated during the first time points. We assume that
is asymptotically normal with mean zero and asymptotic variance-covariance; conditions under which this holds for data that are not independent and identically distributed have been studied extensively (silvey1961note; billingsley1960statistical; bar1971asymptotic; crowder1976maximum; heijmans1986consistent). Furthermore assume that there exists which converges in probability to . Under these assumptions, an approximate Thompson sampling algorithm can be constructed as follows.
For any let denote expectation under parameter value and for any and define . Using (2), can be computed to arbitrary precision using Monte Carlo methods. Let denote a starting value which might be drawn from a prior distribution, estimated from historical data, or elicited from domain experts. Let denote a non-decreasing sequence of positive integers. Let denote the orthogonal projection onto . Approximate Thompson sampling consists of the following steps for each time : (i) compute ; (ii) set ; (iii) observe ; (iv) draw ; and (v) set .
The foregoing algorithm is called ‘approximate’ Thompson sampling because at each : (i) calculation of the discounted mean utility is truncated at ; and (ii) the estimated sampling distribution of the maximum likelihood estimator is used as an approximation of the posterior distribution of . A fully Bayesian implementation of Thompson sampling, with truncation, would draw a strategy from the posterior distribution of at each time ; in settings where evaluation of the likelihood is expensive, these approximations can result in considerable computational savings. In the next section, we show that approximate Thompson sampling is consistent and characterize how the rate of convergence depends on the sequence .
While we focus on maximum likelihood estimation, the results in the next section apply directly to any estimator that is asymptotically normal and satisfies the assumed regularity conditions. It is possible to obtain even more generality by assuming that there exists a sequence of positive constants such that and modifying the approximate Thompson sampling algorithm to draw samples where
are independently and identically distributed sub-gaussian random variables with mean zero. However, for simplicity we focus our developments on the case of-consistent, asymptotically normal estimators.
3 Asymptotic properties
In this section, we derive rates of convergence for the regret . Define to be the information accumulated at time . In addition to (A0)-(A3), we assume that for all : (A4) with probability one; (A5) the parameter space ; (A6) ; and (A7) for , the log likelihood satisfies , where . These assumptions are mild; (A6) holds if is regular and asymptotically normal so that the approximate Thompson sampling algorithm is drawing samples from a -neighborhood of
; and (A7) holds under smoothness and moment conditions on the likelihood(heijmans1986first). Assumption (A5) simplifies our arguments but is not necessary; e.g., one could take to be a compact subset of and have be an interior point of . The following result is proved in the Supplemental Material.
Assume (A0)-(A7) and let be a sequence of non-decreasing positive integers. Then,
Thus, if then the right-hand-side is .
The preceding result illustrates the bias-variance trade-off associated with choosing ; as increases, the Monte Carlo error incurred by approximating with decreases, however, as increases, the error in approximating with also increases. Setting optimally balances this trade-off in that it leads to the fastest rate of convergence.
The preceding result can be sharpened under additional assumptions on the behavior of . For any , define so that . In addition, for any define
so that measures the worst-case regret in an -neighborhood of . For any , define the radius of regret as . Thus, the radius measures how close must be to to ensure that is no more than , i.e., problems with a smaller radius are harder in the sense that a more accurate estimator of is required to ensure the same performance. This notion is formalized in the results that follow.
We strengthen (A6) to (A6’) , where is bounded with probability one, with probability one, and there exists such that , for all and . We modify Assumption (A7) to (A7’) where is bounded with probability one and for some , and all . Proofs of the following results are in the Supplemental Material.
Assume (A0)-(A5), (A6’) and (A7’) then for any
provided , where are constants that depend on the dimension and the discount factor but not on or .
Assume (A0)-(A5) and (A6’-A7’). Furthermore, suppose that and define . Then , i.e., approximate Thompson sampling selects an optimal decision strategy eventually always with probability one.
The preceding theorem provides a probability bound on the regret of approximate Thompson sampling; the dependence on is intuitive in that the bound becomes looser as the radius decreases. The form of the constants are given in the Supplementary Material.
In some settings it may be of interest to consider classes of models for the conditional distribution of given and in which the model complexity increases with . Theorem 1 can be extended to handle this case provided the likelihood is sufficiently smooth and a rate of convergence is available for parameter estimators. Suppose that where and that for each one postulates a class of conditional densities indexed by . Let denote the projection of onto , let denote the maximum likelihood estimator of , and assume that for some (e.g., newey1997convergence). Furthermore, let denote random perturbation of so that , i.e., where is the orthogonal projection onto ,
is a standard normal vector, andis an estimator of the square root of the asymptotic variance-covariance of . Furthermore, define to be the log-likehood and write . Suppose that for all , , where ; and for some . Then, under mild regularity conditions
4 Illustrative simulation experiments
We illustrate the finite sample performance of the proposed approximate Thompson sampling algorithm using two simulated examples: (i) resource allocation for control of the spread of influenza using an agent-based compartmental model; and (ii) adaptive management of mallard populations in the U.S. The first example was chosen to illustrate the use of approximate Thompson sampling in a setting where the set of possible decisions is too large to apply existing methodologies. The second example was chosen to examine the performance of approximate Thompson sampling in a setting where classic approximate dynamic programming algorithms can be applied as a gold standard.
4.1 Control of influenza
In our first illustrative example, we consider a simple agent-based compartmental model for the daily spread of the flu within a closed population. We assume that the population is constant, i.e., there are no births or deaths. Each day, every member of the population is in one of three states: uninfected and susceptible; infected and contagious; or recovered and neither susceptible nor contagious (see keeling2008modeling; hens2012modeling, and references therein)
. We assume that disease transmission can occur only when an infected and susceptible individual come into contact. To model the contact process, we assume that individuals can only make contact through direct links in a social network and that they progress through different social networks over time according to the day of week and their current health status. We assume a large population-level social network which we term the public network. Each member in the population is also a member of a fully connected ‘family’ social network. The set of families is obtained by randomly partitioning the entire population into of groups of size 1-15; details for computing such a random partition are in the Supplemental Material. Each individual in the population is classified as being a student, employed, or retired (in this utopia there is no unemployment and students do not work). Students are randomly assigned to one ofschools and transition among their family network, school network, and the public network according to the decision rule in the left panel of Figure (1). Employees are randomly assigned to one of employers and transition among their family network, work network, and public network according to the decision rule in the right panel of Figure (1). Retired persons attend the public space each day unless they are infected in which case they attend the public space with probability and stay home with probability . Individuals are in their family social network whenever they are at home. We generate the public, work, and school networks according to a Barabasi-Albert, Erdos-Renyi, or a Watts-Strogatz model (newman2010networks). While our contact model is quite simple, it serves to illustrate how the proposed method can be applied to more complex agent-based systems.
Let denote the state of individual at time . In our influenza model, , where: denotes infection status of individual at time so that codes susceptable, 1 codes infected, and 2 codes recovered; denotes the age of subject which is assumed to remain constant for all ; and
denotes a measure of susceptibility. We draw the ages of individuals labeled as being in school independently from a uniform distribution on, ages of individuals labeled as employed are drawn independently from a uniform distribution on , and the ages of those labeled as retired are drawn independently from a uniform distribution on . Initial susceptibility for subject , , is drawn according to the linear model where the errors are standard normal random variables that are independent across subjects and the coefficients and are chosen so that the initial susceptibility has (unconditional) mean zero and variance
. Evolution of susceptibility follows a first-order autoregressive model,, where is an indicator that individual received treatment at time and are independent normal random variables with mean zero and variance .
Let denote the set of infected individuals with whom individual could potentially make contact with at time , i.e., they are in the same social network at time and there is an edge between them. We assume that the probability that individual becomes infected at time is
where denotes the probability of contact, , and are unknown parameters. Thus, the system dynamics model is indexed by . Let denote the true parameter values, in our simulation settings we set ; simulations with other parameter settings were qualitatively similar and therefore omitted. We implemented approximate Thompson sampling using maximum likelihood to estimate and the observed Fisher information to approximate the covariance of this estimator.
Let , we optimized over a parametric class of policies of the form
where is a, possibly data-dependent, feature vector for individual constructed from and . Thus, the preceding policy ranks the individuals according to the score and then assigns treatment to the individuals with largest scores; ties are broken randomly. In this application, the feature vector for individual is , where: is an indicator of infection; is an indicator of susceptibility; is age in years; is susceptability; and and the estimated probability of a contact between individuals and at the next time point.
We measure the performace of Thompson sampling in terms of proportion of individuals who are infected and days after disease outbreak. We select 10% of the population at random to start as infected at when disease management begins. We assume that at most 20% of the population can be treated at each time point, e.g., in the class of policies given in (3). To form a baseline for evaluating the proposed algorithm, we also evaluate the performance of: (i) no treatment; and (ii) a myopic policy wherein treatment is applied to the 20% of the population having the highest estimated probability of becoming infected at the next time point. Table 1 shows the results based on 1000 Monte Carlo replications. Thompson sampling resulted in significantly fewer infections at the end of the observation period than no treatment or treating myopically; the advantage of Thompson sampling appears to increase with population size.
|Network||Popn. size||No Treatment||Myopic||Thompson sampling|
|BA||100||0.402 (0.000451)||0.272 (0.000279)||0.175 (0.000913)|
|BA||1000||0.452 (0.000413)||0.311 (0.000395)||0.199 (0.000553)|
|BA||10000||0.463 (0.005632)||0.334 (0.004502)||0.216 (0.007015)|
|BA||100000||0.470 (0.000336)||0.343 (0.000517)||0.231 (0.004364)|
|WS||100||0.409 (0.002713)||0.289 (0.000308)||0.222 (0.000449)|
|WS||1000||0.471 (0.007324)||0.346 (0.004761)||0.261 (0.008027)|
|WS||10000||0.479 (0.000531)||0.377 (0.006054)||0.283 (0.003962)|
|WS||100000||0.492 (0.005904)||0.387 (0.006382)||0.291 (0.005043)|
|ER||100||0.401 (0.000928)||0.302 (0.000969)||0.194 (0.006406)|
|ER||1000||0.446 (0.000826)||0.353 (0.005216)||0.247 (0.005913)|
|ER||10000||0.454 (0.000574)||0.365 (0.003182)||0.258 (0.004216)|
|ER||100000||0.465 (0.000776)||0.382 (0.004277)||0.273 (0.005328)|
|Network||Popn. size||No Treatment||Myopic||Thompson sampling|
|BA||100||0.340 (0.000104)||0.187 (0.003379)||0.0895 (0.000739)|
|BA||1000||0.364 (0.000375)||0.201 (0.000603)||0.113 (0.000706)|
|BA||10000||0.375 (0.000425)||0.224 (0.000954)||0.121 (0.001472)|
|BA||100000||0.382 (0.000316)||0.232 (0.000541)||0.140 (0.003049)|
|WS||100||0.326 (0.000523)||0.191 (0.000237)||0.109 (0.000837)|
|WS||1000||0.368 (0.000416)||0.203 (0.001824)||0.116 (0.000635)|
|WS||10000||0.372 (0.000865)||0.220 (0.002083)||0.132 (0.003346)|
|WS||100000||0.385 (0.001206)||0.228 (0.001724)||0.143 (0.002501)|
|ER||100||0.317 (0.000178)||0.177 (0.000354)||0.127 (0.000413)|
|ER||1000||0.341 (0.000773)||0.239 (0.000536)||0.152 (0.004086)|
|ER||10000||0.357 (0.000692)||0.253 (0.000891)||0.164 (0.003641)|
|ER||100000||0.366 (0.000565)||0.265 (0.000672)||0.181 (0.005103)|
4.2 Management of mallard populations in the U.S.
In our second illustrative example, we consider adaptive management of mallards in the United States. The United States Fish and Wildlife Service began an adaptive harvest program in 1995 wherein measurements of current species abundance and ecological conditions are used to inform the allocation of waterfowl hunting licenses (johnson2015multilevel). The goal is to maximize the longterm, cumulative harvest. During the past two decades, data collected as part of this program has been used to create and validate high-quality system dynamics models for mallard populations. Current practice is to use these dynamics models with approximate dynamic programing to select from among four types of harvest practices: (i) liberal; (ii) moderate; (iii) restricted; and (iv) closed. For each of these harvest practices, different sets of guidelines are passed to individual agencies who set specific harvest limits based on these guidelines (fish2016ahm). Here we examine the performance of using approximate Thompson sampling to pick among these harvest practices each season. Our intent is to demonstrate that approximate Thompson sampling is competitive even in small domains where approximate dynamic programming can be directly applied.
The mallard population dynamics model we use here is based on the 2016 United States Fish and Wildlife Service model (it is simplified in that we consider a single fly-way, see fish2016ahm, for additional details). For each , let denote the number of adult male, adult female, young male, and young females in the population. Furthermore, for each , define to be the number of ponds, to be the reproductive rate, and to be the harvest practice at time . Under harvest practice , let , and be the survival rates for adult males, adult females, young males, and young females respectively; these survival rates are assumed to be known and are provided in the Supplemental Material. The total population is assumed to evolve according to the difference equations
where are unknown parameters.
We compare Thompson sampling fit with maximum likelihood with the following strategies: (i) always apply a liberal harvesting practice (Liberal); (ii) always apply a moderate harvesting practice (Moderate); (iii) always apply a restricted harvesting practice (Restricted); and (iv) approximate dynamic programming fit using radial basis functions (Approximate DP). The details of the proposed approximate dynamic programming procedure are provided in the Supplemental Material. In each simulation setting, we generate historical data from 1995-2016 under the assumed model withand then simulate management of the mallard population using each of the foregoing strategies over the next fifteen years. We varied the population size in 1995 from 6.0 million to 13.0 million and assumed that the management style from 1995-2016 followed actual harvest decisions applied by the United States Fish and Wildlife service.
The estimated average cumulative harvest over the fifteen year management period are displayed in Table 2. These results are based on 1000 Monte Carlo replications. For each initial population size, approximate Thompson sampling is competitive with approximate dynamic programming which is the current gold standard in this domain; dynamic programming does not scale well and thus cannot be applied to larger problems like the influenza model. Both approximate Thompson sampling and approximate dynamic programming perform similarly to the liberal harvest strategy as both recommend this harvest practice often; frequent recommendation of the liberal harvest practice is consistent with actual recommendations of the United States Fish and Wildlife Service over the past 20 years (johnson2015multilevel).
|Initial popn. size||Liberal||Moderate||Restricted||Approximate DP||Thompson sampling|
|6.0||11.47 (0.314)||10.67 (0.214)||7.31 (0.085)||11.48 (0.279)||11.58 (0.293)|
|7.0||12.35 (0.335)||11.44 (0.226)||7.76 (0.089)||12.36 (0.297)||12.47 (0.313)|
|8.0||13.11 (0.352)||12.11 (0.237)||8.14 (0.091)||13.12 (0.313)||13.23 (0.328)|
|9.0||13.80 (0.367)||12.71 (0.246)||8.49 (0.094)||13.81 (0.325)||13.92 (0.342)|
|10.0||14.41 (0.378)||13.24 (0.253)||8.79 (0.096)||14.42 (0.336)||14.54 (0.354)|
|11.0||14.97 (0.388)||13.72 (0.259)||9.07 (0.097)||14.98 (0.345)||15.10 (0.363)|
|12.0||15.47 (0.397)||14.16 (0.265)||9.32 (0.099)||15.49 (0.352)||15.60 (0.372)|
|13.0||15.94 (0.404)||14.56 (0.270)||9.54 (0.100)||15.95 (0.359)||16.07 (0.379)|
We proposed a variant of Thompson sampling that can be applied to inform decision making in online contexts with potentially high-dimensional state or decision spaces. The proposed algorithm is simple to implement and can be applied to essentially any sequential decision problem with a posited class of dynamics models and an estimator of parameters indexing the underlying dynamics model. Thus, we believe that Thompson sampling might prove useful as a general-purpose tool for a wide range of decision problems.
There are a number of ways in which this work might be extended. We briefly discuss one such extension that we feel is a pressing open problem. Thompson sampling maintains positive support across all feasible decisions, therefore, it may, by random chance, select a decision that would be viewed as unacceptable by domain experts, e.g., selecting a closed harvest practice when mallard populations are at a record high. In such settings, experts may choose to ‘override’ the decision selected by Thompson sampling thereby disrupting the exploration-exploitation trade-off used by the algorithm to learn. Thus, an important extension of Thompson sampling would be one that could accommodate evolving, possibly data-dependent, constraints on the available decisions. Such an extension may have to redefine its notion of optimality as the policy that leads to maximal marginal mean outcome may no longer be identifiable under such constraints. We are currently pursuing such an extension.
Eric Laber acknowledges support from the National Science Foundation (DMS-1555141, DMS-1557733, DMS-1513579) and the National Institutes of Health (1R01 AA023187-01A1, P01 CA142538).
Supplementary material available at Biometrika online includes proofs for Theorems 1 and 2 as well as R code to replicate simulation results presented in Section 4.