On State Variables, Bandit Problems and POMDPs

02/14/2020 ∙ by Warren B. Powell, et al. ∙ 0

State variables are easily the most subtle dimension of sequential decision problems. This is especially true in the context of active learning problems (bandit problems") where decisions affect what we observe and learn. We describe our canonical framework that models any sequential decision problem, and present our definition of state variables that allows us to claim: Any properly modeled sequential decision problem is Markovian. We then present a novel two-agent perspective of partially observable Markov decision problems (POMDPs) that allows us to then claim: Any model of a real decision problem is (possibly) non-Markovian. We illustrate these perspectives using the context of observing and treating flu in a population, and provide examples of all four classes of policies in this setting. We close with an indication of how to extend this thinking to multiagent problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequential decision problems span a genuinely vast range of applications including engineering, business, economics, finance, health, transportation, and energy. It encompasses active learning problems that arise in the experimental sciences, medical decision making, e-commerce, and sports. It also includes iterative algorithms for stochastic search, as well as two-agent games and multiagent systems. In fact, we might claim that virtually any human enterprise will include instances of sequential decision problems.

Sequential decision problems consist of sequences: decision, information, decision, information, , where decisions are determined according to a rule or function that we call a policy

, that is a mapping from a state to a decision. This is arguably the richest problem class in modern data analytics. Yet, unlike fields such as machine learning or deterministic optimization, they lack a canonical modeling framework that is broadly used.

The core challenge in modeling sequential decision problems is modeling the state variable. A consistent theme across what I have been calling the “jungle of stochastic optimization” (see jungle.princeton.edu) is the lack of a standard definition (with the notable exception of the field of optimal control).

I am using this document to a) offer my own definition of a state variable (taken from Powell (2020)), b) suggest a new perspective on the modeling of learning problems (spanning bandit problems to partially observable Markov decision problems) and c) extending these ideas to multi-agent systems. This discussion draws from several sources I have written recently: Powell (2020), Powell (2019b), and Powell (2019a). These are all available at jungle.princeton.edu.

In this article, I am going to argue:

  • All properly modeled problems are Markovian.

  • All models of real applications are (possibly) non-Markovian.

These seem to be contradictory claims, but I am going to show that they reflect the different perspectives that lead to each claim.

We are going to begin by presenting our universal framework for modeling sequential decision problems in section 2. The framework presents sequential decision problems in terms of optimizing over policies. Section 3 provides a streamlined presentation of how to design policies for any sequential decision problem.

Section 4 then provides an in-depth discussion of state variables, including a brief history of state variables, our attempt at a proper definition, followed by illustrations in a variety of settings. Then, section 5 discusses partially observable Markov decision problems, and presents a two-agent model that offers a fresh perspective of all learning problems. We illustrate these ideas in section 6 using a problem setting of learning how to mitigate the spread of flu in a population. We then extend this thinking in section 9 to the field of multiagent systems.

This article is being published on arXiv only, so it is not subject to peer review. Instead, readers are invited to comment on this discussion at http://tinyurl.com/statevariablediscussion.

2 Modeling sequential decision problems

Any sequential decision problem can be written as a sequence of state, decision, information, state, decision, information. Written over time this would be given by

where is the “state” (to be defined below) at time , is the decision made at time (using the information in ), and then is the information that arrives between and (which is not known when we make the decision ). Note that we start at time , and assume a finite horizon

(standard in many communities, but not Markov decision processes).

There are many problems where it is more natural to use a counter (as in event or iteration). We write this as

Finally, there are times where we are iterating over simulations (e.g. the pass over a week-long simulation of ad-clicks), which we would write using

The classical modeling framework used for Markov decision processes is to specify the tuple where is the state space, is the action space (the MDP community uses for action), is the one-step transition matrix with element

which is the probability we transition from state

to when we take action , and is the reward if we are in state and take action (see Puterman (2005)

[Chapter 3]). This modeling framework has been adopted by the reinforcement learning community, but in

Powell (2019b) we argue that it does not provide a useful model, and ignores important elements of a real model.

Below we describe the five elements of any sequential decision problem. We first presented this modeling style in Powell (2011), but as noted in Powell (2019b), this framework is very close to the style used in the optimal control community (see, for example, Lewis & Vrabie (2012), Kirk (2004) and Sontag (1998)). After this, we illustrate the framework with a classical inventory problem (motivated by energy storage) and as a pure learning problem. The framework involves optimizing over policies, so we close with a discussion of designing policies.

2.1 Elements of a sequential decision problem

There are five dimensions of any sequential decision problem: state variables, decision variables, exogenous information processes, the transition function and the objective function.

State variables

- The state of the system at time contains all the information that is necessary and sufficient to compute costs/rewards, constraints, and the transition function (we return to state variables in section 4).

Decision variables

- Standard notation for decisions might be for action, for control, or , which is the notation we use since it is standard in math programming.

may be binary, one of a finite discrete set, or a continuous or discrete vector. We let

be the feasible region for , where may depend on . We address the problem of designing the policy to later.

Decisions are made with a decision function or policy, which we denote by where “” carries the information about the type of function , and any tunable parameters . We require that the policy satisfy .

Exogenous information

- We let be any new information that first becomes known at time (we can think of this as information arriving between and ). may depend on the state and/or the decision , so it is useful to think of it as the function , but we write for compactness. This indexing style means any variable indexed by is known at time .

Transition function

- We denote the transition function by


where is also known by names such as system model, state equation, plant model, plant equation and transfer function. contains the equations for updating each element of .

Objective functions

- There are a number of ways to write objective functions. We begin by making the distinction between state-independent problems, and state-dependent problems. We let denote a state-independent problem, where we assume that neither the objective function , nor any constraints, depends on dynamic information captured in the state variable. We let capture state-dependent problems, where the objective function (and/or constraints) may depend on dynamic information.

We next make the distinction between optimizing the cumulative reward versus the final reward. Optimizing cumulative rewards typically arises when we are solving problems in the field where we are actually experiencing the effect of a decision, whereas final reward problems typically arise in laboratory environments.

Below we list the objectives that are most relevant to our discussion:

State-independent, final reward

This is the classical stochastic search problem. Here we go through a learning/training process to find a final design/decision , where is our search policy (or algorithm), and is the budget. We then have to test the performance of the policy by simulating using


where depends on and the experiments , and where represents the process of testing the design .

State-independent, cumulative reward

This is the standard representation of multi-armed bandit problems, where is what we believe about after experiments. The objective function is written


where is our performance for the experiment using experimental settings , chosen using our belief based on what we know from experiments .

State-dependent, cumulative reward

This is the version of the objective function that is most widely used in stochastic optimal control (as well as Markov decision processes). We switch back to time-indexing here since these problems are often evolving over time (but not always). We write the contribution in the form to help with the comparison to , which gives us


This is not an exhaustive list of objectives (for example, we did not list state-dependent, final reward). Other popular choices model regret or posterior-optimal solutions to compare against a benchmark. Risk is also an important issue.

Note that the objectives in (2) - (4

) all involve searching over policies. Writing the model, and then designing an algorithm to solve the model, is absolutely standard in deterministic optimization. Oddly, the communities that address sequential decision problems tend to first choose a solution approach (what we call the policy), and then model the problem around the class of policy.

Our framework applies to any sequential decision problem. However, it is critical to create the model before we choose a policy. We refer to this style as “model first, then solve.” We address the problem of designing policies in section 3.

2.2 Energy storage illustration

We are going to use a simple energy storage problem to provide a basic illustration of the core framework. Our problem involves a storage device (such as a large battery) that can be used to buy/sell energy from/to the grid at a price that varies over time.

State variables

State where

energy in the battery at time ,
price of energy on the grid at time .
Decision variables

is the amount of energy to purchase from the grid () or sell to the grid (). We introduce the policy (function) that will return a feasible vector . We defer to later the challenge of designing a good policy.

Exogenous information variables

, where is the price charged at time as reported by the grid. The price data could be from historical data, or field observations (for an online application), or a mathematical model.

Transition function

, which consists of the equations:


The transition function needs to include an equation for each element of the state variable. In real applications, the transition function can become quite complex (“500 lines of Matlab code” was how one professional described it).

Objective function

Let be the one-period contribution function given by

We wish to find a policy that maximizes profits over time, so we use the cumulative reward objective, giving us


where and where we are given an information process .

Of course, this is a very simple problem. We are going to return to this problem in section 4 where we will use a series of modifications to illustrate how to model state variables with increasing complexity.

2.3 Pure learning problem

An important class of problems are pure learning problems, which are widely studied in the literature under the umbrella of multiarmed bandit problems. Assume that represents different configurations for manufacturing a new model of electric vehicle which we are going to evaluate using a simulator. Let be the expected performance if we could run an infinitely long simulation. We assume that a single simulation (of reasonable duration) produces the performance

where is the noise from running a single simulation.

Assume we use a Bayesian model (we could do the entire exercise with a frequentist model), where our prior on the truth is given by . Assume that we have performed simulations, and that . Our belief about after simulations is then given by


For convenience, we are going to define the precision of an experiment as , and the precision of our belief about the performance of drug as .

If we choose to try drug and then run the experiment and observe , we update our beliefs using


if ; otherwise, and . These updating equations assume that beliefs are independent; it is a minor extension to allow for correlated beliefs.

Also, these equations are for a Bayesian belief model. In section 4.3.1 we are going to illustrate learning with a frequentist belief model.

We are now ready to state our model using the canonical framework:

State variables

The state variable is the belief given by equation (8).

Decision variables

The decision variable is the configuration that we wish to test next, which will be determined by a policy .

Exogenous information

This is the simulated performance given by .

Transition function

These are given by equations (9)-(10) for updating the beliefs.

Objective function

This is a state-independent problem (the only state variable is our belief about the performance). We have a budget to run simulations of different configurations. When the budget is exhausted, we choose the best design according to

where we introduce the policy because

has been estimated by running experiments using experimentation policy

. The performance of a policy is given by

Our goal is to then solve

Note that when we made the transition from an energy storage problem to a learning problem, the modeling framework remained the same. The biggest change is the state variable, which is now a belief state.

The modeling of learning problems is somewhat ragged in the academic literature. In a tutorial on reinforcement learning, Lazaric (2019) states that bandit problems do not have a state variable (!!). In contrast, there is a substantial literature on bandit problems in the applied probability community that studies “Gittins indices” that is based on solving Bellman’s equation exactly where the state is the belief (see Gittins et al. (2011) for a nice overview of this field).

Our position is that a belief state is simply part of the state variable, which may include elements that we can observe perfectly, as well as beliefs about parameters that can only be estimated. This leaves us with the challenge of designing policies.

3 Designing policies

There are two fundamental strategies for designing policies, each of which can be further divided into two classes, producing four classes of policies:

Policy search

- Here we use any of the objective functions (2) - (4) to search within a family of functions to find the policy that works best. Policies in the policy-search class can be further divided into two classes:

Policy function approximations (PFAs)

PFAs are analytical functions that map states to actions. They can be lookup tables (if the chessboard is in this state, then make this move), or linear models which might be of the form

PFAs can also be nonlinear models (buy low, sell high is a form of nonlinear model), or even a neural network.

Cost function approximations (CFAs)

CFAs are parameterized optimization models. A simple one that is widely used in pure learning problems, called interval estimation, is given by

The CFA might be a large linear program, such as that used to schedule aircraft where the amount of slack for weather delays is set at the

-percentile of the distribution of travel times. We can write this generally as

where might be a parametrically modified objective function (e.g. with penalties for being late), while might be parametrically modified constraints (think of buffer stocks and schedule slack).

Lookahead approximations

- We can create an optimal policy if we could solve


In practice, equation (11) cannot be computed, so we have to resort to approximations. There are two approaches for creating these approximations:

Value function approximations (VFAs)

The ideal VFA policy involves solving Bellman’s equation


We can build a series of policies around Bellman’s equation:


The policy given in equation (13) would be optimal if we could compute from (12) exactly. Equation (14) replaces the value function with an approximation, which assumes that a) we can come up with a reasonable approximation and b) we can compute the expectation. Equation (15) eliminates the expectation by using the post-decision state (see Powell (2011) for a discussion of post-decision states). Equation (16) introduces a linear model for the value function approximation. Finally, equation (17) writes the equation in the form of -learning used in the reinforcement learning community.

Direct lookaheads (DLAs)

The second approach is to create an approximate lookahead model. If we are making a decision at time , we represent our lookahead model using the same notation as the base model, but replace the state with , the decision with which is determined with policy , and the exogenous information with . This gives us an approximate lookahead policy


We claim that these four classes are universal, which means that any policy designed for any sequential decision problem will fall in one of these four classes, or a hybrid of two or more. We further insist that all four classes are important. Powell & Meisel (2016) demonstrates that each of the four classes of policies may work best, depending on the characteristics of the datasets, for the energy storage problem described in section 2.2. Further, all four classes of policies have been used (by different communities) for pure learning problems.

We emphasize that these are four meta-classes. Choosing one of the meta-classes does not mean that you are done, but it does help guide the process. Most (almost all) of the literature on decisions under uncertainty is written with one of the four classes in mind. We think all four classes are important. Most important is that at least one of the four classes will work, which is why we insist on “model first, then solve.”

It has been our experience that the most consistent error made in the modeling of sequential decision problems arises with state variables, so we address this next. It is through the state variable that we can model problems with physical states, belief states or both. Regardless of the makeup of the state variable, we will still turn to the four classes of policies for making decisions.

4 State variables

The definition of a state variable is central to the proper modeling of any sequential decision problem, because it captures the information available to make a decision, along with the information needed to compute the objective function and the transition function. The policy is the function that derives information from the state variable to make decisions.

We are going to start in section 4.1 with a brief history of state variables. In section 4.2 we offer our own definition of a state variable (this is taken from Powell (2020) which in turn is based on the definition offered in Powell (2011)[Chapter 5, available at http://adp.princeton.edu]. Section 4.3 then provides a series of extensions of our energy storage problem to illustrate history-dependent problems, passive and active learning, and the widely overlooked issue of modeling rolling forecasts. We close by giving a probabilist’s measure-theoretic perspective of information and state variables in section 4.4.

4.1 A brief history of state variables

Our experience is that there is an almost universal misunderstanding of what is meant by a “state variable.” Not surprisingly, interpretations of the term “state variable” vary between communities. An indication of the confusion can be traced to attempts to define state variables. For example, Bellman introduces state variables with “we have a physical system characterized at any stage by a small set of parameters, the state variables(Bellman, 1957)

. Puterman’s now classic text introduces state variables with “At each decision epoch, the system occupies a

state.” (Puterman, 2005)[p. 18] (in both cases, the italicized text was included in the original text). As of this writing, Wikipedia offers “A state variable is one of the set of variables that are used to describe the mathematical ‘state’ of a dynamical system.” Note that all three references use the word “state” in the definition of state variable (which means it is not a proper definition).

In fact, the vast majority of books that deal with sequential decision problems in some form do not offer a definition of a state variable, with one notable exception: the optimal control community. There, we have found that books in optimal control routinely offer an explicit definition of a state variable. For example, Kirk (2004) offers:

  • A state variable is a set of quantities [WBP: the controls community uses for the state variable] which if known at time are determined for by specifying the inputs for .

Cassandras & Lafortune (2008) has the definition:

  • The state of a system at time is the information required at such that the output [cost] for all is uniquely determined from this information and from [the control] .

We have observed that the pattern of designing state variables is consistent across books in deterministic control, but not stochastic control. We feel this is because deterministic control books are written by engineers who need to model real problems, while stochastic control books are written by mathematicians.

There is a surprisingly widespread belief that a system can be “non-Markovian” but can be made “Markovian” by adding to the state variable. This is nicely illustrated in Cinlar (2011):

  • The definitions of “time” and “state” depend on the application at hand and the demands of mathematical tractability. Otherwise, if such practical considerations are ignored, every stochastic process can be made Markovian by enhancing its state space sufficiently.

We agree with the basic principle expressed in the controls books, which can all be re-stated as saying “A state variable is all the information you need (along with exogenous inputs) to model the system from time onward.” Our only complaint is that this is a bit vague.

On the other hand, we disagree with the widely held belief that stochastic systems “can be made Markovian” which runs against the core principle in the definitions in the optimal control books that the state variable is all the information needed to model the system from time onward. If it is all the information, then it is Markovian by construction.

There are two key areas of misunderstanding that we are going to address with our discussion. The first is a surprisingly widespread misunderstanding about “Markov” vs. “history-dependent” systems. The second, and far more subtle, arises when there are hidden or unobservable variables.

4.2 A modern definition

We offer two definitions depending on whether we have a system where the structure of the policy has been specified, and when it has not (this is taken from Powell (2020)).

  • A state variable is:

    a) Policy-dependent version

    A function of history that, combined with the exogenous information (and a policy), is necessary and sufficient to compute the decision function (the policy), the cost/contribution function, and the transition function.

    b) Optimization version

    A function of history that, combined with the exogenous information, is necessary and sufficient to compute the cost or contribution function, the constraints, and the transition function.

Both of these definitions lead us to our first claim:

  • Claim 1: All properly modeled systems are Markovian.

But stay tuned; later, we are going to argue the opposite, but there will be a slight change in the wording that explains the apparent contradiction.

Note that both of these definitions are consistent with those used in the controls community, with the only difference that we have specified that we can identify state variables by looking at the requirements of three functions: the cost/contribution function, the constraints (which is a form of function), and the transition function.

One issue that we are going to address arises when we ask “What is a transition function?”

  • The transition function, which we write , is the set of equations that describes how each element of the state variable evolves over time.

We quickly see that we have circular reasoning: a state variable includes the information we need to model the transition function, and the transition function is the equations that describe the evolution of the state variables. It turns out that this circular logic is unavoidable, as we illustrate later (in section 4.3.4).

We have found it useful to identify three types of state variables:

Physical state

- The physical state captures inventories, the location of a device on a graph, the demand for a product, the amount of energy available from a wind farm, or the status of a machine. Physical states typically appear in the right hand sides of constraints.

Other information

- The “other information” variable is literally any other information about observable parameters not included in .

Belief state

- The belief state

captures the parameters of a probability distribution describing unobservable parameters. This could be the mean and variance of a normal distribution, or a set of probabilities.

We present the three types of state variables as being in distinct classes, but it is more accurate to describe them as a series of nested sets as depicted in figure 1. The physical state variables describe quantities that constrain the system (inventories, location of a truck, demands) that are known perfectly. We then describe as “other information” but it might help to think of as any parameter that we observe perfectly, which could include . Finally, is the parameters of probability distributions for any quantity that we do not know perfectly, but a special case of a probability distribution is a point estimate with zero variance, which could include (and then ). Our choice of the three variables is designed purely to help with the modeling process.

Figure 1: Physical state variables , as a subset of other information , as a subset of belief state variables .

We explicitly model the “resource state” because we have found in some communities (and this is certainly true of operations research) that people tend to equate “state” and “physical state.” We do not offer an explicit definition of , although we note that it typically includes dynamic information in right-hand side constraints. might be the number of units of blood of type at time ; might be the number of resources with attribute vector . For example, could be how much we have invested in an asset, where captures the type of asset, how long it has been invested, and other information (such as the current price of the asset).

Some authors find it convenient to distinguish between two types of states:

Exogenous states

These are dynamically varying parameters that evolve purely from an exogenous process.

Controllable states

These are the state variables that are directly or indirectly affected by decisions.

In the massive class of problems known as “dynamic resource allocation,” would be the physical state, and this would also be the controllable state. However, there may be variables in that are also controllable (at least indirectly). There will also be states (such as water in a reservoir, or the state of disease in a patient) that evolve due to a mixture of exogenous and controllable processes.

Later we are going to illustrate the widespread confusion in the handling of “states” (physical states in our language) and “belief states” in the literature on partially observable Markov decision processes (POMDPs).

4.3 More illustrations

We are going to use our energy storage problem to illustrate the handling of so-called “history-dependent” problems (in section 4.3.1), followed by examples of passive and active learning (in sections 4.3.2 and 4.3.3), closing with an illustration of the circular logic for defining state variables and transition functions using rolling forecasts (in section 4.3.4). The material in this section is taken from Powell (2020).

4.3.1 With a time-series price model

Our basic model assumed that prices evolved according to a purely exogenous process (see equation (6)). Now assume that it is governed by the time series model


A common mistake is to say that is the “state” of the price process, and then observe that it is no longer Markovian (it would be called “history dependent”), but “it can be made Markovian by expanding the state variable,” which would be done by including and . According to our definition of a state variable, the state is all the information needed to model the process from time onward, which means that the state of our price process is . This means our system state variable is now

We then have to modify our transition function so that the “price state variable” at time becomes .

4.3.2 With passive learning

We implicitly assumed that our price process in equation (19) was governed by a model where the coefficients were known. Now assume that the vector is unknown, which means we have to use estimates , which gives us the price model


We have to adaptively update our estimate which we can do using recursive least squares. To do this, let

We perform the updating using a standard set of updating equations given by


To compute these equations, we need the three-element vector and the matrix . These then need to be added to our state variable, giving us

We then have to include equations (21) - (24) in our transition function.

4.3.3 With active learning

We can further generalize our model by assuming that our decision to buy or sell energy from or to the grid can have an impact on prices. We might propose a modified price model given by


All we have done is introduce a single term (which specifies how much we buy/sell from/to the grid) to our price model. Assuming that , this model implies that purchasing power from the grid () will increase grid prices, while selling power back to the grid () decreases prices. This means that purchasing a lot of power from the grid (for example) means we are more likely to observe higher prices, which may assist the process of learning . When decisions control or influence what we observe, then this is an example of active learning, which we saw in section 2.3 when we described a pure learning problem.

This change in our price model does not affect the state variable from the previous model, aside from adding one more element to , with the required changes to the matrix . The change will, however, have an impact on the policy. It is easier to learn if there is a nice spread in the prices, which is enhanced by varying over a wide range. This means trying values of that do not appear to be optimal given our current estimate of the vector . Making decisions partly just to learn (to make better decisions in the future) is the essence of active learning, best known in the field of multiarmed bandit problems.

4.3.4 With rolling forecasts

We are going to assume that we are given a rolling forecast from an outside source. This is quite common, and yet is surprisingly overlooked in the modeling of dynamic systems (including inventory/storage systems, for which there is an extensive literature). We are going to use rolling forecasts to illustrate the interaction between the modeling of state variables and the creation of the transition function.

Imagine that we are modeling the energy from wind, which means we would have to add to our state variable. We need to model how evolves over time. Assume we have a forecast of the energy from wind, which means



is the random variable capturing the one-period-ahead error in the forecast.

Equation (26) needs to be added to the transition equations for our model. However, it introduces a new variable, the forecast , which must now be added to the state variable. This means we now need a transition equation to describe how evolves over time. We do this by using a two-period-ahead forecast, , which is basically a forecast of , plus an error, giving us


where is the two-period-ahead error (we are assuming that the variance in a forecast increases linearly with time). Now we have to put in the state variable, which generates a new transition equation. This generalizes to


where . This process illustrates the back and forth between defining the state variable and creating the transition function that we hinted at earlier.

This stops, of course, when we hit the planning horizon . This means that we now have to add

to the state variable, with the transition equations (28) for . Combined with the learning statistics, our state variable is now

It is useful to note that we have a nice illustration of the three elements of our state variable:

The physical state variables,
other information,
the belief state, since these parameters determine the distribution of belief about variables that are not known perfectly.

4.4 A probabilist’s perspective of information

We would be remiss in a discussion of state variables if we did not cover how the mathematical probability community thinks of “information” and “state variables.” We note that this section is completely optional, as will be seen by the end of the section.

We begin by introducing what is widely known as boilerplate language when describing stochastic processes.

  • Let be the sequence of exogenous information variables, beginning with the initial state (that may contain a Bayesian prior), followed by the exogenous information contained in . Let be a sample sequence of a truth (contained in ), and a realization of . Let be the -algebra (also written “sigma-algebra”) on , which captures all the events that might be defined on . The set is the set of all countable unions and complements of the elements of , which is to say every possible event. Let be a probability measure on (if is discrete, would be a probability mass function). Now let be the -algebra generated by the process , which means it reflects the subsets of that we can identify using the information that has been revealed up through time . The sequence is referred to as a filtration, which means that (as more information is revealed, we are able to see more fine-grained events on , which acts like a sequence of filters with increasingly finer openings).

This terminology is known as boilerplate because it can be copied and pasted into any model with a stochastic process , and does not change with applications (readers are given permission to copy this paragraph word for word, but as we show below, it is not necessary).

We need this language to handle the following issue. In a deterministic problem, the decisions represent a sequence of numbers (these can be scalars or vectors). In a stochastic problem, there is a decision for each sample path which represents a realization of the entire sequence . This means that if we write , it is as if we are telling the entire sample path, which means it gets to see what is going to happen in the future.

We fix this by insisting that the function be “-measurable,” which means that is not allowed to depend on the outcomes of . We get the same behavior if we write explicitly as a function (that we call a policy) that depends on just the information in the state . Note that the state is purely a function of the history , in addition to the decisions .

Theoreticians will use or to represent “information,” but they are not equivalent. contains all the information in the exogenous sequence . The state , on the other hand, is constructed from the sequence , but only includes the information we need to compute the objective function, constraints and transition function. Also, can always be represented as a vector of real-valued numbers, while is a set of events which contain the information needed to compute . The set is more general, hence its appeal to mathematicians, while contains the information we actually need to model our problem.

The state can be viewed from three different perspectives, depending on what time it is:

  • We are at time - In a sequential decision problem, if we are talking about then it usually means we are at time , in which case is a particular realization of a set of numbers that capture everything we need from history to model our system moving forward.

  • We are at time - We might be trying to choose the best policy (or some other fixed parameter), in which case we are at time 0, and would be a random variable since we do not know what state we will be in at time when we are at time .

  • We are at time - Finally, a probabilist sits at time and sees all the outcomes (and therefore all the events in ), but from his perspective at time , if you ask him a question about at time , he will recognize only events in (remember that each event in is a subset of sample paths ). For example, if we are running simulations using historical data, and we cheat and use information from time to make a decision at time , that implies we are seeing an event that is in , but which is not in . In such a case, our decision would not be “-measurable.”

Readers without training in measure-theoretic probability will find this language unfamiliar, even threatening. We will just note that the following statements are all completely equivalent.

  • The policy (or decision is -measurable.

  • The policy (or decision is nonanticipative.

  • The policy (or decision is “adapted.”

  • The policy (or decision ) is a function of the state .

Readers without formal training in measure-theoretic probability will likely find statement (4) to be straightforward and easy to understand. We are here to say that you only need to understand statement (4), which means you can write models (and even publish papers) without any of the other formalism in this section.

5 Partially observable Markov decision processes

Partially observable Markov decision processes (POMDPs) broadly describe any sequential decision problem that involves learning an environment that cannot be precisely observed. However, it is most often associated with problems where decisions can affect the environment, which was not the case in our pure learning problems in section 2.3.

We are going to describe our environment in terms of a set of parameters that we are trying to learn. It is helpful to identify three classes of problems:

1) Static unobservable parameters

- These are problems where we are trying to learn the values of a set of static parameters, which might be the response of a function given different inputs, or the parameters characterizing an unknown function. These experiments could be run in a simulator or laboratory, or in the field. Examples are:

  • The strength of material resulting from the use of different catalysts.

  • Designing a business system using a simulator. We might be designing the layout of an assembly line, evaluating the number of aircraft in a fleet, or finding the best locations of warehouses in a logistics network.

  • Evaluating the parameters of a policy for stocking inventory or buying stock.

  • Controlling robots moving in a static but unknown environment.

2) Dynamic unobservable parameters

- Now we are trying to learn the value of parameters that are evolving over time. These come in two flavors:

2a) Exogenous, uncontrollable process

- These are problems where environmental parameters will evolve over time due to a purely exogenous source:

  • Demand for hotel rooms as a function of price in changing market conditions (where our decisions do not affect the market).

  • Robots moving in an uncertain environment that is changing due to weather.

  • Finding the best path through a congested network after a major road has been closed due to construction forcing people to explore new routes.

  • Finding the best price for a ridesharing fleet to balance drivers and riders. The best price evolves as the number of drivers and riders changes over the course of the day.

2b) Controllable process

- These are problems where the controlling agent makes decisions that directly or indirectly affect environmental parameters:

  • Equipment maintenance - We perform inspections to determine the state of the machine, and then perform repairs which changes the state of the equipment.

  • Medical treatments - We test for a disease, and then treat using drugs which changes the progression of the disease.

  • Managing a utility truck after a storm, where the truck is both observing and repairing damaged lines, while we control the truck.

  • Invasive plant or animal species management - We perform inspections, and implement steps to mitigate further spread of the invasive species.

  • Spread of the flu - We can take samples from the population, and then administer flu vaccinations to reduce incidence of the disease.

Class (1) represents our pure learning problems (which we first touched on in section 2.3), often referred to as multiarmed bandits, although the arms (choices) may be continuous and/or vector valued. The important characteristic of class (1) is that our underlying problem (the environment) is assumed to be static. Also, we may be learning in the field, where we are interested in optimizing cumulative rewards (this is the classic bandit setting) or in a laboratory, where we are only interested in the final performance.

Class (2) covers problems where the environment is changing over time. Class (2a) covers problems where the environment is evolving exogenously. This has been widely studied under the umbrella of “restless bandits.” The modeling of these systems is similar to class (1). Class (2b) arises when our decisions directly or indirectly affect the parameters that cannot be observed, which is the domain of POMDPs. We are going to illustrate this with a problem to treat flu in a population.

Earlier we noted that Markov decision problems are often represented by the tuple (we use for action space, but standard notation in this community is to use for action). The POMDP community extends this representation by representing the POMDP as the tuple where , , and are as they were with the basic MDP, is the space of possible observations that can be made of the environment, and is the “observation function” where

the probability we observe outcome when the unobservable state .

The notation for modeling POMDPs is not standard. Different authors may use or for the observation, and may use , or for the space of outcomes. Some will use for outcome and for the “observation function.” Our choice of (which is not standard) helps us to avoid the use of “” for notation, and makes it clear that it is the probability of making an observation, which parallels our one-step transition matrix which describes the evolution of the “state” .

Remark: There is a bit of confusion in the modeling of uncertainty in POMDPs. The tuple represents uncertainty in both the transition matrix , and then through the pair which captures both observations and the probability of making an observation. Recall that we represent the transition function in equation (1) using where is our “exogenous information.” This is the exogenous information (that is, the random inputs) that drives the evolution of our “physical system” with state . We use this random variable to compute our one-step transition matrix from the transition function using


The random variables and may be the same. For example, we may have a queueing system where is the random number of customers arriving, which we are allowed to observe. The unknown parameter might be the arrival rate of customers, which we can estimate using . In other settings and may be completely different. For example, might be the random transmission of disease in a population, while is the outcome of random samples.

What is important is that the standard modeling representation for POMDPs is as a single, extended problem. The POMDP framework ensures that the choice of action (decision in our notation) is not allowed to see the state of the system, but it does assume that the transition function and the observation function are both known. Later we are going to offer a different approach for modeling POMDPs. Before we do this, it is going to help to have an actual example in mind.

6 A learning problem: protecting against the flu

We are going to use the problem of protecting a population against the flu as an illustrative example. It will start as a learning problem with an unknown but controllable parameter, which is the prevalence of the flu in the population. We will use this to illustrate different classes of policies, after which we will propose several extensions.

6.1 A static model

Let be the prevalence of the flu in the population (that is, the fraction of the population that has come down with the flu). In a static problem where we have an unknown parameter , we make observations using


where the noise is what keeps us from observing perfectly.

We express our belief about by assuming that . Since we fix the assumption of normality, we express our belief about as . We are again going to express uncertainty using which is the precision of our estimate of , and is the precision of our observation noise .

We need to estimate the number of people with the disease by running tests, which produces the noisy estimate . We represent the decision to run a test by the decision variable where

If , then we observe which we can use to update our belief about using


If , then and .

For this problem our state variable is our belief about , which we write

If this was our problem, it would be an instance of a one-armed bandit. We might assess a cost for making an observation, along with a cost of uncertainty. For example, assume we have the following costs:

The cost of sampling the population to estimate the number of people infected with the flu,
the cost of uncertainty,

Using this information, we can put this model in our canonical framework as follows:

State variables


Decision variables

determined by our policy (to be determined later).

Exogenous information

which is our noisy estimate of how many people have the flu from equation (30) (and we only obtain this if ).

Transition function

Equations (31) and (32).

Objective function

We would write our objective as


We now need a policy to determine . We can use any of the four classes of policies described in section 3. We sketch examples of policies in section 8 below.

6.2 Variations of our flu model

We are going to present a series of variations of our flu model to bring out different modeling issues:

  • A time-varying model

  • A time-varying model with drift

  • A dynamic model with a controllable truth

  • A flu model with a resource constraint and exogenous state

  • A spatial model

These variations are designed to bring out the modeling issues that arise when we have an evolving truth (with known dynamics), an evolving truth with unknown dynamics (the drift), an unknown truth that we can control (or influence), followed by problems that introduce the dimension of having a known and controllable physical state.

6.2.1 A time-varying model

If the true prevalence of the flu is evolving exogenously (as we would expect in this application), then we would write the true parameter as depending on time, , which might evolve according to


where describes how our truth is evolving. If the truth evolves with zero mean and known variance , our belief state is the same as it was with a static truth (that is, ). What does change is the transition function which now has to reflect both the noise of an observation as well as the uncertainty in the evolution of the truth, captured by .

Remark: When was a constant, we did not have a problem referring to it as a parameter, whereas the state of our system is the belief which evolves over time (state variables should only include information that changes over time). When is changing over time, in which case we write it as , then it is more natural to think of the value of as the state of the system, but not observable to the controller. For this reason, many authors would refer to as a hidden state. However, we still have the belief about , which creates some confusion: What is the state variable? We are going to resolve this confusion below.

6.2.2 A time-varying model with drift

Now assume that

If , then it means that

is drifting higher or lower (for the moment, we are going to assume that

is a constant). We do not know , so we would assign a belief such as

Again let the precision be given by .

We might update our estimate of our belief about using

Now we can update our estimate of the mean and variance of our belief about using