1 Introduction
There is a vast range of problems that consist of the sequence: decisions, information, decisions, information,
. Application areas span engineering, business, economics, finance, health, transportation, and energy. It encompasses active learning problems that arise in the experimental sciences, medical decision making, ecommerce, and sports. It also includes iterative algorithms for stochastic search, as well as twoagent games and multiagent systems. In fact, we might claim that virtually any human enterprise will include instances of sequential decision problems.
Given the diversity of problem domains, it should not be a surprise that a number of communities have emerged to address the problem of making decisions over time to optimize some metric. The reason that so many communities exist is a testament to the variety of problems, but it also hints at the many methods that are needed to solve these problems. As of this writing, there is not a single method that has emerged to solve all problems. In fact, it is fair to say that all the methods that have been proposed are fragile: relatively modest changes can invalidate a theoretical result, or increase run times by orders of magnitude.
In Powell (2019), we present a unified framework for all sequential decision problems. This framework consists of a mathematical model (that draws heavily from the framework used widely in stochastic control), which requires optimizing over policies which are functions for making decisions given what we know at a point in time (captured by the state variable).
The significant advance of the unified framework is the identification of four (meta)classes of policies that encompass all the communities. In fact, whereas the solution approach offered by each community is fragile, we claim that the four classes are universal: any policy proposed for any sequential decision problem will consist of one of these four classes, and possibly a hybrid.
The contribution of the framework is to raise the visibility of all of the communities. Instead of focusing on a specific solution approach (for example, the use of HamiltonJacobiBellman (HJB) equations, which is one of the four classes), the framework encourages people to consider all four classes, and then to design policies that are best suited to the characteristics of a problem.
This chapter is going to focus attention on two specific communities: stochastic optimal control, and reinforcement learning. Stochastic optimal control emerged in the 1950’s, building on what was already a mature community for deterministic optimal control that emerged in the early 1900’s and has been adopted around the world. Reinforcement learning, on the other hand, emerged in the 1990’s building on the foundation of Markov decision processes which was introduced in the 1950’s (in fact, the first use of the term “stochastic optimal control” is attributed to Bellman, who invented Markov decision processes). Reinforcement learning emerged from computer science in the 1980’s, and grew to prominence in 2016 when it was credited with solving the Chinese game of Go using AlphaGo.
We are going to make the following points:

Both communities have evolved from a core theoretical/algorithm result based on HamiltonJacobiBellman equations, transitioning from exact results (that were quite limited), to the use of algorithms based on approximating value functions/costtogo functions/Qfactors, to other strategies that do not depend on HJB equations. We will argue that each of the fields is in the process of recognizing all four classes of policies.

We will present and contrast the canonical modeling frameworks for stochastic control and reinforcement learning (adopted from Markov decision processes). We will show that the framework for stochastic control is very flexible and scalable to real applications, while that used by reinforcement learning is limited to a small problem class.

We will present a universal modeling framework for sequential decision analytics (given in Powell (2019)) that covers any sequential decision problem. The framework draws heavily from that used by stochastic control, with some minor adjustments. While not used by the reinforcement learning community, we will argue that it is used implicitly. In the process, we will dramatically expand the range of problems that can be viewed as either stochastic control problems, or reinforcement learning problems.
We begin our presentation in section 2 with an overview of the different communities that work on sequential decisions under uncertainty, along with a list of major problem classes. Section 3 presents a sidebyside comparison of the modeling frameworks of stochastic optimal control and reinforcement learning.
Section 4 next presents our universal framework (taken from Powell (2019)), and argues that a) it covers all 15+ fields (presented in section 2) dealing with sequential decisions and uncertainty, b) it draws heavily from the standard model of stochastic optimal control, and c) the framework of reinforcement learning, inherited from discrete Markov decision processes, has fundamental weaknesses that limit its applicability to a very narrow classes of problems. We then illustrate the framework using an energy storage problem in section 5; this application offers tremendous richness, and allows us to illustrate the flexibility of the framework.
The central challenge of our modeling framework involves optimizing over policies, which represents our point of departure with the rest of the literature, since it is standard to pick a class of policy in advance. However, this leaves open the problem of how to search over policies. In section 6 we present four (meta)classes of policies which, we claim, are universal, in that any approach suggested in the literature (or in practice) is drawn from one of these four classes, or a hybrid of two or more. Section 7 illustrates all four classes, along with a hybrid, using the context of our energy storage application. These examples will include hybrid resource allocation/active learning problems, along with the overlooked challenge of dealing with rolling forecasts.
Section 8 briefly discusses how to use the framework to model multiagent systems, and notes that this vocabulary provides a fresh perspective on partially observable Markov decision processes. Then, section 9 concludes the chapter with a series of observations about reinforcement learning, stochastic optimal control, and our universal framework.
2 The communities of sequential decisions
The list of potential applications of sequential decision problems is virtually limitless. Below we list a number of major application domains. Ultimately we are going to model all of these using the same framework.
 Discrete problems

These are problems with discrete states and discrete decisions (actions), such as stochastic shortest path problems.
 Control problems

These span controlling robots, drones, rockets and submersibles, where states are continuous (location and velocity) as are controls (forces). Other examples include determining optimal dosages of medications, or continuous inventory (or storage) problems that arise in finance and energy.
 Dynamic resource allocation problems

Here we are typically managing inventories (retail products, food, blood, energy, money, drugs), typically over space and time. It also covers discrete problems such as dynamically routing vehicles, or managing people or machines. It would also cover planning the movements of robots and drones (but not how to do it). The scope of “dynamic resource allocation problems” is almost limitless.
 Active learning problems

This includes any problem that involves learning, and where decisions affect what information is collected (laboratory experiments, field experiments, test marketing, computer simulations, medical testing). It spans multiarmed bandit problems, ecommerce (bidding, recommender systems), blackbox simulations, and simulationoptimization.
 Hybrid learning/resource allocation problems

This would arise if we are managing a drone that is collecting information, which means we have to manage a physical resource while running experiments which are then used to update beliefs. Other problems are laboratory science experiments with setups (this is the physical resource), collecting public health information from field technicians, and any experimental learning setting with a budget constraint.
 Stochastic search

This includes both derivativebased and derivativefree stochastic optimization.
 Adversarial games

This includes any twoplayer (or multiplayer) adversarial games. It includes pricing in markets where price affects market behavior, and military applications.
 Multiagent problems

This covers problems with multiple decisionmakers who might be competing or cooperating. They might be making the same decisions (but spatially distributed), or making different decisions that interact (as arises in supply chains, or where different agents play different roles but have to work together).
Given the diversity of problems, it should not be surprising that a number of different research communities have evolved to address them, each with their own vocabulary and solution methods, creating what we have called the “jungle of stochastic optimization” (Powell (2014), see also jungle.princeton.edu). A list of the different communities that address the problem of solving sequential decisioninformation problems might be:

Stochastic search (derivativebased)

Ranking and selection (derivativefree)

(Stochastic) optimal control

Markov decision processes/dynamic programming

Simulationoptimization

Optimal stopping

Model predictive control

Stochastic programming

Chanceconstrained programming

Approximate/adaptive/neurodynamic programming

Reinforcement learning

Robust optimization

Online computation

Multiarmed bandits

Active learning

Partially observable Markov decision processes
Each of these communities is supported by at least one book and over a thousand papers.
Some of these fields include problem classes that can be described as static: make decision, see information (possibly make one more decision), and then the problem stops (stochastic programming and robust optimization are obvious examples). However, all of them include problems that are fully sequential, consisting of sequences of decision, information, decision, information, , over a finite or infinite horizon. The focus of this chapter is on fully sequential problems.
Several of the communities offer elegant theoretical frameworks that lead to optimal solutions for specialized problems (Markov decision processes and optimal control are two prominent examples). Others offer asymptotically optimal algorithms: derivativebased and certain derivativefree stochastic optimization problems, simulationoptimization, and certain instances of approximate dynamic programming and reinforcement learning. Still others offer theoretical guarantees, often in the form of regret bounds (that is, bounds on how far the solution is from optimal).
We now turn our attention to focus on the fields of stochastic optimal control and reinforcement learning.
3 Stochastic optimal control vs. reinforcement learning
There are numerous communities that have contributed to the broad area of modeling and solving sequential decision problems, but there are two that stand out: optimal control (which laid the foundation for stochastic optimal control), and Markov decision processes, which provided the analytical foundation for reinforcement learning. Although these fields have intersected at different times in their history, today they offer contrasting frameworks which, nonetheless, are steadily converging to common solution strategies.
We present the modeling frameworks of (stochastic) optimal control and reinforcement learning (drawn from Markov decision processes), which are polar opposites. Given the growing popularity of reinforcement learning, we think it is worthwhile to compare and contrast these frameworks. We then present our own universal framework which spans all the fields that deal with any form of sequential decision problems. The reader will quickly see that our framework is quite close to that used by the (stochastic) optimal control community, with a few adjustments.
3.1 Stochastic control
The field of optimal control enjoys a long and rich history, as evidenced by the number of popular books that have been written focusing on deterministic control, including Vrabie & Lewis (2009), Kirk (2004), and Stengel (1994). There are also a number of books on stochastic control (see Sethi (2019), Nisio (2014), Sontag (1998), Stengel (1986), Bertsekas & Shreve (1978), Kushner (1971)) but these tend to be mathematically more advanced.
Deterministic optimal control problems are typically written
(1) 
where is the state at time , is the control (that is, the decision) and
is a loss function with terminal loss
. The state evolves according to(2) 
where is variously known as the transition function, system model, plant model (as in chemical or power plant), plant equation, and transition law. We write the control problem in discrete time, but there is an extensive literature where this is written in continuous time, and the transition function is written
The most common form of a stochastic control problem simply introduces additive noise to the transition function given by
(3) 
where is random at time
. This odd notation arose because of the continuous time formulation, where
would be disturbances (such as wind pushing against an aircraft) between and . The introduction of the uncertainty means that the state variableis a random variable when we are sitting at time 0. Since the control
is also a function of the state, this means that is also a random variable. Common practice is to then take an expectation of the objective function in equation (1), which produces(4) 
which has to be solved subject to the constraint in equation (3). This is problematic, because we have to interpret (3) recognizing that and are random since they depend on the sequence .
In the deterministic formulation of the problem, we are looking for an optimal control vector
. When we introduce the random variable , then the controls need to be interpreted as functions that depend on the information available at time . Mathematicians handle this by saying that “ must be measurable” which means, in plain English, that the control is a function (not a variable) which can only depend on information up through time . This leaves us the challenge of finding this function.We start by relaxing the constraint in (3) and add it to the objective function, giving us
(5) 
where is a vector of dual variables (known as costate variables in the controls community). Assuming that , then drops out of the objective function.
The next step is that we restrict our attention to quadratic loss functions given by
(6) 
where and are a set of known matrices. This special case is known as linearquadratic regulation (or LQR).
With this special structure, we turn to the HamiltonJacobi equations (often called the HamiltonJacobiBellman equations) where we solve for the “costtogo” function using
(7) 
is the value of being in state at time and following an optimal policy from time onward. In the language of reinforcement learning, is known as the value function, and is written .
For the special case of the quadratic objective function in (6), it is possible to solve the HamiltonJacobiBellman equations analytically and show that the optimal control as a function of the state is given by
(8) 
where is a matrix that depends on and .
Here, is a function that we refer to as a policy , but is known as a control law in the controls community. Some would write (8) as . Later we are going to adopt the notation for writing a policy, where carries information about the structure of the policy. We note that when the policy depends on the state , then the function is, by construction, “measurable,” so we can avoid this terminology entirely.
The linear control law (policy) in equation (8) is very elegant, but it is a byproduct of the special structure, which includes the quadratic form of the objective function (equation (6)), the additive noise (equation (3)), and the fact that there are no constraints on the controls. For example, a much more general way of writing the transition function is
which allows the noise to enter the dynamics in any form. For example, consider an inventory problem where the state (the inventory level) is governed by
where is the random demand for our product.
We are also interested in general statedependent reward functions which are often written as (where stands for gain), as well as the constraints, where we might write
where (and ) may contain information from the state variable.
For these more general problems, we cannot compute (7) exactly, so the research community has developed a variety of methods for approximating . Methods for solving (7
) approximately have been widely studied in the controls community under names such as heuristic dynamic programming, approximate dynamic programming, neurodynamic programming, and adaptive dynamic programming. However, even this approach is limited to special classes of problems within our universe of sequential decision problems.
Optimal control enjoys a rich history. Deterministic control dates to the early 1900’s, while stochastic control appears to have been first introduced by Bellman in the 1950’s (known as the father of dynamic programming). Some of the more recent books in optimal control are Kirk (2004), Stengel (1986), Sontag (1998), Sethi (2019), and Lewis et al. (2012). The most common optimal control problems are continuous, lowdimensional and unconstrained. Stochastic problems are most typically formulated with additive noise.
The field of stochastic control has tended to evolve using the more sophisticated mathematics that has characterized the field. Some of the most prominent books include Astrom (1970), Kushner (1971), Bertsekas & Shreve (1978), Yong & Zhou (1999), Nisio (2014) (note that some of the books on deterministic controls touch on the stochastic case).
We are going to see below that this framework for writing sequential decision problems is quite powerful, even if the classical results (such as the linear control policy) are very limited. It will form the foundation for our unified framework, with some slight adjustments.
3.2 Reinforcement learning
The field known as reinforcement learning evolved from early work done by Rich Sutton and his adviser Andy Barto in the early 1980’s. They addressed the problem of modeling the search process of a mouse exploring a maze, developing methods that would eventually help solve the Chinese game of Go, outperforming world masters (figure 1).
Sutton and Barto eventually made the link to the field of Markov decision processes and adopted the vocabulary and notation of this field. The field is nicely summarized in Puterman (2005) which can be viewed as the capstone volume on 50 years of research into Markov decision processes, starting with the seminal work of Bellman (Bellman, 1957). Puterman (2005)[Chapter 3] summarizes the modeling framework as consisting of the following elements:
Decision epochs

.
 State space

set of (discrete) states.
 Action space

action space (set of actions when we are in state ).
 Transition matrix

probability of transitioning to state given that we are in state and take action .
 Reward

the reward received when we are in state and take action .
This notation (which we refer to below as the “MDP formal model”) became widely adopted in the computer science community where reinforcement learning evolved. It became standard for authors to define a reinforcement learning problem as consisting of the tuple where is the transition matrix, and is the reward function.
Using this notation, Sutton and Barto (this work is best summarized in their original volume Sutton & Barto (1998)
) proposed estimating the value of being in a state
and taking an action (at the iteration of the algorithm) using(9)  
(10) 
where is a discount factor and is a smoothing factor that might be called a stepsize (the equation has roots in stochastic optimization) or learning rate. Equation (10) is the core of “reinforcement learning.”
We assume that when we are in state and take action that we have some way of simulating the transition to a state . There are two ways of doing this:

Modelbased  We assume we have the transition matrix and then sample
from the probability distribution
. 
Modelfree  We assume we are observing a physical setting where we can simply observe the transition to state without a transition matrix.
Sutton and Barto named this algorithmic strategy “learning” (after the notation). The appeal of the method is its sheer simplicity. In fact, they retained this style in their wildly popular book (Sutton & Barto, 1998) which can be read by a highschool student.
Just as appealing is the wide applicability of both the model and algorithmic strategy. Contrast the core algorithmic step described by equations (9)  (10) to Bellman’s equation which is the foundation of Markov decision processes, which requires solving
(11) 
for all states . Equation (11) is executed by setting for all , and then stepping backward (hence the reason that this is often called “backward dynamic programming”). In fact, this version was so trivial that the field focused on the stationary version which is written
(12) 
[Side note: The steady state version of Bellman’s equation in (12) became the default version of Bellman’s equation, which explains why the default notation for reinforcement learning does not index variables by time. By contrast, the default formulation for optimal control is finite time, and variables are indexed by time in the canonical model.]
Equation (12) requires solving a system of nonlinear equations to find , which proved to be the foundation for an array of papers with elegant algorithms, where one of the most important is
(13) 
Equation (13) is known as value iteration (note the similarity with equation (11)) and is the basis of learning (compare to equations (9)(10)).
The problem with (11
) is that it is far from trivial. In fact, it is quite rare that it can be computed due to the widely cited “curse of dimensionality.” This typically refers to the fact that for most problems, the state
is a vector . Assuming that all the states are discrete, the number of states grows exponentially in . It is for this reason that dynamic programming is widely criticized for suffering from the “curse of dimensionality.” In fact, the curse of dimensionality is due purely to the use of lookup table representations of the value function (note that the canonical optimal control model does not do this).In practice, this typically means that onedimensional problems can be solved in under a minute; twodimensional problems might take several minutes (but possibly up to an hour, depending on the dimensionality and the planning horizon); three dimensional problems easily take a week or a month; and four dimensional problems can take up to a year (or more).
In fact, there are actually three curses of dimensionality: the state space, the action space, and the outcome space. It is typically assumed that there is a discrete set of actions (think of the roads emanating from an intersection), but there are many problems where decisions are vectors (think of all the ways of assigning different taxis to passengers). Finally, there are random variables (call them for now) which might also be a vector. For example, might be the set of riders calling our taxi company for rides in a 15 minute period. Or, it might represent all the attributes of a customer clicking on ads (age, gender, location).
The most difficult computational challenge in equation (11) is finding the onestep transition matrix with element . This matrix measures which may already be quite large. However, consider what it takes to compute just one element. To show this, we need to steal a bit of notation from the optimal control model, which is the transition function . Using the notation of dynamic programming, we would write this as . Assuming the transition function is known, the onestep transition matrix is computed using
(14) 
This is not too hard if the random variable is scalar, but there are many problems where is a vector, in which case we encounter the third curse of dimensionality. There are many other problems where we do not even know the distribution of (but have a way of observing outcomes).
Now return to the learning equations (9)  (10). At no time are we enumerating all the states, although we do have to enumerate all the actions (and the states these actions lead to), which is perhaps a reason why reinforcement learning is always illustrated in the context of relatively small, discrete action spaces (think of the Chinese game of Go). Finally, we do not need to take an expectation over the random variable ; rather, we just simulate our way from state to state using the transition function .
We are not out of the woods. We still have to estimate the value of being in state and taking action , captured by our factors . If we use lookup tables, this means we need to estimate for each state that we might visit, and each action that we might take, which means we are back to the curse of dimensionality. However, we can use other approximation strategies:

Lookup tables with hierarchical beliefs  Here we use a family of lookup table models at different levels of aggregation.

Parametric models, which might be linear (in the parameters) or nonlinear. We include shallow neural networks here. Parametric models transform the dimensionality of problems down to the dimensionality of the parameter vector, but we have to know the parametric form.

Nonparametric models. Here we include kernel regression, locally parametric, and flexible architectures such as support vector machines and deep neural networks.
Not surprisingly, considerable attention has been devoted to different methods for approximating , with recent attention focusing on using deep neural networks. We will just note that the price of higherdimensional architectures is that they come with the price of increased training (in fact, dramatically increased training). A deep neural network might easily require tens of millions of iterations, and yet still may not guarantee high quality solutions.
The challenge is illustrated in figure 2, where we have a set of observations (red dots). We try to fit the observations using a nonparametric model (the blue line) which overfits the data, which we believe is a smooth, concave (approximately quadratic) surface. We would get a reasonable fit of a quadratic function with no more than 10 data points, but any nonparametric model (such as a deep neural network) might require hundreds to thousands of data points (depending on the noise) to get a good fit.
3.3 A critique of the MDP modeling framework
For many years, the modeling framework of Markov decision processes lived within the MDP community which consisted primarily of applied probabilists, reflecting the limited applicability of the solution methods. Reinforcement learning, however, is a field that is exploding in popularity, while still clinging to the classical MDP modeling framework (see Lazaric (2019) for a typical example of this). What is happening, however, is that people doing computational work are adopting styles that overcome the limitations of the discrete MDP framework. For example, researchers will overcome the problem of computing the onestep transition matrix by saying that they will “simulate” the process. In practice, this means that they are using the transition function , which means that they have to simulate the random information , without explicitly writing out or the model of . This introduces a confusing gap between the statement of the model and the software that captures and solves the model.
We offer the following criticisms of the classical MDP framework that the reinforcement learning community has adopted:

The MDP/RL modeling framework models state spaces. The optimal control framework models state variables. We argue that the latter is much more useful, since it more clearly describes the actual variables of the problem. Consider a problem with discrete states (perhaps with dimensions). The state space could then be written which produces a set of discrete states that we can write . If we had a magical tool that could solve discrete Markov decision problems (remember we need to compute the onestep transition matrix), then we do not need to know anything about the state space , but this is rarely the case. Further, we make the case that just knowing that we have states provides no information about the problem itself, while a list of the variables that make up the state variable (as is done in optimal control) will map directly to the software implementing the model.

Similarly, the MDP/RL community talks about action spaces, while the controls community uses control variables. There is a wide range of problems that are described by discrete actions, where the action space is not too large. However, there are also many problems where actions are continuous, and often are vector valued. The notation of an “action space” is simply not useful for vectorvalued decisions/controls (the issue is the same as with state spaces).

The MDP modeling framework does not explicitly model the exogenous information process . Rather, it is buried in the onestep transition function , as is seen in equation (14). In practical algorithms, we need to simulate the process, so it helps to model the process explicitly. We would also argue that the model of is a critical and challenging dimension of any sequential decision problem which is overlooked in the canonical MDP modeling framework.

Transition functions, if known, are always computable, since we just have to compute them for a single state, a single action, and a single observation of any exogenous information. We suspect that this is why the optimal control community adopted this notation. Onestep transition matrices (or onestep transition kernels if the state variable is continuous) are almost never computable.

There is no proper statement of the objective function, beyond the specification of the reward function . There is an implied objective similar to equation (4), but as we are going to see below, objective functions for sequential decision problems come in a variety of styles. Most important, in our view, is the need to state the objective in terms of optimizing over policies.
We suspect that the reason behind the sharp difference in styles between optimal control and Markov decision processes (adopted by reinforcement learning) is that the field of optimal control evolved from engineering, while Markov decision processes evolved out of mathematics. The adoption of the MDP framework by reinforcement learning (which grew out of computer science and is particularly popular with lessmathematical communities) is purely historical  it was easier to make the connection between the discrete mouseinamaze problem to the language of discrete Markov decision processes than stochastic optimal control.
In section 4 below we are going to offer a framework that overcomes all of these limitations. This framework, however, closely parallels the framework widely used in optimal control, with a few relatively minor modifications (and one major one).
3.4 Bridging optimal control and reinforcement learning
We open our discussion by noting the remarkable difference between the canonical modeling framework for optimal control, which explicitly models state variables , controls , information , and transition functions, and the canonical modeling framework for reinforcement learning (inherited from Markov decision processes) which uses constructs such as state spaces, action spaces, and onestep transition matrices. We will argue that the framework used in optimal control can be translated directly to software, whereas that used by reinforcement learning does not.
To illustrate this assertion, we note that optimal control and reinforcement learning are both addressing a sequential decision problem. In the notation of optimal control, we would write (focusing on discrete time settings):
The reinforcement learning framework, on the other hand, never models anything comparable to the information variable . In fact, the default setting is that we just observe the downstream state rather than modeling how we get there, but this is not universally true.
We also note that while the controls literature typically indexes variables by time, the RL community adopted the standard steady state model (see equation (12)) which means their variables are not indexed by time (or anything else). Instead, they view the system as evolving in steps (or iterations). For this reason, we are going to index variables by (as in ).
In addition, the RL community does not explicitly model an exogenous information variable. Instead, they tend to assume when you are in a state and take action , you then “observe” the next state . However, any simulation in a reinforcement learning model requires creating a transition function which may (but not always) involve some random information that we call “
” (adopting, for the moment, the notation in the controls literature). This allows us to write the sequential decision problem
We put the in parentheses because the RL community does not explicitly model the process. However, when running an RL simulation, the software will have to model this process, even if we are just observing the next state. We use after taking action simply because it is often the case that .
We have found that the reinforcement learning community likes to start by stating a model in terms of the MDP formal model, but then revert to the framework of stochastic control. A good example is the presentation by Lazaric (2019); slide 22 presents the MDP formal model, but when the presentation turns to present an illustration (using a simple inventory problem), it turns to the style used in the optimal control community (see slide 29). Note that the presentation insists that the demand be stationary, which seems to be an effort to force it into the standard stationary model (see equation (12)). We use a much more complex inventory problem in this article, where we do not require stationarity (and which would not be required by the canonical optimal control framework).
So, we see that both optimal control and reinforcement learning are solving sequential decision problems, also known as Markov decision problems. Sequential decision problems (decision, information, decision, information, ) span a truly vast range of applications, as noted in section 2. We suspect that this space is much broader than has been traditionally viewed within either of these two communities. This is not to say that all these problems can be solved with learning or even any Bellmanbased method, but below we will identify four classes of policies that span any approach that might be used for any sequential decision problems.
The optimal control literature has its origins in problems with continuous states and actions, although the mathematical model does not impose any restrictions beyond the basic structure of sequential decisions and information (for stochastic control problems). While optimal control is best known for the theory surrounding the structure of linearquadratic regulation which produces the linear policy in (8), it should not be surprising that the controls community branched into more general problems, requiring different solution strategies. These include:

Approximating the costtogo function .

Determining a decision now by optimizing over a horizon using a presumablyknown model of the system (which is not always available). This approach became known as model predictive control

Specifying a parametric control law, which is typically linear in the parameters (following the style of (8)).
At the same time, the reinforcement learning community found that the performance of learning (that is, equations (9)(10)), despite the hype, did not match early hopes and expectations. In fact, just as the optimal controls community evolved different solution methods, the reinforcement learning community followed a similar path (the same statement can be made of a number of fields in stochastic optimization). This evolution is nicely documented by comparing the first edition of Sutton and Barto’s Reinforcement Learning (Sutton & Barto, 1998), which focuses exclusively on learning, with the second edition (Sutton & Barto, 2018), which covers methods such as Monte Carlo tree search, upper confidence bounding, and the policy gradient method.
We are going to next present (in section 4) a universal framework which is illustrated in section 5 on a series of problems in energy storage. Section 6 will then present four classes of policies that cover every method that has been proposed in the literature, which span all the variations currently in use in both the controls literature as well as the growing literature on reinforcement learning. We then return to the energy storage problems in section 7 and illustrate all four classes of policies (including a hybrid).
4 The universal modeling framework
We are going to present a universal modeling framework that covers all of the disciplines and application domains listed in section 2. The framework will end up posing an optimization problem that involves searching over policies, which are functions for making decisions. We will illustrate the framework on a simple inventory problem using the setting of controlling battery storage (a classical stochastic control problem).
In section 5 we will illustrate some key concepts by extending our energy storage application, focusing primarily on modeling state variables. Then, section 6 describes a general strategy for designing policies, which we are going to claim covers every solution approach proposed in the research literature (or used in practice). Thus, we will have a path to finding solutions to any problem (but these are rarely optimal).
Before starting, we make a few notes on notation:

The controls community uses for state, while the reinforcement learning community adopted the widely used notation for state. We have used partly because of the mnemonics (making it easier to remember), but largely because conflicts with the notation for decisions adopted by the field of math programming, which is widely used.

There are three standard notational systems for decisions: for action (typically discrete), for control (typically a lowdimensional, continuous vector), and , which is the notation used by the entire math programming community, where can be continuous or discrete, scalar or vector. We adopt because of how widely it is used in math programming, and because it has been used in virtually every setting (binary, discrete, continuous, scalar or vector). It has also been adopted in the multiarmed bandit community in computer science.

The controls community uses which is (sadly) random at time , whereas all other variables are known at time . We prefer the style that every variable indexed by time (or iteration ) is known at time (or iteration ). For this reason, we use for the exogenous information that first becomes known between and , which means it is known at time . (Similarly, would be information that becomes known between iterations/observations and .)
4.1 Dimensions of a sequential decision model
There are five elements to any sequential decision problem: state variables, decision variables, exogenous information variables, transition function, and objective function. We briefly describe each below, returning in section 4.2 to discuss state variables in more depth. The description below is adapted from Powell (2019).
 State variables

 The state of the system at time (we might say after iterations) is a function of history which contains all the information that is necessary and sufficient to compute costs/rewards, constraints, and any information needed by the transition function. The state typically consists of a number of dimensions which we might write as . This will be more meaningful when we illustrate it with an example below.
We distinguish between the initial state and the dynamic state for . The initial state contains all deterministic parameters, initial values of any dynamic parameters, and initial beliefs about unknown parameters in the form of the parameters of probability distributions. The dynamic state contains only information that is evolving over time.
In section 4.2, we will distinguish different classes of state variables, including physical state variables (which might describe inventories or the location of a vehicle), other information (which might capture prices, weather, or the humidity in a laboratory), and beliefs (which includes the parameters of probability distributions describing unobservable parameters). It is sometimes helpful to recognize that capture everything that can be observed perfectly, while represents distributions of anything that is uncertain.
 Decision variables

 We use for decisions, where may be binary (e.g. for a stopping problem), discrete (e.g. an element of a finite set), continuous (scalar or vector), integer vectors, and categorical (e.g. the attributes of a patient). In some applications might have hundreds of thousands, or even millions, of dimensions, which makes the concept of “action spaces” fairly meaningless. We note that entire fields of research are sometimes distinguished by the nature of the decision variable.
We assume that decisions are made with a policy, which we might denote . We also assume that a decision is feasible at time . We let “” carry the information about the type of function (for example, a linear model with specific explanatory variables, or a particular nonlinear model), and any tunable parameters .
 Exogenous information

 We let be any new information that first becomes known at time (that is, between and ). This means any variable indexed by is known at time . When modeling specific variables, we use “hats” to indicate exogenous information. Thus, could be the demand that arose between and , or we could let be the change in the price between and . The exogenous information process may be stationary or nonstationary, purely exogenous or state (and possibly action) dependent.
As with decisions, the exogenous information might be scalar, or it could have thousands to millions of dimensions (imagine the number of new customer requests for trips from zone to zone in an area that has 20,000 zones).
The distribution of (given we are at time ) may be described by a known mathematical model, or we may depend on observations from an exogenous source (this is known as “data driven”). The exogenous information may depend on the current state and/or action, so we might write it as . We will suppress this notation moving forward, but with the understanding that we allow this behavior.
 Transition function

 We denote the transition function by
(15) where is also known by names such as system model, state equation, plant model, plant equation and transfer function. We have chosen not to use the standard notation used universally by the controls community simply because the letter is also widely used for “functions” in many settings. The alphabet is very limited and the letter occupies a valuable piece of realestate.
An important problem class in both optimal control and reinforcement learning arises when the transition function is unknown. This is sometimes referred to as “modelfree dynamic programming.” There are some classes of policies that do not need a transition function, both others do, introducing the dimension of trying to learn the transition function.
 Objective functions

 There are a number of ways to write objective functions in sequential decision problems. Our default notation is to let
For now we are going to use the most common form of an objective function used in both dynamic programming (which includes reinforcement learning) and stochastic control, which is to maximize the expected sum of contributions:
(16) where
(17) and where we are given a source of the exogenous information process
(18) We refer to equation (16) along with the state transition function (17) and exogenous information (18) as the base model. We revisit objective functions in section 4.3.
An important feature of our modeling framework is that we introduce the concept of a policy when we describe decisions, and we search over policies in the objective function in equation (16), but we do not at this point specify what the policies might look like. Searching over policies is precisely what is meant by insisting that the control in equation (4) be “measurable.” In section 6 we are going to make this much more concrete, and does not require mastering subtle concepts such as “measurability.” All that is needed is the understanding that a policy depends on the state variable (measurability is guaranteed when this is the case).
In other words (and as promised), we have modeled the problem without specifying how we would solve them (that is, we have not specified how we are computing the policy). This follows our “Model first, then solve” approach. Contrast this with the learning equations (9)  (10) which is basically an algorithm without a model, although the RL community would insist that the model is the canonical MDP framework given in section 3.2.
4.2 State variables
Our experience is that there is an almost universal misunderstanding of what is meant by a “state variable.” Not surprisingly, interpretations of the term “state variable” vary between communities. An indication of the confusion can be traced to attempts to define state variables. For example, Bellman introduces state variables with “we have a physical system characterized at any stage by a small set of parameters, the state variables” (Bellman, 1957). Puterman’s now classic text introduces state variables with “At each decision epoch, the system occupies a state.” (Puterman, 2005)[p. 18] (in both cases, the italicized text was included in the original text). As of this writing, Wikipedia offers “A state variable is one of the set of variables that are used to describe the mathematical ‘state’ of a dynamical system.” Note that all three references use the word “state” in the definition of state variable.
It has also been our finding that most books in optimal control do, in fact, include proper definitions of a state variable (our experience is that this is the only field that does this). They all tend to say the same thing: a state variable is all the information needed to model the system from time onward.
Our only complaint about the standard definition used in optimal control books is that it is vague. The definition proposed in Powell (2020) (building on the definition in Powell (2011)) refines the basic definition with the following:

A state variable is:
 a) Policydependent version

A function of history that, combined with the exogenous information (and a policy), is necessary and sufficient to compute the decision function (the policy), the cost/contribution function, and the transition function.
 b) Optimization version

A function of history that, combined with the exogenous information, is necessary and sufficient to compute the cost or contribution function, the constraints, and the transition function.
There are three types of information in :

The physical state, , which in most (but not all) applications is the state variables that are being controlled. may be a scalar, or a vector with element where could be a type of resource (e.g. a blood type) or the amount of inventory at location . Physical state variables typically appear in the constraints. We make a point of singling out physical states because of their importance in modeling resource allocation problems, where the “state of the system” is often (and mistakenly) equated with the physical state.

Other information, , which is any information that is known deterministically not included in . The information state often evolves exogenously, but may be controlled or at least influenced by decisions (e.g. selling a large number of shares may depress prices). Other information may appear in the objective function (such as prices), and the coefficients in the constraints.

The belief state , which contains distributional information about unknown parameters, where we can use frequentist or Bayesian belief models. These may come in the following styles:

Lookup tables  Here we have a set of discrete values , and we have a belief about a function (such as ) for each .

Parametric belief models  We might assume that where the function is known but where is unknown. We would then describe by a probability distribution.

Nonparametric belief models  These approximate a function at by smoothing local information near .
It is important to recognize that the belief state includes the parameters of a probability distribution describing unobservable parameters of the model. For example,
might be the mean and covariance matrix of a multivariate normal distribution, or a vector of probabilities
where . ^{1}^{1}1It is not unusual for people to overlook the need to include beliefs in the state variable. The RL tutorial Lazaric (2019) does this when it presents the multiarmed bandit problem, insisting that it does not have a state variable (see slide 49). In fact, any bandit problem is a sequential decision problem where the state variable is the belief (which can be Bayesian or frequentist). This has long been recognized by the probability community that has worked on bandit problems since the 1950’s (see the seminal text DeGroot (1970)). Bellman’s equation (using belief states) was fundamental to the development of Gittins indices in Gittins & Jones (1974) (see Gittins et al. (2011) for a nice introduction to this rich area of research). It was the concept of Gittins indices that laid the foundation for upper confidence bounding, which is just a different form of index policy. 
We feel that a proper understanding of state variables opens up the use of the optimal control framework to span the entire set of communities and applications discussed in section 2.
4.3 Objective functions
Sequential decision problems are diverse, and this is reflected in the different types of objective functions that may be used. Our framework is insensitive to the choice of objective function, but they all still require optimizing over policies.
We begin by making the distinction between stateindependent problems, and statedependent problems. We let denote a stateindependent problem, where we assume that neither the objective function , nor any constraints, depends on dynamic information captured in the state variable. We let capture statedependent problems, where the objective function (and/or constraints) may depend on dynamic information.
Throughout we assume problems are formulated over finite time horizons. This is the most standard approach in optimal control, whereas the reinforcement learning community adopted the style of Markov decision processes to model problems over infinite time horizons. We suspect that the difference reflects the history of optimal control, which is based on solving real engineering problems, and Markov decision processes, with its roots in mathematics and stylized problems.
In addition to the issue of statedependency, we make the distinction between optimizing the cumulative reward versus the final reward. When we combine state dependency and the issue of final vs. cumulative reward, we obtain four objective functions. We present these in the order: 1) Stateindependent, final reward, 2) stateindependent, cumulative reward, 3) statedependent, cumulative reward, and 4) statedependent, final reward (the last class is the most subtle).
 Stateindependent functions

These are pure learning problems, where the problem does not depend on information in the state variable. The only state variable is the belief about an unknown function .
 1) Final reward

This is the classical stochastic search problem. Here we go through a learning/training process to find a final design/decision , where is our search policy (or algorithm), and is the budget. We then have to test the performance of the policy by simulating using
(19) where depends on and the experiments , and where represents the process of testing the design .
 2) Cumulative reward

This describes problems where we have to learn in the field, which means that we have to optimize the sum of the rewards we earn, eliminating the need for a final testing pass. This objective is written
(20)
 Statedependent functions

This describes the massive universe of problems where the objective and/or the constraints depend on the state variable which may or may not be controllable.
 3) Cumulative reward

This is the version of the objective function that is most widely used in stochastic optimal control (as well as Markov decision processes). We switch back to timeindexing here since these problems are often evolving over time (but not always). We write the contribution in the form to help with the comparison to .
(21)  4) Final reward

This is the objective function that describes optimization algorithms (represented as ) optimizing a timestaged, statedependent objective. This is the objective that should be used when finding the best algorithm for a dynamic program/stochastic control problem, yet has been almost universally overlooked as a sequential decision problem. The objective is given by
(22) where is the learning policy (or algorithm), while is the implementation policy that we are learning through . We note that we use the learning policy to learn the parameters that govern the behavior of the implementation policy.
There are many problems that require more complex objective functions such as the best (or worst) performance in a time period, across all time periods. In these settings we cannot simply sum the contributions across time periods (or iterations). For this purpose, we introduce the operator which takes as input the entire sequence of contributions. We would write our objective function as
(23) 
The objective in (23), through creative use of the operator , subsumes all four objectives (19)  (22). However, we feel that generality comes at a cost of clarity.
The controls community, while also sharing an interest in risk, is also interested in stability, an issue that is important in settings such as controlling aircraft and rockets. While we do not address the specific issue of designing policies to handle stability, we make the case that the problem of searching over policies remains the same; all that has changed is the metric.
All of these objectives can be written in the form of regret which measures the difference between the solution we obtain and the best possible. Regret is popular in the learning community where we compare against the solution that assumes perfect information. A comparable strategy compares the performance of a policy against what can be achieved with perfect information about the future (widely known as a posterior bound).
4.4 Notes
It is useful to list some similarities (and differences) between our modeling framework and that used in stochastic optimal control:

The optimal control framework includes all five elements, although we lay these out more explicitly.

We use a richer understanding of state variables, which means that we can apply our framework to a much wider range of problems than has traditionally been considered in the optimal control literature. In particular, all the fields and problem areas in section 2 fit this framework, which means we would say that all of these are “optimal control problems.”

The stochastic control modelling framework uses as the information that will arrive between and , which means it is random at time . We let be the information that arrives between and , which means it is known at time . This means we write our transition as
This notation makes it explicitly clear that is not known when we determine decision .

We recognize a wider range of objective functions, which expands the problem classes to offline and online applications, active learning (bandit) problems, and hybrids.

We formulate the optimization problem in terms of optimizing over policies, without prejudging the classes of policies. We describe four classes of policies in section 6 that we claim are universal: they cover all the strategies that have been proposed or used in practice. This also opens the door to creating hybrids that combine two or more classes of policies.
The first four items are relatively minor, highlighting our belief that stochastic control is fundamentally the most sound of all the modeling frameworks used by any of the communities listed in section 2. However, the fifth item is a significant transition from how sequential decision problems are approached today.
Many have found that learning often does not work well. In fact, learning, as with all approximate dynamic programming algorithms, tend to work well only on a fairly small set of problems. Our experience is that approximate dynamic programming (learning is a form of approximate dynamic programming) tends to work well when we can exploit the structure of the value function. For example, ADP has been very successful with some very complex, highdimensional problems in fleet management (see Simão et al. (2009) and BouzaieneAyari et al. (2016)) where the value functions were convex. However, vanilla approximation strategies (e.g. using simple linear models for value function approximations) can work very poorly even on small inventory problems (see Jiang et al. (2014) for a summary of experiments which compare results against a rigorous benchmark). Furthermore, as we will see in section 6 below, there are a range of policies that do not depend on value functions that are natural choices for many applications.
5 Energy storage illustration
We are going to illustrate our modeling framework using the energy system depicted in figure 3, which consists of a wind farm (where energy is free but with high variability in supply), the grid (which has unlimited supply but highly stochastic prices), a market (which exhibits very timedependent, although relatively predictable, demands), and an energy storage device (we will assume it is a battery). While small, this rich system introduces a variety of modeling and algorithmic challenges.
We are going to demonstrate how to model this problem, starting with a simple model and then expanding to illustrate some modeling devices. We will translate each variation into the five core components: state variables, decision variables, exogenous information variables, transition function, and objective function.
5.1 A basic energy storage problem
 State variables

State where
Energy in the battery at time , Power being produced by the wind farms at time , Demand for power at time , Price of energy on the grid at time . Note that it is necessary to go through the rest of the model to determine which variables are needed to compute the objective function, constraints, and transition function.
 Decision variables

where
Flow of energy from grid to battery () or back (), Flow of energy from grid to demand, Flow of energy from wind farm to battery, Flow of energy from wind farm to demand, Flow of energy from battery to demand. These decision variables have to obey the constraints:
(24) (25) (26) (27) Finally, we introduce the policy (function) that will return a feasible vector . We defer to later the challenge of designing a good policy.
 Exogenous information variables

, where
The change in the power from the wind farm between and , The change in the demand between and , The price charged at time as reported by the grid. We note that the first two exogenous information variables are defined as changes in values, while the last (price) is reported directly from an exogenous source.
 Transition function

:
(28) (29) (30) (31) Note that we have illustrated here a controllable transition (28), two where the exogenous information is represented as the change in a process (equations (29) and (30)), and one where we directly observe the updated price (31). This means that the processes and
are firstorder Markov chains (assuming that
and are independent across time), while the price process would be described as “model free” or “data driven” since we are not assuming that we have a mathematical model of the price process.  Objective function

We wish to find a policy that solves
where and where we are given an information process
Normally, we would transition at this point to describe how we are modeling the uncertainty in the information process , and then describe how to design policies. For compactness, we are going to skip these steps now, and instead illustrate how to model a few problem variations that can often cause confusion.
5.2 With a timeseries price model
We are now going to make a single change to the model above. Instead of assuming that prices are provided exogenously, we are going to assume we can model them using a time series model given by
(32) 
A common mistake is to say that is the “state” of the price process, and then observe that it is no longer Markovian (it would be called “history dependent”), but “it can be made Markovian by expanding the state variable,” which would be done by including and (see Cinlar (2011) for an example of this). According to our definition of a state variable, the state is all the information needed to model the process from time onward, which means that the state of our price process is . This means our system state variable is now
We then have to modify our transition function so that the “price state variable” at time becomes .
5.3 With passive learning
We implicitly assumed that our price process in equation (32) was governed by a model where the coefficients were known. Now assume that the price depends on prices over the last three time periods, which means we would write
(33) 
Here, we have to adaptively update our estimate which we can do using recursive least squares. To do this, let
We perform the updating using a standard set of updating equations given by
(34)  
(35)  
(36)  
(37) 
To compute these equations, we need the threeelement vector and the matrix . These then need to be added to our state variable, giving us
We then have to include equations (34)  (37) in our transition function.
5.4 With active learning
We can further generalize our model by assuming that our decision to buy or sell energy from or to the grid can have an impact on prices. We might propose a modified price model given by
(38) 
All we have done is introduce a single term to our price model. Assuming that , this model implies that purchasing power from the grid () will increase grid prices, while selling power back to the grid () decreases prices. This means that purchasing a lot of power from the grid (for example) means we are more likely to observe higher prices, which may assist the process of learning . When decisions control or influence what we observe, then this is an example of active learning.
This change in our price model does not affect the state variable from the previous model, aside from adding one more element to , with the required changes to the matrix . The change will, however, have an impact on the policy. It is easier to learn by varying over a wide range, which means trying values of that do not appear to be optimal given our current estimate of the vector . Making decisions partly just to learn (to make better decisions in the future) is the essence of active learning, best known in the field of multiarmed bandit problems.
5.5 With rolling forecasts
Forecasting is such a routine activity in operational problems, it may come as a surprise that we have been modelling these problems incorrectly.
Assume we have a forecast of the energy from wind, which means
(39) 
where is the random variable capturing the oneperiodahead error in the forecast.
Equation (39) effectively replaces equation (29) in the transition function for the base model. However, it introduces a new variable, the forecast , which must now be added to the state variable. This means we now need a transition equation to describe how evolves over time. We do this by using a twoperiodahead forecast, , which is basically a forecast of , plus an error, giving us
(40) 
where
is the twoperiodahead error (we are assuming that the variance in a forecast increases linearly with time). Now we have to put
in the state variable, which generates a new transition equation. This generalizes to(41) 
where .
This stops, of course, when we hit the planning horizon . This means that we now have to add
to the state variable, with the transition equations (41) for . Combined with the learning statistics, our state variable is now
It is useful to note that we have a nice illustration of the three elements of our state variable:
5.6 Remarks
We note that all the models illustrated in this section are sequential decision problems, which means that all of them can be described as either stochastic control problems, or reinforcement learning problems. This is true whether state variables or decision/control variables are scalar or vector, discrete or continuous (or mixed). We have, however, assume that time is either discrete or discretized.
Energy storage is a form of inventory problem, which is the original stochastic control problem used by Bellman to motivate his work on dynamic programming (Bellman et al., 1955), and is even used today by the reinforcement learning community (Lazaric, 2019). However, we have never seen the variations that we illustrated here solved by any of these communities.
In section 6 we are going to present four classes of policies, and then illustrate, in section 7, that each of the four classes (including a hybrid) can be applied to the full range of these energy storage problems. We are then going to show that both communities (optimal control and reinforcement learning) use methods that are drawn from each of the four classes, but apparently without an awareness that these are instances in broader classes, that can be used to solve complex problems.
6 Designing policies
There are two fundamental strategies for creating policies:
 Policy search

 Here we use any of the objective functions (19)  (23) to search within a family of functions to find the policy that works best. This means we have to a) find a class of function and b) tune any parameters. The challenge is finding the right family, and then performing the tuning (which can be hard).
 Lookahead approximations

 Alternatively, we can construct policies by approximating the impact of a decision now on the future. The challenge here is designing and computing the approximation of the future (this is also hard).
Either of these approaches can yield optimal policies, although in practice this is rare. Below we show that each of these strategies can be further divided into two classes, creating four (meta)classes of policies for making decisions. We make the claim that these are universal, which is to say that any solution approach to any sequential decision problem will use a policy drawn from one of these four classes, or a hybrid of two or more classes.
6.1 Policy search
Policy search involves tuning and comparing policies using the objective functions (19)  (23) so that they behave well when averaged over a set of sample paths. Assume that we have a class of functions , where for each function , there is a parameter vector that controls its behavior. Let be a function in class parameterized by , where . Policy search involves finding the best policy using
(42) 
In special cases, this can produce an optimal policy, as we saw for the case of linearquadratic regulation (see equation (8)).
Since we can rarely find optimal policies using (42), we have identified two subclasses within the policy search class:
 Policy function approximations (PFAs)

 Policy function approximations can be lookup tables, parametric or nonparametric functions, but the most common are parametric functions. This could be a linear function such as
which parallels the linear control law in equation (8) (these are also known as “affine policies”). We might also use a nonlinear function such as an orderupto inventory policy, a logistics curve, or a neural network. Typically there is no guarantee that a PFA is in the optimal class of policies. Instead, we search for the best performance within a class.
 Cost function approximations (CFAs)

 A CFA is
where is a parametrically modified cost function, subject to a parametrically modified set of constraints. A popular example known to the computer science community is interval estimation where a discrete alternative
Comments
There are no comments yet.