1 A Brief Review of MDPs
This section discusses background material on MDPs using several popular textbooks on the subject. Two modern textbooks I recommend for MDPs are [bertsekasdynamic1] and [bertsekasdynamic2]. Reading selective parts of [bertsekasdynamic1], updated in 2017, is sufficient to continue towards an introductory treatment of POMDPs. One other textbook I recommend that adds perspective to this text, and has additional material for MDPs as well as an excellent coverage of POMDP theory is the book [krishnamurthy2016partially]. Published in 2016, the text is a modern treatment of POMDPs with many examples of POMDPs and an easy to read development of POMDP theory. Only the essential parts of MDPs and HMMs are covered in this introductory chapter. Moreover, I provide a quick survey of references to some other areas of interest to MDPs, mostly extensions to the basic theory that has been achieved that may have an impact on future research for POMDPs.
1.1 The Dynamic Programming (DP) Algorithm
DP to MDP reading Road map in [bertsekasdynamic1]:
Read first appendix D On Finite-State Markov Chains if no background with Markov chains is had. Then, proceed to Chapter 1.
Read Chapter 1 The DP Algorithm.
Other mathematical background required and assumed is basic familiarity with calculus, linear algebra, probability and Markov chains.
Within this first chapter of [bertsekasdynamic1] particular emphasis should be placed on 1.1.2 Discrete-State and Finite-State Problems p.7 as this is the setting that leads towards a development of HMMs. Then, one can approach a suitable POMDP introductory development for finite-state based systems. Note that Bertsekas’s coverage includes continuous valued state-spaces, and allows for easy addition of disturbances to the state from a countable set in his coverage of the dynamic programming algorithm.
Next, read 1.2 The basic problem, 1.3 The Dynamic Programming Algorithm. Skip section 1.4 State Augmentation and Other Reformulations. Read section 1.5 Some Mathematical issues. Optional: read 1.6 Dynamic Programming and Minimax Control.
Summary: this chapter helps one come to understand the general structure of the DP algorithm for a finite horizon optimal control problem i.e., over a finite number of stages. The two main features of the problem are a discrete-time dynamic system that evolves the state, and an additive cost function. Notation for common variables is introduced: is the state at time , is the system function at time that takes in parameters The disturbance (noise) comes from a finite set, viewed as a random parameter. The discrete time index is . The length of the horizon is and is the number of times control is applied by an agent. The control or decision an agent selects is denoted as at time . The total cost for this problem is . Here the terminal cost is . The basic problem is formulated in terms of minimizing the expected sum of these costs through an optimal policy (how an optimal controller decides on a control when in some state at some time).
In [bertsekasdynamic1], an important distinction between open loop vs. closed-loop minimization is made. In open loop minimization, one selects at the initial time without considering what the future states and disturbances actually are. On the other hand, in closed-loop control an agents waits until the current state is known at stage before reaching a control decision. The goal of DP and MDPs specifically is to find an optimal rule for selecting for any state in the state space. One wants the sequence of functions for mapping to minimize the expected cumulative cost. For the finite horizon there are most commonly distinct policies , whereas for the infinite horizon, the policy is stationary and denoted as , without the time index . The sequence is called the policy or control law for the finite horizon. For each policy , the expected cost of starting from state is
1.2 MDP Basics: The Finite Horizon Case
Also, consider reading [krishnamurthy2016partially] Ch. Fully Observed Markov Decision Processes. In particular, read Ch Finite state finite horizon MDP, and Bellman’s stochastic dynamic programming algorithm. An MDP is a Markov chain with a control applied at every stage. At some stage , a state of the world is occupied by an agent or decision maker (controller) where it chooses a control ; this choice selects a transition matrix , which is of size for an -state Markov chain. The choice of control input influences the probability that the next state is some value at time . The goal is to choose controls over the horizon, infinite or finite in stage number, that minimize some notion of cost termed a cost-to-go function. Most of the time, these cost-to-go functions are cumulative or additive of the costs received over the stages of the decision making.
First, it should be pointed out that the finite horizon MDP which is N stages long is fully specified by a tuple: and possibly a discount factor . For every stage in the horizon, an important point for MDPs is that the agent knows the model parameters specified by the tuple and that there is no corruption or delay in the state information to the agent or controller . The finite state space is denoted by , and the finite control space is denoted by . Additionally, the probability of transitioning from state to state at time when control is applied is denoted by
Furthermore, denotes the instantaneous cost of being in state and executing control . Lastly, denotes the terminal cost . An important fact of MDPs in terms of the policy is that only the current state needs to be considered for making an optimal decision, not the full information set . In general, maps from the information set to the control space ; a policy, also called a decision rule for MDPs is a mapping from to , i.e., denotes a deterministic mapping from to .
Another change in notation from [bertsekasdynamic1] to [krishnamurthy2016partially] is that is a policy whereas denotes a belief state when this is encountered in the POMDP chapters. The notations and are both common for belief-states. The preceding type of policy is called a deterministic Markovian policy and this is sufficient to achieve the minimum of:
which is the expected cumulative cost of the policy over the finite length horizon. The goal of the agent or decision maker is to determine , i.e., find a policy that minimizes the expected cumulative cost for all initial states . When and are finite spaces there always exist an optimal policy, however uniqueness is not guaranteed. The solution to this problem of finding the optimal policy comes from Bellman’s stochastic dynamic programming algorithm. The optimal policy is obtained from a backward recursion. See p. of [krishnamurthy2016partially] for the details. After running this algorithm for steps, the optimal expected cumulative cost-to-go is given by for all starting states from this algorithm. Furthermore, at each stage Bellman’s algorithm constructs the optimal deterministic policy for all possible . The proof of this algorithm relies on the principle of optimality. See p. of [krishnamurthy2016partially] and page of [bertsekasdynamic1]. In short, given any optimal policy for an arbitrary horizon, if the problem is started at an arbitrary stage in an arbitrary state, the original optimal policy for the full horizon is still optimal for the tail or sub-problem whenever and wherever the agent starts executing the optimal policy.
1.3 MDP Basics: The Infinite Horizon Case
The only other material I would like to cover in this MDP section is the extension to the infinite horizon case, including both discounted total cost with its several cases and a discussion of the average cost case for the different possible cost-to-go function formulations. The famous computational methods are given as assigned reading, as part of the reading road-map.
In Ch. of [bertsekasdynamic1], Introduction to Infinite Horizon Problems, this chapter should be read in its entirety except the Semi-Markov Problems section can be safely skipped. All other sections should be read multiple times to let the details sink in to the mind because some of the facts are quite subtle. From an MDP perspective, the infinite time horizon is a natural extension of the finite time horizon. There are several nice features of this chapter worth reading, particularly the connection between stochastic shortest path problems and DP, the practical computational algorithms, and the kinds of cost-to-go functions that can be optimized. Read 5.1 an overview, 5.2 Stochastic Shortest Path Problems, 5.3 Computational Methods, 5.4 Discounted Problems, 5.5 Average Cost per Stage Problems.
Infinite horizon summary: extra care must be taken in the analysis of these DP and MDP equations as the horizon length goes to infinity in any limiting arguments made for proofs. The model differs from the previous section in the following ways:
The number of stages is now countably infinite.
The system and cost functions are stationary.
The policy is also stationary.
There are four general types of infinite horizon problems discussed 1) minimization of total discounted cost:
where the expectation is taken with respect to the . Here and is called the discount factor for the problem. The meaning of is that future costs of the same magnitude matter less in the future than they currently do at the present time. Something that is often a concern is whether the limiting cost-to-go, , is finite. This can be ensured through sufficient discounting.
There are a few different cases to consider for the infinite horizon definition from [bertsekasdynamic1]:
Case a) stochastic shortest path problems. Here, , but there must be at least one cost-free termination state; Once the system reaches one of these states, it remains there at no further cost. I have also heard this state referred to as a graveyard state before. In effect, this is a finite horizon problem of a random horizon length, and it is affected by the policy being used. For instance, a policy could be prohibiting an agent from reaching a terminal state faster by making poor choices, thus incurring more costs than is necessary. Case b) discounted problems with bounded cost per stage: Here, and the absolute cost per stage is bounded from above by some constant M; this makes the limit in the definition of the cost well defined because it is the infinite sum of a sequence of numbers that are bounded in absolute value by the decreasing geometric progression . For Case c) Discounted and Undiscounted problems with unbounded cost per stage: This analysis is not done in [bertsekasdynamic1]. See Chapter 4 of [bertsekasdynamic2] for further details. The main issue is the handling of infinite costs that some policies achieve. The last case, Case d) Average cost per stage problems is dealt with in Ch. 5 of [bertsekasdynamic1].
The average cost-to-go formulation makes sense when for every policy and initial state . This explosion in the cost-to-go could happen, for instance, if and the cost per stage is positive. Then, it makes sense to turn to a notion of average cost per stage, to properly study the infinite horizon. The average cost-per-stage is defined as
and is often well defined, i.e., finite.
Now that these types of the infinite horizon problem are presented, I want to show an interesting fact that is valuable both analytically, and computationally. Taking the limit in of the finite horizon cost-to-go function yields the infinite horizon cost-to-go, i.e., . This relationship typically holds; it holds for Cases a-b), but exceptions exist for Case c). The following equation is Bellman’s equation or Bellman’s optimality equation, for the stochastic control setting:
This is a system of equations. The solution to this set of equations is the optimal cost-to-go function. It is typically hard to compute the optimal cost-to-go exactly, but algorithms exist that converge to the correct solution. Most algorithms rely on the use of a mathematical property called contraction. The notion of a contraction mapping from one cost-to-go function into another is more mathematically advanced and the viewpoint is not needed for the purposes of this survey paper. More pointers are contained in [bertsekasdynamic2]. There one can read about all of the computational methods, especially value iteration, and policy iteration as they will have parallels to algorithms to be developed for POMDPs.
Additional reading for this section can be found in Ch.6 of [krishnamurthy2016partially] to deepen understanding of MDPs and gain more perspective. Specifically, compare the MDP development of the infinite horizon of section 6.4 with the infinite horizon discounted cost to the form of the above DP Bellman equation. Remember, the infinite-horizon model consists of a 5-tuple . Because the horizon is infinite, note that the terminal cost has been removed. The limiting form of the finite horizon MDP Bellman equation is:
Note that the expectation is removed for this equation, and the dynamics that was in the system equation has moved to the matrix
which acts like a conditional expectation operator. Rather than include an explicit random variablein the system of equations, the randomness comes from as does the dynamics.
2 The Hidden Markov Model (HMM)
2.1 HMM Basics and the Viterbi Algorithm
Throughout this survey paper I use unified notation when I review all the papers and update the original paper’s notation to match mine where appropriate so the reader is not stuck deciphering different notations and conventions. After following the above reading recommendations for MDPs, this section continues with the introduction of HMMs. One could continue reading [bertsekasdynamic1] for HMM material or go to [krishnamurthy2016partially] and begin reading on stochastic models and Bayesian filtering. The Bayesian filtering perspective is crucial for appreciating the HMM in the eventual POMDP context. The filter is used to update the agent or controller’s belief-state as new information, specifically observations of the Markov chain as they are witnessed. This belief-state formulation is important because most of the time the agent has inaccurate and corrupt state information to determine its control decisions. In this context, it is important to determine the true state of the system as best as possible. In this section I review only the most useful information related to HMMs from [bertsekasdynamic1] and [krishnamurthy2016partially] to provide understanding of the HMMs before in the next section moving on to POMDPs, the main focus of the survey.
In [bertsekasdynamic1] Chapter 2, read Deterministic Systems and the Shortest Path Problems
Read 2.1 Finite-State Systems and Shortest Paths. This section provides an interesting perspective.
Then, jump to the section 2.2.2 on Hidden Markov Models and the Viterbi Algorithm.
Later, one will see how the HMM is a critical part of POMDPs. Section 2.2.1 can be skipped. And skip the rest of this chapter, mainly focused on shortest path algorithms for graphs.
Summary of HMMs from [bertsekasdynamic1]: A Markov chain is considered, which has a finite state space with given transition probabilities
. The point of the HMM is to optimally estimate the unknown states as the Markov chain evolves over time. Assume an observer can only see observations that are probabilistically related to the true state. The observer can optimally estimate the state sequence from the observation sequence. Denote the probability of an observation, after a state transition from to by . The observer is also given the prior on initial state probabilities, i.e. the probability the state initially takes on some value . The probablistic transitions just describe constitutes an HMM, which is a noisly observed Markov chain. Note for future reference that when control is added to the model, a POMDP is formed. The Viterbi algorithm uses the most likely state estimation criterion such that it is able to take the observation sequence and compute the estimated state transition sequence
read Ch.2 Stochastic state space models, and specifically read 2.1 Stochastic state space model, 2.2 Optimal prediction: Chapman-Kolmogorov equation, and 2.4 Example 2: Finite-state hidden Markov model (HMM). Then, proceed to Ch. 3 Optimal filtering. In this chapter, read 3.1 Optimal state estimation, 3.2 Conditional expectation and Bregman loss functions, 3.3 Optimal filtering, prediction and smoothing formulas. Section 3.3 is important for understanding optimal filtering in the most general space setting. Then, immediately read 3.5 Hidden Markov model (HMM) specializing the previous results to finite-state Markov chains. Lastly, in this chapter, read 3.7, an important concept of Geometric ergodicity of HMM filter which says that an HMM forgets its initial condition geometrically fast, like a Markov chain does under certain conditions.
2.2 Optimal Prediction: Chapman-Kolmogorov (C-K) equation
Summary from [krishnamurthy2016partially]: an interesting question for HMMs is what is the optimal state predictor? For a finite-state Markov chain with state-space and a state transition matrix , define the
-dimensional state probability vector at timeas
Then, the Chapman-Kolmogorov (C-K) equation reads:
initialised by some prior belief . The process denotes an -state Markov chain. The dynamics of are described by the matrix with elements . Associated with the states , one can define a state level vector . Using we can predict the state level at time by .
A few more definitions of Markov chains may prove useful for proofs or analysis, namely the limiting distribution and the stationary distribution. The limiting distribution: , which may not exists as a limit, but follows from repeated application of the C-K equation. Another important definition for Markov chains is the Stationary distribution which is the -dimensional vector that satisfies . The vector
is interpreted as a normalized right eigenvector of
corresponding to the unit eigenvalue. The limiting distribution and the stationary distribution coincide ifis regular. The definition of regular is: A transition matrix is regular (primitive) if s.t..
2.3 Optimal Filtering for the HMM
Now, I present the main results from Ch.3 on Optimal filtering from [krishnamurthy2016partially] necessary for our purposes of building to the POMDP theory. The recursive state estimation problem goes by several names: optimal filtering, Bayesian filtering, stochastic filtering, recursive Bayesian estimation, and so on. The name is not so important, but what is important is that optimal filtering involves the use of Bayes’ rule.
Next, I present the HMM filter. As before, one has the state transition probabilities (uncontrolled) , but additionally one has another finite space for the observations . It is also possible to have continuous values for the case of noisy observed values of the HMM. See section 2.4 of [krishnamurthy2016partially]
for more on this. Effectively, one has to use probability distributions to model that effect. For the HMM setting, the conditional probability mass function (pmf) is defined as
These jumps are also called state-to-observation transitions. Now, we have state-to-state transitions, and state-to-observation transitions as part of our modeling procedures. Most applications of HMMs such as speech processing or telecommunications use a finite observation space. can be thought of as a matrix that satisfies .
Let’s recap the HMM model before discussing the optimal filtering update to the belief state. An HMM is a noisily observed Markov chain specified by a 4-tuple
, i.e., the state and observation transition probabilities, the state level vector, and an initial belief pmf. When dealing with continuous random variables, the update formula is given by integrals, but for finite state spaces, the integration reduces to summation:
This equation can be expressed beautifully and compactly using matrix-vector notation. Having observed the value , define the diagonal matrix of likelihoods:
Define the -dimensional posterior vector, also called the belief-state as . For observations , the HMM filter is summarized now. Given known parameters , for time , given observation , update the belief-state by
where is our normalizing constant.
Next, I present a few different types of state estimators for the HMM that could be useful for a number of reasons once we reach POMDP theory in the next chapter. First, define a type I estimator at time as
This type of estimation could be useful for application in POMDPs as discussed shortly. Secondly, I introduce another type of state estimator that could be useful when the number of states is very large even though still finite in number. Denote as the type II state estimator, and when the state levels take on the state values reduces to the following form:
where NN stands for nearest neighbor here. Certainly other forms of state estimators exist and should be experimented with compared to these. Those presented now are a few possibilities.
After having developed some HMM theory, I have presented two simple state estimators, called type I and type II that could be used in the previously developed DP and MDP material to solve the HMM and later in POMDPs by turning it into an MDP. More importantly, it provides us approximations or efficient computations of the kind discussed above for solving the HMM, although certainly other variations on these ideas are possible. Later, more types of state estimators for approximations to the POMDP problem are realized after surveying the classical papers. Before reviewing in detail the classical papers, the POMDP is motivated and introduced in the next section.
3 POMDP Fundamentals
POMDPs generalize MDPs when state uncertainty exists or partial information is available to an agent to make decisions. The POMDP captures both stochastic state transitions and observation uncertainty. The canonical model includes finite state, finite control, and finite observation spaces. This model is very general and has been applied to solve research problems in a number of fields. A survey on POMDP applications is found in [cassandra1998survey].
In MDPs, the state is observable to an agent at all stages. Meanwhile, in a POMDP, the agent can only observe an output or observation , which can in many cases be treated as a noisy version of .
The POMDP Canonical Model consists of:
A finite set of states .
A finite set of controls .
A finite set of observations .
A state-to-state transition function .
A state-to-observation transition function .
A cost function, most generically defined on as .
The immediate cost is the most generic kind of cost function for this model because it considers not only the current state and control executed from , but also any observation that can be observed. If observations are not relevant for the cost model, then the immediate cost can be written as . This model can be used to learn a control policy that yields the lowest total cost over all stages of decision making. Also note that the above sets could be modeled as continuous with some more research effort. In fact, in my most recent unpublished research [bowyer2021pomdp], the observation space was modelled as being continuous, but the state and control sets were still modelled as being finite. More research is needed on allowing for all of the spaces to be continuous and being able to determine optimal or near-optimal controllers for such models.
A dissertation by Alvin Drake at MIT [drake1962observation], was the first work akin to POMDPs that studied decoding outputs from a noisy channel. He was the first person to study what we now call POMDPs by looking at decoding Markov sources over a noisy channel. In his work, the states are the symbols of the Markov source and the observations are the symbols received in noise. He considered a binary symmetric channel model of the noise with some different decoding rules. The extension to continuous output alphabets, the inclusion of error control coding, and the potential application of RL to such problems remain open research problems.
In an earlier paper [bowyer2019reinforcement], I investigated RL for solving the spectrum sharing problem, using the simple SARSA algorithm as the main learning technique. Better modelling of all of the agents would occur if a POMDP was able to be formulated; This paper, [wong2020dynamic], explains the radio architecture and more modelling considerations for the RL radio problem. Additionally, a better multi-agent formulation of the multi-radio spectrum sharing problem would bring about more performance gains in terms of optimally dividing the spectrum between all teams’ radios. Note that the general multi-radio problem is partially observed because at any given time instance what another radio tells you (from your radio’s perspective) is most likely not the truth, i.e. they are not using exactly that frequency or bandwidth. Hence, any one agent doesn’t know the full state of the system and necessitates a POMDP model of some kind. These issues are currently being investigated by researchers in our lab.
The first major result for single-agent POMDPs is in [smallwoo1973optimal], for a finite horizon problem with a finite state MDP. A paper by Denardo [denardo1967contraction] lays down some important mathematics to prepare for the extension to the infinite horizon. Particularly, he develops contraction and monotinicity properties as well as a convergence criterion based on these properties that would be applied to the cost-to-go functions. The properties developed in [smallwoo1973optimal] do not carry over to the infinite horizon paper on POMDPs [sondik1978optimal]. These classical papers are reviewed more thoroughly in the next section.
A brief review of POMDP terminology and properties follow. Assume the internal dynamics of the system are modeled by an -state MDP. At each stage , the goal is to pick a control to optimize the performance of the system over its remaining life. To solve this control problem, we encode the current state information in a vector, called the belief state:
The belief state at stage is a vector of state probabilities, where is the probability the current state is at stage .
This encoding transforms the discrete-state MDP into a continuous-state MDP. After using control and observing output , the belief state is updated according to the formula below. The belief state update for some state :
Now, let’s introduce notation for this updating of information or belief-state operation as . Next, I present a DP solution to this problem; Define as the reward-to-go function, which is the maximum expected reward the system can accrue during the remaining stages starting from stage k, starting from belief state . Let be the generic state, input, and output instantaneous reward, which does not vary across stages.
The expression above can be further simplified above as:
where the expected immediate reward for state , at stage if alternative is used by
The terminal stage is when . If is the expected value of terminating the process in state , then . Note also the reward-to-go function can be rewritten in matrix vector notation for the arbitrary stage. Define as the probability of observing output if the current belief-state is , and the control is selected. Then, the recursion becomes:
valid for . This is a DP equation with a continuous state-space, i.e., the ’s, a finite control space, and a finite observation space.
Although the DP equation is quite complex looking, it turns out the solution is piece-wise linear and convex, [smallwoo1973optimal]:
Piece-wise Linear and Convex Cost-to-go Function:
for some set of alpha vectors , for at stage .
The proof is done by induction. Note that at time ,
has the desired form of being piece-wise linear in . The proof is important because it allows one to recursively compute the optimal reward function and hence the optimal policy. See [smallwoo1973optimal] for the completion of the proof.
In the beginning, . For the earlier stages, the alpha vector set can be constructed as follows:
Once is constructed from stage computations, the optimal policy can be computed as follows:
All exact methods stem from the some of the classical results studied in this section and the classical papers studied in the next section, particularly the work of Sondik. At each stage one iteratively constructs a set of alpha vectors from the future stage. In [krishnamurthy2016partially], a general incremental pruning algorithm is proposed to reduce complexity while maintaining optimality by eliminating vectors that never contribute to the cost-to-go function. The paper by Monahan, [monahan1982state], computes the set of vectors at each stage which has alpha vectors in it, where
is the number of possible observations, and the other parts are cardinalities of finite sets. These are pruned according to the linear program in[krishnamurthy2016partially]. Monahan also goes over an enumeration strategy for value iteration for POMDPs, but in general, as we have already seen, the problem grows exponentially in the number of alpha vectors used to approximate the value function. This necessitates the use of pruning strategies to reduce complexity and hence approximate computation for problems with large state, action or observation spaces. Approximation methods for the POMDP are presented after a review of classical papers in the next section. Hence, if one only wants to read or learn about the main approximation methods for POMDPs, the next section can be safely skipped.
The reading road-map of this section is start with [bertsekasdynamic1] Ch.4 Problems with Imperfect State Information. There, he illustrates well the ever increasing computational complexity associated with an increasing horizon length for POMDPs, but this chapter is not great for exploring the introductory material of POMDPs. Skip 4.1-4.2., and read 4.3 Sufficient statistics, paying particular attention to 4.3.2 Finite-State Systems. This chapter begins to illustrate where assumptions of MDPs and dynamic programs breakdown when perfect state information is not available for decision making. After reading this chapter, an appreciation for POMDPs can be developed further by reading [krishnamurthy2016partially]. The textbook [krishnamurthy2016partially] serves as a better introduction to POMDPs and for entering recent research directions in the later chapters, e.g., social sensing. After this section is completed, I move to focus entirely on approximation methods for the POMDP.
4 Classical Papers for POMDPs
4.1 Paper 1: Astrom, “Optimal Control of Markov Processes with Incomplete State Information”
4.1.1 Overview of the paper
I start my survey of classical papers with Astrom’s paper because it closely aligns with the goal of making optimal decisions on problems with countable state spaces when state information is incompletely known. Astrom solves the problem of optimal control with incomplete state information using the mature field of DP. The state spaces and control spaces in this paper are more general than what has been considered so far in this survey paper. As well as presenting unified notation, I convert theorems when relevant into our finite-space setting, possibly infinitely countable, but never uncountable. A preliminary paper that may be read before reading this paper is [stratonovich1965conditional] because a result from section III of it is used, equation 3.19. Another paper, that may be of interest is [dynkin1965controlled] as it deals with control of control of a partially observed random sequence; however, it is more relevant as a setting to bandit type problems. It is not required reading for this POMDP survey, but suggested. At the end of each section in the rest of the survey, I provide a few paragraphs with pointers to further literature, related to the papers considered in that section.
Summary: the beauty of [aastrom1965optimal] is to introduce control to Markov processes, and find an optimal controller that uses all of the observations available up to the current time or stage. Instead of postulating a form of a control law as is done in some areas of control theory, he minimizes a specific cost functional. The design of his problem as a stochastic variational problem (svp) leads to a classical two step procedure as is known for solving optimal control of deterministic linear systems with a quadratic cost. These steps are 1) estimate the state based on observations, and 2) calculate the control variable from the estimated state. This paper presented five novel theorems at the time of publication. I summarize the results and offer new ideas where holes are found.
The cost-to-go or performance measure of the system is a functional of the state trajectories and controls selected. We seek to minimize this functional in expectation. It is noted that the selected controls may depend on the history of the measurements up to the current time . To solve his more general problem of dynamics resulting from a stochastic differential equation, he discretizes the state and time variables to be able to use Markov chain theory. The theory for the quantized problem and its solutions are presented, and it is solved using DP. The last two theorems form provable lower, and upper-bounds for the solution. This leads to a notion of the value of complete state information and partial or incomplete state information relative to no state information. This is interesting, because this leads to a connection of information theory with stochastic optimal control.
4.1.2 Set up of the POMDP Theory and Statement of the Problems
Denote the countable space Markov chain by . For simplicity assume a finite state space and discrete time . We have an initial prior pmf on the initial state . The possibly time-varying controlled state transitions is denoted by with the same constraints on given earlier in the survey. Here, however is a member of a compact set , referred to as the set of admissible controls. Keep in mind, most of the time for computer-based systems is finite. In his model, the transitions are continuous functions of .
Also, the output of the system or the observable process is denoted by , which is a discrete time Random Process. Most of the time the observation space is finite: . The observation transitions go in a matrix , and each element with the following conditions met: . The matrix reflects the probabilities of measurement errors. Perfect state information results when
, the identity matrix, andmay not be square in general. Denote as a realization of this process, i.e. the measurements up to time by . Define a set of control functions as , and refer to it as a strategy or control law. The control law is admissible iff . This introduces feedback into the problem as it directly relates the measurements or observations with the control to be applied. Define the instantaneous cost function, , to be a scalar function of that depends continuously on . The total cost in expectation is
Problem 1 in [aastrom1965optimal]:
Find an admissible control signal whose value at time is a function of the outputs observed up to that time and are such that the expected value of the total cost is minimal.
Solution to Problem 1: We seek to minimize this expectation over all possible choices of :
is a function of , the observations up to time . The solution is given by DP. At time , is obtained and we must determine as a function of the observations. At the final stage, we must minimize . We can rewrite this expression using the smoothing property of conditional expectation to make the dependence on explicit:
Define the optimal cost-to-go function by:
and note the dependence of on all of the observations, which comes from the conditional distribution of given . To emphasize this point, define
for finite states, or
for a countably infinite state space. Thus, the minimal cost at stage is , where .
It is shown recursively in [aastrom1965optimal], that the minimal cost of the remaining stages can be written as:
Using this hypothesis, we have
Hence, the optimal control has to be a function of to minimize the above expression. One more step makes this explicit using the smoothing property:
which can be expressed further using the two conditional densities of the state given and the conditional density of given .
An application I thought of for this was taking this result and applying it for a finite state, and finite observation system. The dynamic program at stage would minimize over the following expression using now two conditional pmfs:
Some interesting Markov assumptions could be made about these two conditional pmfs that would make this a much more useful formula for online computation than the original continuous integrals involving densities. However, the continuous integrals may be useful for developing some soft-decision decoders for error control coding research.
Next, he derives a recursive filter for the state estimator that updates after observation is available; this is our optimal Bayesian filter from before. Refer to the derivation in the paper for more details. Using and definitions we can build the following recursive update:
Introduce the notation . Notice that and refers to the observation. Introduce the vector and define the norm on this vector as . The norm is interpreted as the conditional probability
Now, we can revisit the DP equation and simplify its expression:
The induction follows trivially to any arbitrary stage. Now that the intuition is set, and the basic building blocks are assembled, I present the theorems in order.
4.1.3 The Theorems
Theorem 1: Let the control law minimize the functional
Then, for any stage k:
Theorem 2: Let the functional equation for have a solution . Then, the Problem 1 has a solution, and the control law minimizes the functional . Its minimal value is . Another interesting fact from the proof of this theorem is that the continuity of implies that if there is a solution to the cost functional, say , then this solution is continuous in .
Some more discussion of the above two theorems yields more insight. The cost functional that is being minimized, can be solved apriori knowing only the problem data without seeing any observations . The controller took the form . Finally, note that the dependence of on only enters from the conditional distributions . The function can be obtained from minimizing the cost functional, and can be obtained offline. The function expresses the conditional distribution of the state as functions of the observed data , and is given by a recursion already presented.
The one lemma given in [aastrom1965optimal] is also quite useful. He proves that for a control law such that , the set of conditional probabilities is a Markov chain. Proof: the optimal control law is a function of and the equation
Next, theorem 3 is presented which is based on an adjusted statement of problem 1. He considers a variational problem relative to the -process. Let be the vector . Introduce the new functional as where denotes an inner product between two vectors . denotes expectation with respect to . Now, consider the Problem 2 statement:
Find a sequence of admissible controls s.t. our functional is minimal. The value of at any time may depend on all previous and current .
This gives rise to Theorem 3:
The problems 1 and 2 are equivalent in the sense that if one of the problems has a solution, then the other problem also has a solution. Furthermore, the optimal control law is in both cases where is given by Theorem 1. In problem 2, there is complete state information, and so it reduces to a kind of classical DP recursion. Lastly, Theorem 3 implies that the problem of optimal control of a Markov chain with incomplete state information can be transformed into a problem of optimal control of a process with complete state information. However, the state space of the problem with complete information is the space of probability distributions over the states of the original problem.
I believe this is the first instance in the research literature of what is today known as belief space, and note the observation space is finite, with at most m points. However, in this formulation, the control is not assumed to affect the observation. The division of controlling a Markov chain with incomplete state information can be broken into two parts. Part1: solve either Problem 1 or Problem 2. This gives the optimal controller
. Part2: calculate the conditional probability distributionof the states of the associated M.C. from the measured signal .
The last two theorems of the paper provide lower and upper bounds on the optimal cost-to-go functions. Two cases are presented. The complete state information case provides a lower bound on the optimal cost-to-go function, and the open-loop system using only apriori information provides an upper bound on the optimal cost-to-go function. Case1(Complete state info): Here, , and define the cost-to-go with complete state information as from stage . Then,
Case2(Open-Loop system, i.e., no state information): the other extreme is when no aposteriori state information is available. Assume . Then, we get , and . Also, in this case . The conditional probabilities are independent of , which means the measurements contain no information for calculation of . Rewriting the cost functional for this case gives us:
which appears to be using the C-K equation presented earlier in the survey to run the open loop minimization.
Let the solution of the imperfect-state cost problem be:and that of case 1) by . Then, it can be shown that .
Let the solution of the functional equation in Theorem 4 be , and that of case 2) open-loop system be . Then, it can be shown that .
Some closing remarks and ideas: from the above two theorems one has a final beautiful insight. is taken as the value of having perfect state info relative to none at all, and is the value of having incomplete state information relative to none at all. It would be interesting to follow this point up, and investigate other upper and lower bounds than the ones presented. Is it possible to prove tighter upper and lower bounds? Are there better approximations for these bounds? In particular, for the lower bound, if one was found with domain the space of beliefs that would be a theoretical advance. This lower bound is a linear function of . So, if we found a lower bound to in the space of beliefs, that would mean we have found a better algorithm than the current DP solution at minimizing the total costs over the finite horizon. A tighter upper bound in the belief space would also be interesting, because that would mean we are really close to being optimal, and this may allow us to save on a lot of unnecessary computation to reach that bound. Additionally, as a major research hole, the number of times observations or measurements are made could be incorporated into the instantaneous cost. Then, one could assign a cost to running a measurement, and determine when is it optimal to make a measurement. I would call this an optimal measuring policy. Finally, is it possible to define a notion of entropy on the transition matrix or with the belief-state sequence? This could lead to a new type of maximum exploration strategy, and possibly help with creating new upper and lower bounds for the optimal-cost-to go functions. The research area of exploration appears completely unexplored for POMDPs. Information theory may prove as a most valuable tool for working with the belief states as the distributions of interest and using some of Shannon’s information measures, whether relative entropy or entropy to investigate exploration strategies for POMDPs.
4.2 Paper 2: Smallwood and Sondik, “The Optimal Control of Partially Observable Markov Processes Over a Finite Horizon”
4.2.1 Overview of the paper
This paper [smallwoo1973optimal]’s main results were reviewed in section 3, but here I provide more of an overview of the paper and do not repeat what has been said previously about the paper. This paper formulates the optimal control problem specifically for a finite-state discrete-time Markov Chain (MC). The motivating example of this paper that illustrates the theory developed is a machine maintenance example that is over-viewed in this section of the survey, and helps put the theorems, proof of theorems, and other mathematical results in their context. Typically, in DP or operations research, cost minimization is pursued, but here for this paper, reward maximization is pursued, as in RL. is the optimal reward-to-go function from stage k, whereas typically this is called the optimal cost-to-go function from some state.
Moreover, the machine example serves also as an analogy for POMDPs. This is true because while a machine operates one cannot generally know the internal states exactly by only inspecting the outputs. For this example all one can observe is the quality of the finished product. Only a finite number of observations are considered. The machine produces exactly one finished product once an hour, at the end of the hour. The machine is run by two internal components that may fail independently at each stage . If a component has failed, it increases the chance that a defective product will be produced.
Assume that the finished product is either defective or not defective. If the finished product is defective, the system incurs a reward . While if the finished product is not defective, it incurs a reward of . For this problem there are three states that could be labeled as . These states correspond to zero, one, and both parts inside of the machine being broken inside the machine, respectively. The two observations are which correspond to seeing a defective or not defective product at the end of any given hour on the factory floor. Refer to Figure 1 in [smallwoo1973optimal], which illustrates the relation between the internal Markov process with states and the possible observations via the transition diagram.
Furthermore, Sondik uses different terminology for the controls language we have been using. He calls a control or decision a control alternative, as in another option to choose from a finite set of alternatives. For this problem there are four such control alternatives which are the distinct ways to run the manufacturing process from the start of each hour. These alternatives are just admissible controls in in our framework.
The control options are:
Continue operation, i.e., do not examine finished product,
Examine product at the end of hour,
Stop machine for the hour, disassemble it, inspect the two internal components, and replace any that have failed,
The last option is to use the whole hour to just replace both components with no inspection.
He discusses the costs for this problem at the end of the paper, but it is not hard to see each control will cost a different amount at each stage. Hence, we must determine an optimal policy that maximizes the total reward. Another notation he uses in the paper for an instantaneous reward term, which is most generic: , includes the possibility of rewarding based on observations too. In our framework and notation we could write: for an observation dependent instantaneous cost, that also incorporates stage number. The goal is to find the optimal policy among these 4 options, each having different rewards associated with them, and knowing the complete history of the machine’s operation.
In table I of [smallwoo1973optimal], other POMDP application examples of interest are presented along with their modeling assumptions. I only mention two examples here related to controls and communications. Here, I present a few examples with the states, observations, and controls. For the first example it shows the need for considering more control options, and even using modern day encoding and decoding techniques, while the second considers a target tracking application.
Decoding of Markov Sources
State: status of the source,
Observation: output of the noisy channel,
Control Alternative: not considered.
Search for a moving object
State: status of target object,
Observation: result of search,
Control Alternative: level of expenditure of search resources.
Another application area mentioned in [smallwoo1973optimal] is that of target tracking. I have a few ideas of furthering the ideas from the papers [pollock1970simple],[eagle1984optimal] that studied tracking a Markovian moving target. The seminal paper on the optimal detection research question can be read in [pollock1970simple]. The main working questions in [pollock1970simple] are 1) What is the minimum expected number of looks required for detection of the Markovian moving target and 2) given a finite number n of looks, what is the maximum probability of detection that can be achieved. In [pollock1970simple], only a two state POMDP model was developed, and the goal was to maximize the probability of detection over a finite number of stages using DP methodology. The paper [eagle1984optimal] extended that approach to include an arbitrary finite number of states for the same goal of maximizing the probability of detection. The paper [eagle1984optimal] addresses the call to action from the seminal paper by explicitly formulating the search problem as a POMDP. This paper makes use of the POMDP solution techniques over a finite horizon by utilizing ideas contained in[smallwoo1973optimal], and [monahan1982state]. Lastly, for the literature review, [singh2003optimal] extended [eagle1984optimal]’s research on the optimal detection problem to the infinite-horizon case.
A Markovian moving target can be considered with momentum for future research directions. The system could have state vectors that are mixtures of completely observable and partially observed states. Effectively, I would like to develop a POMDP model for tracking this moving target with communication models of the received state information. In [pollock1970simple], a two-state moving target model was formulated, which can be converted to a 3 state process, where the third state signifies detected or not detected.
4.3 Paper 3: Denardo, “Contraction Mappings in the Theory Underlying Dynamic Programming”
4.3.1 Overview of the paper
This paper ties together a variety of disparate DP problems under a general framework by noticing the many similarities between all the different models in this era. The math developed in this paper helps in studying any infinite horizon problem: POMDP or not. The main properties most of these models possess are:
The property 3. is actually a weakened statement of the first property. These properties surfaced in the works of famous authors such as [howard1971dynamic], [blackwell1965discounted], and [bellman1957dynamic]. In the paper [denardo1967contraction], the term policies is introduced; denote it as , and call the return or the reward-to-go function . Next, he introduces a maximization operator, denoted as , and shows it has the contraction property. He notes that, The fixed-point theorem for contraction mappings assures us that the equation has a unique solution .. Define the optimal return or optimal reward-to-go function as . . If both contraction and monotonicity assumptions are satisfied it is the case that