Markov decision processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) (Sutton and Barto, 1998; Kaelbling et al., 1998) are useful and common tools in machine learning, with artificial agents evolving in these environments, generally seeking to maximise a reward.
But though there has been a lot of work on POMDPs from the practical perspective, there has been relatively little from the theoretical perspective. This paper aims to partially fill that hole. It first looks at notions of equivalence in POMDPs: two such structures are equivalent when an agent cannot distinguish which is which from any actions and observations it takes and makes.
A stronger notion is that of counterfactual equivalence; here, multiple agents sharing the same structure cannot distinguish it from another though any combinations of actions and observations.
Given these notions, this paper demonstrates that any POMDP will be counterfactually equivalent to a deterministic POMDP for any number of interaction terms. A deterministic POMDP is one who transition and observation functions are deterministic, and hence all the uncertainty is concentrated in the initial state.
Having uncertainty expressed in this way allows one to clarify POMDPs from an information perspective: what can the agent be said to learn as it evolves in the POMDP, what it can change and what it can’t. Since the rest of the POMDP is deterministic, an agent that knows the environment can only gain information about the initial state.
This construction has a universality property, in that all such deterministic POMDPs define the same pure learning processes, where a pure learning process is one that can be decomposed as sums of knowledges about the initial state.
This allows better analysis of the causality in the POMDPs, using concepts that were initially designed for environments with the causal structure more naturally encoded, such as causal graphs (Pearl, 2009).
2 Setup and notation
The reward function in a POMDP is not important, as the focus of this paper is on its causal structure, with the reward just a component of the observation.
Thus define a partially observable Markov decision process without reward function (POMDPR) (Choi and Kim, 2011), which consists of
a finite set of states ,
a finite set of actions ,
a finite set of observations ,
a transition probability distribution(where is the set of probability distributions on ),
a probability distribution over the initial state ,
an observation probability distribution .
This POMDPR will often be referred to as an environment (though Hadfield-Menell et al. (2017) refers to similar structures as world models).
The agent interacts with the environment in cycles: initially, the environment is in state (given by ), and the agent receives observation . At time step , the environment is in state and the agent chooses an action . Subsequently the environment transitions to a new state drawn from the distribution and the agent then receives an observation drawn from the distribution . The underlying states and are not directly observed by the agent.
A history is a sequence of actions and observations. We denote the set of all observed histories of length with , and by the set of all histories.
For , let be the sequence of actions , let be the sequence of observations , and let the sequence of states . Write if or if .
The set is the set of policies, functions mapping histories to probability distributions over actions. Given a policy and environment , we get a probability distribution over histories:
Since gives the probabilities of everything except actions, and gives the probabilities of actions, all conditional probabilities between histories, actions, states, and so on, can be computed using , , and Bayes’ rule. For instance, let , , and be such that . Then by Bayes’s rule:
Then note that is obviously independent of , so this can be rewritten as
which can be computed from . In the case where there exists no with , set to .
3 Equivalence and counterfactual equivalence
[Similarity] The environments and are (observationally) similar if they have the same sets , and . Consequently, they have the same sets of histories , and hence the same sets of policies .
We’ll say that two environments and are -equivalent if an agent in one cannot figure out which one it is in during the first turns.
To formalise this: [Equivalence] The environments and are -equivalent if they are similar (and hence have the same sets of histories), and, for all with and all policies ,
If they are -equivalent for all , they are equivalent.
3.2 Counterfactual equivalence
We’ll say that two environments and are -counterfactually equivalent if multiple agents sharing the same environment, cannot figure out which one they are in during the first turns.
This is a bit more tricky to define; in what sense can multiple agents be said to share the same environment? One idea is that if two agents are in the same state and choose the same action, they will then move together to the same next state (and make the same next observation).
To formalise this, define: [Environment policy] The is a deterministic environment policy of length , if it is triplet , where , , and . Let be the set of all environment policies of length .
The idea is that contain all information as to how the stochasticity in , , and are resolved in the environment. The gives a single initial state, means that an agent in state on turn , taking action , will move to state , and means that an agent arriving in state on turn will make observation .
The environment gives a distribution over elements of :
For the first turns of interaction with the environment, the agent can either see itself as updating using , , and , or it can see itself as following a deterministic environment policy , chosen according to the above probability.
Given an environment policy and an actual policy, the probability of a certain history can be computed. If is deterministic, will be always either or , since and deterministically determine all the states, observations, and actions.
Using and Bayes’s rule, this conditional probability can be inverted to compute , which is since and are independent of each other.
So this gives a formalisation of what it means to have several agents sharing the same environment: they share an environment policy. [Counterfactual equivalence] The environments and are -counterfactually equivalent if they are similar, and if for any collection of pairs of histories and policies with ,
If they are -counterfactually equivalent for all , they are counterfactually equivalent. The terms in Equation 3 are the joint probabilities of agents, using policies and sharing the same environment policy, each seeing the histories .
And finally: If and are -equivalent (or -counterfactually equivalent) for all , they are equivalent (or counterfactually equivalent).
A useful result is: If the environments and are -counterfactually equivalent, then they are -equivalent.
If , then
since is impossible, given .
If , then
For the counterfactually equivalent and , the case of , , demonstrates that . The same argument shows , demonstrating and establishing Equation 1.
Consider the of Figure 1. This has , , and . Since the observations and states are the same, with trivial , this is actually a Markov Decision process (Sutton and Barto, 1998). The agent starts in , chooses between two actions, an each action leads separately to one of two outcomes, with equal probability.
Compare with of Figure 2. The actions and observations are the same (hence the two environments are similar), but the state set is larger. Instead of one initial state, there are two, and , leading to the same observation . These two states are equally likely under , and lead deterministically to different states if the agent chooses .
It’s not hard to see that and are counterfactually equivalent. The environment has just shifted the uncertainty about the result of , out of and into the initial distribution .
Contrast both of these with the environment of Figure 3, which has a the same , , , , and as , but different behaviour under action (hence a different ).
It’s not hard to see that all three environments are equivalent: given history , the agent is equally likely to end up in state and , and that’s the end of the process.
They are not, however, counterfactually equivalent. There are four environment policies in (and in ) of non-zero probability. They can be labeled , which sends to and to . Each one has probability .
There are two environment policies in of non-zero probability; they can be labeled , which simply chooses the starting state . Each one has probability .
Since there are only two actions and they are only used once, the policies of these environments can be labeled by that action.
Then consider the two pairs of policies and histories and . Under the environment policy , both these pairs are certainly possible, so they have an non-zero probability under (and ). However, is impossible under , while is impossible under . So there are no environment policies in that make both those histories possible. Thus is not counterfactually equivalent to and .
5 Underlying deterministic environment
In this section, the environment is assumed to have all its special features indicated by a – so it will have state space , transitions function , and so on.
The main result is: For and all environments , there exists an environment that is -counterfactually equivalent to , and on which the transition function , and the observation function , are both deterministic.
Let and , so and are similar.
Recall that any decomposes as . The deterministic is defined as sending the state to . The deterministic is defined as mapping and the action to . For the rest of the proof, we’ll see and as functions, mapping into and .
The initial distribution is if and , and is otherwise. This defines .
We now need to show that and are -counterfactually equivalent. The proof is not conceptually difficult, one just has to pay careful attention to the notation.
Let be defined as the elements of the form111 Ignoring the extra variable: for all , and . , for given by . Let be the (bijective) map taking such to the corresponding .
Since and are deterministic, Equation 2 and the definition of imply that if . Again by the definition of :
Then note that preserves the middle component of . Given a state and an action , the next state and observation in will be given by and . Similarly, given the state , environmental policy , and action , the next state and observation in will be given by and .
So an agent in , starting in , and an agent in , having environment policy (and hence starting in ), would, if they chose the same actions, see the same observations. Now, ‘starting in ’ can be rephrased as ‘having environment policy ’. Since the policies of the agent are dependent on actions and observations only, this means that for all and :
since is a surjection onto .
In the above construction, all the uncertainty and stochasticity of the initial has been concentrated into the distribution over the initial state in .
Note that though the construction will work for every , the size of increases with , so the limit of this as has a countable infinite number of states, rather than a finite number.
5.1 ‘Universality’ of the underlying deterministic environment
For many , much simpler constructions are possible. See for instance environment of Figure 4. It is deterministic in and , and counterfactually equivalent to and in Section 4. But has states and state-action pairs, so there are different environment policies222 Though only of non-zero probability. , meaning that is of magnitude , much larger than the states of .
This poses the question, as to which deterministic POMDP is preferable to model the initial POMDP. Fortunately, there is a level at which all counterfactually equivalent deterministic POMDPs are the same.
[Pure learning process] On , let be a map from histories of length or less, to the unit interval. Then is a pure learning process if there exists a deterministic , -counterfactually equivalent to , such that can be expressed as
for constants .
These pure learning processes are seen to compute a probability over the stochastic elements of the environment. Then the universality result is: Let be a pure learning process on , and let be deterministic and -counterfactually equivalent to . Then there exists constants for , such that can be defined as in Equation 6.
Since is a pure learning process, we already know that there exists a deterministic environment, -counterfactually equivalent to , where decomposes as Equation 6. Since being -counferfactually equivalent is a transitive property, we may as well assume that itself is this environment.
We now need to define the on , and show they generate the same .
Let be the set of deterministic policies. Since is deterministic itself, apart from , a choice of and a choice of determines a unique history of length . Therefore each defines a map . Define the subset as the set of all such that ; these subsets form a partition of .
Since is also deterministic, its state space has a similar partition.
Given an , define the collection of pairs . For a deterministic environment, an environment policy of non-zero probability is just a choice of initial state. So, writing for , Equation 3 with that collection of pairs becomes:
Since everything is deterministic, the expression must be either or , and it is only if . Thus Equation 7 can be further rewritten as
This demonstrates that the probability under of any , is the same as the probability under of ; so, writing for ,
So for all , with , define
Thus for is equal to the weighted average of in . For the with , set to any value. This defines the , and hence a on via Equation 6.
We now need to show that . Note first that
Now let be a history with , and any deterministic policy that, upon given an initial segment , will generate the action . Thus is a policy that could allow to happen.
Let be the set of all such that . This means that, if the agent started in and followed , it would generate a history containing – hence that it would generate .
That set can be written as a union . The observation of is thus equivalent to . Consequently
Thus any deterministic -counterfactually equivalent environment can be used to define any pure learning process: they are all interchangeable for this purpose.
Choi and Kim (2011)
Jaedeug Choi and Kee-Eung Kim.
Inverse reinforcement learning in partially observable environments.Journal of Machine Learning Research, 12:691–730, 2011.
- Hadfield-Menell et al. (2017) Dylan Hadfield-Menell, Smitha Milli, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6749–6758, 2017.
- Kaelbling et al. (1998) Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998.
- Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
- Sutton and Barto (1998) R. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. A Bradford Book.