# Successor Feature Sets: Generalizing Successor Representations Across Policies

Successor-style representations have many advantages for reinforcement learning: for example, they can help an agent generalize from past experience to new goals, and they have been proposed as explanations of behavioral and neural data from human and animal learners. They also form a natural bridge between model-based and model-free RL methods: like the former they make predictions about future experiences, and like the latter they allow efficient prediction of total discounted rewards. However, successor-style representations are not optimized to generalize across policies: typically, we maintain a limited-length list of policies, and share information among them by representation learning or GPI. Successor-style representations also typically make no provision for gathering information or reasoning about latent variables. To address these limitations, we bring together ideas from predictive state representations, belief space value iteration, successor features, and convex analysis: we develop a new, general successor-style representation, together with a Bellman equation that connects multiple sources of information within this representation, including different latent states, policies, and reward functions. The new representation is highly expressive: for example, it lets us efficiently read off an optimal policy for a new reward function, or a policy that imitates a new demonstration. For this paper, we focus on exact computation of the new representation in small, known environments, since even this restricted setting offers plenty of interesting questions. Our implementation does not scale to large, unknown environments – nor would we expect it to, since it generalizes POMDP value iteration, which is difficult to scale. However, we believe that future work will allow us to extend our ideas to approximate reasoning in large, unknown environments.

## Authors

• 6 publications
• 6 publications
• 22 publications
• ### Estimating Disentangled Belief about Hidden State and Hidden Task for Meta-RL

There is considerable interest in designing meta-reinforcement learning ...
05/14/2021 ∙ by Kei Akuzawa, et al. ∙ 0

• ### State2vec: Off-Policy Successor Features Approximators

A major challenge in reinforcement learning (RL) is the design of agents...
10/22/2019 ∙ by Sephora Madjiheurem, et al. ∙ 0

• ### Learning One Representation to Optimize All Rewards

We introduce the forward-backward (FB) representation of the dynamics of...
03/14/2021 ∙ by Ahmed Touati, et al. ∙ 8

• ### Value-driven Hindsight Modelling

Value estimation is a critical component of the reinforcement learning (...
02/19/2020 ∙ by Arthur Guez, et al. ∙ 17

• ### A neurally plausible model learns successor representations in partially observable environments

Animals need to devise strategies to maximize returns while interacting ...
06/22/2019 ∙ by Eszter Vertes, et al. ∙ 0

• ### Reconciling Rewards with Predictive State Representations

Predictive state representations (PSRs) are models of controlled non-Mar...
06/07/2021 ∙ by Andrea Baisero, et al. ∙ 0

• ### Madeup: A Mobile Development Environment for Programming 3-D Models

Constructionism is a learning theory that states that we learn more when...
09/27/2013 ∙ by Chris Johnson, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Background and notation

Our environment is a controlled dynamical system. We interact with it in a sequence of time steps; at each step, all relevant information is encoded in a state vector. Given this state vector, we choose an action. Based on the action and the current state, the environment changes to a new state, emits an observation, and moves to the next time step. We can describe such a system using one of a few related models: a Markov decision process (MDP), a partially-observable Markov decision process (POMDP), or a (transformed) predictive state representation (PSR). We describe these models below, and summarize our notation in Table

1.

### 1.1 MDPs

An MDP is the simplest model: there are possible discrete states, numbered . The environment starts in one of these states, . For each possible action , the transition matrix tells us how our state changes if we execute action :

is the probability that the next state is

if the current state is .

More compactly, we can associate each state with a corresponding standard basis vector , and write for the vector at time . (So, if then .) Then,

is the probability distribution over next states:

 P(st+1∣qt,do a)=E(qt+1∣qt,do a)=Taqt

Here we have written to indicate that choosing an action is an intervention.

### 1.2 POMDPs

In an MDP, we get to know the exact state at each time step: is always a standard basis vector. By contrast, in a POMDP, we only receive partial information about the underlying state: at each time step, after choosing our action , we see an observation according to a distribution that depends on the next state . The observation matrix tells us the probabilities: is the probability of receiving observation if the next state is .

To represent this partial information about state, we can let the state vector range over the probability simplex instead of just the standard basis vectors: tells us the probability that the state is , given all actions and observations so far, up to and including and . The vector is called our belief state; we start in belief state .

Just as in an MDP, we have . But now, instead of immediately resolving to one of the corners of the simplex, we can only take into account partial state information: if then by Bayes rule

 [qt+1]i = P(st+1=i∣qt,do a,o) = P(o∣st+1=i)P(st+1=i∣qt,do a)P(o∣qt,do a) = Doi[Taqt]i/∑o′Do′i[Taqt]i

More compactly, if is the vector of all s, and

 Tao=diag(Do,⋅)Ta

where constructs a diagonal matrix from a vector, then our next belief state is

 qt+1=Taoqt/uTTaoqt

A POMDP is strictly more general than an MDP: if our observation tells us complete information about our next state , then our belief state will be a standard basis vector. This happens precisely when .

### 1.3 PSRs

A PSR further generalizes a POMDP: we can think of a PSR as dropping the interpretation of

as a belief state, and keeping only the mathematical form of the state update. That is, we no longer require our model parameters to have any interpretation in terms of probabilities of partially observable states; we only require them to produce valid observation probability estimates. (It is possible to interpret PSR states and parameters in terms of experiments called

tests; for completeness we describe this interpretation in the supplementary material, available online.)

In more detail, we are given a starting state vector , matrices , and a normalization vector . We define our state vector by the recursion

 qt+1=Tatotqt/uTTatotqt

and our observation probabilities as

 P(ot=o∣qt,do a)=uTTaoqt

The only requirement on the parameters is that the observation probabilities should always be nonnegative and sum to 1: under any sequence of actions and observations, if is the resulting sequence of states,

 (∀a,o,t)uTTaoqt≥0(∀a,t)∑ouTTaoqt=1

It is clear that a PSR generalizes a POMDP, and therefore also an MDP: we can always take to be the vector of all s, and set according to the POMDP transition and observation probabilities, so that

 [Tao]ij=P(ot=o,st+1=i∣st=j,at=a)

It turns out that PSRs are a strict generalization of POMDPs: there exist PSRs whose dynamical systems cannot be described by any finite POMDP. An example is the so-called probability clock Jaeger (2000).

### 1.4 Policy trees

We will need to work with policies for MDPs, POMDPs, and PSRs, handling different horizons as well as partial observability. For this reason, we will use a general policy representation: we will view a policy as a mixture of trees, with each tree representing a deterministic, nonstationary policy. A policy tree’s nodes are labeled with actions, and its edges are labeled with observations (Fig. 1). To execute a policy tree , we execute ’s root action; then, based on the resulting observation , we follow the edge labeled from the root, leading to a subtree that we will call . To execute a mixture, we randomize over its elements. If desired we can randomize lazily, committing to each decision just before it affects our actions. We will work with finite, balanced trees, with depth equal to a horizon ; we can reason about infinite-horizon policies by taking a limit as .

## 2 Imitation by feature matching

Successor feature sets have many uses, but we will start by motivating them with the goal of imitation. Often we are given demonstrations of some desired behavior in a dynamical system, and we would like to imitate that behavior. There are lots of ways to specify this problem, but one reasonable one is apprenticeship learning (Abbeel and Ng, 2004) or feature matching. In this method, we define features of states and actions, and ask our learner to match some statistics of the observed features of our demonstrations.

In more detail, given an MDP, define a vector of features of the current state and action, ; we call this the one-step or immediate feature vector. We can calculate the observed discounted features of a demonstration: if we visit states and actions , then the empirical discounted feature vector is

 f(s1,a1)+γf(s2,a2)+γ2f(s3,a3)+…

where is our discount factor. We can average the feature vectors for all of our demonstrations to get a demonstration or target feature vector .

Analogously, for a policy , we can define the expected discounted feature vector:

 ϕπ=Eπ[∞∑t=1γt−1f(st,at)]

We can use a finite horizon by replacing with in the definitions of and ; in this case we have the option of setting .

Given a target feature vector in any of these models, we can ask our learner to design a policy that matches the target feature vector in expectation. That is, we ask the learner to find a policy with

 ϕπ=ϕd

For example, suppose our world is a simple maze MDP like Fig. 1(a). Suppose that our one-step feature vector is the RGB color of the current state in this figure, and that our discount is . If our demonstrations spend most of their time toward the left-hand side of the state space, then our target vector will be something like : the green feature will have the highest expected discounted value. On the other hand, if our demonstrations spend most of their time toward the bottom-right corner, we might see something like , with the blue feature highest.

## 3 Successor features

To reason about feature matching, it will be important to predict how the features we see in the future depend on our current state. To this end, we define an analog of where we vary our start state, called the successor feature representation Dayan (1993); Barreto et al. (2017):

 ϕπ(s)=Eπ[∞∑t=1γt−1f(st,at)∣∣∣do s1=s]

This function associates a vector of expected discounted features to each possible start state. We can think of as a generalization of a value function: instead of predicting total discounted rewards, it predicts total discounted feature vectors. In fact, the generalization is strict: contains enough information to compute the value function for any one-step reward function of the form , via .

For example, in Fig. 1(b), our policy is to always move left. The corresponding successor feature function looks similar to the immediate feature function, except that colors will be smeared rightward. The smearing will stop at walls, since an agent attempting to move through a wall will stop.

## 4 Extension to POMDPs and PSRs

We can generalize the above definitions to models with partial observability as well. This is not a typical use of successor features: reasoning about partial observability requires a model, while successor-style representations are often used in model-free RL. However, as Lehnert and Littman (2019) point out, the state of a PSR is already a prediction about the future, so incorporating successor features into these models makes sense.

In a POMDP, we have a belief state instead of a fully-observed state. We define the immediate features of to be the expected features of the latent state:

 f(q,a)=k∑s=1q(s)f(s,a)

In a PSR, we similarly allow any feature function that is linear in the predictive state vector :

 f(q,a)=Faq

with one matrix for each action . In either case, define the successor features to be

 ϕπ(q)=Eπ[∞∑t=1γt−1f(qt,at)∣∣∣do q1=q]

Interestingly, the function is linear in . That is, for each , there exists a matrix such that . We call the successor feature matrix for ; it is related to the parameters of the Linear Successor Feature Model of Lehnert and Littman (2019).

We can compute recursively by working backward in time (upward from the leaves of a policy tree): for a tree with root action , the recursion is

 Aπ=Fa+γ∑oAπ(o)Tao

This recursion works by splitting into contributions from the first step () and from steps (rest of RHS). We give a more detailed derivation, as well as a proof of linearity, in the supplementary material online. All the above works for MDPs as well by taking , which lets us keep a uniform notation across MDPs, POMDPs, and PSRs.

It is worth noting the multiple feature representations that contribute to the function . First are the immediate features . Second is the PSR state, which can often be thought of as a feature representation for an underlying “uncompressed” model (Hefny, Downey, and Gordon, 2015). Finally, both of the above feature representations help define the exact value of ; we can also approximate using a third feature representation. Any of these feature representations could be related, or we could use separate features for all three purposes. We believe that an exploration of the roles of these different representations would be important and interesting, but we leave it for future work.

## 5 Successor feature sets

To reason about multiple policies, we can collect together multiple matrices: the successor feature set at horizon is defined as the set of all possible successor feature matrices at horizon ,

 Φ(H)={Aπ∣π a policy with horizon H}

As we will detail below, we can also define an infinite-horizon successor feature set , which is the limit of as .

The successor feature set tells us how the future depends on our state and our choice of policy. It tells us the range of outcomes that are possible: for a state , each point in

tells us about one policy, and gives us moments of the distribution of future states under that policy. The extreme points of

therefore tell us the limits of what we can achieve. (Here we use the shorthand of broadcasting: set arguments mean that we perform an operation all possible ways, substituting one element from each set. E.g., if are sets, means Minkowski sum .)

Note that is a convex, compact set: by linearity of expectation, the feature matrix for a stochastic policy will be a convex combination of the matrices for its component deterministic policies. Therefore, will be the convex hull of a finite set of matrices, one for each possible deterministic policy at horizon .

Working with multiple policies at once provides a number of benefits: perhaps most importantly, it lets us define a Bellman backup that builds new policies combinatorially by combining existing policies at each iteration (Sec. 7). That way, we can reason about all possible policies instead of just a fixed list. Another benefit of is that, as we will see below, it can help us compute optimal policies and feature-matching policies efficiently. On the other hand, because it contains so much information, the set is a complicated object; it can easily become impractical to work with. We return to this problem in Sec. 9.

## 6 Special cases

In some useful special cases, successor feature matrices and successor feature sets have a simpler structure that can make them easier to reason about and work with. E.g., in an MDP, we can split the successor feature matrix into its columns, resulting in one vector per state — this is the ordinary successor feature vector . Similarly, we can split into sets of successor feature vectors, one at each state, representing the range of achievable futures:

 ϕ(s)={ϕπ(s)∣π a policy}=Φes

Fig. 3 visualizes these projections, along with the Bellman backups described below. Each projection tells us the discounted total feature vectors that are achievable from the corresponding state. For example, the top-left plot shows a set with five corners, each corresponding to a policy that is optimal in this state under a different reward function; the bottom-left corner corresponds to “always go down,” which is optimal under reward .

On the other hand, if we only have a single one-step feature (, then we can only represent a 1d family of reward functions. All positive multiples of are equivalent to one another, as are all negative multiples. In this case, our recursion effectively reduces to classic POMDP or PSR value iteration: each element of is now a vector instead of a matrix . This -vector represents the (linear) value function of policy ; the pointwise maximum of all these functions is the (piecewise linear and convex) optimal value function of the POMDP or PSR.

## 7 Bellman equations

Each element of the successor feature set is a successor feature matrix for some policy, and as such, it satisfies the recursion given above. For efficiency, though, we would like to avoid running Bellman backups separately for too many possible policies. To this end, we can write a backup operator and Bellman equations that apply to all policies at once, and hence describe the entire successor feature set.

The joint backup works by relating horizon- policies to horizon- policies. Every horizon- policy tree can be constructed recursively, by choosing an action to perform at the root node and a horizon- tree to execute after each possible observation. So, we can break down any horizon- policy (including stochastic ones) into a distribution over the initial action, followed by conditional distributions over horizon- policy trees for each possible initial observation.

Therefore, if we have the successor feature set at horizon , we can construct the successor feature set at horizon in two steps: first, for each possible initial action , we construct

 Φ(H)a=Fa+γ∑oΦ(H−1)Tao

This set tells us the successor feature matrices for all horizon- policies that begin with action . Note that only the first action is deterministic: lets us assign any conditional distribution over horizon- policy trees after each possible observation.

Second, since a general horizon- policy is a distribution over horizon- policies that start with different actions, each element of is a convex combination of elements of for different values of . That is,

 Φ(H)=conv⋃aΦ(H)a

The recursion bottoms out at horizon , where we have

 Φ(0)={0}

since the discounted sum of a length- trajectory is always the zero vector.

Fig. 3 shows a simple example of the Bellman backup. Since this is an MDP, is determined by its projections onto the individual states. The action “up” takes us from the bottom-left state to the middle-left state. So, we construct by shifting and scaling (red sets). The full set is the convex hull of four sets ; the other three are not shown, but for example, taking gives us a shifted and scaled copy of the set from the bottom-center plot.

The update from to is a contraction: see the supplementary material online for a proof. So, as , will approach a limit ; this set represents the achievable successor feature matrices in the infinite-horizon discounted setting. is a fixed point of the Bellman backup, and therefore satisfies the stationary Bellman equations

 Φ=conv⋃a[Fa+γ∑oΦTao]

## 8 Feature matching and optimal planning

Once we have computed the successor feature set, we can return to the feature matching task described in Section 2. Knowing makes feature matching much easier: for any target vector of discounted feature expectations , we can efficiently either compute a policy that matches or verify that matching is impossible. We detail an algorithm for doing so in Alg. 1; more detail is in the supplementary material online.

Fig. 3 shows the first steps of our feature-matching policy in a simple MDP. At the bottom-left state, the two arrows show the initial target feature vector (root of the arrows) and the computed policy (randomize between “up” and “right” according to the size of the arrows). The target feature vector at the next step depends on the outcome of randomization: each destination state shows the corresponding target and the second step of the computed policy.

We can also use the successor feature set to make optimal planning easier. In particular, if we are given a new reward function expressed in terms of our features, say for some coefficient vector , then we can efficiently compute the optimal value function under :

 V∗(q)=maxπrTϕπq=max{rTψq∣ψ∈Φ}

As a by-product we get an optimal policy: there will always be a matrix that achieves the above and satisfies for some . Any such is an optimal action.

## 9 Implementation

An exact representation of can grow faster than exponentially with the horizon. So, in our experiments below, we work with a straightforward approximate representation. We use two tools: first, we store for all instead of storing , since the former sets tend to be effectively lower-dimensional due to sparsity. Second, analogous to PBVI (Pineau, Gordon, and Thrun, 2003a; Shani, Pineau, and Kaplow, 2013), we fix a set of directions , and retain only the most extreme point of in each direction. Our approximate backed-up set is then the convex hull of these retained points. Just as in PBVI, we can efficiently compute backups by passing the max through the Minkowski sum in the Bellman equation. That is, for each and each , we solve

 argmax⟨mi,ϕ⟩ for ϕ∈⋃a′[Fa′+γ∑o′Φa′o′]Tao

by solving, for each

 argmax⟨mi,ϕ⟩ for ϕ∈Φa′o′Tao

and combining the solutions.

There are a couple of useful variants of this implementation that we can use in stoppable problems (i.e., problems where we have an emergency-stop or a safety policy; see the supplemental material for more detail). First, we can update monotonically, i.e., keep the better of the horizon- or horizon- successor feature matrices in each direction. Second, we can update incrementally: we can update any subset of our directions while leaving the others fixed.

## 10 More on special cases

With the above pruning strategy, our dynamic programming iteration generalizes PBVI Pineau, Gordon, and Thrun (2003b). PBVI was defined originally for POMDPs, but it extends readily to PSRs as well: we just sample predictive states instead of belief states. To relate PBVI to our method, we look at a single task, with reward coefficient vector . We sample a set of belief states or predictive states ; these are the directions that PBVI will use to decide which value functions (-vectors) to retain. Based on these, we set the successor feature matrix directions to be for all .

Now, when we search within our backed up set for the maximal element in direction , we get some successor feature matrix . Because is maximal, we know that is also maximal: that is, is as far as possible in the direction . But is a backed-up value function under the reward ; so, is exactly the value function that PBVI would retain when maximizing in the direction .

## 11 Experiments: dynamic programming

We tried our dynamic programming method on several small domains: the classic mountain-car domain and a random gridworld with full and partial observability. We evaluated both planning and feature matching; results for the former are discussed in this section, and an example of the latter is in Fig. 3. We give further details of our experimental setup in the supplementary material online. At a high level, our experiments show that the algorithms behave as expected, and that they are practical for small domains. They also tell us about limits on scaling: the tightest of these limits is our ability to represent accurately, governed by the number of boundary points that we retain for each .

In mountain-car, the agent has two actions: accelerate left and accelerate right. The state is (position, velocity), in . We discretize to a

mesh with piecewise-constant approximation. Our one-step features are radial basis functions of the state, with values in

. We use 9 RBF centers evenly spaced on a grid. In the MDP gridworld, the agent has four deterministic actions: up, down, left, and right. The one-step features are coordinates scaled to , similar to Fig. 3. In the POMDP gridworld, the actions are stochastic, and the agent only sees a noisy indicator of state. In all domains, the discount is .

Fig. 4 shows how Bellman error evolves across iterations of dynamic programming. Since is a set, we evaluate error by looking at random projections: how far do and the backup of extend in a given direction? We evaluate directions that we optimized for during backups, as well as new random directions.

Note that the asymptotes for the new-directions lines are above zero; this persistent error is due to our limited-size representation of . The error decreases as we increase the number of boundary points that we store. It is larger in the domains with more features and more uncertainty (center and right panels), due to the higher-dimensional matrices and the need to sample mixed (uncertain) belief states.

## 12 Related work

Successor features, a version of which were first introduced by Dayan (1993), provide a middle ground between model-free and model-based RL (Russek et al., 2017). They have been proposed as neurally plausible explanations of learning (Gershman et al., 2012; Gershman, 2018; Momennejad et al., 2017; Stachenfeld, Botvinick, and Gershman, 2017; Gardner, Schoenbaum, and Gershman, 2018; Vértes and Sahani, 2019).

Recently, numerous extensions have been proposed. Most similar to the current work are methods that generalize to a set of policies or tasks. Barreto et al. (2017)

achieve transfer learning by generalizing across tasks with successor features;

Barreto et al. (2018) use generalized policy improvement (GPI) over a set of policies. A few methods (Borsa et al., 2018; Ma, Wen, and Bengio, 2018) recently combined universal value function approximators (Schaul et al., 2015) with GPI to perform multi-task learning, generalizing to a set of goals by conditioning on a goal representation. Barreto et al. (2020) extend policy improvement and policy evaluation from single tasks and policies to a list of them, but do not attempt to back up across policies.

Many authors have trained nonlinear models such as neural networks to predict successor-style representations, e.g.,

Kulkarni et al. (2016); Zhu et al. (2017); Zhang et al. (2017); Machado et al. (2017); Hansen et al. (2019). These works are complementary to our goal here, which is to design and analyze new, more general successor-style representations. We hope our generalizations eventually inform training methods for large-scale nonlinear models.

At the intersection of successor features and imitation learning, Zhu et al. (2017) address visual semantic planning; Lee, Srinivasan, and Doshi-Velez (2019) address off-policy model-free RL in a batch setting; and Hsu (2019) addresses active imitation learning.

As mentioned above, the individual elements of are related to the work of Lehnert and Littman (2019). And, we rely on point-based methods (Pineau, Gordon, and Thrun, 2003a; Shani, Pineau, and Kaplow, 2013) to compute .

## 13 Conclusion

This work introduces successor feature sets, a new representation that generalizes successor features. Successor feature sets represent and reason about successor feature predictions for all policies at once, and respect the compositional structure of policies, in contrast to other approaches that treat each policy separately. A successor feature set represents the boundaries of what is achievable in the future, and how these boundaries depend on our initial state. This information lets us efficiently read off optimal policies or imitate a demonstrated behavior.

We give algorithms for working with successor feature sets, including a dynamic programming algorithm to compute them, as well as algorithms to read off policies from them. The dynamic programming update is a contraction mapping, and therefore convergent. We give both exact and approximate versions of the update. The exact version can be intractable, due to the so-called “curse of dimensionality” and “curse of history.” The approximate version mitigates these curses using point-based sampling.

Finally, we present computational experiments. These are limited to relatively small, known environments; but in these environments, we demonstrate that we can compute successor feature sets accurately, and that they aid generalization. We also explore how our approximations scale with environment complexity.

Overall we believe that our new representation can provide insight on how to reason about policies in a dynamical system. We know, though, that we have only scratched the surface of possible strategies for working with this representation, and we hope that our analysis can inform future work on larger-scale environments.

## Appendix A Feature matching

In this section we give the algorithm for imitation by feature matching, summarized as Alg. 1.

Our policy will be nonstationary: that is, its actions will depend on an internal policy state (defined below) as well as the environment’s current predictive state .

Our algorithm updates its target feature vector over time in order to compensate for random outcomes (the action sampled from the policy and the next state sampled from the transition distribution). We write for the target at time step , and initialize . Updates of this sort are necessary: we might by chance visit a state where it is impossible to achieve the original target , but that does not mean that our policy has failed. Instead, the policy guarantees always to pick a target that is achievable given the state at step , in a way that guarantees that on average we achieve the original target from the initial state .

To guarantee that the target is always achievable, our policy maintains the invariant that . By the definition of , the discounted feature vectors in are exactly the ones that are achievable starting from state , so this invariant is necessary and sufficient to ensure that is achievable. At the first time step, we test whether . If yes, our invariant is satisfied and we can proceed; if no, then we know that we have been given an impossible task. In the latter case we could raise an error, or we could raise a warning and look for the closest achievable vector to .

Our actions at step and our targets at step will be functions of the current environment state and our current target . As such, is the internal policy state mentioned above.

We pick our actions and targets as follows. According to the Bellman equations, the successor feature set is equal to the convex hull of the union of over all . Each matrix in can therefore be written as a convex combination of action-specific matrices, each one chosen from one of the sets . That means that each vector in can be written as a convex combination of vectors in .

Write our target in this way, say , by choosing actions , vectors , and weights with . Then, at the current time step, our algorithm chooses an index according to the probabilities , and executes the corresponding action .

Now let be the chosen index, and write for the chosen action. Again according to the Bellman equations, the point is of the form . In particular, we can choose vectors for each such that

 ϕit=Faqt+γ∑oϕot

Writing for all , we can multiply and divide by within the sum, and conclude

 ϕit=Eo[Faqt+γϕot/pot]

That is, we can select our target for time step as , where is our next observation. To see why, note that our expected discounted feature vector at time will remain the same: the LHS (the current target) is equal to the RHS (the expected one-step contribution plus discounted future target). And, note that the target at the next time step will always be feasible, maintaining our invariant: our state at the next time step will be

 qt+1=Taoqt/pot

and we have selected each to satisfy , so

 ϕot/pot∈ΦTaoqt/pot=Φqt+1

So, based on the observation that we receive, we can update our predictive state and target feature vector according to the equations above, and recurse. (In practice, numerical errors or incomplete convergence of could lead to an infeasible target; in this case we can project the target back onto the feasible set, which will result in some error in feature matching.)

Note that there may be more than one way to decompose , or more than one way to decompose . If so, we can choose any valid decomposition arbitrarily.

## Appendix B Convergence of dynamic programming

We will show that the dynamic programming update for given in Section 7 is a contraction, which implies bounds on the convergence rate of dynamic programming. We will need a few definitions and facts about norms and metrics.

### b.1 Norms

Given any symmetric, convex, compact set with nonempty interior, we can construct a norm by treating as the unit ball. The norm of a vector is then the smallest multiple of that contains .

 ∥x∥S=infx∈cSc

This is a fully general way to define a norm: any norm can be constructed this way by using its own unit ball as . That is, if , then

 ∥x∥=∥x∥B

We will use the shorthand for an -norm: e.g., , (Euclidean norm), or (sup norm). If we start from an asymmetric set , we can symmetrize it to get

 ¯S={αs+(1−α)s′∣s∈S,s′∈−S,α∈[0,1]}

(This is the convex hull of .) Given any norm , we can construct a dual norm :

 ∥y∥∗=sup∥x∥≤1x⋅y

This definition guarantees that dual norms satisfy Hölder’s inequality:

 x⋅y≤∥x∥∥y∥∗

We will write for the unit ball of the dual norm . Taking the dual twice returns to the original norm: and .

Given any two norms and and their corresponding unit balls and , the operator norm of a matrix is

 ∥A∥P,Q=supx∈P∥Ax∥Q=supx∈P,y∈Q∗yTAx

This definition ensures that Hölder’s inequality extends to operator norms:

 ∥Ax∥Q≤∥A∥P,Q∥x∥P

The norm of the transpose of a matrix can be expressed in terms of the duals of and :

 ∥AT∥Q∗,P∗=∥A∥P,Q

If and are the same, we will shorten to

 ∥AT∥P∗=∥A∥P

Given a norm, we can define the Hausdorff metric between sets:

 d(X,Y) =max(¯d(X,Y),¯d(Y,X)) ¯d(X,Y) =supx∈Xinfy∈Y∥x−y∥

If is any real vector space (such as ), the Hausdorff metric makes the set of non-empty compact subsets of into a complete metric space. Given a metric, a contraction is a function that reduces the metric by a constant factor:

 d(f(X),f(Y))≤βd(X,Y)

The factor is called the modulus. If then is called a nonexpansion. For a linear operator , with metric , the modulus is the same as the operator norm . The Banach fixed-point theorem guarantees the existence of a fixed point of any contraction on a complete metric space.

### b.2 Norms for POMDPs and PSRs

We can bound the transition operators for POMDPs and PSRs using operator norms that correspond to the set of valid states. In POMDPs, valid belief states are probability distributions, and therefore satisfy . For PSRs, there is no single norm that works for all models. Instead, for each PSR, we only know that there exists a norm such that all valid states are in the unit ball . (We can get by symmetrizing the PSR’s set of valid states .) We will write in both cases, by taking to be the probability simplex if our model is a POMDP. Given these definitions, we are guaranteed that, for each ,

 ∥Ta∥S≤1whereTa=∑oTao

We also know that each transition operator maps states to unnormalized states: it maps to the cone generated by , i.e., .

### b.3 Convergence: key step

The key step in the proof of convergence is to analyze

 ∑oΦTao

for a fixed action . We will show that this operation is a nonexpansion in the Hausdorff metric based on a particular norm. To build the appropriate norm, we can start from norms for our states and our features. For states we will use the norm that corresponds to our state space: . For features we can use any norm . For elements of we can then use the operator norm for and : . For sets like we can use the Hausdorff metric based on , which we will write as just .

For simplicity we will first analyze distance to a point: start by assuming for some . Now, for each ,

 d(∑oΦTao,{0}) =supψo∈ΦTao∥∑oψo∥F,¯S =supϕo∈Φ∥∑oϕoTao∥F,¯S =supϕo∈Φsupf∈F∗,q∈¯SfT∑oϕoTaoq

where we have written as shorthand for , i.e., one supremum per observation.

Since is the solution to a linear optimization problem, we can assume it is an extreme point of the feasible region , which means either or . Assume ; the other case is symmetric. This lets us replace with .

We next want to simplify the supremum over . We can do this in two steps: first, the supremum can only increase if we let the choice of depend on (which we write as ). Second, Hölder’s inequality tells us that , since and . So, optimizing over instead of just over vectors of the form can again only increase the supremum. We therefore have

 d(∑oΦTao,{0}) ≤supϕo∈Φsupq∈S,fo∈F∗∑ofToϕoTaoq ≤supq∈Ssupro∈k¯S∗∑orToTaoq

We can now solve the optimizations over . Note that the normalization vector is in : for every , so for every . And, for any valid state , no vector in can have dot product larger than with , by definition of . is a nonnegative multiple of a valid state for each ; therefore, is an optimal solution for each , and we have

 d(∑oΦTao,{0}) ≤supq∈S∑okuTTaoq =ksupq∈SuTTaq =k=d(Φ,{0})

To handle distances to a general set , we need to track a instead of just a . Assume wlog that

 d(∑oΦTao,∑oΨTao)=¯d(∑oΦTao,∑oΨTao)

(the other ordering is symmetric). Then

 ¯d(∑oΦTao,∑oΨTao) =supϕo∈Φinfψo∈Ψ∥∑oϕoTao−∑ψoTao∥F,¯S =supϕo∈Φinfψo∈Ψ∥∑o(ϕo−ψo)Tao∥F,¯S

The argument proceeds from here exactly as above, since we know that is bounded by for each .

### b.4 Convergence: rest of the proof

The remaining steps in our dynamic programming update are multiplying by , adding , and taking the convex hull of the union over . Multiplying the sets by changes the modulus from to . Adding the same vector to both sets does not change the modulus. Finally, convex hull of union also leaves the modulus unchanged: more specifically, if are all contractions of modulus , then the mapping

 Φ→conv⋃ifi(Φ)

is also a contraction of modulus . To see why, consider two sets and , with . Consider a point in the former set: it can be written as with each in one of the sets and the a convex combination. For each , we can find a point in the corresponding set at distance at most , since is a contraction. Using the triangle inequality on the convex combination, the final distance is therefore at most .

Putting everything together, we have that the dynamic programming update is a contraction of modulus . From here, the Banach fixed-point theorem guarantees that there exists a unique fixed point of the update, and that each iteration of dynamic programming brings us closer to this fixed point by a factor , as long as we initialize with a nonempty compact subset of the set of matrices.

## Appendix C Background on PSRs

Here we describe a mechanical way to define a valid PSR, given some information about a controlled dynamical system. This method is fully general: if it is possible to express a dynamical system as a PSR, we can use this method to do so. And, PSRs constructed this way allow a nice interpretation of the otherwise-opaque PSR state vector. To describe this method, it will help to define a kind of experiment called a test.

### c.1 Tests

A test consists of a sequence of actions and a function . We execute by executing starting from some state . We record the resulting observations , and feed them as inputs to ; the output is called the test outcome. The test value is the expected outcome

 τ(q)=E(F(ot,ot+1,…,ot+ℓ−1)∣qt=q,do Aτ)

A simple test is one where the function is the indicator of a given sequence of observations; in this case the test value is also called the test success probability. Tests that are not simple are compound. Below, we will use tests to construct PSRs. If we use exclusively simple tests, we will call the result a simple PSR; else it will be a transformed PSR.

We can express compound tests as linear combinations of simple tests: we can break the expectation into a sum over all possible sequences of observations to get

 τ(q)=∑o1…oℓP(o1⋯oℓ∣q,do Aτ)Fτ(o1,…,oℓ)

and each term in the summation is a fixed multiple of a simple test probability.

In a PSR, for any test , it turns out that the function is linear: for a simple test with actions and observations ,

 τ(q) = P(o1,…,oℓ∣q,do Aτ) = uTTaℓoℓ⋅Taℓ−1oℓ−1⋯Ta2o2⋅Ta1o1q

which is linear in . For a compound test, the value is linear because it is a linear combination of simple tests.

In fact, this linearity property is the defining feature of PSRs: a dynamical system can be described as a PSR exactly when we can define a state vector that makes all test values into linear functions. That is, we can write down a PSR iff there exist state extraction functions such that, for all tests , there exist prediction vectors such that the value of is . There may be many ways to define a state vector for a given dynamical system; we are interested particularly in minimal state vectors, i.e., those with the smallest possible dimension .

Above, we saw one direction of the equivalence between PSRs and dynamical systems satisfying the linearity property: given a PSR, the state update equations define , and the expression above gives . We will demonstrate the other direction in the next section below, by constructing a PSR given and .

Given a test , an action , and an observation , define the one-step extension as follows: let be the sequence of actions for , and let be the statistic for . Then the action sequence for is , and the statistic for is , defined as

 Fo(o1,…,oℓ+1)=I(o1=o)F(o2,…,oℓ+1)

In words, the one-step extension tacks onto the beginning of the action sequence. It then applies on the observation sequence starting at the second time step in the future, but it either keeps the result or zeros it out, depending on the value of the first observation.

We can relate the value of a one-step extension test to the value of the original test :

 τao(q)=P(o∣q,do a)τ(q′)

where is the state we reach from after executing and observing . (We can derive this expression by conditioning on whether we receive or not: with probability the outcome of is as if we executed from , else the outcome of is zero.)

For example, in any PSR, we can define the constant test , which has an empty action sequence and always has outcome equal to . The one-step extensions of this test give the probabilities of different observations at the current time step:

 τao1(q)=P(o∣q,do a)

### c.2 PSRs and tests

We can use tests to construct a PSR from a dynamical system, and to interpret the resulting state vector. This interpretation explains the terminology predictive state: our state is equivalent to a vector of predictions about the future. Crucially, these predictions are for observable outcomes of experiments that we could actually conduct. This is in contrast to a POMDP’s state, which may be only partially observable.

In more detail, suppose we have a dynamical system with a minimal state that satisfies the linearity property defined above. That is, suppose we have functions that compute minimal states , and vectors that predict test values . We will show that each coordinate of is a linear combination of test values, and we will define PSR parameters that let us update recursively, instead of having to compute from scratch at each time step using the state extraction functions .

Pick tests , and define to have coordinates . Equivalently, let be the matrix with rows , and write