## 1 Introduction

Planning with learned models in RL is important for sample efficiency. Planning provides a mechanism for the agent to simulate data, in the background during interaction, to improve value estimates. Dyna

(sutton1990integrated) is a classic example of background planning. On each step, the agent simulates several transitions according to its model, and updates with those transitions as if they were real experience. Learning and using such a model is worthwhile in vast or ever-changing environments, where the agent learns over a long time period and can benefit from re-using knowledge about the environment.The promise of Dyna is that we can exploit the Markov structure in the RL formalism, to learn and adapt value estimates efficiently, but many open problems remain to make it more widely useful. These include that (1) one-step models learned in Dyna can be difficult to use for long-horizon planning, (2) learning probabilities over outcome states can be complex, especially for high-dimensional states and (3) planning itself can be computationally expensive for large state spaces.

A variety of strategies have been proposed to improve long-horizon planning. Incorporating options as additional (macro) actions in planning is one approach. An *option* is a policy coupled with a termination condition and initiation set (sutton1999between). They provide temporally-extended ways of behaving, allowing the agent to reason about outcomes further into the future. Incorporating options into planning is a central motivation of this paper, particularly how to do so under function approximation. Options for planning has largely only been tested in tabular settings (sutton1999between; singh2004intrinsically; wan2021averagereward). Recent work has considered mechanism for identifying and learning option policies for planning under function approximation (sutton2022rewardrespecting), but as yet did not consider issues with learning the models.

A variety of other approaches have been developed to handle issues with learning and iterating one-step models. Several papers have shown that using forward model simulations can produce simulated states that result in catastrophically misleading values (jafferjee2020hallucinating; vanhasselt2019when; lambert2022investigating). This problem has been tackled by using reverse models (pan2018organizing; jafferjee2020hallucinating; vanhasselt2019when); primarily using the model for decision-time planning (vanhasselt2019when; silver2008samplebased; chelu2020forethought); and improving training strategies to account for accumulated errors in rollouts (talvitie2014model; venkatraman2015improving; talvitie2017selfcorrecting). An emerging trend is to avoid approximating the true transition dynamics, and instead learn dynamics tailored to predicting values on the next step correctly (farahmand2017valueaware; farahmand2018iterative; ayoub2020modelbased)

. This trend is also implicit in the variety of techniques that encode the planning procedure into neural network architectures that can then be trained end-to-end

(tamar2016value; silver2017predictron; oh2017value; weber2017imaginationaugmented; farquhar2018treeqn; schrittwieser2020mastering). We similarly attempt to avoid issues with iterating models, but do so by considering a different type of model.Much less work has been done for the third problem in Dyna: the expense of planning. There is, however, a large literature on approximate dynamic programming—where the model is given—that is focused on efficient planning (see (powell2009what)). Particularly relevant to this work is restricting value iteration to a small subset of landmark states (mann2015approximate).^{1}^{1}1A similar idea to landmark states has been considered in more classical AI approaches, under the term bi-level planning (wolfe2010combined; hogg2010learning; chitnis2021learning). These techniques are quite different from Dyna-style planning—updating values with (stochastic) dynamic programming updates—and so we do not consider them further here. The resulting policy is suboptimal, restricted to going between these landmark states, but planning is provably much more efficient.

Beyond this planning setting where the model is given, the use of landmark states has also been explored in *goal-conditioned RL*, where the agent is given a desired goal state or states. The first work to exploit this idea in reinforcement learning with function approximation, when learning online, was for learning universal value function approximators (UVFAs) (huang2019mapping). The UVFA conditions action-values on both state-action pairs as well as landmark states. A search is done on a learned graph between landmark states, to identify which landmark to moves towards. A flurry of work followed, still in the goal-conditioned setting (nasiriany2019planning; emmons2020sparse; zhang2020generating; zhang2021world; aubret2021distop; hoang2021successor; gieselmann2021planningaugmented; kim2021landmarkguided; dubey2021snap).

In this paper, we exploit the idea behind landmark states for efficient background planning in general online reinforcement learning problems.
The key novelty is a framework to use
*subgoal-conditioned models*

: temporally-extended models that condition on subgoals. The models are designed to be simpler to learn, as they are only learned for states local to subgoals and they avoid generating entire next state vectors. We use background planning on subgoals, to quickly propagate (suboptimal) value estimates for subgoals. We propose subgoal-value bootstrapping, that leverages these quickly computed subgoal values, but mitigates suboptimality by incorporating an update on real experience. We show in the PinBall environment that our Goal-Space Planning (GSP) algorithm can learn significantly faster than Double DQN, and still reaches nearly the same level of performance. A major insight of this paper is that many of the pieces to the model-based RL puzzle already exist and just needed to be assembled into a unified architecture: the combination of temporal abstraction, subgoal planning, avoiding learning transitions, and using UVFAs is far greater than the sum of the parts.

## 2 Problem Formulation

We consider the standard reinforcement learning setting, where an agent learns to make decisions through interaction with an environment, formulated as Markov Decision Process (MDP)

. is the state space and the action space. and the transition probability describes the expected reward and probability of transitioning to a state, for a given state and action. On each discrete timestep the agent selects an action in state , the environment transitions to a new state and emits a scalar reward .The agent’s objective is to find a policy that maximizes expected *return*, the future discounted reward
. The state-based discount depends on (sutton2011horde), which allows us to specify termination. If is a terminal state, then ; else, for some constant .
The policy can be learned using algorithms like Q-learning (sutton2018reinforcement), which approximate the action-values: the expected return from a given state and action.

We can incorporate models and planning to improve sample efficiency beyond these basic model-free algorithms. In this work, we focus on background planning algorithms: those that learn a model during online interaction and asynchronously update value estimates use dynamic programming updates.^{2}^{2}2Another class of planning algorithms are called *model predictive control* (MPC). These algorithms learn a model and use decision-time planning by simulating many rollouts from the current state. Other recent algorithms using this idea are those doing Monte Carlo tree search (MCTS) online, such as MuZero (schrittwieser2020mastering). The classic example of background planning is Dyna (sutton1990integrated), which performs planning steps by selecting previously observed states, generating transitions—outcome rewards and next states—for every action and performing a Q-learning update with those simulated transitions.

Planning with learned models, however, has several issues. First, even with perfect models, it can be computationally expensive. Running dynamic programming can require multiple sweeps, which is infeasible over a large number of states. A small number of updates, on the other hand, may be insufficient. Computation can be focused by carefully selecting which states to sample transitions from—called search control—but the question about how to do so effectively remains largely unanswered with only a handful of works (moore1993prioritized; wingate2005prioritization; pan2019hill).

The second difficulty arises due to errors in the learned models. In reinforcement learning, the transition dynamics is represented with an expectation model or a probabilistic model . If the state space or feature space is large, then the expected next state or distribution over it can be difficult to estimate, as has been repeatedly shown (talvitie2017selfcorrecting). Further, these errors can compound when iterating the model forward or backward (jafferjee2020hallucinating; vanhasselt2019when). It is common to use an expectation model, but unless the environment is deterministic or we are only learning the values rather than action-values, this model can result in invalid states and detrimental updates (wan2019planning).

In this work, we take steps towards the ambitious question: how can we leverage a separate computational procedure (planning with a model) to improve learning in complex environments? More specifically, we consider background planning for value-based methods. We address the two difficulties with classic background planning strategies discussed above, by focusing planning on a set of subgoals (abstract states) and changing the form of the model.

## 3 Starting Simpler: Goal-Space Planning for Policy Evaluation

To highlight the key idea for efficient planning, we start in a simpler setting: policy evaluation for learning for a fixed deterministic policy in a deterministic environment, assuming access to the true models.
The key idea is to propagate values quickly across the space by updating between a subset of states that we call *subgoals*, . (Later we extend to abstract subgoal vectors that need not correspond to any state.) To do so, we need temporally extended models between pairs that may be further than one-transition apart. For policy evaluation, these models are the accumulated rewards and discounted probabilities under :

where if and otherwise equals , the environment discount. If we cannot reach from under , then will simply accumulate many zeros and be zero. We can treat as our new state space and plan in this space, to get value estimates for all

where if there is a terminal state (episodic problems) and otherwise . It is straightforward to show this converges, because is a substochastic matrix (see Appendix A).

Once we have these values, we can propagate these to other states, locally, again using the closest to . We can do so by noticing that the above definitions can be easily extended to and , since for a pair they are about starting in the state and reaching under .

(1) |

Because the rhs of this equation is fixed, we only cycle through these states once to get their values.

All of this might seem like a lot of work for policy evaluation; indeed, it will be more useful to have this formalism for control. But, even here
goal-space planning can be beneficial. Let assume a chain , where and . Planning over only requires sweeping over 10 states, rather than 1000. Further, we have taken a 1000 horizon problem and converted it into a 10 step one.^{3}^{3}3In this simplified example, we can plan efficiently by updating the value at the end in , and then updating states backwards from the end. But, without knowing this structure, it is not a general purpose strategy. For general MDPs, we would need smart ways to do search control: the approach to pick states from one-step updates. In fact, we can leverage search control strategies to improve the goal-space planning step. Then we get the benefit of these approaches, as well as the benefit of planning over a much smaller state space.
As a result, changes in the environment also propagate faster. If the reward at changes, locally the reward model around can be updated quickly, to change for pairs where is along the way from to . This local change quickly updates the values back to earlier .

## 4 Goal-Space Planning with Subgoal-Conditioned Models

Our objective is to allow planning to operate in a subgoal space while the policy operates in the original MDP. To do so we need to first specify how subgoals relate to states and define the models required for planning and updating the policy. Then we discuss how to use these for planning and finally summarize the overall goal-spacing planning framework, with a diagram in Figure 3.

### 4.1 Defining Subgoals

Assume we have a finite subset of subgoal vectors , that need not be a subset of the (possibly continuous) space of state vectors. For example,

could correspond to a situation where both the front and side distance sensors of a robot report low readings—what a person would call being in a corner. Subgoal vectors could be a one-hot encoding or a vector of features describing the subgoal.

To fully specify a subgoal, we need a *membership function* that indicates if a state is a member of subgoal : , and zero otherwise.
Many states can be mapped to the same subgoal . For the above example, if the first two elements of the state vector consist of the front and side distance sensor, for any states where are less than some threshold . For a concrete example, we visualize subgoals for the environment in our experiments in Figure 4.

Finally, we only reason about reaching subgoals from a subset of states, called *initiation sets* for options (sutton1999between). This constraint is key for locality, to learn and reason about a subset of states for a subgoal.
We assume the existence of a (learned) *initiation function* that is 1 if is in the initiation set for (e.g., sufficiently close in terms of reachability) and zero otherwise.
We discuss some approaches to learn this initiation function in Appendix C. But, here, we assume it is part of the discovery procedure for the subgoals and first focus on how to use it.

### 4.2 Defining Subgoal-Conditioned Models

For planning and acting to operate in two different spaces, we define four models: two used in planning over subgoals (subgoal-to-subgoal) and two used to project these subgoal values back into the underlying state space (state-to-subgoal). Figure 1 visualizes these two spaces.

The state-to-subgoal models are and . An option policy for subgoal starts from any in the initiation set, and terminates in —in where . The reward-model is the discounted rewards under option policy :

where the discount is zero upon reaching subgoal

The discount-model reflects the discounted number of steps until reaching subgoal starting from , in expectation under option policy

These state-to-subgoal will only be queried for where : they are local models.

To define subgoal-to-subgoal models,^{4}^{4}4The first input is any , the second is , which includes . We need to reason about reaching any subgoal or . But is not a real state: we do not reason about starting from it to reach subgoals. and , we use the state-to-subgoal models.
For each subgoal , we aggregate for all where .

(2) |

for normalizer . This definition assumes a uniform weighting over the states where . We could allow a non-uniform weighting, potentially based on visitation frequency in the environment. For this work, however, we assume that for a smaller number of states with relatively similar , making a uniform weighting reasonable.

These models are also local models, as we can similarly extract from and only reason about nearby or relevant to . We set , indicating that if there is a state that is in the initiation set for and has membership in , then is also relevant to .

Let us consider an example, in Figure 1. The red states are members of () and the blue members of (,). For all in the diagram, (all are in the initiation set): the policy can be queried from any to get to . The green path in the left indicates the trajectory under from , stochastically reaching either or , with accumulated reward and discount (averaged over reaching and ). The subgoal-to-subgoal models, on the right, indicate can be reached from , with averaged over both and and over and , described in Equation (2).

### 4.3 Goal-Space Planning with Subgoal-Conditioned Models

We can now consider how to plan with these models. Planning involves learning : the value for different subgoals. This can be achieved using an update similar to value iteration, for all

(3) |

The value of reaching from is the discounted rewards along the way, , plus the discounted value in . If is very small, it is difficult to reach from —or takes many steps—and so the value in is discounted by more. With a relatively small number of subgoals, we can sweep through them all to quickly compute . With a larger set of subgoals, we can instead do as many updates possible, in the background on each step, by stochastically sampling .

We can interpret this update as a standard value iteration update in a new MDP, where 1) the set of states is , 2) the actions from are state-dependent, corresponding to choosing which to go to in the set where and 3) the rewards are and the discounted transition probabilities are . Under this correspondence, it is straightforward to show that the above converges to the optimal values in this new Goal-Space MDP, shown in Proposition 2 in Appendix B.

This goal-space planning approach does not suffer from typical issues with model-based RL. First, the model is not iterated, but we still obtain temporal abstraction because the model itself incorporates it. Second, we do not need to predict entire state vectors—or distributions over them—because we instead input the outcome into the function approximator. This may feel like a false success as it potentially requires restricting ourselves to a smaller number of subgoals. If we want to use a larger number of subgoals, then we may need a function to generate these subgoal vectors anyway—bringing us back to the problem of generating vectors. However, this is likely easier as 1) the subgoals themselves can be much smaller and more abstract, making it more feasibly to procedurally generate them and 2) it may be more feasible maintain a large set of subgoal vectors, or generate individual subgoal vectors, than producing relevant subgoal vectors from a given subgoal.

Now let us examine how to use to update our main policy. The simplest way to decide how to behave from a state is to cycle through the subgoals, and pick the one with the highest value.

(4) |

and take action that corresponds to the action given by for this maximizing . However, this approach has two issues. First restricting to going through subgoals might result in suboptimal policies. From a given state , the set of relevant subgoals may not be on the optimal path. Second, the learned models themselves may have inaccuracies, or planning may not have been completed in the background, resulting in that are not yet fully accurate.

We instead propose to use within the bootstrap target for the action-values for the main policy. For a given transition , either as the most recent experience or from a replay buffer, the proposed *subgoal-value bootstrapping* update to parameterized uses TD error

(5) |

for some . For , we get a standard Q-learning update. For , we fully bootstrap off the value provided by . This may result in suboptimal values , but should learn faster because a reasonable estimate of value has been propagated back quickly using goal-space planning. On the other hand, is not biased by a potentially suboptimal , but does not take advantage of this fast propagation. An interim can allow for fast propagation, but also help overcome suboptimality in the values.

We can show that the above update improves the convergence rate. This result is intuitive: subgoal-value bootstrapping changes the discount rate to . In the extreme case of , we are moving our estimate towards for not based on without any bootstrapping: it is effectively a regression problem. We prove this intuitive result in Appendix B. One other benefit of this approach is that the initiation sets need not cover the whole space: we can have a state for all . If this occurs, we simply do not use and bootstrap as usual.

### 4.4 Putting it All Together: The Full Goal-Space Planning Algorithm

The remaining piece is to learn the models and put it all together. Learning the models is straightforward, as we can leverage the large literature on general value functions (sutton2011horde) and UVFAs (schaul2015universal). There are nuances involved in 1) restricting updating to relevant states according to , 2) learning option policies that reach subgoals, but also maximize rewards along the way and 3) considering ways to jointly learn and . For space we include these details in Appendix C.

The algorithm is visualized in Figure 3 (pseudocode in appx. C.3). The steps of agent-environment interaction include:

1) take action in state , to get and ;

2) query the model for for all

where ;

3) compute projection using Eq. (4) and step 2;

4) update the main policy with the transition and ,

using Eq. (5).

All background computation is used for model learning using a replay buffer and for planning to obtain , so that they can be queried at any time on step 2.

## 5 Experiments with Goal-Space Planning

We investigate the utility of GSP, for 1) improving sample efficiency and 2) re-learning under non-stationarity. We compare to Double DQN (DDQN) (van2016deep), which uses replay and target networks. We layer GSP on top of this agent: the action-value update is modified to incorporate subgoal-value bootstrapping. By selecting , we perfectly recover DDQN, allowing us to test different values to investigate the impact of incorporating subgoal values computed using background planning.

### 5.1 Experiment Specification

We test the agents in the PinBall environment (konidaris2009skill), which allows for a variety of easy and harder instances to test different aspects. The agent has to navigate a small ball to a destination in a maze-like environment with fully elastic and irregularly shaped obstacles. The state is described by 4 features: . The agent has 5 discrete actions: increase/decrease , increase/decrease , and nothing. The agent receives a reward of -5 per step and a reward of 10,000 upon termination at the goal location. PinBall has a continuous state space with complex and sharp dynamics that make learning and control difficult. We used a harder version of PinBall in our first experiment, shown in Figure 4, and simpler one for the non-stationary experiment, shown in Figure 8, to allow DDQN a better chance to adapt under non-stationarity.

The hyperparameters are chosen based on sweeping for DDQN performance. We then fixed these hyperparameters, and used them for GSP. This approach helps ensure they have similar settings, with the primary difference due to incorporating subgoal-value bootstrapping. We used neural networks with ReLU activations and

; details about hyperparameters are in Appendix F.The set of subgoals for GSP are chosen to cover the environment in terms of locations. For each subgoal with location , we set for if the Euclidean distance between and is below . Using a region, rather than requiring , is necessary for a continuous state space. The agent’s velocity is not taken into account for subgoal termination. The width of the region for the initiation function is 0.4. More details about the layout of the environment, positions of these subgoals and initiation functions are shown in Figure 4.

), with the standard error shown. Even just increasing to

allows GSP to leverage the longer-horizon estimates given by the subgoal values, making it learn much faster than DDQN. Once is at 1, where it fully bootstraps off of potentially suboptimal subgoal values, GSP still learns quickly but levels off at a suboptimal value, as expected.### 5.2 Experiment 1: Comparing DDQN and GSP with Pre-learned Models

We first investigate the utility of the models after they have been learned in a pre-training phase. The models use the same updates as they would when being learned online, and are not perfectly accurate. Pre-training the model allows us to ask: if the GSP agent had previously learned a model in the environment—or had offline data to train its model—can it leverage it to learn faster now? One of the primary goals of model-based RL is precisely this re-use, and so it is natural to start in a setting mimicking this use-case. We assume the GSP agent can do many steps of background planning, so that is effectively computed in early learning; this is reasonable as we only need to do value iteration for 9 subgoals, which is fast. We test GSP with .

We see in Figure 4 that GSP learns much faster than DDQN, and reaches the same level of performance. This is the result we should expect—GSP gets to leverage a pre-trained model, after all—but it is an important sanity check that using models in this new way is effective. Of particular note is that even just increasing from 0 (which is DDQN) to provides the learning speed boost without resulting in suboptimal performance. Likely, in early learning, the suboptimal subgoal values provide a coarse direction to follow, to more quickly update the action-values, which is then refined with more learning. We can see that for and , we similarly get fast initial learning, but it plateaus at a more suboptimal point. For very close to zero, we see that performance is more like DDQN. But even for such a small we get improvements.

To further investigate the hypothesis that GSP more quickly changes its value function early in learning, we visualize the value functions for both GSP and DDQN over time in Figure 5. After 2000 steps, they are not yet that different, because there are only four replay updates on each step and it takes time to visit the state-space and update values by bootstrapping off of subgoal values. By step 6000, though, GSP already has some of the structure of the problem, whereas DDQN has simply pushed down many of its values (darker blue).

### 5.3 Accuracy of the Learned Models

One potential benefit of GSP is that the models themselves may be easier to learn, because we can leverage standard value function learning algorithms. We visualize the models learned for the previous experiment, as well as the resulting , with details about model learning in Appendix E.

In Figure 6 we see how learned state-to-subgoal models accurately learn the structure. Each plot shows the learned state-to-subgoal for one subgoal, visualized only for the initiation set . We can see larger discount and reward values predicted based on reachability. However, the models are not perfect. We measured model error and find it is reasonable but not very near zero (see Appendix E). This result is actually encouraging: inaccuracies in the model do not prevent useful planning.

It is informative to visualize . We can see in Figure 5 that the general structure is correct, matching the optimal path, but that it indeed looks suboptimal compared to the final values computed in Figure 5

by DDQN. This inaccuracy is likely due both to some inaccuracy in the models, as well as the fact that subgoal placement is not optimal. This explains why GSP has lower values particularly in states near the bottom, likely skewed downwards by

.Finally, we test the impact on learning using less accurate models. After all, the agent will want to start using its model as soon as possible, rather than waiting for it to become more accurate. We ran GSP using models learned online, using only 50k, 75k and 100k time steps to learn the models. We then froze the models and allowed GSP to learn with them. We can see in Figure 7 that learning with too inaccurate of a model—with 50k—fails, but already with 75k performance improves considerably and with 100k we are already nearly at the same level of optimal performance as the pre-learned models. This result highlights it should be feasible to learn and use these models in GSP, all online.

### 5.4 Experiment 2: Adapting in Nonstationary PinBall

Now we consider another typical use-case for model-based RL: quickly adapting to changes in the environment. We let the agent learn in PinBall for 50k steps, and then switch the goal to a new location for another 50k steps. Goal information is never given to the agent, so it has to visit the old goal, realize it is no longer rewarding, and re-explore to find the new goal. This non-stationary setting is harder for DDQN, so we use a simpler configuration for PinBall, shown in Figure 8.

We can leverage the idea of exploration bonuses, introduced in Dyna-Q+ (sutton2018reinforcement). Exploration bonuses are proportional to the last time that state-action was visited. This encourages the agent to revisit parts of the state-space that it has not seen recently, in case that part of the world has changed. For us, this corresponds to including reward bonus in the planning and projection steps: and . Because we have a small, finite set of subgoals, it is straightforward to leverage this idea that was designed for the tabular setting. We use if the count for is zero, and 0 otherwise. When the world changes, the agent recognizes that it has changed, and resets all counts for the subgoals. Similarly, both agents (GSP and DDQN) clear their replay buffers.

The GSP agent can recognize the world has changed, but not how it has changed. It has to update its models with experience. The state-to-subgoal models and subgoal-to-subgoal models local to the previous terminal state location and the new one need to change, but the rest of the models are actually already accurate. The agent can leverage this existing accuracy.

In Figure 8, we can see both GSP and DDQN drop in performance when the environment changes, with GSP recovering much more quickly. It is always possible that an inaccurate model might actually make re-learning slower, reinforcing incorrect values from the model. Here, though, updating these local models is fast, allowing the subgoal values to also be updated quickly. Though not shown in the plot, GSP without exploration bonuses performs poorly. Its model causes it to avoid visiting the new goal region, so preventing the model from updating, because the value in that bottom corner is low.

## 6 Conclusion

In this paper we introduced a new planning framework, called Goal-Space Planning (GSP). The key idea is to plan in a much smaller space of subgoals, and use these (high-level) subgoal values to update state values using subgoal-conditioned models. We show that, in the PinBall environment, that 1) the subgoal-conditioned models can be accurately learned using standard value estimation algorithms and 2) GSP can significantly improve speed of learning, over Double DQN. The formalism avoids learning transition dynamics and iterating models, two of the sources of failure in previous model-based RL algorithms. GSP provides a new approach to incorporate background planning to improve action-value estimates, with minimalist, local models and computationally efficient planning.

This work introduces a new formalism, and many new technical questions along with it. We have only tested GSP with pre-learned models and assumed a given set of subgoals. Our initial experiments learning the models online, from scratch, indicate that GSP can get similar learning speed boosts. Using a simple recency buffer, however, accumulates transitions only along the optimal trajectory, sometimes causing the models to become highly inaccurate part-way through learning, causing GSP to fail. An important next step is to incorporate smarter strategies, such as curating the replay buffer, to learn these models online. The other critical open question is in subgoal discovery. We somewhat randomly selected subgoals across the PinBall environment, with a successful outcome; such an approach is unlikely to work in many environments. In general, option discovery and subgoal discovery remain open questions. One utility of this work is that it could help narrow the scope of the discovery question, to that of finding abstract subgoals that help the agent plan more efficiently.

## References

## Appendix A Proofs for the Deterministic Policy Evaluation Setting

We provide proofs here for Section 3. We assume throughout that the environment discount is a constant for every step in an episode, until termination when it is zero. The below results can be extended to the case where , using the standard strategy for the stochastic shortest path problem setting.

First, we want to show that given and , we can guarantee that the update for the values for will converge. Recall that is the augmented goal space that includes the terminal state. This terminal state is not a subgoal—since it is not a real state—but is key for appropriate planning.

###### Lemma 1.

Assume that we have a deterministic MDP, deterministic policy , , a discrete set of subgoals , and that we iteratively update with the dynamic programming update

(6) |

for all , starting from an arbitrary (finite) initialization , with fixed at zero. Then then converges to a fixed point.

###### Proof.

To analyze this as a matrix update, we need to extend to include an additional row transitioning from . This row is all zeros, because the value in the terminal state is always fixed at zero. Note that there are ways to avoid introducing terminal states, using transition-based discounting [white2017unifying], but for this work it is actually simpler to explicitly reason about them and reaching them from subgoals.

To show this we simply need to ensure that is a substochastic matrix. Recall that

where if and otherwise equals , the environment discount. If it is substochastic, then . Consequently, the Bellman operator

is a contraction, because .

Because , then either immediately terminates in , giving . Or, it does not immediately terminate, and because . Therefore, if , then .

∎

###### Proposition 1.

For a deterministic MDP, deterministic policy , and a discrete set of subgoals that are all reached by in the MDP, given the obtained from Equation 6, if we set

(7) |

for all states then we get that .

###### Proof.

For a deterministic environment and deterministic policy this result is straightforward. The term only if is on the trajectory from when the policy is executed. The term consists of deterministic (discounted) rewards and is the true value from , as shown in Lemma 6 (namely ). The subgoal is the closest state on the trajectory from , and is where is the number of steps from to . ∎

## Appendix B Proofs for the General Control Setting

In this section we assume that , to avoid some of the additional issues for handling proper policies. The same strategies apply to the stochastic shortest path setting with , with additional assumptions.

###### Proposition 2.

[Convergence of Value Iteration in Goal-Space] Assuming that is a substochastic matrix, with initialized to an arbitrary value and fixing for all , then iteratively sweeping through all with update

(8) |

convergences to a fixed-point.

###### Proof.

We can use the same approach typically used for value iteration. For any , we can define the operator

First we can show that is a -contraction. Assume we are given any two vectors . Notice that , because for our problem setting the discount is either equal to or equal to zero at termination. Then we have that for any

Since this is true for any , it is true for the max over , giving

Because the operator is a contraction, since , we know by the Banach Fixed-Point Theorem that the fixed-point exists and is unique. ∎

Now we analyze the update to the main policy, that incorporates the subgoal value estimates into the bootstrap target. We assume we have a finite number of state-action pairs , with parameterized action-values represented as a vector with one entry per state-action pair. Value iteration to find corresponds to updating with the Bellman optimality operator

(9) |

On each step, for the current , if we assume the parameterized function class can represent , then we can reason about the iterations of obtain when minimizing distance between and , with

Under function approximation, we do not simple update a table of values, but we can get this equality by minimizing until we have zero Bellman error. Note that , by definition.

In this *realizability* regime, we can reason about the iterates produced by value iteration. The convergence rate is dictated by , as is well known, because

Specifically, if we assume , then we can use the fact that 1) the maximal return is no greater than , and 2) for any initialization no larger in magnitude than this maximal return we have that . Therefore, we get that

and so after iterations we have

We can use the exact same strategy to show convergence of value iteration, under our subgoal-value bootstrapping update. Let , assuming is a given, fixed function. Then the modified Bellman optimality operator is

(10) |

###### Proposition 3 (Convergence rate of tabular value iteration under subgoal bootstrapping).

The fixed point exists and is unique. Further, for , and the corresponding , initialized such that , the value iteration update with subgoal bootstrapping for satisfies

###### Proof.

First we can show that is a -contraction. Assume we are given any two vectors . Notice that , because for our problem setting it is either equal to or equal to zero at termination. Then we have that for any

Since this is true for any , it is true for the max, giving

Because the operator is a contraction, since , we know by the Banach Fixed-Point Theorem that the fixed-point exists and is unique.

Now we can also use contraction property for the convergence rate. Notice first that we can consider as the new reward, with maximum value . Further, the new discount is . Consequently, the maximal return is .

∎

This rate is dominated by , and for near 1 gives a much faster convergence rate than . We can determine after how many iteration this term overcomes the increase in the upper bound on the return. In other words, we want to know how big needs to be to get