Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

by   Ofir Nachum, et al.

We study the problem of representation learning in goal-conditioned hierarchical reinforcement learning. In such hierarchical structures, a higher-level controller solves tasks by iteratively communicating goals which a lower-level policy is trained to reach. Accordingly, the choice of representation -- the mapping of observation space to goal space -- is crucial. To study this problem, we develop a notion of sub-optimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation. We derive expressions which bound the sub-optimality and show how these expressions can be translated to representation learning objectives which may be optimized in practice. Results on a number of difficult continuous-control tasks show that our approach to representation learning yields qualitatively better representations as well as quantitatively better hierarchical policies, compared to existing methods (see videos at


page 13

page 14


Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

Goal-conditioned hierarchical reinforcement learning (HRL) serves as a s...

Learning Actionable Representations with Goal-Conditioned Policies

Representation learning is a central challenge across a range of machine...

Hierarchical Policy Learning is Sensitive to Goal Space Design

Hierarchy in reinforcement learning agents allows for control at multipl...

Learning Goal Embeddings via Self-Play for Hierarchical Reinforcement Learning

In hierarchical reinforcement learning a major challenge is determining ...

Learning Functionally Decomposed Hierarchies for Continuous Control Tasks

Solving long-horizon sequential decision making tasks in environments wi...

Sub-policy Adaptation for Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning is a promising approach to long-hori...

Universal Induction with Varying Sets of Combinators

Universal induction is a crucial issue in AGI. Its practical applicabili...

Code Repositories


worked by bowei he(Buaa) and Alex zhao(Umich)

view repo


This is a copy from tensorflow/models

view repo


This is a fork of tensorflow/models/research/efficient-hrl with some changes.

view repo

1 Introduction

Hierarchical reinforcement learning has long held the promise of extending the successes of existing reinforcement learning (RL) methods (Gu et al., 2017; Schulman et al., 2015; Lillicrap et al., 2015) to more complex, difficult, and temporally extended tasks (Parr & Russell, 1998; Sutton et al., 1999; Barto & Mahadevan, 2003). Recently, goal-conditioned hierarchical designs, in which higher-level policies communicate goals to lower-levels and lower-level policies are rewarded for reaching states (i.e. observations) which are close to these desired goals, have emerged as an effective paradigm for hierarchical RL (Nachum et al., 2018; Levy et al., 2017; Vezhnevets et al., 2017). In this hierarchical design, representation learning – the mapping between observation space and goal space – determines the types of sub-tasks the lower-level can be instructed to perform, and is therefore a critical component determining the success or failure of a hierarchical agent.

Previous works have largely studied two ways to choose the representation: learning the representation end-to-end together with the higher- and lower-level policies (Vezhnevets et al., 2017), or using the state space as-is for the goal space (i.e., the goal space is a subspace of the state space) (Nachum et al., 2018; Levy et al., 2017). The former approach is appealing, but in practice often produces poor results (see Nachum et al. (2018) and our own experiments), since the resulting representation is under-defined; i.e., not all possible sub-tasks are expressible as goals in the space. On the other hand, fixing the representation to be the full state means that no information is lost, but this choice is difficult to scale to higher dimensions. For example, if the state observations are entire images, the higher-level must output target images for the lower-level, which can be very difficult.

We instead study how unsupervised objectives can be used to train a representation that is more concise than the full state, but also not as under-determined as in the end-to-end approach. In order to do so in a principled manner, we propose a measure of sub-optimality of a given representation. This measure aims to answer the question: How much does using the learned representation in place of the full representation cause us to lose, in terms of expected reward, against the optimal policy? This question is important, because a useful representation will compress the state, hopefully making the learning problem easier. At the same time, the compression might cause the representation to lose information, making the optimal policy impossible to express. It is therefore critical to understand how lossy a learned representation is, not in terms of reconstruction, but in terms of the ability to represent near-optimal policies on top of this representation.

Our main theoretical result shows that, for a particular choice of representation learning objective, we can learn representations for which the return of the hierarchical policy approaches the return of the optimal policy within a bounded error. This suggests that, if the representation is learned with a principled objective, the ‘lossy-ness’ in the resulting representation should not cause a decrease in overall task performance. We then formulate a representation learning approach that optimizes this bound. We further extend our result to the case of temporal abstraction, where the higher-level controller only chooses new goals at fixed time intervals. To our knowledge, this is the first result showing that hierarchical goal-setting policies with learned representations and temporal abstraction can achieve bounded sub-optimality against the optimal policy. We further observe that the representation learning objective suggested by our theoretical result closely resembles several other recently proposed objectives based on mutual information (van den Oord et al., 2018; Ishmael Belghazi et al., 2018; Hjelm et al., 2018), suggesting an intriguing connection between mutual information and goal representations for hierarchical RL. Results on a number of difficult continuous-control navigation tasks show that our principled representation learning objective yields good qualitative and quantitative performance compared to existing methods.

2 Framework

Following previous work (Nachum et al., 2018), we consider a two-level hierarchical policy on an MDP , in which the higher-level policy modulates the behavior of a lower-level policy by choosing a desired goal state and rewarding the lower-level policy for reaching this state. While prior work has used a sub-space of the state space as goals (Nachum et al., 2018), in more general settings, some type of state representation is necessary. That is, consider a state representation function . A two-level hierarchical policy on is composed of a higher-level policy , where is the goal space, that samples a high-level action (or goal) every steps, for fixed . A non-stationary, goal-conditioned, lower-level policy then translates these high-level actions into low-level actions for . The process is then repeated, beginning with the higher-level policy selecting another goal according to . The policy is trained using a goal-conditioned reward; e.g. the reward of a transition is , where is a distance function.

Figure 1: The hierarchical design we consider.

In this work we adopt a slightly different interpretation of the lower-level policy and its relation to . Every steps, the higher-level policy chooses a goal based on a state . We interpret this state-goal pair as being mapped to a non-stationary policy , where denotes the set of all possible -step policies acting on . We use to denote this mapping from to . In other words, on every step, we encounter some state . We use the higher-level policy to sample a goal and translate this to a policy . We then use to sample actions for . The process is then repeated from .

Although the difference in this interpretation is subtle, the introduction of is crucial for our subsequent analysis. The communication of is no longer as a goal which desires to reach, but rather more precisely, as an identifier to a low-level behavior which desires to induce or activate.

The mapping is usually expressed as the result of an RL optimization over ; e.g.,


where we use

to denote the probability of being in state

after following for steps starting from . We will consider variations on this low-level objective in later sections. From Equation 1 it is clear how the choice of representation affects (albeit indirectly).

We will restrict the environment reward function to be defined only on states. We use to denote the maximal absolute reward: .

3 Hierarchical Policy Sub-Optimality

In the previous section, we introduced two-level policies where a higher-level policy chooses goals , which are translated to lower-level behaviors via . The introduction of this hierarchy leads to a natural question: How much do we lose by learning which is only able to act on via ? The choice of restricts the type and number of lower-level behaviors that the higher-level policy can induce. Thus, the optimal policy on is potentially not expressible by . Despite the potential lossy-ness of , can one still learn a hierarchical policy which is near-optimal?

To approach this question, we introduce a notion of sub-optimality with respect to the form of : Let be the optimal higher-level policy acting on and using as the mapping from to low-level behaviors. Let be the corresponding full hierarchical policy on . We will compare to an optimal hierarchical policy agnostic to . To define we begin by introducing an optimal higher-level policy agnostic to ; i.e. every steps, samples a low-level behavior which is applied to for the following steps. In this way, may express all possible low-level behaviors. We then denote as the full hierarchical policy resulting from .

We would like to compare to , and we do so in terms of state values. Let be the future value achieved by a policy starting at state . We define the sub-optimality of as


The state values are determined by the form of , which is in turn determined by the choice of representation . However, none of these relationships are direct. It is unclear how a change in will result in a change to the sub-optimality. In the following section, we derive a series of bounds which establish a more direct relationship between and . Our main result will show that if one defines as a slight modification of the traditional objective given in Equation 1, then one may translate sub-optimality of to a practical representation learning objective for .

4 Good Representations Lead to Bounded Sub-Optimality

In this section, we provide proxy expressions that bound the sub-optimality induced by a specific choice of . Our main result is Claim 4, which connects the sub-optimality of to both goal-conditioned policy objectives (i.e., the objective in 1) and representation learning (i.e., an objective for the function ).

4.1 Single-Steps () and Deterministic Policies

For ease of presentation, we begin by presenting our results in the restricted case of and deterministic lower-level policies. In this setting, the class of low-level policies may be taken to be simply , where corresponds to a policy which always chooses action . There is no temporal abstraction: The higher-level policy chooses a high-level action at every step, which is translated via to a low-level action . Our claims are based on quantifying how many of the possible low-level behaviors (i.e., all possible state to state transitions) can be produced by for different choices of . To quantify this, we make use of an auxiliary inverse goal model , which aims to predict which goal will cause to yield an action that induces a next state distribution similar to .222In a deterministic, setting, may be seen as a state-conditioned action abstraction mapping . We have the following theorem, which bounds the sub-optimality in terms of total variation divergences between and :

Theorem 1.

If there exists such that,


then , where .

Proof. See Appendices A and B for all proofs.

Theorem 1 allows us to bound the sub-optimality of in terms of how recoverable the effect of any action in is, in terms of transition to the next state. One way to ensure that effects of actions in are recoverable is to have an invertible . That is, if there exists such that for all , then the sub-optimality of is 0.

However, in many cases it may not be desirable or feasible to have an invertible . Looking back at Theorem 1, we emphasize that its statement requires only the effect of any action to be recoverable. That is, for any , we require only that there exist some (given by ) which yields a similar next-state distribution. To this end, we have the following claim, which connects the sub-optimality of to both representation learning and the form of the low-level objective.

Claim 2.

Let be a prior and be so that, for ,333 may be interpreted as the conditional

of the joint distribution

for normalization constant .


If the low-level objective is defined as


then the sub-optimality of is bounded by .

We provide an intuitive explanation of the statement of Claim 2. First, consider that the distribution appearing in Equation 4 may be interpreted as a dynamics model determined by and . By bounding the difference between the true dynamics and the dynamics implied by and , Equation 4 states that the representation should be chosen in such a way that dynamics in representation space are roughly given by . This is essentially a representation learning objective for choosing , and in Section 5 we describe how to optimize it in practice.

Moving on to Equation 5, we note that the form of here is only slightly different than the one-step form of the standard goal-conditioned objective in Equation 1. Therefore, all together Claim 2 establishes a deep connection between representation learning (Equation 4), goal-conditioned policy learning (Equation 5), and sub-optimality. Specifically, if the low-level RL objective is expressed as in Equation 5, then to minimize the sub-optimality we need only optimize a representation learning objective based on Equation 4.

4.2 Temporal Abstraction () and General Policies

We now move on to presenting the same results in the fully general, temporally abstracted setting, in which the higher-level policy chooses a high-level action every steps, which is transformed via to a -step lower-level behavior policy . In this setting, the auxiliary inverse goal model is a mapping from to and aims to predict which goal will cause to yield a policy that induces future state distributions similar to , for . We weight the divergences between the distributions by weights for and for . We denote . The analogue to Theorem 1 is as follows:

Theorem 3.

Consider a mapping and define for as,




then , where .

For the analogue to Claim 2, we simply replace the single-step KL divergences and low-level rewards with a discounted weighted sum thereof:

Claim 4.

Let be a prior over . Let be such that,


where .

If the low-level objective is defined as


then the sub-optimality of is bounded by .

Claim 4 is the main theoretical contribution of our work. As in the previous claim, we have a strong statement, saying that if the low-level objective is defined as in Equation 9, then minimizing the sub-optimality may be done by optimizing a representation learning objective based on Equation 8.

5 Learning

We now have the mathematical foundations necessary to learn representations that are provably good for use in hierarchical RL. We begin by elaborating on how we translate Equation 8 into a practical training objective for and auxiliary (as well as a practical parameterization of policies as input to ). We then continue to describe how one may train a lower-level policy to match the objective presented in Equation 9. In this way, we may learn and lower-level policy to directly optimize a bound on the sub-optimality of . A pseudocode of the full algorithm is presented in the Appendix (see Algorithm 1).

5.1 Learning Good Representations

Consider a representation function and an auxiliary function

, parameterized by vector

. In practice, these are separate neural networks:


While the form of Equation 8 suggests to optimize a supremum over all and , in practice we only have access to a replay buffer which stores experience sampled from our hierarchical behavior policy. Therefore, we propose to choose sampled uniformly from the replay buffer and use the subsequent actions as a representation of the policy , where we use to denote the sequence . Note that this is equivalent to setting the set of candidate policies to (i.e., is the set of -step, deterministic, open-loop policies). This choice additionally simplifies the possible structure of the function approximator used for (a standard neural net which takes in and ). Our proposed representation learning objective is thus,


where will correspond to the inner part of the supremum in Equation 8.

We now define the inner objective . To simplify notation, we use and use as the distribution over such that . Equation 8 suggests the following learning objective on each :


where is a constant. The gradient with respect to is then,


The first term of Equation 14

is straightforward to estimate using experienced

. We set to be the replay buffer distribution, so that the numerator of the second term is also straightforward. We approximate the denominator of the second term using a mini-batch of states independently sampled from the replay buffer:


This completes the description of our representation learning algorithm.

Connection to Mutual Information Estimators.

The form of the objective we optimize (i.e. Equation 13) is very similar to mutual information estimators, mostly CPC (van den Oord et al., 2018). Indeed, one may interpret our objective as maximizing a mutual information via an energy function given by . The main differences between our approach and these previous proposals are as follows: (1) Previous approaches maximize a mutual information agnostic to actions or policy. (2) Previous approaches suggest to define the energy function as for some matrix , whereas our energy function is based on the distance used for low-level reward. (3) Our approach is provably good for use in hierarchical RL, and hence our theoretical results may justify some of the good performance observed by others using mutual information estimators for representation learning. Different approaches to translating our theoretical findings to practical implementations may yield objectives more or less similar to CPC, some of which perform better than others (see Appendix D).

5.2 Learning a Lower-Level Policy

Equation 9 suggests to optimize a policy for every . This is equivalent to the parameterization , which is standard in goal-conditioned hierarchical designs. Standard RL algorithms may be employed to maximize the low-level reward implied by Equation 9:


weighted by and where corresponds to when the state and goal are fixed. While the first term of Equation 16 is straightforward to compute, the log probabilities are in general unknown. To approach this issue, we take advantage of the representation learning objective for . When are optimized as dictated by Equation 8, we have


We may therefore approximate the low-level reward as


As in Section 5.1, we use the sampled actions to represent as input to . We approximate the third term of Equation 18 analogously to Equation 15. Note that this is a slight difference from standard low-level rewards, which use only the first term of Equation 18 and are unweighted.

6 Related Work

Representation learning for RL has a rich and diverse existing literature, often interpreted as an abstraction of the original MDP. Previous works have interpreted the hierarchy introduced in hierarchical RL as an MDP abstraction of state, action, and temporal spaces (Sutton et al., 1999; Dietterich, 2000; Bacon et al., 2017). In goal-conditioned hierarchical designs, although the representation is learned on states, it is in fact a form of action abstraction (since goals are high-level actions). While previous successful applications of goal-conditioned hierarchical designs have either learned representations naively end-to-end (Vezhnevets et al., 2017), or not learned them at all (Levy et al., 2017; Nachum et al., 2018), we take a principled approach to representation learning in hierarchical RL, translating a bound on sub-optimality to a practical learning objective.

Bounding sub-optimality in abstracted MDPs has a long history, from early work in theoretical analysis on approximations to dynamic programming models (Whitt, 1978; Bertsekas & Castanon, 1989). Extensive theoretical work on state abstraction, also known as state aggregation or model minimization, has been done in both operational research (Rogers et al., 1991; Van Roy, 2006) and RL (Dean et al., 1997; Ravindran & Barto, 2002; Abel et al., 2017). Notably, Li et al. (2006) introduce a formalism for categorizing classic work on state abstractions such as bisimulation (Dean et al., 1997) and homomorphism (Ravindran & Barto, 2002) based on what information is preserved, which is similar in spirit to our approach. Exact state abstractions (Li et al., 2006) incur no performance loss (Dean et al., 1997; Ravindran & Barto, 2002), while their approximate variants generally have bounded sub-optimality (Bertsekas & Castanon, 1989; Dean et al., 1997; Sorg & Singh, 2009; Abel et al., 2017). While some of the prior work also focuses on learning state abstractions (Li et al., 2006; Sorg & Singh, 2009; Abel et al., 2017), they often exclusively apply to simple MDP domains as they rely on techniques such as state partitioning or Q-value based aggregation, which are difficult to scale to our experimented domains. Thus, the key differentiation of our work from these prior works is that we derive bounds which may be translated to practical representation learning objectives. Our impressive results on difficult continuous-control, high-dimensional domains is a testament to the potential impact of our theoretical findings.

Lastly, we note the similarity of our representation learning algorithm to recently introduced scalable mutual information maximization objectives such as CPC (van den Oord et al., 2018) and MINE (Ishmael Belghazi et al., 2018)

. This is not a surprise, since maximizing mutual information relates closely with maximum likelihood learning of energy-based models, and our bounds effectively correspond to bounds based on model-based predictive errors, a basic family of bounds in representation learning in MDPs 

(Sorg & Singh, 2009; Brunskill & Li, 2014; Abel et al., 2017). To our knowledge, no prior work has connected these mutual information estimators to representation learning in hierarchical RL, and ours is the first to formulate theoretical guarantees on sub-optimality of the resulting representations in such a framework.

7 Experiments

Ant Maze Env XY Ours Ours (Images)
VAE VAE (Images) E2C E2C (Images)
Figure 2: Learned representations (2D embeddings) of our method and a number of variants on a MuJoCo Ant Maze environment, with color gradient based on episode time-step (black for beginning of episode, yellow for end). The ant travels from beginning to end of a -shaped corridor along an trajectory shown under XY. Without any supervision, our method is able to deduce this near-ideal representation, even when the raw observation is given as a top-down image. Other approaches are unable to properly recover a good representation.

We evaluate our proposed representation learning objective compared to a number of baselines:

  • XY: The oracle baseline which uses the position of the agent as the representation.

  • VAE: A variational autoencoder 

    (Kingma & Welling, 2013) on raw observations.

  • E2C: Embed to control (Watter et al., 2015). A method which uses variational objectives to train a representation of states and actions which have locally linear dynamics.

  • E2E: End-to-end learning of the representation. The representation is fed as input to the higher-level policy and learned using gradients from the RL objective.

  • Whole obs: The raw observation is used as the representation. No representation learning. This is distinct from Nachum et al. (2018), in which a subset of the observation space was pre-determined for use as the goal space.

We evaluate on the following continuous-control MuJoCo (Todorov et al., 2012) tasks (see Appendix C for details):

  • Ant (or Point) Maze: An ant (or point mass) must navigate a -shaped corridor.

  • Ant Push: An ant must push a large block to the side to reach a point behind it.

  • Ant Fall: An ant must push a large block into a chasm so that it may walk over it to the other side without falling.

  • Ant Block: An ant must push a small block to various locations in a square room.

  • Ant Block Maze: An ant must push a small block through a -shaped corridor.

In these tasks, the raw observation is the agent’s coordinates and orientation as well as local coordinates and orientations of its limbs. In the Ant Block and Ant Block Maze environments we also include the coordinates and orientation of the block. We also experiment with more difficult raw representations by replacing the coordinates of the agent with a low-resolution top-down image of the agent and its surroundings. These experiments are labeled ‘Images’.

Point Maze Ant Maze Ant Push Ant Fall Ant Block
Point Maze (Images) Ant Maze (Images) Ant Push (Images) Ant Fall (Images) Ant Block Maze
Figure 3: Results of our method and a number of variants on a suite of tasks in 10M steps of training, plotted according to median over 10 trials with and percentiles. We find that outside of simple point environments, our method is the only one which can approach the performance of oracle representations. These results show that our method can be successful, even when the representation is learned online concurrently while learning a hierarchical policy.
Ant and block Ant pushing small block through corridor Representations
Figure 4: We investigate importance of various observation coordinates in learned representations on a difficult block-moving task. In this task, a simulated robotic ant must move a small red block from beginning to end of a -shaped corridor. Observations include both ant and block coordinates. We show the trajectory of the learned representations on the right (cyan). At four time steps, we also plot the resulting representations after perturbing the observation’s ant coordinates (green) or the observation’s block coordinates (magenta). The learned representations put a greater emphasis (i.e., higher sensitivity) on the block coordinates, which makes sense for this task as the external reward is primarily determined by the position of the block.

For the baseline representation learning methods which are agnostic to the RL training (VAE and E2C), we provide comparative qualitative results in Figure 2. These representations are the result of taking a trained policy, fixing it, and using its sampled experience to learn 2D representations of the raw observations. We find that our method can successfully deduce the underlying near-optimal representation, even when the raw observation is given as an image.

We provide quantitative results in Figure 3. In these experiments, the representation is learned concurrently while learning a full hierarchical policy (according to the procedure in Nachum et al. (2018)). Therefore, this setting is especially difficult since the representation learning must learn good representations even when the behavior policy is very far from optimal. Accordingly, we find that most baseline methods completely fail to make any progress. Only our proposed method is able to approach the performance of the XY oracle.

For the ‘Block’ environments, we were curious what our representation learning objective would learn, since the coordinate of the agent is not the only near-optimal representation. For example, another suitable representation is the coordinates of the small block. To investigate this, we plotted (Figure 4) the trajectory of the learned representations of a successful policy (cyan), along with the representations of the same observations with agent perturbed (green) or with block perturbed (magenta). We find that the learned representations greatly emphasize the block coordinates over the agent coordinates, although in the beginning of the episode, there is a healthy mix of the two.

8 Conclusion

We have presented a principled approach to representation learning in hierarchical RL. Our approach is motivated by the desire to achieve maximum possible return, hence our notion of sub-optimality is in terms of optimal state values. Although this notion of sub-optimality is intractable to optimize directly, we are able to derive a mathematical relationship between it and a specific form of representation learning. Our resulting representation learning objective is practical and achieves impressive results on a suite of high-dimensional, continuous-control tasks.


We thank Bo Dai, Luke Metz, and others on the Google Brain team for insightful comments and discussions.


Appendix A Proof of Theorem 3 (Generalization of Theorem 1)

Consider the sub-optimality with respect to a specific state , . Recall that is the hierarchical result of a policy , and note that may be assumed to be deterministic due to the Markovian nature of . We may use the mapping to transform to a high-level policy on and using the mapping :


Let be the corresponding hierarchical policy. We will bound the quantity , which will bound . We follow logic similar to Achiam et al. (2017) and begin by bounding the total variation divergence between the -discounted state visitation frequencies of the two policies.

Denote the -step state transition distributions using either or as,


for . Considering as linear operators, we may express the state visitation frequencies of , respectively, as


where is a Dirac distribution centered at and


We will use to denote the every--steps -discounted state frequencies of ; i.e.,


By the triangle inequality, we have the following bound on the total variation divergence :


We begin by attacking the first term of Equation 28. We note that


Thus the first term of Equation 28 is bounded by


By expressing as a geometric series and employing the triangle inequality, we have , and we thus bound the whole quantity (30) by


We now move to attack the second term of Equation 28. We may express this term as


Furthermore, by the triangle inequality we have


Therefore, recalling for and for , we may bound the total variation of the state visitation frequencies as


By condition 7 of Theorem 3 we have,


We now move to considering the difference in values. We have


Therefore, we have


as desired.

Appendix B Proof of Claim 4 (Generalization of Claim 2)

Consider a specific . Let . Note that the definition of may be expressed in terms of a KL:


Therefore we have,


By condition 8 we have,


Jensen’s inequality on the sqrt function then implies


Pinsker’s inequality now yields,


Similarly Jensen’s and Pinsker’s inequality on the LHS of Equation 43 yields


The triangle inequality and Equations 46 and 47 then give us,