Discovering hierarchies using Imitation Learning from hierarchy aware policies

Learning options that allow agents to exhibit temporally higher order behavior has proven to be useful in increasing exploration, reducing sample complexity and for various transfer scenarios. Deep Discovery of Options (DDO) is a generative algorithm that learns a hierarchical policy along with options directly from expert trajectories. We perform a qualitative and quantitative analysis of options inferred from DDO in different domains. To this end, we suggest different value metrics like option termination condition, hinge value function error and KL-Divergence based distance metric to compare different methods. Analyzing the termination condition of the options and number of time steps the options were run revealed that the options were terminating prematurely. We suggest modifications which can be incorporated easily and alleviates the problem of shorter options and a collapse of options to the same mode.


page 10

page 12


Diverse Exploration via InfoMax Options

In this paper, we study the problem of autonomously discovering temporal...

Learning with Options that Terminate Off-Policy

A temporally abstract action, or an option, is specified by a policy and...

The Termination Critic

In this work, we consider the problem of autonomously discovering behavi...

Learning Options from Demonstration using Skill Segmentation

We present a method for learning options from segmented demonstration tr...

A Decomposition and Metric-Based Evaluation Framework for Microservices

Migrating from monolithic systems into microservice is a very complex ta...

Online Baum-Welch algorithm for Hierarchical Imitation Learning

The options framework for hierarchical reinforcement learning has increa...

DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations

An option is a short-term skill consisting of a control policy for a spe...

1 Introduction

Different frameworks of hierarchical reinforcement learning have been useful in solving complex tasks. Options framework (

Dietterich (2000)) in particular, provides sufficient temporal abstraction to allow actions that have different time scales. They can improve exploration, make learning more sample efficient, aid in transfer or simply help converge to optimal behavior faster especially when the task has sparse reward and long horizon.

The processes of learning options have been widely studied (Barto & Mahadevan (2003)) with some success. Most methods identify salient states (Kulkarni et al. (2016)) or put constraints on the level of hierarchies. Moreover, learning these options itself takes a lot of data.

Hierarchical Behavioral Cloning(HBC) extends the framework of imitation learning to learn options from expert whose trajectories contain information about the option (usually as sub-goal). However, such an expert can’t always be available in complex domains where options can’t be generated by a human expert.

Deep Discovery of Options(DDO) instead views option learning as a probabilistic inference problem where the input is only the flat trajectories of an expert. Using DDO we seek to examine inferred options and apply various metrics to determine their validity and usefulness in learning. We use various similarity metrics to determine how close the generated trajectories are to expert trajectories. We also compare value functions of expert and DDO agent to compare policies directly. We also examine if the options learned to imitate the expert if expert itself uses a hierarchical policy.

We also show simple methods to tackle problems of termination condition being too high for some inferred options. By adding a constant factor to decrease termination probability we can increase the fraction of timesteps taken by options while having similar performance.

Also, by introducing a regularizer to increase KL-divergence between option policies, we can prevent options collapsing to a single mode without reducing the performance.

2 Related Work

The problem of Imitation learning

is to learn a policy based on samples of expert trajectories. Using supervised learning setup to learn action distributions can lead to the problem of compounding errors due to insufficient samples at unsafe or low visitation states. Additional feedback from an expert during training can be incorporated to give more stable policies like in case of DAGGER (

Ross et al. (2011)).

Hierarchical Reinforcement Learning has proven to be useful paradigm to tackle problems like state, action abstraction , learning low-level skills (Dietterich (2000); Chentanez et al. (2005)), identifying salient events (Bacon et al. (2017)), exploration, etc (Barto & Mahadevan (2003)). One useful hierarchical framework is to augment options along with primitive actions that perform specific lower level task as described in Sutton et al. (1999, 1998).

The problem of learning hierarchical policies from expert feedback has gained popularity recently. Hierarchical Behaviour Cloning by Nejati et al. (2006) is an extension of passive imitation learning, with options being defined with sub-goals they attempt to reach. The expert needs to augment the primitive actions in trajectories with options. The work by Le et al. (2018) extends DAGGER to hierarchical setup where expert gives additional feedback such as suggesting sub-goals as options after inspecting trajectories.

Deep Discovery of Options by Fox et al. (2017) and another work by Smith et al. (2018)

solve the problem of option discovery using probabilistic inference approach assuming a Hidden-Markov model and using EM algorithms (

Welch (2003)) to infer option policies. A follow-up work (Krishnan et al. (2017)) extends this framework to continuous actions tasks.

3 Preliminaries

We use the usual notations for Markov Decision Process (MDP) which is described by

where is set of states, the set of actions, the reward function on taking an action at given state and transition function . is the discount factor for rewards.

gives the next action distribution. The state-action value function is the expected total return

Thus, optimal policy is .

In DQN (Mnih et al. (2013)

), a neural network approximates

and uses the above bellman equation to update function for every transition.

3.1 Hierarchical Behavioural Cloning

In usual behavioral cloning the task is to learn policy from expert trajectories . Hierarchical Behavioural Cloning (Nejati et al. (2006)) extends this to options setup where options are mapped to sub-goals and at each state expert is assumed to select an option and from then on follow trajectory according to option policy upto termination of option. Thus the trajectory of expert looks like and agent learn a hierarchical policy to maximize likelihood of expert trajectories.

3.2 Deep Discovery of Options

Deep Discovery of Options (DDO) uses the behavioral cloning framework to automatically infer useful polices from flat trajectories of expert. Using the flat trajectories as observable variables they assume a hidden markov model with latent variables and being current option being executed and termination probability of current state under option .

Then, the latent variables are . Using a EM-Algorithm similar to Baum-Welch DDO computes options policy , termination probabilities for options and meta policy parameterized by .

It uses the Expectation gradient trick.

Then a forward-backward algorithm is used, reminiscent of Baum-Welch:

The gradient computed above can then be used in any stochastic gradient descent algorithm.

can be written in terms of the forward and backward probability

The dynamics term, gets cancelled in these normalizations, which allows us to ignore the dynamics

3.3 Hierarchical Deep Q-Network

Hierarchical Deep Q-Network (h-DQN) (Kulkarni et al. (2016)) is an extension of DQN (Mnih et al. (2013)) to learn option policy and meta-policy simultaneously. The options are defined, as in HBC, with respect to sub-goal states and intrinsic reward is used to learn option specific value functions . The meta policy is learned via action values w.r.t options . Each of the two value functions are approximated using separate networks. The options network is updated using normal TD-error

where is intrinsic reward, while meta policy network uses SMDP Q-learning (Dietterich (2000)) updates:

where is the discounted reward during execution of option from the environment (only extrinsic reward).

4 Learning options in Grid-worlds

We first train DDO agent in grid world and analyze the options inferred.

4.1 Method

The method for training is summarized in algorithm 1. Note that meta-policy uses both options and primitive actions.

for  do
       Initialize a random start and goal state;
       Learn using value iteration.;
       Sample trajectories using to ;
end for
Use trajectories in to infer options;
Use SMDP Q-learning to learn meta-policy from options and primitive actions;
Algorithm 1 Training in Grid world tasks

4.2 Analysis of Inferred Options

We used 4 to 6 options depending on the size of the grid world. In case the number of rooms is more than 2, we used 6 options, as in the case of the four-room grid. Some of the inferred options are visualized in Appendix A. We can see quite a bit of variation in option policies. However, the termination probability was seen to be very high as seen in Table 1. Mostly, options were not executed for more than one or two timesteps.

Option Mean Variance
1 0.64 0.020
2 0.69 0.011
3 0.81 0.006
4 0.70 0.010
5 0.24 0.009
6 0.27 0.009
Table 1: Option lengths in 4-room world

We also noticed that Q-learner learned faster without using options (see Figure 1). This might indicate that options were not necessarily used for better exploration. Instead, options were similar to primitive actions in behavior due to high termination condition and hence updating Q values for options along with primitives might have increased time to reach optimal behavior.

Figure 1: Learning curve on 4-room environment

Similarity of trajectories

In order to assess how similar trajectories of agent and expert are we used the KL-Divergence over action probabilities as a metric for similarity. The results are shown in Table 2.

Environment CE-error
Roundabout 0.270
4room 0.331
Hallway 1.040
experiment2 0.276
Table 2: KL Divergence between policies of expert and DDO agent

A random policy for this task would give a KL-Divergence value of 0.69. In many environments, though the CE-error is high, it is doing better than a random policy. But the error shows that the trajectories may not be very similar.

Hinge loss of value function

Next, we see if the value function learned by the expert in given task matches that of the agent. We use hinge loss of agent’s value function w.r.t that of the expert. Hinge loss measures the difference if the value function of the agent is lesser than that of expert policy at a state. Thus, it measures the maximal marginal difference.

Lower the loss, closer is value function of the agent to that of expert (which is learned via value iteration).

The results are summarized in Table 3.

Environment Hinge error
Roundabout 0.364
4room 0.468
Hallway 0.198
experiment2 0.372
Table 3: Hinge loss between value functions

4.3 Learning from a hierarchical expert

So far the expert was trained using value iteration and didn’t exhibit hierarchical behavior. We now train using hierarchical expert by using previously trained DDO agent as an expert. We notice that there is not much similarity between expert and inferred options as seen in figure 2.

(a) Expert option
(b) Inferred option
(c) Expert option
(d) Inferred option
Figure 2: Comparing expert option and inferred option

We also tried to hand-code options to expert policy and first train it using SMDP Q-learning and then training DDO agent using trajectories of the expert. Again we didn’t notice many similarities as seen in Figure 3.

(a) Expert
(b) Inferred
(c) Expert
(d) Inferred
Figure 3: Comparing expert option and inferred option in case of hand-coded options for expert

However, we noticed that the hinge loss decreased to 0.014 and KL-Divergence between policies was found to be 0.19. Hence, DDO agent better emulated the hierarchical expert.

4.4 Overcoming large termination probability

A major obstacle to using inferred options for longer time-steps in DDO agent was that termination probabilities were simply too high. Hence, we added a multiplicative factor to inferred termination probability to decrease it before learning the meta-policy. This allows options to execute for a larger number of timesteps before termination.

As expected, the fraction of time options was used increased with decreasing . However, the hinge loss also slightly increased (see Table 4).

Median Mean of Option Time Hinge loss
0.1 5.3 6.4 0.66 0.41
0.2 4 4 0.58 0.37
0.4 2.98 5.57 0.68 0.40
0.7 1.4 1.76 0.60 0.37
1.0 0.5 4.34 0.42 0.369
Table 4: Comparing effect of on termination probability and learned value function

4.5 Avoiding mode collapse of options

We noticed that multiple options inferred are very similar. Hence, to increase the diversity of options we added a regularizer to increase the KL-Divergence between policies of options.

Hence, our new loss function becomes

where is log likelihood of DDO parameters given expert trajectories and determines importance of KL-divergence term. The effect of on inferred policy is summarized in Table 5.

Median steps for options Error
0.001 1.18 0.368
0.01 1.0 0.378
0.1 4.1 0.42
0.2 4.5 0.40
0.3 5.0 0.41
0.4 4.8 0.40
0.5 6.0 0.44
0.6 2.85 0.37
0.7 1.0 0.37
Table 5: Effect of KL-divergence term on DDO policy

We see that setting too high or low makes the median steps per option low. With too high, options learned may not be useful enough and only primitive actions may be chosen. With too low, no attention is paid to the KL-Divergence term and the options learned might all be similar because of Mode Collapse.

5 Learning options in Atari Domain

We also used DDO on PONG and KRULL games to analyze the options inferred in domains with large state space. We chose trained A3C as expert policy.

5.1 Training

We used HDQN framework to train the meta-policy after learning options. Unlike in actual HDQN setup described in Kulkarni et al. (2016) we don’t train the network to learn option policies also, but only train the network for meta-controller to learn policy over options and primitive actions as described in Algorithm 2.

input : Network parameters , Option policies , Termination proabilities

, other DQN hyperparameters

Initialize target and current network with weights ;
for  do
       Reset environment;
       for  do
             Select using greedy on ;
             if  is option then
                   while not  do
                         Get ;
                         Sample ;
                         if  or episode terminates or exceeds  then
                         end if
                   end while
             end if
                   Get from environment;
             end if
             Sample sampled from ;
             Batch update DQN using squared TD error loss ;
       end for
end for
Algorithm 2 TrainHDQN

Using the simple training procedure as in case of Grid-world domains where DDO options are first learned and then meta-policy is learned did not converge quickly in our case. So we used an iterated training procedure as described in Algorithm 3. First, we sampled trajectories from trajectories buffer. In each of the iterations, we first refined DDO parameters from sampled trajectories and then refined meta-policy of HDQN. We added samples from agent to trajectory buffer.

input : Number of options:,global steps:,Sample size , Sample ddo trajectores other hyperparameters
Learn expert policy using A3C;
Initialize DQN ;
Sample trajectories ;
for  do
       Sample trajectories form ;
       Train DDO parameters for option policies and termination conditions using sampled trajectories;
       Rollout trajectories from HDQN ;
end for
Algorithm 3 DDO + HDQN

5.2 Analysis of inferred options

We used 10 options for PONG and 20 options for KRULL.

Qualitative analysis found only up to 3 options in each game that looked useful. Rest of the options usually involved high-frequency periodic oscillations in state transitions.

We provide the average time-steps for each option in trained DDO agent along with standard deviation in figure


(a) Pong options
(b) Krull options
Figure 4: Analysis of options in PONG and KRULL
The values at top of bars depict the standard deviation.

6 Discussion

We have analyzed options inferred by DDO algorithm in some grid-world and Atari domains. We used metrics like KL-divergence between policies, hinge loss between value functions to assess similarity in behavior of agent and expert.

We found that termination probabilities inferred are too high.

We used ad-hoc methods to alleviate problems like high termination probabilities for options and mode collapse of options. These solutions seem to make the policies learned use options inferred more frequently and for a longer fraction of trajectories. We multiplied with a constant factor to decrease termination probability for states. We introduced KL-divergence as a regularizer to increase the variety of options and prevent mode collapse.

In case of training DDO agent in Atari domain, alternating between training DDO parameters for options and training HDQN for meta-policy for several time-steps helped to learn faster than just learning meta-policy at end of DDO option inference.

Importance of options can also be validated by salient states or events discovered, usefulness in transfer to slightly different tasks in the same environment, state spaces where they operate, etc. Creating metrics for continuous actions domains could also be a good next step.


  • Bacon et al. (2017) Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pp. 1726–1734, 2017.
  • Barto & Mahadevan (2003) Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003.
  • Chentanez et al. (2005) Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288, 2005.
  • Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.

    Journal of Artificial Intelligence Research

    , 13:227–303, 2000.
  • Fox et al. (2017) Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
  • Krishnan et al. (2017) Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Goldberg. Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Conference on Robot Learning, pp. 418–437, 2017.
  • Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683, 2016.
  • Le et al. (2018) Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, and Hal Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Nejati et al. (2006) Negin Nejati, Pat Langley, and Tolga Konik. Learning hierarchical task networks by observation. In

    Proceedings of the 23rd international conference on Machine learning

    , pp. 665–672. ACM, 2006.
  • Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
  • Smith et al. (2018) Matthew Smith, Herke Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In International Conference on Machine Learning, pp. 4710–4719, 2018.
  • Sutton et al. (1998) Richard S Sutton, Doina Precup, and Satinder P Singh. Intra-option learning about temporally abstract actions. In ICML, volume 98, pp. 556–564, 1998.
  • Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • Welch (2003) Lloyd R Welch. Hidden markov models and the baum-welch algorithm. IEEE Information Theory Society Newsletter, 53(4):10–13, 2003.

Appendix A Visualization of inferred policies

We show some of the inferred policies in grid-world tasks using expert trained via value iteration.

Appendix B Proposed Evaluation Metrics

b.1 State-Visitation Count

State visitation counts give an estimate of the probability of visiting a particular state. Option policies with diverse state visitation counts can account for a multi-modal set of options. This metric can be used in continuous and high dimensional domains as well by using hash functions.

b.2 KL-Divergence

KL-Divergence between action distributions of agent and expert policy can be a good measure of how closely the agent is imitating expert.

We also use KL-Divergence between two options policies as a measure of how different the options are. This can be used as regularizer to prevent mode collapse of options as described in Section 4.5.

b.3 Hinge Value Function Loss

We use hinge loss of agent’s value function w.r.t that of an expert to measure how close to optimal is the behavior of agent w.r.t expert. Hinge loss measures the difference if the value function of the agent is lesser than that of expert policy at a state. Thus, it measures the maximal marginal difference.

b.4 Diffusion Time

This is the average over all pairs of states of the expected number of time steps required to go from one state to other using inferred options and primitive actions using a random walk.

Small diffusion time implies an agent can cover a larger proportion of state space during initial exploration and also that options allow agent to get past bottleneck states to go to different ranges of state space.

b.5 T-SNE Embeddings

The representation of states can be projected into two-dimensional space using methods like T-SNE. The visualization of mapping between option and states over which it is accessible can be a good visual cue to determine the spatial variance of options.

Figure 5: T-SNE Visualization