Different frameworks of hierarchical reinforcement learning have been useful in solving complex tasks. Options framework (Dietterich (2000)) in particular, provides sufficient temporal abstraction to allow actions that have different time scales. They can improve exploration, make learning more sample efficient, aid in transfer or simply help converge to optimal behavior faster especially when the task has sparse reward and long horizon.
The processes of learning options have been widely studied (Barto & Mahadevan (2003)) with some success. Most methods identify salient states (Kulkarni et al. (2016)) or put constraints on the level of hierarchies. Moreover, learning these options itself takes a lot of data.
Hierarchical Behavioral Cloning(HBC) extends the framework of imitation learning to learn options from expert whose trajectories contain information about the option (usually as sub-goal). However, such an expert can’t always be available in complex domains where options can’t be generated by a human expert.
Deep Discovery of Options(DDO) instead views option learning as a probabilistic inference problem where the input is only the flat trajectories of an expert. Using DDO we seek to examine inferred options and apply various metrics to determine their validity and usefulness in learning. We use various similarity metrics to determine how close the generated trajectories are to expert trajectories. We also compare value functions of expert and DDO agent to compare policies directly. We also examine if the options learned to imitate the expert if expert itself uses a hierarchical policy.
We also show simple methods to tackle problems of termination condition being too high for some inferred options. By adding a constant factor to decrease termination probability we can increase the fraction of timesteps taken by options while having similar performance.
Also, by introducing a regularizer to increase KL-divergence between option policies, we can prevent options collapsing to a single mode without reducing the performance.
2 Related Work
The problem of Imitation learning
is to learn a policy based on samples of expert trajectories. Using supervised learning setup to learn action distributions can lead to the problem of compounding errors due to insufficient samples at unsafe or low visitation states. Additional feedback from an expert during training can be incorporated to give more stable policies like in case of DAGGER (Ross et al. (2011)).
Hierarchical Reinforcement Learning has proven to be useful paradigm to tackle problems like state, action abstraction , learning low-level skills (Dietterich (2000); Chentanez et al. (2005)), identifying salient events (Bacon et al. (2017)), exploration, etc (Barto & Mahadevan (2003)). One useful hierarchical framework is to augment options along with primitive actions that perform specific lower level task as described in Sutton et al. (1999, 1998).
The problem of learning hierarchical policies from expert feedback has gained popularity recently. Hierarchical Behaviour Cloning by Nejati et al. (2006) is an extension of passive imitation learning, with options being defined with sub-goals they attempt to reach. The expert needs to augment the primitive actions in trajectories with options. The work by Le et al. (2018) extends DAGGER to hierarchical setup where expert gives additional feedback such as suggesting sub-goals as options after inspecting trajectories.
solve the problem of option discovery using probabilistic inference approach assuming a Hidden-Markov model and using EM algorithms (Welch (2003)) to infer option policies. A follow-up work (Krishnan et al. (2017)) extends this framework to continuous actions tasks.
We use the usual notations for Markov Decision Process (MDP) which is described bywhere is set of states, the set of actions, the reward function on taking an action at given state and transition function . is the discount factor for rewards.
gives the next action distribution. The state-action value function is the expected total return
Thus, optimal policy is .
In DQN (Mnih et al. (2013)
), a neural network approximatesand uses the above bellman equation to update function for every transition.
3.1 Hierarchical Behavioural Cloning
In usual behavioral cloning the task is to learn policy from expert trajectories . Hierarchical Behavioural Cloning (Nejati et al. (2006)) extends this to options setup where options are mapped to sub-goals and at each state expert is assumed to select an option and from then on follow trajectory according to option policy upto termination of option. Thus the trajectory of expert looks like and agent learn a hierarchical policy to maximize likelihood of expert trajectories.
3.2 Deep Discovery of Options
Deep Discovery of Options (DDO) uses the behavioral cloning framework to automatically infer useful polices from flat trajectories of expert. Using the flat trajectories as observable variables they assume a hidden markov model with latent variables and being current option being executed and termination probability of current state under option .
Then, the latent variables are . Using a EM-Algorithm similar to Baum-Welch DDO computes options policy , termination probabilities for options and meta policy parameterized by .
It uses the Expectation gradient trick.
Then a forward-backward algorithm is used, reminiscent of Baum-Welch:
The gradient computed above can then be used in any stochastic gradient descent algorithm.
can be written in terms of the forward and backward probability
The dynamics term, gets cancelled in these normalizations, which allows us to ignore the dynamics
3.3 Hierarchical Deep Q-Network
Hierarchical Deep Q-Network (h-DQN) (Kulkarni et al. (2016)) is an extension of DQN (Mnih et al. (2013)) to learn option policy and meta-policy simultaneously. The options are defined, as in HBC, with respect to sub-goal states and intrinsic reward is used to learn option specific value functions . The meta policy is learned via action values w.r.t options . Each of the two value functions are approximated using separate networks. The options network is updated using normal TD-error
where is intrinsic reward, while meta policy network uses SMDP Q-learning (Dietterich (2000)) updates:
where is the discounted reward during execution of option from the environment (only extrinsic reward).
4 Learning options in Grid-worlds
We first train DDO agent in grid world and analyze the options inferred.
The method for training is summarized in algorithm 1. Note that meta-policy uses both options and primitive actions.
4.2 Analysis of Inferred Options
We used 4 to 6 options depending on the size of the grid world. In case the number of rooms is more than 2, we used 6 options, as in the case of the four-room grid. Some of the inferred options are visualized in Appendix A. We can see quite a bit of variation in option policies. However, the termination probability was seen to be very high as seen in Table 1. Mostly, options were not executed for more than one or two timesteps.
We also noticed that Q-learner learned faster without using options (see Figure 1). This might indicate that options were not necessarily used for better exploration. Instead, options were similar to primitive actions in behavior due to high termination condition and hence updating Q values for options along with primitives might have increased time to reach optimal behavior.
Similarity of trajectories
In order to assess how similar trajectories of agent and expert are we used the KL-Divergence over action probabilities as a metric for similarity. The results are shown in Table 2.
A random policy for this task would give a KL-Divergence value of 0.69. In many environments, though the CE-error is high, it is doing better than a random policy. But the error shows that the trajectories may not be very similar.
Hinge loss of value function
Next, we see if the value function learned by the expert in given task matches that of the agent. We use hinge loss of agent’s value function w.r.t that of the expert. Hinge loss measures the difference if the value function of the agent is lesser than that of expert policy at a state. Thus, it measures the maximal marginal difference.
Lower the loss, closer is value function of the agent to that of expert (which is learned via value iteration).
The results are summarized in Table 3.
4.3 Learning from a hierarchical expert
So far the expert was trained using value iteration and didn’t exhibit hierarchical behavior. We now train using hierarchical expert by using previously trained DDO agent as an expert. We notice that there is not much similarity between expert and inferred options as seen in figure 2.
We also tried to hand-code options to expert policy and first train it using SMDP Q-learning and then training DDO agent using trajectories of the expert. Again we didn’t notice many similarities as seen in Figure 3.
However, we noticed that the hinge loss decreased to 0.014 and KL-Divergence between policies was found to be 0.19. Hence, DDO agent better emulated the hierarchical expert.
4.4 Overcoming large termination probability
A major obstacle to using inferred options for longer time-steps in DDO agent was that termination probabilities were simply too high. Hence, we added a multiplicative factor to inferred termination probability to decrease it before learning the meta-policy. This allows options to execute for a larger number of timesteps before termination.
As expected, the fraction of time options was used increased with decreasing . However, the hinge loss also slightly increased (see Table 4).
|Median||Mean||of Option Time||Hinge loss|
4.5 Avoiding mode collapse of options
We noticed that multiple options inferred are very similar. Hence, to increase the diversity of options we added a regularizer to increase the KL-Divergence between policies of options.
Hence, our new loss function becomes
where is log likelihood of DDO parameters given expert trajectories and determines importance of KL-divergence term. The effect of on inferred policy is summarized in Table 5.
|Median steps for options||Error|
We see that setting too high or low makes the median steps per option low. With too high, options learned may not be useful enough and only primitive actions may be chosen. With too low, no attention is paid to the KL-Divergence term and the options learned might all be similar because of Mode Collapse.
5 Learning options in Atari Domain
We also used DDO on PONG and KRULL games to analyze the options inferred in domains with large state space. We chose trained A3C as expert policy.
We used HDQN framework to train the meta-policy after learning options. Unlike in actual HDQN setup described in Kulkarni et al. (2016) we don’t train the network to learn option policies also, but only train the network for meta-controller to learn policy over options and primitive actions as described in Algorithm 2.
Using the simple training procedure as in case of Grid-world domains where DDO options are first learned and then meta-policy is learned did not converge quickly in our case. So we used an iterated training procedure as described in Algorithm 3. First, we sampled trajectories from trajectories buffer. In each of the iterations, we first refined DDO parameters from sampled trajectories and then refined meta-policy of HDQN. We added samples from agent to trajectory buffer.
5.2 Analysis of inferred options
We used 10 options for PONG and 20 options for KRULL.
Qualitative analysis found only up to 3 options in each game that looked useful. Rest of the options usually involved high-frequency periodic oscillations in state transitions.
We provide the average time-steps for each option in trained DDO agent along with standard deviation in figure4.
The values at top of bars depict the standard deviation.
We have analyzed options inferred by DDO algorithm in some grid-world and Atari domains. We used metrics like KL-divergence between policies, hinge loss between value functions to assess similarity in behavior of agent and expert.
We found that termination probabilities inferred are too high.
We used ad-hoc methods to alleviate problems like high termination probabilities for options and mode collapse of options. These solutions seem to make the policies learned use options inferred more frequently and for a longer fraction of trajectories. We multiplied with a constant factor to decrease termination probability for states. We introduced KL-divergence as a regularizer to increase the variety of options and prevent mode collapse.
In case of training DDO agent in Atari domain, alternating between training DDO parameters for options and training HDQN for meta-policy for several time-steps helped to learn faster than just learning meta-policy at end of DDO option inference.
Importance of options can also be validated by salient states or events discovered, usefulness in transfer to slightly different tasks in the same environment, state spaces where they operate, etc. Creating metrics for continuous actions domains could also be a good next step.
- Bacon et al. (2017) Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pp. 1726–1734, 2017.
- Barto & Mahadevan (2003) Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003.
- Chentanez et al. (2005) Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288, 2005.
Thomas G Dietterich.
Hierarchical reinforcement learning with the maxq value function
Journal of Artificial Intelligence Research, 13:227–303, 2000.
- Fox et al. (2017) Roy Fox, Sanjay Krishnan, Ion Stoica, and Ken Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
- Krishnan et al. (2017) Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Goldberg. Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Conference on Robot Learning, pp. 418–437, 2017.
- Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683, 2016.
- Le et al. (2018) Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue, and Hal Daumé III. Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Nejati et al. (2006)
Negin Nejati, Pat Langley, and Tolga Konik.
Learning hierarchical task networks by observation.
Proceedings of the 23rd international conference on Machine learning, pp. 665–672. ACM, 2006.
- Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
- Smith et al. (2018) Matthew Smith, Herke Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In International Conference on Machine Learning, pp. 4710–4719, 2018.
- Sutton et al. (1998) Richard S Sutton, Doina Precup, and Satinder P Singh. Intra-option learning about temporally abstract actions. In ICML, volume 98, pp. 556–564, 1998.
- Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Welch (2003) Lloyd R Welch. Hidden markov models and the baum-welch algorithm. IEEE Information Theory Society Newsletter, 53(4):10–13, 2003.
Appendix A Visualization of inferred policies
We show some of the inferred policies in grid-world tasks using expert trained via value iteration.
Appendix B Proposed Evaluation Metrics
b.1 State-Visitation Count
State visitation counts give an estimate of the probability of visiting a particular state. Option policies with diverse state visitation counts can account for a multi-modal set of options. This metric can be used in continuous and high dimensional domains as well by using hash functions.
KL-Divergence between action distributions of agent and expert policy can be a good measure of how closely the agent is imitating expert.
We also use KL-Divergence between two options policies as a measure of how different the options are. This can be used as regularizer to prevent mode collapse of options as described in Section 4.5.
b.3 Hinge Value Function Loss
We use hinge loss of agent’s value function w.r.t that of an expert to measure how close to optimal is the behavior of agent w.r.t expert. Hinge loss measures the difference if the value function of the agent is lesser than that of expert policy at a state. Thus, it measures the maximal marginal difference.
b.4 Diffusion Time
This is the average over all pairs of states of the expected number of time steps required to go from one state to other using inferred options and primitive actions using a random walk.
Small diffusion time implies an agent can cover a larger proportion of state space during initial exploration and also that options allow agent to get past bottleneck states to go to different ranges of state space.
b.5 T-SNE Embeddings
The representation of states can be projected into two-dimensional space using methods like T-SNE. The visualization of mapping between option and states over which it is accessible can be a good visual cue to determine the spatial variance of options.