PODNet: A Neural Network for Discovery of Plannable Options

11/01/2019 ∙ by Ritwik Bera, et al. ∙ 0

Learning from demonstration has been widely studied in machine learning but becomes challenging when the demonstrated trajectories are unstructured and follow different objectives. Our proposed work, Plannable Option Discovery Network (PODNet), addresses how to segment an unstructured set of demonstrated trajectories for option discovery. This enables learning from demonstration to perform multiple tasks and plan high-level trajectories based on the discovered option labels. PODNet combines a custom categorical variational autoencoder, a recurrent option inference network, option-conditioned policy network, and option dynamics model in an end-to-end learning architecture. Due to the concurrently trained option-conditioned policy network and option dynamics model, the proposed architecture has implications in multi-task and hierarchical learning, explainable and interpretable artificial intelligence, and applications where the agent is required to learn only from observations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Learning from demonstrations to perform a single task has been widely studied in the machine learning literature [argall2009survey, ross2011reduction, ross2013learning, bojarski2016end, goecks2018efficiently]. In these approaches, demonstrations are carefully curated in order to exemplify a specific task to be carried out by the learning agent. The challenge arises when the demonstrator is performing more than one task, or multiple hierarchical sub-tasks of a complex objective, also called options, where the same set of observations can be mapped to a different set of actions depending on the option being performed [sutton1999between, stolle2002learning]. This is a challenge for traditional behavior cloning techniques that focus on learning a single mapping between observation and actions in a single option scenario.

This paper presents Plannable Option Discovery Network (PODNet), attempting to enable agents to learn the semantic structure behind those complex demonstrated tasks by using a meta-controller operating in the option-space instead of directly operating in the action-space. The main hypothesis is that a meta-controller operating in the option-space can achieve much faster convergence on imitation learning and reinforcement learning benchmarks than an action-space policy network due to the significantly smaller size of the option-space. Our contribution, PODNet, is a custom categorical variational autoencoder (CatVAE) that is composed of several constituent networks that not only segment demonstrated trajectories into options, but concurrently trains a option dynamics model that can be used for downstream planning tasks and training on simulated rollouts to minimize interaction with the environment while the policy is maturing. Unlike previous imitation-learning based approaches to option discovery, our approach does not require the agent to interact with the environment in its option discovery process as it trains offline on just behavior cloning data. Moreover, being able to infer the option label for the current behavior executed by the learning agent, essentially, allowing the agent to broadcast the option it is currently pursuing, has implications in explainable and interpretable artificial intelligence.

Related Work

This work addresses how to segment an unstructured set of demonstrated trajectories for option discovery. The one-shot imitation architecture developed by Wang2017 [Wang2017] using conditional GAIL (cGAIL) maps trajectories into a set of latent codes that captures the semantic relationship in an interpretable and meaningful manner. This is analogous to word2vec [mikolov2013efficient]

in natural language processing (NLP) where words are embedded into a vector space that preserves linguistic relationships.

Figure 1:

Proposed encoder-decoder architecture. Note that the Policy Network decoder could also be a recurrent neural network (RNN) if we wish to make the behaviour label dependent on all preceding states and labels instead of just the previous state and corresponding behaviour label.

In InfoGAN [chen2016infogan], a generative adversarial network (GAN) maximizes the mutual information between the latent variables and the observation, learning a discriminator that confidently predict the observation labels. InfoRL [Hayat2019] and InfoGAIL [Li2017] utilized the concept of mutual information maximization to map latent variables to solution trajectories (generated by RL) and expert demonstrations respectively. Directed-InfoGAIL [Sharma2018] introduced the concept of directed information. It maximized the mutual information between the trajectory observed so far and the consequent option label. This modification to the InfoGAIL architecture allowed it to segment demonstrations and reproduce option. However, it assumed a prior knowledge of the number of options to be discovered. Diversity Is All You Need (DIAYN) [eysenbach2018diversity] recovers distinctive sub-behaviours (from random exploration) by generating random trajectories and maximising mutual information between the states and the behavior label.

Variational Autoencoding Learning of Options by Reinforcement (VALOR) [achiam2018variational] used VAEs [higgins2017beta] to encode labels into trajectories, thus also implicitly maximising mutual information between behavior labels and corresponding trajectories. DIAYN’s, mutual information maximisation objective function is also implicitly solved in a VAE setting (see Appendix). Both VAEs and InfoGANs maximize mutual information between latent states and the input data. The difference is that VAE’s have access to the true data distribution while InfoGANs also have to learn to model the true data distribution. More recently, CompILE [kipf2019compile] employed a VAE based approach to infer not only option labels at every trajectory step but also infer option start and termination points in the given trajectory. However, options once inferred to be completed are masked out. Thus while inferring options in the future, the agent loses track of critical options that might have happened in the past.

Most of the related works mentioned so far do not learn a dynamics model, and as a result, the discovered options cannot be used for downstream planning via model-based RL techniques. In our work, we utilize the fact that the demonstration data has state-transition information embedded within the demonstration trajectories and thus can be used to learn a dynamics model while simultaneously learning options. We also present a technique to identify the number of distinguishable options to be discovered from the demonstration data.

Plannable Option Discovery Network

Our proposed approach, Plannable Option Discovery Network (PODNet), is a custom categorical variational autoencoder (CatVAE) which consists of several constituent networks: a recurrent option inference network, an option-conditioned policy network, and an option dynamics model, as seen in Figure 1

. The categorical VAE allows the network to map each trajectory segment into a latent code and intrinsically perform soft k-means clustering on the inferred option labels. The following subsections explain the constituent components of PODNet.

Constituent Neural Networks

Recurrent option inference network

In a complex task, the choice of an option at any time depends on both the current state, as well a history of the current and previous options that have been executed. For example, in a door opening task, an agent would decide to open a door only if it had already fetched the key earlier. We utilize a recurrent encoder using short long-term memory (LSTM) [hochreiter1997long] to ensure the current option’s dependence on both the current state and the preceding options is captured. This helps overcome the problem where different options that contain similar or overlapping states are mapped to the same option label, as was observed in DIAYN [eysenbach2018diversity]. Our option inference network P is an LSTM that takes as input the current state as well as the previous option label and predicts the next option label for time step .

Option-conditioned policy network

Approaches such as InfoGAIL [Li2017], achieve the disentanglement into latent variables by imitating the demo trajectories while having access only to the inferred latent variable and not the demonstrator actions. We achieve this goal by concurrently training a option label conditioned policy network that takes in the current predicted option as well as the current state and predicts the action that minimizes the behaviour cloning loss of the demonstration trajectories.

Figure 2: Complete PODNet diagram illustrating how the option dynamics model is integrated to meta-controllers to plan trajectories. Given a goal state

, the meta-controller simulate trajectories using the option dynamics model and output the best estimated sequence of options to achieve the goal state. This sequence is then passed to the option-conditioned policy network, which outputs the sequence of estimated actions required to follow the planned option sequence.

Option dynamics model

The main novelty of PODNet is the inclusion of an option dynamics model. The options dynamics model Q takes in as input the current state and option label and predicts the next state . In other words, the option dynamics model is an option-conditioned state-transition function, or dynamics model, that is dependent on the current option being executed instead of using the current action as traditional state-transition models would. The option dynamics model is trained simultaneously with the other policy and option inference networks by adding the option dynamics consistency loss to the overall training objective. Training an option dynamics model in this way allows for two things: first, it ensures the the system dynamics can be completely defined by the option label, potentially allowing for easier recovery of option labels. Second, it ensures that the recovered option labels allow for modeling the environment dynamics in terms of the options themselves. This not only provides the ability to incorporate planning, but it allows planning to be performed at the options level instead of the action level, which will allow for more efficient planning on longer time-scales.

Training

The training process occurs offline and starts by collecting a dataset

consisting of unstructured demonstrated trajectories, which can be generated from any source as, for example, human experts, optimal controllers, or pre-trained reinforcement learning agents. The overall training loss-function is given as,

Hence,

Ensuring smooth backpropagation

To ensure that the gradients flow through differentiable functions only during backpropagation,

is represented by a Gumbel-Softmax distribution, as illustrated in the literature on Categorical VAEs [Jang2016]

. Using argmax to select the option with highest conditional probability would lead to having a discrete operation in the neural network and prohibit backpropagation in PODNet. Solo softmax is only used during the backward pass to allow backpropagation. For the forward pass, the softmax output is further subject to the argmax operator to obtain a one-hot encoded label vector.

Entropy Regularisation

The categorical distribution arising from the encoder network is forced to have minimal KL divergence with a uniform categorical distribution. This is done to ensure that all inputs are not encoded into the same sub-behaviour cluster and are meaningfully separated into separate clusters. Entropy-driven regularization encourages exploration of the label space. This exploration can be modulated by tuning the hyperparameter

.

Temporal smoothing of option labels

In order to prevent the model from frequently switching between option labels, the following temporal smoothing regularization term is added to the training objective.

where is the component of the inferred option label and is they associated hyperparameter that regulates temporal consistency of the inferred options.

Discovery of number of options

The number of options can be obtained by having a held-out part of the demonstration, on which the behaviour cloning loss is evaluated, similar to how validation loss is. We start with an initial number of options, K, to be discovered and increment/decrement it to move towards decreasing .

Planning Option Sequences

Although the main motivation for PODNet is to segment unstructured trajectories, the learned option dynamics model combined with the option-conditioned policy network can be used for planning option sequences. As shown in Figure 2, the option dynamics model learned with PODNets can be integrated to meta-controllers to plan trajectories. Given a goal state , the meta-controller simulate trajectories using the option dynamics model and output the best estimated sequence of options to achieve the goal state. This sequence is then passed to the option-conditioned policy network, which outputs the sequence of estimated actions required to follow the planned option sequence.

Conclusion

In this paper we presented PODNet, a neural network architecture for discovery of plannable options. Our approach combines a custom categorical variational autoencoder, a recurrent option inference network, option-conditioned policy network, and option dynamics model for end-to-end training and segmentation of an unstructured set of demonstrated trajectories for option discovery. PODNet’s architecture implicitly utilizes prior knowledge about options being dynamically consistent (plannable and representable by a skill dynamics model), being temporally extended (as enforced by the temporal smoothing regularization) and definitive of the agent’s actions at a particular state (as enforced by a option-conditioned policy network). This leads to discovery of plannable options that enable predictable behavior in AI agents when they adapt to newer tasks in a transfer learning setting. The proposed architecture has implications in multi-task and hierarchical learning, explainable and interpretable artificial intelligence.

References

Appendices

VAEs and mutual information

Disregarding the regularization terms, it can be shown that the objective function solved by training variation autoencoders (VAEs) is equivalent to maximising the mutual information term, where and z is the bottleneck latent state.
The encoder network is and the decoder network is , which also represents the approximate posterior.

The final term is the evidence lower bound (ELBO) (minus the KL divergence regularisation term) that is maximised in VAEs as a part of the objective function. The KL divergence regularisation term ensures exploration and packing in the latent variable space.

Equivalence between VAE and Diversity Is All You Need’s training objective

Let us consider the objective function defined in Diversity Is All You Need [Eysenbach2018],

Note that,

Since system dynamics and distribution of initial states is assumed fixed, it can be said that

Now let us analyze the Kullback-Leibler divergence term,

Hence, we can write the objective function as,

(1)

Recomputing the expection over trajectories as expectations over states

Hence, we can re-write the objective function (1) as,

Hence, we can see the equivalence between the ELBO objective in variational autoencoders and the objective defined in DIAYN. Note that the regularization term ensures that policy network explores the action space and generates a diverse distribution of possible output trajectories.