Learning from demonstrations to perform a single task has been widely studied in the machine learning literature [argall2009survey, ross2011reduction, ross2013learning, bojarski2016end, goecks2018efficiently]. In these approaches, demonstrations are carefully curated in order to exemplify a specific task to be carried out by the learning agent. The challenge arises when the demonstrator is performing more than one task, or multiple hierarchical sub-tasks of a complex objective, also called options, where the same set of observations can be mapped to a different set of actions depending on the option being performed [sutton1999between, stolle2002learning]. This is a challenge for traditional behavior cloning techniques that focus on learning a single mapping between observation and actions in a single option scenario.
This paper presents Plannable Option Discovery Network (PODNet), attempting to enable agents to learn the semantic structure behind those complex demonstrated tasks by using a meta-controller operating in the option-space instead of directly operating in the action-space. The main hypothesis is that a meta-controller operating in the option-space can achieve much faster convergence on imitation learning and reinforcement learning benchmarks than an action-space policy network due to the significantly smaller size of the option-space. Our contribution, PODNet, is a custom categorical variational autoencoder (CatVAE) that is composed of several constituent networks that not only segment demonstrated trajectories into options, but concurrently trains a option dynamics model that can be used for downstream planning tasks and training on simulated rollouts to minimize interaction with the environment while the policy is maturing. Unlike previous imitation-learning based approaches to option discovery, our approach does not require the agent to interact with the environment in its option discovery process as it trains offline on just behavior cloning data. Moreover, being able to infer the option label for the current behavior executed by the learning agent, essentially, allowing the agent to broadcast the option it is currently pursuing, has implications in explainable and interpretable artificial intelligence.
This work addresses how to segment an unstructured set of demonstrated trajectories for option discovery. The one-shot imitation architecture developed by Wang2017 [Wang2017] using conditional GAIL (cGAIL) maps trajectories into a set of latent codes that captures the semantic relationship in an interpretable and meaningful manner. This is analogous to word2vec [mikolov2013efficient]
In InfoGAN [chen2016infogan], a generative adversarial network (GAN) maximizes the mutual information between the latent variables and the observation, learning a discriminator that confidently predict the observation labels. InfoRL [Hayat2019] and InfoGAIL [Li2017] utilized the concept of mutual information maximization to map latent variables to solution trajectories (generated by RL) and expert demonstrations respectively. Directed-InfoGAIL [Sharma2018] introduced the concept of directed information. It maximized the mutual information between the trajectory observed so far and the consequent option label. This modification to the InfoGAIL architecture allowed it to segment demonstrations and reproduce option. However, it assumed a prior knowledge of the number of options to be discovered. Diversity Is All You Need (DIAYN) [eysenbach2018diversity] recovers distinctive sub-behaviours (from random exploration) by generating random trajectories and maximising mutual information between the states and the behavior label.
Variational Autoencoding Learning of Options by Reinforcement (VALOR) [achiam2018variational] used VAEs [higgins2017beta] to encode labels into trajectories, thus also implicitly maximising mutual information between behavior labels and corresponding trajectories. DIAYN’s, mutual information maximisation objective function is also implicitly solved in a VAE setting (see Appendix). Both VAEs and InfoGANs maximize mutual information between latent states and the input data. The difference is that VAE’s have access to the true data distribution while InfoGANs also have to learn to model the true data distribution. More recently, CompILE [kipf2019compile] employed a VAE based approach to infer not only option labels at every trajectory step but also infer option start and termination points in the given trajectory. However, options once inferred to be completed are masked out. Thus while inferring options in the future, the agent loses track of critical options that might have happened in the past.
Most of the related works mentioned so far do not learn a dynamics model, and as a result, the discovered options cannot be used for downstream planning via model-based RL techniques. In our work, we utilize the fact that the demonstration data has state-transition information embedded within the demonstration trajectories and thus can be used to learn a dynamics model while simultaneously learning options. We also present a technique to identify the number of distinguishable options to be discovered from the demonstration data.
Plannable Option Discovery Network
Our proposed approach, Plannable Option Discovery Network (PODNet), is a custom categorical variational autoencoder (CatVAE) which consists of several constituent networks: a recurrent option inference network, an option-conditioned policy network, and an option dynamics model, as seen in Figure 1
. The categorical VAE allows the network to map each trajectory segment into a latent code and intrinsically perform soft k-means clustering on the inferred option labels. The following subsections explain the constituent components of PODNet.
Constituent Neural Networks
Recurrent option inference network
In a complex task, the choice of an option at any time depends on both the current state, as well a history of the current and previous options that have been executed. For example, in a door opening task, an agent would decide to open a door only if it had already fetched the key earlier. We utilize a recurrent encoder using short long-term memory (LSTM) [hochreiter1997long] to ensure the current option’s dependence on both the current state and the preceding options is captured. This helps overcome the problem where different options that contain similar or overlapping states are mapped to the same option label, as was observed in DIAYN [eysenbach2018diversity]. Our option inference network P is an LSTM that takes as input the current state as well as the previous option label and predicts the next option label for time step .
Option-conditioned policy network
Approaches such as InfoGAIL [Li2017], achieve the disentanglement into latent variables by imitating the demo trajectories while having access only to the inferred latent variable and not the demonstrator actions. We achieve this goal by concurrently training a option label conditioned policy network that takes in the current predicted option as well as the current state and predicts the action that minimizes the behaviour cloning loss of the demonstration trajectories.
Option dynamics model
The main novelty of PODNet is the inclusion of an option dynamics model. The options dynamics model Q takes in as input the current state and option label and predicts the next state . In other words, the option dynamics model is an option-conditioned state-transition function, or dynamics model, that is dependent on the current option being executed instead of using the current action as traditional state-transition models would. The option dynamics model is trained simultaneously with the other policy and option inference networks by adding the option dynamics consistency loss to the overall training objective. Training an option dynamics model in this way allows for two things: first, it ensures the the system dynamics can be completely defined by the option label, potentially allowing for easier recovery of option labels. Second, it ensures that the recovered option labels allow for modeling the environment dynamics in terms of the options themselves. This not only provides the ability to incorporate planning, but it allows planning to be performed at the options level instead of the action level, which will allow for more efficient planning on longer time-scales.
The training process occurs offline and starts by collecting a dataset
consisting of unstructured demonstrated trajectories, which can be generated from any source as, for example, human experts, optimal controllers, or pre-trained reinforcement learning agents. The overall training loss-function is given as,
Ensuring smooth backpropagation
To ensure that the gradients flow through differentiable functions only during backpropagation,is represented by a Gumbel-Softmax distribution, as illustrated in the literature on Categorical VAEs [Jang2016]
. Using argmax to select the option with highest conditional probability would lead to having a discrete operation in the neural network and prohibit backpropagation in PODNet. Solo softmax is only used during the backward pass to allow backpropagation. For the forward pass, the softmax output is further subject to the argmax operator to obtain a one-hot encoded label vector.
The categorical distribution arising from the encoder network is forced to have minimal KL divergence with a uniform categorical distribution. This is done to ensure that all inputs are not encoded into the same sub-behaviour cluster and are meaningfully separated into separate clusters. Entropy-driven regularization encourages exploration of the label space. This exploration can be modulated by tuning the hyperparameter.
Temporal smoothing of option labels
In order to prevent the model from frequently switching between option labels, the following temporal smoothing regularization term is added to the training objective.
where is the component of the inferred option label and is they associated hyperparameter that regulates temporal consistency of the inferred options.
Discovery of number of options
The number of options can be obtained by having a held-out part of the demonstration, on which the behaviour cloning loss is evaluated, similar to how validation loss is. We start with an initial number of options, K, to be discovered and increment/decrement it to move towards decreasing .
Planning Option Sequences
Although the main motivation for PODNet is to segment unstructured trajectories, the learned option dynamics model combined with the option-conditioned policy network can be used for planning option sequences. As shown in Figure 2, the option dynamics model learned with PODNets can be integrated to meta-controllers to plan trajectories. Given a goal state , the meta-controller simulate trajectories using the option dynamics model and output the best estimated sequence of options to achieve the goal state. This sequence is then passed to the option-conditioned policy network, which outputs the sequence of estimated actions required to follow the planned option sequence.
In this paper we presented PODNet, a neural network architecture for discovery of plannable options. Our approach combines a custom categorical variational autoencoder, a recurrent option inference network, option-conditioned policy network, and option dynamics model for end-to-end training and segmentation of an unstructured set of demonstrated trajectories for option discovery. PODNet’s architecture implicitly utilizes prior knowledge about options being dynamically consistent (plannable and representable by a skill dynamics model), being temporally extended (as enforced by the temporal smoothing regularization) and definitive of the agent’s actions at a particular state (as enforced by a option-conditioned policy network). This leads to discovery of plannable options that enable predictable behavior in AI agents when they adapt to newer tasks in a transfer learning setting. The proposed architecture has implications in multi-task and hierarchical learning, explainable and interpretable artificial intelligence.
VAEs and mutual information
Disregarding the regularization terms, it can be shown that the objective function solved by training variation autoencoders (VAEs) is equivalent to maximising the mutual information term, where and z is the bottleneck latent state.
The encoder network is and the decoder network is , which also represents the approximate posterior.
The final term is the evidence lower bound (ELBO) (minus the KL divergence regularisation term) that is maximised in VAEs as a part of the objective function. The KL divergence regularisation term ensures exploration and packing in the latent variable space.
Equivalence between VAE and Diversity Is All You Need’s training objective
Let us consider the objective function defined in Diversity Is All You Need [Eysenbach2018],
Since system dynamics and distribution of initial states is assumed fixed, it can be said that
Now let us analyze the Kullback-Leibler divergence term,
Hence, we can write the objective function as,
Recomputing the expection over trajectories as expectations over states
Hence, we can re-write the objective function (1) as,
Hence, we can see the equivalence between the ELBO objective in variational autoencoders and the objective defined in DIAYN. Note that the regularization term ensures that policy network explores the action space and generates a diverse distribution of possible output trajectories.