## 1 Introduction

Humans have the power to reason about complex tasks as combinations of simpler, interpretable subtasks. There are many hierarchical reinforcement learning approaches designed to handle tasks comprised of sequential subtasks

[Sutton et al., 1999, Konidaris and Barto, 2007], but what if a task is made up of concurrent subtasks? For example, someone who wants to learn to do a backflip may consider it to be combination of sequential andconcurrent subtasks: jumping, tucking knees, rolling backwards, and thrusting arms downwards. Little focus has been given to designing algorithms that decompose complex tasks into distinct concurrent subtasks. Even less effort has been put into finding decompositions that are made up of independent yet interpretable concurrent subtasks, even though analogous approaches are effective on a number of challenging artificial intelligence problems

[Chen et al., 2016, Burgess et al., 2018].We believe that endowing intelligent agents with the ability to disentangle complex tasks into simpler, distinct, and interpretable subtasks could help them learn complex tasks more efficiently. Furthermore, by explicitly encouraging agents to learn distinct, interpretable subtasks, we would incline the agents towards learning distinct and interpretable representations, which can be leveraged in powerful ways. For example, in the context of algorithmic human-robot interaction, an agent, embodied or virtual, may be able to use these representations to learn complex tasks a human cannot perform well but are comprised of simpler concurrent subtasks at which the human is proficient (e.g., backflip decomposition). Another example is that the agent could continuously interpolate between subtasks to adjust its behavior in semantically-meaningful ways

[Chen et al., 2016, Burgess et al., 2018]. This is particularly motivating because such fine-scale interpolations are characteristic of humans.We propose a low-cost modification to the VAE objective used by Wang et al. [2017] that aims to induce latent space structure that captures the relationship between a behavior and the subskills that comprise this behavior in a disentangled and interpretable way. We evaluate both the original and modified objectives on a moderately complex imitation learning problem, in which agents are trained to perform a behavior after being trained on subskills that qualitatively comprise that behavior.

## 2 Preliminaries

We consider a standard reinforcement learning framework in which we have a Markov decision process (MDP)

, where is the state space, is the action space,is the distribution representing the transition probabilities of the agent arriving in state

after taking action while in state , is the reward function representing the reward the agent obtains by taking action while in state , is the time horizon and can be either finite or infinite, and is the multiplicative rate at which future rewards are discounted per time step. Throughout this paper we will use the terms*trajectory*,

*behavior*, and

*demonstration*interchangeably, all of which refer to a finite or infinite sequence of states and actions for , where is the number of trajectories.

## 3 Embedding and Reconstructing Trajectories

We use a conditional variational autoencoder (CVAE)

[Sohn et al., 2015, Kingma and Welling, 2013] to learn a semantically-meaningful low-dimensional embedding space that can (1) help an agent learn new behaviors more quickly, (2) be sampled from to generate behaviors, (3) and shed light on high-level factors of variation (e.g. subskills) that comprise complex behaviors.The CVAE we use is made up of a bi-directional LSTM (BiLSTM) [Hochreiter and Schmidhuber, 1997, Schuster and Paliwal, 1997] state-sequence encoder , an attention module [Bahdanau et al., 2014, Zhou et al., 2016] that maps the BiLSTM output to the latent (i.e. trajectory) embedding , a conditional WaveNet [Oord et al., 2016] state decoder , which serves as a *dynamics model*

, and a multi-layer perceptron (MLP) action decoder

, which serves as a*policy*

, whose outputs parametrize the normal distribution from which

is sampled. The bidirectional-LSTM captures sequential information over the states of the trajectories, and the conditional WaveNet allows for exact density modeling of the possibly multi-modal . This CVAE is similar to that of [Wang et al., 2017], but with key improvements, such as the addition of an attention module.To train this CVAE, the following objective is maximized

(1) |

In Section 3 we will modify this objective to encourage the latent space to capture relationships between subskill embeddings and behavior embeddings of subskills that comprise these behaviors.

## 4 Shaping the Trajectory Embedding (i.e. Latent) Space

Some skills can be considered as approximate combinations of certain subskills. For example, a backflip could be considered a sequential and concurrent combination of four subskills: jumping, tucking knees, rolling backwards, and thrusting arms downwards. Training the VAE to embed and reconstruct demonstrations of these five behaviors using Equation (1) would generally result in an embedding space with no clear relationship between the backflip embedding and the four subskill embeddings, especially if the dimension of the latent space is large or the number of demonstrated behaviors is small.

Motivated by semantically meaningful latent representations that have been found in other work [Mikolov et al., 2013], we aim to induce a latent space structure so that relationships between embeddings of behaviors and embeddings of subskills that comprise these behaviors are captured in a meaningful way. In particular we propose inclining the latent space structure so that a behavior embedding is the sum of the embeddings of the subskills that comprise it. Concretely, if is a backflip embedding and are jumping, tucking knees, rolling backwards, and thrusting arms downwards embeddings, we want to have . An example of such latent space restructuring is shown by Figure 1.

However, the VAE models probability distributions, so enforcing equality between one instance of a behavior and one instance of its subskills is insufficient. Instead, we want the random variables (RVs) representing the embeddings of the subskills to relate to the RV representing the embedding of the behavior comprised of those subskills. Another way to do this is to relate the subskill embedding RVs with the RV representing the trajectory generated by decoder networks

and when conditioned on an embedding of the behavior.Suppose is a behavior comprised of subskills Let represent the trajectory generated from an embedding corresponding to . Define , where . To train the encoder , state decoder , and action decoder simultaneously, we aim to maximize the mutual information between and , which can be expressed as

However, we don’t have access to the true posterior distribution . We can instead introduce a distribution as a variational approximation to to get the variational lower bound

similarly to the approach in Section 5 of Chen et al. [2016].

Computing the entropy for an arbitrary distribution may be difficult, but by setting to be a Gaussian RV, the entropy has the simple, closed-form expression

where

is the standard deviation of

. By choosingto parametrize a Gaussian distribution and assuming that state sequences from different subskills are sufficiently unrelated that they can be considered statistically independent. Then

is the sum of Gaussian RVs and has the nice formand the entropy of is

To encourage a semantically meaningful relationship between the embeddings of a behavior and the embeddings of the subskills that comprise those behaviors, we regularize the objective in Equation (1) with to get

(2) |

where

is a hyperparameter that controls the trade-off between original objective and degree of shaping of the latent space.

## 5 Experiments and Results

We evaluate our approach on a 197-dimensional state and 34-dimensional action space humanoid simulated in Bullet [Coumans, 2015]. We use policies that were pre-trained by Peng et al. [2018] to perform *kick*, *spin*, and *jump*, as subskills that qualitatively comprise the behavior *spin kick*. We train two sets of three VAEs on the subskills: one set that optimizes for the original VAE objective (1) and another set that optimizes for the regularized objective (2). To compare the proposed approach with the original, we evaluate the training process of each set of VAEs by considering the similarity between the generated trajectories and the pre-trained *spin kick* policy demonstrations. Results of the mean squared error (MSE) between the generated and demonstration states averaged over 5 different random seeds are shown in Figure 2.

We see that the two approaches perform similarly on the proposed setup, but the regularized one tends to attain better overall performance and trains faster than the baseline. One possible reason for similar performance was mentioned earlier, the dimensionality of the latent space ( for results in Figure 2) may have been too large for the number of subskills and behaviors. As a result, the latent space may not have been forced to be constrained in some way that significantly affected performance. Regardless, we believe that the consistent performance gain from our regularized approach over the non-regularized baseline indicates that this line of research is worth looking into.

## 6 Discussion and Future Work

We explored the idea of inducing certain latent structure through the maximization of mutual information between generated behaviors and embeddings of the subskills that qualitatively comprise those behaviors, which, to the best of our knowledge, has not yet been investigated. Though our algorithm slightly outperformed the state-of-the-art baseline, there is much room for future work. The CVAE could be replaced with a -CVAE [Higgins et al., 2017] to control disentanglement of . The proposed approach could be evaluated on behaviors and subskills that more strictly adhere to concurrent relationship desired. A larger number of behaviors and subskills could be trained at once, both to constrain the latent space and to enrich the pool of subskills from which to train on and inspect the relationships between.

### Acknowledgments

The author thanks John Chong for this and countless other opportunities.

## References

- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.
- Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599. Cited by: §1, §1.
- Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1, §1, §4.
- Bullet. External Links: Link Cited by: §5.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. Iclr 2 (5), pp. 6. Cited by: §6.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.
- Building portable options: skill transfer in reinforcement learning.. In IJCAI, Vol. 7. Cited by: §1.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §4.
- Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.
- Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §5.
- Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §3.
- Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §3.
- Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §1.
- Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pp. 5320–5329. Cited by: §1, §3.
- Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 207–212. Cited by: §3.

Comments

There are no comments yet.