1 Introduction
Reinforcement Learning (RL) aims to take sequential actions so as to maximize, by interacting with an environment, a certain prespecified reward function, designed for the purpose of solving a task. RL using Deep Neural Networks (DNNs) has shown tremendous success in several tasks such as playing games
Mnih et al. (2015); Silver et al. (2016), solving complex robotics tasks Levine et al. (2016); Duan et al. (2016), etc. However, with sparse rewards, these algorithms often require a huge number of interactions with the environment, which is costly in realworld applications such as selfdriving cars Bojarski et al. (2016), and manipulations using real robots Levine et al. (2016). Manually designed dense reward functions could mitigate such issues, however, in general, it is difficult to design detailed reward functions for complex realworld tasks.Imitation Learning (IL) using trajectories generated by an expert can potentially be used to learn the policies faster Argall et al. (2009). But, the performance of IL algorithms Ross et al. (2011) are not only dependent on the performance of the expert providing the trajectories, but also on the statespace distribution represented by the trajectories, especially in case of high dimensional states. In order to avoid such dependencies on the expert, some methods proposed in the literature Sun et al. (2017); Cheng et al. (2018) take the path of combining RL and IL. However, these methods assume access to the expert value function, which may become impractical in realworld scenarios.
In this paper, we follow a strategy which starts with IL and then switches to RL. In the IL step, our framework performs supervised pretraining which aims at learning a policy which best describes the expert trajectories. However, due to limited availability of expert trajectories, the policy trained with IL will have errors, which can then be alleviated using RL. Similar approaches are taken in Cheng et al. (2018) and Nagabandi et al. (2018), where the authors show that supervised pretraining does help to speedup learning. However, note that the reward function in RL is still sparse, making it difficult to learn. With this in mind, we pose the following question: can we make more efficient use of the expert trajectories, instead of just supervised pretraining?
Given a set of trajectories, humans can quickly identify waypoints, which need to be completed in order to achieve the goal. We tend to break down the entire complex task into subgoals and try to achieve them in the best order possible. Prior knowledge of humans helps to achieve tasks much faster Andreas et al. (2017); Dubey et al. (2018) than using only the trajectories for learning. The human psychology of divideandconquer has been crucial in several applications and it serves as a motivation behind our algorithm which learns to partition the statespace into subgoals using expert trajectories. The learned subgoals provide a discrete reward signal, unlike value based continuous reward Ng et al. (1999); Sun et al. (2018), which can be erroneous, especially with a limited number of trajectories in long time horizon tasks. As the expert trajectories set may not contain all the states where the agent may visit during exploration in the RL step, we augment the subgoal predictor via oneclass classification to deal with such underrepresented states. We perform experiments on three goaloriented tasks on MuJoCo Todorov (2014) with sparse terminalonly reward, which stateoftheart RL, IL or their combinations are not able to solve.
2 Related Works
Our work is closely related to learning from demonstrations or expert trajectories as well as discovering subgoals in complex tasks. We first discuss works on imitation learning using expert trajectories or rewardtogo. We also discuss the methods which aim to discover subgoals, in an online manner during the RL stage from its past experience.
Imitation Learning. Imitation Learning Schaal (1999); Silver et al. (2008); Chernova and Veloso (2009); Rajeswaran et al. (2017); Hester et al. (2018)
uses a set of expert trajectories or demonstrations to guide the policy learning process. A naive approach to use such trajectories is to train a policy in a supervised learning manner. However, such a policy would probably produce errors which grow quadratically with increasing steps. This can be alleviated using Behavioral Cloning (BC) algorithms
Ross et al. (2011); Ross and Bagnell (2014); Torabi et al. (2018), which queries expert action at states visited by the agent, after the initial supervised learning phase. However, such query actions may be costly or difficult to obtain in many applications. Trajectories are also used by Levine and Koltun (2013), to guide the policy search, with the main goal of optimizing the return of the policy rather than mimicking the expert. Recently, some works Sun et al. (2017); Chang et al. (2015); Sun et al. (2018) aim to combine IL with RL by assuming access to experts rewardtogo at the states visited by the RL agent. Cheng et al. (2018) take a moderately different approach where they switch from IL to RL and show that randomizing the switch point can help to learn faster. The authors in Ranchod et al. (2015) use demonstration trajectories to perform skill segmentation in an Inverse Reinforcement Learning (IRL) framework. The authors in Murali et al. (2016) also perform expert trajectory segmentation, but do not show results on learning the task, which is our main goal. SWIRL Krishnan et al. (2019) make certain assumptions on the expert trajectories to learn the reward function and their method is dependent on the discriminability of the state features, which we on the other hand learn endtoend.Learning with Options. Discovering and learning options have been studied in the literature Sutton et al. (1999); Precup (2000); Stolle and Precup (2002) which can be used to speedup the policy learning process. Silver and Ciosek (2012) developed a framework for planning based on options in a hierarchical manner, such that low level options can be used to build higher level options. Florensa et al. (2017)
propose to learn a set of options, or skills, by augmenting the state space with a latent categorical skill vector. A separate network is then trained to learn a policy over options. The OptionCritic architecture
Bacon et al. (2017) developed a gradient based framework to learn the options along with learning the policy. This framework is extended in Riemer et al. (2018) to handle a hierarchy of options. Held et al. (2017) proposed a framework where the goals are generated using Generative Adversarial Networks (GAN) in a curriculum learning manner with increasingly difficult goals. Researchers have shown that an important way of identifying subgoals in several tasks is identifying bottleneck regions in tasks. Diverse Density McGovern and Barto (2001), Relative Novelty Şimşek and Barto (2004), Graph Partitioning Şimşek et al. (2005), clustering Mannor et al. (2004) can be used to identify such subgoals. However, unlike our method, these algorithms do not use a set of expert trajectories, and thus would still be difficult to identify useful subgoals for complex tasks.3 Methodology
We first provide a formal definition of the problem we are addressing in this paper, followed by a brief overall methodology, and then present a detailed description of our framework.
Problem Definition.
Consider a standard RL setting where an agent interacts with an environment which can be modeled by a Markov Decision Process (MDP)
, where is the set of states, is the set of actions, is a scalar reward function, is the discount factor and is the initial state distribution. Our goal is to learn a policy , with , which optimizes the expected discounted reward , where and , and .With sparse rewards, optimizing the expected discounted reward using RL may be difficult. In such cases, it may be beneficial to use a set of stateaction trajectories generated by an expert to guide the learning process. is the number of trajectories in the dataset and is the length of the trajectory. We propose a methodology to efficiently use by discovering subgoals from these trajectories and use them to develop an extrinsic reward function.
Overall Methodology. Several complex, goaloriented, realworld tasks can often be broken down into subgoals with some natural ordering. Providing positive rewards after completing these subgoals can help to learn much faster compared to sparse, terminalonly rewards. In this paper, we advocate that such subgoals can be learned directly from a set of expert demonstration trajectories, rather than manually designing them.
A pictorial description of our method is presented in Fig. 0(a). We use the set to first train a policy by applying supervised learning. This serves a good initial point for policy search using RL. However, with sparse rewards, the search can still be difficult and the network may forget the learned parameters in the first step if it does not receive sufficiently useful rewards. To avoid this, we use to learn a function , which given a state, predicts subgoals. We use this function to obtain a new reward function, which intuitively informs the RL agent whenever it moves from one subgoal to another. We also learn a utility function to modulate the subgoal predictions over the states which are not wellrepresented in the set . We approximate the functions , , and using neural networks. We next describe our meaning of subgoals followed by an algorithm to learn them.
3.1 Subgoal Definition
Definition 1. Consider that the statespace is partitioned into sets of states as , s.t., and and is the number of subgoals specified by the user. For each , we say that the particular action takes the agent from one subgoal to another iff , for some and .
We assume that there is an ordering in which groups of states appear in the trajectories as shown in Fig. 0(b)
. However, the states within these groups of states may appear in any random order in the trajectories. These groups of states are not defined a priori and our algorithm aims at estimating these partitions. Note that such orderings are natural in several realworld applications where a certain subgoal can only be reached after completing one or more previous subgoals. We show (empirically in the supplementary) that our assumption is soft rather than being strict, i.e., the degree by which the trajectories deviate from the assumption determines the granularity of the discovered subgoals. We may consider that states in the trajectories of
appear in increasing order of subgoal indices, i.e., achieving subgoal is harder than achieving subgoal . This gives us a natural way of defining an extrinsic reward function, which would help towards faster policy search. Also, all the trajectories in should start from the initial state distribution and end at the terminal states.3.2 Learning SubGoal Prediction
We use to partition the statespace into subgoals, with
being a hyperparameter. We learn a neural network to approximate
, which given a state predicts a probability mass function (p.m.f.) over the possible subgoal partitions . The order in which the subgoals occur in the trajectories, i.e., , acts as a supervisory signal, which can be derived from our assumption mentioned above.We propose an iterative framework to learn using these ordered constraints. In the first step, we learn a mapping from states to subgoals using equipartition labels among the subgoals. Then we infer the labels of the states in the trajectories and correct them by imposing ordering constraints. We use the new labels to again train the network and follow the same procedure until convergence. These two steps are as follows.
Learning Step. In this step we consider that we have a set of tuples , which we use to learn the function , which can be posed as a multiclass classification problem with
categories. We optimize the following crossentropy loss function,
(1) 
where is the indicator function and is the number of states in the dataset . To begin with, we do not have any labels , and thus we consider equipartition of all the subgoals in along each trajectory. That is, given a trajectory of states for some , the initial subgoals are,
(2) 
Using this initial labeling scheme, similar states across trajectories may have different labels, but the network is expected to converge at the Maximum Likelihood Estimate (MLE) of the entire dataset. We also optimize CASL Paul et al. (2018) for stable learning as the initial labels can be erroneous. In the next iteration of the learning step, we use the inferred subgoal labels, which we obtain as follows.
Inference Step. Although the equipartition labels in Eqn. 2 may have similar states across different trajectories mapped to dissimilar subgoals, the learned network modeling provides maps similar states to the same subgoal. But, Eqn. 1, and thus the predictions of
does not account for the natural temporal ordering of the subgoals. Even with architectures such as Recurrent Neural Networks (RNN), it may be better to impose such temporal order constraints explicitly rather than relying on the network to learn them. We inject such order constraints using Dynamic Time Warping (DTW).
Formally, for the trajectory in , we obtain the following set: , where is a vector representing the p.m.f. over the subgoals . However, as the predictions do not consider temporal ordering, the constraint that subgoal occurs after subgoal , for , is not preserved. To impose such constraints, we use DTW between the two sequences , which are the standard basis vectors in the dimensional Euclidean space and . We use the norm of the difference between two vectors as the similarity measure in DTW. In this process, we obtain a subgoal assignment for each state in the trajectories, which become the new labels for training in the learning step.
We then invoke the learning step using the new labels (instead of Eqn. 2), followed by the inference step to obtain the next subgoal labels. We continue this process until the number of subgoal labels changed between iterations is less than a certain threshold. This method is presented in Algorithm 1, where the superscript represents the iteration number in learninginference alternates.
Reward Using SubGoals. The ordering of the subgoals, as discussed before, provides a natural way of designing a reward function as follows:
(3) 
where the agent in state takes action and reaches state . The augmented reward function would become . Considering that we have a function of the form , and without loss of generality that , so that for the initial state , it follows from Ng et al. (1999) that every optimal policy in , will also be optimal in . However, the new reward function may help to learn the task faster.
OutofSet Augmentation. In several applications, it might be the case that the trajectories only cover a small subset of the state space, while the agent, during the RL step, may visit states outside of the states in . The subgoals estimated at these outofset states may be erroneous. To alleviate this problem, we use a logical assertion on the potential function that the subgoal predictor is confident only for states which are wellrepresented in , and not elsewhere. We learn a neural network to model a utility function , which given a state, predicts the degree by which it is seen in the dataset . To do this, we build upon Deep OneClass Classification Ruff et al. (2018)
, which performs well on the task of anomaly detection. The idea is derived from Support Vector Data Description (SVDD)
Tax and Duin (2004), which aims to find the smallest hypersphere enclosing the given data points with minimum error. Data points outside the sphere are then deemed as anomalous. We learn the parameters of by optimizing the following function:where is a vector determined a priori Ruff et al. (2018), is modeled by a neural network with parameters , s.t. . The second part is the regularization loss with all the parameters of the network lumped to . The utility function can be expressed as follows:
(4) 
A lower value of indicates that the state has been seen in . We modify the potential function and thus the extrinsic reward function, to incorporate the utility score as follows:
(5) 
where denotes the modified potential function. It may be noted that as the extrinsic reward function is still a potentialbased function Ng et al. (1999), the optimality conditions between the MDP and still hold as discussed previously.
Supervised PreTraining. We first pretrain the policy network using the trajectories (details in supplementary). The performance of the pretrained policy network is generally quite poor and is upper bounded by the expert performance from which the trajectories are drawn. We then employ RL, which starts from the pretrained policy, to learn from the subgoal based reward function. Unlike standard imitation learning algorithms, e.g., DAgger, which finetune the pretrained policy with the expert in the loop, our algorithm only uses the initial set of expert trajectories and does not invoke the expert otherwise.
4 Experiments
In this section, we perform experimental evaluation of the proposed method of learning from trajectories and compare it with other stateoftheart methods. We also perform ablation of different modules of our framework.
Tasks. We perform experiments on three challenging environments as shown in Fig. 2. First is BallinMaze Game (BiMGame) introduced in van Baar et al. (2018), where the task is to move a ball from the outermost to the innermost ring using a set of five discrete actions  clockwise and anticlockwise rotation by along the two principal dimensions of the board and “noop" where the current orientation of the board is maintained. The states are images of size . The second environment is AntTarget which involves the Ant Schulman et al. (2015). The task is to reach the center of a circle of radius m with the Ant being initialized on a arc of the circle. The state and action are continuous with and dimensions respectively. The third environment, AntMaze, uses the same Ant, but in a Ushaped maze used in Held et al. (2017). The Ant is initialized on one end of the maze with the goal being the other end indicated as red in Fig. 1(c). Details about the network architectures we use for , and can be found in the supplementary material.
Reward. For all tasks, we use sparse terminalonly reward, i.e., only after reaching the goal state and otherwise. Standard RL methods such as A3C Mnih et al. (2016) are not able to solve these tasks with such sparse rewards.
Trajectory Generation. We generate trajectories from A3C Mnih et al. (2016) policies trained with dense reward, which we do not use in any other experiments. We also generate suboptimal trajectories for BiMGame and AntMaze. To do so for BiMGame, we use the simulator via Model Predictive Control (MPC) as in Paul and van Baar (2018) (details in the supplementary). For AntMaze, we generate suboptimal trajectories from an A3C policy stopped much before convergence. We generate around trajectories for BiMGame and AntMaze, and for AntTarget. As we generate two separate sets of trajectories for BiMGame and AntTarget, we use the suboptimal set for all experiments, unless otherwise mentioned.



Baselines. We primarily compare our method with RL methods which utilize trajectory or expert information  AggreVaTeD Sun et al. (2017) and value based reward shaping Ng et al. (1999), equivalent to the in THOR Sun et al. (2018). For these methods, we use to fit a value function to the sparse terminalonly reward of the original MDP and use it as the expert value function. We also compare with standard A3C, but pretrained using . It may be noted that we pretrain all the methods using the trajectory set to have a fair comparison. We report results with mean cumulative reward and over independent runs.



Comparison with Baselines. First, we compare our method with other baselines in Fig 3. Note that as outofset augmentation using can be applied for other methods which learn from trajectories, such as valuebased reward shaping, we present the results for comparison with baselines without using , i.e., Eqn. 3. Later, we perform an ablation study with and without using . As may be observed, none of the baselines show any sign of learning for the tasks, except for ValueReward, which performs comparably with the proposed method for AntTarget only. Our method, on the other hand, is able to learn and solve the tasks consistently over multiple runs. The expert cumulative rewards are also drawn as straight lines in the plots and imitation learning methods like DAgger Ross et al. (2011) can only reach that mark. Our method is able to surpass the expert for all the tasks. In fact, for AntMaze, even with a rather suboptimal expert (an average cumulative reward of only ), our algorithm achieves about cumulative reward at million steps.
The poor performance of the ValueReward and AggreVaTeD can be attributed to the imperfect value function learned with a limited number of trajectories. Specifically, with an increase in the trajectory length, the variations in cumulative reward in the initial set of states are quite high. This introduces a considerable amount of error in the estimated value function in the initial states, which in turn traps the agent in some local optima when such value functions are used to guide the learning process.
Variations in SubGoals. The number of subgoals is specified by the user, based on domain knowledge. For example, in the BiMGame, the task has four bottlenecks, which are states to be visited to complete the task and they can be considered as subgoals. We perform experiments with different subgoals and present the plots in Fig. 4. It may be observed that for BiMGame and AntTarget, our method performs well over a large variety of subgoals. On the other hand for AntMaze, as the length of the task is much longer than AntTarget (12m vs 5m), learn much faster than , as higher number of subgoals provides more frequent rewards. Note that the variations in speed of learning with number of subgoals is also dependent on the number of expert trajectories. If the pretraining is good, then less frequent subgoals might work fine, whereas if we have a small number of expert trajectories, the RL agent may need more frequent reward (see the supplementary material for more experiments).
Effect of OutofSet Augmentation. The set may not cover the entire statespace. To deal with this situation we developed the extrinsic reward function in Eqn. 5 using . To evaluate its effectiveness we execute our algorithm using Eqn. 3 and Eqn. 5, and show the results in Fig. 5, with legends showing without and with respectively. For BiMGame, we used the optimal A3C trajectories, for this evaluation. This is because, using MPC trajectories with Eqn. 3 can still solve the task with similar reward plots, since MPC trajectories visit a lot more states due to its shorttem planning. The (optimal) A3C trajectories on the other hand, rarely visit some states, due to its longterm planning. In this case, using Eqn. 3 actually traps the agents to a local optimum (in the outermost ring), whereas using as in Eqn. 5, learns to solve the task consistently (Fig. 4(a)).
For AntTarget in Fig. 4(b), using performs better than without using (and also surpasses Value based Reward Shaping). This is because the trajectories only span a small sector of the circle (Fig. 6(b)) while the Ant is allowed to visit states outside of it in the RL step. Thus, avoids incorrect subgoal assignments to states not wellrepresented in and helps in the overall learning.
Effect of SubOptimal Expert. In general, the optimality of the expert may have an effect on performance. The comparison of our algorithm with optimal vs. suboptimal expert trajectories are shown in Fig. 6. As may be observed, the learning curve for both the tasks is better for the optimal expert trajectories. However, in spite of using such suboptimal experts, our method is able to surpass and perform much better than the experts. We also see that our method performs better than even the optimal expert (as it is only optimal w.r.t. some cost function) used in AntMaze.
Visualization. We visualize the subgoals discovered by our algorithm and plot it on the xy plane in Fig. 7. As can be seen in BiMGame, with subgoals, our method is able to discover the bottleneck regions of the board as different subgoals. For AntTarget and AntMaze, the path to the goal is more or less equally divided into subgoals. This shows that our method of subgoal discovery can work for both environments with and without bottleneck regions. The supplementary material has more visualizations and discussion.
5 Discussions
The experimental analysis we presented in the previous section contain the following key observations:

[leftmargin=*]

Our method for subgoal discovery works both for tasks with inherent bottlenecks (e.g. BiMGame) and for tasks without any bottlenecks (e.g. AntTarget and AntMaze), but with temporal orderings between groups of states in the expert trajectories, which is the case for many applications.

Experiments show, that our assumption on the temporal ordering of groups of states in expert trajectories is soft, and determines the granularity of the discovered subgoals (see supplementary).

Discrete rewards using subgoals performs much better than value function based continuous rewards. Moreover, value functions learned from long and limited number of trajectories may be erroneous, whereas segmenting the trajectories based on temporal ordering may still work well.

As the expert trajectories may not cover the entire statespace regions the agent visits during exploration in the RL step, augmenting the subgoal based reward function using outofset augmentation performs better compared to not using it.
6 Conclusion
In this paper, we presented a framework to utilize the demonstration trajectories in an efficient manner by discovering subgoals, which are waypoints that need to be completed in order to achieve a certain complex goaloriented task. We use these subgoals to augment the reward function of the task, without affecting the optimality of the learned policy. Experiments on three complex task show that unlike stateoftheart RL, IL or methods which combines them, our method is able to solve the tasks consistently. We also show that our method is able to perform much better than suboptimal experts used to obtain the expert trajectories and at least as good as the optimal experts. Our future work will concentrate on extending our method for repetitive nongoal oriented tasks.
References
 Modular multitask reinforcement learning with policy sketches. In ICML, pp. 166–175. Cited by: §1.
 A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §1.
 The optioncritic architecture.. In AAAI, pp. 1726–1734. Cited by: §2.
 End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
 Learning to search better than your teacher. ICML. Cited by: §2.
 Fast policy learning through imitation and reinforcement. UAI. Cited by: §1, §1, §2.
 Interactive policy learning through confidencebased autonomy. JAIR 34, pp. 1–25. Cited by: §2.
 Benchmarking deep reinforcement learning for continuous control. In ICML, pp. 1329–1338. Cited by: §1.
 Investigating human priors for playing video games. arXiv preprint arXiv:1802.10217. Cited by: §1.
 Stochastic neural networks for hierarchical reinforcement learning. ICLR. Cited by: §2.
 Automatic goal generation for reinforcement learning agents. ICML. Cited by: §2, §4.

Deep qlearning from demonstrations.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.  SWIRL: a sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards. IJRR 38 (23), pp. 126–145. Cited by: §2.
 Endtoend training of deep visuomotor policies. JMLR 17 (1), pp. 1334–1373. Cited by: §1.
 Guided policy search. In ICML, pp. 1–9. Cited by: §2.
 Dynamic abstraction in reinforcement learning via clustering. In ICML, pp. 71. Cited by: §2.
 Automatic discovery of subgoals in reinforcement learning using diverse density. ICML. Cited by: §2.
 Asynchronous methods for deep reinforcement learning. In ICML, pp. 1928–1937. Cited by: §4, §4.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.

Tscdl: unsupervised trajectory segmentation of multimodal surgical demonstrations with deep learning
. In ICRA, pp. 4150–4157. Cited by: §2.  Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In ICRA, pp. 7559–7566. Cited by: §1.
 Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §1, §3.2, §3.2, §4.

Wtalc: weaklysupervised temporal activity localization and classification.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 563–579. Cited by: §3.2.  Trajectorybased learning for ballinmaze games. arXiv preprint arXiv:1811.11441. Cited by: §4.
 Temporal abstraction in reinforcement learning. University of Massachusetts Amherst. Cited by: §2.
 Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. RSS. Cited by: §2.
 Nonparametric bayesian reward segmentation for skill discovery using inverse reinforcement learning. In IROS, pp. 471–477. Cited by: §2.
 Learning abstract options. NIPS. Cited by: §2.
 Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979. Cited by: §2.
 A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, pp. 627–635. Cited by: §1, §2, §4.
 Deep oneclass classification. In ICML, pp. 4390–4399. Cited by: §3.2.
 Is imitation learning the route to humanoid robots?. Trends in cognitive sciences 3 (6), pp. 233–242. Cited by: §2.
 Highdimensional continuous control using generalized advantage estimation. ICLR. Cited by: §4.
 High performance outdoor navigation from overhead data using imitation learning. RSS. Cited by: §2.
 Compositional planning using optimal option models. arXiv preprint arXiv:1206.6473. Cited by: §2.
 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
 Using relative novelty to identify useful temporal abstractions in reinforcement learning. In ICML, pp. 95. Cited by: §2.
 Identifying useful subgoals in reinforcement learning by local graph partitioning. In ICML, pp. 816–823. Cited by: §2.
 Learning options in reinforcement learning. In SARA, pp. 212–223. Cited by: §2.
 Truncated horizon policy search: combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240. Cited by: §1, §2, §4.
 Deeply aggrevated: differentiable imitation learning for sequential prediction. In ICML, pp. 3309–3318. Cited by: §1, §2, §4.
 Between mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (12), pp. 181–211. Cited by: §2.
 Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §3.2.
 Convex and analyticallyinvertible dynamics with contacts and constraints: theory and implementation in mujoco.. In ICRA, pp. 6054–6061. Cited by: §1.
 Behavioral cloning from observation. IJCAI. Cited by: §2.

Simtoreal transfer learning using robustified controllers in robotic tasks involving complex dynamics.
. ICRA. Cited by: §4.