Consider a reinforcement learning (rl) agent placed in an environment without external rewards, but with a third party agent, called Bob, acting independently of the agent, and in particular not providing any explicit guidance.
In the absence of external rewards, we define as an intrinsic motivation for the agent to maximally control the environment state. If the latter includes multiple features, e.g. corresponding to different objects as in Figure 1, the agent should seek to control them independently. Besides learning from intrinsic motivation, the agent should also take advantage of Bob’s presence. Even if Bob does not provide clear guidance to the RL agent, reproducing his actions may help gain control and the agent should do so.
Learning simultaneously to control all the aspects of a realistic environment state is likely to be sub-optimal: some aspects may not be controllable at all, or become so only after having mastered others, while training too much on already mastered ones may slow down progress on new ones. Likewise, Bob may act on aspects that the agent cannot act on, or produce behaviors the agent already masters. In both cases, imitating his actions would be a waste of time. As a consequence, the agent can benefit from using curriculum learning to guide its acquisition of environment control, both to choose what to practice and what to imitate from Bob.
From these observations, this work addresses the problem of combining feature-based control in a non-rewarding discrete environment, and imitation learning applied to an ambiguous and unconstrained third party agent, with the help of curriculum learning.
The intrinsic motivation of our agent is feature control. Here we let aside the problem of discovering independently controllable features (Thomas et al., 2017) in the environment. We rather assume that some environment features, controllable or not, are readily available and focus on how to act on them, for example opening the box in Figure 1.
Using the vocabulary of Unicorn (Mankowitz et al., 2018), we define the act of controlling one feature independently as a task, and the value to set this feature to as a goal. We use Universal Value Function Approximators (uvfas) (Schaul et al., 2015) to learn in a discrete environment and in this multitask, multigoal and unsupervised setting.
In Learning from Demonstrations (LfD) (Schaal, 1997) applied to rl, two hypotheses are frequently made: 1) the expert provides demonstrations to guide the agent, 2) it does so for one task only, that the agent does not have to identify. In this work these requirements are not met. Instead, we assume that: 1) Bob acts independently of the agent in the environment, without the role of explicitly demonstrating useful behaviors, and 2) the agent does not know which feature Bob wants to control when acting. For example, if Bob moves an object among several, the agent should detect that Bob demonstrated a behavior that can be imitated to gain control on this object specifically and not on the others.
To incorporate the imitation of Bob’s behaviors to autonomous learning, we combine uvfas with an adapted version of the double DQN from Demonstration (dqnfd) algorithm (Hester et al., 2017).
To decide what to learn autonomously and what to imitate in Bob’s behavior, the agent integrates curriculum learning by maximizing its absolute learning progress (LP) (Oudeyer et al., 2007; Baranes & Oudeyer, 2010): when a feature is too hard to control or already mastered, the agent does not make progress on controlling it and focuses on others. Using absolute values ensures the agent will refocus on a task for which its competence drops (Colas et al., 2018).
In this work, we introduce a new reinforcement learning setting, where an agent is placed in a non-rewarding discrete environment, and an unconstrained demonstrator Bob acts ambiguously, performing several tasks without the agent knowing what he is achieving. Combining feature control, imitation learning and curriculum learning, we propose an agent clic which learns in this setting by:
using uvfas in the discrete action case for controlling independently multiple environment features, without external rewards.
observing Bob’s ambiguous actions, determining the features that they help control, and imitating them.
selecting which features to train on, and what to imitate from Bob through absolute learning progress maximization.
Our results show that clic can:
use its observations of Bob to learn more and faster control of its environment.
be taught control of features in a certain order by showing it demonstrations in this order.
identify then ignore non-reproducible and already mastered behaviors when imitating Bob.
2 Related work
Our work falls at the intersection of three domains: unsupervised rl, imitation learning and curriculum learning.
Unsupervised Reinforcement Learning
There is a growing body of literature about learning a set of skills in an environment without any external reward signal (Machado & Bowling, 2016; Gregor et al., 2016; Eysenbach et al., 2018; Warde-Farley et al., 2018). The corresponding domain has recently been termed “unsupervised rl ”.
A first concern with these approaches is what to learn in the absence of reward. Recent works like Andrychowicz et al. (2017) and Plappert et al. (2018) defined controlling the environment state as the objective of the agent: they represent goal states as some additional input, and also use uvfas and goal-conditioned policies to learn how to reach them. In this case, the agent rewards itself when it reaches a goal state. This work follows a similar approach but at a finer level, by learning to control state features.
A second concern is how to make profit of the learned skills to address tasks of interest for an external user. In Eysenbach et al. (2018), three mechanisms are proposed for doing so: (i) considering the acquisition of skills as some pre-training process which will accelerate a subsequent rl stage, (ii) using hierarchical rl based on the skills to build a higher level policy choosing the adequate skill at all times to maximize some external reward, or (iii) imitating an expert. Quite interestingly, in the latter case, the authors ask whether learning various skills can help imitating an expert, whereas in this paper we ask whether imitating another agent can help acquiring useful skills.
The potential of learning from demonstrations to accelerate skill acquisition is well-known in the context of both rl for robotics (Ijspeert et al., 2013) and agents in simulation (Levine & Koltun, 2013; Hester et al., 2017; Večerík et al., 2017). In this work, we build upon mechanisms proposed in Hester et al. (2017). From this perspective, our combination of intrinsically motivated learning and imitation learning can be seen as a kind of multitask, multigoal version of dqnfd, where tasks are features to control, and curriculum learning is applied to the choice of these tasks. Some multitask imitation learning agents have been proposed (Parisotto et al., 2015; Peng et al., 2018), but they are not endowed with curriculum learning capabilities.
Curriculum learning is a long standing research topic in developmental robotics (Baranes & Oudeyer, 2010; Forestier et al., 2017) and has recently become the focus of intensive research in rl (Graves et al., 2017; Blaes et al., 2018; Narvekar & Stone, 2018; Weinshall & Cohen, 2018). Our curriculum learning strategy is especially close to that of the curious algorithm of Colas et al. (2018), where a multitask multigoal agent called e-uvfa is also combined with absolute LP maximization to choose what task to train on.
Combining Imitation and Curriculum Learning
The interplay of curriculum and imitation learning in an autonomous agent is present in several works like Nguyen et al. (2011) and Duminy et al. (2019), which also demonstrate that imitation learning can drive the skill acquisition trajectory of an autonomous agent learning from intrinsic motivations. However, these works do not consider a unique ambiguous and potentially non-reproducible teacher as we do, and above all address distinct technical issues by not using deep rl.
To summarize, in regards to existing work, we investigate unexplored questions at the intersection of unsupervised rl, imitation and curriculum learning.
Our framework is based on four components: (1) the intrinsic motivation of the agent consists in controlling environment features so that they take desired values (Section 3.1); (2) the agent learns how to satisfy its intrinsic motivations through a version of double dqn augmented to tackle individual feature control (Section 3.2); (3) when another agent acts in the environment (Section 3.3), our agent imitates the way it controls these features (Section 3.4); (4) to choose what to learn and imitate, the agent builds and follows a curriculum over features, based on learning progress maximization (Section 3.5).
3.1 Feature control as intrinsic motivation
In the absence of an external reward, our agent is driven by intrinsic motivations based on feature control, where tasks are features, and goals the values to set them to.
Following Abbeel & Ng (2004)
, we consider a Markov Decision Process without a reward function. Stateswrite , and in traditional goal-conditioned control, the agent learns an optimal policy for reaching goal starting from . If the goal is and the state is discrete, the state is terminal if and only if for all .
However, in many cases, an agent can be interested in controlling only a subset of the features. To this end, we consider a subset of features of interest, such that is the f-th feature of for . For each feature of interest we define an intrinsic reward function: , with if and otherwise, for .
3.2 Learning to control features
Thus we define a general feature-related action-value function:
The agent learns a policy distribution that maximizes for all . The value function
is approximated with a neural network takingand as input and providing for all and . At the start of an episode, the agent samples a desired value and a feature following Section 3.5. clic then acts following until it reaches for , or a timeout is reached. The trajectory obtained is a list of tuples that are stored in a standard replay buffer. At each step, clic minimizes the double dqn loss on mini-batches taken from the replay buffer:
where is the target network (Van Hasselt et al., 2015).
A difference between clic and double dqn is that, once and are chosen, clic uses softmax exploration rather than -greedy, as the -decreasing schedule of the latter is unfit to cope with cases where different features need different amounts of exploration. Also, we do not input2018; Colas et al., 2018) because we empirically observed that the network generalized too much between tasks, outputting high Q-values for uncontrolled features111We attribute this Q-value overestimation to the fact that masking only the first layer of weights with a one-hot encoded input vector is not sufficient to maintain the subsequent unmasked layers outputs low for uncontrolled features..
All the above helped clic learn autonomous control of different features of its environment. We now consider an agent Bob that clic may imitate.
Every steps, Bob chooses a feature to control, a value to set it to, and achieves trajectories for (Algorithm 1, line 12). Section 4.1 describes how Bob chooses and but, in all cases, clic does not have access to this information.
Trajectories from Bob may be helpful to learn to control several features: if Bob changed and set feature to value at some point, then Bob’s actions up to this point constitute a demonstration for setting to , in the spirit of Hindsight Experience Replay (her) (Andrychowicz et al., 2017). Transitions from the instances of Bob trajectories are augmented with such pairs, and clic can compute their rewards with . These augmented transitions are stored both in the same replay buffer as the agent’s own experience (Algorithm 1, line 15), and in a separate set for imitation learning.
3.4 Imitation learning
After a set of trajectories from Bob is observed, the agent performs steps of imitation on (Algorithm 1, lines 18-20).
Extending Hester et al. (2017), we define a large margin classification loss for our general Q-values. Given a feature and a goal , it writes:
where is a margin function that is when and otherwise. This loss ensures that the values of Bob’s action will be at least a margin above Q-values of other actions.
At each of the imitation step, clic selects a feature following Section 3.5, and minimizes on batches of tuples from .
3.5 Curriculum learning
clic uses its own learning progress to select which feature to practice (Section 3.2 and Algorithm 1, lines 2, 9), and which feature to imitate Bob on (Section 3.4 and Algorithm 1, line 18). To that end, it tracks its learning progress on each feature of interest, and samples preferably features for which this learning progress is maximal.
Similar to Colas et al. (2018), we define the agent’s competence at step for feature s the average success over a window of attempts at controlling — success meaning at the end of episode with desired value . The learning progress then writes
It is used to derive for each feature
a sampling probability at step:
with controlling the sampling randomness: means pure random sampling, while means pure proportional sampling. To ensure a minimum amount of exploration on features while learning a curriculum, clic uses . We call clic-rnd the version of clic that samples features randomly without maximizing LP, using .
The use in of absolute values ensures a competence drop on a feature leads to training again on this feature (Baranes & Oudeyer, 2013; Colas et al., 2018). The parameter is added so that only a sufficient competence drop () is visible, and makes sure to favor controlling new features over maintaining those mastered to the highest level of competence .
4 Experimental setup
Experiments are conducted in several discrete state, discrete action environments containing objects, in which imitating Bob’s actions is more or less challenging: , with six independent fully controllable objects; two partially controllable variations of called , ; and a distinct environment with fully controllable but hierarchically related objects.
Independent objects, full control:
The environment (Figure 1, left) is a grid-world with six objects to interact with. The actions of an agent are up, down, right, left and touch. The environment state perceived is defined as the agent position in the grid plus the state of all objects: . The agent always starts in the center of the grid, and we use 25% () of sticky actions (Machado et al., 2017) for stochasticity.
When Bob is present and acts in the environment, clic perceives and stores the states visited by Bob as . We assume that the only source of non-optimality of Bob’s actions comes from the sticky actions. Bob’s choice of objects to act on depends on the experiment and is detailed for each result.
To control objects and set the corresponding feature values to 1, clic has to go through and touch an unknown predefined ordered list of intermediate positions for each object. This setup abstractly reproduces the idea that several steps may be necessary to manipulate properly an object: a longer list means a more difficult object to manipulate.
Controlling this environment — or one of the three others introduced next — means controlling its objects: the features of interest of clic are objects states, (recall that given the definition of , if , ). Thus from now, controlling feature always means being consistently able to set the corresponding object state to any desired value . In all environments, states are normalized so that each feature takes its values in .
Independent objects, partial control: and
In , the six objects are fully controllable by the clic agent. In realistic environments, some aspects of the environment are likely to be controllable by Bob only. and are copies of where respectively one and three of the six objects can be controlled by clic, though Bob can control all of them. Technically, choosing action touch on the intermediate positions associated to the uncontrollable objects has simply no effect (Figure 1, bottom right).
Hierarchically related objects, full control:
In , the six objects are independent of each other: there is no overlap between their lists of intermediate positions. On the contrary, in , we modify these lists so that controlling Object becomes a prerequisite to control Object . Simply put, setting Object state to 1 means setting Object to 1 plus going through an additional list of intermediate positions (Figure 1, top right).
For the sake of simplicity, in all experiments, the desired value for all features of interest is , even if the features take their values in . All clic parameters are as follows: we take , , , , , an episode timeout is reached after 200 steps, the batch size is 64, the temperature for the softmax exploration is and the replay buffer size is
. Q-values are approximated with a neural network with two 32-neurons hidden layers withrelu activations 222The code of the experiments is available at https://www.dropbox.com/s/zs1ukyo0la7hze2/clic.zip?dl=0..
Features of interest being objects states, we use the words feature and object interchangeably in this section. In all figures, and refer to the competence and learning progress of clic (or clic-rnd) for Object at step ; is the average normalized competence at step , with the number of controllable objects; is the number of steps clic spends imitating Bob on Object (max. ). Unless stated otherwise, when Bob is said to act on a set of objects, he samples randomly one of these object at the start of all his trajectories and acts to set its state to 1. Plain curves correspond to the median of 10 runs and shaded areas to their interquartile ranges.
The experiments aim at determining: 1) if clic can identify useful demonstrations from Bob’s actions with unknown goals and imitate them; 2) to what extent clic’s learning trajectory can be influenced by Bob’s behaviors; and 3) if clic can identify and ignore behaviors that should not be reproduced, because it will not be able to do so, or already knows how to do so. As explained in Section 3.5, the curriculum learning agent clic uses proportional sampling with whereas the basic agent clic-rnd uses random sampling with .
5.1 Imitating Bob
First we analyze the impact of observing and imitating Bob, both in (Figure 2 A) and (Figure 2 B). In , Bob either does nothing, or acts on objects 1, 2 and 3, or acts on all objects. It is clear that clic learns faster when it sees Bob acting on more objects.
In , Bob either does nothing, or controls the intermediate object 4 or the two hardest of the hierarchy, 5 and 6. This environment is harder to control in autonomy as it requires more exploration to discover the hard objects. Results show that clic uses Bob’s behavior to gain control over the features Bob seeks to control, but also over those Bob controls as intermediate steps.
For example, when Bob controls Objects 5 and 6, the hierarchy implies that he has to also act on all other objects before. In this case, results show that clic can imitate Bob’s actions to control more than only what Bob intended to achieve. When Bob controls only Object 4, clic gains control over features corresponding to objects up to 4 as well, but discovering autonomously harder ones is too difficult.
5.2 Following Bob’s teaching
In Figure 2 A, when Bob controls only a subset of features, clic first imitates him to reach , and then explores the rest of the environment autonomously. From another point of view, these results prove that Bob can influence clic’s learning trajectory by only showing it how to control some selected features. We now show that, acting as a mentor, Bob can completely control this learning trajectory.
In Figure 3, instead of choosing randomly an object to control, Bob shows how to control Object 1 until the agent’s competence for this feature is above 0.9. Then Bob demonstrates control over Object 2, and so on. If the agent’s competence on a previously mastered object falls below 0.9, Bob provides demonstrations for it again. The figure compares the performances of clic (Figure 3 B) and clic-rnd (Figure 3 A), in (where objects are of the exact same difficulties) and with Bob acting as specified.
Without curriculum learning, clic-rnd’s learning trajectory starts to follow Bob’s demonstrations order but does not do so up to the end: as clic-rnd learns autonomously in parallel of imitating Bob, and randomly chooses what to learn in autonomy, nothing prevents it from learning to control features 4, 5 and 6 before Bob shows how to do so.
By contrast, clic strictly follows the curriculum taught by Bob, learning almost always one feature at a time. This time, during its autonomous learning phase, LP maximization pushes clic to stick to the features Bob demonstrated, until it does not make progress on them anymore. In other words, clic can be taught control over the environment in a desired order by simply showing it demonstrations in this order.
We observe that following Bob’s order does not speed up global control of the environment. In , all objects are independent, so there is no transfer between learning the different features of interest. The ability to focus on learning features one by one is balanced by the fact that the network overfits when it trains on one feature only at a time.
5.3 Ignoring non-reproducible behavior from Bob
When Bob controls features that the agent cannot, imitating Bob can be a waste of time. Yet, if the agent knew in advance the subset of features it cannot control and imitate Bob on, learning would be simpler as the agent could focus on fewer aspects of the environment. But in this work, clic does not have this knowledge and relies on curriculum learning to discover what to learn and imitate.
In Figure 4, we perform experiments in (A) with all objects controllable by the agent, (B) with half of them controllable, and (C) with only one controllable. In each environment, Bob acts on all objects, including those that the agent has no control on, and we study the impact of LP maximization on the agent performances.
Comparing the three blue curves shows that without curriculum learning, having fewer features to master boosts only slightly clic-rnd performances on global control. As expected, without a mechanism to tell between what can and cannot be learned, much of the advantage of having fewer to learn is lost. Also, we note that the variability of clic-rnd increases with the number of uncontrollable features: achieving global control becomes dependent on the probability that it discovers the only controllable features more or less early.
In , when all objects are controllable, LP maximization does not bring any benefit and clic does not outperform clic-rnd. Indeed objects are independent and of equal difficulties, so there can only be a poor form of transfer between them, and a low gain from practicing them and imitating Bob on them in any specific order, as LP maximization pushes the agent to do.
In and , on the contrary, the red curves show that LP maximization helps achieving maximum control faster in presence of uncontrollable features: when three or five objects are uncontrollable, clic learns faster than clic-rnd. Besides, the impact of curriculum learning is greater when there are more uncontrollable features: the gap between clic-rnd and clic performances is wider in than it is in .
Reasons for this gain are found in Figure 5, which shows the impact of LP maximization when only is controllable. Figure 5 displays the evolution of the agent learning progress (A) and competence (B) on , the amount of steps spent imitating Bob controlling (C), and the amount of steps uselessly spent imitating Bob controlling (D).
When it does not maximize LP (dashed blue), the agent ignores the LP bump (Figure 5 A) resulting from gaining some control over the feature (Figure 5 C). So it randomly selects what to train on, and above all what to imitate: the agent tries to reproduce Bob’s behavior when it can (Figure 5 B) as often as when it cannot (Figure 5 D), as shown by the blue curves at the same level. This is sub-optimal as performances on uncontrollable objects will never increase.
Instead, when the agent maximizes LP (plain red), the bump in LP at 50k steps (Figure 5 A) results in more imitation for this feature (Figure 5 B) and more episodes trying to control it. Simultaneously, the agent stops focusing on other features for which its learning progress is too small, among which uncontrollable ones (Figure 5 D). This focus results in a faster learning pace on controllable features (Figure 5 C).
5.4 Ignoring Bob’s demonstrations for mastered features
By contrast with to , in , all features are controllable but some are easier than others. In particular, controlling Object is easier than controlling Object , and any demonstration for Object contains a demonstration for Object .
As a consequence, in such an environment, when Bob acts on random objects, he ends up providing more demonstrations for the easiest features. If Bob chooses to guide the agent as in Section 5.2, he initially focuses on these easy features, and the bias towards them is even stronger.
Figure 6 shows the global performances of clic-rnd and clic in these two scenarios where actions from Bob are biased towards easy features. In both scenarios, curriculum learning helps ignoring Bob’s behavior affecting already mastered features. The effect of curriculum is greater when Bob guides the agent, as the bias is stronger in this case. But the same effect holds when Bob chooses randomly what to do, at least early in learning.
The fact that curriculum learning enables clic to stop imitating Bob on easy features can be clearly seen in Figure 7. Here Bob guides the agent, and the bias towards easy features is strong. clic-rnd chooses randomly what to imitate among what it is shown, so it imitates Bob mainly on quickly mastered features.
For instance, at step 400K, Object 3 is not yet mastered by clic-rnd so Bob demonstrates it, along with its necessary intermediate steps Objects 1 and 2; clic-rnd, observing these three objects being controlled, imitates Bob on all of them (Figure 7 B), whereas it already knows how to control Objects 1 and 2. Thus it loses precious imitation steps to make progress on Object 3.
In this work, we proposed a new learning setting, with a non-rewarding environment where a third party agent called Bob acts without communicating the intent of its actions and in ways that can be non-reproducible for the agent. This setting, although discrete, is a first step towards real-life environments, where artificial agents will face a very large number of potential tasks and goals, next to other agents with different capabilities. We combined feature control, curriculum and imitation learning to build an agent called clic that addresses the issues raised by this challenging learning context.
In particular, we showed that clic predictably makes faster progress when it observes more behaviors from Bob, but also that it can leverage Bob’s actions to make progress for more tasks than only those demonstrated. We demonstrated that Bob could mentor clic and control its developmental trajectory by simply providing ordered demonstrations. Eventually, we showed that clic can effectively use learning progress maximization to tell between what is and is not useful to learn and imitate, and thus learns faster both when the environment is partially controllable and when it contains a natural hierarchy.
A specificity of this work was that, rather than considering a human expert teaching the agent, as is usually the case in interactive learning, we considered an external software agent providing non intentional demonstrations. As a consequence, we did not focus on limiting the amount of demonstrated behaviors, as is often the case in the domain (Kang et al., 2018). Another topic of interest that emerges from our work is the importance of the order of the demonstrations performed by a teaching agent. We keep these two topics for future work.
Additionally, our focus being on the combination of curriculum learning and imitation learning rather than on representation learning, experiments were performed in discrete state, discrete action environments with independent features. But the main ideas presented here could easily be extended to the continuous case, replacing dqnfd with ddpgfd (Večerík et al., 2017) and trying to learn independently controllable features (Thomas et al., 2017).
Abbeel & Ng (2004)
Abbeel, P. and Ng, A. Y.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the twenty-first international conference on Machine learning, pp. 1. ACM, 2004.
- Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2017.
- Baranes & Oudeyer (2010) Baranes, A. and Oudeyer, P.-Y. Intrinsically motivated goal exploration for active motor learning in robots: A case study. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2010), Taipei, Taiwan, Province Of China, 2010. IEEE.
- Baranes & Oudeyer (2013) Baranes, A. and Oudeyer, P.-Y. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61(1):49–73, 2013.
- Blaes et al. (2018) Blaes, S., Vlastelica, M., Zhu, J.-J., and Martius, G. ”control what you can: Intrinsically motivated hierarchical reinforcement learner”. in Deep RL Workshop NeurIPS, 2018.
- Colas et al. (2018) Colas, C., Fournier, P., Sigaud, O., and Oudeyer, P.-Y. CURIOUS: Intrinsically motivated multi-task, multi-goal reinforcement learning. arXiv preprint arXiv:1810.06284, 2018.
- Duminy et al. (2019) Duminy, N., Nguyen, S. M., and Duhaut, D. Learning a set of interrelated tasks by using sequences of motor policies for a socially guided intrinsically motivated learner. Frontiers in Neurorobotics, 2019.
- Eysenbach et al. (2018) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
- Forestier et al. (2017) Forestier, S., Mollard, Y., and Oudeyer, P.-Y. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190, 2017.
- Graves et al. (2017) Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003, 2017.
- Gregor et al. (2016) Gregor, K., Rezende, D. J., and Wierstra, D. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
- Hester et al. (2017) Hester, T., Večerík, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
- Ijspeert et al. (2013) Ijspeert, A. J., Nakanishi, J., Hoffmann, H., Pastor, P., and Schaal, S. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25(2):328–373, 2013.
- Kang et al. (2018) Kang, B., Jie, Z., and Feng, J. Policy optimization with demonstrations. In International Conference on Machine Learning, pp. 2474–2483, 2018.
- Levine & Koltun (2013) Levine, S. and Koltun, V. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning, pp. 1–9, 2013.
- Machado & Bowling (2016) Machado, M. C. and Bowling, M. Learning purposeful behaviour in the absence of rewards. arXiv preprint arXiv:1605.07700, 2016.
- Machado et al. (2017) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
- Mankowitz et al. (2018) Mankowitz, D. J., Žídek, A., Barreto, A., Horgan, D., Hessel, M., Quan, J., Oh, J., van Hasselt, H., Silver, D., and Schaul, T. Unicorn: Continual learning with a universal, off-policy agent. arXiv preprint arXiv:1802.08294, 2018.
- Narvekar & Stone (2018) Narvekar, S. and Stone, P. Learning curriculum policies for reinforcement learning. arXiv preprint arXiv:1812.00285, 2018.
- Nguyen et al. (2011) Nguyen, S. M., Baranes, A., and Oudeyer, P.-Y. Bootstrapping intrinsically motivated learning with human demonstrations. arXiv preprint arXiv:1112.1937, 2011.
Oudeyer et al. (2007)
Oudeyer, P.-Y., Kaplan, F., and Hafner, V. V.
Intrinsic motivation systems for autonomous mental development.
IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007.
- Parisotto et al. (2015) Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
- Peng et al. (2018) Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717, 2018.
- Plappert et al. (2018) Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
- Schaal (1997) Schaal, S. Learning from demonstration. In Advances in Neural Information Processing Systems 9, pp. 1040–1046, Cambridge, MA, 1997. MIT Press.
- Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320, 2015.
- Thomas et al. (2017) Thomas, V., Pondard, J., Bengio, E., Sarfati, M., Beaudoin, P., Meurs, M.-J., Pineau, J., Precup, D., and Bengio, Y. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
- Van Hasselt et al. (2015) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461, 2015.
- Večerík et al. (2017) Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
- Warde-Farley et al. (2018) Warde-Farley, D., Van de Wiele, T., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. arXiv preprint arXiv:1811.11359, 2018.
- Weinshall & Cohen (2018) Weinshall, D. and Cohen, G. Curriculum learning by transfer learning: Theory and experiments with deep networks. arXiv preprint arXiv:1802.03796, 2018.