Show me the Way: Intrinsic Motivation from Demonstrations

by   Léonard Hussenot, et al.

The study of exploration in Reinforcement Learning (RL) has a long history but it remains an unsolved problem. Recent approaches applied to Deep RL are based on the concept of intrinsic motivation and are implemented in the shape of an exploration bonus, added to the environment reward, that encourages visiting exhaustively the whole state-action space as fast as possible. This approach is supported by the vast theory of RL for which convergence to optimality assumes exhaustive exploration. Yet, Human Beings and mammals do not exhaustively explore the world and their motivation is not only based on novelty but also on diverse other factors (e.g., curiosity, fun, style, pleasure, safety, competition, etc.). They optimize for life-long learning and train to learn transferable skills in playgrounds without obvious goals. They also apply innate or learned priors to save time and stay safe. For these reasons, we propose a method for learning an exploration bonus from demonstrations that could transfer these motivations to an artificial agent without explicitly modeling them. Using an inverse RL approach, we show that different exploration behaviors can be learnt and efficiently used by RL agents to solve tasks for which exhaustive exploration is prohibitive.



There are no comments yet.


page 14


Learning with AMIGo: Adversarially Motivated Intrinsic Goals

A key challenge for reinforcement learning (RL) consists of learning in ...

Explore and Control with Adversarial Surprise

Reinforcement learning (RL) provides a framework for learning goal-direc...

Intrinsic Motivation in Object-Action-Outcome Blending Latent Space

One effective approach for equipping artificial agents with sensorimotor...

Exploring Exploration: Comparing Children with RL Agents in Unified Environments

Research in developmental psychology consistently shows that children ex...

Noisy Agents: Self-supervised Exploration by Predicting Auditory Events

Humans integrate multiple sensory modalities (e.g. visual and audio) to ...

Don't Do What Doesn't Matter: Intrinsic Motivation with Action Usefulness

Sparse rewards are double-edged training signals in reinforcement learni...

Constraining the Size Growth of the Task Space with Socially Guided Intrinsic Motivation using Demonstrations

This paper presents an algorithm for learning a highly redundant inverse...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning has addressed a variety of sequential-decision-making problems whether in games [40, 21, 34] or robotics [1, 2]. Nevertheless, some simple problems remain unsolved. Current state-of-the-art methods struggle to find good policies in environments (1) where constant negative rewards may discourage the agent to explore (e.g., the Pitfall! game from Atari), (2) where the reward is so sparse that an agent does not find any (e.g., the Montezuma’s Revenge Atari game), (3) where state and action space are big (e.g., text worlds). In order to tackle these specific problems, the use of reward bonuses, inspired by animal curiosity, was proposed to steer the agent’s exploration  [35, 36]. Even though different intrinsic bonuses have been proposed, they all rely on the same principle: novelty should be rewarded. These methods mostly differ in how they compute this notion of newness. Count-based methods do it by counting how often the agent has encountered a given state  [36]. Pseudo-counts methods  [4, 24] allow to approximate counts in large state spaces. Prediction error is also used to measure novelty, either by computing the agent’s ability to predict the future [27] or random statistics about the current state [7]. Some restrict novelty to state-action pairs that have an impact on the agent [30] or derive empowerment metrics [22] using mutual information. All these methods naturally encourage the discovery of new states through exhaustive exploration. Yet, in most realistic environments, exhaustive exploration is (1) not feasible due to the size of the state-action space, (2) not desirable as most behaviors are unlikely to be relevant for the task at hand. Nonetheless, human and more generally mammals exploration behaviors are governed by various motivations and constraints. Intelligent Beings do not have unlimited resources of time and energy. They optimize these resources to survive and reproduce but also to have fun [14], to help others [8] or to satisfy their curiosity. Oudeyer and Kaplan [25] makes the difference between homeostatic motivations, that encourage to stay in the “comfort zone” and generally correspond to desires that can be satiated; and heterostatic motivations, that push organisms out of equilibrium but cannot be satiated. These many desires shape the way organisms interact with their environment, encouraging them to discover new things but also to protect themselves, avoiding over-surprising events with mechanisms like fear [19]. Berseth et al. [5] exemplified how to exploit such priors by implementing a “homeostasis” objective for RL, thereby showing how different from “novelty seeking” these priors can be. Eventually, the resource constraints stop organisms from exploring exhaustively their environment and push them to transfer knowledge from past experience. In an arbitrary environment, exhaustive exploration is desirable and leads to convergence with theoretical guarantees [37]. But when the exploration presents some structure, one can transfer skills and priors from similar environments. Dubey et al. [9] exemplified, in the case of simple video games, how humans priors help us to solve new problems. The authors enlighten how humans struggle to play the same video game when the semantics of the objects are changed, when the physics of the environment (e.g., the gravity) is rotated, when the visual similarities are modified or when the natural way of interacting with objects is transformed. Overall, they show how much of the human’s ability to solve a new game in a zero-shot manner is due to their prior on the environment. In this paper, we propose to learn a bonus that captures these priors and sources of motivation from demonstrations of goal-driven exploratory behaviors. By adopting this approach, we expect to learn a bonus that implicitly helps reproducing a structured exploration behavior in lieu of an exhaustive one. We also argue that, to a certain extent at least, this can happen without the need of extra modelling inspired by cognitive or behavioral research. To do so, we cast this problem as an inverse RL problem with the difference that only some fraction of the reward optimized by an observed agent is hidden: the intrinsic motivation bonus. The task-related reward remains provided by the environment. Therefore, we propose the following contributions:

  1. a modelling that allows for disentangling the reward optimized by a demonstrator from its intrinsic motivation bonus;

  2. an architecture, that we call “Show me the Way

    ” (SmtW), based on a cascade of supervised learning methods that extracts that exploration bonus from demonstrations;

  3. a method for assessing the quality of the bonus in terms of its ability to encourage similar exploration behaviors as the demonstrator.

To evaluate SmtW, we validate a set of hypotheses on a controlled environment. We notably find that our method can learn structures and styles, transfer useful priors and encourages long-term planning.

2 Background

Markov Decision Processes.

In Reinforcement Learning (RL), an agent learns to behave optimally through interactions with an environment. This is usually formalized as a Markov Decision Processes (MDP) 

[38, 29], a tuple with the set of states, the set of actions (assumed discrete here), the Markovian transition kernel defining the dynamic of the environment, a bounded reward function and a discount factor. The agent interacts with the environment through a (here deterministic) policy . The quality of a given policy is quantified by the associated state-action value function, or -function. It is the expected discounted cumulative reward for starting from , taking action , and following afterward: , with , and . By construction, it satisfies the Bellman equation: for any , . An optimal policy satisfies component-wise , for any policy . Let be the associated (unique) optimal -function, any deterministic optimal policy is greedy with respect to it: . Exploration Bonus. A common strategy to encourage exploration is to augment the reward function with a bonus. This bonus generally depends on past history. For example, a bonus rewarding novelty requires remembering what has been experienced so far. Write the history up to time , and the set of all histories. Generally speaking, we abstract a bonus as , and use it for addressing the dilemma between exploration and exploitation, which thus amounts for the agent to optimize for instead of simply .

3 Show me the Way

Our main contribution is Show me the Way (SmtW), a new exploration bonus extracted from demonstrations. The proposed method learns the demonstrator’s intrinsic bonus and encourages the agent to imitate its way of exploring the environment.

Figure 1: Trajectories are generated by a demonstrator exploring its environment. In order to recover a bonus that can explain its behavior, a BC policy parameterized with an LSTM is trained to predict the actions of the demonstrator from its trajectories of states, by minimizing

. The policy’s logits

are interpreted as optimal -values and used to compute a regression target. A bonus function , parameterized with an LSTM, is then trained to predict it, by minimizing .

Formalization. We assume to have access to demonstrations, and that these demonstrations are optimal according to the (known) reward of the environment plus the (unknown) intrinsic bonus of the demonstrator. The environment being assumed Markovian, knowing the current state is enough to act optimally according to the task (optimizing for the environment’s reward). Yet, the demonstrator also optimizes its exploration bonus, that depends on the past. To formalize things, we consider that the demonstrations are provided by a policy , and that the policy is optimal for the augmented MDP , where replaces and replaces . We frame our problem as learning the bonus from trajectories sampled from . As such, it seems very close to an Inverse Reinforcement Learning (IRL) problem, that aims at recovering a reward function from expert’s trajectories. Yet, we have two additional difficulties here. First, we have to take the whole history into account. Second, we aim at learning the bonus, which requires disentangling the known reward from the bonus to be learnt. Our approach. If we cannot naively apply any existing IRL algorithm to our problem, it can be a source of inspiration. Especially, following Klein et al. [18]

, let’s assume that a behavior cloning (BC) policy can be learned using a standard classifier, like a neural network, that learns to map states

to demonstrator actions . Most classifier would compute scores (for example, the logits in a neural network) and a policy , mimicking the demonstrator, could be obtained by selecting actions as . As we expect an expert to be greedy according to its internal -function (), its policy should follow . If we assume to be close to , we can identify with and act as if the classifier learned the expert’s -function. From this, one can use in the Bellman equation to recover a reward (the reward is roughly the difference of consecutive -values). We extend this general idea to our own problem. Assume that we have access to trajectories sampled by the unknown demonstrator . We rewrite this data as a set of transitions , with as defined before, , and (recall that we assume the reward to be known). Let be a neural network classifier with LSTM [13] units, being the set of parameters and being the logits. We propose to train to do behavioral cloning, that is to predict the demonstrator actions based on its past interactions , by minimizing a cross-entropy loss:


with the logit for input . If the classifier learns correctly, the logits of the resulting network should satisfy for , and the class predicted by the classifier will be . As such, one can interpret as an optimal -function (hence the notation), and as the associated optimal policy. Both these quantities can be related to the bonus-augmented reward through the Bellman equation, that holds for the augmented MDP:


This suggests a simple solution for learning a bonus function: learn a network (parameterized by , with LSTM units) by minimizing a square-loss, the regression target being , an unbiased sample of what would give the true Bellman equation. However, we only observe optimal actions (according to

), so this alone would hardly generalize to suboptimal ones. Therefore, we propose a heuristic, that consists in regressing for suboptimal actions towards

, a hyperparameter of the algorithm. For example, it could be set to

, the minimum being over transitions in the dataset. This gives the following loss, for a transition , and for being sampled randomly in :


To sum up, we train a BC policy by minimizing . The implicit resulting logits are considered optimal -values, that are in turn used to learn the bonus by minimizing the loss (Figure 1).

4 Experiments

We aim at providing insights on what priors SmtW is able to extract from the demonstrations. Specifically, we wish to verify that SmtW is able to make use of memory to encourage a structured exploration of the environment. In order to thoroughly study the method, we test it on a grid-world where we are able to design specific behaviors. As in IRL, studying the return of an agent trained with our bonus is only a proxy to evaluate SmtW’s quality and is not informative on the priors the bonus conveys. We thus focus our experiments on analyzing the priors that were extracted from the demonstrations by the method. More specifically we wish to answer the following questions: (1) Is SmtW encouraging the demonstrator’s behavior more than a random one? (2) Is SmtW capturing the demonstrator’s style, its way of exploring the environment? (3) Is SmtW capturing the skills required to solve the task? (4) Does SmtW encourage novelty seeking? (5) Does SmtW captures the constraints the demonstrator may be submitted to? After answering these questions, we eventually check that a simple agent can benefit from SmtW to actually solve efficiently a task.

Figure 2: KeysDoors(N=5).

The environment. We introduce an environment that will allow us to provide insights on these various questions. We require this environment to be procedurally-generated in order to test SmtW’s ability to generalize to unseen environments. We want it to require combinatorial exploration to be solved so that a demonstrator would naturally use a structured exploration. To achieve this, we introduce the KeysDoors grid-world of size NxN. It contains keys and doors, modeled by two different colors. The agent has a third color. The goal is to find the correct key and to open the correct door with it. As doors (resp. keys) are indistinguishable, an explorer has to try the different keys on the different doors. Actions available are {go left, go right, go up, go down, take, open, wait}. When an agent makes the action “take” on a key, it is then able to move with it. Actions “open” or “take” make the agent lose the key it is holding. To solve the task, the agent has to go to the correct key, take it, go to the right door without doing action “take” or “open” on the way, and then “open” the door. We need the environment to require perseverance so we made the reward function -1 for any actions but the wait action, that is rewarded 0. Opening the correct door with the correct key gives a reward of 100 and terminates the episode. It requires perseverance as a “lazy” policy would get a return of 0 while trying to find the 100 reward gives -1 at each step. This is a well known issue in RL that simple exploration leads to such lazy solutions. The demonstrations are generated to introduce a visible bias in how the environment is explored. For a given instance of the environment, the demonstrator navigates between keys and doors and tries key/door pair in a precise order. It takes the first key on the left and tries it on the first door on the left, then it tries the same key on the second door etc. Once it has tried the first key on every door, it repeats the operation with the second key and proceeds further this way. The episode ends when the demonstrator finds the right key/door pair and obtains the reward. Then it “exploits”, taking the correct key and opening directly the correct door five consecutive times. Note that this also simulates the non-stationnarity happening in most goal-directed task solving process. One first mainly explores and then exploits more and more. Train vs. Test. The bonus is always used in new test environments, unseen in the demonstrations. SmtW’s ability to generalize to new environments is thus tested in all the following experiments. Given the possible positions of the keys, of the doors and then of the correct key and the correct door, there are possible instances of the environment.

4.1 Bonus analysis

We train the SmtW bonus on 200 KeysDoors(N=5) training-environments with 10 demonstrations for each of them. The implementation choices are detailed in Appendix A.3. In order to study the priors extracted from the demonstrations, we study the distribution of bonus given by SmtW along various trajectories following a given behavior. This allows us to study what behaviors are encouraged, which one are discouraged, are two equally good behaviors rewarded identically? We thereafter plot the distribution of bonus on various user-defined behaviors and average the results on 20 test environments, unseen during the training of SmtW. This way, we will be able to show how different behaviors are rewarded by the different bonuses. We compare the bonus given by SmtW along these trajectories to the one that would be given by a count-based [36] and a random network distillation bonus [7]. This will allow to verify that we do not just learn a proxy for a novelty based bonus. Is SmtW encouraging a structured exploration more than a random one? We compare in Figure 3 the distribution of bonus received along random trajectories to the ones obtained by the demonstrator’s behavior. Recall that SmtW has been trained on similar environments but is here tested on different ones. It thus has not been trained with the instances of the “demonstrator’s behavior” it is shown here. As shown on Figure 3, the demonstrator’s behavior (top) is more rewarded by SmtW than the random behavior (bottom). The count-based bonus also rewards the demonstrator’s behavior more than a random one as a random behavior explores the environment very locally. Surprisingly, RND rewards the random behavior more than the demonstrator’s one. This might be explained by the fact that the demonstrator visits several time the same state in order to explore correctly. Indeed the demonstrator has to go several times to the same key to take it and try it on the several doors.

Figure 3: Bonus distribution received by an demonstrator’s behavior (top) and a random behavior (bottom), averaged over 20 test environments. The dashed vertical line is the mean of the distribution.

Is SmtW capturing the demonstrator’s style, his way of exploring the environment? We show in Figure 4 the distribution of bonus received along different behaviors. We compare the bonus obtained by the demonstrator’s behavior to one obtained by a demonstrator’s behavior that tries the key/door pairs in the reverse order (from right to left instead of left to right) and one that tries them in a random order. These three behaviors lead to the same outcome but we hope to capture the demonstrator’s exploration bias and see if it encourages the behaviors that tries the key/door pair in the same order as the demonstrations.

Figure 4: Bonus distribution received by an demonstrator’s behavior (top), a behavior trying key/door pairs in inverse order (middle), and in random order (bottom).

As shown on Figure 4, the count based bonus and RND reward similarly the three behaviors, as they lead to the same amount of novelty. Only the order in which the key/door pairs are tried is change. SmtW, on the contrary, encourages to reproduce the demonstrator bias. It rewards more the behavior trying the key/door pairs in the same order as in the demonstrations.

Is SmtW capturing the priors useful to solve the task?

Figure 5 shows the distribution of bonus received by the demonstrator’s behavior and compare it to the one received by a behavior loosing the key on the way to the door (by taking action “open” before being on the door). As shown on Figure 5, the count-based bonus and RND reward equivalently these two behaviors as they bring the same amount of novelty (both in term of ground-truth-state and observations). SmtW does not reward the “dummy demonstrator” behavior as much as the expert one and we can interpret the lower distribution mode (SmtW-bottom) as the bonus obtained after loosing the key. We can argue that SmtW has somehow captured the prior that it is useful to navigate from the key to the door without loosing the key.

Figure 5: Bonus distribution received by an demonstrator’s behavior (top) and by a “dummy demonstrator” behavior, acting (almost) like the demonstrator but releasing the key on the way to the door (bottom).

Does SmtW encourage long-term exploration? As the environment gives a reward of for taking any action but the wait, an agent not exploring sufficiently would quickly converge to the policy only taking action wait to avoid negative rewards (verified in Figure 9). This same problem is visible in the Pitfall! game, where the best agents learn a policy obtaining 0 reward, whilst persevering humans get much higher scores. We show in Figure 6 the distribution of bonus obtained by a behavior constantly taking the wait action and compare it the bonus distribution obtained by the demonstrator’s behavior. As shown on Figure 6

, SmtW rewards negatively a behavior not seeking novelty. As expected the count based gives a bonus very close to 0 for such a behavior. Perhaps surprisingly, RND rewards negatively this behavior but not with an average bonus lower than the demonstrator’s behavior. This might be also due to the designed bonus normalization that RND uses (zero-mean unit-variance).

Figure 6: Bonus distribution received by an demonstrator’s behavior (top) and by a behavior not moving at all to avoid negative rewards (bottom).

Does SmtW captures the constraints the demonstrator may be submitted to? A demonstrator can be subject to time or energy constraints. In the demonstrations, the demonstrator tries to explore the environment as fast as possible and does not take action wait

on his way to keys and doors. We compare the bonus distribution obtained by the demonstrator’s behavior to the one obtained by the same behavior having a probability 0.1 of waiting at each step. The overall behavior is thus almost the demonstrator’s one, except for the fact that it is sometimes taking action “wait”. As shown on Figure 

7, RND and the count-based bonus reward equivalently these two behaviors. On the other hand, SmtW rewards less the “waiting demonstrator” behavior. We argue it has somehow captured the prior resulting from the resource constraint that leads the demonstrator to try the key/door pairs as fast as possible.

Figure 7: Bonus distribution received by an demonstrator’s behavior (top) and by a “waiting demonstrator” behavior, similar but taking action wait with probability 0.1 at each step (bottom).

What is more, a demonstrator might be subject to safety constraints. As example, it might be dangerous for a robot to try an action in an inappropriate place. The demonstrations minimize the number of time they use the action “take” and only do it when on keys. We can consider that the demonstrator’s behavior complied with safety constraints. We show in Figure 8 the bonus distribution obtained by the demonstrator’s behavior and compare it with the one obtained by the behavior trying the action take on each state on his way to the key. The behavior also solves the task but could be seen as unsafe if the action is not supposed to be taken elsewhere than on keys. As shown on Figure 8, the RND and the count-based bonuses rewards equivalently these two behaviors. This is expected as they bring the same amount of novelty. In contrast, SmtW rewards less the “unsafe demonstrator” behavior, capturing the safety prior the demonstrator have been subject to.

Figure 8: Bonus distribution received by an demonstrator’s behavior (top) and by a “unsafe demonstrator” behavior, navigating like the demonstrator but trying the action “take” on each state until it has the key (bottom).

Overall, we argue that SmtW is able to recover some important bias and constraints inherent to the demonstrations. Hand-crafting a reward expressing these motivations could be extremely complicated and we demonstrated that SmtW is able to generalize these motivations to unseen environments.

4.2 Training an agent on the bonus

Figure 9: Median and min/max values of the return per episode (left) and of the total bonus per episode(right).

We now wish to check that an agent can benefit from SmtW. We thus train a -learning agent with SmtW and compare the results with that of a simple -greedy (=0.1) exploration strategy and a count-based bonus with . The results are averaged over 10 newly generated environments, unseen during SmtW training. For each of these environments, the experiment is repeated twice. We present, for each algorithm, the best result after a hyper-parameter search, given explicitly in Appx. A.4. The bonus given by our method is computed to capture the exploratory behavior of the demonstrator. In order for the agent not to keep exploring forever, our bonus is here divided by with k the number of step of training. As Figure 9 shows, the Q-learning with an -greedy exploration strategy quickly gets stuck in “waiting” at each timestep. SmtW encourages the agent to visit its environment and solves the 10 new environments much faster than the count-based method that push for exhaustive exploration.

5 Related Work

Intrinsic Motivation. Intrinsic motivation is essential to mental development [26] and we can argue that this may, in consequence, be an essential component for computational learning. Oudeyer and Kaplan [25] argue that all humans respond to intrinsic motivations. Young infants motivations can be qualified as more chaotic as they push children to bite, throw, grasp or shout in order to learn. Adults, in contrast, have more structured intrinsic motivations, activated, for instance, when they play games, read novels or watch movies. Correctly using these numerous intrinsic motivations can be key to train agents that solve more and more difficult tasks. Instead of modeling such intrinsic motivations to mimic cognitive processes, we learn them from demonstrations. Exploration. In order to provide an exploration signal to the agent,  [36] proposed the very intuitive count-based method in order to measure novelty. Counting how many times the agent has been in a given state, it rewards less visited states. Several methods extended this idea to large state-space problems  [24, 4, 39, 20], where it is not possible to count state occupancy. Intrinsic curiosity is also commonly computed as a prediction error, either trying to predict the environment’s dynamics  [27, 30] or random statistics about the current state  [7]. Different methods try also to measure surprise as a prediction gain  [33, 15]. Instead of designing such a bonus, we aim at learning one from demonstrations. Learning from demonstrations.Imitation learning, the problem of learning from demonstrations, is typically folded into two different paradigms. (1) Behavioral cloning [28, 3, 31] tries to directly match the demonstrator’s behavior, generally using supervised learning techniques. (2) Inverse Reinforcement Learning  [32, 23] first tries to recover a reward explaining the demonstrator’s behavior, before optimizing the reward for imitating the demonstrator. Some methods output an explicit reward  [18, 1, 23, 41] while adversarial imitation learning can be seen as IRL with implicit reward recovery  [12, 11, 10]. Overall these methods all assume that the near-optimality of the demonstrations. Some works try to relax this assumption and to learn from sub-optimal demonstrations  [16, 6]. IRL methods typically control the quality of their algorithm through the proxy of the return obtained by an agent trained on the inferred reward. Our methods differs from these methods it does not assume that demonstrations are optimal but rather try to answer the question: “In what way is the demonstrator’s behavior deviating from an optimal policy?”. Moreover, we do not seek to recover a reward as in IRL but rather to recover a bonus explaining which, added to the environment reward, explains the demonstrator’s behavior. Facing the same problem that the usual proxy to control the algorithm quality (training an agent on the inferred bonus) is not informative, we decided to study our method through its response to various behaviors.

6 Conclusion

In this work, we present a novel method for extracting an intrinsic bonus from the demonstrations. The method we introduce is offline and does not require environment interactions to recover the bonus, unlike recent adversarial imitation methods who need numerous interaction in order to recover a reward function. Anyway, those methods could not be readily applied to our problem, as they do not explicitly compute a reward function. Moreover, to the best of our knowledge, this is one of the very first method to recover some kind reward that is history-dependent. We show how this bonus generalize to unseen environments and is able to convey long-term priors. We exemplified the approach on a simple yet didactic and challenging example. Yet, testing the method on a larger-scale environment would require human exploratory demonstrations. Gathering such a dataset is costly and very few are already available. Even though the given example is simple, this novel approach of capturing the demonstrator’s bias could potentially lead to new lines of work in RL. For instance, one could use our method to implement behavioral style-transfer in RL and show to an agent a specific way to solve the task thanks to demonstrations. Combining a reward and biases extracted from demonstrations may also help for robotic tasks, where some aspects of the task are easily programmable with a reward but some expectations on how to solve the task may be easier to transmit thanks to demonstrations. This could also lead to some advances in tackling mispecified rewards. Using both a reward, that would contain information on the task to solve but not fully describe the constraints of the problem and demonstrations to correct the reward can be key to train sequential controllers in complex dynamics.


  • Abbeel and Ng [2004] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In

    International Conference on Machine Learning

    , 2004.
  • Andrychowicz et al. [2020] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020.
  • Bagnell et al. [2007] J. Bagnell, J. Chestnutt, D. M. Bradley, and N. D. Ratliff. Boosting structured prediction for imitation learning. In Advances in Neural Information Processing Systems, 2007.
  • Bellemare et al. [2016] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pages 1471–1479, 2016.
  • Berseth et al. [2019] G. Berseth, D. Geng, C. Devin, C. Finn, D. Jayaraman, and S. Levine. Smirl: Surprise minimizing rl in dynamic environments. arXiv preprint arXiv:1912.05510, 2019.
  • Brown et al. [2019] D. Brown, W. Goo, P. Nagarajan, and S. Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International Conference on Machine Learning, 2019.
  • Burda et al. [2018] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. International Conference on Learning Representations (ICLR), 2018.
  • Byrne and Whiten [1989] R. W. Byrne and A. Whiten. Machiavellian intelligence: social expertise and the evolution of intellect in monkeys, apes, and humans. Clarendon Press, 1989.
  • Dubey et al. [2018] R. Dubey, P. Agrawal, D. Pathak, T. L. Griffiths, and A. A. Efros. Investigating human priors for playing video games. arXiv preprint arXiv:1802.10217, 2018.
  • Finn et al. [2016] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 2016.
  • Fu et al. [2018] J. Fu, K. Luo, and S. Levine. Learning robust rewards with adversarial inverse reinforcement learning. International Conference on Learning Representations, 2018.
  • Ho and Ermon [2016] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 2016.
  • Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Holloway and Valentine [2004] S. L. Holloway and G. Valentine. Children’s geographies: Playing, living, learning. Routledge, 2004.
  • Houthooft et al. [2016] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Variational information maximizing exploration. Advances in Neural Information Processing Systems (NIPS), 2016.
  • Jacq et al. [2019] A. Jacq, M. Geist, A. Paiva, and O. Pietquin. Learning from a learner. In International Conference on Machine Learning, 2019.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Klein et al. [2013] E. Klein, B. Piot, M. Geist, and O. Pietquin. A cascaded supervised learning approach to inverse reinforcement learning. In Joint European conference on machine learning and knowledge discovery in databases, pages 1–16. Springer, 2013.
  • Lang et al. [2000] P. J. Lang, M. Davis, and A. Öhman. Fear and anxiety: animal models and human cognitive psychophysiology. Journal of affective disorders, 61(3):137–159, 2000.
  • Machado et al. [2018] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents.

    Journal of Artificial Intelligence Research

    , 61:523–562, 2018.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • Mohamed and Rezende [2015] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133, 2015.
  • Ng et al. [2000] A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, 2000.
  • Ostrovski et al. [2017] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017.
  • Oudeyer and Kaplan [2009] P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
  • Oudeyer et al. [2007] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental development.

    IEEE transactions on evolutionary computation

    , 11(2):265–286, 2007.
  • Pathak et al. [2017] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 16–17, 2017.
  • Pomerleau [1991] D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 1991.
  • Puterman [2014] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Raileanu and Rocktäschel [2020] R. Raileanu and T. Rocktäschel. Ride: Rewarding impact-driven exploration for procedurally-generated environments. arXiv preprint arXiv:2002.12292, 2020.
  • Ross and Bagnell [2010] S. Ross and D. Bagnell. Efficient reductions for imitation learning. In International Conference on Artificial Intelligence and Statistics, 2010.
  • Russell [1998] S. Russell. Learning agents for uncertain environments. In

    Conference on Computational learning theory

    , 1998.
  • Schmidhuber [1991] J. Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pages 1458–1463, 1991.
  • Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
  • Şimşek and Barto [2006] Ö. Şimşek and A. G. Barto. An intrinsic reward mechanism for efficient exploration. In Proceedings of the 23rd international conference on Machine learning, pages 833–840, 2006.
  • Strehl and Littman [2008] A. L. Strehl and M. L. Littman.

    An analysis of model-based interval estimation for markov decision processes.

    Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • Strehl et al. [2009] A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tang et al. [2017] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pages 2753–2762, 2017.
  • Tesauro [1995] G. Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 1995.
  • Ziebart et al. [2008] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, 2008.

Appendix A Implementation Details

We thereafter detail the implementation choices. The experiments ran on a GPU P100.

a.1 The environment

The KeysDoors environment is generated procedurally. For each column, locations for a door and a key are sampled uniformly without replacement. Thus, there is exactly one key and one door on each column and these cannot be at the same location. The “correct” key is then uniformly sampled among the keys and the "correct" door is sampled uniformly among the doors. The initial position of the agent is sample uniformly on the grid. The environment gives both a ground-truth-state (an integer representing the current state), only used by the tabular -learning as well as an RGB observation (as shown in Fig. 2), used by SmtW. Figure 10 shows a trajectory in one possible instance of the KeysDoors environment with . Every observation (an tensor) is normalized between 0 and 1 by dividing by 255.

Figure 10: A trajectory of length 9 in an instance of the KeysDoors(N=5) environment.

a.2 The behaviors

Different scripted behaviors are presented in Sec. 4.1, more details are given here.

  • The demonstrator tries the first key on the left on every door, from left to right and repeats the operation for the second key, the third key, and so on. It thus makes use of its memory to remember which pairs have already been tried so far. The demonstrations are not deterministic. If the demonstrator needs to go from a key in position to a door in position , thus requiring 3 steps "up" and 4 steps "right", it might execute them in any order.

  • The random behavior takes random actions. Trajectories are limited to 1000 steps.

  • The demonstrator inverse behavior is similar to the demonstrator as it navigates to a key, takes it, navigates to a door and opens it. However, the key/door pairs are tried in the reverse order to the demonstrations.

  • The demonstrator random behavior is also similar but tries the key/door pairs in a random order.

  • The dummy demonstrator behavior navigates exactly like the demonstrator but drops the key at a random time on the way to the door (uniformly sampled on the path to the door) by taking action open. The trajectories are limited to 1000 steps.

  • The standing still behavior remains in its original position by only taking the wait action.

  • The waiting demonstrator behavior acts like the demonstrator but has a probability 0.5 of waiting at each step.

  • The unsafe demonstrator acts like the demonstrator but takes this action take each time it moves until it has a key.

A trajectory of an agent moving to a key, taking it, moving to a door and trying to open it with the key is shown in Fig. 10

a.3 Architectures

Our method works directly with visual inputs, as shown in Fig. 10. The network used for the behavioral cloning policy

has the following architecture: an LSTM with 64 units, a fully-connected layer with 512 units and relu activation and an output layer with as many units as there are actions available in the environment (7 for KeysDoors). It is trained with the Adam optimizer 

[17] with a learning rate of and a batch size of . It uses the visual input from the environment and not the ground-truth state. The network used for the regression of the bonus has the same architecture but an output layer with a single unit. It is trained with the Adam optimizer, a learning rate of and a batch size of . The discount factor used in SmtW is set to .

a.4 Hyperparameters sweep

For experiment shown in Figure 9, the tabular -learning is trained on the 10 test environments twice and the figure shows the median and the min/max values. For each of the compared algorithms, we sweep over the agent learning rate over the following values: . Only the result of the learning rate with the hightest median return over the runs is shown for each algorithm. The -greedy strategy is used for all methods with . Even though the agent is tabular, we recall that SmtW itself does not access the ground-truth state of the environment. It works from observations. The count-based bonus, on the contrary, counts ground-truth states. We used the discount factor .