Improving Robot Dual-System Motor Learning with Intrinsically Motivated Meta-Control and Latent-Space Experience Imagination

04/19/2020 ∙ by Muhammad Burhan Hafez, et al. ∙ University of Hamburg 2

Combining model-based and model-free learning systems has been shown to improve the sample efficiency of learning to perform complex robotic tasks. However, dual-system approaches fail to consider the reliability of the learned model when it is applied to make multiple-step predictions, resulting in a compounding of prediction errors and performance degradation. In this paper, we present a novel dual-system motor learning approach where a meta-controller arbitrates online between model-based and model-free decisions based on an estimate of the local reliability of the learned model. The reliability estimate is used in computing an intrinsic feedback signal, encouraging actions that lead to data that improves the model. Our approach also integrates arbitration with imagination where a learned latent-space model generates imagined experiences, based on its local reliability, to be used as additional training data. We evaluate our approach against baseline and state-of-the-art methods on learning vision-based robotic grasping in simulation and real world. The results show that our approach outperforms the compared methods and learns near-optimal grasping policies in dense- and sparse-reward environments.



There are no comments yet.


page 10

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) enables artificial agents to learn how to perform sequential decision-making tasks from experience, without manually programming the desired behavior. This involves online learning of a control policy – a mapping from a raw and often high-dimensional sensory input to a raw motor output that optimizes the task performance. In recent years, deep RL has been used to learn this mapping from self-collected experience data by utilizing deep neural networks as function approximators, achieving superhuman performance in a variety of game domains

[37, 49] and facilitating the acquisition of complex robotic manipulation skills [32, 16].

The current success of deep RL, particularly in robotic control, has come at the expense of notoriously high sample complexity that fundamentally limits how quickly a robot can learn successful control policies. A wide range of approaches have been proposed to alleviate this problem. Some focused on improving experience replay in which data are drawn from a memory of recent experiences and used for training the controller. Sampling experiences with a probability proportional to reward prediction error instead of random sampling

[48] and counting unsuccessful policy rollouts as successful ones by replaying experiences with a different goal than the one the agent was trying to achieve [3] are two prominent examples. Others proposed extending count-based exploration methods, previously only applicable to tabular representations, to problems with continuous high-dimensional state spaces [13, 51]. Learning a separate exploration policy was also found to increase the sample efficiency of learning the target policy by updating the exploration policy based on the amount of improvement in the target policy as a result of the experience data collected with the exploration policy [53]. Similarly, the learning speed on novel tasks was found to be improved by using a task-independent exploration policy updated between learning trials of different tasks [14]. In a different class of approaches, intrinsic rewards were used to efficiently guide the exploration in environments where extrinsic rewards are sparse or absent. Proposed intrinsic reward functions include novelty estimation of perceived states [41, 5], learning progress in predicting future states [15], competence progress in achieving self-generated goals [42, 36], and information-theoretic measures of uncertainty [38, 23].

Despite the efficiency gained by these approaches, learning the desired control behavior was predominantly model-free. While a few approaches (e.g., [41, 15]) learn transition models, the models were only used for computing auxiliary feedback signals to improve exploration. The policy function itself was learned with model-free RL. This places a limit on the achievable degree of sample efficiency and is inconsistent with a large body of behavioral and neural evidence showing that model-free and model-based learning systems both have an active role in human motor learning [21, 31].

In the following, we will discuss in more detail dual-system motor learning and how it is applied in robotics. We will then introduce a new approach that integrates the two learning systems in an adaptive, reliable and sample-efficient manner and perform grasp-learning experiments where we implement and evaluate our approach on a humanoid robot.

1.1 Dual-System Motor Learning

Motor behavior can be divided into habitual behavior obtained by model-free learning and goal-directed behavior obtained by model-based learning [8]. Several hypotheses were proposed to explain how the human brain arbitrates between model-based and model-free learning systems. For instance, Cushman and Morris [7] argue that when performing a sequential decision-making task, humans use the model-free system to habitually select goals and then the model-based system to generate a plan to achieve a selected goal. Another study proposes a contrasting hypothesis called “plan-until-habit”, in which planning is first performed by simulating the world up to a certain depth that decreases with increased time pressure and then model-free action values are exploited [27]. While this study attributes the change in behavior between model-based and model-free control to the availability of cognitive resources, particularly time, other studies have found that the behavior instead changes according to the expected reward regardless of resource availability [30, 4]. Kool et al. [30] support the latter by providing behavioral evidence that people with a perfect transition model of the task and an extended response deadline exerted less model-based control when its expected reward advantage was lower than the cognitive cost involved. This finding was interpreted by suggesting that the brain estimates the value of using each control system but reduces that of the model-based system in proportion to its increased cognitive cost. Similarly, Boureau et al. [4] state that meta-decisions including arbitration between model-based and model-free control are governed by a cost-benefit trade-off in which the brain constantly generates rough estimates of the costs and benefits of allocating cognitive resources for model-based control. The average reward rate, which is the reward expected for temporally allocating a particular resource, and the controllability, which measures how advantageous a carefully considered decision is over a fast habitual one in terms of rewards collected, are proposed as estimates for the opportunity cost and benefit respectively. The willingness to exert model-based control thus increases proportionally to how much larger the reward obtained by controllability is compared to the average reward rate. The authors note, however, that modifying these estimates by including other decision variables like the uncertainty about action outcomes might account for meta-decisions in specific behavioral contexts like exploration.

Haith and Krakauer [21] review the behavioral evidence for the existence of each of the model-based and model-free mechanisms of motor learning in humans and argue that both are employed in parallel by the motor system for movement control. They point out that while the two learning systems generate their own estimate of the value of a given action at a given state, the decision which action to take is made primarily based on the reliability of each of these two estimates. Imperfect predictions of an internal forward model limit the reliability of model-based learning and, hence, in the later stages of learning, after extensive experience, model-free learning becomes more reliable, as the authors indicate. Another study provides neural evidence that the human brain encodes the reliability of model-based and model-free learning systems based on their prediction errors in the lateral prefrontal and frontopolar cortex and uses the reliability signals to dynamically arbitrate behavioral control between the two systems [31]. The arbitration model the study proposes combines model-based and model-free value signals, weighted by the degree of reliability of each system, and uses this integrated value signal for guiding behavior. To account for the cognitive complexity involved, the arbitrator incorporates a bias toward the less cognitively demanding model-free control. The arbitration between different learning systems was also found to drive human strategy selection where the goal is to learn when to use which strategy [33]. The proposed context-sensitive strategy selection approach, which assumes a mental model predicting each strategy’s accuracy and execution time from features of the current situation, was found to better explain how people adaptively choose strategies than previous accounts. However, it is based on choosing the strategy with the best predicted speed-accuracy trade-off rather than choosing the most reliable strategy.

In contrast to previous works suggesting strict neural and behavioral division between model-based and model-free learning systems, Russek et al. [47] propose a computational framework where the two systems are tightly coupled, motivated by evidence supporting a role for dopamine in model-based learning besides its well-established role in model-free learning. In the framework, action values are estimated by applying model-free temporal difference (TD) learning to successor representations (SR), which are the expected future state occupancies. This was found to explain the involvement of dopamine in model-based learning, since the TD error is a reward prediction error thought to be mediated by phasic dopamine and SR is a predictive representation capturing knowledge of the transition model. Although the presented framework gives a neurally plausible computational account of the interaction between the two learning systems, it does not answer the question of how the brain prioritizes computations and arbitrates control between learning systems, as concluded by the authors.

Dual-System Deep RL for Robot Control.

Deep RL approaches are broadly classified into model-based and model-free ones. Model-based approaches facilitate transfer of learning across tasks, since a model learned in the context of a task can be directly used to compute an appropriate control policy for a new task. They are also typically sample-efficient in that they allow for generating synthetic experiences by making predictions about the future. On the other hand, model-free approaches do not have representational limitations that would prevent the convergence to a desired behavior if the model representation is insufficient to perfectly capture the environment dynamics. However, they require a lot of experience and hence have high sample complexity. This has motivated several works to address the problem of how to combine the benefits of model-based and model-free methods.

Initializing the neural network policy of a model-free learner using rollouts of a model-based controller followed by model-free fine-tuning of the policy was found to lead to a higher sample efficiency compared to pure model-free learning with random policy initialization [40]. The model-based controller used is based on random-sampling where several randomly generated action sequences are fed to the model and then the sequence with the highest expected reward is chosen. This limits the effectivity of the approach to low-dimensional action spaces and short horizons. In a different work, Feinberg et al. [11] decompose action-value estimation into two parts: one contains the sum of future rewards predicted by a learned model over a limited horizon and one contains the cached model-free estimate of the long-term reward computed at the end of the horizon. While the method is shown to boost the sample efficiency, it assumes perfect model predictions for a fixed horizon, which is a strong assumption, because, in practice, the model generates noisy data early in learning, and a measure of model reliability is therefore needed.

Other works used information about the future provided by a trained world model as input to the model-free learner to improve its decisions [45, 18]. In [45]

, imagined trajectories generated by a model are processed by a recurrent neural network that outputs a rollout encoding for each trajectory. The encodings are concatenated and used as additional context for the model-free learner’s value and policy networks. Rather than training a feedforward model,

[18] train a recurrent world model on random environment rollouts and use the hidden state of the trained model along with a learned abstract state representation as input to a model-free controller. The proposed approach achieves state-of-the-art performance on an image-based car racing task. However, these works employ pretrained world models and abstract representations with the risk of encoding task-irrelevant features. Instead, Francois-Lavet et al. [12] propose training the world model and an abstract state representation that minimizes both model-free and model-based losses during task learning. The abstract state is the input to both the model-free Q-network predicting action values and the world model predicting next states and rewards. Planning is done by performing one fixed-depth rollout of the model for each possible action at the current state and then taking the first action of the rollout with the highest overall estimated value. The approach has two major drawbacks: first, the complexity of planning, which is performed at each time step, grows exponentially with the number of possible actions; second, if the model is inaccurate, as is the case in complex domains, a large fixed planning depth leads to a compounding of prediction errors that eventually impair task performance.

Another line of research focused on modeling an arbitrator that explicitly switches control between model-based and model-free systems at decision time [10, 19]. One study proposes a control architecture where the arbitrator chooses between an action suggested by an inverse dynamics model and an action suggested by an actor network of a model-free actor-critic system [10]. The arbitration is guided by reward prediction error. If the error at the previous time step is below a predefined threshold, the actor network’s prediction is performed. Otherwise, the inverse model’s predicted action is performed. The approach does not address model imperfection and is evaluated on a robotic reaching task with a significantly low-dimensional state space. In contrast, an estimate of reliability of a learned model’s predictions was recently used to guide the arbitrator’s decision which of the two systems to query for an action during robot grasp learning [19]. The reliability is measured by the model learning progress, which is the time derivative of the average prediction error of the model. If the reliability is positive, the arbitrator queries the model-based system for an action, which performs gradient-based model predictive control, and if not, the arbitrator queries the model-free system instead. The approach, however, relies only on the temporal information when computing the learning progress and ignores the spatial context. It also uses a fixed planning horizon.

1.2 Experience Imagination in Dual-System Motor Learning

One complex cognitive process the dual-system motor learning implements is experience imagination, using the capacity of the world model to make predictions about future states and rewards. Imagination typically refers to the mental simulation of motor behavior. It requires a combination of different cognitive functions, including episodic memory, abstract sensory and motor representations, and manipulation of representations [39]. This cognitive synergy the imagination induces was distinguished neurally by identifying the different brain regions activated during imagination, including both cognitive and motor areas [6, 44], and is strongly associated with the cognitive development in children where complex behaviors develop from the recombination of simple ones. Experience imagination is also essential to mental practice and cognitive rehearsal of physical skills, facilitating skill acquisition [9]

. Besides, it is estimated that automating imagination has the potential of advancing deep learning beyond finding correlations in data as well as providing a means to broaden the focus of research from problem solving to problem creation through the imagination-supported ability to self-generate goals and pursue them with intrinsic motivation

[35, 22].

Based on how experience imagination is implemented, dual-system deep RL approaches can be divided into two groups: (i) online imagination [40, 11, 45, 18, 12, 10, 19] and (ii) offline imagination [17, 25, 20]. In online imagination approaches, generating imagined trajectories for planning with the world model is done at decision time, as discussed in Section 1.1. Offline imagination approaches, on the other hand, augment the memory of real experiences with model-generated imagined experiences, increasing the amount of training data available to the RL agent when updating the control policy offline with experience replay. Gu et al. [17] apply offline imagination by generating on-policy imagined rollouts under the model, starting at states sampled from transitions the model has recently been trained on, and adding them to the replay memory. While this results in a fast convergence to an optimal policy with model-free RL and shows some robustness to imperfectly learned models, the used linear model is insufficient to perfectly capture complex environment dynamics and generate correct imagined rollouts in tasks involving learning from high-dimensional observations, as indicated by the authors. Instead of always using imaginary data in training, a different work suggests to update the policy and value networks from imagined rollouts only when there is a high uncertainty in the estimated action values [25]. The approach is shown to achieve high sample efficiency on continuous control tasks but does not consider model prediction errors. In visuomotor control tasks, generating imagined data requires learning perfect world models at the pixel level, which is impossible in practice. This has recently been addressed by learning the model in latent space and dividing the experience replay buffer into pixel-space and latent-space buffers for storing real and imagined experiences respectively [20]. In the approach, the learned latent space is self-organized into local regions with local world models, and a running average of the model prediction error is independently computed for each region. Unlike previous works, the imagined rollout is reliably generated with a probability inversely proportional to the average error of the current region, and the imagination depth is adaptively determined by the average error of the traversed regions.

1.3 Issues with Model-based Planning and Proposed Changes

As discussed in Section 1.1, recent works on dual-system deep RL have demonstrated the potential of using model predictive control (MPC) for planning [40, 19, 12]

. MPC is an iterative optimization-based control method that collects a multi-step rollout from an initial state given a dynamics model, infers an optimal action plan, performs the first action of the plan, and then repeats the process in a receding horizon. The time and space complexity of planning with MPC is very high when performing backpropagation through time to optimize the MPC planning objective, particularly over long planning horizons. Amos et al.

[2] attempt to address this issue by implicit differentiation of the Karush-Kuhn-Tucker (KKT) optimality conditions at a fixed point of the employed convex optimization solver. In the approach we present here, a different solution is proposed by arguing that the length of the planning horizon should always be adaptively determined according to the current reliability of the learned model of environment dynamics. Another issue is the inevitable model errors that quickly compound during multi-step planning. Forcing latent variables to predict the long-term future using an auxiliary cost during model training was found to make planning in the latent space involve less prediction errors [26]. In our approach described in Section 2, we instead focus on developing a directed exploration strategy that gradually improves the model accuracy.

In light of the previous research, we make the following contributions:

  • First, we present a novel dual-system motor learning approach where an intrinsically motivated meta-controller arbitrates online between model-based and model-free decisions based on the local reliability of a learned world model.

  • Second, we describe a new learning framework that integrates online meta-control with offline learning-adaptive experience imagination.

  • Finally, we show that our proposed framework improves the sample efficiency of learning vision-based motor skills on a developmental humanoid robot, compared to baseline and recent dual-system methods.

2 Intrinsically Motivated Meta-Control

We describe here our approach to dual-system motor learning. The approach consists of model-free and model-based control systems and a meta-controller deciding which of the two systems to query for an action at each time step. We first present the two systems and then discuss how the local reliability in model predictions is used to adaptively guide meta-decisions and provide intrinsic feedback to improve the learned model. Our objective is to train a policy neural network representing the desired control behavior more efficiently than when following a pure model-based or model-free approach.

2.1 Model-free Control System

To train a model-free controller from experience, we consider a standard model-free reinforcement learning (RL) problem where the goal is to learn a policy, , mapping from states

to probability distributions over actions

that maximizes the expected return under , where is the reward function and is the discount factor. In RL, actor-critic methods are well suited for continuous control by learning simultaneously a value function and a policy function. We are particularly interested in off-policy actor-critic methods, since they allow for learning from actions coming from different systems, such as a model-based controller. Deep Deterministic Policy Gradient (DDPG) [34] is a state-of-the-art off-policy actor-critic method that we use in our approach along with an off-policy variant of Continuous Actor-Critic Learning Automaton (CACLA) [52].

The action-value (-)function is defined as the expected return of taking a particular action at a particular state and following a policy thereafter: . Accordingly, the optimal policy satisfies . In infinite or large state-action spaces and when the transition model is unknown, actor-critic methods approximate the -function and the policy function using critic and actor neural networks parametrized by and respectively. The critic is trained to minimize the loss between the current value estimate and the target value , where and are the target critic and actor networks parameterized by and respectively and updated slowly towards their corresponding and networks:


The actor, however, is trained differently by each method. DDPG updates the actor’s parameters by minibatch gradient ascent on the -function:


where is the minibatch size, whereas off-policy CACLA updates the actor only when the advantage of taking the current action is positive by minimizing the loss:


where is the action advantage at time step representing how better the observed value of taking is than the expected on-policy value . This moves the actor’s output towards that has a positive advantage.

Figure 1:

Model-free control system: (a) Critic-autoencoder network consisting of a fully convolutional encoder

that takes in a raw image , a fully convolutional decoder that computes a reconstruction , and a critic that estimates the -value given and ; (b) Actor network taking in the latent state representation , which is jointly trained to minimize the reconstruction and value prediction losses, and generating a control action with a dimensionality of dim(), where is the action space.

Our actor-critic architecture is shown in Figure 1. In the architecture, a latent state representation , the output of the encoder , is learned to be state discriminator and value predictor by jointly optimizing the combined reconstruction and value prediction loss:


where is the reconstruction loss between the decoder’s output and the original input , is the value prediction loss (Equation 1) and and are are weighting coefficients of individual loss components. Since it captures task-relevant information sufficient to reconstruct the original input and identify rewarding states, this jointly learned latent representation can then be used as a direct input to an actor network, as shown in Figure 1(b). Any off-policy actor-critic method can be used together with our architecture, such as DDPG and off-policy CACLA.

2.2 Model-based Control System

In our proposed approach, a predictive model of the world dynamics is learned simultaneously with the task. Instead of learning the model at pixel-level, which is noise sensitive and infeasible in practice, the model is learned in jointly trained latent space (Figure 1(a)). This also ensures that the model is learned on task-relevant latent representations, as opposed to representations learned only to minimize the pixel-level reconstruction error of an autoencoder, which includes no information on what features are useful for the task. The latent-space world model predicts the next latent state representation and environment reward given the current representation and control action and is trained to minimize the loss:


where is the extrinsic reward from the environment, and are two feedforward neural networks for predicting the next latent state representation and environment reward respectively.

To perform motor control with the latent-space world model, we use model predictive control (MPC). In MPC, the world model is rolled out multiple time steps into the future starting from an initial world state and action plan. An objective function is measured at each time step, and then by backpropagation through time and gradient descent, an action plan that optimizes the objective is computed. Only the first action of the optimal plan is taken before repeating the process again at the next time step with the updated state information in closed loop. In our approach, the initial sequence of actions is provided by the model-free RL actor (Section 2.1) at the initial and subsequent model-generated latent states and is optimized with MPC over a time horizon by minimizing the loss:


where is the predicted reward at time step , is the latent state predicted by , is the actor’s output, and is the desired return. We perform gradient descent steps on (Equation 6) with respect to each individual action in the initial plan:


where is the learning rate for plan optimization. This results in an optimal plan whose first action is executed in the environment. Figure 2 shows one iteration of this optimization process in which an action plan that optimizes the objective given the model is inferred.

Figure 2: Model-based control system: After observing the latent state , the world is simulated time steps into the future using the learned world model and an action sequence proposed by the RL actor, resulting in a sequence of model-generated latent states and rewards . The objective is then measured and optimized by performing backpropagation and steps of gradient descent. The first action of the optimal plan is applied in the environment and the optimization process is repeated at the next time step.

2.3 Intrinsically Motivated Meta-Controller

Our approach to arbitrating between model-free and model-based control systems is based on the spatially and temporally local reliability of predictions of the latent-space world model. We define the reliability in model predictions according to the average prediction error of the model. To improve model predictions, we use the change in average prediction error as an intrinsic reward.

2.3.1 Latent-Space Self-Organization

We incrementally self-organize the latent space into local regions with local world models using the Instantaneous Topological Map (ITM) [24]

during motor exploration. ITM was originally designed for strongly correlated stimuli, which is the case here since the stimuli are the latent states visited along continuous trajectories, and has only a few hyperparameters. However, any other growing Self-Organizing Map (SOM) may also be used in our approach. The ITM network is defined by a set of nodes

, each having a weight vector

, and a set of edges connecting each node to its neighbors . The network starts with two connected nodes, and when a new stimulus is observed, the following adaptation steps are performed:

  1. Matching: Find the nearest node and the second-nearest node to : , .

  2. Edge adaptation: If and are not connected, add an edge between them. For all nodes , if lies inside the Thales sphere through and (the sphere with diameter ), i.e. , remove the edge between and , and if has no remaining edges, remove .

  3. Node adaptation: If lies outside the Thales sphere through and , i.e. , and if , where is the desired mapping resolution, create a new node with and an edge with .

A moving window average of model prediction error is computed and updated separately for each latent-space region (node in ITM):


where specifies the length of the window of recent predictions in , and and are the model’s neural networks associated with for predicting the next latent state and extrinsic reward respectively. The improvement in model predictions, the change in over time, is then estimated by computing the learning progress (LP) locally in each region using a time window :


The learning progress is used to derive an intrinsic reward , encouraging actions that yield data that improves the model. It is also used as an unbiased, spatially and temporally local reliability estimator that underlies meta-decisions, as detailed in the following section.

Figure 3: Adaptive-length model rollout for model-based control: Given an initial latent state, , the world is unrolled using local models , where is the nearest node to the initial real (and, later, model-generated) latent states, until the local prediction reliability estimated by the learning progress associated with the current latent region is low or a maximum depth is reached. At each rollout step , the nearest node to the model-generated latent state is identified and the corresponding determines whether to complete () or terminate () the rollout. The actions chosen in the rollout are the output of the actor network . The predicted latent states and rewards are the outputs of the model’s networks and respectively. When the rollout is terminated, plan optimization is performed over the computed horizon as explained in Section 2.2.

2.3.2 Reliability-based Arbitration

When a new latent state is observed, the ITM network is updated and the nearest node to is identified. If the corresponding learning progress in the latent region covered by is negative, which indicates low prediction reliability for the local world model, the meta-controller queries the model-free control system for a motor action. The model-free system in turn sends the output of the actor network with exploration noise to the environment. If, on the other hand, the learning progress is greater or equal to zero, the meta-controller queries the model-based control system instead for a motor action. This initiates the plan optimization process (see Figure 2). However, rather than using a predetermined planning horizon, the learning progress defined over the traversed latent regions the model-generated states belong to adaptively sets the depth of planning, as illustrated in Figure 3. This is done by terminating the model-generated rollout when the local learning progress is negative or a maximum depth is reached (see Algorithm 1). Rolling out the model until the estimated reliability is low ensures that no imperfect model predictions are used in computing the optimal plan and reduces the computational cost. The first action of the optimal plan is then sent to the environment with exploration noise. In either case and after performing an action , the newly collected experience , where , is added to the replay memory of recent experiences used to update the actor, critic-autoencoder, and world model networks. Figure 4 illustrates the arbitration process of the intrinsically motivated meta-controller.

Figure 4: Intrinsically motivated meta-controller: At each time step, the learning progress in the latent-space region the current latent state belongs to is checked. If greater or equal to zero, this indicates high reliability in model predictions and the meta-controller queries the model-based control system for an action, which in turn performs plan optimization and returns the first action of the optimal plan . Otherwise, negative learning progress indicates low prediction reliability and the meta-controller queries the model-free system for an action, which returns the output of the actor’s network . The selected action is then sent to the environment with exploration noise, and the environment returns the next state and extrinsic reward .
1:, ,
2:while  and  do
5:     best-matching node to
7:end while
Algorithm 1 Planning Depth , ,

In our approach, the model-free control system provides the model-based one with a good initial action sequence. Likewise, the model-based control system provides the model-free one with a better-informed exploratory action when the model is locally reliable. Thus, the two control systems are mutually beneficial. The complete algorithm for learning visuomotor control policies with our intrinsically motivated meta-controller is given in Algorithm 2.

1:Input: max. planning depth , no. of planning optimization iterations , desired mapping resolution , episode length , no. of episodes
2:Given: an off-policy actor-critic method
3:Initialize learning parameters { }
4:Initialize SOM with nodes and and corresponding model parameters { }
5:Initialize replay buffer
6:for  do
7:    Sample initial state
8:    for  do
9:        Compute latent state encoding
10:        Update SOM
11:        Identify best-matching node
12:        if  then
13:             Planning Depth , , // Algorithm 1
14:            Query model-based control system with time horizon // Section 2.2
15:             is the optimal plan’s first action
16:        else
17:            Query model-free control system // Section 2.1
18:             where is ’s actor
19:        end if
20:        Add exploration noise
21:        Execute and observe and
22:        Update , following Equation 9, and compute intrinsic reward
24:        Store () in
25:        Update {} using () to minimize // Equation 5
26:        Update {} on minibatch from to minimize // Equation 4
27:        Update on minibatch from based on the chosen // Equation 2 & 3
28:        Update target network parameters: ,    , with
29:    end for
30:end for
Algorithm 2 Intrinsically Motivated Meta-Controller (IM2C)

3 Integrating Arbitration and Imagination

In addition to planning, predictive world models can be leveraged by generating imagined experience samples to augment real-world samples and improve data efficiency of learning control policies. In a previous work, we demonstrated that performing imagined rollouts in a learned latent space and adapting the imagination depth to the improvement in learning a world model accelerate robotic visuomotor skill learning [20]. Here, we propose to integrate our learning-adaptive imagination (LA-Imagination) with the presented reliability-based arbitration using the same underlying self-organized latent space.

In LA-Imagination, an on-policy imagined rollout is performed every time step with a probability proportional to the local model’s prediction accuracy. We modify the algorithm and instead use the adaptive-length model rollout, the input to plan optimization in our model-based control system, to provide a set of imagined transitions. To allow for learning from imagined latent-space transitions, we split the replay memory into pixel-space and latent-space replay buffers and respectively. Real-world pixel-space transitions are stored in and used to learn the jointly optimized latent representation, while imagined latent-space transitions are stored in and used to learn the -function, policy and model parameters. This is performed by updating parameters {} with gradient descent on a minibatch from to minimize (Equation 4) followed by updating parameters {} with gradient descent on a minibatch from to minimize (Equation 1), taking the latent representation as input to the -function, and updating parameters {} by following Equation 2 or 3 according to the chosen actor-critic method. In our proposed framework, offline learning from imagined transitions with experience replay is coupled with online meta-control discussed in Section 2 based on the spatially and temporally local model reliability estimated by the learning progress. Figure 5 shows the overall learning framework.

Figure 5: Integrated Imagination-Arbitration (I2A) framework: At each time step , the Intrinsically Motivated Meta-Controller uses the learning progress associated with node to arbitrate between model-based and model-free control systems (Figure 4). If is found to be greater or equal to zero, the model-based system is called and the model is unrolled in latent space until is negative or a maximum depth is reached. The resulting model rollout is used to provide a sequence of imagined transitions {}, as shown by the red arrow, which are then added to the latent-space buffer and used in training the actor, critic and local model networks. It is also used as input to the plan optimization process of the model-based system. After arbitration, the action of the chosen control system is sent to the environment and the collected real-world transition {} is stored in and used in training the autoencoder network to jointly optimize the reconstruction and value prediction losses.

4 Experimental Evaluation

In Sections 2 and 3, we have described the Intrinsically Motivated Meta-Controller (IM2C) and the Integrated Imagination-Arbitration (I2A) framework for improving data efficiency of learning robotic vision-based control policies. Here, we will evaluate their performance compared to baseline and state-of-the-art methods on robot grasp learning in simulation as well as on a real robot.

4.1 Evaluation in Simulation

Here, we describe the experimental setup, including the learning parameters and robotic environment, and the results of applying our proposed and the compared algorithms to our simulated robot grasp-learning task.

Figure 6: V-REP-simulated grasp-learning scenario: NICO robot facing a table and attempting to grasp a glass randomly placed on the table.

4.1.1 Robot Grasping Setup

Parameter and Implementation details.

We use the neural architectures shown in Figure 1

with the number and size of convolutional filters placed above the corresponding layers for representing the actor and critic in the considered algorithms. No pooling layers are used. All convolutional layers are zero-padded and have stride 1. ReLU activations are used in all layers except for the output layers of the actor and critic networks that use tanh and linear activations respectively. For representing the world model, we use a fully connected neural network with one hidden layer of 20 tanh units and two output layers of 32 and 1 linear units for predicting the next latent state and extrinsic reward respectively. The weighting coefficients


of the combined loss function defined in Equation

4 are set to 0.1 and 1 respectively. We set the learning rate , the number of gradient descent steps and the maximum depth of the plan optimization of the model-based control system to 13, 10, and 6 respectively. A single replay buffer with a capacity of 100k transitions is used in all experiments except for the experiment with our proposed I2A method where we use two replay buffers and with capacities of 60k and 200k respectively. All networks are trained from scratch using batch size 256 and Adam optimizer [29] with learning rate 1-3 for the critic-autoencoder and model networks and 1-4 for the actor network. The discount factor and the update rate of the target networks are set to 0.99 and 1-6 respectively. The desired mapping resolution is set to 6 and the time windows used in computing the learning progress and are set to 40 and 20 time units respectively.

We train the networks using Tensorflow

[1] on a desktop with Intel i5-6500 CPU, 16 GB of RAM, and a single NVIDIA Geforce GTX 1050 Ti GPU.

Figure 7: (a) Motor output: The joints controlled by the grasping policy are depicted as yellow cylinders with one in the shoulder and 3 in each finger. (b) Sensory input: 64 32 RGB image used as input to the learning algorithm.
Simulation environment.

All experiments are conducted on our Neuro-Inspired COmpanion (NICO) robot [28] using the V-REP robot simulator [46]

. NICO is a child-sized humanoid developed by the Knowledge Technology group of the University of Hamburg. NICO is a flexible platform for research on embodied neurocognitive models based on human-like sensory and motor capabilities. It stands about one meter tall; its body proportions and degrees of freedom resemble that of a three- to four-year-old child. Figure

6 shows the configuration of the environment, including the simulated NICO robot sitting in front of a table on top of which a glass is placed and used as the grasping target. In order to prevent self-collisions while still allowing for a large workspace, we consider learning a grasping policy that controls the shoulder joint and finger joints of the right hand, as shown in Figure 7(a). The shoulder joint has an angular range of movement of 100 degrees. The multi-fingered hand is tendon-operated and consists of 1 thumb and 2 index fingers with finger joints having an angular range of movement of 160 degrees. All algorithms take as input a 64 32 RGB image obtained from the vision sensor, as shown in Figure 7(b).

4.1.2 Results

We run the algorithms for 10k episodes. Each episode terminates when the target is grasped, toppled, or a maximum of 50 time steps is reached. The target position is randomly set to a new graspable position at the start of each episode. The extrinsic reward function is defined as follows:

Figure 8:

Learning curves of off-policy CACLA and DDPG with and without IM2C on robot grasp learning from pixel input in two reward settings: (a) dense reward and (b) sparse reward. The curves are smoothed using a sliding window of 250 episodes. Shaded regions correspond to one standard deviation.

where and are the center points of the target and the hand respectively. We compare the performance of off-policy CACLA and DDPG with and without our proposed IM2C on learning robotic vision-based grasping in dense and sparse-reward settings. Figure 8 shows the episodic reward averaged over 5 random seeds (we use the term episodic reward to refer to the sum of extrinsic rewards collected over one complete episode). It can be observed that both CACLA+IM2C and DDPG+IM2C achieved a higher average episodic reward and a better convergence rate than their baseline counterparts at the end of training in both reward settings. The effect of IM2C is more evident in the results of learning from sparse rewards where CACLA+IM2C and DDPG+IM2C significantly outperformed their baseline counterparts in learning speed and final performance, as shown in Figure 8(b) and Table 1. We compute the following scoring metrics: (i) Area-under-Curve (AuC) is the area under the learning curve, normalized by the total area, and gives a quantitative measure of learning speed, and (ii) Final Performance (Final Perf) is the average episodic reward over the last 500 training episodes. The two metrics are reported in Table 1.

Dense Reward
AuC 0.379 0.815 0.214 0.554
Final Pref. -0.50.5 0.80.1 -1.00.3 0.60.4
Sparse Reward
AuC 0.109 0.440 0.019 0.406
Final Pref. -0.30.2 0.90.1 -0.90.0 0.60.3
Table 1: Summary statistics of the simulation results for different experimental settings.
Figure 9: Learning curves of DDPG+IM2C, DDPG+CMC, and MVE-DDPG on robot grasp learning from pixel input in two reward settings: (a) dense reward and (b) sparse reward. The curves are smoothed using a sliding window of 250 episodes. Shaded regions correspond to one standard deviation.

We also compare IM2C to previous methods for improving model-free value estimation with model-based predictions, particularly the state-of-the-art Model-based Value Expansion (MVE) method [11] and the more recent Curious Meta-Controller (CMC) method [19]. We implement MVE-DDPG from [11] and DDPG+CMC from [19] with a prediction horizon of 2 and 3 steps respectively, and find these values to produce the best results. Figure 9 shows the average episodic reward over 5 random seeds. The three methods have a comparable learning performance over 3k episodes in the dense-reward setting (Figure 9(a)). The episodic reward of DDPG+IM2C and DDPG+CMC, however, continues to increase faster than that of MVE-DDPG, reaching 0.59 and 0.45 respectively. In the sparse-reward setting (Figure 9(b)), MVE-DDPG shows no clear improvement in performance, while DDPG+IM2C and DDPG+CMC are able to improve their performance, converging to a policy of 0.62 and 0.24 episodic reward respectively. We believe the poor performance of MVE is primarily due to incorporating imperfect predictions in learning value estimates, as opposed to the reliability-driven model use of IM2C. Besides, CMC and MVE use fixed , increasing the risk of compounding prediction errors, while IM2C enables automatic selection of that is fully adaptive to the local reliability of the model.

Figure 10: Learning curves of DDPG+I2A, DDPG+IM2C, and DDPG+LA-Imagination on robot grasp learning from pixel input in two reward settings: (a) dense reward and (b) sparse reward. The curves are smoothed using a sliding window of 250 episodes. Shaded regions correspond to one standard deviation.

Last but not least, we evaluate our proposed I2A framework, which combines experience imagination with reliability-based arbitration, by conducting an ablation study to analyze the influence of individual components of I2A, namely the arbitration and imagination components. This is performed by comparing I2A to IM2A that represents the arbitration component and to LA-Imagination (see Section 3) that represents the imagination component. The average episodic reward of running the three algorithms over 5 random seeds in the two reward settings is shown in Figure 10. It is clear that augmenting the replay memory of DDPG with latent-space imagined transitions using LA-Imagination significantly improves the data efficiency of DDPG which otherwise completely fails to show any progress (see Figure 8). Compared to DDPG+LA-Imagination, DDPG+IM2C leads to a higher episodic reward, which again confirms the effectiveness of the meta-controller in adaptively arbitrating between model-based and model-free control systems and choosing more informed exploratory actions, progressing faster to a good grasping policy. DDPG+I2A, on the other hand, yields the best results through combining the advantages of the two approaches using the same underlying self-organized latent space.

Figure 11: NICO experimental setup during a grasping test trial. From left to right: the exocentric and the egocentric (inset) views of the initial, intermediate, and full-grasp configurations.

4.2 Evaluation on a Real Robot

For the experiments with the physical NICO, the simulation environment was recreated as faithfully as possible: The simulation is based on a URDF model of NICO. Therefore, there is no difference in the simulated and the real robot. Both the table and NICO’s seat have the same height as in the simulation, allowing for a direct transfer of the arm pose and, more importantly, the trained neural model. Furthermore, the color of the table is identical to the color in the simulation. A grasping object is slightly different in geometry, to allow for more stable grasps, but has the same color. To achieve the same perspective for the visual input, an external camera was mounted on the table with a view similar to the simulated camera. Figure 11 shows NICO in the experimental setup.

While the grasping object’s position in the simulation environment can be manipulated directly and the virtual NICO is only used for grasp learning and execution, NICO in the real environment is also used to place the object at an exact and known position on the table. Each grasping trial consists of the following steps: Starting from the initial position (shoulder at zero degrees), the grasping object is put into NICO’s hand (if it was not already in the hand), the hand closes and NICO puts the object at a predetermined position on the table, the position is memorized and NICO moves the hand back to the starting position. Now the actual grasping trial starts by taking an image with the external camera and feeding it into the actor network that outputs a motor command to NICO. After the movement is executed, either the object is grasped, or if the hand is too far from the object, another image is recorded, and the process is repeated. Up to eight consecutive grasping steps are performed before the attempt is categorized as failed, and the object is retrieved using its initially stored position.

To compare the performance of the algorithms on the real NICO robot, we take the best-performing policy network of each algorithm, trained in simulation, and then deploy it on the real robot. We perform 25 test episodes, each with a random graspable position. To achieve a seamless simulation-to-real transfer of the trained policy networks and to compensate for the slightly different alignment of the simulated and the real camera, we force the encoder part of the critic-autoencoder network to map one image from the simulation environment and one image from the real world with the same joint configuration and environmental setup into a similar latent representation. This is done by minimizing the Euclidean distance between latent representations, at the output of the encoder, corresponding to images from the simulated and real-world environments with supervised learning over a training set of 2k simulated-real image pairs. The encoder then computes the latent state to be used as input to the policy network during testing. No fine-tuning of the trained policy networks is performed. We report the success rate (the proportion of the successful test episodes) for each algorithm in Table


Dense reward 16% 68% 48% 80% 88%
Sparse reward 12% 44% 12% 76% 76%
Table 2: Success rate of the trained policy networks on the real robot.

5 Conclusion

We presented a novel robot dual-system motor learning approach that is behaviorally and neurally plausible, data efficient, and competitive with the state of the art. Our approach adaptively arbitrates between model-based and model-free decisions based on the spatially and temporally local reliability of a learned world model. The reliability estimate computed locally for every region of a learned latent space is used to make the meta-decision as well as to enable an adaptive-length model rollout for plan optimization during model-based control. We derive an intrinsic reward using the reliability estimate to encourage collecting experience data that improves the model. To further improve the data efficiency, we leverage the reliable multi-step model predictions by combining arbitration with experience imagination where imagined experiences collected from model rollouts are used as an additional training data for the control policy.

We show that our approach learns better vision-based control policies than baseline and state-of-the-art methods in dense and sparse reward environments. Policy networks trained in simulation with our approach are shown to perform well on the physical robot without fine-tuning of the policy parameters. Our results suggest that model reliability is essential for dual-system approaches involving online meta-decisions to determine which of the model-based and model-free systems to query for an action and for generating imagined experience data that includes less overall prediction error. Our approach can be used with any off-policy reinforcement learning algorithm, which we demonstrated with off-policy CACLA and DDPG. We believe that our approach can be extended to the case of a multi-step model, instead of the single-step model used in the present work, by incorporating temporal abstractions, such as options [50, 43]. Another promising direction for future work is to generalize our approach to environments with stochastic dynamics.

This work was supported by the German Academic Exchange Service (DAAD) funding programme (No. 57214224) with partial support from the German Research Foundation DFG under project CML (TRR 169).


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.1.1.
  • [2] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter (2018) Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, pp. 8289–8300. Cited by: §1.3.
  • [3] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing systems, pp. 5048–5058. Cited by: §1.
  • [4] Y. Boureau, P. Sokol-Hessner, and N. D. Daw (2015) Deciding how to decide: self-control and meta-decision making. Trends in Cognitive Sciences 19 (11), pp. 700–710. Cited by: §1.1.
  • [5] Y. Burda, H. Edwards, A. Storkey, and O. Klimov (2019) Exploration by random network distillation. In 7th International Conference on Learning Representations, Cited by: §1.
  • [6] L. K. Case, J. Pineda, and V. S. Ramachandran (2015) Common coding and dynamic interactions between observed, imagined, and experienced motor and somatosensory activity. Neuropsychologia 79, pp. 233–245. Cited by: §1.2.
  • [7] F. Cushman and A. Morris (2015) Habitual control of goal selection in humans. Proceedings of the National Academy of Sciences 112 (45), pp. 13817–13822. Cited by: §1.1.
  • [8] N. D. Daw, Y. Niv, and P. Dayan (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience 8 (12), pp. 1704–1711. Cited by: §1.1.
  • [9] J. E. Driskell, C. Copper, and A. Moran (1994) Does mental practice enhance performance?. Journal of Applied Psychology 79 (4), pp. 481. Cited by: §1.2.
  • [10] F. S. Fard and T. P. Trappenberg (2018) Mixing habits and planning for multi-step target reaching using arbitrated predictive actor-critic. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.1, §1.2.
  • [11] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine (2018) Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101. Cited by: §1.1, §1.2, §4.1.2.
  • [12] V. François-Lavet, Y. Bengio, D. Precup, and J. Pineau (2019) Combined reinforcement learning via abstract representations. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3582–3589. Cited by: §1.1, §1.2, §1.3.
  • [13] J. Fu, J. Co-Reyes, and S. Levine (2017) Ex2: exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing systems, pp. 2577–2587. Cited by: §1.
  • [14] F. Garcia and P. S. Thomas (2019) A meta-mdp approach to exploration for lifelong reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5692–5701. Cited by: §1.
  • [15] J. Gottlieb, P. Oudeyer, M. Lopes, and A. Baranes (2013) Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends in Cognitive Sciences 17 (11), pp. 585–593. Cited by: §1, §1.
  • [16] S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. Cited by: §1.
  • [17] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §1.2.
  • [18] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §1.1, §1.2.
  • [19] M. B. Hafez, C. Weber, M. Kerzel, and S. Wermter (2019) Curious meta-controller: adaptive alternation between model-based and model-free control in deep reinforcement learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.1, §1.2, §1.3, §4.1.2.
  • [20] M. B. Hafez, C. Weber, M. Kerzel, and S. Wermter (2019) Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space. In 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 240–246. Cited by: §1.2, §3.
  • [21] A. M. Haith and J. W. Krakauer (2013) Model-based and model-free mechanisms of human motor learning. In Progress in Motor Control, pp. 1–21. Cited by: §1.1, §1.
  • [22] J. B. Hamrick (2019) Analogues of mental simulation and imagination in deep learning. Current Opinion in Behavioral Sciences 29, pp. 8–16. Cited by: §1.2.
  • [23] E. Hazan, S. Kakade, K. Singh, and A. Van Soest (2019) Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681–2691. Cited by: §1.
  • [24] J. Jockusch and H. Ritter (1999) An instantaneous topological mapping model for correlated stimuli. In International Joint Conference on Neural Networks (IJCNN), Vol. 1, pp. 529–534. Cited by: §2.3.1.
  • [25] G. Kalweit and J. Boedecker (2017) Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pp. 195–206. Cited by: §1.2.
  • [26] N. R. Ke, A. Singh, A. Touati, A. Goyal, Y. Bengio, D. Parikh, and D. Batra (2019) Learning dynamics model in reinforcement learning by incorporating the long term future. In 7th International Conference on Learning Representations, Cited by: §1.3.
  • [27] M. Keramati, P. Smittenaar, R. J. Dolan, and P. Dayan (2016) Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum. Proceedings of the National Academy of Sciences 113 (45), pp. 12868–12873. Cited by: §1.1.
  • [28] M. Kerzel, E. Strahl, S. Magg, N. Navarro-Guerrero, S. Heinrich, and S. Wermter (2017) NICO—neuro-inspired companion: a developmental humanoid robot platform for multimodal interaction. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 113–120. Cited by: §4.1.1.
  • [29] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations, Cited by: §4.1.1.
  • [30] W. Kool, S. J. Gershman, and F. A. Cushman (2018) Planning complexity registers as a cost in metacontrol. Journal of Cognitive Neuroscience 30 (10), pp. 1391–1404. Cited by: §1.1.
  • [31] S. W. Lee, S. Shimojo, and J. P. O’Doherty (2014) Neural computations underlying arbitration between model-based and model-free learning. Neuron 81 (3), pp. 687–699. Cited by: §1.1, §1.
  • [32] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  • [33] F. Lieder and T. L. Griffiths (2015)

    When to use which heuristic: a rational solution to the strategy selection problem.

    In Proceedings of the 37th Annual Conference of the Cognitive Science Society, Cited by: §1.1.
  • [34] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, Cited by: §2.1.
  • [35] S. Mahadevan (2018) Imagination machines: a new challenge for artificial intelligence. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.2.
  • [36] F. Mannella, V. G. Santucci, E. Somogyi, L. Jacquey, K. J. O’Regan, and G. Baldassarre (2018) Know your body through intrinsic goals. Frontiers in Neurorobotics 12, pp. 30. Cited by: §1.
  • [37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
  • [38] S. Mohamed and D. J. Rezende (2015) Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing systems, pp. 2125–2133. Cited by: §1.
  • [39] S. T. Moulton and S. M. Kosslyn (2009) Imagining predictions: mental imagery as mental emulation. Philosophical Transactions of the Royal Society B: Biological Sciences 364 (1521), pp. 1273–1280. Cited by: §1.2.
  • [40] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §1.1, §1.2, §1.3.
  • [41] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2778–2787. Cited by: §1, §1.
  • [42] A. Péré, S. Forestier, O. Sigaud, and P. Oudeyer (2018) Unsupervised learning of goal spaces for intrinsically motivated goal exploration. In 6th International Conference on Learning Representations, Cited by: §1.
  • [43] D. Precup (2000) Temporal abstraction in reinforcement learning. Ph.D. Thesis, University of Massachusetts. Cited by: §5.
  • [44] R. Ptak, A. Schnider, and J. Fellrath (2017) The dorsal frontoparietal network: a core system for emulated action. Trends in Cognitive Sciences 21 (8), pp. 589–599. Cited by: §1.2.
  • [45] S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. (2017) Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing systems, pp. 5690–5701. Cited by: §1.1, §1.2.
  • [46] E. Rohmer, S. P. Singh, and M. Freese (2013) V-rep: a versatile and scalable robot simulation framework. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1321–1326. Cited by: §4.1.1.
  • [47] E. M. Russek, I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D. Daw (2017) Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS Computational Biology 13 (9), pp. e1005768. Cited by: §1.1.
  • [48] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In 4th International Conference on Learning Representations, Cited by: §1.
  • [49] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.
  • [50] R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 (1-2), pp. 181–211. Cited by: §5.
  • [51] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel (2017) # exploration: a study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2753–2762. Cited by: §1.
  • [52] H. Van Hasselt (2012) Reinforcement learning in continuous state and action spaces. In Reinforcement Learning, pp. 207–251. Cited by: §2.1.
  • [53] T. Xu, Q. Liu, L. Zhao, and J. Peng (2018) Learning to explore via meta-policy gradient. In International Conference on Machine Learning, pp. 5463–5472. Cited by: §1.