Multi-task Reinforcement Learning with a Planning Quasi-Metric

02/08/2020 ∙ by Vincent Micheli, et al. ∙ Idiap Research Institute 55

We introduce a new reinforcement learning approach combining a planning quasi-metric (PQM) that estimates the number of actions required to go from a state to another, with task-specific planners that compute a target state to reach a given goal. The main advantage of this decomposition is to allow the sharing across tasks of a task-agnostic model of the quasi-metric that captures the environment's dynamics and can be learned in a dense and unsupervised manner. We demonstrate the usefulness of this approach on the standard bit-flip problem and in the MuJoCo robotic arm simulator.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are interested in devising a new approach to reinforcement learning to solve multiple tasks in a single environment, and to learn separately the dynamic of the environment and the definitions of goals in it.

A simple example would be a 2d maze, where there could be two different sets of tasks: reach horizontal coordinate or reach vertical coordinate . Learning the spatial configuration of the maze would be useful for both sets of tasks.

Another example would be the robotic arm in the MuJoCo simulator Todorov et al. (2012) proposed by Andrychowicz et al. (2017). While tasks in that simulator can be different, they all rely on the underlying physical dynamics of the robotic arm.

Our approach consists in segmenting the model into:

  1. a task-agnostic planning quasi-metric that estimates the minimum expected number of steps necessary to go from any state to any state (see § 2.2), and

  2. a series of task-specific planners that estimate, given the current state and a task-specific goal , what target state the agent should aim for (see § 2.3).

These models are trained concurrently. The key value of this approach is to share the quasi-metric across tasks, for instance to enable transfer learning as demonstrated in the experimental section (§ 


The idea of a quasi-metric between states is the natural extension of recent works, starting with the Universal Value Function Approximators (Schaul et al., 2015) which introduced the notion that learning the reward function can be done without a single privileged goal, and then extended with the Hindsight Experience Replay (Andrychowicz et al., 2017) that introduced the idea that goals do not have to be pre-defined but can be picked arbitrarily. Combined with a constant negative reward, this leads naturally to a metric where states and goal get a more symmetric role, which departs from the historical and classical idea of accumulated reward.

The long-term motivation of our approach is twofold: First, we want to segment the policy of an agent into a life-long learned quasi-metric, and a collection of task-specific easy-to-learn planners. These planners would be related to high-level imperatives for a biological system, triggered by low-level physiological necessities (“eat”, “get warmer”, “reproduce”, “sleep”), and high-level operations for a robot (“recharge battery”, “pick up boxes”, “patrol”, etc.). Second, the central role of a metric where the heavy lifting takes place provides a powerful framework to develop hierarchical planning, curiosity strategies, estimators of performance, etc.

Figure 1: Given a starting state and a goal , the planner computes a target state which is the closest state in , that is that minimizes the length of the dashed path to go from to under the environment dynamics.

2 Method

Let be the state space and the action space. We call goal a subset of the state space, and a task a set of goals

. Many tasks can be defined in a given environment with the same state and action spaces. Note that in the environments we consider, the concrete definition of a task is a subset of the state vector coordinates, and a goal is defined by the target values for these coordinates.

Consider the robotic arm of the MuJoCo simulator Todorov et al. (2012), that we use for experiments in § 3.2: The state space

concatenates, among others, the position and velocity of the arm and the location of the object to manipulate. Examples of tasks could be “reach a certain position”, in which case a goal is a set of states parameterized by a 3d position, where the position of the arm handle is fixed but all other degrees of freedom are let free, “reach a certain speed” where everything is let unconstrained but the handle’s speed, “put the object at the left side of the table”, where everything is free but one coordinate of the object location, and so on.

For what follows, we also let


be a state / action / reward sequence.

2.1 Q-learning and Hindsight Experience Replay

Given a discount factor , a standard Q-learning algorithm (Watkins, 1989) aims at learning a reward function of the form


that should provide an estimate of the maximum [expected] reward that can be accumulated when starting from state doing action . This is achieved by iteratively computing through updates minimizing


where is a “target model”, usually updated less frequently or through a stabilizing moving average (Hasselt, 2010).

The Universal Value Function Approximators (Schaul et al., 2015) parameterize the Q-function with a goal, and the Hindsight Experience Replay (Andrychowicz et al., 2017) combines this approach with additional goals sampled along visited trajectories, and a constant negative reward.

The goal-dependent q value


is updated, given a , to minimize




As noted by Eysenbach et al. (2019), setting an undiscounted () constant negative reward, except when the goal is reached, as in equation 6, makes the resulting accumulated reward the [opposite of the] distance to the goal in number of steps.

And by considering arbitrary goals, the model actually embodies a distance between any pair state/goals, which is very similar conceptually to a state/state metric.

2.2 Planning Quasi-Metric

Similarly to the distance between states proposed by Eysenbach et al. (2019), we explicitly introduce an action-parameterized quasi-metric


such that is “the minimum [expected] number of steps to go from to when starting with action ”.

We stress that it is a quasi-metric since it is not symmetric in most of the actual planning setups. Consider for instance one-way streets for an autonomous urban vehicle, irreversible physical transformations, or inertia for a robotic task, which may make going from to easy and the reciprocal transition difficult.

Given an arbitrary target state , the update of should minimize


where the first term makes the quasi-metric between successive states, and the second makes it globally consistent with the best policy, following Bellman’s equation. As in equation 3, is a “target model”, usually updated less frequently or through a stabilizing moving average.

We implement the learning of the PQM with a standard actor/critic structure. First the PQM itself that plays the role of the critic


and an actor, which is either an explicit in the case of a finite set of actions, or a model


to approximate when dealing with a continuous action space

For training, given a tuple we update to reduce


and we update to reduce


so that gets closer to the choice of action at that minimizes the remaining distance to .

2.3 Planner

Note that while the quasi-metric allows to reach a certain state by choosing at any moment the action that decreases the distance to it the most, it does not allow to reach a more abstract “goal”, defined as a

set of states. This objective is not trivial: the two objects are defined at completely different scales, the latter possibly ignoring virtually all the degrees of freedom of the former.

Hence, to use the PQM to actually reach goals, a key element is missing to pick the “ideal state” that (1) is in the goal but also (2) is the easiest to reach from the state currently occupied. For this purpose we introduce the idea of planner (see figure 1)


such that is the “best” target state, that is the state in closest to :


The key notion in this formulation is that we can have multiple planners dedicated to as many goal spaces, that utilize the same quasi-metric, which is in charge of the heavy lifting of “understanding” the underlying dynamics of the environment.

We follow the idea of the actor for the action choice, and do not implement the planner by explicitly solving the system of equation 14 but introduce a parameterized model


For training, given a pair we update to reduce


The first term is an estimate of the objective of the problem (14), that is the distance between and , where the actor’s prediction plays the role of the of the original problem.

The second term is a penalty replacing the hard constraints of (14) with a distance to a set. That latter distance is in practice a norm over a subset of the state’s coordinates. We come back to this with more details in § 3.

The third term is a penalty for imposing the validity of the state, for instance ensuring that speed or angles remain in valid ranges.

The resulting policy combines the actor and the planner . Given the current state and the goal , the chosen action is .

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • a PQM critic and an actor ,

  • a goal space and its associated planner ,

  • a goal sampling strategy and a replay buffer .

  Initialize , and
  for  do
     Sample a goal and an initial state
     for  do
        Compute the target state:
        Sample an action:
        Execute the action and observe a new state
        Store the transition in
        Sample a set of additional target states for replay (current episode)
        for  do
           Store the transition in
        end for
     end for
     for  do
        Sample a minibatch from the replay buffer
        Perform a SGD step on , and using
     end for
  end for
Algorithm 1 Training

3 Experiments

We have validated our approach experimentally in PyTorch 

Paszke et al. (2019), on two standard environments: The bit-flipping problem (see § 3.1), known to be particularly challenging to traditional RL approaches relying on sparse rewards, and the MuJoCo simulator (see § 3.2

), which exhibits some key difficulties of real robotic tasks. Our software to reproduce the experiments will be available under an open-source license at the time of publication.

As observed by many other practitioners, deep learning in general, and reinforcement learning in particular, require heavy optimization of meta-parameters, both related to the regressors’ architectures (number of layers, layer size, non-linearities, etc.) and the optimization itself (batch size, sampling strategy, SGD parameterization, etc.)

This translates to large computational budgets to obtain an optimal setup and performance. The experimental results presented in this section were obtained with roughly 250 vCPU cores for one month, which is far less than the requirement for some state-of-the-art results. It forced us to only coarsely adapt configurations optimized in previous works for more classical and consequently quite different approaches.

Figure 2: Empirical mean and

one standard deviation confidence interval of the success rate (left) and time to goal (right) on the bit-flip task (see § 

3.1.1) of three different algorithms: a standard deep Q-learning (DQN), our approach that combines a planning quasi-metric with a planner (PQM), and the same with transfer of the quasi-metric trained on a task where the bits to match to reach the goals were different (PQM w/ transfer).
Figure 3: Accuracy of the quasi-metric estimate on the bit-flip task (see § 3.1.1). We plot here the empirical mean and one standard deviation confidence interval of the PQM estimate vs. the true distance, which here is the number of differing bits, hence the Hamming distance. The left figure is obtained when the starting state and the goal state differ only on the bits that are relevant to the task, hence are consistent with the planner , and the right figure when both are taken at random.

3.1 Bit-flip

3.1.1 Environment and tasks

The state space for this first environment is a Boolean vector of bits, and there are actions, each switching one particular bit of the state.

We fix and define two tasks, corresponding to reaching a target configuration respectively for the first bits and the last bits. A goal in these tasks is defined by the target configuration of bits.

The difficulty in this environment is the cardinality of the state space, the lack of geometrical structure, and the average time it takes to go from any state to any other state under a random policy.

To demonstrate the transferability of the quasi-metric, we also consider a task defined by the 15 first bits, with transfer from a task defined by the 15 last bits. After training an agent on one, we train a new agent on the other but keep the parameters of the quasi-metric and the actor.

3.1.2 Network architectures and training

The critic is implemented as , where

is a ReLU MLP with

input units, one hidden layer with neurons, and outputs. As indicated in § 2.2, the actor for this environment is an explicit over the actions. The planner is implemented also as a ReLU MLP with input units corresponding to the concatenation of a state and a goal definition, one hidden layer with neurons, and output neurons, with a final sigmoid non-linearity.

The length of an episode is equal to the number of bits in the goal, which is twice the median of the optimal number of actions.

We kept the meta-parameters as selected by Plappert et al. (2018, appendix B), and chose . There is no term imposing the validity of the planner output, hence no parameter .

Following algorithm 1, we train for epochs, each consisting of running the policy for episodes and then performing optimization steps on minibatches of size sampled uniformly from a replay buffer consisting of transitions. We use the “future” strategy of HER for the selection of goals Andrychowicz et al. (2017). We update the target networks after every optimization step using the decay coefficient of .

3.1.3 Results

The experiments in the bit-flip environment show the advantage of using a planning quasi-metric. As shown on figure 2, the training is successful and the combination of the quasi-metric and the planner results in a policy similar to that of the standard DQN, both in terms of success rate and in terms of time to goal. It also appears that, while this model is slightly harder to train on a single task compared to DQN, it provides a great performance boost when transferring the PQM between tasks: since the quasi-metric is pre-trained, the training process only needs to learn a planner, which is a simpler object.

It is noteworthy that due to limited computational means, we kept essentially the meta-parameters of the DQN setup of Andrychowicz et al. (2017), which were heavily optimized for a different architecture, and as such the comparison is biased to DQN’s advantage.

Figure 3 gives a clearer view of the accuracy of the metric alone. We have computed after training the value of for pairs of starting states / target states taken at random, and compared it to the “true” distance, which happens to be in that environment the Hamming distance, that is the number of bits that differ between the two.

For the tasks in this environment, the planner predicts a target state whose bits that matter for the task are the goal configuration, and the others are unchanged from the starting state, since this corresponds to the shortest path. Hence we considered two groups of state pairs: Either “in task”, which means that the two states are consistent with the planner prediction in the task, and differ only on the bits that matter for the task, or “random” in which case they are arbitrary, and hence may be inconsistent with the biased statistic observed during training.

The results show that the estimate of the quasi-metric is very accurate on the first group, less so on the second, but still strongly monotonic. This is consistent with the transfer providing a substantial boost to the training on a new task.

3.2 MuJoCo

Figure 4: Trajectory of the robotic arm for the “push” task when transferring the quasi-metric from the “pick and place” task. The quasi-metric has properly modeled how to move the arm at a desired location in contact with the black box, and needs additional training to manipulate the box without control of the gripper (see § 3.2.1).
Figure 5: Empirical mean and one standard deviation confidence interval of the success rate (left) and time to goal (right) on the MuJoCo “pick and place” and “push” tasks (see § 3.2.1). We compare the performance of the Deep Deterministic Policy Gradient (DDPG), with our approach that combines a planning quasi-metric with a planner (PQM). These curves show that although the metric is harder to learn than the policy alone, the joint learning of the two models is successful.
Figure 6: Empirical mean and one standard deviation confidence interval of the success rate (left) and time to goal (right) on the MuJoCo “push” task (see § 3.2.1). We compare the performance of the Deep Deterministic Policy Gradient (DDPG), with our approach that combines a planning quasi-metric with a planner (PQM), and the same with the transfer of the quasi-metric trained on the “pick and place” task. The curves show a boost in early training thanks to the pre-trained quasi-metric.
Figure 7: Accuracy of the quasi-metric on the MuJoCo “push” and “pick and place” tasks (see § 3.2.1). We plot here the empirical mean and one standard deviation confidence interval of the PQM estimate at the beginning of a successful episode, where is the target goal as estimated by the planner vs. the actual number of steps it took to reach .

3.2.1 Environment and tasks

For our second set of experiments, we use the “Fetch” environments of OpenAI gym Brockman et al. (2016) which use the MuJoCo physics engine (Todorov et al., 2012), pictured in figure 4, and are described in details by Plappert et al. (2018, section 1.1). If left unspecified the details of our experiments are the same as indicated by Plappert et al. (2018, section 1.4), and Andrychowicz et al. (2017, appendix A).

We consider two tasks: “push”, where a box is placed at random on the table and the robot’s objective is to move it to a desired location also on the table, without using the gripper, and “pick and place”, in which the robot can control its gripper, and the desired location for the box may be located above the table surface.

To demonstrate the transferability of the quasi-metric in this environment, we also consider the “push” task with transfer from “pick and place”: after training an agent on the latter, we train a new agent on the former, but initialize the parameters and with the values obtained with the previous agent.

3.2.2 Network architectures and training

In what follows let be the dimension of the state space , the dimension of the action space , the dimension of the goal parameter, which corresponds to the desired spatial location of the manipulated box.

The critic is implemented with a ReLU MLP, with input units, corresponding to the concatenation of two states and an action, three hidden layers with units each, and a single output unit. The actor is a ReLU MLP with input units, three hidden layers with units each, and output units with non-linearity. As Plappert et al. (2018), we also add a penalty to the actor’s loss equal to the square of the output layer pre-activations. Finally the planner is a ReLU MLP with input units, three hidden layers of units, and output units.

Following the algorithm 1, for “pick and place” and “push with transfer”, we train for epochs. For “push”, we train for epochs. As in Plappert et al. (2018), each epoch consists of cycles, and each cycle consists of running the policy for episodes and then performing optimization steps on minibatches of size sampled uniformly from a replay buffer consisting of transitions. We use the “future” strategy of HER for the selection of goals Andrychowicz et al. (2017), and update the target networks after every cycle using the decay coefficient of .

We kept the meta-parameters as selected in Plappert et al. (2018, appendix B), with an additional grid search over the number of hidden neurons in the actor and critic models in , in , and in .

For each combination, we trained a policy on the “push” task, and eventually selected the combination with the highest rolling median success rate over epochs, resulting in hidden neurons, , and .

All hyperparameters are described in greater detail by

Andrychowicz et al. (2017).

3.2.3 Results

The results obtained in this environment confirm the observations from the bit-flip environment. Figure 5 shows that the joint training of the PQM and planners works properly and results in a policy similar to that obtained with the DDPG approach. Performance is slightly lower, in part due to the limited meta-optimization we could afford, that favors DDPG, and in part due to the difficulty of learning the metric, which is a more complicated functional.

Figures 4 and 6 show the advantage of using the PQM to transfer knowledge from a task to another. Even though the two tasks are quite different, one using the gripper and moving the object in space while holding it, and the other moving only by contact in the plane, the quasi-metric provides an initial boost in the training by providing the ability to position the arm.

Finally, figure 7 shows that the estimate of the quasi-metric accurately reflects the actual distance to the goal state.

4 Related works

The standard Q-learning approach to reinforcement learning consists of learning a policy implicitly through an estimator of the value of a state-action pairs, defined as the expected accumulated reward starting by doing the said action in the said state, and following an optimal policy then Watkins (1989). Such approach has proven extremely effective, in particular when combined with modern deep architectures as value approximators Mnih et al. (2015). This has been improved by duplicating the model during optimization to avoid over-estimating state values van Hasselt et al. (2015).

The main weakness of such classical methods is the necessity for large training sets, due to the complexity of the model to learn, and the sparsity of the reward. This latter point has been tackled recently by learning goal-conditioned policies Schaul et al. (2015); Andrychowicz et al. (2017), where arbitrary observed states may be considered as “synthetic” goals. Doing so allows to leverage a richer structure, where regularities of the environment observed along trajectories to synthetic goals can be transferred to reach the actual ones of interest.

Interestingly, this idea of goal-conditioned policy is combined with a constant negative reward at any time step, which results in an accumulated reward structure having the form of the [opposite of] the distance to the goal. Eysenbach et al. (2019) explicitly consider a distance between states, and not from state to goal, and from there leverage the set of observed states as vertices in a graph that approximates the geodesic distance in the state manifold.

A natural approach to cope with the sparsity of reward and limited number of examples is to leverage structures learned on other tasks. This idea of multi-tasks learning is not specific to reinforcement learning, and aims at solving the same problem of data scarcity. A very straight-forward approach is to train a single model with multiple “heads” providing each a task-specific prediction Caruana (1998).

The same idea of transferring models has been applied to reinforcement learning Taylor and Stone (2009), with recent successes using a single model that mimics specialized expert actors on individual tasks Parisotto et al. (2016). The key issue of bringing several models to a common representation is tackled by normalizing contributions of the different tasks Hessel et al. (2018). This is in contrast with our proposal, which explicitly leverages being in a similar environment. This corresponds in our view to a more realistic robotic setup for which the embodiment is fixed, and allows to make the shared knowledge explicit.

5 Conclusion

We have proposed to address the action selection problem by modeling separately a quasi-metric between states, and the estimation of a target state given a goal. Experiments show that this approach is valid, and that these two models can be trained jointly to get an efficient policy. As also illustrated in the experiments, the core advantage is that this decomposition moves the bulk of the modeling to the quasi-metric, which can be trained across tasks, with a dense feedback from the environment.

This model is fundamentally multi-tasks. The quasi-metric captures more than is necessary for any single task and requires both a more complex model and a more expensive learning, as seen in the experiments. However, since it can be shared across tasks it provides a substantial boost to learning a new task, for which only the planner that estimates the target state for a given goal has to be trained, with a minor fine-tuning of the quasi-metric.

By disentangling two very different aspects of the planning, this decomposition is very promising for future extensions. The quasi-metric handles the difficulty of learning a global structure known only through local interactions but is potentially amenable to the triangular inequality, clustering methods, and dimension reduction. The planners are easier to learn and may be improved with a specific class of regressors taking advantage of a coarse-to-fine structure: your final destination can be initially coarsely defined and refined along your way.

The metric structure over the state space opens the way to more complex planning approaches, where intermediate states to reach are “imagined” with an explicit path-search algorithm, combining tree-search and deep-learning estimates, in a way similar to MCTS for Go (Silver et al., 2016; Schrittwieser et al., 2019).

Finally, it is also also promising for deriving quantities related to the remaining metric uncertainty over states prediction, which can be possibly leveraged as a curiosity measure (Pathak et al., 2017).


  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. CoRR abs/1707.01495. Cited by: §1, §1, §2.1, §3.1.2, §3.1.3, §3.2.1, §3.2.2, §3.2.2, §4.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. CoRR abs/1606.01540. Cited by: §3.2.1.
  • R. Caruana (1998) Multitask learning. In Learning to Learn, pp. 95–133. Cited by: §4.
  • B. Eysenbach, R. Salakhutdinov, and S. Levine (2019) Search on the replay buffer: bridging planning and reinforcement learning. CoRR abs/1906.05253. Cited by: §2.1, §2.2, §4.
  • H. V. Hasselt (2010) Double q-learning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 2613–2621. Cited by: §2.1.
  • M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt (2018) Multi-task deep reinforcement learning with popart. CoRR abs/1809.04474. Cited by: §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.
  • E. Parisotto, J. L. Ba, and R. Salakhutdinov (2016) Actor-mimic: deep multitask and transfer reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §4.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems (NeurIPS), pp. 8024–8035. Cited by: §3.
  • D. Pathak, P. Agrawal, A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. CoRR abs/1705.05363. Cited by: §5.
  • M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba (2018) Multi-goal reinforcement learning: challenging robotics environments and request for research. CoRR abs/1802.09464. External Links: 1802.09464 Cited by: §3.1.2, §3.2.1, §3.2.2, §3.2.2, §3.2.2.
  • T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In

    International Conference on Machine Learning (ICML)

    Vol. 37, pp. 1312–1320. Cited by: §1, §2.1, §4.
  • J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver (2019) Mastering atari, go, chess and shogi by planning with a learned model. CoRR abs/1911.08265. Cited by: §5.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016)

    Mastering the game of go with deep neural networks and tree search

    Nature 529, pp. 484–503. Cited by: §5.
  • M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research (JMLR) 10, pp. 1633–1685. Cited by: §4.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control.. In International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §2, §3.2.1.
  • H. van Hasselt, A. Guez, and D. Silver (2015) Deep reinforcement learning with double q-learning. CoRR abs/1509.06461v3. Cited by: §4.
  • C. J. C. H. Watkins (1989) Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, UK. Cited by: §2.1, §4.