Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay

08/17/2021 ∙ by Tianhong Dai, et al. ∙ Imperial College London ARAYA 0

Hindsight experience replay (HER) is a goal relabelling technique typically used with off-policy deep reinforcement learning algorithms to solve goal-oriented tasks; it is well suited to robotic manipulation tasks that deliver only sparse rewards. In HER, both trajectories and transitions are sampled uniformly for training. However, not all of the agent's experiences contribute equally to training, and so naive uniform sampling may lead to inefficient learning. In this paper, we propose diversity-based trajectory and goal selection with HER (DTGSH). Firstly, trajectories are sampled according to the diversity of the goal states as modelled by determinantal point processes (DPPs). Secondly, transitions with diverse goal states are selected from the trajectories by using k-DPPs. We evaluate DTGSH on five challenging robotic manipulation tasks in simulated robot environments, where we show that our method can learn more quickly and reach higher performance than other state-of-the-art approaches on all tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (DRL) [3]

, in which neural networks are used as function approximators for reinforcement learning (RL), has been shown to be capable of solving complex control problems in several environments, including board games 

[26, 27], video games [4, 19, 29], simulated and real robotic manipulation [2, 9, 15] and simulated autonomous driving [12].

However, learning from a sparse reward signal, where the only reward is provided upon the completion of a task, still remains difficult. An agent may rarely or never encounter positive examples from which to learn in a sparse-reward environment. Many domains therefore provide dense reward signals [5], or practitioners may turn to reward shaping [20]. Designing dense reward functions typically requires prior domain knowledge, making this approach difficult to generalise across different environments.

Fortunately, a common scenario is goal-oriented RL, where the RL agent is tasked with solving different goals within the same environment [11, 24]. Even if each task has a sparse reward, the agent ideally generalises across goals, making the learning process easier. For example, in a robotic manipulation task, the goal during a single episode would be to achieve a specific position of a target object.

Hindsight experience replay (HER) [1] was proposed to improve the learning efficiency of goal-oriented RL agents in sparse reward settings: when past experience is replayed to train the agent, the desired goal is replaced (in “hindsight”) with the achieved goal, generating many positive experiences. In the above example, the desired target position would be overwritten with the achieved target position, with the achieved reward also being overwritten correspondingly.

We note that HER, whilst it enabled solutions to previously unsolved tasks, can be somewhat inefficient in its use of uniformly sampling transitions during training. In the same way that prioritised experience replay [25] has significantly improved over the standard experience replay in RL, several approaches have improved upon HER by using data-dependent sampling [8, 31]. HER with energy-based prioritisation (HEBP) [31] assumes semantic knowledge about the goal-space and uses the energy of the target objects to sample trajectories with high energies, and then samples transitions uniformly. Curriculum-guided HER (CHER) [8] samples trajectories uniformly, and then samples transitions based on a mixture of proximity to the desired goal and the diversity of the samples; CHER adapts the weighting of these factors over time. In this work, we introduce diversity-based trajectory and goal selection with HER (DTGSH; See Fig. 1), which samples trajectories based on the diversity of the goals achieved within the trajectory, and then samples transitions based on the diversity of the set of samples. In this paper, DTGSH is evaluated on five challenging robotic manipulation tasks. From extensive experiments, our proposed method converges faster and reaches higher rewards than prior work, without requiring domain knowledge [31] or tuning a curriculum [8], and is based on a single concept—determinantal point processes (DPPs) [13].

Figure 1: Overview of DTGSH. Every time a new episode is completed, its diversity is calculated, and it is stored in the episodic replay buffer. During training, episodes are sampled according to their diversity-based priority, and then diverse, hindsight-relabelled transitions are sampled using a -DPP [14].

2 Background

2.1 Reinforcement Learning

RL is the study of agents interacting with their environment in order to maximise their reward, formalised using the framework of Markov decision processes (MDPs) 

[28]. At each timestep , an agent receives a state from the environment, and then samples an action from its policy . Next, the action is executed in the environment to get the next state , and a reward . In the episodic RL setting, the objective of the agent is to maximise its expected return over a finite trajectory with length :

(1)

where

is a discount factor that exponentially downplays the influence of future rewards, reducing the variance of the return.

2.2 Goal-oriented Reinforcement Learning

RL can be expanded to the multi-goal setting, where the agent’s policy and the environment’s reward function are also conditioned on a goal  [11, 24]. In this work, we focus on the goal-oriented setting and environments proposed by OpenAI [22].

In this setting, every episode comes with a desired goal , which specifies the desired configuration of a target object in the environment (which could include the agent itself). At every timestep , the agent is also provided with the currently achieved goal . A transition in the environment is thus denoted as: . The environment provides a sparse reward function, where a negative reward is given unless the achieved goal is within a small distance of the desired goal:

(2)

However, in this setting, the agent is unlikely to achieve a non-negative reward through random exploration. To overcome this, HER provides successful experiences for the agent to learn from by relabelling transitions during training: the agent trains on a hindsight desired goal , which is set to the achieved goal , with recomputed using the environment reward function (Eq. (2)).

2.3 Deep Deterministic Policy Gradient

Deep deterministic policy gradient (DDPG) [16] is an off-policy actor-critic DRL algorithm for continuous control tasks, and is used as the baseline algorithm for HER [1, 8, 31]. The actor is a policy network parameterised by , and outputs the agent’s actions. The critic is a state-action-value function approximator parameterised by

, and estimates the expected return following a given state-action pair. The critic is trained by minimising

where . The actor is trained by maximising

, backpropagating through the critic. Further implementation details can be found in prior work 

[1, 16].

2.4 Determinantal Point Processes

A DPP [13]

is a stochastic process that characterises a probability distribution over sets of points using the determinant of some function. In machine learning it is often used to quantify the diversity of a subset, with applications such as video 

[18] and document summarisation [10].

Formally, for a discrete set of points , a point process

is a probability measure over all

subsets. is a DPP if a random subset is sampled with probability:

(3)

where ,

is the identity matrix,

is the positive semi-definite DPP kernel matrix, and is the sub-matrix with rows and columns indexed by the elements of the subset .

The kernel matrix can be represented as the Gram matrix , where each column of

is the feature vector of an item in

. The determinant, , represents the (squared) volume spanned by vectors . From a geometric perspective, feature vectors that are closer to being orthogonal to each other will have a larger determinant, and vectors in the spanned subspace are more likely to be sampled: . Using orthgonality as a measure of diversity, we leverage DPPs to sample diverse trajectories and goals.

3 Related Work

The proposed work is built on HER [1] as a way to effectively augment goal-oriented transitions from a replay buffer: to address the problem of sparse rewards, transitions from unsuccessful trajectories are turned into successful ones. HER uses an episodic replay buffer, with uniform sampling over trajectories, and uniform sampling over transitions. However, these samples may be redundant, and many may contribute little to the successful training of the agent.

In the literature, some efforts have been made to increase the efficiency of HER by prioritising more valuable episodes/transitions. Motivated by the work-energy principle in physics, HEBP [31] assigns higher probability to trajectories in which the target object has higher energy; once the episodes are sampled, the transitions are then sampled uniformly. However, HEBP requires knowing the semantics of the goal space in order to calculate the probability, which is proportional to the sum of the target’s potential, kinetic and rotational energies.

CHER [8] dynamically controls the sampling of transitions during training based on a mixture of goal proximity and diversity. Firstly, episodes are uniformly sampled from the episodic replay buffer, and then a minibatch of is sampled according to the current state of the curriculum. The curriculum initially biases sampling to achieved goals that are close to the desired goal (requiring a distance function), and later biases sampling towards diverse goals, using a -nearest neighbour graph and a submodular function to more efficiently sample a diverse subset of goals (using the same distance function).

Other work has expanded HER in orthogonal directions. Hindsight policy gradient [23]

and episodic self-imitation learning 

[6] apply HER to improve the efficiency of goal-based on-policy algorithms. Dynamic HER [7] and competitive ER [17] expand HER to the dynamic goal and multi-agent settings, respectively.

The use of DPPs in RL has been more limited, with applications towards modelling value functions of sets of agents in multiagent RL [21, 30].

4 Methodology

We now formally describe the two main components of our method, DTGSH: 1) a diversity-based trajectory selection module to sample valuable trajectories for the further goal selection; 2) a diversity-based goal selection module to select transitions with diverse goal states from the previously selected trajectories. Together, these select informative transitions from a large area of the goal space, improving the agent’s ability to learn and generalise.

4.1 Diversity-based Trajectory Selection

We propose a diversity-based prioritization method to select valuable trajectories for efficient training. Related to HEBP’s prioritisation of high-energy trajectories [31], we hypothesise that trajectories that achieve diverse goal states are more valuable for training; however, unlike HEBP, we do not require knowledge of the goal space semantics.

In a robotic manipulation task, the agent needs to move a target object from its initial position, , to the target position, . If the agent never moves the object, despite hindsight relabelling it will not be learning information that would directly help in task completion. On the other hand, if the object moves a lot, hindsight relabelling will help the agent learn about meaningful interactions.

In our approach, DPPs are used to model the diversity of achieved goal states in an episode, or subsets thereof. For a single trajectory of length , we divide it into several partial trajectories of length , with achieved goal states . That is, with a sliding window of , a trajectory can be divided into partial trajectories:

(4)

The diversity of each partial trajectory can be computed as:

(5)

where is the kernel matrix of partial trajectory :

(6)

and , where each is the -normalised version of the achieved goal  [14]. Finally, the diversity of trajectory is the sum of the diversity of its constituent partial trajectories:

(7)

Similarly to HEBP [31], we use a non-uniform episode sampling strategy. During training, we prioritise sampling episodes proportionally to their diversity; the probability of sampling trajectory from a replay buffer of size is:

(8)

4.2 Diversity-based Goal Selection

In prior work [1, 31], after selecting the trajectories from the replay buffer, one transition from each selected trajectory is sampled uniformly to construct a minibatch for training. However, the modified goals in the minibatch might be similar, resulting in redundant information. In order to form a minibatch with diverse goals for more efficient learning, we use -DPPs [14] for sampling goals. Compared to the standard DPP, a -DPP is a conditional DPP where the subset has a fixed size , with the probability distribution function:

(9)

-DPPs are more appropriate for goal selection with a minibatch of fixed size . Given trajectories sampled from the replay buffer, we first uniformly sample a transition from each of the trajectories. Finally, a -DPP is used to sample a diverse set of transitions based on the relabelled goals (which, in this context, we denote as “candidate goals”). Fig. 1(a) gives an example of uniform vs. -DPP sampling, demonstrating the increased coverage of the latter. Fig. 1(b) provides corresponding estimated density plots; note that the density of the -DPP samples is actually more uniform over the support of the candidate goal distribution.

(a) Plot of candidate goals and selected goals. -DPP sampling is more likely to sample points from the full span of the goal space.
(b) Kernel density estimation of the distributions of goals. -DPP leads to a more uniform selection of goals over the support of the goal space.
Figure 2: Visualisation of goals selected from candidate goals of the Push task using either uniform sampling or -DPP sampling, respectively. The candidate goals are distributed over a 2D () space. Note that -DPP sampling (right hand plots) results in a broader span of selected goals in space compared to uniform sampling (left hand plots).
0:  set of candidate goal states , minibatch size
1:  ,
2:  Calculate the DPP kernel matrix
3:  
4:   elementary symmetric polynomial:
5:  for  do
6:     if  then
7:        ,
8:        if  then
9:           break
10:        end if
11:     end if
12:  end for
13:  ,
14:  while  do
15:     Select from with is the standard basis
16:     
17:      an orthonormal basis for the subspace of orthogonal to
18:  end while
19:  return  minibatch with size
Algorithm 1 Diversity-based Goal Selection using -DPP
0:  RL environment with episodes of length , number of episodes , off-policy RL algorithm , episodic replay buffer , number of algorithm updates , candidate transitions size , minibatch size
1:  Initialize the parameters of all models in
2:  
3:  for  do
4:     Sample a desired goal and an initial state start a new episode
5:     for  do
6:        Sample an action using the policy
7:        Execute action and get the next state and achieved goal state
8:        Calculate according to Eq. (2)
9:        Store transition in
10:     end for
11:     Calculate the diversity score of current episode using Eq. (5) and Eq. (7)
12:     Calculate the diversity-based priority of each episode in using Eq. (8)
13:     for  do
14:        Sample trajectories from according to priority
15:        Uniformly sample one transition from each of the trajectories
16:        Relabel goals in each transition and recompute the reward to get candidate transitions
17:        Sample minibatch with size from the candidates using Alg. 1
18:        Optimise with minibatch
19:     end for
20:  end for
Algorithm 2 Diversity-based Trajectory and Goal Selection with HER

Alg. 1 shows the details of the goal selection subroutine, and Alg. 2 gives the overall algorithm for our method, DTGSH.

5 Experiments

We evaluate our proposed method, and compare it with current state-of-the-art HER-based algorithms [1, 8, 31] on challenging robotic manipulation tasks [22], pictured in Fig. 3. Furthermore, we perform ablation studies on our diversity-based trajectory and goal selection modules. Our code is based on OpenAI Baselines111https://github.com/openai/baselines, and is available at: https://github.com/TianhongDai/div-hindsight.

(a) Push
(b) PickPlace
(c) EggFull
(d) BlockRotate
(e) PenRotate
Figure 3: Robotic manipulation environments. (a-b) use the Fetch robot, and (c-e) use the Shadow Dexterous Hand.

5.1 Environments

The robotic manipulation environments used for training and evaluation include five different tasks. Two tasks use the 7-DoF Fetch robotic arm with two-fingers parallel gripper: Push, and Pick&Place, which both require the agent to move a cube to the target position. The remaining three tasks use a 24-DoF Shadow Dexterous Hand to manipulate an egg, a block and a pen, respectively. The sparse reward function is given by Eq. (2).

In the Fetch environments, the state contains the position and velocity of the joints, and the position and rotation of the cube. Each action is a 4-dimensional vector, with three dimensions specifying the relative position of the gripper, and the final dimension specifying the state of the gripper (i.e., open or closed). The desired goal is the target position, and the achieved goal is the position of the cube. Each episode is of length .

In the Shadow Dexterous Hand environments, the state contains the position and velocity of the joints. Each action is a 20-dimensional vector which specifies the absolute position of 20 non-coupled joints in the hand. The desired goal and achieved goal specify the rotation of the object for the block and pen tasks, and the position + rotation of the object for the egg task. Each episode is of length .

5.2 Training Settings

We base our training setup on CHER [8]. We train all agents on minibatches of size

for 50 epochs using MPI for parallelisation over 16 CPU cores; each epoch consists of 1600 (

) episodes, with evaluation over 160 (

) episodes at the end of each epoch. Remaining hyperparameters for the baselines are taken from the original work 

[1, 8, 31]. Our method, DTGSH, uses partial trajectories of length and as the number of candidate goals.

5.3 Benchmark Results

We compare DTGSH to DDPG [16], DDPG+HER [1], DDPG+HEBP [31] and DDPG+CHER [8]. Evaluation results are given based on repeated runs with 5 different seeds; we plot the median success rate with upper and lower bounds given by the 75 and 25 percentiles, respectively.

(a) Push
(b) Pick&Place
(c) EggFull
(d) BlockRotate
(e) PenRotate
Figure 4: Success rate of DTGSH and baseline approaches.

Fig. 4 and Tab. 1 show the performance of DDPG+DTGSH and baseline approaches on all five tasks. In the Fetch tasks, DDPG+DTGSH and DDPG+HEBP both learn significantly faster than the other methods, while in the Shadow Dexterous Hand tasks DDPG+DTGSH learns the fastest and achieves higher success rates than all other methods. In particular, DDPG cannot solve any tasks without using HER, and CHER performs worse in the Fetch tasks. We believe the results highlight the importance of sampling both diverse trajectories and goals, as in our proposed method, DTGSH.

Push Pick&Place EggFull BlockRotate PenRotate
DDPG [16] 0.090.01 0.040.00 0.000.00 0.010.00 0.000.00
DDPG+HER [1] 1.000.00 0.890.03 0.110.01 0.550.04 0.150.02
DDPG+HEBP [31] 1.000.00 0.910.03 0.140.02 0.590.02 0.200.03
DDPG+CHER [8] 1.000.00 0.910.04 0.150.01 0.540.04 0.170.03
DDPG+DTGSH 1.000.00 0.940.01 0.170.03 0.620.02 0.210.02
Table 1: Final mean success rate standard deviation, with best results in bold.

5.4 Ablation Studies

In this section, we perform the following experiments to investigate the effectiveness of each component in DTGSH: 1) diversity-based trajectory selection with HER (DTSH) and diversity-based goal selection with HER (DGSH) are evaluated independently to assess the contribution of each stage; 2) the performance using different partial trajectory lengths ; 3) the performance of using different candidate goal set sizes .

(a) Push
(b) Pick&Place
(c) EggFull
(d) BlockRotate
(e) PenRotate
Figure 5: Success rate of HER, DTGSH, and ablations DTSH and DGSH.

Fig. 5 shows the performance of using DTSH and DGSH independently. DDPG+DTSH outperforms DDPG+HER substantially in all tasks, which supports the view that sampling trajectories with diverse achieved goals can substantially improve performance. Furthermore, unlike DDPG+HEBP, DTSH does not require knowing the structure of the goal space in order to calculate the energy of the target object; DDPG+DGSH achieves better performance than DDPG+HER in three environments, and is only worse in one environment. DGSH performs better in environments where it is easier to solve the task (e.g., Fetch tasks), and hence the trajectories selected are more likely to contain useful transitions. However, DTGSH, which is the combination of both modules, performs the best overall.

(a) Push
(b) Pick&Place
(c) EggFull
(d) BlockRotate
(e) PenRotate
Figure 6: Success rate of DTGSH with different partial trajectory lengths and different candidate goal set sizes .

Fig. 6 shows the performance of DDPG+DTGSH with different partial trajectory lengths and different candidate goal set sizes . In this work, we use and as the defaults. Performance degrades with , indicating that pairwise diversity is best for learning in our method. does not affect performance in the Fetch environments, but degrades performance in the Shadow Dexterous Hand environments.

5.5 Time Complexity

Table 2 gives example training times of all of the HER-based algorithms. DTGSH requires an additional calculation of the diversity score of at the end of every training episode, and sampling of for each minibatch.

DDPG+HER [1] DDPG+HEBP [31] DDPG+CHER [8] DDPG+DTGSH
Time 00:55:08 00:56:32 03:02:18 01:52:30
Table 2: Training time (hours:minutes:seconds) of DTGSH and baseline approaches on the Push task for 50 epochs.

6 Conclusion

In this paper, we introduced diversity-based trajectory and goal selection with hindsight experience replay (DTGSH) to improve the learning efficiency of goal-orientated RL agents in the sparse reward setting. Our method can be divided into two stages: 1) valuable trajectories are selected according to diversity-based priority, as modelled by determinantal point processes (DPPs) [13]; 2) -DPPs [14] are leveraged to sample transitions with diverse goal states from previously selected trajectories for training. Our experiments empirically show that DTGSH achieves faster learning and higher final performance in five challenging robotic manipulation tasks, compared to previous state-of-the-art approaches [1, 8, 31]. Furthermore, unlike prior extensions of hindsight experience replay, DTGSH does not require semantic knowledge of the goal space [31], and does not require tuning a curriculum [8].

Acknowledgements

This work was supported by JST, Moonshot R&D Grant Number JPMJMS2012.

References

  • [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Neural Information Processing Systems, Cited by: §1, §2.3, §3, §4.2, §5.2, §5.3, Table 1, Table 2, §5, §6.
  • [2] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. International Journal of Robotics Research. Cited by: §1.
  • [3] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath (2017) Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine. Cited by: §1.
  • [4] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680. Cited by: §1.
  • [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §1.
  • [6] T. Dai, H. Liu, and A. A. Bharath (2020) Episodic self-imitation learning with hindsight. Electronics. Cited by: §3.
  • [7] M. Fang, C. Zhou, B. Shi, B. Gong, J. Xu, and T. Zhang (2018) DHER: hindsight experience replay for dynamic goals. In International Conference on Learning Representations, Cited by: §3.
  • [8] M. Fang, T. Zhou, Y. Du, L. Han, and Z. Zhang (2019) Curriculum-guided hindsight experience replay. In Neural Information Processing Systems, Cited by: §1, §2.3, §3, §5.2, §5.3, Table 1, Table 2, §5, §6.
  • [9] S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In International Conference on Robotics and Automation, Cited by: §1.
  • [10] K. Hong and A. Nenkova (2014)

    Improving the estimation of word importance for news multi-document summarization

    .
    In Conference of the European Chapter of the Association for Computational Linguistics, Cited by: §2.4.
  • [11] L. P. Kaelbling (1993) Learning to achieve goals. In

    International Joint Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.2.
  • [12] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez (2021) Deep reinforcement learning for autonomous driving: a survey. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1.
  • [13] A. Kulesza, B. Taskar, et al. (2012) Determinantal point processes for machine learning. Foundations and Trends in Machine Learning. Cited by: §1, §2.4, §6.
  • [14] A. Kulesza and B. Taskar (2011) K-dpps: fixed-size determinantal point processes. In International Conference on Machine Learning, Cited by: Figure 1, §4.1, §4.2, §6.
  • [15] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research. Cited by: §1.
  • [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning.. In International Conference on Learning Representations, Cited by: §2.3, §5.3, Table 1.
  • [17] H. Liu, A. Trott, R. Socher, and C. Xiong (2019) Competitive experience replay. In International Conference on Learning Representations, Cited by: §3.
  • [18] B. Mahasseni, M. Lam, and S. Todorovic (2017) Unsupervised video summarization with adversarial lstm networks. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §2.4.
  • [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §1.
  • [20] A. Y. Ng, D. Harada, and S. Russell (1999) Theory and application to reward shaping. In International Conference on Machine Learning, Cited by: §1.
  • [21] T. Osogami and R. Raymond (2019) Determinantal reinforcement learning. In AAAI Conference on Artificial Intelligence, Cited by: §3.
  • [22] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al. (2018) Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv:1802.09464. Cited by: §2.2, §5.
  • [23] P. Rauber, A. Ummadisingu, F. Mutz, and J. Schmidhuber (2019) Hindsight policy gradients. In International Conference on Learning Representations, Cited by: §3.
  • [24] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In International Conference on Machine Learning, Cited by: §1, §2.2.
  • [25] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In International Conference on Learning Representations, Cited by: §1.
  • [26] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. (2020) Mastering Atari, Go, chess and shogi by planning with a learned model. Nature. Cited by: §1.
  • [27] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of Go without human knowledge. Nature. Cited by: §1.
  • [28] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. The MIT Press. Cited by: §2.1.
  • [29] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature. Cited by: §1.
  • [30] Y. Yang, Y. Wen, J. Wang, L. Chen, K. Shao, D. Mguni, and W. Zhang (2020) Multi-Agent Determinantal Q-Learning. In International Conference on Machine Learning, Cited by: §3.
  • [31] R. Zhao and V. Tresp (2018) Energy-based hindsight experience prioritization. In Conference on Robot Learning, Cited by: §1, §2.3, §3, §4.1, §4.1, §4.2, §5.2, §5.3, Table 1, Table 2, §5, §6.