Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

05/21/2019 ∙ by Rui Zhao, et al. ∙ 0

In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer, and later these trajectories are selected randomly for replay. However, the achieved goals in the replay buffer are often biased towards the behavior policies. From a Bayesian perspective, when there is no prior knowledge about the target goal distribution, the agent should learn uniformly from diverse achieved goals. Therefore, we first propose a novel multi-goal RL objective based on weighted entropy. This objective encourages the agent to maximize the expected return, as well as to achieve more diverse goals. Secondly, we developed a maximum entropy-based prioritization framework to optimize the proposed objective. For evaluation of this framework, we combine it with Deep Deterministic Policy Gradient, both with or without Hindsight Experience Replay. On a set of multi-goal robotic tasks of OpenAI Gym, we compare our method with other baselines and show promising improvements in both performance and sample-efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) (Sutton & Barto, 1998)

combined with Deep Learning (DL)

(Goodfellow et al., 2016) has led to great successes in various tasks, such as playing video games (Mnih et al., 2015), challenging the World Go Champion (Silver et al., 2016), and learning autonomously to accomplish different robotic tasks (Ng et al., 2006; Peters & Schaal, 2008; Levine et al., 2016; Chebotar et al., 2017; Andrychowicz et al., 2017).

One of the biggest challenges in RL is to make the agent learn efficiently in applications with sparse rewards. To tackle this challenge, Lillicrap et al. (2015) developed the Deep Deterministic Policy Gradient (DDPG), which enables the agent to learn continuous control, such as manipulation and locomotion. Schaul et al. (2015a) proposed Universal Value Function Approximators (UVFAs), which generalize not just over states, but also over goals, and extend value functions to multiple goals. Furthermore, to make the agent learn faster in sparse reward settings, Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER), which encourages the agent to learn from the goal-states it has achieved. The combined use of DDPG and HER allows the agent to learn to accomplish more complex robot manipulation tasks. However, there is still a huge gap between the learning efficiency of humans and RL agents. In most cases, an RL agent needs millions of samples before it is able to solve the tasks, while humans only need a few samples (Mnih et al., 2015). This paper is based on our 2018 NeurIPS Deep RL workshop paper (Zhao & Tresp, 2019).

Figure 1: Robot arm Fetch and Shadow Dexterous hand environment: FetchPush, FetchPickAndPlace, FetchSlide, HandManipulateEgg, HandManipulateBlock, and HandManipulatePen.

In previous works, the concept of maximum entropy has been used to encourage exploration during training (Williams & Peng, 1991; Mnih et al., 2015; Wu & Tian, 2016). Recently, Haarnoja et al. (2017) introduced Soft-Q Learning, which learns a deep energy-based policy by evaluating the maximum entropy of actions for each state. Soft-Q Learning encourages the agent to learn all the policies that lead to the optimum (Levine, 2018). Furthermore, Soft Actor-Critic (Haarnoja et al., 2018c) demonstrated a better performance while showing compositional ability and robustness of the maximum entropy policy in locomotion (Haarnoja et al., 2018a) and robot manipulation tasks (Haarnoja et al., 2018b). The agent aims to maximize the expected reward while also maximizing the entropy to succeed at the task while acting as randomly as possible. Based on maximum entropy policies, Eysenbach et al. (2018) showed that the agent is able to develop diverse skills solely by maximizing an information theoretic objective without any reward function. For multi-goal and multi-task learning (Caruana, 1997), the diversity of training sets helps the agent transfer skills to unseen goals and tasks (Pan et al., 2010). The variability of training samples mitigates overfitting and helps the model to better generalize (Goodfellow et al., 2016). In our approach, we combine maximum entropy with multi-goal RL to help the agent to achieve unseen goals by learning uniformly from diverse achieved goals during training.

We observe that during experience replay the uniformly sampled trajectories are biased towards the behavior policies, with respect to the achieved goal-states. Consider training a robot arm to reach a certain point in a space. At the beginning, the agent samples trajectories using a random policy. The sampled trajectories are centered around the initial position of the robot arm. Therefore, the distribution of achieved goals, i.e., positions of the robot arm, is similar to a Gaussian distribution around the initial position, which is non-uniform. Sampling from such a distribution is biased towards the current policies. From a Bayesian point of view

(Murphy, 2012), the agent should learn uniformly from these achieved goals, when there is no prior knowledge of the target goal distribution.

To correct this bias, we propose a new objective which combines maximum entropy and the multi-goal RL objective. This new objective uses entropy as a regularizer to encourage the agent to traverse diverse goal-states. Furthermore, we derive a safe lower bound for optimization. To optimize this surrogate objective, we implement maximum entropy-based prioritization as a simple yet effective solution.

2 Preliminary

2.1 Settings

Environments: We consider multi-goal reinforcement learning tasks, like the robotic simulation scenarios provided by OpenAI Gym (Plappert et al., 2018), where six challenging tasks are used for evaluation, including push, slide, pick & place with the robot arm, as well as hand manipulation of the block, egg, and pen, as shown in Figure 1. Accordingly, we define the following terminologies for this specific kind of multi-goal scenarios.

Goals: The goals are the desired positions and the orientations of the object. Specifically, we use , with standing for environment, to denote the real goal which serves as the input from the environment, in order to distinguish it from the achieved goal used in Hindsight settings (Andrychowicz et al., 2017). Note that in this paper we consider the case where the goals can be represented by states, which leads us to the concept of achieved goal-state , with details explained below.

States, Goal-States and Achieved Goals: The state

consists of two sub-vectors, the achieved goal-state

, which represents the position and orientation of the object being manipulated, and the context state , i.e. , where denotes concatenation.

In our case, we define to represent an achieved goal that has the same dimension as the real goal from the environment. The context state contains the rest information about the state, including the linear and angular velocities of all robot joints and of the object. The real goals can be substituted by the achieved goals to facilitate learning. This goal relabeling technique was proposed by Andrychowicz et al. (2017) as Hindsight Experience Replay.

Achieved Goal Trajectory: A trajectory consisting solely of goal-states is represented as . We use to denote all the achieved goals in the trajectory , i.e., .

Rewards: We consider sparse rewards . There is a tolerated range between the desired goal-states and the achieved goal-states. If the object is not in the tolerated range of the real goal, the agent receives a reward signal - for each transition; otherwise, the agent receives a reward signal .

Goal-Conditioned Policy: In multi-goal settings, the agent receives the environmental goal and the state input . We want to train a goal-conditioned policy to effectively generalize its behavior to different environmental goals .

2.2 Reinforcement Learning

We consider an agent interacting with an environment. We assume the environment is fully observable, including a set of state , a set of action , a distribution of initial states

, transition probabilities

, a reward function : , and a discount factor .

Deep Deterministic Policy Gradient: For continuous control tasks, the Deep Deterministic Policy Gradient (DDPG) shows promising performance, which is essentially an off-policy actor-critic method (Lillicrap et al., 2015).

Universal Value Function Approximators: For multi-goal continuous control tasks, DDPG can be extended by Universal Value Function Approximators (UVFA) (Schaul et al., 2015a). UVFA essentially generalizes the Q-function to multiple goal-states, where the Q-value depends not only on the state-action pairs, but also on the goals.

Hindsight Experience Replay: For robotic tasks, if the goal is challenging and the reward is sparse, the agent could perform badly for a long time before learning anything. Hindsight Experience Replay (HER) encourages the agent to learn from whatever goal-states it has achieved. Andrychowicz et al. (2017) show that HER makes training possible in challenging robotic tasks via goal relabeling, i.e., randomly substituting real goals with achieved goals.

2.3 Weighted Entropy

Guiaşu (1971) proposed weighted entropy, which is an extension of Shannon entropy. The definition of weighted entropy is given as

(1)

where is the weight of the elementary event and is the probability of the elementary event.

3 Method

In this section, we formally describe our method, including the mathematical derivation of the Maximum Entropy-Regularized Multi-Goal RL objective and the Maximum Entropy-based Prioritization framework.

3.1 Multi-Goal RL

In this paper, we consider multi-goal RL as goal-conditioned policy learning (Schaul et al., 2015a; Andrychowicz et al., 2017; Rauber et al., 2017; Plappert et al., 2018)

. We denote random variables by upper case letters and the values of random variables by corresponding lower case letters. For example, let

denote the set of valid values to a random variable , and let denote the probability function of random variable .

Consider that an agent receives a goal at the beginning of the episode. The agent interacts with the environment for timesteps. At each timestep , the agent observes a state and performs an action . The agent also receives a reward conditioned on the input goal .

We use to denote a trajectory, where . We assume that the probability of trajectory , given goal and a policy parameterized by , is given as

The transition probability states that the probability of a state transition given an action is independent of the goal, and we denote it with . For every and , we also assume that is non-zero. The expected return of a policy parameterized by is given as

(2)

Off-policy RL methods use experience replay (Lin, 1992; Mnih et al., 2015)

to leverage bias over variance and potentially improve sample-efficiency. In the off-policy case, the objective, Equation (

2), is given as

(3)

where denotes the replay buffer. Normally, the trajectories are randomly sampled from the buffer. However, we observe that the trajectories in the replay buffer are often imbalanced with respect to the achieved goals . Thus, we propose Maximum Entropy-Regularized Multi-Goal RL to improve performance.

3.2 Maximum Entropy-Regularized Multi-Goal RL

In multi-goal RL, we want to encourage the agent to traverse diverse goal-state trajectories, and at the same time, maximize the expected return. This is like maximizing the empowerment (Mohamed & Rezende, 2015) of an agent attempting to achieve multiple goals. We propose the reward-weighted entropy objective for multi-goal RL, which is given as

(4)

For simplicity, we use to represent , which is the occurrence probability of the goal-state trajectory . The expectation is calculated based on as well, so the proposed objective is the weighted entropy (Guiaşu, 1971; Kelbert et al., 2017) of , which we denote as , where the weight is the accumulated reward in our case.

The objective function, Equation (4), has two interpretations. The first interpretation is to maximize the weighted expected return, where the rare trajectories have larger weights. Note that when all trajectories occur uniformly, this weighting mechanism has no effect. The second interpretation is to maximize a reward-weighted entropy, where the more rewarded trajectories have higher weights. This objective encourages the agent to learn how to achieve diverse goal-states, as well as to maximize the expected return.

In Equation (4), the weight, , is unbounded, which makes the training of the universal function approximator unstable. Therefore, we propose a safe surrogate objective, , which is essentially a lower bound of the original objective.

3.3 Surrogate Objective

To construct the safe surrogate objective, we sample the trajectories from the replay buffer with a proposal distribution, . represents the distribution of the goal trajectories in the replay buffer. The surrogate objective is given in Theorem 3, which is proved to be a lower bound of the original objective, Equation (4).

Theorem 1.

The surrogate is a lower bound of the objective function , i.e., , where

(5)
(6)
(7)

is the normalization factor for . is the weighted entropy (Guiaşu, 1971; Kelbert et al., 2017), where the weight is the accumulated reward , in our case.

Proof.

See Appendix. ∎

3.4 Prioritized Sampling

To optimize the surrogate objective, Equation (16), we cast the optimization process into a prioritized sampling framework. At each iteration, we first construct the proposal distribution , which has a higher entropy than . This ensures that the agent learns from a more diverse goal-state distribution. In Theorem 4, we prove that the entropy with respect to is higher than the entropy with respect to .

Theorem 2.

Let the probability density function of goals in the replay buffer be

(8)

Let the proposal probability density function be defined as

(9)

Then, the proposal goal distribution has an equal or higher entropy

(10)
Proof.

See Appendix. ∎

3.5 Estimation of Distribution

To optimize the surrogate objective with prioritized sampling, we need to know the probability distribution of a goal-state trajectory

. We use a Latent Variable Model (LVM) (Murphy, 2012) to model the underlying distribution of , since LVM is suitable for modeling complex distributions.

while not converged do
        Sample goal and initial state
        for  do
               for  do
                      Sample action from behavior policy.
                      Step environment: .
                      Update replay buffer .
                      Construct prioritized sampling distribution:
                      with higher .
                      Sample trajectories
                      Update policy () to max.  via DDPG, HER.
              Update density model ().
Algorithm 1 Maximum Entropy-based Prioritization (MEP)
Figure 2: MEP Algorithm: We update the density model to construct a higher entropy distribution of achieved goals and update the agent with the more diversified training distribution.

Specifically, we use to denote the latent-variable-conditioned goal-state trajectory distribution, which we assume to be Gaussians. is the -th latent variable, where and is the number of the latent variables. The resulting model is a Mixture of Gaussians(MoG), mathematically,

(11)

where each Gaussian, , has its own mean and covariance , represents the mixing coefficients, and is the partition function. The model parameter includes all mean , covariance , and mixing coefficients .

In prioritized sampling, we use the complementary predictive density of a goal-state trajectory as the priority, which is given as

(12)

The complementary density describes the likelihood that a goal-state trajectory

occurs in the replay buffer. A high complementary density corresponds to a rare occurrence of the goal trajectory. We want to over-sample these rare goal-state trajectories during replay to increase the entropy of the training distribution. Therefore, we use the complementary density to construct the proposal distribution as a joint distribution

(13)

3.6 Maximum Entropy-Based Prioritization

With prioritized sampling, the agent learns to maximize the return of a more diverse goal distribution. When the agent replays the samples, it first ranks all the trajectories with respect to their proposal distribution

, and then uses the ranking number directly as the probability for sampling. This means that rare goals have high ranking numbers and, equivalently, have higher priorities to be replayed. Here, we use the ranking instead of the density. The reason is that the rank-based variant is more robust since it is neither affected by outliers nor by density magnitudes. Furthermore, its heavy-tail property also guarantees that samples will be diverse

(Schaul et al., 2015b). Mathematically, the probability of a trajectory to be replayed after the prioritization is:

(14)

where is the total number of trajectories in the replay buffer and is the ranking function.

Figure 3:

Mean success rate with standard deviation in all six robot environments

We summarize the complete training algorithm in Algorithm 1 and in Figure 2. In short, we propose Maximum Entropy-Regularized Multi-Goal RL (Section 3.2) to enable RL agents to learn more efficiently in multi-goal tasks (Section 3.1). We integrate a goal entropy term into the normal expected return objective. To maximize the objective, Equation (4), we derive a surrogate objective in Theorem 3, i.e., a lower bound of the original objective. We use prioritized sampling based on a higher entropy proposal distribution at each iteration and utilize off-policy RL methods to maximize the expected return. This framework is implemented as Maximum Entropy-based Prioritization (MEP).

Push Pick & Place Slide
Method success time success time success time
DDPG 99.90% 5.52h 39.34% 5.61h 75.67% 5.47h
DDPG+PER 99.94% 30.66h 67.19% 25.73h 66.33% 25.85h
DDPG+MEP 99.96% 6.76h 76.02% 6.92h 76.77% 6.66h
Egg Block Pen
Method success time success time success time
DDPG+HER 76.19% 7.33h 20.32% 8.47h 27.28% 7.55h
DDPG+HER+PER 75.46% 79.86h 18.95% 80.72h 27.74% 81.17h
DDPG+HER+MEP 81.30% 17.00h 25.00% 19.88h 31.88% 25.36h
Table 1: Mean success rate (%) and training time (hour) for all six environments

4 Experiments

We test the proposed method on a variety of simulated robotic tasks, see Section 2.1, and compare it to strong baselines, including DDPG and HER. To the best of our knowledge, the most similar method to MEP is Prioritized Experience Replay (PER) (Schaul et al., 2015b). In the experiments, we first compare the performance improvement of MEP and PER. Afterwards, we compare the time-complexity of the two methods. We show that MEP improves performance with much less computational time than PER. Furthermore, the motivations of PER and MEP are different. The former uses TD-errors, while the latter is based on an entropy-regularized objective function.

In this section, we investigate the following questions:

  1. Does incorporating goal entropy via MEP bring benefits to off-policy RL algorithms, such as DDPG or DDPG+HER?

  2. Does MEP improve sample-efficiency of state-of-the-art RL approaches in robotic manipulation tasks?

  3. How does MEP influence the entropy of the achieved goal distribution during training?

Our code is available online at https://github.com/ruizhaogit/mep.git. The implementation uses OpenAI Baselines (Dhariwal et al., 2017)

with a backend of TensorFlow

(Abadi et al., 2016).

4.1 Performance

To test the performance difference among methods including DDPG, DDPG+PER, and DDPG+MEP, we run the experiment in the three robot arm environments. We use the DDPG as the baseline here because the robot arm environment is relatively simple. In the more challenging robot hand environments, we use DDPG+HER as the baseline method and test the performance among DDPG+HER, DDPG+HER+PER, and DDPG+HER+MEP. To combine PER with HER, we calculate the TD-error of each transition based on the randomly selected achieved goals. Then we prioritize the transitions with higher TD-errors for replay.

Now, we compare the mean success rates. Each experiment is carried out with 5 random seeds and the shaded area represents the standard deviation. The learning curve with respect to training epochs is shown in Figure

3. For all experiments, we use 19 CPUs and train the agent for 200 epochs. After training, we use the best-learned policy for evaluation and test it in the environment. The testing results are the mean success rates. A comparison of the performances along with the training time is shown in Table 1.

Figure 4: Number of training samples needed with respect to mean success rate for all six environments (the lower the better)

From Figure 3, we can see that MEP converges faster in all six tasks than both the baseline and PER. The agent trained with MEP also shows a better performance at the end of the training, as shown in Table 1. In Table 1, we can also see that the training time of MEP lies in between the baseline and PER. It is known that PER can become very time-consuming (Schaul et al., 2015b), especially when the memory size is very large. The reason is that PER uses TD-errors for prioritization. After each update of the model, the agent needs to update the priorities of the transitions in the replay buffer, which is . In our experiments, we use the efficient implementation based on the “sum-tree” data structure, which can be relatively efficiently updated and sampled from (Schaul et al., 2015b). To be more specific, MEP consumes much less computational time than PER. For example in the robot arm environments, on average, DDPG+MEP consumes about 1.2 times the training time of DDPG. In comparison, DDPG+PER consumes about 5 times the training time as DDPG. In this case, MEP is 4 times faster than PER. MEP is faster because it only updates the trajectory density once per epoch and can easily be combined with any multi-goal RL methods, such as DDPG and HER.

Table 1 shows that baseline methods with MEP result in better performance in all six tasks. The improvement increases by up to 39.34 percentage points compared to the baseline methods. The average improvement over the six tasks is 9.15 percentage points. We can see that MEP is a simple, yet effective method and it improves state-of-the-art methods.

4.2 Sample-Efficiency

To compare sample-efficiency of the baseline and MEP, we compare the number of training samples needed for a certain mean success rate. The comparison is shown in Figure 4. From Figure 4, in the FetchPush-v0 environment, we can see that for the same 99% mean success rate, the baseline DDPG needs 273,600 samples for training, while DDPG+MEP only needs 112,100 samples. In this case, DDPG+MEP is more than twice (2.44) as sample-efficient as DDPG. Similarly, in the other five environments, MEP improves sample-efficiency by factors around one to three. In conclusion, for all six environments, MEP is able to improve sample-efficiency by an average factor of two (1.95) over the baseline’s sample-efficiency.

4.3 Goal Entropy

Figure 5: Entropy values of the achieved goal distribution during training

To verify that the overall MEP procedure works as expected, we calculated the entropy value of the achieved goal distribution with respect to the epoch of training. The experimental results are averaged over 5 different random seeds. Figure 5 shows the mean entropy values with its standard deviation in three different environments. From Figure 5, we can see that the implemented MEP algorithm indeed increases the entropy of the goal distribution. This affirms the consistency of the stated theory with the implemented MEP framework.

5 Related Work

Maximum entropy was used in RL by Williams & Peng (1991)

as an additional term in the loss function to encourage exploration and avoid local minimums

(Mnih et al., 2016; Wu & Tian, 2016; Nachum et al., 2016; Asadi & Littman, 2016). A similar idea has also been utilized in the deep learning community, where entropy loss was used as a regularization technique to penalize over-confident output distributions (Pereyra et al., 2017). In RL, the entropy loss adds more cost to actions that dominate quickly. A higher entropy loss favors more exploration (Mnih et al., 2016). Neu et al. (2017)

gave a unified view on entropy-regularized Markov Decision Processes (MDP) and discussed the convergence properties of entropy-regularized RL, including TRPO

(Schulman et al., 2015) and A3C (Mnih et al., 2016).

More recently, Haarnoja et al. (2017) and Levine (2018) proposed deep energy-based policies with state conditioned entropy-based regularization, which is known as Soft-Q Learning. They showed that maximum entropy policies emerge as the solution when optimal control is cast as probabilistic inference. Concurrently, Schulman et al. (2017) showed the connection and the equivalence between Soft-Q Learning and policy gradients. Maximum entropy policies are shown to be robust and lead to better initializations for RL agents (Haarnoja et al., 2018a, b). Based on maximum entropy polices, Eysenbach et al. (2018) developed an information theoretic objective, which enables the agent to automatically discover different sets of skills.

Unlike aforementioned works (Williams & Peng, 1991; Mnih et al., 2016; Haarnoja et al., 2017), the information theoretic objective (Eysenbach et al., 2018) uses state, not actions, to calculate the entropy for distinguishing different skills. Our work is similar to this previous work (Eysenbach et al., 2018) in the sense that we also use the states, instead of actions, to calculate the entropy term and encourage the trained agent to cover a variety of goal-states. Our method generalizes to multi-goal and multi-task RL (Kaelbling, 1993; Sutton et al., 1999; Bakker & Schmidhuber, 2004; Sutton et al., 2011; Szepesvari et al., 2014; Schaul et al., 2015a; Pinto & Gupta, 2017; Plappert et al., 2018).

The entropy term that we used in the multi-goal RL objective is maximized over goal-states. We use maximum goal entropy as a regularization for multi-goal RL, which encourages the agent to learn uniformly with respect to goals instead of experienced transitions. This corrects the bias introduced by the agent’s behavior policies. For example, the more easily achievable goals are generally dominant in the replay buffer. The goal entropy-regularized objective allows the agent to learn to achieve the unknown real goals, as well as various virtual goals.

We implemented the maximum entropy regularization via prioritized sampling based on achieved goal-states. We believe that the most similar framework is prioritized experience replay (Schaul et al., 2015b). Prioritized experience replay was introduced by Schaul et al. (2015b) as an improvement to the experience replay in DQN (Mnih et al., 2015). It prioritizes the transitions with higher TD-error in the replay buffer to speed up training. The prioritized experience replay is motivated by TD-errors. However, the motivation of our method comes from information theory–maximum entropy. Compared to prioritized experience replay, our method performs superior empirically and consumes much less computational time.

The intuition behind our method is to assign priority to those under-represented goals, which are relatively more valuable to learn from (see Appendix). Essentially, our method samples goals from an entropy-regularized distribution, rather than from a true replay buffer distribution, which is biased towards the behavior policies. Similar to recent work on goal sampling methods (Forestier et al., 2017; Péré et al., 2018; Florensa et al., 2018; Zhao & Tresp, 2018; Nair et al., 2018; Warde-Farley et al., 2018), our aim is to model a goal-conditioned MDP. In the future, we want to further explore the role of goal entropy in multi-goal RL.

6 Conclusion

This paper makes three contributions. First, we propose the idea of Maximum Entropy-Regularized Multi-Goal RL, which is essentially a reward-weighted entropy objective. Secondly, we derive a safe surrogate objective, i.e., a lower bound of the original objective, to achieve stable optimization. Thirdly, we implement a novel Maximum Entropy-based Prioritization framework for optimizing the surrogate objective. Overall, our approach encourages the agent to achieve a diverse set of goals while maximizing the expected return.

We evaluated our approach in multi-goal robotic simulations. The experimental results showed that our approach improves performance and sample-efficiency of the agent while keeping computational time under control. More precisely, the results showed that our method improves performance by 9 percentage points and sample-efficiency by a factor of two compared to state-of-the-art methods.

References

Appendix A Proof of Theorem 1

Theorem 3.

The surrogate is a lower bound of the objective function , i.e., , where

(15)
(16)
(17)

is the normalization factor for . is the weighted entropy (Guiaşu, 1971; Kelbert et al., 2017), where the weight is the accumulated reward in our case.

Proof.
(18)
(19)
(20)
(21)
(22)
(23)
(24)

In the inequality, we use the property . ∎

Appendix B Proof of Theorem 2

Theorem 4.

Let the probability density function of goals in the replay buffer be

(25)

Let the proposal probability density function be defined as

(26)

Then, the proposal goal distribution has an equal or higher entropy

(27)
Proof.

For clarity, we define the notations in this proof as and .

Note that the definition of Entropy is

(28)

where the th summand is , which is a concave function. Since the goal distribution has a finite support , we have the real-valued vector and .

We use Karamata’s inequality (Kadelburg et al., 2005), which states that if the vector majorizes then the summation of the concave transformation of the first vector is smaller than the concave transformation of the second vector.

In our case, the concave transformation is the weighted information at the th position -, where the weight is the probability (entropy is the expectation of information). Therefore, the proof of the theorem is also a proof of the majorizing property of over (Petrov, ).

We denote the proposal goal distribution as

(29)

Note that in our case, the partition function is a constant.

Majorizing has three requirements (Marshall et al., 1979).

The first requirement is that both vectors must sum up to one. This requirement is already met because

(30)

The second requirement is that monotonicity exits. Without loss of generality, we assume the probabilities are sorted:

(31)

Thus, if then

(32)
(33)
(34)
(35)

which means that if the original goal probabilities are sorted, the transformed goal probabilities are also sorted,

(36)

The third requirement is that for an arbitrary cutoff index , there is

(37)

To prove this, we have

(38)
(39)
(40)
(41)
(42)

Note that, we multiply to each side of

(43)

Then we have

(44)

Now, we substitute the expression of and then have

(45)

We express as a series of terms , we have

(46)

We use the distributive law to the right side and have

(47)

We move the first term on the right side to the left and use the distributive law then have

(48)

We use the distributive law again on the right side and move the first term to the left and use the distributive law then have

(49)

We remove the minus sign then have

(50)

To prove the inequality above, it suffices to show that the inequality holds true for each associated term of the multiplication on each side of the inequality.

Suppose that

(51)

then we have

(52)

As mentioned above, the probabilities are sorted in descending order. We have

(53)

then

(54)

Therefore, we have proved that the inequality holds true for an arbitrary associated term, which also applies when they are added up. ∎

Appendix C Insights

Figure 6: Pearson correlation between the complementary density and TD-errors in the middle of training

To further understand why maximum entropy in goal space facilitates learning, we look into the TD-errors during training. We investigate the correlation between the complementary predictive density and the TD-errors of the trajectory. The Pearson correlation coefficients, i.e., Pearson’s r (Benesty et al., 2009), between the density and the TD-errors of the trajectory are 0.63, 0.76, and 0.73, for the hand manipulation of egg, block, and pen tasks, respectively. The plot of the Pearson correlation is shown in Figure 6. The value of Pearson’s r is between 1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. We can see that the complementary predictive density is correlated with the TD-errors of the trajectory with an average Pearson’s r of 0.7. This proves that the agent learns faster from a more diverse goal distribution. Under-represented goals often have higher TD-errors, and thus are relatively more valuable to learn from. Therefore, it is helpful to maximize the goal entropy and prioritize the under-represented goals during training.