It is in the nature of living organisms to harness the knowledge of others who are more experienced than them to develop behaviors and skills that are crucial for tasks throughout their life (Nehaniv and Dautenhahn, 2007) and humans are no exception. Much of this skill acquisition is done in an observational process in which we observe the behaviors of other agents and imitate them. In scenarios that arise in Safe Reinforcement Learning (Safe RL), one wishes to avoid exploring risky behaviors while pursuing a goal. For example, it is desirable for an autonomous vehicle to avoid digressing to sideways or colliding with other vehicles or pedestrians. We wish to address Avoidance Learning – a problem reported and extensively studied in human behavior (Turnwald et al., 2016; Norbury et al., 2018) – and mathematically formulate the corresponding learning problem for an artificial agent.
Learning from an expert is a well-studied concept in RL and robotics (Argall et al., 2009). It can be categorized into two main approaches: Behavior Cloning (Sammut, 2010) and Inverse Reinforcement Learning (Abbeel and Ng, 2004). In the former case the agent tries to mimic the policy of an expert in a supervised fashion, whereas in the latter case, it recovers a reward function from the expert to optimize its policy. A recent inverse reinforcement learning algorithm is Generative Adversarial Imitation Learning (GAIL) (Ho and Ermon, 2016), where the reward of the expert is estimated by the agent in a two-player zero-sum game setting: a generator tries to approximate the expert policy while a discriminator distinguishes between expert and novice behaviors.
A natural question that arises in this premise of learning from observation is how to leverage the information obtained from observing a dangerous demonstrator to maximize the agent’s goal (Observational Learning) (Borsa et al., 2017). One can pose such a situation as an anti-imitation learning problem, where the underlying constrained optimization problem is to maximize the agent’s reward while staying as far as possible from the demonstrator’s observations.
In the real world, circumstances may arise where an agent trying to learn from a demonstrator lacks the precise knowledge of the demonstrator’s actions and can only observe his states (i.e., the consequence of his actions). In addition, the objective of the demonstrator is often unknown to the agent (as in the case of inverse RL). All the agent can observe is the changes of state distribution of the demonstrator throughout the learning process. Many safe RL methods require either an explicit constraint on the policy (Achiam et al., 2017) or manage risk by reducing measures of variability in cost such as the conditional value-at-risk (CVaR) (Tamar et al., 2014).
In this work, we present an avoidance learning (AvL) method that requires only knowledge of the state-only trajectories from the demonstrator and no explicit measure of danger or policy constraints, only demonstrations. The learning problem is a constrained optimization problem where the agent maximizes the discounted sum of rewards while trying to maximize the divergence of its own estimated state occupancy distribution from that of the demonstrator. This allows the agent to optimize its own policy while avoiding a bad demonstrator’s policy. Unlike many state-of-the-art methods in safe RL, our method does not require explicit engineering of negative rewards or constraints.
Estimating the state occupancy distribution of the demonstrator’s policy is not trivial when the size of the state space becomes large and a simple count-based estimate does not suffice. Thus, we first present a formulation of this problem using a Variational Auto-Encoder (VAE) (Kingma and Welling, 2013) to estimate the stationary distributions of the agent’s and demonstrator’s state occupancy distributions. We propose a novel objective function for the constrained optimization problem inherent to AvL, which is the sum of the agent’s reward and the Kullback-Leibler (KL) divergence between these two stationary distributions. This is essentially the Lagrangian relaxation (also known as the Lagrangian dual) problem of the corresponding the original constrained optimization problem and the agent’s optimal policy is the solution of maximizing the Lagrangian relaxation objective.
We apply our proposed method for AvL to both 2D and 3D partially observable environments. We use a convolutional neural density model trained on demonstrator state sequences to find an avoidance reward bonus for agent training. Our experimental results corroborate our theory that this method successfully learns a policy that avoids the dangerous demonstrator trajectories while still finding the optimal reward. Furthermore, this method results in faster learning for the agent trained with the novel objective by avoiding exploration of dangerous states.
A Markov Decision Process (MDP) is made of a tuplewhere is a set of states, is the set of actions available to the agent,
is the transition kernel giving a probability over next states given the current state and action,is a reward function and is a discount factor. and are respectively the state and action of the expert at time instant . We define a policy
as the probability distribution over actions conditioned on the current state;. The value of a policy is defined as , where denotes the expectation. The entropy of a policy is . An agent follows a policy and receives reward from the environment. A state-action value function is . The advantage is . We define the un-discounted occupancy of a state under policy as,
Partially observable Markov Decision Processes (POMDPs) are a generalization of Markov Decision Processes (MDPs), where the agent does not have the complete knowledge of the state. It takes an action at a state based on its observation, which is an encoding of the underlying state of the environment. One of the earliest works that introduced POMDPs is (Åström, 1965), which established the optimal control (equivalently the action of the agent) with incomplete information about the state.
A POMDP can be modeled by a tuple , where is a set of observations that the agent can experience in its world. is the state-transition function giving for each world state and action, a probability distribution, , over world states. is the probability of observing if the agent took action and transitioned to state .
3 Related Work
In this section, we briefly address the works closely related to observational RL.
Inverse Reinforcement Learning (IRL) is an imitation learning method where the agent first recovers the expert’s reward function and then learns its own optimal policy using the estimated expert’s reward (Ng and Russell, 2000).
In IRL, it is assumed that the expert acts optimally with respect to its reward, i.e.,
where is the optimal expert policy, is any policy, is the expert’s reward estimated by the agent.
The agent learns an optimal policy to maximize , where is the agent’s policy. IRL is closely related to observational RL in situations where the agent sees the state-only demonstrations of the expert.
Variational Autoencoders(VAE) (Kingma and Welling, 2013) consist of two networks. A generative network or for the reconstruction, samples visible reconstructed variables given latent variables . A variational inference network (Encoding network) maps known visible variables to latent variables . This approximates a prior distribution . The objective of VAEs is to maximize the evidence lower bound, ).
PixelCNN for Count Based Exploration (Bellemare et al., 2016) gives the definition of a pseudocount, derived from a density model, to be used in count-based exploration. It is computed from a density model over a finite space , with the probability assigned by to a state after training on a sequence of length of observed states. Using PixelCNN (van den Oord et al., 2016) as a density model (Ostrovski et al., 2017), a pseudocount is computed and used as an exploration bonus directly on the observed reward in a DQN (Mnih et al., 2013). It is shown to improve speed of learning in numerous Atari 2600 game environments.
Safe Reinforcement Learning is the problem of learning a policy that maximizes expected return while ensuring that some safety constraints are met. The exploration process in many of these problems is unique in that it incorporates external knowledge of risky areas of the state and action space, which can also result in the decrease in training time (García and Fernández, 2015). A common algorithm used is constrained policy optimization given a constrained MDP (Achiam et al., 2017) (Altman, 1999). Other methods have formulated a policy gradient algorithm where the CVaR of the rewards is minimized in a constrained optimization problem (Chow and Ghavamzadeh, 2014).
4 Learning to Avoid Demonstrator State Trajectories
The agent observes a set of state-trajectories from a bad demonstrator . Let be the state trajectories of the demonstrator, where . Denote the distribution of demonstrator trajectories . Given the demonstrator trajectories, we estimate the state occupancy distribution of the demonstrator policy by training a VAE on the state occupancy of each demonstrator trajectory (explained below) or we can more simply average the state occupancy measures of all demonstrator trajectories.
To avoid the demonstrator trajectories, we can, as an example, derive an optimization problem based on the discounted sum of rewards. We formulate a scenario where the agent wishes to find a policy that maximizes an arbitrary probability distance metric term (e.g. KL divergence) capturing the distance between the state occupancy distribution induced by its policy and the one induced by demonstrator policies (the probability of the state being in a demonstrator trajectory). The dual of the constrained optimization problem is given as follows, with Lagrange multipliers
We are given demonstrator trajectories and need to compute a stationary distribution, , which can be following an optimal policy or a policy we wish to avoid depending on the problem. We can estimate this distribution using a VAE model such that we are building a generative model conditioned on latent variable . We can then marginalize out latent variables to give : our demonstrator stationary distribution. Similarly, we can train a VAE for the agent learner.
This method computes a stationary distribution approximation for the agent and for the demonstrator . We can consider metrics for probability distribution such as KL divergence where we would have a general goal of
The agent wishes to consider the KL divergence between the stationary distributions of the demonstrator and the agent in the case of avoidance. A policy optimization objective function, over the discounted sum of rewards, to maximize is therefore
where and are the estimates of the state occupancy measures of the agent (under policy
) and the expert (demonstrator) respectively. We compute them by taking a finite number of uniformly distributed samples from the two VAEs (as a generative model) and then finding the mean state occupancy measures for each. If we have multiple demonstrators, we can compute for each one a distance metric term with respect to the current policy. We also select a parameteras a coefficient for the distance metric in . An example of an avoidance learning algorithm which maximizes KL divergence between state occupancy distributions with Proximal Policy Optimization (PPO) (Schulman et al., 2017) and an arbitrary advantage estimation method (ie. Generalized Advantage Estimation (GAE)) (Schulman et al., 2016) is shown in Algorithm 1. If we wish to perform this algorithm without using a VAE, we simply average the state occupancy measures from the demonstrator trajectories and the corresponding state occupancy measures calculated from trajectories sampled from the policy in Line 6.
5 Using Neural Density Models For an Avoidance Bonus
In this section, our motivation is to propose a solution for AvL in partially observable environments where the agent receives raw images as observation. The previous approach of estimating state occupancy distribution is not tractable anymore: since the environment is partially observable, two scenarios are possible depending on if we have access or not to the agent’s coordinates. In the case we have them, the agent can experience multiple points of view for the same 3D Cartesian coordinates, depending on the angle of observation. The number of possible states would therefore grow. If we can’t access the exact positions like in many real-world scenarios, our previous approach of using KL divergence for estimating two states distributions cannot stand anymore.
Let be the space of observable states induced by any agent and its training environment. Empirical distribution is induced by the demonstrator ’s observations. For with , their corresponding neural density model estimates would show . We wish to avoid observing states more than in our agent training to avoid the demonstrator trajectories.
We now define the pseudocount and exploration bonus. is the probability that would assign to if it was trained on again. Given the density model , we can compute a prediction gain of the model . The prediction gain is also enforced by a threshold to be learning positive . is learning positive if for all .
We use a Gated PixelCNN based density model to generate a pseudocount defined by:
with corresponding to a prediction gain decay. We obtain by training the density model on trajectories sampled from the demonstrator. We can say this is a density estimate of the demonstrator observations in a POMDP. We then have an avoidance reward bonus that we define similarly to an exploration bonus at step ,
which is then added to observed agent rewards to encourage avoidance of the states with high frequency in demonstrator trajectories. This bonus is low for demonstrator observations and higher for observations not commonly occupied by the demonstrator. We therefore use the pseudocount as a reward bonus for the agent during training, since it will give positive feedback for exploring the states spaces not seen by the demonstrator and thus avoid it. with being the pseudocount weight coefficient parameter. We expect this bonus to be useful during training for environments with sparse rewards, because it now provides feedback at every step to the agent.
We investigate the following questions with our experiments:
Does using a KL penalty or pseudocount avoidance bonus actually enforce safer trajectories during training? Does it compare with baseline Safe RL methods?
Is sample efficiency improving in environments with sparse rewards while avoiding unsafe regions?
We use the below task settings to explore these questions. Our experiments are designed to emulate realistic control situations; such as avoiding dangerous regions in 2D fully observable grid-world environments and 3D partially observable worlds, with sparse rewards, selecting the right path/room to reach the goal, and selecting the right objects. All are implemented with Gym supported environments and the hyperparameters used in our experiments are described in Appendix C.2.
6.1 2D Grid-world Environments
We build our 2D grid-world environments using the Gym "MiniGrid" package (Chevalier-Boisvert et al., 2018)
. The environments are fully observable and each observation is an (w, h, 3) tensor. At each timestep, the agent can change its direction, actions are as follows:turn-left, turn-right, move-forward. Facing a wall, the agent will stay in the same state if it moves forward into the wall. Rewards are sparse: we gave a non-zero reward to the agent only when it fully completed the mission, and the magnitude of the reward was , where is the length of the successful episode and is the maximum number of steps that we allowed for completing the episode, different for each mission. If the agent goes into lava or reaches the maximum number of steps authorized for each episode, the episode ends with 0 reward. Some examples are shown in Figure 1 (see Appendix A for more details about MiniGrid).
We train the demonstrator to go to lava which is the dangerous behavior for the agent. In case of multiple cells of lava to avoid, we can train several demonstrators, each one being trained to go to a different lava cell.
For experiments, we used the PPO algorithm with parallelized data collection and GAE. Each environment is run with 20 random network initializations.
When executing Algorithm 1, PPO with avoidance KL penalty, we sample 100 trajectories from the demonstrators’ policies. We then use these trajectories to estimate the demonstrator state occupancy distribution with 10,000 samples from the VAE trained on the demonstrator state occupancies.
6.2 3D Partially observable environments
We use the Gym "MiniWorld" package (Chevalier-Boisvert, 2018) to create 3D environments with an egocentric point of view for the agent. Examples of these environments are presented in Figure 2 (see Appendix B for more information on MiniWorld). We can solve these environments using PPO with a convolutional actor-critic architecture (see Appendix C.1 for more details).
As explained previously with MiniGrid, we train the demonstrators modifying the reward function so that the demonstrator will learn the behavior we later want to avoid for safety reasons. For example, on Sidewalk the demonstrator will receive a reward if it goes to the street or walks along with it, or even if it gets stuck in front of a wall.
On the other hand, when training our true agent to go to the goal (the red box), we don’t want it to go to the street where there is a potential danger. Therefore, we aim to give this agent an avoidance bonus at each step, using the neural density model described in Section 5. The agent receives a bonus when seeing observations far from the demonstrator distribution. This incentivizes the agent to remain far from danger, and would additionally improve sample efficiency because the rewards would not be sparse anymore.
6.3 Results for 2D Grid-world environments
The avoidance learning method described in Section 4 is shown to have decent level of increases on the sample efficiency of PPO but also generates policies that show more significant avoidance of dangerous regions. In relatively simple Grid environments, we did not expect large increases in sample efficiency as the solutions are relatively easy in a small state-space compared to complex 3D environments. In Figure 3 the state occupancy distribution show that the introduction of the KL term during training leads to convergence of a policy that is one cell further away from the lava cells. (a,b) show faster convergence in terms of sample efficiency of the policy towards the goal and avoidance of dangerous states during training.
In addition, we compared against a trajectory-based policy gradient CVaR optimization method that is a variation of PPO (PPO-CVaR) (Described in Appendix D.1) and the original version presented in (Chow and Ghavamzadeh, 2014) (PG-CVaR) (which uses the discounted sum of rewards as the policy gradient score function). For the CVAR experiments, we introduced a reward of to each lava cell instead of ending the episode.
In Table 1, we see that introducing the KL term leads to faster policy convergence (for the optimal hyperparameters) in all Grid environments tested. We compare using the VAE to estimate the stationary distributions of the state occupancy vs. averaging the state occupancies computed from sampled trajectories/demonstrator trajectories. The value of the KL term depends on the extent of divergence from the demonstrator trajectories and therefore has a large effect on convergence. For some environments, the agent temporarily follows the trajectory of the demonstrator and eventually diverges from it to achieve the goal. By varying the KL weight, we find a suitable KL term to achieve sufficient divergence while still accomplishing the task. The method performs best in environments where there is a separated area (e.g. a room with one entrance) where the lava is present in. The fact that using AvL without training a VAE either performs worse or only marginally better than simple averaging to estimate the state occupancy distributions indicates this computationally expensive step could be unnecessary given sufficient demonstrator trajectories.
|Environment||AvL (VAE)||AvL (Avg)||PPO||PPO-CVaR||PG-CVaR|
6.4 Results for 3D environments
For 3D MiniWorld environments we see remarkable improvements in both achieved success rate and number of frames to reach convergence. Variance is also reduced. (Figure4).
To explain this incredible improvement, we call for caution and wish to point out the importance of the choice of the seeds to run the experiments. We randomly selected 20 to run our experiments. In addition, the road is a very large region (the entire left side of the world) and entering it ends the episode. By learning to avoid this region, the agent is confined to a relatively small sidewalk and can reach the goal more easily.
Hyperparameters are the same for experiments using PPO with and without pseudocount bonus and we optimized them to have the best training curve, in terms of number of frames until convergence, possible for experiments using PPO without bonus. We could witness that for some seeds, the policy did not converge towards succeeding in the missions, while it did with the pseudocount bonus. We hypertuned the pseudocount weight parameter, as we previously did with the KL weight parameter and achieved the best results with 0.1.
Regarding safety during training, the agent indeed avoids going on the road on the Sidewalk environment and travels less frequently to the incorrect box on FourRooms.
We propose a novel algorithm for Avoidance Learning through observational reinforcement learning as well as a novel method for observational reinforcement learning in partially observable continuous environments. We demonstrate that learning from observation has the ability to learn safer policies, provide a safer learning process and learn these policies more efficiently. The approach does not require any explicit knowledge of demonstrator actions, any engineering of negative rewards, or known policy constraints. State-only demonstrations are sufficient. The results of this method are of interest to autonomous vehicles, as they can be trained on cars without an action recording apparatus. Simple observations of poor driving behavior can be used via simple video recordings.
Some limitations can arise from our approach. A fundamental assumption for this to work is that one can sample state trajectories from the demonstrator. In addition, estimating stationary state distributions or using GatedPixelCNN is also computationally expensive and in the case of the VAE, it requires training at every update: we trade computation time for sample efficiency and safety.
We thank Maxime Chevalier-Boisvert, Riashat Islam, David Yu-Tung Hui, Dzmitry Bahdanau and Charles Guille-Escuret for helpful discussions. We thank Jordan Hoffman, Vincent Luczkow, Guillaume Alain for their help in reviewing the paper.
Apprenticeship learning via inverse reinforcement learning.
International conference on Machine learning, pp. 1. Cited by: §1.
- Constrained policy optimization. In International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70, pp. 22–31. Cited by: §1, §3.
- Constrained markov decision processes. Cited by: §3.
- A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §1.
- Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applications 10 (1), pp. 174 – 205. Cited by: §2.
- Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29, pp. 1471–1479. Cited by: §3.
- Observational learning by reinforcement learning. External Links: Cited by: §1.
- Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-minigrid Cited by: §6.1.
- Gym-miniworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-miniworld Cited by: Appendix B, §6.2.
- Algorithms for cvar optimization in mdps. In International Conference on Neural Information Processing Systems, NIPS’14. Cited by: §3, §6.3.
- A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, pp. 1437–1480. Cited by: §3.
- Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573. Cited by: §1.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.
- Playing atari with deep reinforcement learning. External Links: Cited by: §3.
- Imitation and social learning in robots, humans and animals: behavioural, social and communicative dimensions.. Cambridge University Press. Cited by: §1.
- Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, ICML ’00. Cited by: §3.
- Value generalization in human avoidance learning. eLife. Cited by: §1.
- Count-based exploration with neural density models. In International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 70, pp. 2721–2730. Cited by: §3, §5.
- Behavioral cloning. In Encyclopedia of Machine Learning, pp. 93–97. Cited by: §1.
- High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), Cited by: §4.
- Proximal policy optimization algorithms. External Links: Cited by: §4.
- Policy gradients beyond expectations: conditional value-at-risk. ArXiv abs/1404.3862. Cited by: §1.
Understanding human avoidance behavior: interaction-aware decision making based on game theory. International Journal of Social Robotics. Cited by: §1.
- Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems 29, pp. 4790–4798. Cited by: §3.
Appendix A MiniGrid Environments for OpenAI Gym
MiniGrid, is an open source gridworld package111https://github.com/maximecb/gym-minigrid that includes a family of reinforcement learning environments compatible with the OpenAI Gym framework. Many of these environments are customizable so that the task difficulty can be adjusted (e.g., the size of rooms, the number and type of objects, the topology).
a.1 The World
In MiniGrid, the world is a grid of size . Each tile in the grid contains exactly zero or one object. We use the objects: wall, lava and goal. Each object has an associated discrete color, which can be one of red, green, blue, purple, yellow and grey. By default, walls are always grey and goal squares are always green.
a.2 Reward Function
Rewards are sparse for all MiniGrid environments. Episodes are terminated with a positive reward when the agent reaches the specified goal (generally the green goal square). Otherwise, episodes are terminated with zero reward when a time step limit is reached or the agent goes into lava.
The formula for calculating positive sparse rewards is . That is, rewards are always between zero and one; the quicker the agent successfully completes an episode, the closer the reward is to . The parameter is different for each environment, and varies depending on the size of each environment, with larger environments having a higher time step limit.
a.3 Action Space
There are seven actions in MiniGrid: turn left, turn right, move forward, pick up an object, drop an object, toggle and done. For the purpose of this paper, the pick up, drop, toggle and done actions are irrelevant. The agent can use the turn-left and turn-right actions to rotate and face one of 4 possible directions (north, south, east, west). The move forward action makes the agent move from its current tile onto the tile in the direction it is currently facing, provided there is nothing on that tile, or that the tile contains an open door. The agent can open doors if they are right in front of it by using the toggle action.
a.4 Observation Space
We are using fully observable grids of size . The observations are provided as a tensor of shape (w, h, 3). However, note that these are not RGB images. Each tile is encoded using 3 integer values: one describing the type of object contained in the cell, one describing its color, and a flag indicating whether doors are open or closed. This compact encoding was chosen for space efficiency and to enable faster training. The fully observable RGB image view of the environments shown in this paper is provided for visualization.
Appendix B MiniWorld Environments for OpenAI Gym
MiniWorld 222https://github.com/maximecb/gym-miniworld [Chevalier-Boisvert, 2018] is a minimalistic 3D interior environment simulator for reinforcement learning and robotics research. It can be used to simulate environments with rooms, doors, hallways and various objects (e.g: office and home environments, mazes).
b.1 The World
In MiniWorld, the world is made of static elements (rooms and hallways), as well as objects which may be dynamic, which we call entities. Environments are essentially 2D floorplans made of connected rooms. Rooms can have any convex outline defined by at least 3 points. Portals (openings) can be created in walls to create doors or windows into other rooms.
b.2 Coordinate System
MiniWorld uses OpenGL’s right-handed coordinate system. The ground plane lies along the X and Z axes, and the Y axis points up. When direction angles are specified, a positive angle corresponds to a counter-clockwise (leftward) rotation. Angles are in degrees for ease of hand-editing. By convention, angle zero points towards the positive X axis.
The observations are single camera images, as numpy arrays of size (80, 60, 3). These arrays contain unsigned 8-bit integer values in the [0, 255] range.
For simplicity, actions are discrete. The actions we use are: turn-left, turn-right and move-forward. The turn and move actions will rotate or move the agent by a small fixed interval. The simulator assumes that the agent is a differential drive robot.
b.5 Reward function
Each environment has an associated max-episode-steps variable which specifies the maximum number of time steps allowed to complete an episode. By default, rewards are sparse and in the [0, 1] range, with a small penalty being given based on the number of time steps needed to complete the task: . If the task is not completed within the maximum number authorized, a reward of 0 is produced.
Appendix C Hyperparameters and Models Architectures
c.1 Architecture of the policies
We rely on the actor-critic architecture on both MiniGrid and MiniWorld.
c.2 Hyperparameters used in MiniGrid and MiniWorld
|value loss coefficient||0.5|
maximum norm of gradient in PPO
number of PPO epochs
|batch size for PPO||256|
|value loss coefficient in PPO||0.5|
maximum norm of gradient in PPO
|number of PPO epochs||4|
|batch size for PPO||128|
|clip parameter in PPO||0.2|
|parameter in GatedPixelCNN||0.1|
Appendix D Methodology
d.1 CVaR Constrained Policy Gradient Method
(Changed method for new results) We propose the CVaR constraint for the PPO algorithm. is the standard PPO clipped loss. , with being an arbitrary probability distribution. refers to the expected distribution of the discounted sum of rewards starting for state under policy and is estimated as .
Appendix E Additional environments and results
We also show the effect of selecting varying coefficients for the neural avoidance bonus.