Adversarial Policies: Attacking Deep Reinforcement Learning

05/25/2019 ∙ by Adam Gleave, et al. ∙ berkeley college 0

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent's observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. Videos are available at



There are no comments yet.


page 2

page 4

page 6

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The discovery of adversarial examples for image classifiers prompted a new field of research into adversarial attacks and defenses [Szegedy et al., 2014]. Recent work has shown that deep RL policies are also vulnerable to adversarial perturbations of image observations [Huang et al., 2017, Kos and Song, 2017]. However, real-world RL agents inhabit natural environments populated by other agents, including humans, who can only modify observations through their actions. We explore whether it’s possible to attack a victim policy by building an adversarial policy that takes actions in a shared environment, inducing natural observations which have adversarial effects on the victim.

RL has been applied in settings as varied as autonomous driving [Dosovitskiy et al., 2017], negotiation [Lewis et al., 2017] and automated trading [Noonan, 2017]. In domains such as these, an attacker cannot usually directly modify the victim policy’s input. For example, in autonomous driving pedestrians and other drivers can take actions in the world that affect the camera image, but only in a physically realistic fashion. They cannot add noise to arbitrary pixels, or make a building disappear. Similarly, in financial trading an attacker can send orders to an exchange which will appear in the victim’s market data feed, but the attacker cannot modify observations of a third party’s orders.



Figure 1: Illustrative snapshots of a victim (in blue) against normal and adversarial opponents (in red). The victim wins if it crosses the finish line; otherwise, the opponent wins. Despite never standing up, the adversarial opponent wins 86% of episodes, far above the normal opponent’s 47% win rate.

As a proof of concept, we show the existence of adversarial policies in zero-sum simulated robotics games with proprioceptive observations [Bansal et al., 2018a]. The state-of-the-art victim policies were trained via self-play to be robust to opponents. We train each adversarial policy using model-free RL against a fixed black-box victim. We find the adversarial policies reliably beat their victim, despite training for less than 3% of the time steps initially used to train the victim policies.

Critically, we find the adversaries win by creating natural observations that are adversarial, and not by becoming generally strong opponents. Qualitatively, the adversaries fall to the ground in contorted positions, as illustrated in Figure 1, rather than learning to run, kick or block like normal opponents. This strategy does not work when the victim is ‘masked’ and cannot see the adversary’s position, suggesting that the adversary succeeds by manipulating a victim’s observations through its actions.

Having observed these results, we wanted to understand the sensitivity of the attack to the number of dimensions of the victim’s observations the attacker can influence. We test this by varying the robotic body (Humanoid, with 24 dimensions influenced by the attacker, and Ant, with 15 dimensions), while keeping the high-level task the same. We find victim policies in the higher-dimensional Humanoid environments are substantially more vulnerable to adversarial policies than in Ant.

To gain insight into why adversarial policies succeed, we analyze the activations of the victim’s policy network using a Gaussian Mixture Model and t-SNE 

[Maaten and Hinton, 2008]. We find adversarial policies induce significantly different activations than normal opponents. Furthermore, the adversarial activations are typically more widely dispersed across time steps than normal activations.

Our paper makes three contributions. First, we propose a novel, physically realistic threat model for adversarial examples in RL. Second, we demonstrate the existence of adversarial policies in this threat model, in several simulated robotics games. Our adversarial policies reliably beat the victim, despite training with less than 3% as many timesteps and generating seemingly random behavior. Third, we conduct a detailed analysis of why the adversarial policies work. We show they create natural observations that are adversarial to the victim and push the activations of the victim’s policy network off-distribution. Additionally, we find policies are easier to attack in high-dimensional environments.

As deep RL is increasingly deployed in environments with potential adversaries, we believe it is important that practitioners are aware of this previously unrecognized threat model. Moreover, even in benign settings, we believe adversarial policies can be a useful tool for uncovering unexpected policy failure modes. Finally, we are excited by the potential of adversarial training using adversarial policies, which could improve robustness relative to conventional self-play by training against adversaries that exploit weaknesses undiscovered by the distribution of similar opponents present during self-play.

2 Related Work

Most study of adversarial examples has focused on small norm perturbations to images, which Szegedy et al. [2014] discovered cause a variety of models to confidently mispredict the class, even though the changes are visually imperceptible to a human. Gilmer et al. [2018a] argued that attackers are not limited to small perturbations, and can instead construct new images or search for naturally misclassified images. Similarly, Uesato et al. [2018] argue that the near-ubiquitous model is merely a convenient local approximation for the true worst-case risk. We follow Goodfellow et al. [2017]

in viewing adversarial examples more broadly, as “inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake.”

The little prior work studying adversarial examples in RL has assumed an -norm threat model. Huang et al. [2017] and Kos and Song [2017] showed that deep RL policies are vulnerable to small perturbations in image observations. Recent work by Lin et al. [2017] generates a sequence of perturbations guiding the victim to a target state. Our work differs from these previous approaches by using a physically realistic threat model that disallows direct modification of the victim’s observations.

Specifically, we model the adversary and victim as agents in a Markov game, drawing on a long tradition in multi-agent reinforcement learning [Littman, 1994]. Competitive multi-agent environments are useful as a source of concrete threat models [Lowe et al., 2017, Bansal et al., 2018a]. However, finding an adversarial policy is a single-agent RL problem since the victim policy is fixed.

Adversarial training is a common defense to adversarial examples, achieving state-of-the-art robustness in image classification [Xie et al., 2019]

. Prior work has also applied adversarial training to improve the robustness of deep RL policies, where the adversary exerts a force vector on the victim or varies dynamics parameters such as friction 

[Pinto et al., 2017, Mandlekar et al., 2017, Pattanaik et al., 2018]. We hope to explore adversarial training with adversarial policies in future work. We expect this to produce policies robust to opponents unlike those they were trained with, in contrast to conventional self-play which only trains for robustness in a small region of policy space.

3 Framework

We model the victim as playing against an opponent in a two-player Markov game [Shapley, 1953]. Our threat model assumes the attacker can control the opponent, in which case we call the opponent an adversary. We denote the adversary and victim by subscript and respectively. The game consists of state set , action sets and , and a joint state transition function where

is a probability distribution on

. The reward function for player depends on the current state, next state and both player’s actions. Each player wishes to maximize their (discounted) sum of rewards.

The adversary is allowed unlimited black-box access to actions sampled from , but is not given any white-box information such as weights or activations. We further assume the victim agent follows a fixed stochastic policy , corresponding to the common case of a pre-trained model deployed with static weights. Safety-critical systems are particularly likely to use a fixed or infrequently updated model due to the considerable expense of real-world testing.

Since the victim policy is held fixed, the two-player Markov game reduces to a single-player MDP that the attacker must solve. The state and action space of the adversary are the same as in , while the transition and reward function have the victim policy embedded:

where the victim’s action is sampled from the stochastic policy . The goal of the attacker is to find an adversarial policy maximizing the sum of discounted rewards:


Note the MDP’s dynamics will be unknown even if the Markov game’s dynamics are known since the victim policy is a black-box. Consequently, the attacker must solve an RL problem.

(a) Kick and Defend
(b) You Shall Not Pass
(c) Sumo Humans
(d) Sumo Ants
Figure 2: Illustrations of the zero-sum simulated robotics games from Bansal et al. [2018a] we use for evaluation. Environments are further described in Section 4.1.

4 Finding Adversarial Policies

We demonstrate the existence of adversarial policies in zero-sum simulated robotics games. First, we describe how the victim policies were trained and the environments they operate in. Subsequently, we provide details of our attack method in these environments, and describe several baselines. Finally, we present a quantitative and qualitative evaluation of the adversarial policies and baseline opponents.

4.1 Environments and Victim Policies

We attack victim policies for the zero-sum simulated robotics games created by Bansal et al. [2018a], illustrated in Figure 2. The victims were trained in pairs via self-play against random old versions of their opponent, for between 680 and 1360 million time steps. We use the pre-trained policy weights released in the “agent zoo” of Bansal et al. [2018b]. In symmetric environments, the zoo agents are labeled ZooN where is a random seed. In asymmetric environments, they are labeled ZooVN and ZooON representing the Victim and Opponent agents.

All environments are two-player games in the MuJoCo robotics simulator. Both agents observe the position, velocity and contact forces of joints in their body, and the position of their opponent’s joints. The episodes end when a win condition is triggered, or after a time limit, in which case the agents draw. We evaluate in all environments from Bansal et al. [2018a] except for Run to Goal, which we omit as the setup is identical to You Shall Not Pass except for the win condition. We describe the environments below, and specify the number of zoo agents and their type (MLP or LSTM):

Kick and Defend (3, LSTM). A soccer penalty shootout between two Humanoid robots. The positions of the kicker, goalie and ball are randomly initialized. The kicker wins if the ball goes between the goalposts; otherwise, the goalie wins, provided it remains within 3 units of the goal.

You Shall Not Pass (1, MLP). Two Humanoid agents are initialized facing each other. The runner wins if it reaches the finish line; the blocker wins if it does not.

Sumo Humans (3, LSTM). Two Humanoid agents compete on a round arena. The players’ positions are randomly initialized. A player wins by remaining standing after their opponent has fallen.111Bansal et al. [2018a] consider the episode to end in a tie if a player falls before it is touched by an opponent. Our win condition allows for attacks that indirectly modify observations without physical contact.

Sumo Ants (4, LSTM). The same task as Sumo Humans, but with ‘Ant’ quadrupedal robot bodies. We use this task in Section 5.2 to investigate the importance of dimensionality to this attack method.

4.2 Methods Evaluated

Following the RL formulation in Section 3, we train an adversarial policy to maximize Equation 1 using Proximal Policy Optimization (PPO) [Schulman et al., 2017]. We give a sparse reward at the end of the episode, positive when the adversary wins the game and negative when it loses or ties. Bansal et al. [2018a] trained the victim policies using a similar reward, with an additional dense component at the start of training. We train for 20 million time steps using Stable Baselines’s PPO implementation [Hill et al., 2019]

. The hyperparameters were selected through a combination of manual tuning and a random search of 100 samples; see Section 

LABEL:sec:appendix:training in the supplementary material for details. We compare our methods to three baselines: a policy Rand taking random actions; a lifeless policy Zero that exerts zero control; and all pre-trained policies Zoo* from Bansal et al. [2018a].

Figure 3: against the median victim in each environment (based on the difference between the win rate for Adv and Zoo). The adversary outperforms the baseline against the median victim in Kick and Defend and You Shall Not Pass, and is competitive on Sumo Humans. For full results, see figure 4 below or figure LABEL:fig:full-win-rate in the supplementary material.

4.3 Results

Quantitative Evaluation  We find the adversarial policies reliably win against most victim policies, and outperform the pre-trained Zoo baseline for a majority of environments and victims. We report the win rate over time against the median victim in each environment in Figure 3, with full results in Figure LABEL:fig:full-win-rate in the supplementary material. Win rates against all victims are summarized in Figure 4.

Qualitative Evaluation  The adversarial policies beat the victim not by performing the intended task (e.g. blocking a goal), but rather by exploiting weaknesses in the victim’s policy. This effect is best seen by watching the videos at In Kick and Defend and You Shall Not Pass, the adversarial policy never stands up. The adversary instead wins by taking actions that induce adversarial observations causing the victim’s policy to take poor actions. A robust victim could easily win, a result we demonstrate in Section 5.1.

This flavor of attacks is impossible in Sumo Humans, since the adversarial policy immediately loses if it falls over. Faced with this control constraint, the adversarial policy learns a more high-level strategy: it kneels in the center in a stable position. Surprisingly, this is very effective against victim 1, which in 88% of cases falls over attempting to tackle the adversary. However, it proves less effective against victims 2 and 3, achieving only a 62% and 45% win rate, below Zoo baselines. We further explore the importance of the number of dimensions the adversary can safely manipulate in Section 5.2.

Distribution Shift  One might wonder if the adversarial policies are winning simply because they are outside the training distribution of the victim. To test this, we evaluate victims against two simple off-distribution baselines: a random policy Rand (green) and a lifeless policy Zero (red). These baselines win as often as 30% to 50% in Kick and Defend, but less than 1% of the time in Sumo and You Shall Not Pass. This is well below the performance of our adversarial policies. We conclude that most victim policies are robust to typical off-distribution observations. Although our adversarial policies do produce off-distribution observations, this is insufficient to explain their performance.

(a) Kick and Defend.
, , .
(b) You Shall Not Pass.
, , .
(c) Sumo Humans.
, , .
(d) Sumo Ants.
, , .
Figure 4: Percentage of episodes won by opponent of victim (out of 1000). The maximal cell in each row has a red border. The adversary outperforms the baselines in Kick and Defend and You Shall Not Pass, is comparable in Sumo Humans, but performs poorly in Sumo Ants (see Section 5.2). Importantly, ‘masking’ the victim so it cannot see the adversary improves the victim’s win rate (see Section 5.1). Victim win rates and ties are reported in figure LABEL:fig:full-score-heatmap in the supplementary material.

, and are the dimensions of the Observations, Actions, and Position of the opponent (part of the victim’s observation ).

5 Understanding Adversarial Policies

In the previous section we demonstrated adversarial policies exist for victims in a range of competitive simulated robotics environments. In this section, we focus on understanding why these policies exist. In Section 5.1 we establish that adversarial policies rely on manipulating the victim through their own body position. We show in Section 5.2 that victims are more vulnerable to adversarial policies in high-dimensional environments. Finally, in Section 5.3 we analyze the activations of the victim’s policy network, showing they differ substantially when playing an adversarial opponent.

5.1 Masked Policies

We have previously shown that adversarial policies are able to reliably win against victims. In this section, we demonstrate that they win by taking actions to induce natural observations that are adversarial to the victim, and not by physically interfering with the victim. To test this, we introduce a ‘masked‘ victim (labeled ZooMN or ZooMVN) that is the same as the normal victim ZooN or ZooVN, except the observation of the adversary’s position is set to a static value corresponding to a typical initial position. We use the same adversarial policy against the normal and masked victim.

One would expect it to be beneficial to be able to see your opponent. Indeed, the masked victims do worse than a normal victim when playing normal opponents. For example, Figure 3(b) shows that in You Shall Not Pass the normal opponent ZooO1 wins 78% of the time against the masked victim ZooMV1 but only 47% of the time against the normal victim ZooV1. However, the relationship is reversed when playing an adversary. The normal victim ZooV1 loses 86% of the time to adversary Adv1 whereas the masked victim ZooMV1 wins 99% of the time. This pattern is particularly clear in You Shall Not Pass, but the trend is similar in other environments, confirming that the adversary wins by taking actions that indirectly cause natural observations that are adversarial for the victim.

This result is surprising as it implies highly non-transitive relationships may exist between policies even in games that seem to be transitive. A game is said to be transitive if policies can be ranked such that higher-ranked policies beat lower-ranked policies. Prima facie, the games in this paper seem transitive: professional human soccer players and sumo wrestlers can reliably beat amateurs. Despite this, there is a non-transitive relationship between adversarial policies, victims and masked victims. Consequently, we urge caution when using methods such as self-play that assume transitivity, and would recommend more general methods where practical [Balduzzi et al., 2019, Brown et al., 2019].

Our findings also suggest a trade-off in the size of the observation space. In benign environments, allowing more observation of the environment increases performance. However, this also makes the agent more vulnerable to adversaries. This is in contrast to an idealized Bayesian agent, where the value of information is always non-negative [Good, 1967]. In the following section, we investigate further the connection between vulnerability to attack and the size of the observation space.

5.2 Dimensionality

It is well-established that classifiers are more vulnerable to adversarial examples on high-dimensional inputs [Gilmer et al., 2018b, Khoury and Hadfield-Menell, 2018, Shafahi et al., 2019]. We hypothesize that a similar result is true for adversarial policies: the greater the dimensionality of the component of the observation space under control of the adversary, the more vulnerable the victim is to attack. In the environments by Bansal et al. [2018a], the component is the position of the adversary’s joints.

We test our hypothesis in the Sumo environment, keeping the task the same but varying whether the agents are Ants (quadrupedal robots) or Humanoids. The results in Figures 3(c) and 3(d) support the hypothesis: the win rate in the lower dimensional Sumo Ants () environment is much lower than in the higher dimensional Sumo Humans () environment. Specifically, in Sumo Humans we obtain a win rate of 87% against victim 1, 63% against victim 2 and 44% against victim 3. By contrast, in Sumo Ants we obtain a win rate of at most 12%.

5.3 Victim Activations

(a) Gaussian Mixture Model (GMM): likelihood the activations of a victim’s policy network are “normal”. The victim is Zoo1 or ZooV1. We collect activations over time steps against each opponent. We fit a -component GMM to activations induced by Zoo1 or ZooO1

. Error bars are a 95% confidence interval.

(b) t-SNE activations of Kick and Defend victim ZooV2 playing against different opponents. See Figures LABEL:fig:tsne:full and LABEL:fig:tsne:one-opponent in the supplementary results for visualizations of other environments and victims.
Figure 5: Analysis of activations of the victim’s policy network. Both the density model and the t-SNE visualization show that the adversary Adv induces off-distribution activations. Key: legends specify opponent the victim was pitted against. Adv is the best adversary trained against the victim, and Rand is a policy taking random actions. Zoo*N corresponds to ZooN (Sumo) or ZooON (otherwise). Zoo*1T and Zoo*1V are the train and validation datasets, drawn from Zoo1 (Sumo) or ZooO1 (otherwise).

In Section 5.1 we showed that adversarial policies win by creating natural observations that are adversarial to the victim. In this section, we seek to better understand why these observations are adversarial. We record activations from each victim’s policy network playing a range of opponents, and analyse these using a Gaussian Mixture Model (GMM) and a t-SNE representation. See Section LABEL:sec:appendix:activations in the supplementary material for details of training and hyperparameters.

We fit a GMM on activations Zoo*1T collected playing against a normal opponent, Zoo1 or ZooV1, holding out Zoo*1V as a validation set. Figure 4(a) shows that the adversarial policy Adv induces activations with the lowest log-likelihood of any opponent. The random baseline Rand is slightly more probable. The normal opponents Zoo*2 and Zoo*3 induce activations with almost as high likelihood as the validation set Zoo*1V, except in Sumo Humans where they are as unlikely as Rand.

We plot a t-SNE visualization of the activations of Kick and Defend victim ZooV2 in Figure 4(b). As expected from the density model results, there is a clear separation between between Adv, Rand and the normal opponent ZooO2. Intriguingly, Adv induces activations more widely dispersed than the random policy Rand, which in turn are more widely dispersed than ZooO2. We report on the full set of victim policies in Figures LABEL:fig:tsne:full and LABEL:fig:tsne:one-opponent in the supplementary material.

6 Discussion

We have proposed a novel threat model for reinforcement learning where the attacker controls an agent acting in the same environment as the victim. The attacker cannot directly modify the victim’s observations, but can choose an adversarial policy that takes actions creating natural observations that are adversarial. We have shown that adversarial policies exist in a range of zero-sum simulated robotics games against state-of-the-art victims trained via self-play to be robust to adversaries.

Moreover, we find that the adversarial policies win not by becoming generally strong players, but rather by taking actions that confuse the victim. We verify this through qualitative observations of the adversary’s behavior, and from showing that the performance of the victim improves when it is blind to the position of the adversary. Furthermore, our evaluation suggests victims in high-dimensional environments are more vulnerable to adversarial policies, and show adversarial policies induce highly off-distribution activations in the victim.

While it may at first appear unsurprising that a policy trained as an adversary against another RL policy would be able to exploit it, we believe that this observation is highly significant. First, the policies we have attacked were explicitly trained via self-play to minimize exploitability. The same type of technique has been used in a number of works focused on playing adversarial games directly against humans, where minimizing exploitability is paramount [Silver et al., 2017, OpenAI, 2018].

Second, the use of fixed victim policies reflects what is likely to be a common use case. In safety critical systems, where attacks like these would be most concerning, it is standard practice to validate a model and then freeze it, so as to ensure that the deployed model does not develop any new issues due to retraining. Therefore, our attack profile is a realistic reflection of what we might see with RL-trained policies in real-world settings, such as with autonomous vehicles.

Furthermore, even if the target victim uses continual learning, it may be possible to train against a fixed proxy victim. The attacker could use imitation learning on the target victim to produce a proxy. Alternatively, in consumer applications such as self-driving vehicles, the attacker can buy a copy of the system and periodically factory reset it. Once an adversarial policy has been trained against the proxy, the attacker may be able to transfer this policy to the target, exploiting it until it adapts.

Our results suggest a number of directions for future work. The ease with which policies can be attacked highlights the need for effective defenses. It may be possible to detect adversarial attacks using the density model on activations, in which case one could fallback to a conservative policy.

We are also excited at the potential of adversarial training with adversarial policies to improve robustness. Concretely, we envisage population-based training where new randomly initialized agents are introduced over time, and allowed to train against a fixed victim for some period of time. This would expose victims to a much broader range of opponents than conventional self-play or population-based training. However, it will considerably increase computational requirements, unless more efficient methods for finding adversarial policies than model-free RL are discovered.

Overall, we are excited about the implications the adversarial policy model has for the robustness, security and understanding of deep RL policies. Our results show the existence of a previously unrecognized problem in deep RL, but there remain many open questions. We hope this work encourages other researchers to investigate this area further. Videos and other supplementary material are available online at and our source code is available at


We thank Jakob Foerster, Matthew Rahtz, Dylan Hadfield-Menell, Catherine Olsson, Jan Leike, Rohin Shah, Victoria Krakovna, Daniel Filan, Steven Wang, Dawn Song, Sam Toyer and Dan Hendrycks for their suggestion and helpful feedback on earlier drafts of this paper. We thank Chris Northwood for assistance developing the website accompanying this paper.