Targeted Attacks on Deep Reinforcement Learning Agents through Adversarial Observations

05/29/2019 ∙ by Léonard Hussenot, et al. ∙ 0

This paper deals with adversarial attacks on perceptions of neural network policies in the Reinforcement Learning (RL) context. While previous approaches perform untargeted attacks on the state of the agent, we propose a method to perform targeted attacks to lure an agent into consistently following a desired policy. We place ourselves in a realistic setting, where attacks are performed on observations of the environment rather than the internal state of the agent and develop constant attacks instead of per-observation ones. We illustrate our method by attacking deep RL agents playing Atari games and show that universal additive masks can be applied not only to degrade performance but to take control of an agent.



There are no comments yet.


page 6

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural network classifiers have been shown to be sensitive to adversarial examples 

(Goodfellow et al., 2015; Carlini and Wagner, 2017). These attacks, whose existence was highlighted by Szegedy et al. (2013), are able to lure classifiers into predicting wrong labels for an image initially correctly classified by slightly modifying the input image. Such adversarial inputs can reasonably be applied to real-world situations (Athalye et al., 2018; Brown et al., 2017), by either adding an imperceptible noise or a reasonable-sized patch on an image. In Reinforcement Learning (RL), end-to-end neural architectures are trained to map complex inputs (such as images) to actions. They can successfully play simple video games like Atari (Mnih et al., 2015), achieve super-human performances at the game of Go (Silver et al., 2016), control complex robotics systems (Levine et al., 2016) and show promise for training self-driving car policies (Bojarski et al., 2016). However, the use of deep networks for processing visual observations may lead these RL agents to a similarly high sensitivity to adversarial inputs, resulting in catastrophic behaviours of the controlled system. Previous works (Huang et al., 2017; Kos and Song, 2017)

introduce adversarial examples in RL, yet leaving aside its intrinsically dynamic nature. They, indeed, stick to the classical meaning of adversarial attacks and study the drop in performance of an agent when subject to imperceptible attacks against its inputs. Moreover, while the supervised learning literature divides attacks into two categories, such a dichotomy is not as that clear in the RL context. Indeed, dividing the possible scenarios into

white-box and black-box, i.e., having access or not to the learning algorithm and its parameters is, here, not obvious. Especially, previous works focused on attacking the agent’s state meaning that even pre-processing operations on the observations are known and that the memory of the agent can be modified. Finally, most of these works, excepts from Lin et al. (2017)

who proposed a heuristic to reduce the number of attacked states, assume that attacks can be generated per state which is, in general, unfeasible.

Pattanaik et al. (2018) study adversarial examples in RL but mostly focus on low dimensional inputs and robustness of the hyper-parameters. Other works (Zhang et al., 2018; Ruderman et al., 2018) propose an adversarial maze framework to study generalization and transfer between similar environments and uncover worst-case scenarios. In contrast to that literature, we aim at placing ourselves in a more realistic setting. First, we argue that in the context of RL, attacks can have different objectives. Not only drops in performances can be the goal but, for instance, an adversary may want to spur the agent to act as told by another policy, using targeted adversarial examples. In addition, we suggest to use pre-computed attacks to be applied on-the-fly with no (or few) computation to be done during the actual online control process. To limit the required intervention on the RL agent, we do not attack internal representations (agent’s states) but only observations provided by the environment, limiting the white-box assumption. We exemplify this approach on four representative Atari games and show that taking control of a trained deep-RL agent so that its behaviour matches a desired policy can be done with very few different attacks. Videos from the experiments can be found at

2 Background

In Reinforcement Learning

, an agent interacts sequentially with a dynamic environment so as to learn an optimal control. To do so, the environment is modeled as a Markov Decision Process (MDP), that is a tuple

where is the state space, is the action space,

is a set of Markovian transition probabilities

defining the dynamics of the environment, is a reward function (we denote ) and is a discount factor. The (possibly stochastic) policy , mapping states to distributions over actions, is trained to maximize the agent expected cumulative discounted reward over time, , also called the value function of policy (where denotes the expectation over all possible trajectories generated by policy ). Value-based algorithms (Mnih et al., 2015; Hessel et al., 2018) use the value function, or more frequently the quality function to compute . To handle large state spaces, Deep RL (DRL) uses deep neural networks for function approximation. In value-based DRL, the quality function is parameterized with a neural network of parameters , mapping states to actions. Adversarial examples were introduced in the context of supervised classification. Given a classifier , an input , a bound on a norm , an adversarial example is an input such that while . Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) is the most widespread method for generating adversarial examples for the -norm. From a linear approximation of , it computes the attack as


with the loss of the classifier and the true label.As an adversary, one wishes to maximize the loss w.r.t. . Presented this way, it is an untargeted attack. It pushes towards misclassifying in any other label than . It can easily be turned into a targeted attack by, instead of , optimizing for with the label the adversary wants to predict for . It can also be adapted to an bound by normalizing the gradient to instead of taking the sign in Eq. (1). Both methods will thus be referred to as FGSM. It can also be transformed into iterative methods, by taking several steps in the direction of the gradient, and in momentum-based iterative methods by furthermore adjusting dynamically the gradient step. All these methods will be referred to as gradient-based attacks. When using deep networks to compute its policy, an RL agent can be fooled the same way as a supervised classifier. For algorithms computing a stochastic policy , we take , the true label, as the action predicted by the network: . The step(s) in the direction of the gradient of Eq. (1) will thus encourage the network to change its output from its original decision. In this case, with the cross-entropy. In the case where the output policy is deterministic, e.g. , the same calculus would lead to the gradient in Eq. (1) being zero almost everywhere. We will thus consider stochastic policies and use with . In RL, no true labels are available. So, adversarial examples have mainly been used in the untargeted case to change the decision of the agent in order to decrease its performance. We argue that applying untargeted adversarial examples copying the supervised paradigm is a restricted setting and that a realistic adversary could aim to take control of the agent using targeted attacks. The latter may be used to encourage the agent to take a specific action chosen by an adversary.

3 Adversarial attacks on deep RL agents

In (Huang et al., 2017), the FGSM attack is applied to the network inducing the policy , typically the -network in value-based DRL. By doing so, the attack has to be built on the agent state. This might appear as a reasonable hypothesis in the white-box setting. Nonetheless, the state is an internal representation to the agent. It might result from a complex – possibly learned – preprocessing of the raw perceptions provided by the environment. For example, when playing Atari games, classic RL algorithms use consecutive observations as input state to ensure the Markov property. Therefore deriving the loss from Eq. (1) with respect to the input state gives an attack on observations rather than one. As a consequence, deploying this attack actually means manipulating the memory of the agent. Attacking the internal representation of the agent is thus a technical obstacle to the black-box setting where the adversary is conceptually located between the environment and the agent. It also goes beyond the white-box assumption. Moreover, depending on the implementation, this breaks the assumption made by most authors that the norm of the attack is bounded by . For example, for Atari games, because of the overlap happening during the frame stacking procedure used to build states, observations are attacked several times. We thus wish to build attacks on raw observations rather than complete states and prove their efficiency. We denote the observation provided by the environment and the (unattacked) state. It results from a preprocessing of the past observations: . We denote the attack and the attacked observation. As we wish to attack observations rather than states, we do not compute the gradient of w.r.t. but only w.r.t. the last observation . These attacks are more easily applicable as they consist in adding an imperceptible noise to observations, and they can consistently respect the norm constraint , if required. Our goal is to design a realistic targeted controlling attack (see Sec. 3.2), however we first introduce the methodology and some results in the context of untargeted performance-damaging attack.

3.1 Untargeted attacks

We first wish to demonstrate that one can make a DRL agent’s performance plunge by attacking only observations rather than states. We design two types of attack. First, a per-observation one where we allow ourselves to compute a new attack for each observation provided by the environment. Second, in order to build a more realistic attack, we design one pre-computed constant attack to be applied identically on every observation.

Per-observation attack

We design the following attack. At each time step, either FGSM or an iterative method will be applied to the observation in order to change the decision that the agent would have taken. Denoting the objective of FGSM can thus be formulated in this case as maximizing over ,


As we work in an online framework, we consider that previous observations the agent received were also attacked, hence the use of . FGSM, as iterative methods, can thus be seen as gradient step(s) for maximizing the KL-divergence between policies. This attack, though reasonable considering that it is applied only on observations, may still be unenforceable as computing a different attack online might be computationally prohibitive for the adversary. We thus design a constant imperceptible attack that we will apply to every frame as a constant additive mask.

Constant attack

We study the effect of this new type of attacks: given an observed unattacked trajectory , the adversary looks for minimizing:


By replacing in Eq. (1) by the mean gradient over the episode, we minimize the likelihood of the agent’s choices over a trajectory. The single constant additive mask is then applied over all successive observations. If, as described here and shown in the experiments Sec. 4, untargeted adversarial examples can be found by only attacking observations rather than the state of the agent, we may now try to make the most of this method to take full control of the agent.

3.2 Targeted attacks

In RL, a malicious adversary might not solely wish to make the agent’s performance drop. In a more general scheme, it can wish to control the played policy. We thus consider an adversary’s policy we want to match when being attacked. We will keep referring to the protagonist playing the game as the agent and to the manipulating opponent as the adversary. Again, we design two attacks.

Per-observation attack

We apply a targeted gradient-based attack on each observation to encourage the agent to take the adversary’s preferred action. Denoting , the loss from Eq. (1) is now the opposite and the optimized objective is to minimize over :

Universal-masks attack

For each action , we design a universal mask to lure the agent into taking action . We use the qualifier “universal” as in the supervised learning literature where a universal attack lures a classifier into predicting the same class whatever the input image is. Similarly, we want our mask to consistently lure the agent into taking action , whatever the observation is. We collect trajectories by observing the agent interacting with the environment, building a set of observations of length . We collect trajectories by having the adversary to interact with the environment, building a set of observations of length . We define , the attack training set as: . Attacks are still computed to be applied on observations (not states). Here, we do not restrict ourselves to imperceptible attacks constrained by but try to find a reasonable additive mask that will effectively lure the agent. We wish that a human could still play the game and reach the same performance with the attacked observations. We here consider the policy computed with a temperature applied on the -values: . We compute the adversarial mask by minimizing over :


The term is a regularization parameter on the -norm of and

is the uniform distribution over matrices of value in

and of the same shape as . For untargeted constant attacks, we optimized Eq. (3) neglecting the previous attacks. We will see in Sec. 4 that this was enough to damage the agent’s performance. Nevertheless, targeted attacks are harder to compute, we thus wish to consider these previous attacks when computing a targeted mask. As we don’t know a priori the attacks that will be applied on previous observations, we add the noise for attacks on the current observation to be efficient regardless of previous ones. Ideally, we would want to be the greedy policy in order to optimize over the final choice of the agent. We instead use the softmax with temperature for the loss to remain differentiable. We use both and because, if we successfully attack the agent, its trajectories will change and it may get very different observations from . As we hope that it follows similar paths as the adversary, we add trajectories of the adversary for our masks to be trained on a meaningful training set. This is not an -constrained attack anymore but we will show in Sec. 4 that the visual impact of the computed attacks would not change a human player’s choices while deceiving the agent.

4 Experiments

We use the Dopamine library (Castro et al., 2018) to build RL agents playing the Atari environment (Bellemare et al., 2013). We use the same preprocessing as in (Mnih et al., 2015) with sticky actions (Machado et al., 2018). We use trained policies (Such et al., 2018) for both DQN (Mnih et al., 2015) and Rainbow (Hessel et al., 2018) and use them in evaluation mode: parameters of the agents are left unchanged. Results are averaged over five agent seeds. Performance is evaluated in terms of undiscounted cumulative return over an episode of at most 27k steps. FGSM and iterative attacks are computed thanks to the Cleverhans library (Papernot et al., 2016a). We attack observations that are on the unnormalized 255 gray-scale. The attack bounded by is thus also to be divided by 255. Moreover, attacked observations are always clipped to 0-255 to keep them in the valid range of images. We adjust with the used norm: when considering a norm and an , we in fact bound the attack as follows: .We run our experiments on four Atari games: Space Invaders, Pong, Air Raid and HERO. We chose these to prove that our manipulating attacks work consistently on games with very different aspects, from a tagging games with no significant changes in the background (e.g. Space Invaders) to a plateform game with changing environment (e.g. HERO). Fig. 1 shows the visual rendering of an -norm attack bounded by . Attack on the right is rescaled to be visible but is otherwise imperceptible (see difference between left and middle images).

Figure 1: Left: raw observation. Middle: attacked observation with . Right: re-scaled attack.

4.1 Untargeted Attacks

Figure 2: Per-observation (left) and constant (right) attacks on Rainbow playing Space Invaders.


We observe on Fig. 2-left that Rainbow is very sensitive to adversarial examples, even if we constrain ourselves to attacking observations and not complete states. Attacks are able, with to decrease the performance of the agent by more than and to reduce its performance to the one of a randomly-acting agent with . Simple FGSM performs well in this untargeted case even though momentum-based attack outperforms it. We also observe that a Gaussian noise attack limited by in norm is not able to bring the performance of the agent down as effectively.

Constant attack

We observe on Fig. 2-right that Rainbow shows a high sensitivity to constant attacks in -norm and that -norm attacks are ineffective. With -norm bounded attacks, suffices to reduce its performance to the one of a randomly acting agent. More results concerning untargeted attacks are presented in the Appendix.

4.2 Targeted Attacks


We test our attack in the following scheme. A trained DQN agent is playing, being under a targeted attack with with a trained Rainbow policy. DQN’s behavior will thus be encouraged by the attacks to follow Rainbow’s policy (which, in general, performs better).

Figure 3: Targeted “helpful” per-observation attacks on DQN playing Space Invaders.
Figure 4: Targeted “helpful” per-observation attacks on DQN playing HERO.

We observe on Fig. 3-left and Fig. 4-left that we are able to improve the performance of DQN significantly by attacking with still very small perturbations and almost make it reach Rainbow’s mean performance on the game. We also observe that in this targeted case, FGSM is not enough. It suffices to prevent a policy to take its preferred action but is not accurate enough to make it choose a particular action. Best returns are achieved when more than of the taken actions match with , given by the Rainbow algorithm (see Fig. 3-right and Fig. 4-right, where each point represents one attacked trajectory for one particular value of and gives its return on the y-axis and the rate of successful attack during the episode on the x-axis). This experiment can be even pushed further by “helpfully” attacking an untrained version of DQN (random weights) with a Rainbow adversary to see how well it can perform with only small perturbations of its observations. As observed on Fig. 5-left and 6-left, the task seems harder however, with a bigger , e.g. for Space Invaders and for HERO, we reach DQN’s performance and with for Space Invaders and for HERO, we reach those of Rainbow. As can be seen on Fig. 5-right and Fig. 6-right, the score increases as the attack success rate increases. With a sufficient , 90% of the actions can be matched with .

Figure 5: Targeted “helpful” per-observation attacks on untrained-DQN playing Space Invaders.
Figure 6: Targeted “helpful” per-observation attacks on untrained-DQN playing HERO.


For computing the constant masks, we collect 25 trajectories from the agent and 25 trajectories from the adversary. We then optimize Eq. (4) with Adam optimizer (Kingma and Ba, 2014) for 1000 iterations on minibatches of size 128 and a learning rate of . We use as the regularization parameter and for the uniform distribution from which the noise added to past observations is drawn from. The temperature should be small for to resemble the greedy policy but not too small to avoid numerical errors. We choose . All these parameters are left unchanged for the different games. We test this attack having chosen by Rainbow given the current state. The corresponding mask is then applied to the observation before being fed to DQN. We study the rate of action transformed to the adversary’s action as well as the total return.

Figure 7: Targeted “helpful” universal-masks attack on DQN playing Space Invaders. Left: x- rate of action matching the adversary’s choice; y- return. Right: example of a constant mask applied.
Figure 8: Targeted universal-masks attack on DQN playing HERO (left) and Air Raid (right).

As shown on Fig. 7-left and  8, we are able to consistently force DQN to take the same actions as Rainbow with a precomputed universal mask per action. More than 95% of the actions match and the performance of attacked DQN clearly outperforms vanilla DQN, reaching Rainbow’s performance. As seen on Fig. 7-right the masks, though not imperceptible, are close to zero everywhere and would not change human performance. Their norm is greater than the one of the imperceptible attacks but they remain very localized. It can also be seen on Fig. 9, where DQN playing HERO is attacked with the universal masks, that the visual impact of this mask is still very limited.

Figure 9: DQN playing Hero under universal-masks attack. Left: the unattacked observation. Middle: the attacked observation fed to DQN. Right: the attack.

5 Related Work & Discussion

In order to make machine learning safe, adversarial examples 

(Szegedy et al., 2013)

were introduced to highlight weaknesses of deep learning. Several algorithms were developed to produce adversarial examples rapidly 

(Goodfellow et al., 2015) or as close to the original example as possible (Carlini and Wagner, 2017). Defensive techniques were proposed like distillation (Papernot et al., 2016b) but other techniques (Carlini and Wagner, 2016) have proven their effectiveness against them. Adversarial examples on DRL agents are less studied. Previous work from Huang et al. (2017) first addressed the issue but the method focused only on decreasing the agent’s performance and attacked the whole agent’s state, adding strong assumptions to the white-box setting. We designed realistic attacks by rather attacking observations provided by the environment and considered the novel objective of manipulating the played policy to take control of it. Lin et al. (2017) were the first to consider targeted attacks on agents. However, they defined a unique objective, which was to bring the agent into a particular state. For this purpose, the algorithm includes a computationally expensive video prediction model. What’s more, their attack is also only tested with the adversarial attack introduced in Carlini and Wagner (2017), which is known to be slower than fast-gradient method to compute as it requires solving a heavy optimization problem. In contrast, in the targeted case, we address the more general objective of matching the attacked policy with a desired one. Their work was also the first to raise the question of a realistic attack by reducing the number of attacked states. The proposed method reached the same performance as Huang et al. (2017) only attacking 25% of the frames. We argued that a realistic attack is one that can be applied online and thus requires very few computation. We thus propose a framework to precompute universal attacks that are applied directly in the untargeted case and that just requires a forward pass to compute in the targeted case. Pinto et al. (2017) proposed an adversarial method for robust training of agents but only considered attacks on the dynamic of the environment, not on the visual perception of the agent. Zhang et al. (2018); Ruderman et al. (2018) developed adversarial environment generation to study agent’s generalization and worst-case scenarios. Those are different from this present work where we enlighten how an adversary might take control of the agent’s policy. As this work shows that one can manipulate the policy played by the agent by either adding on the fly imperceptible noise to the observations of the environment or some pre-computed masks, a necessary direction of work is to develop algorithms robust to these attacks. One important direction of future work is to test these attacks in the black-box setting by training our manipulating attack on several seeds before testing it on a different seed in order to see if this attack can be performed without knowing the parameters of the attacked policy.


6 Appendix

6.1 Videos

Videos of the experiments can be found at

6.2 Untargeted attacks: more results

As can be seen below, on every tested game, both DQN and Rainbow show high sensitivity to untargeted attacks. For the per-observation attack, is generally enough to make the agent’s performance drop to those of a randomly acting agent. For the constant attack, a higher value as is needed. However, this thoroughly remains in the domain of "imperceptible attacks".

Figure 10: Per-observation (left) and constant (right) attacks on DQN playing Space Invaders.
Figure 11: Per-observation (left) and constant (right) attacks on DQN playing Pong.
Figure 12: Per-observation (left) and constant (right) attacks on Rainbow playing Pong.
Figure 13: Per-observation (left) and constant (right) attacks on DQN playing Air Raid.
Figure 14: Per-observation (left) and constant (right) attacks on Rainbow playing Air Raid.
Figure 15: Per-observation (left) and constant (right) attacks on DQN playing Hero.
Figure 16: Per-observation (left) and constant (right) attacks on Rainbow playing Hero.

6.3 Targeted attacks: more results

For Air Raid, as it was the case for the other games, targeted per-observation attack is able to make DQN reach Rainbow’s performance with only , as well as the untrained-DQN with higher value e.g. .

Figure 17: Targeted “helpful” per-observation attacks on DQN playing Air Raid.
Figure 18: Targeted “helpful” per-observation attacks on untrained-DQN playing Air Raid.