1. Introduction
In recent years, deep neural networks (DNNs) have become pervasive and led a trend of fast adoption in various commercial systems. DNNs have also encouraged increased success in the field of deep reinforcement learning (DRL), where the goal is to train an agent to interact with the environments for maximizing an expected return. DRL systems have been evaluated on games
(Ghory, 2004; Mnih et al., 2013, 2016), autonomous navigation (Dai et al., 2005; Pan et al., 2019a), and robotics control (Levine et al., 2016), etc. To take advantage of this, industries are integrating DRL into production systems (RE WORK, 2017). However, it is wellknown that DNNs are vulnerable to adversarial perturbations (Goodfellow et al., 2015; Li and Vorobeychik, 2014, 2015; Xiao et al., 2019, 2018c, 2018b; Qiu et al., 2019; Xiao et al., 2018a). DRL systems that use DNNs to perform perception and policy making also have similar vulnerabilities. For example, one of the main weaknesses of DRL models in adversarial environments is their heavy dependence on the input observations since they use DNNs to process the observations. Moreover, since DRL models are trained to solve sequential decisionmaking problems, an attacker can perturb multiple observations. In fact, the distribution of training and testing data could be different due to random noise and adversarial manipulation (Laskov and Lippmann, 2010). Therefore, the learned policy can be vulnerable in adversarial environments.In this paper, we first present an extensive study of the taxonomy of adversarial attacks on DRL systems. Second, we propose and evaluate 10 adversarial attacks in order to explore points in the taxonomy that have not previously been examined in the literature. We organize adversarial attacks on DRL into a taxonomy based on details of the victim model and other properties of the attacker. First, we categorize these attacks based on what component of the system the attacker is capable of perturbing. The organization of this categorization resembles the components of a Markov decision process (MDP): we recognize attacks that perturb an agent’s observations, actions, or the system’s environment dynamics. We summarize these categories in Figure 1. Second, the attacker’s knowledge. We categorize these attacks based on what knowledge the attacker needs to perform the attack. Broadly, this breaks attacks down into the already recognized whitebox attacks, where the attacker has full knowledge of the target DRL system, and blackbox attacks, where the attacker has less or no knowledge. We will discuss this taxonomy further in Section 3.
On the other hand, existing attacks that perturb the observation operate independently on each frame, which are too computational intensive to run in realtime. We propose two novel strategies for quickly creating adversarial perturbations to use in realtime attacks. The first strategy, Nattack, trains a neural network to generate a perturbation, reducing the computation to a single forward pass over this network. The second strategy exploits the property that, in RL environments, the states are not independent, and later states depend on previous state and action. Therefore, we propose online sequential attacks, which, in contrast to attacks that operate independently on each frame, generate a perturbation using information from a few frames and then apply the generated perturbation to later frames. We include our experiments with these strategies as part of our exploration of attacks that perturb the observation. We describe our attacks in detail in Section 4, and we present our evaluation in Section 5.
To summarize, our contributions are: (1) We systematically organize adversarial attacks on DRL systems into a taxonomy and devise and evaluate 10 new attacks on several DRL environments; (2) We propose two practical strategies for carrying out an adversarial attack under limited computation power, Nattack and online sequential attack; (3) We propose two methods for efficiently querying a model in blackbox attacks: adaptive dimension sampling based finite difference (SFD) method and optimal frame selection method; (4) We provide a theoretic analysis of our proposed gradient estimation method and prove its efficiency and estimation error bound; and (5) We propose the first targeted attack that adversarially perturbs a DRL system’s environment dynamics in order to cause an agent fail in a specific way, and this method is more practical in real world applications.
Attack  MDP Component  Attacker Knowledge  Realtime  Physical  Temporal Dependency  

White/BlackBox  Arch.  Param.  Query  
obsfgsmwb  Observation  Whitebox  Yes  Yes  Yes  Yes  No  Independent 
obscwwb  Observation  Whitebox  Yes  Yes  Yes  Too slow  No  Independent 
obsnnwb  Observation  Whitebox  Yes  Yes  Yes  Yes  No  Independent 
obsfgsmbb  Observation  Blackbox  No  No  No  Yes  No  Independent 
obsimibb  Observation  Blackbox  No  No  Yes  Yes  No  Independent 
obsfdbb  Observation  Blackbox  No  No  Yes  Too slow  No  Independent 
obssfdbb  Observation  Blackbox  No  No  Yes  Too slow  No  Independent 
obsseqfgsmwb  Observation  Whitebox  Yes  Yes  Yes  Yes  No  Sequential 
obsseqfdbb  Observation  Blackbox  No  No  Yes  Yes  No  Sequential 
obsseqsfdbb  Observation  Blackbox  No  No  Yes  Yes  No  Sequential 
actnnwb  Action  Whitebox  Yes  Yes  Yes  Yes  No  Independent 
envsearchbb  Dynamics  Blackbox  No  No  Yes  N/A  Yes  N/A 
2. Related work
This paper builds on the concepts introduced in previous adversarial examples work, and our taxonomy of attacks and new attacks expand around previous attacks on DRL systems.
Adversarial attacks on machine learning models. Our attacks draw some of their techniques from previously proposed attacks. Goodfellow et al. describe the fast gradient sign method (FGSM) of generating adversarial perturbations in a whitebox setting (Goodfellow et al., 2015). Carlini and Wagner describe additional methods based on optimization, which result in smaller perturbations (Carlini and Wagner, 2017). MoosaviDezfooli et al. demonstrate a way to generate a “universal” perturbation that is effective on a set of multiple inputs (MoosaviDezfooli et al., 2017). Evtimov et al. show that adversarial examples can be robust to natural lighting conditions and viewing angles (Evtimov et al., 2018). Considering a more convenient case for realworld adversaries, blackbox attacks without providing training algorithms are also proposed for general machine learning models (Papernot et al., 2017; Chen et al., 2017). In our blackbox attacks, we apply these techniques that have been proposed for adapting whitebox methods to blackbox scenarios.
Adversarial attacks on DRL models. Recently, Huang et al. demonstrate an attack that uses FGSM to perturb observation frames in a DRL setting (Huang et al., 2017). However, the whitebox setting of this method requires knowing the exact trained model, the preferred action and it is not clear what the malicious goal of adversary is. Their work also proposes a blackbox attack method, where the blackbox setting is based on transferability. In this work, we propose novel blackbox attacks, including the attacks that do not rely on transferability. In addition, we propose several ways to reduce the computational complexity of attacks. Lin et al. design an algorithm to achieve targeted attack for DRL models (Lin et al., 2017). However, their work only considers targeted attacks and requires training a generative model to predict future states, which is already a computational intensive task. Behzadan and Munir propose a blackbox attack method that trains another DQN network to minimize the expected return while still using FGSM as the attack method (Behzadan and Munir, 2017). In terms of adversarial attack on environment dynamics, Pan et al. propose to use candidate inference attack to infer possible dynamics used for training a candidate policy, posing potential privacyrelated risk to deep RL models (Pan et al., 2019c).
Robust RL via adversarial training. Safety and generalization in various robotics and autonomous driving applications has drawn lots of attention for training robust models (Packer et al., 2018; Pinto et al., 2017; Pan et al., 2019b). Knowing how RL models can be attacked is beneficial for training robust RL agent. Pinto et al. proposes to train a RL agent to provide adversarial attack during training so that the agent can be robust against dynamics variations (Pinto et al., 2017). However, since they manually selected the perturbations on environment dynamics, the attack provided in their work may not be able to generalize to broader RL systems. Additionally, their method relies on an accurate modeling of the environment dynamics, which may not be available for real world tasks such as robotics and autonomous driving systems.
3. Taxonomy of attacks in DRL
Existing work on attacking DRL systems with adversarial perturbations focuses on perturbing an agent’s observations. This is the most appealing place to start, with seminal results already suggesting that recognition systems are vulnerable to adversarial examples (Goodfellow et al., 2015; MoosaviDezfooli et al., 2017). It naturally follows that we should ask whether perturbations introduced in other places in a RL system can cause the agent to misbehave, and in what scenarios, taking into account (i) can the attacker perform the attack with limited knowledge about the agent, (ii) can the attacker perform the attack in real time, and (iii) can the attacker introduce the perturbation physically. To systematically explore this question, we propose a taxonomy of possible adversarial attacks.
Attack components. In the first layer of our proposed taxonomy, we divide attacks based on what components in an MDP the attacker chooses to perturb: the agent’s observations, actions, and environment dynamics. We will discuss some of the scenarios where attacks on these components can be practical. For attacks applied on the observation space, the pixel values of images can be changed by installing some virus into the software that is used to process captured photos from the sensor or in the simulator that is rendering the environment. In case images are transmitted between robots and computers, some communications can be altered by an attacker wirelessly (Lonzetta et al., 2018). Some physical observation based attacks have been analyzed in autonomous driving (Evtimov et al., 2018). For attacks applied on the action space, the action outputs can be modified by installing some hardware virus in the actuator executing the action. This can be realistic in some robotic control tasks where the control center sends some control signals to the actuator, a vulnerability in the implementation, for example, vulnerability in the bluetooth signal transmission, may allow an attacker to modify those signals (Lonzetta et al., 2018). For attacks applied on the environment dynamics, in the autonomous driving case we can change the material surface characteristic of the road such that the policy trained in one environment will fail in the perturbed environment; in the robotic control case, the robot’s mass distribution can be changed such that the robot may lose balance when executing its original policy because the robot hasn’t been trained in this case.
Attacker’s knowledge. In the second layer of our proposed taxonomy, we categorize attacks based on what information the attacker needs to perform the attack. This divides attacks into whitebox attacks and blackbox attacks. We make a further categorization based on the attacker’s knowledge about the policy network’s architecture, weight parameters and whether the attacker can query the network. In whitebox attacks, the agent has access to the architecture and weight parameters of the policy network and of course can query the network. In blackbox attacks, the attackers don’t have access to weight parameters of the policy network and may or may not have access to the policy network’s architecture. The attacker may or may not have access to query the policy network.
Further categorization. We consider these additional properties of attacks. Realtime: while some attacks require more computation than can be performed in realtime, some are fast enough to run. Still other attacks perform some precomputation and then are able to generate perturbations quickly for each step. We identify this pragmatic property as part of our taxonomy. Physical: for RL tasks that take place in the real world, this property concerns the feasibility of physically applying the perturbation on the environment. Temporal dependency: we distinguish between attacks that generate a perturbation in each frame independently from other frames and online sequential attacks that use information from previous frames to generate perturbations on later frames.
4. Adversarial Attacks on Reinforcement Learning Policies
In order to study the unexplored parts of our proposed taxonomy from Section 3, in this section we develop several concrete attacks. Table 1 summarizes these attacks.
4.1. Attacks on State Observations
We now describe attacks that perturb an agent’s state observations. In this category of attacks, the attacker changes the input state observation to , where the attacker generates perturbation from the original observation and some learned parameters . In order to ensure that perturbations are small, we require that , which we can enforce by choosing to be of the form , and is a small positive value. We present both whitebox attacks and blackbox attacks.
Whitebox attacks In this setting, we assume that the attacker can access the agent’s policy network where refers to the action and refers to the state. Huang et al. has previously introduced one attack in this category that applies the FGSM method to generate whitebox perturbation purely on observations. We reproduce this experiment with our obsfgsmwb attack. This attack’s application scenario is when we know the policy network’s architecture and parameters. We also include a variant of Huang et al.’s attack that replaces FGSM with an optimization based method (Carlini and Wagner, 2017) in obscwwb. In addition, we propose an attack strategy Nattack where the perturbation is computed from a deep neural network. in a whitebox setting. We call this attack obsnnwb. This attack works when where we know the policy network’s architecture and parameters. We train the parameters of the attacker network based on the given policy to minimize victim policy’s expected return when the perturbations are applied:
Here refers to the opposite of the environment reward. With a fixed victim policy , this attack is similar to training a policy. For example, in DQN, our goal is to perform gradient update on
based on the following loss function:
where is the model under attack, is the next state relative to current state . In continuous control using DDPG, our goal is to perform gradient update on based on the following loss function
where is the value function and is the actor function.
Blackbox attacks. In general, the trained RL models are kept private to avoid easy attacks. Given such “blackbox” models, the adversary needs to take more sophisticated strategies to perform the attacks. In the blackbox attack, there are different scenarios based on the knowledge of attacker. First, the attacker is not allowed to obtain any information about the model architecture, parameters, or even query information. In this case, the attacker can perform a “transferability” based attack by attacking a surrogate model and then transfer the perturbation to the victim model. Huang et al. introduced a blackbox variant of the FGSM attack using transferability, which we denote as obsfgsmbb. This attack requires access to the original training environment. In this section, we introduce several other novel blackbox attack methods and propose to improve the efficiency of these attacks.
Imitation learning based blackbox attack. This attack obsimibb is inspired by Rusu et al.’s work on policy distillation (Rusu et al., 2015). The attacker trains a surrogate policy to imitate the victim policy .Then the attacker uses a whitebox method on the surrogate policy to generate a perturbation and applies that perturbation on the victim policy.
We provide the details of attack obsimibb, which trains a surrogate policy to imitate the victim policy and apply the perturbation generated from the surrogate policy to the victim policy to perform the attack. Formally, in a Deep Q learning case, given a blackbox policy with access to its policy outputs, we collect some dataset , where each data sample consists of a short observation sequence
and a vector
which is the unnormalized Qvalues, and one value corresponds to one action. We will perform imitation learning to learn a new policy such that we minimize the following loss function by taking gradient update with respect to network parameters :where corresponds to the victim policy, corresponds to our surrogate policy, and is a temperature factor. This attack works in the setting where we don’t have access to the policy network’s architecture or parameters, but can query the network.
Finite difference (FD) based blackbox attack. Previous blackbox attacks obsfgsmbb and obsimibb all require retraining a surrogate policy. Previous work by Bhagoji et al. (Bhagoji et al., 2017) applies the finite difference (FD) method in attacking classification models. We extend the FD method to DRL systems in obsfdbb which doesn’t require retraining a new policy. This attack works in the setting where we don’t have the policy network’s architecture or parameters, but can query the network. FD based attack on DRL uses FD to estimate gradient on the input observations, and then perform gradient descent to generate perturbations on the input observations. The key step in FD is to estimate the gradient. Denote the loss function as and state input as . Then the canonical basis vector is defined as an dimension vector with 1 only in the th component and 0 otherwise. The finite difference method estimates gradients via the following equation
(1) 
where is a parameter to control estimation accuracy. For dimensional input, the finite difference method would require queries to obtain the estimation, which is computationally intensive for high dimensional inputs such as images. We propose a sampling technique to mitigate this computational cost.
Adaptive sampling based FD (SFD)
. Many deep learning models extract features from inputs patchwise and have sparse activation map
(Bau et al., 2017). In Figure 2, we compare the gradient pattern of an image in CIFAR10 to the random distributed gradient pattern. On the left image, large gradient are more concentrated in a certain region rather than distributed in the entire image as in the right image.We propose a method for estimating gradients that exploits this spatial structure. In this method, we first estimate the gradient with respect to some randomly sampled pixels, then iteratively, we identify pixels where the gradient has a high magnitude and estimate the gradient with respect to surrounding pixels.
Given a function , where is the model parameter (we omit this for conciseness below), our goal is to estimate the gradient of with respect to an input : . We define the nontrivial dimension of the gradient of at as , i.e., the dimensions with gradient absolute value greater or equal to a threshold value . To estimate nontrivial dimension of the gradient, first, we randomly sample dimensions in , and get a set of dimensions , and use FD to estimate the gradients for dimensions in . Then we select a set of dimensions , and we use FD to estimate the gradients of the neighbors (a set ) of dimensions in , if these gradients haven’t been estimated. Then again we select dimensions with absolute gradients no less than from and find their neighbors to estimate gradients. We repeat this process for iterations. By exploring the sparse gradients this way, we can adaptively sample dimensions to estimate gradients, and can significantly reduce the number of queries. We give the full attack algorithm of obssfdbb (which works in the same scenario as obsfdbb) in Algorithm 1.
Here we provide an analysis of our SFD sampling algorithm and estimate the amount of nontrivial dimension of the gradient that can be estimated using our method in Lemma 4. The basic idea of this lemma is to prove that by using our sampling method, we can sample more of the nontrivial dimension of the gradient than by using random sampling. We also provide an error bound for the estimated gradient with SFD in Theorem 5.
Definition 4.0 (Neighbor Dimension’s Gradient).
and , we define the neighbor dimension’s gradient as . Note that is equivalent to , and to be general we choose the first one to obtain the definition.
Definition 4.0 (Nontrivial Gradient Dimension).
Given a positive gradient threshold , an input data instance , and a loss function , for any dimension , if , then we define this gradient as nontrivial gradient and the corresponding dimension as nontrivial gradient dimension. On the other hand, if , then we define this gradient as trivial gradient and the corresponding dimension as trivial gradient dimension.
Definition 4.0 (Gradient Sample Probability).
Given a selected threshold in Algorithm 1, for any
, define the gradient sample probability as
, where represents the set of dimensions selected by algorithm . Therefore, the gradient sample probability of SFD and random sampling are and respectively. Some further definitions on neighbor gradient distribution probability are as following:
If , then define
(2) 
If , then define
(3)
Based on the following assumption that these distribution and are defined over all possible dimensions in one image (over ) and these distribution works throughout the entire gradient estimation iteration process, we have the following lemma.
Lemma 4 ().
We make the following assumptions on : , s.t. . For dimension whose gradient , the probability that the gradient magnitude of its neighborhood pixel is in is . We conclude, as long as , we have .
The intuitive understanding of this lemma is that when the chance of getting gradient absolute value smaller than is less likely around pixels with gradient absolute value greater or equal to than if randomly sampling from the image, our method will be more sample efficient than random sample method.
Proof.
Following the notation in Lemma 4, define
(4) 
Note the randomness is with respect to the dimensions. These probabilities are, when we randomly choose dimensions in a gradient vector of dimensions, the chance of sampling some dimension with gradient absolute value no less than or , respectively. For , its neighbor gradient’s absolute value follows the following distribution:
Define as the number of nontrivial gradients estimated in iteration 0, and define
(5) 
as the number of nontrivial gradients estimated in iteration , then the ratio
characterizes in every iteration, the ratio of nontrivial gradient estimation over the total number of gradient estimation for , while characterizes nontrivial gradient estimation ratio if using random sampling method. For , . Since is the ratio of the number nontrivial gradients over the total number of gradients estimated in the th iteration, if we randomly select the dimensions, will be the same as . Using our SFD method, if for every , we have , then overall speaking, the ratio of the number of nontrivial gradients estimated over the total number of gradients estimated, , will be greater than . The definition of is
(6) 
where is the number of iterations.
We now prove that if , then for ,
(7) 
More specifically, for , since we perform uniform random sampling, and we sample dimensions, we have
(8) 
Then if we can prove that for , , and for some , we have , it in turn proves . The reason is that if for every step in our SFD algorithm the sample efficiency is at least the same as random sampling, and for some iterations our SFD is more efficient, then overall speaking our SFD based gradient estimation is more efficient than random sampling based gradient estimation.
Now from the above definition, we have
(9) 
Therefore, when , , our sampling algorithm is more efficient than random sampling.
∎
This lemma suggests that when the gradient distribution is more concentrated ( is small, then ), then our sampling algorithm is more efficient than random sampling. Next we give another theorem about the upper bound for the gradient estimation error and the proof for this theorem.
Theorem 5 ().
Suppose we sample all nontrivial dimensions of the gradient and estimate the gradient with perturbation strength , the estimation error of the gradients is upper bounded by the following inequality,
(10) 
for constant , , and is the estimated gradient of with respect to .
Proof.
Now we prove the Theorem 5: when , assume function is , by Taylor’s series we have
(11) 
Combine the two equations we get
(12) 
which means the truncation error is bounded by . Moreover, we have
(13) 
where , and .
We can regard each dimension as a single variable function , then we have
(14) 
Then , assume we are able to sample all nontrivial gradients with absolute gradient value no less than , then we have
(15) 
Therefore, the truncation error of gradients estimation is upper bounded by the following inequality.
(16) 
for some , and is the estimated gradient of with respect to .
∎
Online sequential attacks. In a DRL setting, consecutive observations are not i.i.d.—instead, they are highly correlated, with each state depending on previous ones. It’s then possible to perform an attack with less computation than performing the attack independently on each state. Considering realworld cases, for example, an autonomous robot would take a realtime video as input to help make decisions, an attacker is motivated to generate perturbations only based on previous states and apply it to future states, which we refer to as an online sequential attack. We hypothesize that a perturbation generated this way is effective on subsequent states.
Universal attack based approach. We propose online sequential attacks obsseqfgsmwb, obsseqfdbb, and obsseqsfdbb that exploit this structure of the observations. obsseqfgsmwb works in standard whitebox setting, where we know the architecture and parameters of the policy network; obsseqfdbb and obsseqsfdbb works in the same setting as obsfdbb. In these attacks, we first collect a number of observation frames and generate a single perturbation using the averaged gradient on these frames (or estimated gradients using FD or SFD, in the case of obsseqfdbb and obsseqsfdbb). Then, we apply that perturbation to all subsequent frames. In obsseqsfdbb, we combine the universal attack approach and the adaptive sampling technique for finite difference estimates. We improve upon the above attack by finding the the set of frames that appear to be most important and using the gradients from those frames. With this, we hope to maintain attack effectiveness while reducing the number of queries needed. We propose to select a subset of frames within the first
based on the variance of their
values. Then, in all subsequent frames, the attack applies a perturbation generated from the averaged gradient. We select an optimal set of important frames with high value variance to generate the perturbations. We give a proof in Corollary 6 below for why attacking these important frames is more effective in terms of reducing the overall expected return.Corollary 6 ().
Let the state and stateaction value be and respectively for a policy with time horizon . We conclude that , if , then where means the observation at time is changed from to .
Proof.
Recall the definition of Q value is
(17) 
The variance of Q value at a state is defined as
(18) 
where is the action space of the MDP, and denotes the number of actions. Suppose we are to attack state and state where the Q value variance of this two states are and , and assume .
Denote the stateaction pair Q values after attack are and , respectively. During the attack, state is modified to , and state is modified to , and their action’s Qvalue also change, so we use and to denote the actions after the attack. By using and instead of and , we mean that though the observed states are modified by the attack algorithm, but the true states do not change. By using a different action notation, we mean that since the observed states have been modified, the optimal actions at the modified states can be different from the optimal actions at the original observed states. Then the total discounted expected return for the entire episode can be expressed as (assume all actions are optimal actions)
(19) 
Since , can also be expressed as
(20)  
Subtract by we get
(21)  
According to our claim that states where the variance of Q value function is small will get better attack effect, suppose , and assume the range of Q value at step is larger than step , then we have
(22)  
Therefore which means attack state the agent will get less return in expectation. If , assume the range of Q value at step is smaller than step , then we have
(23) 
If is very small or is large enough such that , then we have which means attacking state the agent will get more reward in expectation than attacking state . ∎
4.2. Attacks on Action Selection
Our second category of attacks is to directly attack action output and minimize the expected return. We experiment with one attack in this category, under a whitebox scenario, actnnwb. Here we train another policy network that takes in the state and outputs a perturbation on the function: , the goal is also to minimize the expected return. For example, in DQN, the loss is chosen to be . For DDPG, the loss is chosen to be , where is reward that captures the attacker’s goal of minimizing the victim agent’s expected return. This second approach to learn the attack is to treat the environment and the original policy together as a new environment, and view attacks as actions.
4.3. Attacks on Environment Dynamics
In this third category, attacks perturb the environment transition model. In our case, we aim to achieve targeted attack, which means we want to change the dynamics such that the agent will fail in a specific way. Define the environment dynamics as , the agent’s policy as , the agent’s state at step following the current policy under current dynamics as , and define a mapping from to :, which outputs the state at time step : given initial state , policy , and environment dynamics . The task of attacking environment dynamics is to find another dynamics such that the agent will reach a target state at step : .
Random dynamics search. A naive way to find the target dynamics, which we demonstrate in envrandbb, is to use random search. Specifically, we randomly propose a new dynamics and see whether, under this dynamics, the agent will reach . This method works in the setting where we don’t need to have access to the policy network’s architecture and parameters, but just need to query the network.
Adversarial dynamics search. We design a more systematic algorithm based on RL to search for a dynamics to attack and call this method envsearchbb. At each time step, an attacker proposes a change to the current environment dynamics with some perturbation , where is bounded by some constant , and we find the new state at time step following the current policy under dynamics , then the attacker agent will get reward . We demonstrate this in envsearchbb using DDPG (Lillicrap et al., 2016) to train the attacker. In order to show that this method works better than random search, we also compare with the random dynamics search method, and keep the bound of maximum perturbation the same. This attack works in the same setting as envrandbb.
5. Experiments
We attack several agents trained for five different RL environments: Atari games Pong and Enduro (Bellemare et al., 2013), HalfCheetah and Hopper in MuJoCo (Todorov et al., 2012), and the driving simulation TORCS (Pan et al., 2017). We train DQN (Mnih et al., 2015) on Pong, Enduro and TORCS, and we train DDPG (Lillicrap et al., 2016) on HalfCheetah and Hopper. The reward function for TORCS comes from (Pan et al., 2017). The DQN network architecture comes from (Mnih et al., 2015). The network for continuous control using DDPG comes from (Dhariwal et al., 2017). For each game, we train the above agents with different random seeds and different architectures in order to evaluate different conditions in the transferability and imitation learning based blackbox attack. Details of network structure and the performance for each game are included in Appendix A.
5.1. Experimental Design
We compare the agents’ performance under all attacks with their performance under no attack, denoted as nonadv.
Attacks on observation. We test these attacks under perturbation bounds of and on the Atari games and MuJoCo simulations and and on TORCS.^{1}^{1}1Values are in range [0,1]

First, we test the whitebox attacks obsfgsmwb and obsnnwb on all five environments.

Second, we test the attack obsfgsmbb under two different conditions: (1) In obsfgsmbb(1), the attacker uses the same network structure in the surrogate model as the victim policy and (2) In obsfgsmbb(2), the attacker uses a different network structure for the surrogate model.

We test the attack obsimibb on all five environments. Similar to the transferability attacks, we test this attack under samearchitecture (obsimibb(1)) and differentarchitecture (obsimibb(2)) conditions. We use FGSM to generate perturbations on the surrogate policy.

We test obssfdbb under different numbers of SFD iterations; we denote an attack that uses iterations as obss[]fdbb. The number of queries is significantly reduced in obssfdbb than obsfdbb; we show the actual numbers of queries used in Table 2.

For the attack obsseqfgsmwb, we test under the condition obsseq[F]fgsmwb (F for “first”), where we use all of the first frames to compute the gradient for generating a perturbation for the subsequent frames.

For the attacks obsseqfdbb and obsseqsfdbb, we test under three conditions. (i) In obsseq[F]fdbb, we look at the first frames and use FD to estimate the gradient; (ii) In obsseq[L]fdbb and obsseq[L]s[]fdbb (L for “largest”), we again look at the first frames, but we select only the of the frames that have the largest value variance to generate the universal perturbation; (iii) obsseq[S]fdbb (S for “smallest”) is similar to the previous one, we select of the first frames that have the smallest value variance to generate the universal perturbation.

We additionally test a random perturbation based online sequential attack obsseqrandbb, where we take a sample from uniform random noise to generate a perturbation and apply on all frames. Although this attack does not consider the starting frames, we still test it under different conditions obsseq[F]randbb, where we start adding the random perturbation after the th frame. This makes it consistent with the other online sequential attacks that apply their perturbation after the th frame.
Attacks on action selection. We test the action selection attack actnnwb on the Atari games, TORCS, and MuJoCo robotic control tasks.
Attacks environment dynamics. We test the environment dynamics attacks envrandbb and envsearchbb on the MuJoCo environments and TORCS. In the tests on MuJoCo, we perturb the body mass and body inertia vector, which are in and in HalfCheetah and Hopper environments, respectively. In the tests on TORCS, we perturb the road friction coefficient and bump size, which is in . The perturbation strength is within 10% of the original magnitude of the dynamics being perturbed.
Bound  10 iter.  20 iter.  40 iter.  100 iter. 

0.05  
0.10 
Environment  envrandbb  envsearchbb 

HalfCheetah  7.91  5.76 
Hopper  1.89  0.0017 
TORCS  25.02  22.75 












5.2. Experimental Results
Attacks on observation. Figure 2(a) shows the results of the attacks on observations on TORCS, including all methods on attacking observations and the results of nonadv. In addition, we show in Figure 15 the decomposition of TORCS reward into progress related reward and catastrophe related reward (reward for collisions). We show that our attack achieves a significant number of crashes on the autonomous driving environment compared with obsfgsmwb. On TORCS, our neural network based attack obsnnwb achieves better attack performance than the FGSM attack obsfgsmwb. Under a blackbox setting, our proposed imitation learning based attacks obsimibb(1), obsimibb(2), and the FD based attack obsfdbb achieves better attack performance than the transferability based attacks obsfgsmbb(1) and obsfgsmbb(2).
Figures 2(b) and 2(c) compare the cumulative rewards among different blackbox methods on TORCS. These figures show that the policy is vulnerable to all of the blackbox methods. Specifically, they show that obss[]fdbb can achieve similar performance to FD under each value of the perturbation bound . In Table 2, we provide the number of queries for using obssfdbb and obsfdbb, and the results show that obssfdbb uses significantly less queries (around 1000 to 6000) than obsfdbb (around 14,000) but achieves similar attack performance. The SFD method only samples part of the pixels to calculate gradient while the vanilla FD method requires gradient computation at all pixels. Therefore, obssfdbb is more efficient in terms of running time than obsfdbb, which indicates the effectiveness of our adaptive sampling algorithm in reducing gradient computation time and keeping the attack performance.
The results for comparing obsseq[F]fgsmwb, obsseq[F]fdbb, and obsseq[F]randbb are shown in Figure 5 (left) for perturbation of different norm bound ( and ). The two figures show the cumulative reward for one episode when the states are under attack. Comparing the results, our proposed obsseq[F]fdbb achieves close attack performance compared with our obsseq[F]fgsmwb, and the baseline obsseq[F]randbb is not effective. Figure 5 (right) shows that when we select a set of states with the largest Q value variance (obsseq[L]fdbb) to estimate the gradient, the attack is more effective than selecting states with the smallest Q value variance (obsseq[S]fdbb), which indicates that selecting frames with large Q value variance is more effective. We see that when is very small (), the estimated universal perturbation may be not accurate, and when , the attack performance is reasonably good.
In Figure 6, we show the results of obsseq[L]s[]fdbb by varying the number of iterations , and select the 20% of frames with the largest Q value variance within the first frames to estimate the gradient using SFD. It is clear that with more iterations, we are able to get more accurate estimation of the gradients and thus achieve better attack performance, while the total number of queries is still significantly reduced. We conclude from Table 2 that when , the number of queries for SFD is around 6k, which is significantly smaller than needed for FD, which takes 14k queries to estimate the gradient on an image of size (14112 = ).
We provide the results of attack applied on observation space in other environments in Figure 4, Figure 8, Figure 7, and Figure 9. These environments include Atari game Pong and Enduro, and MuJoCo robotics simulation environments HalfCheetah and Hopper. It can be observed from these results that, for obsseq[L]fdbb, there exists at least one such that when we estimate a universal perturbation from the top 20% frames of the first frames and apply the perturbation on all subsequent frames starting from the th frame, we are able to achieve reasonably good attack performance. In some environments, such as in Pong, is already enough to induce strong attack; while in Enduro, achieves better performance than or . The Enduro environment is also an autonomous driving environment that is simpler than the TORCS environment, and we observed consistent results in the two environments. Note that different thresholds are applied according to the complexity of the two environments.
Attacks on action selection. We present the results of our attacks on action selection in Figure 10. The results show that action space attack is also effective. With the larger perturbation bound, we achieve better attack performance.
Attacks on environment dynamics. In Table 11, we show our results for performing targeted adversarial environment dynamics attack. The results are the distance to the target state (the smaller the better). Our goal is to attack the environment dynamics so the victim agent will fail in a prespecified way. For example, for a Hopper to turn over and for a self driving car to drive off road and hit obstacles. The results show that random search method performs worse than RL based search method in terms of reaching a specific state after certain steps. The quality of the attack can be qualitatively evaluated by observing the sequence of states when the agent is being attacked and see whether the target state has been achieved. In Figures 12–14, we show the sequences of states when the agents are under attack with the random search or reinforcement learning based search method. The last image in each sequence denotes the state at same step . The last image in each abnormal dynamics rollout sequence corresponds to the target state, the last image in the attacked dynamics using RL search denotes the attacked results using envsearchbb, and the last image in the attacked dynamics using random search denotes the attacked results using envrandbb. It can be seen from these figures that envsearchbb method is very effective at achieving targeted attack while using random search, it is relatively harder to achieve this.
6. Discussion and Conclusions
Though this paper is about adversarial attacks on deep reinforcement learning, one important direction is to develop reinforcement learning methods that are robust and how to defense again attacks. We provide some discussion regarding these perspectives.
General Attacks on DRL. We have attempted to study a broad scope of possible attacks: perturbing different parts of a reinforcement learning system, under threat models with different attacker knowledge, and using new techniques to reduce attack computation cost. However, our experiments are not an exhaustive set of possible attacks. A general attacker may gain access to perturb multiple parts of an RL system and may utilize still newer techniques to compute effective perturbations efficiently.
Improving Robustness of RL. There have been increasing interest in training RL algorithms that can be robust against perturbations in the environment, or even adversarial attacks. Previous methods that aim to improve the robustness of RL either try to apply random perturbation to the observation or apply gradient based noise to the observation to induce the agent to choose some suboptimal actions. On the one hand, our finite difference and sampling based finite difference based method can provide faster attack than traditional FGSM based attack that requires backpropagation to calculate gradient, therefore can be incorporated into the training of RL policies to improve the robustness of RL policy. The environment dynamics attack can help to find the environment where the current agent is vulnerable. On the other hand, our methods provide tools to evaluate the vulnerability of the trained RL policy. Finally, we hope that our proposed taxonomy helps guide future research in making DRL systems robust, and we offer our experimental results as baselines for future robust RL techniques to compare against.
Priority of Defense Towards the Proposed Attacks. From the perspective of training robust RL policies, it is important to know the severeness of the risk related with the proposed attacks. Among the proposed attacks, the environment dynamics attack can be a more realistic potential risk to consider than the other two attacks based on observations or action space. The reason is that this attack does not require access to modify the policy network software system and only requires access to modify environment dynamics, and by modifying the environment dynamics parameters such as changing road condition in autonomous driving, we see from our experiments that the agent tends to fail with the original policy. The observation and action space attack, especially the blackbox attacks, are also important to defend against, since an attacker can definitely query the network and may have access to change the observations or action selection.
Potential Defenses. Previous work, with an increasing interest in training robust RL algorithms, has tried (i) applying random perturbation to the observation or (ii) applying gradient based noise to the observation in order to exercise the agent under training on possible perturbations. As a first order enhancement, our finite difference and sampling based finite difference based attacks can fit in the same pipeline and can even run faster than traditional FGSM. However, valuable future defenses should also consider which attacks would be more practical to carry out. An environment dynamics attack, for example, can perturb the dynamics without any electronic modification to the system’s sensors or controllers. Blackbox attacks with query access may also be increasingly realistic with the availability of consumer products that use RL making oracles widely available.
We hope our exploratory work and the taxonomy of attacks we describe help form a more complete view for what threats should be considered in ongoing research in robust reinforcement learning.
References
 Network dissection: quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 3319–3327. Cited by: §4.1.
 Vulnerability of deep reinforcement learning to policy induction attacks. arXiv preprint arXiv:1701.04143. Cited by: §2.

The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §5.  Exploring the space of blackbox attacks on deep neural networks. arXiv preprint arXiv:1712.09491. Cited by: §4.1.
 Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017, Cited by: §2, §4.1.
 Zoo: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In ACM Workshop on Artificial Intelligence and Security, pp. 15–26. Cited by: §2.
 An approach to tune fuzzy controllers based on reinforcement learning for autonomous vehicle control. IEEE Transactions on Intelligent Transportation Systems 6 (3), pp. 285–293. Cited by: §1.
 Openai baselines. Cited by: §5.
 Robust physicalworld attacks on machine learning models. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, Cited by: §2, §3.
 Reinforcement learning in board games. Department of Computer Science, University of Bristol, Tech. Rep. Cited by: §1.
 Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §1, §2, §3.
 Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284. Cited by: §2, §4.1, §4.1.
 Machine learning in adversarial environments. Machine learning 81 (2), pp. 115–119. Cited by: §1.
 Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
 Feature crosssubstitution in adversarial classification. In Advances in Neural Information Processing Systems, pp. 2087–2095. Cited by: §1.
 Scalable optimization of randomized operational decisions in adversarial classification settings.. In AISTATS, Cited by: §1.
 Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §4.3, §5.
 Tactics of adversarial attack on deep reinforcement learning agents. In 26th International Joint Conference on Artificial Intelligence, Cited by: §2.
 Security vulnerabilities in bluetooth technology as used in iot. Journal of Sensor and Actuator Networks 7 (3), pp. 28. Cited by: §3.
 Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §1.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §5.
 Universal adversarial perturbations. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 1765–1773. Cited by: §2, §3.
 Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §2.
 Semantic predictive control for explainable and efficient policy learning. IEEE International Conference on Robotics and Automation (ICRA). Cited by: §1.
 Risk averse robust adversarial reinforcement learning. IEEE International Conference on Robotics and Automation (ICRA). Cited by: §2.
 How you act tells a lot: privacyleakage attack on deep reinforcement learning. arXiv preprint arXiv:1904.11082. Cited by: §2.
 Virtual to real reinforcement learning for autonomous driving. In British Machine Vision Conference (BMVC), Cited by: §5.
 Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. Cited by: §2.
 Robust adversarial reinforcement learning. In ICML, Proceedings of Machine Learning Research, Vol. 70, pp. 2817–2826. Cited by: §2.
 SemanticAdv: generating adversarial examples via attributeconditional image editing. arXiv preprint arXiv:1906.07927. Cited by: §1.
 DEEP learning in production & warehousing with Amazon Robotics. Note: https://link.medium.com/71WXEy3AaS Cited by: §1.
 Policy distillation. arXiv preprint arXiv:1511.06295. Cited by: §4.1.
 Mujoco: a physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §5.
 Characterizing adversarial examples based on spatial consistency information for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 217–234. Cited by: §1.
 Generating adversarial examples with adversarial networks. In IJCAI, Cited by: §1.
 Meshadv: adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6898–6907. Cited by: §1.
 Spatially transformed adversarial examples. ICLR 2019. Cited by: §1.
Appendix A Experimental Setup
We trained DQN models on Pong, Enduro, and TORCS, and trained DDPG models on HalfCheetah and Hopper. The DQN model for training Pong and Enduro consists of 3 convolutional layers and 2 fully connected layers. The two network architectures differ in their number of filters. Specifically, the first network structure is , where denotes a convolutional layer of input channel number , output channel number , kernel size
, and stride
. denotes a fully connected layer with input dimension and output dimension , and is the number of actions in the environment. The DQN model for training TORCS consists of 3 convultional layers and 2 or 3 fully connected layers. The convultional layers’ structure is , and the fully connected layer structure is for one model and for the other model.The DDPG model for training HalfCheetah and Hopper consists of several fully connected layers. We trained two different policy network structures on all MuJoCo environments. The first model’s actor is a network of size and the critic is a network of size . The second model’s actor is a network of size , and the critic is a network of size
. For both models, we added ReLU activation layers between these fully connected layers.
The TORCS autonomous driving environment is a discrete action space control environment with 9 actions, they are turn left, turn right, keep going, turn left and accelerate, turn right and accelerate, accelerate, turn left and decelerate, turn right and decelerate and decelerate. The other 4 games, Pong, Enduro, HalfCheetah, and Hopper are standard OpenAI gym environment.
The trained model’s performance when tested without any attack is included in the following Table 3.
Torcs  Enduro  Pong  HalfCheetah  Hopper  

Episodic reward  1720.8  1308  21  8257  3061 
Episode length  1351  16634  1654  1000  1000 
The DDPG neural network used for envsearchbb is the same as the first model (3layer fully connected network) used for training the policy for HalfCheetah, except that the input dimension is of the perturbation parameters’ dimension, and output dimension is also of the perturbation parameters’ dimension. For HalfCheetah, Hopper and TORCS, these input and output dimensions are 32, 20, and 10, respectively.
Comments
There are no comments yet.