Characterizing Attacks on Deep Reinforcement Learning

07/21/2019 ∙ by Chaowei Xiao, et al. ∙ University of Michigan University of Illinois at Urbana-Champaign, Inc. Tsinghua University berkeley college 4

Deep reinforcement learning (DRL) has achieved great success in various applications. However, recent studies show that machine learning models are vulnerable to adversarial attacks. DRL models have been attacked by adding perturbations to observations. While such observation based attack is only one aspect of potential attacks on DRL, other forms of attacks which are more practical require further analysis, such as manipulating environment dynamics. Therefore, we propose to understand the vulnerabilities of DRL from various perspectives and provide a thorough taxonomy of potential attacks. We conduct the first set of experiments on the unexplored parts within the taxonomy. In addition to current observation based attacks against DRL, we propose the first targeted attacks based on action space and environment dynamics. We also introduce the online sequential attacks based on temporal consistency information among frames. To better estimate gradient in black-box setting, we propose a sampling strategy and theoretically prove its efficiency and estimation error bound. We conduct extensive experiments to compare the effectiveness of different attacks with several baselines in various environments, including game playing, robotics control, and autonomous driving.



There are no comments yet.


page 5

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, deep neural networks (DNNs) have become pervasive and led a trend of fast adoption in various commercial systems. DNNs have also encouraged increased success in the field of deep reinforcement learning (DRL), where the goal is to train an agent to interact with the environments for maximizing an expected return. DRL systems have been evaluated on games

(Ghory, 2004; Mnih et al., 2013, 2016), autonomous navigation (Dai et al., 2005; Pan et al., 2019a), and robotics control (Levine et al., 2016), etc. To take advantage of this, industries are integrating DRL into production systems (RE WORK, 2017). However, it is well-known that DNNs are vulnerable to adversarial perturbations (Goodfellow et al., 2015; Li and Vorobeychik, 2014, 2015; Xiao et al., 2019, 2018c, 2018b; Qiu et al., 2019; Xiao et al., 2018a). DRL systems that use DNNs to perform perception and policy making also have similar vulnerabilities. For example, one of the main weaknesses of DRL models in adversarial environments is their heavy dependence on the input observations since they use DNNs to process the observations. Moreover, since DRL models are trained to solve sequential decision-making problems, an attacker can perturb multiple observations. In fact, the distribution of training and testing data could be different due to random noise and adversarial manipulation (Laskov and Lippmann, 2010). Therefore, the learned policy can be vulnerable in adversarial environments.

Figure 1.

Taxonomy of adversarial attacks on deep reinforcement learning (DRL). RL environments are usually modeled as a Markov Decision Process (MDP) that consists of

observation space, action space, and environment (transition) dynamics. Potential adversarial attacks could be applied to any of these components.

In this paper, we first present an extensive study of the taxonomy of adversarial attacks on DRL systems. Second, we propose and evaluate 10 adversarial attacks in order to explore points in the taxonomy that have not previously been examined in the literature. We organize adversarial attacks on DRL into a taxonomy based on details of the victim model and other properties of the attacker. First, we categorize these attacks based on what component of the system the attacker is capable of perturbing. The organization of this categorization resembles the components of a Markov decision process (MDP): we recognize attacks that perturb an agent’s observations, actions, or the system’s environment dynamics. We summarize these categories in Figure 1. Second, the attacker’s knowledge. We categorize these attacks based on what knowledge the attacker needs to perform the attack. Broadly, this breaks attacks down into the already recognized white-box attacks, where the attacker has full knowledge of the target DRL system, and black-box attacks, where the attacker has less or no knowledge. We will discuss this taxonomy further in Section 3.

On the other hand, existing attacks that perturb the observation operate independently on each frame, which are too computational intensive to run in real-time. We propose two novel strategies for quickly creating adversarial perturbations to use in real-time attacks. The first strategy, N-attack, trains a neural network to generate a perturbation, reducing the computation to a single forward pass over this network. The second strategy exploits the property that, in RL environments, the states are not independent, and later states depend on previous state and action. Therefore, we propose online sequential attacks, which, in contrast to attacks that operate independently on each frame, generate a perturbation using information from a few frames and then apply the generated perturbation to later frames. We include our experiments with these strategies as part of our exploration of attacks that perturb the observation. We describe our attacks in detail in Section 4, and we present our evaluation in Section 5.

To summarize, our contributions are: (1) We systematically organize adversarial attacks on DRL systems into a taxonomy and devise and evaluate 10 new attacks on several DRL environments; (2) We propose two practical strategies for carrying out an adversarial attack under limited computation power, N-attack and online sequential attack; (3) We propose two methods for efficiently querying a model in black-box attacks: adaptive dimension sampling based finite difference (SFD) method and optimal frame selection method; (4) We provide a theoretic analysis of our proposed gradient estimation method and prove its efficiency and estimation error bound; and (5) We propose the first targeted attack that adversarially perturbs a DRL system’s environment dynamics in order to cause an agent fail in a specific way, and this method is more practical in real world applications.

Attack MDP Component Attacker Knowledge Real-time Physical Temporal Dependency
White/Black-Box Arch. Param. Query
obs-fgsm-wb Observation White-box Yes Yes Yes Yes No Independent
obs-cw-wb Observation White-box Yes Yes Yes Too slow No Independent
obs-nn-wb Observation White-box Yes Yes Yes Yes No Independent
obs-fgsm-bb Observation Black-box No No No Yes No Independent
obs-imi-bb Observation Black-box No No Yes Yes No Independent
obs-fd-bb Observation Black-box No No Yes Too slow No Independent
obs-sfd-bb Observation Black-box No No Yes Too slow No Independent
obs-seq-fgsm-wb Observation White-box Yes Yes Yes Yes No Sequential
obs-seq-fd-bb Observation Black-box No No Yes Yes No Sequential
obs-seq-sfd-bb Observation Black-box No No Yes Yes No Sequential
act-nn-wb Action White-box Yes Yes Yes Yes No Independent
env-search-bb Dynamics Black-box No No Yes N/A Yes N/A
Table 1. Summary of the adversarial attacks on DRL systems, categorized based on our proposed taxonomy. The name reflects the category of the attack method. For example, obs-nn-wb means attack on observation using neural network based white-box attack. The attack methods we proposed are highlighted using bold text. “Arch.,” “Param.,” and “Query” indicate whether the attack requires knowledge of the policy network’s architecture, parameters and whether it needs to query the policy network.

2. Related work

This paper builds on the concepts introduced in previous adversarial examples work, and our taxonomy of attacks and new attacks expand around previous attacks on DRL systems.

Adversarial attacks on machine learning models. Our attacks draw some of their techniques from previously proposed attacks. Goodfellow et al. describe the fast gradient sign method (FGSM) of generating adversarial perturbations in a white-box setting (Goodfellow et al., 2015). Carlini and Wagner describe additional methods based on optimization, which result in smaller perturbations (Carlini and Wagner, 2017). Moosavi-Dezfooli et al. demonstrate a way to generate a “universal” perturbation that is effective on a set of multiple inputs (Moosavi-Dezfooli et al., 2017). Evtimov et al. show that adversarial examples can be robust to natural lighting conditions and viewing angles (Evtimov et al., 2018). Considering a more convenient case for real-world adversaries, black-box attacks without providing training algorithms are also proposed for general machine learning models (Papernot et al., 2017; Chen et al., 2017). In our black-box attacks, we apply these techniques that have been proposed for adapting white-box methods to black-box scenarios.

Adversarial attacks on DRL models. Recently, Huang et al. demonstrate an attack that uses FGSM to perturb observation frames in a DRL setting (Huang et al., 2017). However, the white-box setting of this method requires knowing the exact trained model, the preferred action and it is not clear what the malicious goal of adversary is. Their work also proposes a black-box attack method, where the black-box setting is based on transferability. In this work, we propose novel black-box attacks, including the attacks that do not rely on transferability. In addition, we propose several ways to reduce the computational complexity of attacks. Lin et al. design an algorithm to achieve targeted attack for DRL models (Lin et al., 2017). However, their work only considers targeted attacks and requires training a generative model to predict future states, which is already a computational intensive task. Behzadan and Munir propose a black-box attack method that trains another DQN network to minimize the expected return while still using FGSM as the attack method (Behzadan and Munir, 2017). In terms of adversarial attack on environment dynamics, Pan et al. propose to use candidate inference attack to infer possible dynamics used for training a candidate policy, posing potential privacy-related risk to deep RL models (Pan et al., 2019c).

Robust RL via adversarial training. Safety and generalization in various robotics and autonomous driving applications has drawn lots of attention for training robust models (Packer et al., 2018; Pinto et al., 2017; Pan et al., 2019b). Knowing how RL models can be attacked is beneficial for training robust RL agent. Pinto et al. proposes to train a RL agent to provide adversarial attack during training so that the agent can be robust against dynamics variations (Pinto et al., 2017). However, since they manually selected the perturbations on environment dynamics, the attack provided in their work may not be able to generalize to broader RL systems. Additionally, their method relies on an accurate modeling of the environment dynamics, which may not be available for real world tasks such as robotics and autonomous driving systems.

3. Taxonomy of attacks in DRL

Existing work on attacking DRL systems with adversarial perturbations focuses on perturbing an agent’s observations. This is the most appealing place to start, with seminal results already suggesting that recognition systems are vulnerable to adversarial examples (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2017). It naturally follows that we should ask whether perturbations introduced in other places in a RL system can cause the agent to misbehave, and in what scenarios, taking into account (i) can the attacker perform the attack with limited knowledge about the agent, (ii) can the attacker perform the attack in real time, and (iii) can the attacker introduce the perturbation physically. To systematically explore this question, we propose a taxonomy of possible adversarial attacks.

Attack components. In the first layer of our proposed taxonomy, we divide attacks based on what components in an MDP the attacker chooses to perturb: the agent’s observations, actions, and environment dynamics. We will discuss some of the scenarios where attacks on these components can be practical. For attacks applied on the observation space, the pixel values of images can be changed by installing some virus into the software that is used to process captured photos from the sensor or in the simulator that is rendering the environment. In case images are transmitted between robots and computers, some communications can be altered by an attacker wirelessly (Lonzetta et al., 2018). Some physical observation based attacks have been analyzed in autonomous driving (Evtimov et al., 2018). For attacks applied on the action space, the action outputs can be modified by installing some hardware virus in the actuator executing the action. This can be realistic in some robotic control tasks where the control center sends some control signals to the actuator, a vulnerability in the implementation, for example, vulnerability in the bluetooth signal transmission, may allow an attacker to modify those signals (Lonzetta et al., 2018). For attacks applied on the environment dynamics, in the autonomous driving case we can change the material surface characteristic of the road such that the policy trained in one environment will fail in the perturbed environment; in the robotic control case, the robot’s mass distribution can be changed such that the robot may lose balance when executing its original policy because the robot hasn’t been trained in this case.

Attacker’s knowledge. In the second layer of our proposed taxonomy, we categorize attacks based on what information the attacker needs to perform the attack. This divides attacks into white-box attacks and black-box attacks. We make a further categorization based on the attacker’s knowledge about the policy network’s architecture, weight parameters and whether the attacker can query the network. In white-box attacks, the agent has access to the architecture and weight parameters of the policy network and of course can query the network. In black-box attacks, the attackers don’t have access to weight parameters of the policy network and may or may not have access to the policy network’s architecture. The attacker may or may not have access to query the policy network.

Further categorization. We consider these additional properties of attacks. Real-time: while some attacks require more computation than can be performed in real-time, some are fast enough to run. Still other attacks perform some precomputation and then are able to generate perturbations quickly for each step. We identify this pragmatic property as part of our taxonomy. Physical: for RL tasks that take place in the real world, this property concerns the feasibility of physically applying the perturbation on the environment. Temporal dependency: we distinguish between attacks that generate a perturbation in each frame independently from other frames and online sequential attacks that use information from previous frames to generate perturbations on later frames.

4. Adversarial Attacks on Reinforcement Learning Policies

In order to study the unexplored parts of our proposed taxonomy from Section 3, in this section we develop several concrete attacks. Table 1 summarizes these attacks.

4.1. Attacks on State Observations

We now describe attacks that perturb an agent’s state observations. In this category of attacks, the attacker changes the input state observation to , where the attacker generates perturbation from the original observation and some learned parameters . In order to ensure that perturbations are small, we require that , which we can enforce by choosing to be of the form , and is a small positive value. We present both white-box attacks and black-box attacks.

White-box attacks In this setting, we assume that the attacker can access the agent’s policy network where refers to the action and refers to the state. Huang et al. has previously introduced one attack in this category that applies the FGSM method to generate white-box perturbation purely on observations. We reproduce this experiment with our obs-fgsm-wb attack. This attack’s application scenario is when we know the policy network’s architecture and parameters. We also include a variant of Huang et al.’s attack that replaces FGSM with an optimization based method (Carlini and Wagner, 2017) in obs-cw-wb. In addition, we propose an attack strategy N-attack where the perturbation is computed from a deep neural network. in a white-box setting. We call this attack obs-nn-wb. This attack works when where we know the policy network’s architecture and parameters. We train the parameters of the attacker network based on the given policy to minimize victim policy’s expected return when the perturbations are applied:

Here refers to the opposite of the environment reward. With a fixed victim policy , this attack is similar to training a policy. For example, in DQN, our goal is to perform gradient update on

based on the following loss function:

where is the model under attack, is the next state relative to current state . In continuous control using DDPG, our goal is to perform gradient update on based on the following loss function

where is the value function and is the actor function.

Black-box attacks. In general, the trained RL models are kept private to avoid easy attacks. Given such “black-box” models, the adversary needs to take more sophisticated strategies to perform the attacks. In the black-box attack, there are different scenarios based on the knowledge of attacker. First, the attacker is not allowed to obtain any information about the model architecture, parameters, or even query information. In this case, the attacker can perform a “transferability” based attack by attacking a surrogate model and then transfer the perturbation to the victim model. Huang et al. introduced a black-box variant of the FGSM attack using transferability, which we denote as obs-fgsm-bb. This attack requires access to the original training environment. In this section, we introduce several other novel black-box attack methods and propose to improve the efficiency of these attacks.

Imitation learning based black-box attack. This attack obs-imi-bb is inspired by Rusu et al.’s work on policy distillation (Rusu et al., 2015). The attacker trains a surrogate policy to imitate the victim policy .Then the attacker uses a white-box method on the surrogate policy to generate a perturbation and applies that perturbation on the victim policy.

We provide the details of attack obs-imi-bb, which trains a surrogate policy to imitate the victim policy and apply the perturbation generated from the surrogate policy to the victim policy to perform the attack. Formally, in a Deep Q learning case, given a black-box policy with access to its policy outputs, we collect some dataset , where each data sample consists of a short observation sequence

and a vector

which is the unnormalized Q-values, and one value corresponds to one action. We will perform imitation learning to learn a new policy such that we minimize the following loss function by taking gradient update with respect to network parameters :

where corresponds to the victim policy, corresponds to our surrogate policy, and is a temperature factor. This attack works in the setting where we don’t have access to the policy network’s architecture or parameters, but can query the network.

Finite difference (FD) based black-box attack. Previous black-box attacks obs-fgsm-bb and obs-imi-bb all require retraining a surrogate policy. Previous work by Bhagoji et al. (Bhagoji et al., 2017) applies the finite difference (FD) method in attacking classification models. We extend the FD method to DRL systems in obs-fd-bb which doesn’t require retraining a new policy. This attack works in the setting where we don’t have the policy network’s architecture or parameters, but can query the network. FD based attack on DRL uses FD to estimate gradient on the input observations, and then perform gradient descent to generate perturbations on the input observations. The key step in FD is to estimate the gradient. Denote the loss function as and state input as . Then the canonical basis vector is defined as an dimension vector with 1 only in the -th component and 0 otherwise. The finite difference method estimates gradients via the following equation


where is a parameter to control estimation accuracy. For dimensional input, the finite difference method would require queries to obtain the estimation, which is computationally intensive for high dimensional inputs such as images. We propose a sampling technique to mitigate this computational cost.

Adaptive sampling based FD (SFD)

. Many deep learning models extract features from inputs patch-wise and have sparse activation map 

(Bau et al., 2017). In Figure 2, we compare the gradient pattern of an image in CIFAR-10 to the random distributed gradient pattern. On the left image, large gradient are more concentrated in a certain region rather than distributed in the entire image as in the right image.

Figure 2. Image gradient distribution. The left image contains the gradient absolute value on an image from CIFAR-10 with a pretrained ResNet-50. The right image represents random gradient distribution used for comparison.

We propose a method for estimating gradients that exploits this spatial structure. In this method, we first estimate the gradient with respect to some randomly sampled pixels, then iteratively, we identify pixels where the gradient has a high magnitude and estimate the gradient with respect to surrounding pixels.

Given a function , where is the model parameter (we omit this for conciseness below), our goal is to estimate the gradient of with respect to an input : . We define the nontrivial dimension of the gradient of at as , i.e., the dimensions with gradient absolute value greater or equal to a threshold value . To estimate nontrivial dimension of the gradient, first, we randomly sample dimensions in , and get a set of dimensions , and use FD to estimate the gradients for dimensions in . Then we select a set of dimensions , and we use FD to estimate the gradients of the neighbors (a set ) of dimensions in , if these gradients haven’t been estimated. Then again we select dimensions with absolute gradients no less than from and find their neighbors to estimate gradients. We repeat this process for iterations. By exploring the sparse gradients this way, we can adaptively sample dimensions to estimate gradients, and can significantly reduce the number of queries. We give the full attack algorithm of obs-sfd-bb (which works in the same scenario as obs-fd-bb) in Algorithm 1.

  Input:    : state vector     : loss function with parameters      : # of item to estimate gradient at each step      : # of iteration     : the gradient threshold    : finite difference perturbation value   Output: estimated gradient Initialization:, randomly select dimensions in to form an index set . For to   For     If hasn’t been estimated      Get such that and       obtain     end  end      = indexes of neighbors of indexes in end
Algorithm 1 Adaptive sampling based finite difference (ASFD)

Here we provide an analysis of our SFD sampling algorithm and estimate the amount of nontrivial dimension of the gradient that can be estimated using our method in Lemma 4. The basic idea of this lemma is to prove that by using our sampling method, we can sample more of the nontrivial dimension of the gradient than by using random sampling. We also provide an error bound for the estimated gradient with SFD in Theorem 5.

Definition 4.0 (Neighbor Dimension’s Gradient).

and , we define the neighbor dimension’s gradient as . Note that is equivalent to , and to be general we choose the first one to obtain the definition.

Definition 4.0 (Non-trivial Gradient Dimension).

Given a positive gradient threshold , an input data instance , and a loss function , for any dimension , if , then we define this gradient as non-trivial gradient and the corresponding dimension as non-trivial gradient dimension. On the other hand, if , then we define this gradient as trivial gradient and the corresponding dimension as trivial gradient dimension.

Definition 4.0 (Gradient Sample Probability).

Given a selected threshold in Algorithm 1, for any

, define the gradient sample probability as

, where represents the set of dimensions selected by algorithm . Therefore, the gradient sample probability of SFD and random sampling are and respectively. Some further definitions on neighbor gradient distribution probability are as following:

  • If , then define

  • If , then define


Based on the following assumption that these distribution and are defined over all possible dimensions in one image (over ) and these distribution works throughout the entire gradient estimation iteration process, we have the following lemma.

Lemma 4 ().

We make the following assumptions on : , s.t. . For dimension whose gradient , the probability that the gradient magnitude of its neighborhood pixel is in is . We conclude, as long as , we have .

The intuitive understanding of this lemma is that when the chance of getting gradient absolute value smaller than is less likely around pixels with gradient absolute value greater or equal to than if randomly sampling from the image, our method will be more sample efficient than random sample method.


Following the notation in Lemma 4, define


Note the randomness is with respect to the dimensions. These probabilities are, when we randomly choose dimensions in a gradient vector of dimensions, the chance of sampling some dimension with gradient absolute value no less than or , respectively. For , its neighbor gradient’s absolute value follows the following distribution:

Define as the number of nontrivial gradients estimated in iteration 0, and define


as the number of nontrivial gradients estimated in iteration , then the ratio

characterizes in every iteration, the ratio of nontrivial gradient estimation over the total number of gradient estimation for , while characterizes nontrivial gradient estimation ratio if using random sampling method. For , . Since is the ratio of the number non-trivial gradients over the total number of gradients estimated in the -th iteration, if we randomly select the dimensions, will be the same as . Using our SFD method, if for every , we have , then overall speaking, the ratio of the number of non-trivial gradients estimated over the total number of gradients estimated, , will be greater than . The definition of is


where is the number of iterations.

We now prove that if , then for ,


More specifically, for , since we perform uniform random sampling, and we sample dimensions, we have


Then if we can prove that for , , and for some , we have , it in turn proves . The reason is that if for every step in our SFD algorithm the sample efficiency is at least the same as random sampling, and for some iterations our SFD is more efficient, then overall speaking our SFD based gradient estimation is more efficient than random sampling based gradient estimation.

Now from the above definition, we have


Therefore, when , , our sampling algorithm is more efficient than random sampling.

This lemma suggests that when the gradient distribution is more concentrated ( is small, then ), then our sampling algorithm is more efficient than random sampling. Next we give another theorem about the upper bound for the gradient estimation error and the proof for this theorem.

Theorem 5 ().

Suppose we sample all nontrivial dimensions of the gradient and estimate the gradient with perturbation strength , the estimation error of the gradients is upper bounded by the following inequality,


for constant , , and is the estimated gradient of with respect to .


Now we prove the Theorem 5: when , assume function is , by Taylor’s series we have


Combine the two equations we get


which means the truncation error is bounded by . Moreover, we have


where , and .

We can regard each dimension as a single variable function , then we have


Then , assume we are able to sample all nontrivial gradients with absolute gradient value no less than , then we have


Therefore, the truncation error of gradients estimation is upper bounded by the following inequality.


for some , and is the estimated gradient of with respect to .

Online sequential attacks. In a DRL setting, consecutive observations are not i.i.d.—instead, they are highly correlated, with each state depending on previous ones. It’s then possible to perform an attack with less computation than performing the attack independently on each state. Considering real-world cases, for example, an autonomous robot would take a real-time video as input to help make decisions, an attacker is motivated to generate perturbations only based on previous states and apply it to future states, which we refer to as an online sequential attack. We hypothesize that a perturbation generated this way is effective on subsequent states.

Universal attack based approach. We propose online sequential attacks obs-seq-fgsm-wb, obs-seq-fd-bb, and obs-seq-sfd-bb that exploit this structure of the observations. obs-seq-fgsm-wb works in standard white-box setting, where we know the architecture and parameters of the policy network; obs-seq-fd-bb and obs-seq-sfd-bb works in the same setting as obs-fd-bb. In these attacks, we first collect a number of observation frames and generate a single perturbation using the averaged gradient on these frames (or estimated gradients using FD or SFD, in the case of obs-seq-fd-bb and obs-seq-sfd-bb). Then, we apply that perturbation to all subsequent frames. In obs-seq-sfd-bb, we combine the universal attack approach and the adaptive sampling technique for finite difference estimates. We improve upon the above attack by finding the the set of frames that appear to be most important and using the gradients from those frames. With this, we hope to maintain attack effectiveness while reducing the number of queries needed. We propose to select a subset of frames within the first

based on the variance of their

values. Then, in all subsequent frames, the attack applies a perturbation generated from the averaged gradient. We select an optimal set of important frames with high value variance to generate the perturbations. We give a proof in Corollary 6 below for why attacking these important frames is more effective in terms of reducing the overall expected return.

Corollary 6 ().

Let the state and state-action value be and respectively for a policy with time horizon . We conclude that , if , then where means the observation at time is changed from to .


Recall the definition of Q value is


The variance of Q value at a state is defined as


where is the action space of the MDP, and denotes the number of actions. Suppose we are to attack state and state where the Q value variance of this two states are and , and assume .

Denote the state-action pair Q values after attack are and , respectively. During the attack, state is modified to , and state is modified to , and their action’s Q-value also change, so we use and to denote the actions after the attack. By using and instead of and , we mean that though the observed states are modified by the attack algorithm, but the true states do not change. By using a different action notation, we mean that since the observed states have been modified, the optimal actions at the modified states can be different from the optimal actions at the original observed states. Then the total discounted expected return for the entire episode can be expressed as (assume all actions are optimal actions)


Since , can also be expressed as


Subtract by we get


According to our claim that states where the variance of Q value function is small will get better attack effect, suppose , and assume the range of Q value at step is larger than step , then we have


Therefore which means attack state the agent will get less return in expectation. If , assume the range of Q value at step is smaller than step , then we have


If is very small or is large enough such that , then we have which means attacking state the agent will get more reward in expectation than attacking state . ∎

4.2. Attacks on Action Selection

Our second category of attacks is to directly attack action output and minimize the expected return. We experiment with one attack in this category, under a white-box scenario, act-nn-wb. Here we train another policy network that takes in the state and outputs a perturbation on the function: , the goal is also to minimize the expected return. For example, in DQN, the loss is chosen to be . For DDPG, the loss is chosen to be , where is reward that captures the attacker’s goal of minimizing the victim agent’s expected return. This second approach to learn the attack is to treat the environment and the original policy together as a new environment, and view attacks as actions.

(a) Episodic Reward
(b) Cumulative Reward (=0.05)
(c) Cumulative Reward (=0.10)
Figure 3. Episodic reward under different attack methods and cumulative reward of different black-box attacks on TORCS.

4.3. Attacks on Environment Dynamics

In this third category, attacks perturb the environment transition model. In our case, we aim to achieve targeted attack, which means we want to change the dynamics such that the agent will fail in a specific way. Define the environment dynamics as , the agent’s policy as , the agent’s state at step following the current policy under current dynamics as , and define a mapping from to :, which outputs the state at time step : given initial state , policy , and environment dynamics . The task of attacking environment dynamics is to find another dynamics such that the agent will reach a target state at step : .

Random dynamics search. A naive way to find the target dynamics, which we demonstrate in env-rand-bb, is to use random search. Specifically, we randomly propose a new dynamics and see whether, under this dynamics, the agent will reach . This method works in the setting where we don’t need to have access to the policy network’s architecture and parameters, but just need to query the network.

Adversarial dynamics search. We design a more systematic algorithm based on RL to search for a dynamics to attack and call this method env-search-bb. At each time step, an attacker proposes a change to the current environment dynamics with some perturbation , where is bounded by some constant , and we find the new state at time step following the current policy under dynamics , then the attacker agent will get reward . We demonstrate this in env-search-bb using DDPG (Lillicrap et al., 2016) to train the attacker. In order to show that this method works better than random search, we also compare with the random dynamics search method, and keep the bound of maximum perturbation the same. This attack works in the same setting as env-rand-bb.

5. Experiments

We attack several agents trained for five different RL environments: Atari games Pong and Enduro (Bellemare et al., 2013), HalfCheetah and Hopper in MuJoCo (Todorov et al., 2012), and the driving simulation TORCS (Pan et al., 2017). We train DQN (Mnih et al., 2015) on Pong, Enduro and TORCS, and we train DDPG (Lillicrap et al., 2016) on HalfCheetah and Hopper. The reward function for TORCS comes from (Pan et al., 2017). The DQN network architecture comes from (Mnih et al., 2015). The network for continuous control using DDPG comes from (Dhariwal et al., 2017). For each game, we train the above agents with different random seeds and different architectures in order to evaluate different conditions in the transferability and imitation learning based black-box attack. Details of network structure and the performance for each game are included in Appendix A.

5.1. Experimental Design

We compare the agents’ performance under all attacks with their performance under no attack, denoted as non-adv.

Attacks on observation. We test these attacks under perturbation bounds of and on the Atari games and MuJoCo simulations and and on TORCS.111Values are in range [0,1]

  • First, we test the white-box attacks obs-fgsm-wb and obs-nn-wb on all five environments.

  • Second, we test the attack obs-fgsm-bb under two different conditions: (1) In obs-fgsm-bb(1), the attacker uses the same network structure in the surrogate model as the victim policy and (2) In obs-fgsm-bb(2), the attacker uses a different network structure for the surrogate model.

  • We test the attack obs-imi-bb on all five environments. Similar to the transferability attacks, we test this attack under same-architecture (obs-imi-bb(1)) and different-architecture (obs-imi-bb(2)) conditions. We use FGSM to generate perturbations on the surrogate policy.

  • We test obs-sfd-bb under different numbers of SFD iterations; we denote an attack that uses iterations as obs-s[]fd-bb. The number of queries is significantly reduced in obs-sfd-bb than obs-fd-bb; we show the actual numbers of queries used in Table 2.

  • For the attack obs-seq-fgsm-wb, we test under the condition obs-seq[F]-fgsm-wb (F for “first”), where we use all of the first frames to compute the gradient for generating a perturbation for the subsequent frames.

  • For the attacks obs-seq-fd-bb and obs-seq-sfd-bb, we test under three conditions. (i) In obs-seq[F]-fd-bb, we look at the first frames and use FD to estimate the gradient; (ii) In obs-seq[L]-fd-bb and obs-seq[L]-s[]fd-bb (L for “largest”), we again look at the first frames, but we select only the of the frames that have the largest value variance to generate the universal perturbation; (iii) obs-seq[S]-fd-bb (S for “smallest”) is similar to the previous one, we select of the first frames that have the smallest value variance to generate the universal perturbation.

  • We additionally test a random perturbation based online sequential attack obs-seq-rand-bb, where we take a sample from uniform random noise to generate a perturbation and apply on all frames. Although this attack does not consider the starting frames, we still test it under different conditions obs-seq[F]-rand-bb, where we start adding the random perturbation after the -th frame. This makes it consistent with the other online sequential attacks that apply their perturbation after the th frame.

Attacks on action selection. We test the action selection attack act-nn-wb on the Atari games, TORCS, and MuJoCo robotic control tasks.

Attacks environment dynamics. We test the environment dynamics attacks env-rand-bb and env-search-bb on the MuJoCo environments and TORCS. In the tests on MuJoCo, we perturb the body mass and body inertia vector, which are in and in HalfCheetah and Hopper environments, respectively. In the tests on TORCS, we perturb the road friction coefficient and bump size, which is in . The perturbation strength is within 10% of the original magnitude of the dynamics being perturbed.

Figure 4. Episodic rewards among different attack methods on Atari games. Dotted lines are black-box attack while dash lines are white-box attack.
Bound 10 iter. 20 iter. 40 iter. 100 iter.
Table 2. Number of queries for SFD on each image among different settings. (14112 would be needed for FD.)
Figure 5. Performance of universal attack based approach considering all starting images (seq[F]-, left two graphs) and subsets of frames with largest (seq[L]-) and smallest (seq[S]-) Q value variance (right two images). Results shown for TORCS, under two perturbation bounds .
Figure 6. Performance of universal attack based approach with different numbers of query iterations with obs-seq-sfd-bb
(a) =0.005 — HalfCheetah
(b) =0.01 — HalfCheetah
(c) =0.005 — Hopper
(d) =0.01 — Hopper
Figure 7. Performance among different attack methods on MuJoCo. We use the format “ bound — Environment” to label the settings of each image.
(a) = 0.005—Pong
(b) = 0.01—Pong
(c) =0.005—Enduro
(d) =0.01—Enduro
Figure 8. (a-b): Cumulative reward after adding optimal state based universal perturbation on Pong game. (c-d): Cumulative reward after adding optimal state based universal perturbation on Enduro game. The results for Enduro are different from the TORCS results since the threshold is different from the TORCS case.
(a) =0.005 — HalfCheetah
(b) =0.01 — HalfCheetah
(c) =0.005 — Hopper
(d) =0.01 — Hopper
Figure 9. Cumulative reward after adding optimal state based universal perturbation on Mujoco. We use the format “ bound — Environment” to label the settings of each image.
(a) Atari and Torcs (b) Mujoco
Figure 10. Action based attacks. Results of act-nn-wb on Atari games, TORCS, and MuJoCo tasks.
Environment env-rand-bb env-search-bb
HalfCheetah 7.91 5.76
Hopper 1.89 0.0017
TORCS 25.02 22.75
Figure 11. Results of environment dynamics based attacks.
(a) Agent’s behavior under normal dynamics
(b) Agent’s behavior under abnormal dynamics
(c) Agent’s behavior under attacked dynamics using RL
(d) Agent’s behavior under attacked dynamics using random search
Figure 12. Results for Dynamics Attack on HalfCheetah
(a) Agent’s behavior under normal dynamics
(b) Agent’s behavior under abnormal dynamics
(c) Agent’s behavior under attacked dynamics using RL
(d) Agent’s behavior under attacked dynamics using random search
Figure 13. Results for Dynamics Attack on Hopper
(a) Agent’s behavior under normal dynamics
(b) Agent’s behavior under abnormal dynamics
(c) Agent’s behavior under attacked dynamics using RL
(d) Agent’s behavior under attacked dynamics using random search
Figure 14. Results for Dynamics Attack on TORCS

5.2. Experimental Results

Attacks on observation. Figure 2(a) shows the results of the attacks on observations on TORCS, including all methods on attacking observations and the results of non-adv. In addition, we show in Figure 15 the decomposition of TORCS reward into progress related reward and catastrophe related reward (reward for collisions). We show that our attack achieves a significant number of crashes on the autonomous driving environment compared with obs-fgsm-wb. On TORCS, our neural network based attack obs-nn-wb achieves better attack performance than the FGSM attack obs-fgsm-wb. Under a black-box setting, our proposed imitation learning based attacks obs-imi-bb(1), obs-imi-bb(2), and the FD based attack obs-fd-bb achieves better attack performance than the transferability based attacks obs-fgsm-bb(1) and obs-fgsm-bb(2).

Figures 2(b) and 2(c) compare the cumulative rewards among different black-box methods on TORCS. These figures show that the policy is vulnerable to all of the black-box methods. Specifically, they show that obs-s[]fd-bb can achieve similar performance to FD under each value of the perturbation bound . In Table 2, we provide the number of queries for using obs-sfd-bb and obs-fd-bb, and the results show that obs-sfd-bb uses significantly less queries (around 1000 to 6000) than obs-fd-bb (around 14,000) but achieves similar attack performance. The SFD method only samples part of the pixels to calculate gradient while the vanilla FD method requires gradient computation at all pixels. Therefore, obs-sfd-bb is more efficient in terms of running time than obs-fd-bb, which indicates the effectiveness of our adaptive sampling algorithm in reducing gradient computation time and keeping the attack performance.

Figure 15. Reward Decomposition on the TORCS environment under obs-nn-wb attack and obs-fgsm-wb. Progress reward is related with driving forward while catastrophe reward is related with colliding into obstacles.

The results for comparing obs-seq[F]-fgsm-wb, obs-seq[F]-fd-bb, and obs-seq[F]-rand-bb are shown in Figure 5 (left) for perturbation of different norm bound ( and ). The two figures show the cumulative reward for one episode when the states are under attack. Comparing the results, our proposed obs-seq[F]-fd-bb achieves close attack performance compared with our obs-seq[F]-fgsm-wb, and the baseline obs-seq[F]-rand-bb is not effective. Figure 5 (right) shows that when we select a set of states with the largest Q value variance (obs-seq[L]-fd-bb) to estimate the gradient, the attack is more effective than selecting states with the smallest Q value variance (obs-seq[S]-fd-bb), which indicates that selecting frames with large Q value variance is more effective. We see that when is very small (), the estimated universal perturbation may be not accurate, and when , the attack performance is reasonably good.

In Figure 6, we show the results of obs-seq[L]-s[]fd-bb by varying the number of iterations , and select the 20% of frames with the largest Q value variance within the first frames to estimate the gradient using SFD. It is clear that with more iterations, we are able to get more accurate estimation of the gradients and thus achieve better attack performance, while the total number of queries is still significantly reduced. We conclude from Table 2 that when , the number of queries for SFD is around 6k, which is significantly smaller than needed for FD, which takes 14k queries to estimate the gradient on an image of size (14112 = ).

We provide the results of attack applied on observation space in other environments in Figure 4, Figure 8, Figure 7, and Figure 9. These environments include Atari game Pong and Enduro, and MuJoCo robotics simulation environments HalfCheetah and Hopper. It can be observed from these results that, for obs-seq[L]-fd-bb, there exists at least one such that when we estimate a universal perturbation from the top 20% frames of the first frames and apply the perturbation on all subsequent frames starting from the -th frame, we are able to achieve reasonably good attack performance. In some environments, such as in Pong, is already enough to induce strong attack; while in Enduro, achieves better performance than or . The Enduro environment is also an autonomous driving environment that is simpler than the TORCS environment, and we observed consistent results in the two environments. Note that different thresholds are applied according to the complexity of the two environments.

Attacks on action selection. We present the results of our attacks on action selection in Figure 10. The results show that action space attack is also effective. With the larger perturbation bound, we achieve better attack performance.

Attacks on environment dynamics. In Table 11, we show our results for performing targeted adversarial environment dynamics attack. The results are the distance to the target state (the smaller the better). Our goal is to attack the environment dynamics so the victim agent will fail in a pre-specified way. For example, for a Hopper to turn over and for a self driving car to drive off road and hit obstacles. The results show that random search method performs worse than RL based search method in terms of reaching a specific state after certain steps. The quality of the attack can be qualitatively evaluated by observing the sequence of states when the agent is being attacked and see whether the target state has been achieved. In Figures 1214, we show the sequences of states when the agents are under attack with the random search or reinforcement learning based search method. The last image in each sequence denotes the state at same step . The last image in each abnormal dynamics rollout sequence corresponds to the target state, the last image in the attacked dynamics using RL search denotes the attacked results using env-search-bb, and the last image in the attacked dynamics using random search denotes the attacked results using env-rand-bb. It can be seen from these figures that env-search-bb method is very effective at achieving targeted attack while using random search, it is relatively harder to achieve this.

6. Discussion and Conclusions

Though this paper is about adversarial attacks on deep reinforcement learning, one important direction is to develop reinforcement learning methods that are robust and how to defense again attacks. We provide some discussion regarding these perspectives.

General Attacks on DRL. We have attempted to study a broad scope of possible attacks: perturbing different parts of a reinforcement learning system, under threat models with different attacker knowledge, and using new techniques to reduce attack computation cost. However, our experiments are not an exhaustive set of possible attacks. A general attacker may gain access to perturb multiple parts of an RL system and may utilize still newer techniques to compute effective perturbations efficiently.

Improving Robustness of RL. There have been increasing interest in training RL algorithms that can be robust against perturbations in the environment, or even adversarial attacks. Previous methods that aim to improve the robustness of RL either try to apply random perturbation to the observation or apply gradient based noise to the observation to induce the agent to choose some sub-optimal actions. On the one hand, our finite difference and sampling based finite difference based method can provide faster attack than traditional FGSM based attack that requires back-propagation to calculate gradient, therefore can be incorporated into the training of RL policies to improve the robustness of RL policy. The environment dynamics attack can help to find the environment where the current agent is vulnerable. On the other hand, our methods provide tools to evaluate the vulnerability of the trained RL policy. Finally, we hope that our proposed taxonomy helps guide future research in making DRL systems robust, and we offer our experimental results as baselines for future robust RL techniques to compare against.

Priority of Defense Towards the Proposed Attacks. From the perspective of training robust RL policies, it is important to know the severeness of the risk related with the proposed attacks. Among the proposed attacks, the environment dynamics attack can be a more realistic potential risk to consider than the other two attacks based on observations or action space. The reason is that this attack does not require access to modify the policy network software system and only requires access to modify environment dynamics, and by modifying the environment dynamics parameters such as changing road condition in autonomous driving, we see from our experiments that the agent tends to fail with the original policy. The observation and action space attack, especially the black-box attacks, are also important to defend against, since an attacker can definitely query the network and may have access to change the observations or action selection.

Potential Defenses. Previous work, with an increasing interest in training robust RL algorithms, has tried (i) applying random perturbation to the observation or (ii) applying gradient based noise to the observation in order to exercise the agent under training on possible perturbations. As a first order enhancement, our finite difference and sampling based finite difference based attacks can fit in the same pipeline and can even run faster than traditional FGSM. However, valuable future defenses should also consider which attacks would be more practical to carry out. An environment dynamics attack, for example, can perturb the dynamics without any electronic modification to the system’s sensors or controllers. Black-box attacks with query access may also be increasingly realistic with the availability of consumer products that use RL making oracles widely available.

We hope our exploratory work and the taxonomy of attacks we describe help form a more complete view for what threats should be considered in ongoing research in robust reinforcement learning.


  • D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 3319–3327. Cited by: §4.1.
  • V. Behzadan and A. Munir (2017) Vulnerability of deep reinforcement learning to policy induction attacks. arXiv preprint arXiv:1701.04143. Cited by: §2.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §5.
  • A. N. Bhagoji, W. He, B. Li, and D. Song (2017) Exploring the space of black-box attacks on deep neural networks. arXiv preprint arXiv:1712.09491. Cited by: §4.1.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017, Cited by: §2, §4.1.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In ACM Workshop on Artificial Intelligence and Security, pp. 15–26. Cited by: §2.
  • X. Dai, C. Li, and A. B. Rad (2005) An approach to tune fuzzy controllers based on reinforcement learning for autonomous vehicle control. IEEE Transactions on Intelligent Transportation Systems 6 (3), pp. 285–293. Cited by: §1.
  • P. Dhariwal, C. Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu (2017) Openai baselines. Cited by: §5.
  • I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song (2018) Robust physical-world attacks on machine learning models. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, Cited by: §2, §3.
  • I. Ghory (2004) Reinforcement learning in board games. Department of Computer Science, University of Bristol, Tech. Rep. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §1, §2, §3.
  • S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel (2017) Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284. Cited by: §2, §4.1, §4.1.
  • P. Laskov and R. Lippmann (2010) Machine learning in adversarial environments. Machine learning 81 (2), pp. 115–119. Cited by: §1.
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  • B. Li and Y. Vorobeychik (2014) Feature cross-substitution in adversarial classification. In Advances in Neural Information Processing Systems, pp. 2087–2095. Cited by: §1.
  • B. Li and Y. Vorobeychik (2015) Scalable optimization of randomized operational decisions in adversarial classification settings.. In AISTATS, Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §4.3, §5.
  • Y. Lin, Z. Hong, Y. Liao, M. Shih, M. Liu, and M. Sun (2017) Tactics of adversarial attack on deep reinforcement learning agents. In 26th International Joint Conference on Artificial Intelligence, Cited by: §2.
  • A. Lonzetta, P. Cope, J. Campbell, B. Mohd, and T. Hayajneh (2018) Security vulnerabilities in bluetooth technology as used in iot. Journal of Sensor and Actuator Networks 7 (3), pp. 28. Cited by: §3.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §5.
  • S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 1765–1773. Cited by: §2, §3.
  • C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song (2018) Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §2.
  • X. Pan, X. Chen, Q. Cai, J. Canny, and F. Yu (2019a) Semantic predictive control for explainable and efficient policy learning. IEEE International Conference on Robotics and Automation (ICRA). Cited by: §1.
  • X. Pan, D. Seita, Y. Gao, and J. Canny (2019b) Risk averse robust adversarial reinforcement learning. IEEE International Conference on Robotics and Automation (ICRA). Cited by: §2.
  • X. Pan, W. Wang, X. Zhang, B. Li, J. Yi, and D. Song (2019c) How you act tells a lot: privacy-leakage attack on deep reinforcement learning. arXiv preprint arXiv:1904.11082. Cited by: §2.
  • X. Pan, Y. You, Z. Wang, and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. In British Machine Vision Conference (BMVC), Cited by: §5.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. Cited by: §2.
  • L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In ICML, Proceedings of Machine Learning Research, Vol. 70, pp. 2817–2826. Cited by: §2.
  • H. Qiu, C. Xiao, L. Yang, X. Yan, H. Lee, and B. Li (2019) SemanticAdv: generating adversarial examples via attribute-conditional image editing. arXiv preprint arXiv:1906.07927. Cited by: §1.
  • RE WORK (2017) DEEP learning in production & warehousing with Amazon Robotics. Note: Cited by: §1.
  • A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell (2015) Policy distillation. arXiv preprint arXiv:1511.06295. Cited by: §4.1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §5.
  • C. Xiao, R. Deng, B. Li, F. Yu, M. Liu, and D. Song (2018a) Characterizing adversarial examples based on spatial consistency information for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 217–234. Cited by: §1.
  • C. Xiao, B. Li, J. Zhu, W. He, M. Liu, and D. Song (2018b) Generating adversarial examples with adversarial networks. In IJCAI, Cited by: §1.
  • C. Xiao, D. Yang, B. Li, J. Deng, and M. Liu (2019) Meshadv: adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6898–6907. Cited by: §1.
  • C. Xiao, J. Zhu, B. Li, W. He, M. Liu, and D. Song (2018c) Spatially transformed adversarial examples. ICLR 2019. Cited by: §1.

Appendix A Experimental Setup

We trained DQN models on Pong, Enduro, and TORCS, and trained DDPG models on HalfCheetah and Hopper. The DQN model for training Pong and Enduro consists of 3 convolutional layers and 2 fully connected layers. The two network architectures differ in their number of filters. Specifically, the first network structure is , where denotes a convolutional layer of input channel number , output channel number , kernel size

, and stride

. denotes a fully connected layer with input dimension and output dimension , and is the number of actions in the environment. The DQN model for training TORCS consists of 3 convultional layers and 2 or 3 fully connected layers. The convultional layers’ structure is , and the fully connected layer structure is for one model and for the other model.

The DDPG model for training HalfCheetah and Hopper consists of several fully connected layers. We trained two different policy network structures on all MuJoCo environments. The first model’s actor is a network of size and the critic is a network of size . The second model’s actor is a network of size , and the critic is a network of size

. For both models, we added ReLU activation layers between these fully connected layers.

The TORCS autonomous driving environment is a discrete action space control environment with 9 actions, they are turn left, turn right, keep going, turn left and accelerate, turn right and accelerate, accelerate, turn left and decelerate, turn right and decelerate and decelerate. The other 4 games, Pong, Enduro, HalfCheetah, and Hopper are standard OpenAI gym environment.

The trained model’s performance when tested without any attack is included in the following Table 3.

Torcs Enduro Pong HalfCheetah Hopper
Episodic reward 1720.8 1308 21 8257 3061
Episode length 1351 16634 1654 1000 1000
Table 3. Model performance among different environments

The DDPG neural network used for env-search-bb is the same as the first model (3-layer fully connected network) used for training the policy for HalfCheetah, except that the input dimension is of the perturbation parameters’ dimension, and output dimension is also of the perturbation parameters’ dimension. For HalfCheetah, Hopper and TORCS, these input and output dimensions are 32, 20, and 10, respectively.