Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents

Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (AS) (corresponding to actuators in engineering systems) are equally perverse; such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose a white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results shows that using the same amount of resources, the LAS attack deteriorates the agent's performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent's dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.


page 6

page 7


Robust Deep Reinforcement Learning with Adversarial Attacks

This paper proposes adversarial attacks for Reinforcement Learning (RL) ...

BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning

Recent research has confirmed the feasibility of backdoor attacks in dee...

Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning

Adversarial attacks against conventional Deep Learning (DL) systems and ...

Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

We study data poisoning attacks on online deep reinforcement learning (D...

Query-based Targeted Action-Space Adversarial Policies on Deep Reinforcement Learning Agents

Advances in computing resources have resulted in the increasing complexi...

Robustifying Reinforcement Learning Agents via Action Space Adversarial Training

Adoption of machine learning (ML)-enabled cyber-physical systems (CPS) a...

Learning to Cope with Adversarial Attacks

The security of Deep Reinforcement Learning (Deep RL) algorithms deploye...

Code Repositories



view repo


The spectrum of Reinforcement Learning (RL) applications ranges from engineering design and control [Lazic et al.2018, Amos et al.2018] to business [Hu et al.2018] and creative design [Peng et al.2018]. As RL-based frameworks are increasingly deployed in the real-world, it is imperative that the safety and reliability of these frameworks are well understood. While any adversarial infiltration of these systems can be costly, the safety of DRL systems deployed in cyber-physical systems (CPS) such as industrial robotic applications and self-driving vehicles are especially safety- and life-critical.

A root cause of these safety concerns is that in certain applications, the inputs to an RL system can be accessed and modified adversarially to cause the RL agent to take sub-optimal (or even harmful) actions. This is especially true when deep neural networks (DNNs) are used as key components (e.g., to represent policies) of RL agents. Recently, a wealth of results in the ML literature demonstrated that DNNs can be fooled to misclassify images by perturbing the input by an imperceptible amount 

[Goodfellow, Shlens, and Szegedy2015, Xie et al.2017]. Such adversarial perturbations have also demonstrated the impacts of attacks on an RL agent’s state space as shown by [Huang et al.2017] and [Behzadan and Munir2017].

Besides perturbing the RL agent’s state space, it is also important to consider adversarial attacks on the agent’s action space, which in engineering systems, represents physically manipulable actuators. We note that (model-based) actuator attacks have been studied in the cyber-physical security community, including vulnerability of a continuous system to discrete time attacks [Kim et al.2016]; theoretical characteristics of undetectable actuator attacks [Ayas and Djouadi2016]; and “defense” schemes that re-stabilizes a system when under actuation attacks [Huang and Dong2018, Jin, Haddad, and Yucelen2017]. However, the issue of adversarial attacks on a RL agent’s action space (AS) has relatively been ignored in the DRL literature. In this work, we present a suite of novel attack strategies on a RL agent’s AS.

Contributions: Our specific contributions are as follows: (i) We formulate a white-box Myopic Action Space (MAS) attack strategy as an optimization problem with decoupled constraints, (ii) We extend the formulation above by coupling the constraints to compute a non-myopic attack that is derived using the agent’s state-action dynamics and develop a white-box Look-ahead Action Space (LAS) attack strategy. Empirically, we show that using the same budget of attack, a more powerful attack can be crafted using LAS as compared to MAS, (iii) We illustrate how these attack strategies can be used to understand a RL agent’s vulnerabilities, (iv) We present analysis to show that our proposed attack algorithms leveraging projected gradient descent on the surrogate reward function (represented by the trained RL agent model) converges to the same effect of applying projected gradient descent on the true reward function.

Method Includes Dynamics Method Space of Attack
FGSM on Policies [Huang et al.2017] X O S
ATN [Tretschk, Oh, and Fritz2018] X M S
Gradient based Adversarial Attack [Pattanaik et al.2018] X O S
Policy Induction Attacks [Behzadan and Munir2017] X O S
Strategically-Timed and Enchanting Attack [Lin et al.2017] O, M S
NR-MDP [Tessler, Efroni, and Mannor2019] X M A
Myopic Action Space (MAS) X O A
Look-ahead Action Space (LAS) O A
Table 1: Landscape of adversarial attack strategies on RL agents. First column denotes if the attack takes into account the dynamics of the agent. Second column shows the method of computing the attacks, where denotes an optimization-based method while denotes a model-based method where the parameters of a model needs to be learned. The last column represents if the attacks are performed on the agent’s state space (S) or action space (A).

Related works

Due to the large amount of recent progress in the area of adversarial machine learning, we only focus on reviewing the most relevant attack and defense mechanisms proposed for DRL models recently. Table 

1 presents the primary landscape of this area of research to contextualize our work.

Adversarial attacks on RL agent

Several studies of adversarial attacks on DRL systems have been conducted recently. The authors of [Huang et al.2017]

extended the idea of FGSM attacks in deep learning to RL agent’s policies to degrade the performance of a trained RL agent. Furthermore, authors in 

[Behzadan and Munir2017] showed that these attacks on the agent’s state space are transferable to other agents. Additionally, authors of [Tretschk, Oh, and Fritz2018]

proposed attaching an Adversarial Transformer Network (ATN) to the RL agent to learn perturbations that will deceive the RL agent to pursue an adversarial reward. While the attack strategies mentioned above are effective, they do not consider the dynamics of the agent. One exception is the work by 

[Lin et al.2017]

that proposed two attack strategies. One strategy was to attack the agent when the difference in probability/value of the best and worst action crosses a threshold, which leverages the Action-Gap Phenomenon studied by 

[Farahmand2011]. The other strategy was to combine a video prediction model that predicts future states and a sampling-based action planning scheme to craft adversarial inputs to lead the agent to an adversarial goal, which might not be scalable. Other studies of adversarial attacks on the specific application of DRL for path-finding have also been conducted by [Xiang et al.2018, Bai et al.2018, Chen et al.2018] and [Liu et al.2017], which results in the RL agent failing to find a path to the goal or planning a path that is more costly.

Robustification of RL agents

As successful attack strategies are being developed for RL models, various works on training RL agents to be robust against attacks have also been conducted. The authors of [Pattanaik et al.2018] proposed that a more severe attack can be engineered by increasing the probability of the worst action rather than decreasing the probability of the best action. They showed that the robustness of an RL agent can be improved by training the agent using these adversarial examples. More recently, authors of [Tessler, Efroni, and Mannor2019] presented a method to robustify RL agent’s policy towards AS perturbations by formulating the problem as a zero-sum Markov game. In their formulation, a separate nominal and adversary policy are trained simultaneously with a critic network being updated over the mixture of both policies to improve both adversarial and nominal policies. Meanwhile, authors in [Havens, Jiang, and Sarkar2018] proposed a method to detect and mitigate attacks by employing a hierarchical learning framework with multiple sub-policies. The study proved that the framework reduces the bias of the agent to maintain high nominal rewards in the absence of adversaries. We note that other methods to defend against adversarial attacks exist, such as the studies done by [Tramèr et al.2017, Sinha, Namkoong, and Duchi2018] and [Xie et al.2018]. These works are done mainly in the context of a DNN but may be extendable to DRL agents that employs DNN as policies, however discussing these works in detail goes beyond the scope of this work.

Mathematical formulation


Our focus will be exclusively on model-free RL approaches. Below, let and

denote the (continuous, possibly high-dimensional) vector variables denoting

state and action, respectively, at time . Let denote the reward signal the agent receives for taking the action , given . We will assume a state evolution function, , and a reward function . For simplicity, we do not model stochastic/measurement noise in either the actions, states, or rewards. The goal of the RL agent is to choose a sequence of actions that maximizes the cumulative reward, , given access to the trajectory, , comprising all past states and actions.

In value-based methods, the RL agent determines action at each time step by finding an intermediate quantity called the value function that satisfies the recursive Bellman Equations. One example of such method is Q-learning [Watkins and Dayan1992] where the agent discovers the Q-function, defined recursively as:

If the time-horizon is long enough, the Q-function is assumed to be stationary, i.e., . The optimal action (or the “policy”) at each time step is to (deterministically) select the action that maximizes this stationary Q-function conditioned on the observed state, i.e.,

In DRL, the Q-function in the above formulation is approximated via a parametric neural network ; methods to train these networks include Deep Q-networks [Mnih et al.2015].

In policy-based methods such as policy gradients [Sutton et al.2000], the RL agent directly maps trajectories to policies. For technical reasons, in contrast with Q-learning, the selected action is assumed to be stochastic

(i.e., it is sampled from a parametric probability distribution

, which we will call the policy) such that the expected rewards (with the expectation taken over ) are maximized:

In DRL, the optimal policy is assumed to be the output of a parametric neural network , and actions at each time step are sampled; methods to train this neural network include proximal policy optimization (PPO) [Schulman et al.2017].

Threat model

Our goal is to identify adversarial vulnerabilities in both RL approaches above in a principled manner. We define a formal threat model, where we assume the adversary possesses the following capabilities:

  1. Access to RL agent’s action stream.The attacker has access to the agent’s actuators and can perturb the agent’s nominal action adversarially (under reasonable bounds, elaborated below). The nominal agent is also assumed to be a closed-loop system and have no active defense mechanisms.

  2. Access to RL agent’s training environment. The attacker has access to the agent’s training environment; this is required since the attacker will need to perform forward simulations to design an optimal sequence of perturbations (elaborated below).

  3. Knowledge of trained RL agent’s DNN. This is required to understand how the RL agent acts under nominal conditions, and to compute gradients. In adversarial ML literature, this assumption is commonly made under the umbrella of white-box attacks.

In the context of the above assumptions, the goal of the attacker is to choose a (bounded) AS perturbation that minimizes long-term discounted rewards. Based on how the attacker chooses to perturb actions, we define and construct two types of optimization-based attacks. We note that alternative approaches, such as training another RL agent to learn a sequence of attacks, is also plausible. However, an optimization-based approach is computationally more tractable to generate on-the-fly attacks for a target agent compared to training another RL agent (especially for high-dimensional continuous action spaces considered here) to generate attacks. Therefore, we restrict our focus on optimization-based approaches in this paper.

Myopic Action-Space (MAS) attack model

We first consider the case where the attacker is myopic, i.e., at each time step, they design perturbations in a greedy manner without regards to future considerations. Formally, let be the AS perturbation (to be determined) and be a budget constraint on the magnitude of each 111Physically, the budget may reflect a real physical constraint, such as the energy requirements to influence an actuation, or it may be a reflection on the degree of imperceptibility of the attack.. At each time step , the attacker designs such that the anticipated future reward is minimized


where denotes the -norm for some . Observe that while the attacker ostensibly solves separate (decoupled) problems at each time, the states themselves are not independent since given any trajectory, , where is the transition of the environment based on and . Therefore, is implicitly coupled through time since it depends heavily on the evolution of state trajectories rather than individual state visitations. Hence, the adversary perturbations solved above are strictly myopic and we consider this a static attack on the agent’s AS.

Look-ahead Action Space (LAS) attack model

Next, we consider the case where the attacker is able to look ahead and chooses a designed sequence of future perturbations. Using the same notation as above, let denote the sum of rewards until a horizon parameter , and let be the future sum of rewards from time . Additionally, we consider the (concatenated) matrix and denote a budget parameter. The attacker solves the optimization problem:


where denotes the -norm [Boyd and Vandenberghe2004]. By coupling the objective function and constraints through the temporal dimension, the solution to the optimization problem above is then forced to take the dynamics of the agent into account in an explicit manner.

Proposed algorithms

In this section, we present two attack algorithms based on the optimization formulations presented in previous section.

1 Initialize nominal environment, , nominal agent with weights,
2 Initialize budget
3 while   do
4       Compute adversarial action using
5       Compute , project onto ball of size to get
6       Compute projected adversarial action =
7       Step through with to get next state
Algorithm 1 Myopic Action Space (MAS) Attack

Algorithm for mounting MAS attacks

We observe that (1) is a nonlinear constrained optimization problem; therefore, an immediate approach to solve it is via projected gradient descent (PGD). Specifically, let denote the ball of radius in the AS. We compute the gradient of the adversarial reward, with respect to (w.r.t.) the AS variables and obtain the unconstrained adversarial action using gradient descent with step size . Next, we calculate the unconstrained perturbation and project in onto to get :


Here, represents the nominal action. We note that this approach resembles the fast gradient-sign method (FGSM) [Goodfellow, Shlens, and Szegedy2015], although we compute standard gradients here. As a variation, we can compute multiple steps of gradient descent w.r.t the action variable prior to projection; this is analogous to the basic iterative method (or iterative FGSM) [Kurakin, Goodfellow, and Bengio2016].

Our overall MAS attack algorithm is presented in pseudo-code form in Alg. 1. We note that in DRL approaches, only a noisy proxy

of the true reward function is available: In value-based methods, we utilize the learned Q-function (for example, a DQN) as an approximate of the true reward function, while in policy-iteration methods, we use the probability density function returned by the optimal policy as a proxy of the reward, under the assumption that actions with high probability induces a high expected reward. Since DQN selects the action based on the argmax of Q-values and policy iteration samples the action with highest probability, the Q-values/action-probability remains a useful proxy for the reward in our attack formulation. Therefore, our proposed MAS attack is technically a version of

noisy projected gradient descent on the policy evaluation of the nominal agent. We elaborate on this further below.

Algorithm for mounting LAS attacks

1 Initialize nominal and adversary environments , with same random seed
2 Initialize nominal agent weights,
3 Initialize budget , adversary action buffer , horizon
4 while  do
5       Reset
6       if  = 0 then
7             Reset and
8       while  do
9             Compute adversarial action using
10             Compute
11             Append to
12             Step through with to get next state
13      Compute for each element in
14       Project sequence of in on to ball of size to obtain look-ahead sequence of budgets [, ]
15       Project each in on to look-ahead sequence of budgets computed in the previous step to get sequence [ ]
16       Compute projected adversarial action =
17       Step through with
Algorithm 2 Look-ahead Action Space (LAS) Attack

The previous algorithm is myopic and can be interpreted as a purely spatial attack. In this section, we propose a spatiotemporal attack algorithm by solving Eq. (2) over a given time window . Due to the coupling of constraints in time, this approach is more involved. We initialize a copy of both the nominal agent and the environment, called the adversary and adversarial environment respectively. At time , we sample a virtual roll-out trajectory up until a certain horizon using the pair of adversarial agent and environment. At each time step of the virtual roll-out, we compute AS perturbations by taking (possibly multiple) gradient updates. Next, we compute the norms of each and project the sequence of norms back onto an -ball of radius . The resulting projected norms at each time point now represents the individual budgets, , of the spatial dimension at each time step. Finally, we project the original obtained in the previous step onto the -balls of radii , respectively to get the final perturbations 222Intuitively, these steps represent the allocation of overall budget across different time steps; see supplementary material for formal justification..

In subsequent time steps, the procedure above is repeated with a reduced budget of and reduced horizon until reaches zero. The horizon is then reset again for planning a new spatiotemporal attack. An alternative formulation could also be shifting the window without reducing its length until the adversary decides to stop the attack. However, we consider the first formulation such that we can compare the performance of LAS with MAS for an equal overall budget. This technique of re-planning the at every step while shifting the window of is similar to the concept of receding horizons regularly used in optimal control [Mayne and Michalska1990, Mattingley, Wang, and Boyd2011]. It is evident that using this form of dynamic re-planning mitigates the planning error that occurs when the actual and simulated state trajectories diverge due to error accumulation [Qin and Badgwell2003]. Hence, we perform this re-planning at every to account for this deviation. The pseudocode is provided in Alg. 2.

Figure 1: Visual comparison of MAS and LAS. In MAS, each is computed via multi-step gradient descent w.r.t. expected rewards for the current step. In LAS, each is computed w.r.t. the dynamics of the agent using receding horizon. An adversarial agent & environment is used to compute MAS for each step. Projection is applied to each in the temporal domain. The final perturbed action used to interact with the environment is obtained by adding the first to the nominal action. This is done until the end of the attack window, i.e., .

Theoretical analysis

We can show that projected gradient descent on the surrogate reward function (modeled by the RL agent network) to generate both MAS and LAS attacks provably converges; this can be accomplished since gradient descent on a surrogate function is akin to a noisy gradient descent on the true adversarial reward. We defer our analysis to the supplementary material.

Experimental results & discussion

To demonstrate the effectiveness and versatility of our methods, we implemented them on RL agents with continuous action environments from OpenAI’s gym [Brockman et al.2016] as they reflect the type of AS in most practical applications. We trained the RL agent using ChainerRL [Chainer2019] framework on an Intel Xeon processor with 32 logical cores, 128GB of RAM and four Nvidia Titan X GPUs. For policy-based methods, we trained a nominal agent using the PPO algorithm and a DoubleDQN (DDQN) agent [Van Hasselt, Guez, and Silver2016] for value-based methods333The only difference in implementation of policy vs value-based methods is that in policy methods, we take analytical gradients of a distribution to compute the attacks (e.g., in line 10 of Algorithm 2) while for value-based methods, we randomly sample adversarial actions to compute numerical gradients.. Additionally, we utilize Normalized Advantage Functions [Gu et al.2016] to convert the discrete nature of DDQN’s output to continuous AS. For succinctness, we present the results of the attack strategies only on PPO agent for the Lunar-Lander (LL) environment. Additional results of the DDQN agent on LL and the Bipedal-Walker (BW) environments for both PPO and DDQN agents are provided in the supplementary section along with video demonstrations. As a baseline, we implemented a random AS attack, where a random perturbation bounded by the same budget is applied to the agent’s AS at every step. For MAS attacks, we implemented two different spatial projection schemes, projection based on [Condat2016] that represents a sparser distribution and projection that represents a less sparse distribution of attacks. For LAS attacks, all combinations of spatial and temporal projection for and were implemented.

Figure 2: Box plots of PPO LL showing average cumulative reward across 10 episodes for each attack methods. The top Figs. (a-c) have =5 with = 3, 4, and 5 respectively. Similarly, bottom Figs. (d-f) have the same values but with =10. An obvious trend is that as increases, the effectiveness of LAS over MAS becomes more evident as seen in the decreasing trend of the reward.

Comparison of MAS and LAS attacks

Fig. 2 shows the cumulative rewards obtained by the PPO agent acting in a LL environment, with each subplot representing different combination of budget, and horizon, . The top three subplots shows experiments with a value of 5 time steps and value of 3, 4, and 5 from left to right respectively. The bottom row of figures shows a similar set of experiments but with value of 10 time steps instead. To directly compare MAS and LAS attacks with equivalent budgets across time, MAS budget values are taken as .

Holding constant while increasing provides both MAS and LAS a higher budget to inject the nominal actions with . We observe that with a low budget of (Fig. 2(a)), only LAS is success in attacking the RL agent. With a higher budget of 5 (Fig. 2(c)), MAS managed to affect the RL agent slightly while LAS reduces the performance of the nominal agent severely.

Holding constant, an increase in allows the allocated to be distributed along the increased time horizon. In this case, a longer horizon dilutes the severity each in compared to shorter horizons. By comparing similar budget values of different horizons (i.e horizons 5 and 10 for budget 3, Fig. 2(a) and Fig. 2(d) respectively), attacks for are generally less severe than their counterparts. In both and combinations, we observe that MAS attacks are less effective in compared to LAS in general. Hence, we focus on studying LAS attacks further.

Ablation study of horizon and budget for time projection attacks for MAS and LAS

Figure 3: Ablation study showing effectiveness of attacks comparing LAS with MAS for time projection attacks. Left and right subplot shows and spatial projection respectively. Each subplot contains different lines representing different horizon, where budget is incrementally increased along each horizon. Both and spatial projection scales monotonically with increasing budget.

We performed an ablation study to compare the effectiveness between LAS and MAS attacks. We take the difference for each attack’s reduction in rewards (i.e. attack - nominal) to study how much more additional rewards can be reduced. Fig. 3 is categorized by different spatial projections, where spatial projection is on the left while spatial projection is on the right. Both subplots are time projection attacks. Each individual subplot shows three different lines with different , with each line visualizing the change in mean cumulative reward as budget increases along the x-axis. As budget increases, attacks in both and spatial projection shows a monotonic decrease in cumulative rewards. Attacks in both spatial attacks with value of 5 shows a different trend, where decreases linearly with increasing budget while became stagnant after value of 3. This can be attributed to the fact that the attacks are more sparsely distributed in attacks, causing most of the perturbations be distributed into one joint. Thus, as budget increases we see a diminishing return of LAS since actuating joint beyond a certain limit doesn’t decrease reward any further.

Figure 4: Time vs Attack magnitude along action dimension for LAS attacks with in LL environment with PPO RL agent. (a) Variation of attack magnitude along Up-Down (UD) and Left-Right (LR) action dimensions through different episodes. In all episodes except episode 2, UD action is more heavily attacked than LR. (b) Variation of attack magnitude through time for episode 1 of (a). After 270 steps, the agent is not attacked in the LR dimension, but heavily attacked in UD. (c) Actual rendering of LL environment for episode 1 of (a) corresponding to (b). Frame 1-5 are strictly increasing time steps showing trajectory of the RL agent controlling the LL.

Action dimension decomposition of LAS attacks

Fig. 4 shows action dimension decomposition of LAS attacks. Example shown in Fig. 4 is the result of projection in AS with projection in time. From Fig. 4(a), we observe that through all the episodes of LAS attacks, one of the action dimension (i.e., Up - Down (UD) direction of LL) is consistently perturbed more, i.e., accumulates more attack, than Left - Right (LR) direction.

Fig. 4(b) shows a detailed view of action dimension attacks for an episode (Episode 1). It is evident from the figure that the UD action of the lunar lander is more prone to attacks throughout the episode than LR action. Additionally, LR action attack is restricted after a certain time steps and only UD action is attacked further. Fig. 4(c) further corroborates the observation in the real LL environment. As episode 1 proceeds in Fig. 4(c), the LL initially lands on the ground in frame 3, but lifts up and remains in that condition until the episode ends in frame 5. From these studies, it can be clearly observed that (correlated projection of AS with time in) LAS attacks can clearly isolate vulnerable action dimension(s) of the RL agent to mount a successful attack.

Conclusion & future work

In this study, we present two novel attack strategies on a RL agent’s AS; a myopic attack (MAS) and a non-myopic attack (LAS). The results show that LAS attacks, that were crafted with explicit use of the agent’s dynamics information, are more powerful than MAS attacks. Additionally, we observed that applying LAS attacks on RL agents reveals the possible vulnerable actuators of an agent, as seen by the non-uniform distribution of attacks on certain action dimensions. This can be leveraged as a tool to identify the vulnerabilities and plan a mitigation strategy under similar attacks. Possible future works include extending the concept of LAS attacks to state space attacks where the agent’s observations are perturbed instead of the agent’s action, while taking into account the dynamics of the agent. Another future avenue of this work will be to investigate different methods to robustify RL agents against these types of attacks. We speculate that while these AS attacks may be detectable by observing the difference in expected return, it will be hard for the RL agent to mitigate these attacks. This is because the adversarial perturbations on the action space are independent of the agent’s policy. In comparison, a RL agent might be able to learn a counter-policy to maintain a nominal reward if the adversarial perturbations are applied on the state space.


  • [Amos et al.2018] Amos, B.; Jimenez, I.; Sacks, J.; Boots, B.; and Kolter, J. Z. 2018. Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, 8289–8300.
  • [Ayas and Djouadi2016] Ayas, M. S., and Djouadi, S. M. 2016. Undetectable sensor and actuator attacks for observer based controlled cyber-physical systems. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), 1–7.
  • [Bai et al.2018] Bai, X.; Niu, W.; Liu, J.; Gao, X.; Xiang, Y.; and Liu, J. 2018. Adversarial examples construction towards white-box q table variation in dqn pathfinding training. In

    2018 IEEE Third International Conference on Data Science in Cyberspace (DSC)

    , 781–787.
  • [Behzadan and Munir2017] Behzadan, V., and Munir, A. 2017. Vulnerability of deep reinforcement learning to policy induction attacks. In Perner, P., ed.,

    Machine Learning and Data Mining in Pattern Recognition

    , 262–275.
    Cham: Springer International Publishing.
  • [Boyd and Vandenberghe2004] Boyd, S., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.
  • [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540.
  • [Chainer2019] Chainer. 2019. Chainerrl.
  • [Chen et al.2018] Chen, T.; Niu, W.; Xiang, Y.; Bai, X.; Liu, J.; Han, Z.; and Li, G. 2018. Gradient band-based adversarial training for generalized attack immunity of a3c path finding. arXiv preprint arXiv:1807.06752.
  • [Condat2016] Condat, L. 2016. Fast projection onto the simplex and the l1 ball. Mathematical Programming 158(1-2):575–585.
  • [Farahmand2011] Farahmand, A.-m. 2011. Action-gap phenomenon in reinforcement learning. In Shawe-Taylor, J.; Zemel, R. S.; Bartlett, P. L.; Pereira, F.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 24. Curran Associates, Inc. 172–180.
  • [Goodfellow, Shlens, and Szegedy2015] Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
  • [Gu et al.2016] Gu, S.; Lillicrap, T.; Sutskever, I.; and Levine, S. 2016. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, 2829–2838.
  • [Havens, Jiang, and Sarkar2018] Havens, A.; Jiang, Z.; and Sarkar, S. 2018. Online robust policy learning in the presence of unknown adversaries. In Advances in Neural Information Processing Systems, 9916–9926.
  • [Hu et al.2018] Hu, Z.; Liang, Y.; Zhang, J.; Li, Z.; and Liu, Y. 2018. Inference aided reinforcement learning for incentive mechanism design in crowdsourcing. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 5507–5517.
  • [Huang and Dong2018] Huang, X., and Dong, J. 2018. Reliable control policy of cyber-physical systems against a class of frequency-constrained sensor and actuator attacks. IEEE Transactions on Cybernetics 48(12):3432–3439.
  • [Huang et al.2017] Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and Abbeel, P. 2017. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284.
  • [Jin, Haddad, and Yucelen2017] Jin, X.; Haddad, W. M.; and Yucelen, T. 2017. An adaptive control architecture for mitigating sensor and actuator attacks in cyber-physical systems. IEEE Transactions on Automatic Control 62(11):6058–6064.
  • [Kim et al.2016] Kim, J.; Park, G.; Shim, H.; and Eun, Y. 2016. Zero-stealthy attack for sampled-data control systems: The case of faster actuation than sensing. In 2016 IEEE 55th Conference on Decision and Control (CDC), 5956–5961.
  • [Kurakin, Goodfellow, and Bengio2016] Kurakin, A.; Goodfellow, I.; and Bengio, S. 2016. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236.
  • [Lazic et al.2018] Lazic, N.; Boutilier, C.; Lu, T.; Wong, E.; Roy, B.; Ryu, M.; and Imwalle, G. 2018. Data center cooling using model-predictive control. In Advances in Neural Information Processing Systems, 3814–3823.
  • [Lin et al.2017] Lin, Y.-C.; Hong, Z.-W.; Liao, Y.-H.; Shih, M.-L.; Liu, M.-Y.; and Sun, M. 2017. Tactics of adversarial attack on deep reinforcement learning agents. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    , 3756–3762.
    AAAI Press.
  • [Liu et al.2017] Liu, J.; Niu, W.; Liu, J.; Zhao, J.; Chen, T.; Yang, Y.; Xiang, Y.; and Han, L. 2017. A method to effectively detect vulnerabilities on path planning of vin. In International Conference on Information and Communications Security, 374–384. Springer.
  • [Mattingley, Wang, and Boyd2011] Mattingley, J.; Wang, Y.; and Boyd, S. 2011. Receding horizon control. IEEE Control Systems Magazine 31(3):52–65.
  • [Mayne and Michalska1990] Mayne, D. Q., and Michalska, H. 1990. Receding horizon control of nonlinear systems. IEEE Transactions on Automatic Control 35(7):814–824.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529.
  • [Pattanaik et al.2018] Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; and Chowdhary, G. 2018. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 2040–2042. International Foundation for Autonomous Agents and Multiagent Systems.
  • [Peng et al.2018] Peng, X. B.; Abbeel, P.; Levine, S.; and van de Panne, M. 2018. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37(4):143.
  • [Qin and Badgwell2003] Qin, S. J., and Badgwell, T. A. 2003. A survey of industrial model predictive control technology. Control engineering practice 11(7):733–764.
  • [Schulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • [Sinha, Namkoong, and Duchi2018] Sinha, A.; Namkoong, H.; and Duchi, J. 2018. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations.
  • [Sutton et al.2000] Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
  • [Tessler, Efroni, and Mannor2019] Tessler, C.; Efroni, Y.; and Mannor, S. 2019. Action robust reinforcement learning and applications in continuous control. arXiv preprint arXiv:1901.09184.
  • [Tramèr et al.2017] Tramèr, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; and McDaniel, P. 2017. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204.
  • [Tretschk, Oh, and Fritz2018] Tretschk, E.; Oh, S. J.; and Fritz, M. 2018. Sequential attacks on agents for long-term adversarial goals. In 2. ACM Computer Science in Cars Symposium.
  • [Van Hasselt, Guez, and Silver2016] Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence.
  • [Watkins and Dayan1992] Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning. Machine Learning 8(3):279–292.
  • [Xiang et al.2018] Xiang, Y.; Niu, W.; Liu, J.; Chen, T.; and Han, Z. 2018. A pca-based model to predict adversarial examples on q-learning of path finding. In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), 773–780. IEEE.
  • [Xie et al.2017] Xie, C.; Wang, J.; Zhang, Z.; Zhou, Y.; Xie, L.; and Yuille, A. 2017. Adversarial examples for semantic segmentation and object detection. In

    The IEEE International Conference on Computer Vision (ICCV)

  • [Xie et al.2018] Xie, C.; Wang, J.; Zhang, Z.; Ren, Z.; and Yuille, A. 2018. Mitigating adversarial effects through randomization. In International Conference on Learning Representations.