Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (AS) (corresponding to actuators in engineering systems) are equally perverse; such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose a white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results shows that using the same amount of resources, the LAS attack deteriorates the agent's performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent's dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.READ FULL TEXT VIEW PDF
The spectrum of Reinforcement Learning (RL) applications ranges from engineering design and control [Lazic et al.2018, Amos et al.2018] to business [Hu et al.2018] and creative design [Peng et al.2018]. As RL-based frameworks are increasingly deployed in the real-world, it is imperative that the safety and reliability of these frameworks are well understood. While any adversarial infiltration of these systems can be costly, the safety of DRL systems deployed in cyber-physical systems (CPS) such as industrial robotic applications and self-driving vehicles are especially safety- and life-critical.
A root cause of these safety concerns is that in certain applications, the inputs to an RL system can be accessed and modified adversarially to cause the RL agent to take sub-optimal (or even harmful) actions. This is especially true when deep neural networks (DNNs) are used as key components (e.g., to represent policies) of RL agents. Recently, a wealth of results in the ML literature demonstrated that DNNs can be fooled to misclassify images by perturbing the input by an imperceptible amount[Goodfellow, Shlens, and Szegedy2015, Xie et al.2017]. Such adversarial perturbations have also demonstrated the impacts of attacks on an RL agent’s state space as shown by [Huang et al.2017] and [Behzadan and Munir2017].
Besides perturbing the RL agent’s state space, it is also important to consider adversarial attacks on the agent’s action space, which in engineering systems, represents physically manipulable actuators. We note that (model-based) actuator attacks have been studied in the cyber-physical security community, including vulnerability of a continuous system to discrete time attacks [Kim et al.2016]; theoretical characteristics of undetectable actuator attacks [Ayas and Djouadi2016]; and “defense” schemes that re-stabilizes a system when under actuation attacks [Huang and Dong2018, Jin, Haddad, and Yucelen2017]. However, the issue of adversarial attacks on a RL agent’s action space (AS) has relatively been ignored in the DRL literature. In this work, we present a suite of novel attack strategies on a RL agent’s AS.
Contributions: Our specific contributions are as follows: (i) We formulate a white-box Myopic Action Space (MAS) attack strategy as an optimization problem with decoupled constraints, (ii) We extend the formulation above by coupling the constraints to compute a non-myopic attack that is derived using the agent’s state-action dynamics and develop a white-box Look-ahead Action Space (LAS) attack strategy. Empirically, we show that using the same budget of attack, a more powerful attack can be crafted using LAS as compared to MAS, (iii) We illustrate how these attack strategies can be used to understand a RL agent’s vulnerabilities, (iv) We present analysis to show that our proposed attack algorithms leveraging projected gradient descent on the surrogate reward function (represented by the trained RL agent model) converges to the same effect of applying projected gradient descent on the true reward function.
|Method||Includes Dynamics||Method||Space of Attack|
|FGSM on Policies [Huang et al.2017]||X||O||S|
|ATN [Tretschk, Oh, and Fritz2018]||X||M||S|
|Gradient based Adversarial Attack [Pattanaik et al.2018]||X||O||S|
|Policy Induction Attacks [Behzadan and Munir2017]||X||O||S|
|Strategically-Timed and Enchanting Attack [Lin et al.2017]||✓||O, M||S|
|NR-MDP [Tessler, Efroni, and Mannor2019]||X||M||A|
|Myopic Action Space (MAS)||X||O||A|
|Look-ahead Action Space (LAS)||✓||O||A|
Due to the large amount of recent progress in the area of adversarial machine learning, we only focus on reviewing the most relevant attack and defense mechanisms proposed for DRL models recently. Table1 presents the primary landscape of this area of research to contextualize our work.
Several studies of adversarial attacks on DRL systems have been conducted recently. The authors of [Huang et al.2017]
extended the idea of FGSM attacks in deep learning to RL agent’s policies to degrade the performance of a trained RL agent. Furthermore, authors in[Behzadan and Munir2017] showed that these attacks on the agent’s state space are transferable to other agents. Additionally, authors of [Tretschk, Oh, and Fritz2018]
proposed attaching an Adversarial Transformer Network (ATN) to the RL agent to learn perturbations that will deceive the RL agent to pursue an adversarial reward. While the attack strategies mentioned above are effective, they do not consider the dynamics of the agent. One exception is the work by[Lin et al.2017]
that proposed two attack strategies. One strategy was to attack the agent when the difference in probability/value of the best and worst action crosses a threshold, which leverages the Action-Gap Phenomenon studied by[Farahmand2011]. The other strategy was to combine a video prediction model that predicts future states and a sampling-based action planning scheme to craft adversarial inputs to lead the agent to an adversarial goal, which might not be scalable. Other studies of adversarial attacks on the specific application of DRL for path-finding have also been conducted by [Xiang et al.2018, Bai et al.2018, Chen et al.2018] and [Liu et al.2017], which results in the RL agent failing to find a path to the goal or planning a path that is more costly.
As successful attack strategies are being developed for RL models, various works on training RL agents to be robust against attacks have also been conducted. The authors of [Pattanaik et al.2018] proposed that a more severe attack can be engineered by increasing the probability of the worst action rather than decreasing the probability of the best action. They showed that the robustness of an RL agent can be improved by training the agent using these adversarial examples. More recently, authors of [Tessler, Efroni, and Mannor2019] presented a method to robustify RL agent’s policy towards AS perturbations by formulating the problem as a zero-sum Markov game. In their formulation, a separate nominal and adversary policy are trained simultaneously with a critic network being updated over the mixture of both policies to improve both adversarial and nominal policies. Meanwhile, authors in [Havens, Jiang, and Sarkar2018] proposed a method to detect and mitigate attacks by employing a hierarchical learning framework with multiple sub-policies. The study proved that the framework reduces the bias of the agent to maintain high nominal rewards in the absence of adversaries. We note that other methods to defend against adversarial attacks exist, such as the studies done by [Tramèr et al.2017, Sinha, Namkoong, and Duchi2018] and [Xie et al.2018]. These works are done mainly in the context of a DNN but may be extendable to DRL agents that employs DNN as policies, however discussing these works in detail goes beyond the scope of this work.
Our focus will be exclusively on model-free RL approaches. Below, let and
denote the (continuous, possibly high-dimensional) vector variables denotingstate and action, respectively, at time . Let denote the reward signal the agent receives for taking the action , given . We will assume a state evolution function, , and a reward function . For simplicity, we do not model stochastic/measurement noise in either the actions, states, or rewards. The goal of the RL agent is to choose a sequence of actions that maximizes the cumulative reward, , given access to the trajectory, , comprising all past states and actions.
In value-based methods, the RL agent determines action at each time step by finding an intermediate quantity called the value function that satisfies the recursive Bellman Equations. One example of such method is Q-learning [Watkins and Dayan1992] where the agent discovers the Q-function, defined recursively as:
If the time-horizon is long enough, the Q-function is assumed to be stationary, i.e., . The optimal action (or the “policy”) at each time step is to (deterministically) select the action that maximizes this stationary Q-function conditioned on the observed state, i.e.,
In DRL, the Q-function in the above formulation is approximated via a parametric neural network ; methods to train these networks include Deep Q-networks [Mnih et al.2015].
In policy-based methods such as policy gradients [Sutton et al.2000], the RL agent directly maps trajectories to policies. For technical reasons, in contrast with Q-learning, the selected action is assumed to be stochastic
(i.e., it is sampled from a parametric probability distribution, which we will call the policy) such that the expected rewards (with the expectation taken over ) are maximized:
In DRL, the optimal policy is assumed to be the output of a parametric neural network , and actions at each time step are sampled; methods to train this neural network include proximal policy optimization (PPO) [Schulman et al.2017].
Our goal is to identify adversarial vulnerabilities in both RL approaches above in a principled manner. We define a formal threat model, where we assume the adversary possesses the following capabilities:
Access to RL agent’s action stream.The attacker has access to the agent’s actuators and can perturb the agent’s nominal action adversarially (under reasonable bounds, elaborated below). The nominal agent is also assumed to be a closed-loop system and have no active defense mechanisms.
Access to RL agent’s training environment. The attacker has access to the agent’s training environment; this is required since the attacker will need to perform forward simulations to design an optimal sequence of perturbations (elaborated below).
Knowledge of trained RL agent’s DNN. This is required to understand how the RL agent acts under nominal conditions, and to compute gradients. In adversarial ML literature, this assumption is commonly made under the umbrella of white-box attacks.
In the context of the above assumptions, the goal of the attacker is to choose a (bounded) AS perturbation that minimizes long-term discounted rewards. Based on how the attacker chooses to perturb actions, we define and construct two types of optimization-based attacks. We note that alternative approaches, such as training another RL agent to learn a sequence of attacks, is also plausible. However, an optimization-based approach is computationally more tractable to generate on-the-fly attacks for a target agent compared to training another RL agent (especially for high-dimensional continuous action spaces considered here) to generate attacks. Therefore, we restrict our focus on optimization-based approaches in this paper.
We first consider the case where the attacker is myopic, i.e., at each time step, they design perturbations in a greedy manner without regards to future considerations. Formally, let be the AS perturbation (to be determined) and be a budget constraint on the magnitude of each 111Physically, the budget may reflect a real physical constraint, such as the energy requirements to influence an actuation, or it may be a reflection on the degree of imperceptibility of the attack.. At each time step , the attacker designs such that the anticipated future reward is minimized
where denotes the -norm for some . Observe that while the attacker ostensibly solves separate (decoupled) problems at each time, the states themselves are not independent since given any trajectory, , where is the transition of the environment based on and . Therefore, is implicitly coupled through time since it depends heavily on the evolution of state trajectories rather than individual state visitations. Hence, the adversary perturbations solved above are strictly myopic and we consider this a static attack on the agent’s AS.
Next, we consider the case where the attacker is able to look ahead and chooses a designed sequence of future perturbations. Using the same notation as above, let denote the sum of rewards until a horizon parameter , and let be the future sum of rewards from time . Additionally, we consider the (concatenated) matrix and denote a budget parameter. The attacker solves the optimization problem:
where denotes the -norm [Boyd and Vandenberghe2004]. By coupling the objective function and constraints through the temporal dimension, the solution to the optimization problem above is then forced to take the dynamics of the agent into account in an explicit manner.
In this section, we present two attack algorithms based on the optimization formulations presented in previous section.
We observe that (1) is a nonlinear constrained optimization problem; therefore, an immediate approach to solve it is via projected gradient descent (PGD). Specifically, let denote the ball of radius in the AS. We compute the gradient of the adversarial reward, with respect to (w.r.t.) the AS variables and obtain the unconstrained adversarial action using gradient descent with step size . Next, we calculate the unconstrained perturbation and project in onto to get :
Here, represents the nominal action. We note that this approach resembles the fast gradient-sign method (FGSM) [Goodfellow, Shlens, and Szegedy2015], although we compute standard gradients here. As a variation, we can compute multiple steps of gradient descent w.r.t the action variable prior to projection; this is analogous to the basic iterative method (or iterative FGSM) [Kurakin, Goodfellow, and Bengio2016].
Our overall MAS attack algorithm is presented in pseudo-code form in Alg. 1. We note that in DRL approaches, only a noisy proxy
of the true reward function is available: In value-based methods, we utilize the learned Q-function (for example, a DQN) as an approximate of the true reward function, while in policy-iteration methods, we use the probability density function returned by the optimal policy as a proxy of the reward, under the assumption that actions with high probability induces a high expected reward. Since DQN selects the action based on the argmax of Q-values and policy iteration samples the action with highest probability, the Q-values/action-probability remains a useful proxy for the reward in our attack formulation. Therefore, our proposed MAS attack is technically a version ofnoisy projected gradient descent on the policy evaluation of the nominal agent. We elaborate on this further below.
The previous algorithm is myopic and can be interpreted as a purely spatial attack. In this section, we propose a spatiotemporal attack algorithm by solving Eq. (2) over a given time window . Due to the coupling of constraints in time, this approach is more involved. We initialize a copy of both the nominal agent and the environment, called the adversary and adversarial environment respectively. At time , we sample a virtual roll-out trajectory up until a certain horizon using the pair of adversarial agent and environment. At each time step of the virtual roll-out, we compute AS perturbations by taking (possibly multiple) gradient updates. Next, we compute the norms of each and project the sequence of norms back onto an -ball of radius . The resulting projected norms at each time point now represents the individual budgets, , of the spatial dimension at each time step. Finally, we project the original obtained in the previous step onto the -balls of radii , respectively to get the final perturbations 222Intuitively, these steps represent the allocation of overall budget across different time steps; see supplementary material for formal justification..
In subsequent time steps, the procedure above is repeated with a reduced budget of and reduced horizon until reaches zero. The horizon is then reset again for planning a new spatiotemporal attack. An alternative formulation could also be shifting the window without reducing its length until the adversary decides to stop the attack. However, we consider the first formulation such that we can compare the performance of LAS with MAS for an equal overall budget. This technique of re-planning the at every step while shifting the window of is similar to the concept of receding horizons regularly used in optimal control [Mayne and Michalska1990, Mattingley, Wang, and Boyd2011]. It is evident that using this form of dynamic re-planning mitigates the planning error that occurs when the actual and simulated state trajectories diverge due to error accumulation [Qin and Badgwell2003]. Hence, we perform this re-planning at every to account for this deviation. The pseudocode is provided in Alg. 2.
We can show that projected gradient descent on the surrogate reward function (modeled by the RL agent network) to generate both MAS and LAS attacks provably converges; this can be accomplished since gradient descent on a surrogate function is akin to a noisy gradient descent on the true adversarial reward. We defer our analysis to the supplementary material.
To demonstrate the effectiveness and versatility of our methods, we implemented them on RL agents with continuous action environments from OpenAI’s gym [Brockman et al.2016] as they reflect the type of AS in most practical applications. We trained the RL agent using ChainerRL [Chainer2019] framework on an Intel Xeon processor with 32 logical cores, 128GB of RAM and four Nvidia Titan X GPUs. For policy-based methods, we trained a nominal agent using the PPO algorithm and a DoubleDQN (DDQN) agent [Van Hasselt, Guez, and Silver2016] for value-based methods333The only difference in implementation of policy vs value-based methods is that in policy methods, we take analytical gradients of a distribution to compute the attacks (e.g., in line 10 of Algorithm 2) while for value-based methods, we randomly sample adversarial actions to compute numerical gradients.. Additionally, we utilize Normalized Advantage Functions [Gu et al.2016] to convert the discrete nature of DDQN’s output to continuous AS. For succinctness, we present the results of the attack strategies only on PPO agent for the Lunar-Lander (LL) environment. Additional results of the DDQN agent on LL and the Bipedal-Walker (BW) environments for both PPO and DDQN agents are provided in the supplementary section along with video demonstrations. As a baseline, we implemented a random AS attack, where a random perturbation bounded by the same budget is applied to the agent’s AS at every step. For MAS attacks, we implemented two different spatial projection schemes, projection based on [Condat2016] that represents a sparser distribution and projection that represents a less sparse distribution of attacks. For LAS attacks, all combinations of spatial and temporal projection for and were implemented.
Fig. 2 shows the cumulative rewards obtained by the PPO agent acting in a LL environment, with each subplot representing different combination of budget, and horizon, . The top three subplots shows experiments with a value of 5 time steps and value of 3, 4, and 5 from left to right respectively. The bottom row of figures shows a similar set of experiments but with value of 10 time steps instead. To directly compare MAS and LAS attacks with equivalent budgets across time, MAS budget values are taken as .
Holding constant while increasing provides both MAS and LAS a higher budget to inject the nominal actions with . We observe that with a low budget of (Fig. 2(a)), only LAS is success in attacking the RL agent. With a higher budget of 5 (Fig. 2(c)), MAS managed to affect the RL agent slightly while LAS reduces the performance of the nominal agent severely.
Holding constant, an increase in allows the allocated to be distributed along the increased time horizon. In this case, a longer horizon dilutes the severity each in compared to shorter horizons. By comparing similar budget values of different horizons (i.e horizons 5 and 10 for budget 3, Fig. 2(a) and Fig. 2(d) respectively), attacks for are generally less severe than their counterparts. In both and combinations, we observe that MAS attacks are less effective in compared to LAS in general. Hence, we focus on studying LAS attacks further.
We performed an ablation study to compare the effectiveness between LAS and MAS attacks. We take the difference for each attack’s reduction in rewards (i.e. attack - nominal) to study how much more additional rewards can be reduced. Fig. 3 is categorized by different spatial projections, where spatial projection is on the left while spatial projection is on the right. Both subplots are time projection attacks. Each individual subplot shows three different lines with different , with each line visualizing the change in mean cumulative reward as budget increases along the x-axis. As budget increases, attacks in both and spatial projection shows a monotonic decrease in cumulative rewards. Attacks in both spatial attacks with value of 5 shows a different trend, where decreases linearly with increasing budget while became stagnant after value of 3. This can be attributed to the fact that the attacks are more sparsely distributed in attacks, causing most of the perturbations be distributed into one joint. Thus, as budget increases we see a diminishing return of LAS since actuating joint beyond a certain limit doesn’t decrease reward any further.
Fig. 4 shows action dimension decomposition of LAS attacks. Example shown in Fig. 4 is the result of projection in AS with projection in time. From Fig. 4(a), we observe that through all the episodes of LAS attacks, one of the action dimension (i.e., Up - Down (UD) direction of LL) is consistently perturbed more, i.e., accumulates more attack, than Left - Right (LR) direction.
Fig. 4(b) shows a detailed view of action dimension attacks for an episode (Episode 1). It is evident from the figure that the UD action of the lunar lander is more prone to attacks throughout the episode than LR action. Additionally, LR action attack is restricted after a certain time steps and only UD action is attacked further. Fig. 4(c) further corroborates the observation in the real LL environment. As episode 1 proceeds in Fig. 4(c), the LL initially lands on the ground in frame 3, but lifts up and remains in that condition until the episode ends in frame 5. From these studies, it can be clearly observed that (correlated projection of AS with time in) LAS attacks can clearly isolate vulnerable action dimension(s) of the RL agent to mount a successful attack.
In this study, we present two novel attack strategies on a RL agent’s AS; a myopic attack (MAS) and a non-myopic attack (LAS). The results show that LAS attacks, that were crafted with explicit use of the agent’s dynamics information, are more powerful than MAS attacks. Additionally, we observed that applying LAS attacks on RL agents reveals the possible vulnerable actuators of an agent, as seen by the non-uniform distribution of attacks on certain action dimensions. This can be leveraged as a tool to identify the vulnerabilities and plan a mitigation strategy under similar attacks. Possible future works include extending the concept of LAS attacks to state space attacks where the agent’s observations are perturbed instead of the agent’s action, while taking into account the dynamics of the agent. Another future avenue of this work will be to investigate different methods to robustify RL agents against these types of attacks. We speculate that while these AS attacks may be detectable by observing the difference in expected return, it will be hard for the RL agent to mitigate these attacks. This is because the adversarial perturbations on the action space are independent of the agent’s policy. In comparison, a RL agent might be able to learn a counter-policy to maintain a nominal reward if the adversarial perturbations are applied on the state space.
2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), 781–787. IEEE.
Machine Learning and Data Mining in Pattern Recognition, 262–275. Cham: Springer International Publishing.
Proceedings of the 26th International Joint Conference on Artificial Intelligence, 3756–3762. AAAI Press.
The IEEE International Conference on Computer Vision (ICCV).