Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement Learning

02/20/2020 ∙ by Chao-Han Huck Yang, et al. ∙ 15

Recent deep neural networks based techniques, especially those equipped with the ability of self-adaptation in the system level such as deep reinforcement learning (DRL), are shown to possess many advantages of optimizing robot learning systems (e.g., autonomous navigation and continuous robot arm control.) However, the learning-based systems and the associated models may be threatened by the risks of intentionally adaptive (e.g., noisy sensor confusion) and adversarial perturbations from real-world scenarios. In this paper, we introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames. To study the vulnerability of learning-based navigation systems, we propose two adversarial agent models: one refers to online learning; another one is based on evolutionary learning. Besides, three open-source robot learning and navigation control environments are employed to study the vulnerability under adversarial timing attacks. Our experimental results show that the adversarial timing attacks can lead to a significant performance drop, and also suggest the necessity of enhancing the robustness of robot learning systems.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning (DRL) has gained a widespread applications in digital gaming, robotics and control. In particular, the main DRL approaches, such as the value-based deep Q-network (DQN)  [1], Asynchronous Advantage Actor-Critic (A3C) [2], and the population-based Go-explore [3], have succeeded in mastering many dynamically unknown action-searching environments [3]. Relying on the similarity between the adaptive and interacting behaviors, DRL-based models are commonly used in the domain of navigation and robotics, and achieve a noticeable improvement over classical methods. However, despite the significant performance enhancement, DRL-based models may incur some new challenges in terms of system robustness against adversarial attacks. For example, the DRL-based navigation systems are likely to propagate and even enlarge risks (e.g., delay, noisy, and pixel-wise pulsed-signals [4] on the sensor networks of vehicle [5]) induced from the attackers. Besides, unlike the image classification tasks where only a single mission gets involved, the navigation learning agent has to deal with a couple of dynamic states (e.g., inputs from sensors or raw pixels) and the related rewards. Our work mainly focuses on the robustness analysis of strategically-timed attacks by potential noises incurred from the real world scenarios. More specifically, we formulate the adversarial attacks on two DRL-security settings:

  • White-box attack: if attacker can access to model parameters, some potential function needs to be used to estimate the learning performance to jam in noise.

  • Black-box attack: without the requirements of model parameters, the attacker trains a policy agent with the opposite reward objective via observing actions from the victim DRL network, the state, and the reward from the environment.

To validate the adversarial robustness of a navigation system, we attempt a new and important research direction based on a D environment of (1) continuous robot arm control (e.g., Unity Reacher); (2) sensor-input navigation system (e.g., Unity Banana Collector [6]); (3) raw images of self-driving environments (e.g., Donkey Car) as shown in Fig.1 (a), (b), and (c).

2 Related work

Scheduling Physical Attacks on Sensor Fusion. Sensor networks for the navigation system are susceptible to flooding-based attacks like Pulsing Denial-of-Service (PDoS) [7] and adversary selective jamming attacks [8]. The related work includes the security and robustness of background noise, spoofing pulses, and jamming signals on autonomous vehicles. For example, Yan et al. [9] show that PDoS attacks can feasibly conduct on a Tesla Model S automobile equipped with standard millimeter-wave radars, ultrasonic sensors, forward-looking cameras. Besides, to detect any anonymous network attacks, a sensing engine defined by some offline algorithms is required within a built-in network system. Furthermore, a recent work [10] also demonstrates that the LiDAR-based Apollo-Auto system [11] could be fooled by adversarial noises during the 3D-point-cloud pre-processing phase as a malicious reconstruction.
Adversarial Attacks on Deep Reinforcement Learning.

Many works are denoted to adversarial attacks on neural network classifiers in either white-box settings or black-box ones 

[12, 13]. Goodfellow et al. [14]

proposed adversarial examples for evaluating the robustness of machine learning classifiers. Zeroth order optimization (ZOO) 

[13] was employed to estimate the gradients of black-box systems for generating adversarial examples. Besides, the tasks on RL-based adversarial attacks aim at addressing policy misconduct [15, 16] or generalization issues [17]. In particular, Lin et al. [16] developed a strategically-timed attacking method in which at time , an agent takes action based on a policy derived from a Potential Energy Function [18]. However, these approaches do not consider the update of online weights associated with the size of the action space. In this work, we further improve the potential estimated model from [16] by weighted-majority online learning, which owns a performance guarantee with a bound for in Eq. (4). Besides, we introduce a more realistic black-box timed-attack setting.

Figure 1: The 3D robot learning environments: (1) continuous robot arm control as the ; (2) banana collector as the ; (3) self-driving donkey car as the . Noisy observation under timing attack: (4) zero-out; (5) random sensor fusion; (6) adversarial perturbation.

3 Method

3.1 Noisy Observation from the Real World

We define a noisy DRL framework of a robot learning system under perturbation, where a noisy state observation can be formulated as the addition of a state and a noise pattern :


We propose three principal types of noise test (from , to ) from the real world to impose adversarial timing attacks:
 Pulsed Zero-out Attacks (): Off-the-shelf hardwares [9] can affect the entire sensor networks by an over-shooting noise incurred from a timing attack in Eq.(1) as Fig. 1 (4).
 Gaussian Average on Sensor Fusion (): Sensor fusion is an essential part of the autonomous system by combining of sensory data from disparate sources with less uncertainty. We define a noisy sensor fusion system by a Gaussian filter for getting _ in Eq.(2) and shown as Fig. 1 (5).
 Adversarial Noise Patterns (): Inspired by the fast gradient sign method (FGSM) [12, 15] based DQN attacks, we use FGSM to generate adversarial patterns against the prediction loss of a well-trained DQN. We use and a restriction of -norm, where is the all input including and ; = is an optimal output action by weighting over possible actions in Eq.(2):


To evaluate the performance of each timing selection algorithm in following sections, each model will receive noise patterns (from , to ) and average the total reward as Table 1. In a perspective of system level, we take the random pulsed-signal as a attacking baseline. We jam in PDoS signals discussed in Sec. 3.1 randomly with maximum constrains times (we use = from [16] as a baseline) to block agent from obtaining actual state observations in an episode.

3.2 Enhanced White-Box Strategically-Timed Attack by Online Learning

White-box adversarial setting. Recently, since various pre-defined DRL architectures and models (e.g., Google Dopamine [19]) are released for public use and as a key to Business-to-Business (B2B) solution, an adversarial attacker is likely to access the open-source and design an efficient strategically-timed attack.
Weighted-Majority Potential Energy Function. We first propose an advanced adversarial attack which is originated from online learning and based on the algorithm of weighted majority algorithm (WMA). The procedures of WMA are shown in Eq. 3 and Algorithm 1, where we introduce experts for weighting the revenues incurred by taking actions. The weights of experts are equally initialized to and then iteratively updated as the step (12) in the Algorithm 1. At each time , steps () and () suggest that we obtain both and which correspond to the actions of maximum and minimum costs. The decision of attacking the states relies on the threshold value . If is greater than a pre-specified constant threshold , we intend to attack the states by adding pulses to make the user have random observations. The choices of are based on the difference of two potential energy functions (inspired by [16] and [15]) defined as (3)111For potential energy estimation on policy-based model (e.g., A3C), we use a weighted-majority average as .:


We use the strategically-timed attacks in [16] as a baseline with = 0.3 to evaluate our WMA-enhance algorithms. Then, we further discuss a learning bound for this advanced WMA-policy estimation.
Proposition 1: Assuming that the total number of rounds , the weighted algorithm enjoys the bound as Eq.(4), where denotes a normalization term at time .


Proposition [18] suggests that the weighted revenues are more likely to reach the global optimal in theory, since the regret at time is upper bounded by a constant value in Alg.1.

  1. Input: number of experts, ; number of rounds, , a threshold constant .
  2. Parameter: , expert weight associated with actions.
  3. Initialize: .
  4. For
  5.      Set , where .
  6.      Receive revenues from all experts .
  7.      .
  8.      .
  9.      Compute the threshold function .
  10.     If :
  11.         Attack the state by shuffle
  12.     Update rule , .
Algorithm 1 Adversarial Online Learning Attack based on Weighted Majority Algorithm
Model Baseline Random WMA PEPG-ASA Lin et al. [16]
: Continuous Robot-Arm Control with DQN [1] 30.22.1 22.80.4 4.21.0 6.41.3 5.21.2
: Continuous Robot-Arm Control with A3C [2] 30.13.6 23.20.5 3.20.7 5.21.0 5.61.3
: 3D-Banana Collector Navigation with DQN 12.12.1 10.82.8 3.22.3 7.41.9 6.91.6
: 3D-Banana Collector Navigation with A3C 12.11.6 9.61.7 3.41.1 5.31.4 5.21.3
: Donkey Car Navigation with DQN 1.20.1 0.80.5 0.20.1 0.40.2 0.40.1
: Donkey Car Navigation with A3C 1.10.4 0.80.2 0.30.2 0.60.3 0.60.2
Table 1: A comparison of Performance of the timing attack algorithms on , , and environments. We try to evaluate the robustness of a robot learning system with four types of strategically-timed attack algorithms, namely random selection; weighted majority algorithm (WMA); parameter exploring policy gradients [20] adversarial strategy agent (PEPG-ASA), and Lin et al. [16]. All experiments are tested for ten times under three different types of noise patterns (zero-out, Gaussian, and adversarial noises), where the total rewards are averaged by dividing .

3.3 Black-Box Strategically-Timed Attack by Adversarial Evolutionary Strategy

Black-box adversarial setting. Since an adversarial insidious attacking agent is hardly recognizable, an adversarial agent is able to drive the equilibrium of DRL-based system with an opposite objective reward without any information of targeted DRL-model. Thus, we propose an adversarial-strategic agent (ASA) via a population-based training method based on parameter exploring policy gradients [20] (PEPG) to optimize a black-box system. The PEPG-ASA algorithm can dynamically select sensitive time frames for jamming in an physical noise patterns in Section 3.1, which is likely to minimize the total system-rewards from an off-online observation of the input-output pairs without accessing actual parameters from the given DRL framework as below:

  • observation: records of state from [, ,…, ] and adversarial reward against victim navigation DRL-agent from [, ,…, ], an adversarial reward as a black-box security setting.

  • adversarial reward : a negative absolute value of the environmental reward .

An obvious way to maximize is to estimate . Differentiating this form of the expected return with respect to and applying sampling methods, where in Eq. (5) are the parameters determining the distribution over , the agent can generate h from and yield the following gradient estimator:


The probabilistic policy, which is parametrized over a single parameter for PEPG, has the advantage of taking deterministic actions such that an entire track of history can be traced by sampling the parameter .

4 Results

4.1 3D Control and Robot Learning Environment Setup

Our testing platforms were based on the most recently released open-source ‘Unity-3D’ environments [6] for robotic applications.

A double-jointed arm could move to the desired position. A reward of +0.1 is provided for each step that the agent’s hand is in the goal location. The observation space consists of 33 variables corresponding to the position, rotation, velocity, and angular velocities of the arm. Every action is a vector with four numbers, corresponding to torque applicable to two joints. Each entry in the action vector should be a numerical value between -1 and 1.

 Banana Collector: A reward of is provided for collecting a yellow banana, and a reward of is provided for collecting a blue banana from a first-person view vehicle to collect as many yellow bananas as possible while avoiding blue bananas. The state-space has 37 dimensions and contains the agent’s velocity, along with the ray-based perception of objects around the agent’s forward direction. Four discrete actions are available to associate with four moving directions.
 Donkey Car: Donkey Car is an open-source embedded system for radio control vehicles with an off-line RL simulator. The state input is the image from the front camera with 80 80 pixels, the actions are equal to two steering values ranging from -1 to 1, and the reward is a cross-track error (CTE). We use a modified reward from  [21] divided by 1k to balance track-staying and maximize its speed.

4.2 Performance Evaluation

We applied two classical DRL algorithms, namely DQN and A3C, to evaluate the learning performance relative to well-trained DRL models in Tab. 1.
Baseline (aka no attack): We modify DQN and A3C models from the open-source Dopamine 2.0 [19] package to avoid an overparameterized model with reproducibility guarantee.
Adversarial Robustness (aka under attack): Assuming the presence of one adversarial attacker, we highlight some important results. Overall, although the WMA (white-box setting) outperforms the PEPG-ASA (black-box setting), it also requires much more information of a navigation system during the online potential-energy estimation and training. In Fig. 2, we show a result of DQN evaluate on the four types of attack method compared with the baseline performance, a random noise injector (Random), WMA, PEPG-ASA, and Lin [16] shown in Tab. 1. WMA shows a stable threaten result as a competitive attack method.

Figure 2: Learning performance of a DQN agent testing in Unity 3D banana collector, a 3D-navigation task, included the baseline (a.k.a. no attacks); random jamming, WMA, PEPG-ASA, and the method from Lin et. al.

5 Conclusion

This work introduces two novel adversarial timing attacking algorithms for evaluating DRL-based model robustness under white-box and black-box adversarial settings. The experiments suggest that the improved performance of DRL-based continuous control and robot learning models can be significantly degraded in adversarial settings. In particular, both valued and policy-based DRL algorithms are easily manipulated by a black-box adversarial attacking agent. Besides, our work points out the importance of the robustness and adversarial training against adversarial examples in DRL-based navigation systems. Our future work will discuss the visualization and interpretability of robot learning and control systems in order to secure the system. To improve model defense, we could also adapt the adversarial training [12] to train DQN & A3C models by noisy states.