1 Introduction
Deep Reinforcement Learning (DRL) has gained a widespread applications in digital gaming, robotics and control. In particular, the main DRL approaches, such as the valuebased deep Qnetwork (DQN) [1], Asynchronous Advantage ActorCritic (A3C) [2], and the populationbased Goexplore [3], have succeeded in mastering many dynamically unknown actionsearching environments [3]. Relying on the similarity between the adaptive and interacting behaviors, DRLbased models are commonly used in the domain of navigation and robotics, and achieve a noticeable improvement over classical methods. However, despite the significant performance enhancement, DRLbased models may incur some new challenges in terms of system robustness against adversarial attacks. For example, the DRLbased navigation systems are likely to propagate and even enlarge risks (e.g., delay, noisy, and pixelwise pulsedsignals [4] on the sensor networks of vehicle [5]) induced from the attackers. Besides, unlike the image classification tasks where only a single mission gets involved, the navigation learning agent has to deal with a couple of dynamic states (e.g., inputs from sensors or raw pixels) and the related rewards. Our work mainly focuses on the robustness analysis of strategicallytimed attacks by potential noises incurred from the real world scenarios. More specifically, we formulate the adversarial attacks on two DRLsecurity settings:

Whitebox attack: if attacker can access to model parameters, some potential function needs to be used to estimate the learning performance to jam in noise.

Blackbox attack: without the requirements of model parameters, the attacker trains a policy agent with the opposite reward objective via observing actions from the victim DRL network, the state, and the reward from the environment.
To validate the adversarial robustness of a navigation system, we attempt a new and important research direction based on a D environment of (1) continuous robot arm control (e.g., Unity Reacher); (2) sensorinput navigation system (e.g., Unity Banana Collector [6]); (3) raw images of selfdriving environments (e.g., Donkey Car) as shown in Fig.1 (a), (b), and (c).
2 Related work
Scheduling Physical Attacks on Sensor Fusion.
Sensor networks for the navigation system are susceptible to floodingbased attacks like Pulsing DenialofService (PDoS) [7] and adversary selective jamming attacks [8]. The related work includes the security and robustness of background noise, spoofing pulses, and jamming signals on autonomous vehicles. For example, Yan et al. [9] show that PDoS attacks can feasibly conduct on a Tesla Model S automobile equipped with standard millimeterwave radars, ultrasonic sensors, forwardlooking cameras. Besides, to detect any anonymous network attacks, a sensing engine defined by some offline algorithms is required within a builtin network system. Furthermore, a recent work [10] also demonstrates that the LiDARbased ApolloAuto system [11] could be fooled by adversarial noises during the 3Dpointcloud preprocessing phase as a malicious reconstruction.
Adversarial Attacks on Deep Reinforcement Learning.
Many works are denoted to adversarial attacks on neural network classifiers in either whitebox settings or blackbox ones
[12, 13]. Goodfellow et al. [14]proposed adversarial examples for evaluating the robustness of machine learning classifiers. Zeroth order optimization (ZOO)
[13] was employed to estimate the gradients of blackbox systems for generating adversarial examples. Besides, the tasks on RLbased adversarial attacks aim at addressing policy misconduct [15, 16] or generalization issues [17]. In particular, Lin et al. [16] developed a strategicallytimed attacking method in which at time , an agent takes action based on a policy derived from a Potential Energy Function [18]. However, these approaches do not consider the update of online weights associated with the size of the action space. In this work, we further improve the potential estimated model from [16] by weightedmajority online learning, which owns a performance guarantee with a bound for in Eq. (4). Besides, we introduce a more realistic blackbox timedattack setting.3 Method
3.1 Noisy Observation from the Real World
We define a noisy DRL framework of a robot learning system under perturbation, where a noisy state observation can be formulated as the addition of a state and a noise pattern :
(1) 
We propose three principal types of noise test (from , to ) from the real world to impose adversarial timing attacks:
Pulsed Zeroout Attacks (): Offtheshelf hardwares [9] can affect the entire sensor networks by an overshooting noise incurred from a timing attack in Eq.(1) as Fig. 1 (4).
Gaussian Average on Sensor Fusion ():
Sensor fusion is an essential part of the autonomous system by combining of sensory data from disparate sources with less uncertainty. We define a noisy sensor fusion system by a Gaussian filter for getting _ in Eq.(2) and shown as Fig. 1 (5).
Adversarial Noise Patterns (): Inspired by the fast gradient sign method (FGSM) [12, 15] based DQN attacks, we use FGSM to generate adversarial patterns against the prediction loss of a welltrained DQN. We use and a restriction of norm, where is the all input including and ; = is an optimal output action by weighting over possible actions in Eq.(2):
(2) 
To evaluate the performance of each timing selection algorithm in following sections, each model will receive noise patterns (from , to ) and average the total reward as Table 1. In a perspective of system level, we take the random pulsedsignal as a attacking baseline. We jam in PDoS signals discussed in Sec. 3.1 randomly with maximum constrains times (we use = from [16] as a baseline) to block agent from obtaining actual state observations in an episode.
3.2 Enhanced WhiteBox StrategicallyTimed Attack by Online Learning
Whitebox adversarial setting. Recently, since various predefined DRL architectures and models (e.g., Google Dopamine [19]) are released for public use and as a key to BusinesstoBusiness (B2B) solution, an adversarial attacker is likely to access the opensource and design an efficient strategicallytimed attack.
WeightedMajority Potential Energy Function.
We first propose an advanced adversarial attack which is originated from online learning and based on the algorithm of weighted majority algorithm (WMA). The procedures of WMA are shown in Eq. 3 and Algorithm 1, where we introduce experts for weighting the revenues incurred by taking actions. The weights of experts are equally initialized to and then iteratively updated as the step (12) in the Algorithm 1. At each time , steps () and () suggest that we obtain both and which correspond to the actions of maximum and minimum costs. The decision of attacking the states relies on the threshold value . If is greater than a prespecified constant threshold , we intend to attack the states by adding pulses to make the user have random observations. The choices of are based on the difference of two potential energy functions (inspired by [16] and [15]) defined as (3)^{1}^{1}1For potential energy estimation on policybased model (e.g., A3C), we use a weightedmajority average as .:
(3) 
We use the strategicallytimed attacks in [16] as a baseline with = 0.3 to evaluate our WMAenhance algorithms. Then, we further discuss a learning bound for this advanced WMApolicy estimation.
Proposition 1: Assuming that the total number of rounds , the weighted algorithm enjoys the bound as Eq.(4), where denotes a normalization term at time .
(4) 
Proposition [18] suggests that the weighted revenues are more likely to reach the global optimal in theory, since the regret at time is upper bounded by a constant value in Alg.1.
Model  Baseline  Random  WMA  PEPGASA  Lin et al. [16] 

: Continuous RobotArm Control with DQN [1]  30.22.1  22.80.4  4.21.0  6.41.3  5.21.2 
: Continuous RobotArm Control with A3C [2]  30.13.6  23.20.5  3.20.7  5.21.0  5.61.3 
: 3DBanana Collector Navigation with DQN  12.12.1  10.82.8  3.22.3  7.41.9  6.91.6 
: 3DBanana Collector Navigation with A3C  12.11.6  9.61.7  3.41.1  5.31.4  5.21.3 
: Donkey Car Navigation with DQN  1.20.1  0.80.5  0.20.1  0.40.2  0.40.1 
: Donkey Car Navigation with A3C  1.10.4  0.80.2  0.30.2  0.60.3  0.60.2 
3.3 BlackBox StrategicallyTimed Attack by Adversarial Evolutionary Strategy
Blackbox adversarial setting. Since an adversarial insidious attacking agent is hardly recognizable, an adversarial agent is able to drive the equilibrium of DRLbased system with an opposite objective reward without any information of targeted DRLmodel. Thus, we propose an adversarialstrategic agent (ASA) via a populationbased training method based on parameter exploring policy gradients [20] (PEPG) to optimize a blackbox system. The PEPGASA algorithm can dynamically select sensitive time frames for jamming in an physical noise patterns in Section 3.1, which is likely to minimize the total systemrewards from an offonline observation of the inputoutput pairs without accessing actual parameters from the given DRL framework as below:

observation: records of state from [, ,…, ] and adversarial reward against victim navigation DRLagent from [, ,…, ], an adversarial reward as a blackbox security setting.

adversarial reward : a negative absolute value of the environmental reward .
An obvious way to maximize is to estimate . Differentiating this form of the expected return with respect to and applying sampling methods, where in Eq. (5) are the parameters determining the distribution over , the agent can generate h from and yield the following gradient estimator:
(5) 
The probabilistic policy, which is parametrized over a single parameter for PEPG, has the advantage of taking deterministic actions such that an entire track of history can be traced by sampling the parameter .
4 Results
4.1 3D Control and Robot Learning Environment Setup
Our testing platforms were based on the most recently released opensource ‘Unity3D’ environments [6] for robotic applications.
Reacher:
A doublejointed arm could move to the desired position. A reward of +0.1 is provided for each step that the agent’s hand is in the goal location. The observation space consists of 33 variables corresponding to the position, rotation, velocity, and angular velocities of the arm. Every action is a vector with four numbers, corresponding to torque applicable to two joints. Each entry in the action vector should be a numerical value between 1 and 1.
Banana Collector: A reward of is provided for collecting a yellow banana, and a reward of is provided for collecting a blue banana from a firstperson view vehicle to collect as many yellow bananas as possible while avoiding blue bananas. The statespace has 37 dimensions and contains the agent’s velocity, along with the raybased perception of objects around the agent’s forward direction. Four discrete actions are available to associate with four moving directions.
Donkey Car: Donkey Car is an opensource embedded system for radio control vehicles with an offline RL simulator. The state input is the image from the front camera with 80 80 pixels, the actions are equal to two steering values ranging from 1 to 1, and the reward is a crosstrack error (CTE). We use a modified reward from [21] divided by 1k to balance trackstaying and maximize its speed.
4.2 Performance Evaluation
We applied two classical DRL algorithms, namely DQN and A3C, to evaluate the learning performance relative to welltrained DRL models in Tab. 1.
Baseline (aka no attack):
We modify DQN and A3C models from the opensource Dopamine 2.0 [19] package to avoid an overparameterized model with reproducibility guarantee.
Adversarial Robustness (aka under attack):
Assuming the presence of one adversarial attacker, we highlight some important results. Overall, although the WMA (whitebox setting) outperforms the PEPGASA (blackbox setting), it also requires much more information of a navigation system during the online potentialenergy estimation and training.
In Fig. 2, we show a result of DQN evaluate on the four types of attack method compared with the baseline performance, a random noise injector (Random), WMA, PEPGASA, and Lin [16] shown in Tab. 1. WMA shows a stable threaten result as a competitive attack method.
5 Conclusion
This work introduces two novel adversarial timing attacking algorithms for evaluating DRLbased model robustness under whitebox and blackbox adversarial settings. The experiments suggest that the improved performance of DRLbased continuous control and robot learning models can be significantly degraded in adversarial settings. In particular, both valued and policybased DRL algorithms are easily manipulated by a blackbox adversarial attacking agent. Besides, our work points out the importance of the robustness and adversarial training against adversarial examples in DRLbased navigation systems. Our future work will discuss the visualization and interpretability of robot learning and control systems in order to secure the system. To improve model defense, we could also adapt the adversarial training [12] to train DQN & A3C models by noisy states.
References
 [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529, 2015.
 [2] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
 [3] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune, “Goexplore: a new approach for hardexploration problems,” arXiv preprint arXiv:1901.10995, 2019.
 [4] ChaoHan Huck Yang, YiChieh Liu, PinYu Chen, Xiaoli Ma, and YiChang James Tsai, “When causal intervention meets adversarial examples and image masking for deep neural networks,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3811–3815.
 [5] Tor A Johansen, Andrea Cristofaro, Kim Sørensen, Jakob M Hansen, and Thor I Fossen, “On estimation of wind velocity, angleofattack and sideslip angle of small uavs using standard sensors,” in 2015 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 2015, pp. 510–519.
 [6] Arthur Juliani, VincentPierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange, “Unity: A general platform for intelligent agents,” arXiv preprint arXiv:1809.02627, 2018.
 [7] Xiapu Luo and Rocky KC Chang, “On a new class of pulsing denialofservice attacks and the defense.,” in NDSS, 2005.
 [8] Alejandro Proano and Loukas Lazos, “Selective jamming attacks in wireless networks,” in Communications (ICC), 2010 IEEE International Conference on. IEEE, 2010, pp. 1–6.
 [9] Chen Yan, Wenyuan Xu, and Jianhao Liu, “Can you trust autonomous vehicles: Contactless attacks against sensors of selfdriving vehicle,” .
 [10] Yulong Cao, Chaowei Xiao, Dawei Yang, Jing Fang, Ruigang Yang, Mingyan Liu, and Bo Li, “Adversarial objects against lidarbased autonomous driving systems,” arXiv preprint arXiv:1907.05418, 2019.
 [11] Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong, “Baidu apollo em motion planner,” arXiv preprint arXiv:1807.08048, 2018.
 [12] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” ICLR, 2015.

[13]
PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh,
“Zoo: Zeroth order optimization based blackbox attacks to deep
neural networks without training substitute models,”
in
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
. ACM, 2017, pp. 15–26.  [14] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al., “Adversarial attacks and defences competition,” arXiv preprint arXiv:1804.00097, 2018.
 [15] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel, “Adversarial attacks on neural network policies,” arXiv preprint arXiv:1702.02284, 2017.
 [16] YenChen Lin, ZhangWei Hong, YuanHong Liao, MengLi Shih, MingYu Liu, and Min Sun, “Tactics of adversarial attack on deep reinforcement learning agents,” arXiv preprint arXiv:1703.06748, 2017.
 [17] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta, “Robust adversarial reinforcement learning,” arXiv preprint arXiv:1703.02702, 2017.
 [18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of machine learning, MIT press, 2012.
 [19] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare, “Dopamine: A research framework for deep reinforcement learning,” arXiv preprint arXiv:1812.06110, 2018.
 [20] Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber, “Parameterexploring policy gradients,” Neural Networks, vol. 23, no. 4, pp. 551–559, 2010.

[21]
Bharat Prakash, Mark Horton, Nicholas R Waytowich, William David Hairston, Tim
Oates, and Tinoosh Mohsenin,
“On the use of deep autoencoders for efficient embedded reinforcement learning,”
in Proceedings of the 2019 on Great Lakes Symposium on VLSI. ACM, 2019, pp. 507–512.
Comments
There are no comments yet.