Deep Reinforcement Learning (DRL) has gained a widespread applications in digital gaming, robotics and control. In particular, the main DRL approaches, such as the value-based deep Q-network (DQN) , Asynchronous Advantage Actor-Critic (A3C) , and the population-based Go-explore , have succeeded in mastering many dynamically unknown action-searching environments . Relying on the similarity between the adaptive and interacting behaviors, DRL-based models are commonly used in the domain of navigation and robotics, and achieve a noticeable improvement over classical methods. However, despite the significant performance enhancement, DRL-based models may incur some new challenges in terms of system robustness against adversarial attacks. For example, the DRL-based navigation systems are likely to propagate and even enlarge risks (e.g., delay, noisy, and pixel-wise pulsed-signals  on the sensor networks of vehicle ) induced from the attackers. Besides, unlike the image classification tasks where only a single mission gets involved, the navigation learning agent has to deal with a couple of dynamic states (e.g., inputs from sensors or raw pixels) and the related rewards. Our work mainly focuses on the robustness analysis of strategically-timed attacks by potential noises incurred from the real world scenarios. More specifically, we formulate the adversarial attacks on two DRL-security settings:
White-box attack: if attacker can access to model parameters, some potential function needs to be used to estimate the learning performance to jam in noise.
Black-box attack: without the requirements of model parameters, the attacker trains a policy agent with the opposite reward objective via observing actions from the victim DRL network, the state, and the reward from the environment.
To validate the adversarial robustness of a navigation system, we attempt a new and important research direction based on a D environment of (1) continuous robot arm control (e.g., Unity Reacher); (2) sensor-input navigation system (e.g., Unity Banana Collector ); (3) raw images of self-driving environments (e.g., Donkey Car) as shown in Fig.1 (a), (b), and (c).
2 Related work
Scheduling Physical Attacks on Sensor Fusion.
Sensor networks for the navigation system are susceptible to flooding-based attacks like Pulsing Denial-of-Service (PDoS)  and adversary selective jamming attacks . The related work includes the security and robustness of background noise, spoofing pulses, and jamming signals on autonomous vehicles. For example, Yan et al.  show that PDoS attacks can feasibly conduct on a Tesla Model S automobile equipped with standard millimeter-wave radars, ultrasonic sensors, forward-looking cameras. Besides, to detect any anonymous network attacks, a sensing engine defined by some offline algorithms is required within a built-in network system. Furthermore, a recent work  also demonstrates that the LiDAR-based Apollo-Auto system  could be fooled by adversarial noises during the 3D-point-cloud pre-processing phase as a malicious reconstruction.
Adversarial Attacks on Deep Reinforcement Learning.
Many works are denoted to adversarial attacks on neural network classifiers in either white-box settings or black-box ones[12, 13]. Goodfellow et al. 
proposed adversarial examples for evaluating the robustness of machine learning classifiers. Zeroth order optimization (ZOO) was employed to estimate the gradients of black-box systems for generating adversarial examples. Besides, the tasks on RL-based adversarial attacks aim at addressing policy misconduct [15, 16] or generalization issues . In particular, Lin et al.  developed a strategically-timed attacking method in which at time , an agent takes action based on a policy derived from a Potential Energy Function . However, these approaches do not consider the update of online weights associated with the size of the action space. In this work, we further improve the potential estimated model from  by weighted-majority online learning, which owns a performance guarantee with a bound for in Eq. (4). Besides, we introduce a more realistic black-box timed-attack setting.
3.1 Noisy Observation from the Real World
We define a noisy DRL framework of a robot learning system under perturbation, where a noisy state observation can be formulated as the addition of a state and a noise pattern :
We propose three principal types of noise test (from , to ) from the real world to impose adversarial timing attacks:
Pulsed Zero-out Attacks (): Off-the-shelf hardwares  can affect the entire sensor networks by an over-shooting noise incurred from a timing attack in Eq.(1) as Fig. 1 (4).
Gaussian Average on Sensor Fusion (): Sensor fusion is an essential part of the autonomous system by combining of sensory data from disparate sources with less uncertainty. We define a noisy sensor fusion system by a Gaussian filter for getting _ in Eq.(2) and shown as Fig. 1 (5).
Adversarial Noise Patterns (): Inspired by the fast gradient sign method (FGSM) [12, 15] based DQN attacks, we use FGSM to generate adversarial patterns against the prediction loss of a well-trained DQN. We use and a restriction of -norm, where is the all input including and ; = is an optimal output action by weighting over possible actions in Eq.(2):
To evaluate the performance of each timing selection algorithm in following sections, each model will receive noise patterns (from , to ) and average the total reward as Table 1. In a perspective of system level, we take the random pulsed-signal as a attacking baseline. We jam in PDoS signals discussed in Sec. 3.1 randomly with maximum constrains times (we use = from  as a baseline) to block agent from obtaining actual state observations in an episode.
3.2 Enhanced White-Box Strategically-Timed Attack by Online Learning
White-box adversarial setting. Recently, since various pre-defined DRL architectures and models (e.g., Google Dopamine ) are released for public use and as a key to Business-to-Business (B2B) solution, an adversarial attacker is likely to access the open-source and design an efficient strategically-timed attack.
Weighted-Majority Potential Energy Function. We first propose an advanced adversarial attack which is originated from online learning and based on the algorithm of weighted majority algorithm (WMA). The procedures of WMA are shown in Eq. 3 and Algorithm 1, where we introduce experts for weighting the revenues incurred by taking actions. The weights of experts are equally initialized to and then iteratively updated as the step (12) in the Algorithm 1. At each time , steps () and () suggest that we obtain both and which correspond to the actions of maximum and minimum costs. The decision of attacking the states relies on the threshold value . If is greater than a pre-specified constant threshold , we intend to attack the states by adding pulses to make the user have random observations. The choices of are based on the difference of two potential energy functions (inspired by  and ) defined as (3)111For potential energy estimation on policy-based model (e.g., A3C), we use a weighted-majority average as .:
We use the strategically-timed attacks in  as a baseline with = 0.3 to evaluate our WMA-enhance algorithms. Then, we further discuss a learning bound for this advanced WMA-policy estimation.
Proposition 1: Assuming that the total number of rounds , the weighted algorithm enjoys the bound as Eq.(4), where denotes a normalization term at time .
|Model||Baseline||Random||WMA||PEPG-ASA||Lin et al. |
|: Continuous Robot-Arm Control with DQN ||30.22.1||22.80.4||4.21.0||6.41.3||5.21.2|
|: Continuous Robot-Arm Control with A3C ||30.13.6||23.20.5||3.20.7||5.21.0||5.61.3|
|: 3D-Banana Collector Navigation with DQN||12.12.1||10.82.8||3.22.3||7.41.9||6.91.6|
|: 3D-Banana Collector Navigation with A3C||12.11.6||9.61.7||3.41.1||5.31.4||5.21.3|
|: Donkey Car Navigation with DQN||1.20.1||0.80.5||0.20.1||0.40.2||0.40.1|
|: Donkey Car Navigation with A3C||1.10.4||0.80.2||0.30.2||0.60.3||0.60.2|
3.3 Black-Box Strategically-Timed Attack by Adversarial Evolutionary Strategy
Black-box adversarial setting. Since an adversarial insidious attacking agent is hardly recognizable, an adversarial agent is able to drive the equilibrium of DRL-based system with an opposite objective reward without any information of targeted DRL-model. Thus, we propose an adversarial-strategic agent (ASA) via a population-based training method based on parameter exploring policy gradients  (PEPG) to optimize a black-box system. The PEPG-ASA algorithm can dynamically select sensitive time frames for jamming in an physical noise patterns in Section 3.1, which is likely to minimize the total system-rewards from an off-online observation of the input-output pairs without accessing actual parameters from the given DRL framework as below:
observation: records of state from [, ,…, ] and adversarial reward against victim navigation DRL-agent from [, ,…, ], an adversarial reward as a black-box security setting.
adversarial reward : a negative absolute value of the environmental reward .
An obvious way to maximize is to estimate . Differentiating this form of the expected return with respect to and applying sampling methods, where in Eq. (5) are the parameters determining the distribution over , the agent can generate h from and yield the following gradient estimator:
The probabilistic policy, which is parametrized over a single parameter for PEPG, has the advantage of taking deterministic actions such that an entire track of history can be traced by sampling the parameter .
4.1 3D Control and Robot Learning Environment Setup
Our testing platforms were based on the most recently released open-source ‘Unity-3D’ environments  for robotic applications.
A double-jointed arm could move to the desired position. A reward of +0.1 is provided for each step that the agent’s hand is in the goal location. The observation space consists of 33 variables corresponding to the position, rotation, velocity, and angular velocities of the arm. Every action is a vector with four numbers, corresponding to torque applicable to two joints. Each entry in the action vector should be a numerical value between -1 and 1.
Banana Collector: A reward of is provided for collecting a yellow banana, and a reward of is provided for collecting a blue banana from a first-person view vehicle to collect as many yellow bananas as possible while avoiding blue bananas. The state-space has 37 dimensions and contains the agent’s velocity, along with the ray-based perception of objects around the agent’s forward direction. Four discrete actions are available to associate with four moving directions.
Donkey Car: Donkey Car is an open-source embedded system for radio control vehicles with an off-line RL simulator. The state input is the image from the front camera with 80 80 pixels, the actions are equal to two steering values ranging from -1 to 1, and the reward is a cross-track error (CTE). We use a modified reward from  divided by 1k to balance track-staying and maximize its speed.
4.2 Performance Evaluation
We applied two classical DRL algorithms, namely DQN and A3C, to evaluate the learning performance relative to well-trained DRL models in Tab. 1.
Baseline (aka no attack): We modify DQN and A3C models from the open-source Dopamine 2.0  package to avoid an overparameterized model with reproducibility guarantee.
Adversarial Robustness (aka under attack): Assuming the presence of one adversarial attacker, we highlight some important results. Overall, although the WMA (white-box setting) outperforms the PEPG-ASA (black-box setting), it also requires much more information of a navigation system during the online potential-energy estimation and training. In Fig. 2, we show a result of DQN evaluate on the four types of attack method compared with the baseline performance, a random noise injector (Random), WMA, PEPG-ASA, and Lin  shown in Tab. 1. WMA shows a stable threaten result as a competitive attack method.
This work introduces two novel adversarial timing attacking algorithms for evaluating DRL-based model robustness under white-box and black-box adversarial settings. The experiments suggest that the improved performance of DRL-based continuous control and robot learning models can be significantly degraded in adversarial settings. In particular, both valued and policy-based DRL algorithms are easily manipulated by a black-box adversarial attacking agent. Besides, our work points out the importance of the robustness and adversarial training against adversarial examples in DRL-based navigation systems. Our future work will discuss the visualization and interpretability of robot learning and control systems in order to secure the system. To improve model defense, we could also adapt the adversarial training  to train DQN & A3C models by noisy states.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529, 2015.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
-  Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune, “Go-explore: a new approach for hard-exploration problems,” arXiv preprint arXiv:1901.10995, 2019.
-  Chao-Han Huck Yang, Yi-Chieh Liu, Pin-Yu Chen, Xiaoli Ma, and Yi-Chang James Tsai, “When causal intervention meets adversarial examples and image masking for deep neural networks,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3811–3815.
-  Tor A Johansen, Andrea Cristofaro, Kim Sørensen, Jakob M Hansen, and Thor I Fossen, “On estimation of wind velocity, angle-of-attack and sideslip angle of small uavs using standard sensors,” in 2015 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 2015, pp. 510–519.
-  Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange, “Unity: A general platform for intelligent agents,” arXiv preprint arXiv:1809.02627, 2018.
-  Xiapu Luo and Rocky KC Chang, “On a new class of pulsing denial-of-service attacks and the defense.,” in NDSS, 2005.
-  Alejandro Proano and Loukas Lazos, “Selective jamming attacks in wireless networks,” in Communications (ICC), 2010 IEEE International Conference on. IEEE, 2010, pp. 1–6.
-  Chen Yan, Wenyuan Xu, and Jianhao Liu, “Can you trust autonomous vehicles: Contactless attacks against sensors of self-driving vehicle,” .
-  Yulong Cao, Chaowei Xiao, Dawei Yang, Jing Fang, Ruigang Yang, Mingyan Liu, and Bo Li, “Adversarial objects against lidar-based autonomous driving systems,” arXiv preprint arXiv:1907.05418, 2019.
-  Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong, “Baidu apollo em motion planner,” arXiv preprint arXiv:1807.08048, 2018.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” ICLR, 2015.
Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh,
“Zoo: Zeroth order optimization based black-box attacks to deep
neural networks without training substitute models,”
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 2017, pp. 15–26.
-  Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al., “Adversarial attacks and defences competition,” arXiv preprint arXiv:1804.00097, 2018.
-  Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel, “Adversarial attacks on neural network policies,” arXiv preprint arXiv:1702.02284, 2017.
-  Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun, “Tactics of adversarial attack on deep reinforcement learning agents,” arXiv preprint arXiv:1703.06748, 2017.
-  Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta, “Robust adversarial reinforcement learning,” arXiv preprint arXiv:1703.02702, 2017.
-  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of machine learning, MIT press, 2012.
-  Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare, “Dopamine: A research framework for deep reinforcement learning,” arXiv preprint arXiv:1812.06110, 2018.
-  Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber, “Parameter-exploring policy gradients,” Neural Networks, vol. 23, no. 4, pp. 551–559, 2010.
Bharat Prakash, Mark Horton, Nicholas R Waytowich, William David Hairston, Tim
Oates, and Tinoosh Mohsenin,
“On the use of deep autoencoders for efficient embedded reinforcement learning,”in Proceedings of the 2019 on Great Lakes Symposium on VLSI. ACM, 2019, pp. 507–512.