1. Introduction
Recently, the machine learning field has witnessed significant progresses on image recognition He et al. (2016)
Vaswani et al. (2017), and robotic control Lillicrap et al. (2016). However, recently machine learning algorithms have been found possible to leak private information of individual training data Shokri et al. (2017). The private information could be personal health data, transaction history or personal photos which contain sensitive information. For example, in the blackbox membership inference attack setting, it is possible to determine if some individual data points were used to train the model with only blackbox access to it Shokri et al. (2017). This fact indicates that personal private information may be leaked from the trained machine learning models.The goal of our work is to explore if similar issues occur in reinforcement learning (RL), especially in deep reinforcement learning (DRL). Deep reinforcement learning has recently achieved great successes in solving computer games Mnih et al. (2015), robotic control tasks Lillicrap et al. (2016), and autonomous driving Pan et al. (2017); Gao et al. (2018). Since DRL has a great potential to be applied in many real world applications which may involve personal private data in the trained model, therefore it is important to study the vulnerability of deep reinforcement learning to potential privacystealing attacks. However, existing work on privacy in machine learning are mostly about supervised learning models such as classification and regression models. The privacy issue in reinforcement learning algorithms has not been studied before.
We present a motivating example that demonstrates the possibility of private information leakage from a DRL policy, followed by discussions on attack strategies. In a simple Grid World environment, the agent’s task is to navigate from its current location to the goal location and avoid colliding with obstacles. In Figure 1, the gray boxes represent obstacles, the red box denotes the agent, and the green box shows the goal of the agent. A trained reinforcement learning policy using DQN Mnih et al. (2015) with vision input (the aerial view of the Grid World) follows the trajectory indicated by the red line towards the goal. Surprisingly, based on our observation, the agent follows the same trajectory even when we remove all obstacles within the frame of the Grid World. In this example, the trained policy will reveal the optimal actions on the original map without seeing the same map, and therefore leak the private information within the original Grid World. The explanation for this observation is that the DQN agent “memorizes” the training map instead of acquiring the ability to perform visual navigation. Furthermore, this information can be used to infer the map structure given that we know the optimal action at every location. This motivating example shows that it is possible to infer private information from a trained deep reinforcement learning policy.
Different from the case in supervised learning, we define the problem of stealing private information from DRL policies as inferring certain sensitive characteristics of the training environment transition dynamics given blackbox access to the trained policy. We assume a powerful attacker with access to certain components of the environment including the state space, the action space, the initial state distribution and the reward function, in order to analyze the worst case scenario. Notably, such a problem of inferring transition dynamics is itself an illposed problem. Similar to inverse reinforcement learning Abbeel and Ng (2004) in which more than one reward functions can explain the observed behaviors; in our setting, it is possible that more than one transition dynamics can explain the given blackbox policy. As a result, in general it is not possible to infer the exact transition dynamics from the given policy. Rather, we study it as a private information leaking problem in the following two settings: first, the attacker knows nothing about the training environments but the environments can have common constraints; second, the attacker has access to a set of potential candidates of environment dynamics.
In both settings, we consider the blackbox scenario which only allows us to query the policy but not the policy parameters. In the first setting, the attackers are aware of a set of explicit constraints about the transition dynamics and try to infer the transition dynamics. For instance, in the Grid World game, the attacker knows that one action can only move at most 1 step away from any location. In the second setting, the attacker is provided with a set of candidate robot configurations in robotic control tasks along with the trained policy in one unknown configuration, and the goal is to infer which configuration was used to perform the training. Such setting is motivated by the practical scenario where different robotic systems are provided in the market for further verification. Thus, by inferring the actual dynamic configuration from these accessible candidates, it is possible to tell where the robot was made or other information about the robot. If the policy for a specific robot is hacked this way, it is possible for an attacker to infer the specific configuration of the robot and pose a great threat to the system’s privacy. In this setting, our attack trains separate shadow policies
under each of the available configurations and uses the performance on all candidate environments to train a classifier to figure out which configuration corresponds to which policy.
Our contributions are listed as follows.

To the best of our knowledge, this is the first work to conduct studies on the privacy leakage problem in DRL;

We formulate the problem with goal of inferring environment dynamics under different scenarios, including when the attacker has limited knowledge about the training environments and when the attacker having access to a set of possible candidates of the training environment dynamics;

We propose two algorithms to perform the privacy attacks in DRL: approximate the optimal policy via genetic algorithm, and candidate inference via shadow policies;

We perform comprehensive privacy analysis in DRL within several environments: navigation task with LiDAR perception as input, robots in continuous control environments, and autonomous driving simulator. We show that we can obtain high recovery rate in different scenarios.
2. Related Work
Sensitivity of RL on Training Environments. It has been shown in previous work that RL policy training is highly sensitive to training hyperparameters, and slight variation of these parameters can lead to significantly different results Henderson et al. (2018). RL is also known to have a hard time generalizing to an unseen domain Tamar et al. (2016) and typically overfits to the original training environment, though adding randomization during training helps improve the robustness of RL algorithms to a limited extent Rajeswaran et al. ([n. d.]). These characteristics indicate that RL models may have implicitly memorized the training environment and are vulnerable to attacks that try to steal private training environment information.
Robust RL and Transfer Learning
. Previous work in robust RL seeks to train an RL policy that can work in different environments with varying dynamics or visual scenes Pinto et al. (2017); Pan et al. (2019); Sadeghi and Levine (2017); Rajeswaran et al. ([n. d.]); Justesen et al. (2018). Policies trained in multiple environments have better generalization to an unseen target domain compared with policies that are only trained in one environment Rajeswaran et al. ([n. d.]). Intuitively, RL models that can generalize well to different environments may also tend to have better robustness against privacy leaking attacks.PrivacyStealing Attack Against Machine Learning Models. Membership inference attacks on machine learning models have been previously studied by Shokri et al. (2017) and further studied by Song et al. (2017); Carlini et al. (2018). The attack is performed on the trained models to tell whether or not a specific data point is in the training set. On the defense side, differential privacy for machine learning algorithms protecting training data has also been been framed and studied by Shokri and Shmatikov (2015); Abadi et al. (2016). However, most of the works on both sides focus on classification models, where the data are collected offline for training. In the reinforcement learning setting, the data are collected online during training, and inference about whether a single data point is used to train the model is not applicable. Instead, in this paper, we propose to infer the environment transition dynamics. One related work is privacy preserving reinforcement learning by Sakuma et al. (2008). However, their work is in a multiagent reinforcement learning setting, and the private information refers to privacy between individual agent’s knowledge instead of the actual training data privacy discussed in this work.
Inverse Reinforcement Learning. Inverse reinforcement learning is about inferring the reward function given expert demonstrations Abbeel and Ng (2004); Ziebart et al. (2008). In our case, we are given access to a blackbox welltrained reinforcement learning policy, and the task is to infer the most likely dynamics coefficients or the environment transition dynamics. Our work is also related with the work by Herman et al. (2016). However, this work assumes that experience data in the original environment are given, while in our case we do not have this assumption.
3. PrivacyLeakage Attack on Deep Reinforcement Learning
In this section, we first formulate our problem under the reinforcement learning setting and then propose algorithms to infer private information about the RL training environment.
3.1. Problem Definition
We follow the formulation of the standard reinforcement learning problem. The environment of reinforcement learning is modeled as a Markov Decision Process (MDP), which consists of the state space
, the action space , the transition dynamics , and the reward function . Usually a discount factorwill be applied to the accumulated reward to discount future rewards. The transition dynamics is defined as a probability mapping from stateaction pairs to states
. The goal of a reinforcement learning algorithm is to learn a policythat maps stateaction pairs to a probability distribution
, so as to maximize the expected return , where is the length of the horizon.In this work, we are interested in the problem of inferring the transition dynamics of an MDP given a well trained policy and other components of that MDP. Recovering the transition probability is a fundamentally illposed problem. There could be multiple that can explain the same observed policy. Therefore, we further assume the attacker has access to prior knowledge of some structural constraints about the environment. Specifically, we consider the following two scenarios. In the first scenario, we assume the attacker has knowledge of some explicit constraints on . The attacker tries to find a that both satisfies the constraints and best explains the observed policy. In the second scenario, we assume the attacker knows a set of candidate s. The goal of the attacker is to determine which one of the candidates is the original transition dynamics used for training.
3.2. Methodology
In this section, we introduce two methods that respectively solve the two problems defined above.
3.2.1. Transition Dynamics Search by Genetic Algorithm
For the first scenario, we formulate the problem as searching for a that both satisfies certain constraints and best explains the observed policy. We propose to solve such a problem with a Genetic Algorithm (GA). GA is a kind of search algorithm inspired by the process of natural selection Davis (1991); Such et al. (2017). Different GAs are proposed based on different bioinspired mutation and selection operators Deb et al. (2000); Horn et al. (1994). In this paper, we use the basic GA to demonstrate the possibility of attack. However, our goal is not to show that GA is the only and the best method to solve the problem.
In each iteration, we maintain a population of transition dynamics that satisfy the known constraints. We have the following fitness score (Equation 1) to characterize how similar the induced optimal policy by any is to the given policy .
(1) 
where is the state space of the environment , and is the optimal policy under environment with transition dynamics . Here is a similarity metric on the action space and refers to the optimal policy we independently obtained by training with candidate transition dynamics . The goal is to find a such that it maximizes this similarity score. Top candidates (elite population) sorted by the fitness (similarity) score are kept to the next generation. Other candidates are generated by two candidates (called parents) of the last generation. Our GA variant selects parents by randomly selecting two candidates and choosing the one with the higher score. Then a twopoint crossover of the selected parents is used to generate their child candidates. Also, random mutation is applied to the child candidates. A detailed description of our GA is in Algorithm 1. By running such a GA algorithm, we aim to find a that is close to the original transition dynamics that induces the provided policy .
3.2.2. Candidate Inference with Shadow Policies
For the second scenario, we present an algorithm to perform candidate environment dynamics inference. Since the policy tends to behave differently in environments with perturbed transition dynamics, we use the given policy’s performances under all candidate environments to infer which candidate it was trained on. To build a classification model, we first construct the training dataset by training shadow policies under environment dynamics, each of which is initialized with random seeds. The policy is the th policy trained on the corresponding candidate transition dynamics , where . Then for each of the policies, we collect their episodic rewards under different environment dynamics with
trials for each. We construct the feature vectors for each policy using the mean and variance of the episodic rewards over
trials in every environment dynamics. Thus, the feature vector for each policy will be a vector in . We then fit a classifier based on the feature vectors with labels corresponding to each environment dynamics candidate. During testing time, given a target policy , we build the feature vector in a similar way: calculate the episodic reward in each of the dynamics with different trials, and therefore predict which environment dynamics it was originally trained on. The detailed algorithm is described in Algorithm 2.3.3. Relation to Inverse RL
Our defined problem is related to inverse RL. For an MDP tuple (, , ,, ), normal RL assumes that all 4 items are available and learns an optimal policy to maximize the expected return. Inverse RL (IRL) assumes that only (, , , ) are available. It infers given optimal policy from expert demonstrations. RL seeks to learn the optimal policy given a reward function , while IRL learns the reward function that best supports the given policy . Different from IRL and RL, our proposed transition dynamics recovery problem assumes that only (, , , ) are available. It infers given some optimal policy . Our work studies how to infer the transition dynamics that best supports the given policy .
4. What Map Did You Walk On?
In this experiment section, we present an example showing how an attacker could inversely acquire private information (the floor plan in a Grid World environment) with GA algorithm under the first scenario we defined previously.
4.1. Experimental Settings
Performing navigation on a certain map is a fundamental problem in robotics. Recent studies try to address this problem in an endtoend manner by using Deep Reinforcement Learning (DRL) based algorithms Mirowski et al. (2016); Zhu et al. (2017). As the Grid World is a very commonly used environment for testing RL algorithms, and a reasonable simplification of the navigation task, the experiments in this section are based on the Grid World environments.
We are interested in the problem that given a welltrained DRL agent on some specific grid map, can we recover the map (or at least part of the map)? This kind of attack can pose a real threat to the practical use of RL agents, if the attacker can easily infer the floor plan structures of privacysensitive areas by just having access to the navigation robots’ learned policy. We introduce the following setup to approximate the case of a navigation robot in the real world. More specifically, the grid maps are designed similarly to real floor plans. An example is presented in Figure 2.
We consider the case that the agent takes LiDAR as input instead of aerialview visual data. More specifically, the observation space is composed of distances in positive real value in 8 directions (4 cardinal directions and 4 intermediate directions). The action space contains five actions: move left, move right, move up, move down and stay. The reward function is defined as 1 if reaching the goal, and 0.1 if the agent stays in place or collides with obstacles, and 0 otherwise. The goal location is fixed for each grid map. RL agents are trained until full convergence on the environment before testing. The detailed map constraints and policy training details are included below.
Floor Plan Constraints. All grid maps are designed following some common sense in real world architecture design. In particular, an floor plan, together with its boundary walls, should satisfy the following constraints: First, free space (including the goal) grids form a connected graph, which means the smallest navigation distance between any two free space grids is finite; Second, there should be one and only one goal position on the grid map, and the goal grid is considered as free space, not an obstacle; Third, the thickness of walls must be equal to 1. In other words, there should not be any obstacles of shape , or any obstacles containing this shape. Note that the boundary walls are not considered as part of the map, and are considered as a known prior. An example floor plan is shown in Figure 2.
Policy Training Details. The target policies under a set of specific floor plan designs are trained using DQN Mnih et al. (2015). The policy network structure is shown in Table 1. We train the agents using the aforementioned reward design. For using DQN, we train the agents using an epsilongreedy exploration schedule with exploration rate decreasing from 1 to 0.02 from step 0 to step 100,000. We also train agents with vanilla policy gradient methods to get stochastic policies. The policy gradient agents share the same network architecture, the same environment reward functions, and the same exploration schedule with DQN.
Type  Input Dim.  Output Dim. 

Linear  8  64 
Linear  64  64 
Linear  64  5 
Architecture of network used for LiDAR input agents. ReLU activation are applied after each linear layer except the last one.
4.2. Attack Implementation
Under this setting, we implement the GA based search method to recover the map structure. We randomly generate 20 test floor plans of size according to the constraints mentioned earlier. For DQN, the fitness score (similarity score) measures the fitness of a grid map’s transition dynamics to a given policy . The similarity metric for deterministic policy (DQN) is 0 if the action selections disagree between the target policy and current policy and it’s 1 if they agree; for stochastic policies with the policy gradient method (PG), we use the L2 distance between their action probability distribution, and we set a threshold such that if the action probability distribution’s L2 distance between the two policies is smaller than , then the similarity metric returns 1, otherwise it returns 0. Therefore maximizing the score is equivalent to minimizing the policy difference.
Genetic Algorithm Details. In our GA variant, a population size is evolved iteratively. The population is composed of candidate map solutions , represented by a 01 vector (0 for empty and 1 for wall). At every iteration (called generation), each is evaluated using Equation 1, produces a fitness score , and is sorted according to the score. The top candidates (elite population) are kept to the next generation. The other candidates are each generated by two candidates (called parents) of the last generation. Our GA variant selects parents by randomly selecting two candidates and choosing the one with higher score. Then a twopoint crossover of the selected parents is used to generate their child candidate. Random mutation is applied to the resulting child candidate with a mutation rate . The operators used in our GA variant include: First, crossover operator. We use twopoint crossover in our implementation. For the twopoint crossover on vectors of length , two random crossover positions are generated for each parent pair. The resulting child candidate is a combination of [0, a), [a, b) [b, END]. This process is shown in Figure 6. Second, mutation operator. We preset a fixed mutation rate (in our experiments, set to 0.05). For each child vector generated by the crossover process, all bits within their vector have the chance to flip with probability .
Based on the fitness scores defined previously, we implement the introduced GA search method. As previous papers have shown that GA tends to converge to a local minimum Rocha and Neves (1999), we run the proposed GA search multiple times with 8 different random seeds. The highestscored one is selected as our final search result. In addition, we compare GA search with two baseline search methods including random search and RLbased search.
Random Search: The most direct method that can be applied here is random search. In each search iteration, we randomly generate a grid map and calculate its score according to Equation 1, and consider the map with the highest score as our solution. While there is no guarantee that random search will find a solution, a time limit (5000 s) is set for the search algorithm.
RLbased Search:
Reinforcement learning has recently been used to solve combinatorial optimization problems and search problems
Bello et al. (2017). Here we use reinforcement learning as a baseline to search for a Grid World map. The MDP for RL is defined as such: the state space is all the possible map configurations; the action space is discrete, and consists of overall actions for a map of shape; the reward function is defined the same as in GA algorithm’s fitness score; the terminal condition is when the number of episodic step exceeds 100 steps. We use DQN as the reinforcement learning search algorithm and use epsilongreedy policy for exploration. More specifically, for the testing maps ofsize, the input to the DQN network is a 49 dimension binary vector representing the current guessed map, and the outputs are the Qvalues of 49 different actions, where each action indicates that specific location to be changed from obstacle to free space or from free space to obstacle. The goal position will not change and it’s considered to be known as prior information. The RL algorithm runs for 250,000 steps before terminating, and the exploration rate decays linearly from 1 to 0.02 from step 0 to step 240,000. The running time is also set to be 5000 s. We use a three layer fully connected deep neural network to train this RL based search method, with each layer’s output size to be 64 except the last layer’s output size, which is 49.
Table 2 summarizes the accuracy of floor plan recovery by different methods (the best recovery rate). As we can see, with a similar running time, the GA based search method outperforms the other two methods and achieves high accuracy in terms of map structure recovery. Therefore, our attack method is able to recover the map structure and poses a great threat to the privacy of DRL.
To have a better understanding of the GA results, we take the floor plan in Figure 2 for example. Figure 3 is the visualization of the final search results from the GA algorithm. The curves of population average score and average recovery rate are plotted in Figure 4 and Figure 5. As can be seen, the quality of the recovered maps from the policy gradient (PG) agent are significantly better than those from the DQN agent. This is intuitive because the former agent gives much more information than the latter one. It is easier to tell if an agent has been trained on some states with the complete probability information. If we compare the curves of two cases (DQN vs PG), we can see the population average score from DQN is higher and more concentrated than from PG, but all runs of DQN agents show a lower average recovery rate (approximately 70%), while some runs of PG agents could reach recovery rate higher than 90%. This is due in part to the score function used for PG. In DQN, the score function only carries binary information (the same selected action or different actions), while in PG, the whole action probability distribution is compared, which provides more information about the uncertainty in action selection.
Environment  Agent  Task  Method  Recovery Rate  Run Time (s) 
Grid World  DQN  Transition Dynamics Search  Random Search  61.31%  5000 
Grid World  DQN  Transition Dynamics Search  RLbased Search  81.63%  5000 
Grid World  DQN  Transition Dynamics Search  Genetic Algorithm  89.55%  4511 
Grid World  PG  Transition Dynamics Search  Random Search  68.55%  5000 
Grid World  PG  Transition Dynamics Search  RLbased Search  71.42%  5000 
Grid World  PG  Transition Dynamics Search  Genetic Algorithm  95.83%  4063 
Hopper  PPO  Candidate Inference  SVM  81.30%   
Half Cheetah  PPO  Candidate Inference  SVM  82.30%   
Bipedal Walker  PPO  Candidate Inference  SVM  88.90%   
TORCS  DQN  Candidate Inference  SVM  94.30%   
5. Which Bot Did You Use?
In this section, we present results showing how an attacker could inversely acquire private information using candidate inference with shadow policies under the second scenario we defined previously. The hypothetical attack scenario happens when an attacker tries to identify which version of the robot was used based on the released policy. We conduct the experiments on a suite of continuous control benchmarks including the RoboSchool version of MuJoCo tasks Hopper and HalfCheetah, Box2D environment bipedal walker, and TORCS, a car racing simulator.
Hopper. The hopper is a monopod robot with four links corresponding to the torso, thigh, shin, and foot and three actuated joints. The goal for the robot is to learn to walk on a plane without falling over by applying continuousvalued forces to its joints. The reward at each time step is a combination of the progress made and the costs of the movements, e.g. electricity, and penalties for collisions. Three environment parameters can be varied: 1. torso density 2. sliding friction of the joints 3. power of the forces applied to each joint. We construct 6 candidate configuration by alternating these parameters. Namely, they are ‘LightTorso’, ‘HeavyTorso’, ‘SlipperyJoints’,‘RoughJoints’,‘Weak’ and ‘Strong’.
Half Cheetah. Halfcheetah is a bipedal robot with eight links and six actuated joints corresponding to the thighs, shins, and feet. The goal, reward structure, parameters and the way we construct candidate configurations are the same as those in Hopper.
Bipedal Walker. Bipedal Walker is a bipedal robot. The task is to move to the far end, and the agent gets a reward if it succeeds. The state space consists of the hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joint angular speeds, legs contact with ground, and 10 LiDAR rangefinder measurements. We construct 9 candidate configurations by changing the value of the hull angle speed.
TORCS. TORCS is an opensource car racing simulator containing multiple categories of environments with different road conditions and terrain. We use the Michigan Way environment, which is a racing road with sunny weather. The simulator comes from the modified TORCS simulator from Pan et al. (2017). Reinforcement learning models trained on TORCS get input as RGB images from the simulator and can execute 9 different actions, including move forward and keep current speed, move forward and accelerate, move forward and decelerate, turn left and keep current speed, turn left and accelerate, turn left and decelerate, turn right and keep current speed, turn right and accelerate, turn right and decelerate. The reward function is defined as when there is no collision, and 2.5 when there are collisions, where is the current speed of the car, and is the angle between the speed direction and road direction. The game terminates when the number of steps exceeds 1000 or when the the car collides into obstacles 3 times. We construct 5 configurations with different surface friction coefficients of the road, side walk, and fences. We change the friction of the surfaces from very slippery to slippery to normal, then to rough and very rough.
Note that the parameters involved in constructing candidate environments are important private information. For example, in the TORCS environment, the surface friction coefficients can help to infer the material used to construct the road or indoor environment. Hypothetical attackers can use such information to potentially know more about the background of the autonomous driving environment. The road’s friction coefficient can also reflect the weather conditions where the policy is trained.
For Hopper and HalfCheetah environments, we use Proximal Policy Optimization (PPO) algorithm Schulman et al. (2017)
to train policies. For each dynamics configuration, we train for 10,000 episodes until convergence. For bipedal walker, we use PPO to train policy on each configuration for 3,000 episodes until convergence. While training the PPO policy, we use a random environment seed for each episode to avoid overfitting to the seed. The policy and value functions are multilayer perceptrons (MLPs) with two hidden layers of 64 units each and hyperbolic tangent activations; there is no parameter sharing.
For the TORCS environment, we use DQN to train the policy to be attacked. For all configurations, we train for one million steps with exploration rate decaying from 1 to 0.02 from step 0 to step 500,000, and keep the exploration rate at 0.02 from 500,000 steps on. The DQN network consists of 2 convolutional layers and 3 fully connected layers, and the input image size is . The network architecture is included in Table 3.
Type  Kernel  Stride  Output Channels 
Conv.  8  4  32 
Conv.  4  2  64 
Conv.  3  1  64 
Linear      512 
Linear      9 
For each configuration, we train policies with 32 random seeds on the training algorithms, splitting the different seeds into training and testing datasets. Out of the policies under 32 seeds, 8 of them were used to generate the training samples to learn the classifiers and the rest were used to do evaluation. The accuracy is computed by averaging accuracy over all configurations. For each task, we have different candidate dynamics, and
varies across different tasks. For Hopper and HalfCheetah, we have 6 candidates, and for Bipedal Walker we have 9, for TORCS, we have 5. We use a linear support vector machine (SVM) classifier to predict the label for a given policy.
In Figure 9, we show the test time performance of models trained in different dynamics in the TORCS environment. It shows that changing the dynamics does change the model’s performance in different testing environments. This makes it possible for a SVM classifier to identify the candidate label for a given policy. Meanwhile, the policy does have some generalization ability. Take the policy trained on normal environment for example: it has the same performance on normal, rough and very rough roads. This indicates that the environment where the policy performs the best may not be exactly the environment where it was trained on. On the other hand, Table 2 shows that a trained SVM classifier can identify the pattern of performances for each policy, and use this to figure out where each policy was trained.
In Figure 10, we present the accuracy of identifying the underlying training candidate in both the Hopper and Half Cheetah environments. It shows that some candidates are easier to identify than others. Interestingly, this pattern is shared between Hopper and Half Cheetah.
We report our candidate inference accuracy in table 2 for all environments. Given the policy rollout performance in all environments with different dynamics, we classify this policy into the environment where it was originally trained. It illustrates that using the candidate inference with shadow polices method, we achieve high accuracy across all environments. It would be possible for an attacker to inversely infer which candidate configuration was used to train the given policy. This potentially poses a privacyleaking challenge to existing reinforcement learning algorithms such as PPO and DQN.
We discuss designing deep reinforcement learning policies that are robust to such privacyleaking attacks. As mentioned in the related works section, robust reinforcement learning usually means training policies that can generalize to slightly perturbed environments. The purpose is to train policies that can be robust to perturbation in the environment to ensure safety in control. Our work indicates that if trained RL policies can generalize to perturbed environments, the policies may also be robust to privacyleaking attacks. For example, in the Grid World case, if the policy is not memorizing the environment dynamics but instead learns to plan no matter what environment it is deployed in, then it would be almost impossible to infer the floor plan structure using our method. In the candidate inference method, if the trained policies can generalize to many different environments, including significantly perturbed environments, then it will be much harder to infer which environment appeared in the training environment. Therefore, our work provides some insight into designing robust RL algorithms that can defend against such privacyleaking attacks.
6. Conclusion
In this work, we have evaluated several approaches to retrieve private information about the environment from welltrained policies under different settings. We evaluated our proposed algorithms under two settings. In the first setting we apply some constraints on the transition dynamics, and proposed to use a genetic algorithm to recover the original transition map in a Grid World environment. In the second setting, we assume we have access to several candidate dynamics and propose an algorithm to infer which candidate environment is used to train a given policy. Through extensive experiments we show that deep reinforcement learning is vulnerable to potential privacyleaking attacks and specific information about the training environment dynamics can be recovered with high accuracy. These findings provide insights for designing reinforcement learning algorithms that protect the training environment’s privacy. Our work also provides reinforcement learning community with a new perspective at the intersection of RL and its generalizability, privacy and security.
Acknowledgement
This work is partially supported by DARPA grant 00009970. We thank Warren He for providing useful feedback.
References
 (1)
 Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318.
 Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning. ACM, 1.
 Bello et al. (2017) Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. 2017. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70. 459–468.
 Carlini et al. (2018) Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. 2018. The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets. arXiv preprint arXiv:1802.08232 (2018).
 Davis (1991) Lawrence Davis. 1991. Handbook of genetic algorithms. (1991).
 Deb et al. (2000) Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and Tanaka Meyarivan. 2000. A fast elitist nondominated sorting genetic algorithm for multiobjective optimization: NSGAII. In International conference on parallel problem solving from nature. Springer, 849–858.
 Gao et al. (2018) Yang Gao, Ji Lin, Fisher Yu, Sergey Levine, Trevor Darrell, et al. 2018. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313 (2018).

He
et al. (2016)
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2016.
Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 770–778. 
Henderson et al. (2018)
Peter Henderson, Riashat
Islam, Philip Bachman, Joelle Pineau,
Doina Precup, and David Meger.
2018.
Deep reinforcement learning that matters. In
ThirtySecond AAAI Conference on Artificial Intelligence
. 
Herman et al. (2016)
Michael Herman, Tobias
Gindele, Jörg Wagner, Felix Schmitt,
and Wolfram Burgard. 2016.
Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. In
Artificial Intelligence and Statistics. 102–110. 
Horn
et al. (1994)
Jeffrey Horn, Nicholas
Nafpliotis, and David E Goldberg.
1994.
A niched Pareto genetic algorithm for
multiobjective optimization. In
Proceedings of the first IEEE conference on evolutionary computation, IEEE world congress on computational intelligence
, Vol. 1. Citeseer, 82–87.  Justesen et al. (2018) Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. 2018. Procedural level generation improves generality of deep reinforcement learning. arXiv preprint arXiv:1806.10729 (2018).
 Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In International Conference on Learning Representation (ICLR).
 Mirowski et al. (2016) Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. 2016. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016).
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
 Pan et al. (2019) Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. 2019. Risk Averse Robust Adversarial Reinforcement Learning. arXiv preprint arXiv:1904.00511 (2019).
 Pan et al. (2017) Xinlei Pan, Yurong You, Ziyan Wang, and Cewu Lu. 2017. Virtual to real reinforcement learning for autonomous driving. In British Machine Vision Conference (BMVC), London, UK, 2017.
 Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust Adversarial Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, (ICML), 2017.
 Rajeswaran et al. ([n. d.]) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. [n. d.]. Epopt: Learning robust neural network policies using model ensembles. In International Conference on Learning Representations (ICLR), 2017.
 Rocha and Neves (1999) Miguel Rocha and José Neves. 1999. Preventing premature convergence to local optima in genetic algorithms via random offspring generation. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, 127–136.
 Sadeghi and Levine (2017) Fereshteh Sadeghi and Sergey Levine. 2017. CAD2RL: Real SingleImage Flight Without a Single Real Image. In Robotics: Science and Systems XIII, 2017.
 Sakuma et al. (2008) Jun Sakuma, Shigenobu Kobayashi, and Rebecca N Wright. 2008. Privacypreserving reinforcement learning. In Proceedings of the 25th international conference on Machine learning. ACM, 864–871.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
 Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. Privacypreserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, 1310–1321.
 Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 3–18.
 Song et al. (2017) Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. 2017. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 587–601.
 Such et al. (2017) Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. 2017. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567 (2017).
 Tamar et al. (2016) Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. 2016. Value iteration networks. In Advances in Neural Information Processing Systems. 2154–2162.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
 Zhu et al. (2017) Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li FeiFei, and Ali Farhadi. 2017. Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. 3357–3364.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum Entropy Inverse Reinforcement Learning.. In AAAI, Vol. 8. 1433–1438.
Comments
There are no comments yet.