How You Act Tells a Lot: Privacy-Leakage Attack on Deep Reinforcement Learning

04/24/2019 ∙ by Xinlei Pan, et al. ∙ Duke University University of Illinois at Urbana-Champaign, Inc. berkeley college 20

Machine learning has been widely applied to various applications, some of which involve training with privacy-sensitive data. A modest number of data breaches have been studied, including credit card information in natural language data and identities from face dataset. However, most of these studies focus on supervised learning models. As deep reinforcement learning (DRL) has been deployed in a number of real-world systems, such as indoor robot navigation, whether trained DRL policies can leak private information requires in-depth study. To explore such privacy breaches in general, we mainly propose two methods: environment dynamics search via genetic algorithm and candidate inference based on shadow policies. We conduct extensive experiments to demonstrate such privacy vulnerabilities in DRL under various settings. We leverage the proposed algorithms to infer floor plans from some trained Grid World navigation DRL agents with LiDAR perception. The proposed algorithm can correctly infer most of the floor plans and reaches an average recovery rate of 95.83 recover the robot configuration in continuous control environments and an autonomous driving simulator with high accuracy. To the best of our knowledge, this is the first work to investigate privacy leakage in DRL settings and we show that DRL-based agents do potentially leak privacy-sensitive information from the trained policies.



There are no comments yet.


page 1

page 4

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recently, the machine learning field has witnessed significant progresses on image recognition He et al. (2016)

, natural language processing

Vaswani et al. (2017), and robotic control Lillicrap et al. (2016). However, recently machine learning algorithms have been found possible to leak private information of individual training data Shokri et al. (2017). The private information could be personal health data, transaction history or personal photos which contain sensitive information. For example, in the black-box membership inference attack setting, it is possible to determine if some individual data points were used to train the model with only black-box access to it Shokri et al. (2017). This fact indicates that personal private information may be leaked from the trained machine learning models.

Figure 1. Overfitting in Grid World environment. Grid World map (left), agent’s trained policy (middle), agent’s policy when obstacles are removed (right). The agent “memorizes” the original environment and the optimal actions even when the environment has been changed, indicating that the original environment transition dynamics can be potentially inferred by approximating the optimal policy.

The goal of our work is to explore if similar issues occur in reinforcement learning (RL), especially in deep reinforcement learning (DRL). Deep reinforcement learning has recently achieved great successes in solving computer games Mnih et al. (2015), robotic control tasks Lillicrap et al. (2016), and autonomous driving Pan et al. (2017); Gao et al. (2018). Since DRL has a great potential to be applied in many real world applications which may involve personal private data in the trained model, therefore it is important to study the vulnerability of deep reinforcement learning to potential privacy-stealing attacks. However, existing work on privacy in machine learning are mostly about supervised learning models such as classification and regression models. The privacy issue in reinforcement learning algorithms has not been studied before.

We present a motivating example that demonstrates the possibility of private information leakage from a DRL policy, followed by discussions on attack strategies. In a simple Grid World environment, the agent’s task is to navigate from its current location to the goal location and avoid colliding with obstacles. In Figure 1, the gray boxes represent obstacles, the red box denotes the agent, and the green box shows the goal of the agent. A trained reinforcement learning policy using DQN Mnih et al. (2015) with vision input (the aerial view of the Grid World) follows the trajectory indicated by the red line towards the goal. Surprisingly, based on our observation, the agent follows the same trajectory even when we remove all obstacles within the frame of the Grid World. In this example, the trained policy will reveal the optimal actions on the original map without seeing the same map, and therefore leak the private information within the original Grid World. The explanation for this observation is that the DQN agent “memorizes” the training map instead of acquiring the ability to perform visual navigation. Furthermore, this information can be used to infer the map structure given that we know the optimal action at every location. This motivating example shows that it is possible to infer private information from a trained deep reinforcement learning policy.

Different from the case in supervised learning, we define the problem of stealing private information from DRL policies as inferring certain sensitive characteristics of the training environment transition dynamics given black-box access to the trained policy. We assume a powerful attacker with access to certain components of the environment including the state space, the action space, the initial state distribution and the reward function, in order to analyze the worst case scenario. Notably, such a problem of inferring transition dynamics is itself an ill-posed problem. Similar to inverse reinforcement learning Abbeel and Ng (2004) in which more than one reward functions can explain the observed behaviors; in our setting, it is possible that more than one transition dynamics can explain the given black-box policy. As a result, in general it is not possible to infer the exact transition dynamics from the given policy. Rather, we study it as a private information leaking problem in the following two settings: first, the attacker knows nothing about the training environments but the environments can have common constraints; second, the attacker has access to a set of potential candidates of environment dynamics.

In both settings, we consider the black-box scenario which only allows us to query the policy but not the policy parameters. In the first setting, the attackers are aware of a set of explicit constraints about the transition dynamics and try to infer the transition dynamics. For instance, in the Grid World game, the attacker knows that one action can only move at most 1 step away from any location. In the second setting, the attacker is provided with a set of candidate robot configurations in robotic control tasks along with the trained policy in one unknown configuration, and the goal is to infer which configuration was used to perform the training. Such setting is motivated by the practical scenario where different robotic systems are provided in the market for further verification. Thus, by inferring the actual dynamic configuration from these accessible candidates, it is possible to tell where the robot was made or other information about the robot. If the policy for a specific robot is hacked this way, it is possible for an attacker to infer the specific configuration of the robot and pose a great threat to the system’s privacy. In this setting, our attack trains separate shadow policies

under each of the available configurations and uses the performance on all candidate environments to train a classifier to figure out which configuration corresponds to which policy.

Our contributions are listed as follows.

  • To the best of our knowledge, this is the first work to conduct studies on the privacy leakage problem in DRL;

  • We formulate the problem with goal of inferring environment dynamics under different scenarios, including when the attacker has limited knowledge about the training environments and when the attacker having access to a set of possible candidates of the training environment dynamics;

  • We propose two algorithms to perform the privacy attacks in DRL: approximate the optimal policy via genetic algorithm, and candidate inference via shadow policies;

  • We perform comprehensive privacy analysis in DRL within several environments: navigation task with LiDAR perception as input, robots in continuous control environments, and autonomous driving simulator. We show that we can obtain high recovery rate in different scenarios.

2. Related Work

Sensitivity of RL on Training Environments. It has been shown in previous work that RL policy training is highly sensitive to training hyper-parameters, and slight variation of these parameters can lead to significantly different results Henderson et al. (2018). RL is also known to have a hard time generalizing to an unseen domain Tamar et al. (2016) and typically overfits to the original training environment, though adding randomization during training helps improve the robustness of RL algorithms to a limited extent Rajeswaran et al. ([n. d.]). These characteristics indicate that RL models may have implicitly memorized the training environment and are vulnerable to attacks that try to steal private training environment information.

Robust RL and Transfer Learning

. Previous work in robust RL seeks to train an RL policy that can work in different environments with varying dynamics or visual scenes Pinto et al. (2017); Pan et al. (2019); Sadeghi and Levine (2017); Rajeswaran et al. ([n. d.]); Justesen et al. (2018). Policies trained in multiple environments have better generalization to an unseen target domain compared with policies that are only trained in one environment Rajeswaran et al. ([n. d.]). Intuitively, RL models that can generalize well to different environments may also tend to have better robustness against privacy leaking attacks.

Privacy-Stealing Attack Against Machine Learning Models. Membership inference attacks on machine learning models have been previously studied by Shokri et al. (2017) and further studied by Song et al. (2017); Carlini et al. (2018). The attack is performed on the trained models to tell whether or not a specific data point is in the training set. On the defense side, differential privacy for machine learning algorithms protecting training data has also been been framed and studied by Shokri and Shmatikov (2015); Abadi et al. (2016). However, most of the works on both sides focus on classification models, where the data are collected offline for training. In the reinforcement learning setting, the data are collected online during training, and inference about whether a single data point is used to train the model is not applicable. Instead, in this paper, we propose to infer the environment transition dynamics. One related work is privacy preserving reinforcement learning by Sakuma et al. (2008). However, their work is in a multi-agent reinforcement learning setting, and the private information refers to privacy between individual agent’s knowledge instead of the actual training data privacy discussed in this work.

Inverse Reinforcement Learning. Inverse reinforcement learning is about inferring the reward function given expert demonstrations Abbeel and Ng (2004); Ziebart et al. (2008). In our case, we are given access to a black-box well-trained reinforcement learning policy, and the task is to infer the most likely dynamics coefficients or the environment transition dynamics. Our work is also related with the work by Herman et al. (2016). However, this work assumes that experience data in the original environment are given, while in our case we do not have this assumption.

3. Privacy-Leakage Attack on Deep Reinforcement Learning

In this section, we first formulate our problem under the reinforcement learning setting and then propose algorithms to infer private information about the RL training environment.

3.1. Problem Definition

We follow the formulation of the standard reinforcement learning problem. The environment of reinforcement learning is modeled as a Markov Decision Process (MDP), which consists of the state space

, the action space , the transition dynamics , and the reward function . Usually a discount factor

will be applied to the accumulated reward to discount future rewards. The transition dynamics is defined as a probability mapping from state-action pairs to states

. The goal of a reinforcement learning algorithm is to learn a policy

that maps state-action pairs to a probability distribution

, so as to maximize the expected return , where is the length of the horizon.

In this work, we are interested in the problem of inferring the transition dynamics of an MDP given a well trained policy and other components of that MDP. Recovering the transition probability is a fundamentally ill-posed problem. There could be multiple that can explain the same observed policy. Therefore, we further assume the attacker has access to prior knowledge of some structural constraints about the environment. Specifically, we consider the following two scenarios. In the first scenario, we assume the attacker has knowledge of some explicit constraints on . The attacker tries to find a that both satisfies the constraints and best explains the observed policy. In the second scenario, we assume the attacker knows a set of candidate s. The goal of the attacker is to determine which one of the candidates is the original transition dynamics used for training.

3.2. Methodology

In this section, we introduce two methods that respectively solve the two problems defined above.

3.2.1. Transition Dynamics Search by Genetic Algorithm

For the first scenario, we formulate the problem as searching for a that both satisfies certain constraints and best explains the observed policy. We propose to solve such a problem with a Genetic Algorithm (GA). GA is a kind of search algorithm inspired by the process of natural selection Davis (1991); Such et al. (2017). Different GAs are proposed based on different bio-inspired mutation and selection operators Deb et al. (2000); Horn et al. (1994). In this paper, we use the basic GA to demonstrate the possibility of attack. However, our goal is not to show that GA is the only and the best method to solve the problem.

In each iteration, we maintain a population of transition dynamics that satisfy the known constraints. We have the following fitness score (Equation 1) to characterize how similar the induced optimal policy by any is to the given policy .


where is the state space of the environment , and is the optimal policy under environment with transition dynamics . Here is a similarity metric on the action space and refers to the optimal policy we independently obtained by training with candidate transition dynamics . The goal is to find a such that it maximizes this similarity score. Top candidates (elite population) sorted by the fitness (similarity) score are kept to the next generation. Other candidates are generated by two candidates (called parents) of the last generation. Our GA variant selects parents by randomly selecting two candidates and choosing the one with the higher score. Then a two-point crossover of the selected parents is used to generate their child candidates. Also, random mutation is applied to the child candidates. A detailed description of our GA is in Algorithm 1. By running such a GA algorithm, we aim to find a that is close to the original transition dynamics that induces the provided policy .

      : The provided policy
      : The constraint for transition dynamics
      : The fitness score function
      : Size of the population
      : Size of the elite population
      : Generation limit
      : Best transition dynamics solution found
:= randomly generated candidates satisfying
for i in 0 to  do
      sort according to the
      := first candidates in
     while  is not filled up do
         , := randomly select 2 candidates in
          := max(, ) according to their score
         , := randomly select 2 candidates in
          := max(, ) according to their score
          := Crossover(, )
          := Mutation()
         if  satisfies  then
         end if
     end while
end for
return : the candidate with the highest score in
Algorithm 1 Genetic Algorithm for Transition Dynamics Recovery

3.2.2. Candidate Inference with Shadow Policies

For the second scenario, we present an algorithm to perform candidate environment dynamics inference. Since the policy tends to behave differently in environments with perturbed transition dynamics, we use the given policy’s performances under all candidate environments to infer which candidate it was trained on. To build a classification model, we first construct the training dataset by training shadow policies under environment dynamics, each of which is initialized with random seeds. The policy is the th policy trained on the corresponding candidate transition dynamics , where . Then for each of the policies, we collect their episodic rewards under different environment dynamics with

trials for each. We construct the feature vectors for each policy using the mean and variance of the episodic rewards over

trials in every environment dynamics. Thus, the feature vector for each policy will be a vector in . We then fit a classifier based on the feature vectors with labels corresponding to each environment dynamics candidate. During testing time, given a target policy , we build the feature vector in a similar way: calculate the episodic reward in each of the dynamics with different trials, and therefore predict which environment dynamics it was originally trained on. The detailed algorithm is described in Algorithm 2.

Prepare Training Data
Learn shadow policies under ;
for  do
     for   do
            : obtain trials’ episodic reward of policy in
            = mean and variance of
     end for
end for
Got data pairs: rewards and dynamics label .
Learn Classifier
      Learn a classifier
Test: Apply classifier on target policy .
Algorithm 2 Candidate Inference with Shadow Policies

3.3. Relation to Inverse RL

Our defined problem is related to inverse RL. For an MDP tuple (, , ,, ), normal RL assumes that all 4 items are available and learns an optimal policy to maximize the expected return. Inverse RL (IRL) assumes that only (, , , ) are available. It infers given optimal policy from expert demonstrations. RL seeks to learn the optimal policy given a reward function , while IRL learns the reward function that best supports the given policy . Different from IRL and RL, our proposed transition dynamics recovery problem assumes that only (, , , ) are available. It infers given some optimal policy . Our work studies how to infer the transition dynamics that best supports the given policy .

4. What Map Did You Walk On?

In this experiment section, we present an example showing how an attacker could inversely acquire private information (the floor plan in a Grid World environment) with GA algorithm under the first scenario we defined previously.

Figure 2. An example of our abstracted grid map. (a) The original real floor plan. (b) The abstracted grid map.

4.1. Experimental Settings

Performing navigation on a certain map is a fundamental problem in robotics. Recent studies try to address this problem in an end-to-end manner by using Deep Reinforcement Learning (DRL) based algorithms Mirowski et al. (2016); Zhu et al. (2017). As the Grid World is a very commonly used environment for testing RL algorithms, and a reasonable simplification of the navigation task, the experiments in this section are based on the Grid World environments.

We are interested in the problem that given a well-trained DRL agent on some specific grid map, can we recover the map (or at least part of the map)? This kind of attack can pose a real threat to the practical use of RL agents, if the attacker can easily infer the floor plan structures of privacy-sensitive areas by just having access to the navigation robots’ learned policy. We introduce the following setup to approximate the case of a navigation robot in the real world. More specifically, the grid maps are designed similarly to real floor plans. An example is presented in Figure 2.

Figure 3. (a) The ground truth map. (b, c) Two search results recovered from deterministic agent (DQN). (d, e) Two search results recovered from stochastic agent (Policy Gradient, PG). Cells mismatched with the ground truth are marked with red crosses.
Figure 4. Curves of fitness score in the 8 GA search runs with DQN agents. Different colors indicate different seeds. The x-axis indicates the number of GA iterations. (a) The curve of population average score. (b) The curve of population average recovery rate (accuracy of free space or obstacle prediction).
Figure 5. Curves of fitness score in the 8 GA search runs with Policy Gradient (PG) agents. Different colors indicate different seeds. The x-axis indicates the number of GA iterations. (a) The curve of population average score. (b) The curve of population average recovery rate.

We consider the case that the agent takes LiDAR as input instead of aerial-view visual data. More specifically, the observation space is composed of distances in positive real value in 8 directions (4 cardinal directions and 4 intermediate directions). The action space contains five actions: move left, move right, move up, move down and stay. The reward function is defined as 1 if reaching the goal, and -0.1 if the agent stays in place or collides with obstacles, and 0 otherwise. The goal location is fixed for each grid map. RL agents are trained until full convergence on the environment before testing. The detailed map constraints and policy training details are included below.

Floor Plan Constraints. All grid maps are designed following some common sense in real world architecture design. In particular, an floor plan, together with its boundary walls, should satisfy the following constraints: First, free space (including the goal) grids form a connected graph, which means the smallest navigation distance between any two free space grids is finite; Second, there should be one and only one goal position on the grid map, and the goal grid is considered as free space, not an obstacle; Third, the thickness of walls must be equal to 1. In other words, there should not be any obstacles of shape , or any obstacles containing this shape. Note that the boundary walls are not considered as part of the map, and are considered as a known prior. An example floor plan is shown in Figure 2.

Policy Training Details. The target policies under a set of specific floor plan designs are trained using DQN Mnih et al. (2015). The policy network structure is shown in Table 1. We train the agents using the aforementioned reward design. For using DQN, we train the agents using an epsilon-greedy exploration schedule with exploration rate decreasing from 1 to 0.02 from step 0 to step 100,000. We also train agents with vanilla policy gradient methods to get stochastic policies. The policy gradient agents share the same network architecture, the same environment reward functions, and the same exploration schedule with DQN.

Type Input Dim. Output Dim.
Linear 8 64
Linear 64 64
Linear 64 5
Table 1.

Architecture of network used for LiDAR input agents. ReLU activation are applied after each linear layer except the last one.

4.2. Attack Implementation

Under this setting, we implement the GA based search method to recover the map structure. We randomly generate 20 test floor plans of size according to the constraints mentioned earlier. For DQN, the fitness score (similarity score) measures the fitness of a grid map’s transition dynamics to a given policy . The similarity metric for deterministic policy (DQN) is 0 if the action selections disagree between the target policy and current policy and it’s 1 if they agree; for stochastic policies with the policy gradient method (PG), we use the L2 distance between their action probability distribution, and we set a threshold such that if the action probability distribution’s L2 distance between the two policies is smaller than , then the similarity metric returns 1, otherwise it returns 0. Therefore maximizing the score is equivalent to minimizing the policy difference.

Genetic Algorithm Details. In our GA variant, a population size is evolved iteratively. The population is composed of candidate map solutions , represented by a 0-1 vector (0 for empty and 1 for wall). At every iteration (called generation), each is evaluated using Equation 1, produces a fitness score , and is sorted according to the score. The top candidates (elite population) are kept to the next generation. The other candidates are each generated by two candidates (called parents) of the last generation. Our GA variant selects parents by randomly selecting two candidates and choosing the one with higher score. Then a two-point crossover of the selected parents is used to generate their child candidate. Random mutation is applied to the resulting child candidate with a mutation rate . The operators used in our GA variant include: First, crossover operator. We use two-point crossover in our implementation. For the two-point crossover on vectors of length , two random crossover positions are generated for each parent pair. The resulting child candidate is a combination of [0, a), [a, b) [b, END]. This process is shown in Figure 6. Second, mutation operator. We preset a fixed mutation rate (in our experiments, set to 0.05). For each child vector generated by the crossover process, all bits within their vector have the chance to flip with probability .

Figure 6. An example of two-point crossover.

Based on the fitness scores defined previously, we implement the introduced GA search method. As previous papers have shown that GA tends to converge to a local minimum Rocha and Neves (1999), we run the proposed GA search multiple times with 8 different random seeds. The highest-scored one is selected as our final search result. In addition, we compare GA search with two baseline search methods including random search and RL-based search.

Random Search: The most direct method that can be applied here is random search. In each search iteration, we randomly generate a grid map and calculate its score according to Equation 1, and consider the map with the highest score as our solution. While there is no guarantee that random search will find a solution, a time limit (5000 s) is set for the search algorithm.

RL-based Search:

Reinforcement learning has recently been used to solve combinatorial optimization problems and search problems 

Bello et al. (2017). Here we use reinforcement learning as a baseline to search for a Grid World map. The MDP for RL is defined as such: the state space is all the possible map configurations; the action space is discrete, and consists of overall actions for a map of shape; the reward function is defined the same as in GA algorithm’s fitness score; the terminal condition is when the number of episodic step exceeds 100 steps. We use DQN as the reinforcement learning search algorithm and use epsilon-greedy policy for exploration. More specifically, for the testing maps of

size, the input to the DQN network is a 49 dimension binary vector representing the current guessed map, and the outputs are the Q-values of 49 different actions, where each action indicates that specific location to be changed from obstacle to free space or from free space to obstacle. The goal position will not change and it’s considered to be known as prior information. The RL algorithm runs for 250,000 steps before terminating, and the exploration rate decays linearly from 1 to 0.02 from step 0 to step 240,000. The running time is also set to be 5000 s. We use a three layer fully connected deep neural network to train this RL based search method, with each layer’s output size to be 64 except the last layer’s output size, which is 49.

Table 2 summarizes the accuracy of floor plan recovery by different methods (the best recovery rate). As we can see, with a similar running time, the GA based search method outperforms the other two methods and achieves high accuracy in terms of map structure recovery. Therefore, our attack method is able to recover the map structure and poses a great threat to the privacy of DRL.

To have a better understanding of the GA results, we take the floor plan in Figure 2 for example. Figure 3 is the visualization of the final search results from the GA algorithm. The curves of population average score and average recovery rate are plotted in Figure 4 and Figure 5. As can be seen, the quality of the recovered maps from the policy gradient (PG) agent are significantly better than those from the DQN agent. This is intuitive because the former agent gives much more information than the latter one. It is easier to tell if an agent has been trained on some states with the complete probability information. If we compare the curves of two cases (DQN vs PG), we can see the population average score from DQN is higher and more concentrated than from PG, but all runs of DQN agents show a lower average recovery rate (approximately 70%), while some runs of PG agents could reach recovery rate higher than 90%. This is due in part to the score function used for PG. In DQN, the score function only carries binary information (the same selected action or different actions), while in PG, the whole action probability distribution is compared, which provides more information about the uncertainty in action selection.

Environment Agent Task Method Recovery Rate Run Time (s)
Grid World DQN Transition Dynamics Search Random Search 61.31% 5000
Grid World DQN Transition Dynamics Search RL-based Search 81.63% 5000
Grid World DQN Transition Dynamics Search Genetic Algorithm 89.55% 4511
Grid World PG Transition Dynamics Search Random Search 68.55% 5000
Grid World PG Transition Dynamics Search RL-based Search 71.42% 5000
Grid World PG Transition Dynamics Search Genetic Algorithm 95.83% 4063
Hopper PPO Candidate Inference SVM 81.30% -
Half Cheetah PPO Candidate Inference SVM 82.30% -
Bipedal Walker PPO Candidate Inference SVM 88.90% -
TORCS DQN Candidate Inference SVM 94.30% -
Table 2. Map recovery rate and candidate inference accuracy.

5. Which Bot Did You Use?

In this section, we present results showing how an attacker could inversely acquire private information using candidate inference with shadow policies under the second scenario we defined previously. The hypothetical attack scenario happens when an attacker tries to identify which version of the robot was used based on the released policy. We conduct the experiments on a suite of continuous control benchmarks including the RoboSchool version of MuJoCo tasks Hopper and Half-Cheetah, Box2D environment bipedal walker, and TORCS, a car racing simulator.

Hopper. The hopper is a monopod robot with four links corresponding to the torso, thigh, shin, and foot and three actuated joints. The goal for the robot is to learn to walk on a plane without falling over by applying continuous-valued forces to its joints. The reward at each time step is a combination of the progress made and the costs of the movements, e.g. electricity, and penalties for collisions. Three environment parameters can be varied: 1. torso density 2. sliding friction of the joints 3. power of the forces applied to each joint. We construct 6 candidate configuration by alternating these parameters. Namely, they are ‘LightTorso’, ‘HeavyTorso’, ‘SlipperyJoints’,‘RoughJoints’,‘Weak’ and ‘Strong’.

Half Cheetah. Half-cheetah is a bipedal robot with eight links and six actuated joints corresponding to the thighs, shins, and feet. The goal, reward structure, parameters and the way we construct candidate configurations are the same as those in Hopper.

Bipedal Walker. Bipedal Walker is a bipedal robot. The task is to move to the far end, and the agent gets a reward if it succeeds. The state space consists of the hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joint angular speeds, legs contact with ground, and 10 LiDAR rangefinder measurements. We construct 9 candidate configurations by changing the value of the hull angle speed.

TORCS. TORCS is an open-source car racing simulator containing multiple categories of environments with different road conditions and terrain. We use the Michigan Way environment, which is a racing road with sunny weather. The simulator comes from the modified TORCS simulator from Pan et al. (2017). Reinforcement learning models trained on TORCS get input as RGB images from the simulator and can execute 9 different actions, including move forward and keep current speed, move forward and accelerate, move forward and decelerate, turn left and keep current speed, turn left and accelerate, turn left and decelerate, turn right and keep current speed, turn right and accelerate, turn right and decelerate. The reward function is defined as when there is no collision, and -2.5 when there are collisions, where is the current speed of the car, and is the angle between the speed direction and road direction. The game terminates when the number of steps exceeds 1000 or when the the car collides into obstacles 3 times. We construct 5 configurations with different surface friction coefficients of the road, side walk, and fences. We change the friction of the surfaces from very slippery to slippery to normal, then to rough and very rough.

Figure 7. Hopper and Half-Cheetah environments.
Figure 8. Bipedal Walker and TORCS environments.

Note that the parameters involved in constructing candidate environments are important private information. For example, in the TORCS environment, the surface friction coefficients can help to infer the material used to construct the road or indoor environment. Hypothetical attackers can use such information to potentially know more about the background of the autonomous driving environment. The road’s friction coefficient can also reflect the weather conditions where the policy is trained.

For Hopper and Half-Cheetah environments, we use Proximal Policy Optimization (PPO) algorithm Schulman et al. (2017)

to train policies. For each dynamics configuration, we train for 10,000 episodes until convergence. For bipedal walker, we use PPO to train policy on each configuration for 3,000 episodes until convergence. While training the PPO policy, we use a random environment seed for each episode to avoid overfitting to the seed. The policy and value functions are multi-layer perceptrons (MLPs) with two hidden layers of 64 units each and hyperbolic tangent activations; there is no parameter sharing.

For the TORCS environment, we use DQN to train the policy to be attacked. For all configurations, we train for one million steps with exploration rate decaying from 1 to 0.02 from step 0 to step 500,000, and keep the exploration rate at 0.02 from 500,000 steps on. The DQN network consists of 2 convolutional layers and 3 fully connected layers, and the input image size is . The network architecture is included in Table 3.

Type Kernel Stride Output Channels
Conv. 8 4 32
Conv. 4 2 64
Conv. 3 1 64
Linear - - 512
Linear - - 9
Table 3. Architecture of network used for training TORCS. ReLU activations are applied after each linear or convolutional layer except the last linear layer.

For each configuration, we train policies with 32 random seeds on the training algorithms, splitting the different seeds into training and testing datasets. Out of the policies under 32 seeds, 8 of them were used to generate the training samples to learn the classifiers and the rest were used to do evaluation. The accuracy is computed by averaging accuracy over all configurations. For each task, we have different candidate dynamics, and

varies across different tasks. For Hopper and Half-Cheetah, we have 6 candidates, and for Bipedal Walker we have 9, for TORCS, we have 5. We use a linear support vector machine (SVM) classifier to predict the label for a given policy.

Figure 9. TORCS dynamics candidate testing performance.

In Figure 9, we show the test time performance of models trained in different dynamics in the TORCS environment. It shows that changing the dynamics does change the model’s performance in different testing environments. This makes it possible for a SVM classifier to identify the candidate label for a given policy. Meanwhile, the policy does have some generalization ability. Take the policy trained on normal environment for example: it has the same performance on normal, rough and very rough roads. This indicates that the environment where the policy performs the best may not be exactly the environment where it was trained on. On the other hand, Table  2 shows that a trained SVM classifier can identify the pattern of performances for each policy, and use this to figure out where each policy was trained.

Figure 10. Hopper and Half-Cheetah candidate inference accuracy for different candidates.

In Figure 10, we present the accuracy of identifying the underlying training candidate in both the Hopper and Half Cheetah environments. It shows that some candidates are easier to identify than others. Interestingly, this pattern is shared between Hopper and Half Cheetah.

We report our candidate inference accuracy in table 2 for all environments. Given the policy rollout performance in all environments with different dynamics, we classify this policy into the environment where it was originally trained. It illustrates that using the candidate inference with shadow polices method, we achieve high accuracy across all environments. It would be possible for an attacker to inversely infer which candidate configuration was used to train the given policy. This potentially poses a privacy-leaking challenge to existing reinforcement learning algorithms such as PPO and DQN.

We discuss designing deep reinforcement learning policies that are robust to such privacy-leaking attacks. As mentioned in the related works section, robust reinforcement learning usually means training policies that can generalize to slightly perturbed environments. The purpose is to train policies that can be robust to perturbation in the environment to ensure safety in control. Our work indicates that if trained RL policies can generalize to perturbed environments, the policies may also be robust to privacy-leaking attacks. For example, in the Grid World case, if the policy is not memorizing the environment dynamics but instead learns to plan no matter what environment it is deployed in, then it would be almost impossible to infer the floor plan structure using our method. In the candidate inference method, if the trained policies can generalize to many different environments, including significantly perturbed environments, then it will be much harder to infer which environment appeared in the training environment. Therefore, our work provides some insight into designing robust RL algorithms that can defend against such privacy-leaking attacks.

6. Conclusion

In this work, we have evaluated several approaches to retrieve private information about the environment from well-trained policies under different settings. We evaluated our proposed algorithms under two settings. In the first setting we apply some constraints on the transition dynamics, and proposed to use a genetic algorithm to recover the original transition map in a Grid World environment. In the second setting, we assume we have access to several candidate dynamics and propose an algorithm to infer which candidate environment is used to train a given policy. Through extensive experiments we show that deep reinforcement learning is vulnerable to potential privacy-leaking attacks and specific information about the training environment dynamics can be recovered with high accuracy. These findings provide insights for designing reinforcement learning algorithms that protect the training environment’s privacy. Our work also provides reinforcement learning community with a new perspective at the intersection of RL and its generalizability, privacy and security.


This work is partially supported by DARPA grant 00009970. We thank Warren He for providing useful feedback.