Locally Private Distributed Reinforcement Learning

by   Hajime Ono, et al.
University of Tsukuba

We study locally differentially private algorithms for reinforcement learning to obtain a robust policy that performs well across distributed private environments. Our algorithm protects the information of local agents' models from being exploited by adversarial reverse engineering. Since a local policy is strongly being affected by the individual environment, the output of the agent may release the private information unconsciously. In our proposed algorithm, local agents update the model in their environments and report noisy gradients designed to satisfy local differential privacy (LDP) that gives a rigorous local privacy guarantee. By utilizing a set of reported noisy gradients, a central aggregator updates its model and delivers it to different local agents. In our empirical evaluation, we demonstrate how our method performs well under LDP. To the best of our knowledge, this is the first work that actualizes distributed reinforcement learning under LDP. This work enables us to obtain a robust agent that performs well across distributed private environments.



There are no comments yet.


page 1

page 2

page 3

page 4


Manipulation Attacks in Local Differential Privacy

Local differential privacy is a widely studied restriction on distribute...

Optimal Differentially Private ADMM for Distributed Machine Learning

Due to massive amounts of data distributed across multiple locations, di...

Robust and Private Learning of Halfspaces

In this work, we study the trade-off between differential privacy and ad...

How to Democratise and Protect AI: Fair and Differentially Private Decentralised Deep Learning

This paper firstly considers the research problem of fairness in collabo...

Fast and Differentially Private Algorithms for Decentralized Collaborative Machine Learning

Consider a set of agents in a peer-to-peer communication network, where ...

Differentially Private Controller Synthesis With Metric Temporal Logic Specifications

Privacy is an important concern in various multiagent systems in which d...

Differential Advising in Multi-Agent Reinforcement Learning

Agent advising is one of the main approaches to improve agent learning p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advancement of reinforcement learning (RL) shows great success within broad domains ranging from market strategy decisions Abe et al. (2004), load balancing Cogill et al. (2006) to autonomous driving Shalev-Shwartz et al. (2016)

. Reinforcement learning is a process to obtain a good policy in a given environment in an unsupervised learning manner. Distributed reinforcement learning (DRL) is known as a practical solution to accelerate reinforcement learning in parallel 

Mnih et al. (2016); Nair et al. (2015); Palmer et al. (2019); Bacchiani et al. (2019). It also gives us robust policies across different environments. A policy is regarded as robust if it performs well across various environments, but not overfitting a simulated environment. A policy overfitting a simulated environment does not work well for the real-world environment Rajeswaran et al. (2016).

In case that the local environments are related to private information, such as private rooms and individual properties, there are privacy issues. Since a locally learned policy is strongly being affected by the individual environment, the output of the agent may release the private information unconsciously. Pan et al. (2019)

pointed out that reinforcement learning can cause privacy issue. They proposed an attack to recover the dynamics of agents through estimating the transition dynamics with their state space, action space, reward function, and trained policy. For example, from the policy of a robot cleaner trained in an individual’s room, an adversary can estimate the room layout if she can access the policy. In the distributed settings, sending information by the local agent has serious privacy risks if we do not believe the central aggregator.

Local differential privacy (LDP) Kasiviswanathan et al. (2011); Duchi et al. (2013) gives a rigorous privacy guarantee when data providers send information to a data curator. Mechanisms ensuring LDP makes outputs indistinguishable values regardless of the input. In this paper, we aim to design locally differentially private algorithms for DRL, such that reported information from local agents is indistinguishable.

Figure 1: Private Gradient Collection (PGC) framework. The framework aims to lean a robust policy based on the reported noisy gradients that satisfy LDP from local agents. The central aggregator updates his model by the noisy gradients, and distributes the updated model to different agents for making the policy robust across various environments.

To achieve the DRL under LDP constraints, we develop a framework that leans a robust policy based on the reported information from the agents while preserving local privacy of them (Figure 1). We call the framework Private Gradient Collection (PGC). In the framework, first, the central aggregator distributes a global model to several local agents. Second, the local agents update the model at local private environments. Third, the agents report noisy gradients that satisfy LDP to the central aggregator. At last, the central aggregator updates the global parameters by utilizing a set of reported noisy gradients. After updating the global model, the central aggregator distributes the model to the other agents to learn more and more. Following the above way, local agents can report their updates by submitting noisy gradients even if the local nodes do not have any deliverable data. Besides, the central aggregator easily updates the global model by just applying the collected gradients to the model, without any privacy concerns of the local agents.

To the concrete realization of the framework, we introduce an algorithm based on asynchronous advantage actor-critic (A3C) based. For introducing randomness that satisfies LDP, we present two mechanisms for the gradient submission.

To the best of our knowledge, this is the first work that actualizes DRL under local differential privacy. In this paper, we show that our algorithm ensures LDP guarantees by utilizing a series of techniques. In our empirical evaluations, we demonstrate how our method learns the robust policy effectively even it is required to satisfy local differential privacy. This work enables us to obtain a robust agent that performs well across distributed private environments.

1.1 Related Works

For privacy-preserving distributed reinforcement learning, both cryptographical and differentially private approaches have been studied.

The cryptographical approaches conceal information during their learning process Zhang and Makedon (2005); Sakuma et al. (2008). However, if several agents cooperate, they have chances to estimate the information of the other agents. Our proposed method under LDP is robust against the cooperated adversarial parties. Even if all other remaining agents attack an agent, the agent’s dynamics are indistinguishable from the different candidate dynamics.

Noisy DQN Fortunato et al. (2018) is a DQN Mnih et al. (2015) variant that aims to improve learning stability by injecting Gaussian noise, but it has no way to preserve privacy. Wang and Hegde (2019) introduced differentially private Q-learning in continuous spaces. Zhu and Philip (2019) introduced a cooperative multi-agent system that chooses advice information from neighboring agents in a differentially private way. Chamikara et al. (2019) proposed a private distributed learning framework that craft perturbed data satisfying LDP at local data holders and send it to the untrusted curator. We focus on DRL that agents do not have such deliverable data, but submit perturbed gradients.

2 Preliminaries

Before detail discussions, we introduce essential background notations, definitions, and related works to understand our proposals.

2.1 (Local) Differential Privacy

Differential privacy Dwork et al. (2006) is a rigorous privacy definition, which quantitatively evaluates the degree of privacy protection when releasing statistical aggregates. While local differential privacy (LDP) Kasiviswanathan et al. (2011); Duchi et al. (2013) gives a rigorous privacy guarantee when data providers send information to a data curator. Suppose the data providers send information to a collector via some random mechanism .

Definition 1

For all possible input and for any subset of outputs , a randomized mechanism satisfies -local differential privacy if it holds that


where is the Napier number.

The definition requires to output indistinguishable values regardless of the input. An essential property of a mechanism is the (global) sensitivity of the output.

Definition 2

For all input , the sensitivity of a mechanism is defined as


The sequential composition is the property that describes an intuition that more outputs more violate privacy.

Theorem 1

Let be a series of mechanisms. Assume that satisfies -(L)DP for each , respectively. Then, the series of mechanisms satisfies -(L)DP.

Post-processing invariance is a property that differentially private information never harm privacy anymore.

Theorem 2

For any deterministic or randomized function defined over the mechanism , if satisfies -(L)DP, also satisfies -(L)DP for any input .

Because of the property, in a local private setting, the curator is allowed to run arbitrary processing for the collected data.

Laplace mechanism. Laplace mechanism is the well known randomized mechanism that samples randomized values from the Laplace distribution. The Laplace distribution is designed based on the sensitivity of the target function outputs. The mechanism samples the randomized output from the Laplace distributions denoted as


where is the -th element of .

Bit flip. Bit flip Ding et al. (2018, 2017) is a randomization technique for satisfying (L)DP. For input ,


By the bit flip, the randomized outputs have sharp directions.

Random projection  Johnson (1984); Achlioptas (2001); Bingham and Mannila (2001)

is a useful technique that reduces the dimensionality of the vector by a random matrix. We can use the random matrix

such that


where is the dimension of mapped space. The random matrix has the useful property that the column vectors of are almost orthogonal each other Achlioptas (2001); Bingham and Mannila (2001). Thanks to the property, we can approximately recover the original vector using the transposed matrix from the compressed vector.

2.2 Distributed Reinforcement Learning

Benefits of distributed reinforcement learning (DRL) are increasing learning efficiency and obtaining a robust policy. We focus on the later while preserving the privacy of local agents. Not only the distributed setting, but robust RL is also the generalization of RL, which adapts uncertainty of the transition dynamics Morimoto and Doya (2005); Rajeswaran et al. (2016); Pinto et al. (2017). Our goal is to obtain a transferable policy that performs well across various environments.

Suppose there are a central aggregator and distributed agents. Agent

moves around on the Markov decision process (MDP) which is characterized with common state space

, common action space , common reward function , common discounting factor and local transition dynamics . contains some terminal states. Each local dynamics decides the next state after an action on a state where is the probability simplex on . Local dynamics is parametrized with where is a parameter set. For each round , agent has state and takes action . After the action, the agent gets reward from , and the state transits . gives to the terminal states. The transition follows local transition dynamics . Agent decides its action by policy , which is shared by all agents. The central aggregator trains the policy with the cooperation of the agents. Defining as the initial state distribution, for each agent , history

is the random variable such that


where is determined with the MDP and the policy. To obtain a robust policy, which works well on some dynamics in the possible dynamics set , we solve an optimization problem. The objective function is


where is the discounting factor.

2.2.1 Asynchronous Advantage Actor-critic

Asynchronous advantage actor-critic (A3CMnih et al. (2016) is a DRL framework, which is originally proposed for acceleration of policy training in parallel. On distributed A3C protocol, each agent optimizes both policy and approximation of a state-value function. The policy is denoted as . For some and some , represents the confidence for action on . Based on the confidence, each agent decides the next action at each time step. State-value function is the function which represents the value of each state on policy .


Further, denotes the approximated state value function. For simplicity, this paper denotes and as and .

The objective function is defined as follows:




with terminal state and some small positive real values and . is the loss of policy, and is the loss of the estimation of value function. Decreasing (9) increases the true objective (7).

3 Locally Differentially Private Actor-Critic

3.1 Our Algorithm

We here present our algorithm for DRL under local differential privacy. We first introduce an overview and an abstract model of our algorithm. We call the abstract model Private Gradient Collection (PGC). Based on the model, we present our algorithm PGC-A3C that is a method based on A3C with satisfying -LDP for all local agents. We also address the privacy analysis of the proposed method and give some extensions.

3.1.1 Private Gradient Collection

Suppose the central aggregator has a model parameterized with where is the dimensionality of , and trains by utilizing reported information from local agents. The central aggregator and all local agents share the parameters

, the structure of the model, and loss function


The abstract model PGC follows the below four steps:

  1. The central aggregator delivers to local agents.

  2. Each local agent initializes her parameters by and updates in her local private environment.

  3. The local agent reports information about her model with injecting noise to satisfy -LDP.

  4. The central aggregator updates by only utilizing the received noisy information from the local agents.

The primal question to design the model is what information should local agents report? Our answer is stochastic gradient. Hence, the local agent computes loss and its gradient behind local observations and rewards, then submits a noisy gradient to the central node.

In the local training process, each agent inputs to the network and obtains the next action or some information to decide the action. At the end of an episode, with history and observed rewards , agent evaluates by the loss function L. After the evaluation, the agent computes stochastic gradient of along . Before reporting to the central aggregator, she randomizes the gradient. The randomness (e.g., additive noise) is designed to satisfy -LDP via random mechanism .

Definition 3

(-LDP for gradient submissions) For each , any and any subset , the following inequality must hold:


With the noisy gradients, the central aggregator updates . Then, the updated parameters are shared with all distributed agents again. To make the problem simple, we assume that one agent submits the gradient only once. Because of the post-processing invariant, the central aggregator can apply the noisy gradients to the parameters in any way. She can use any gradient method and can use a submitted gradient multi-time.

3.1.2 Pgc-A3c

As a concrete realization of the PGC framework, we propose PGC-A3C, which is an LDP variant of A3C. Following the PGC framework, PGC-A3C employs the gradient submissions with a randomized mechanism from local agents to the central aggregator. The other procedure follows the original algorithm of A3C. Algorithm is the overall procedure of PGC-A3C.

Empirical Loss Minimization.

As well as vanilla A3C, based on the episode history , the local agent evaluates empirical loss:




The empirical loss replaces random value in Equation 9 with observed . After the evaluation, the agent computes stochastic gradient of the empirical loss along . For the stochastic gradient, each agent crafts the noisy gradient by a randomized mechanism to satisfy -LDP, and submits to the central aggregator.

Crafting Noisy Gradient.

We discuss how to craft a noisy gradient that satisfies -LDP. A simplest way to craft the gradient is to follow the Laplace mechanism that is well-known in differential privacy literature. However, for a stochastic gradient, it is hard to deal with its sensitivity. We employ clipping technique to bound the sensitivity.


where is a stochastic gradient vector, is clipping size. Each agent clips the gradient by the norm with a positive constant , and then the sensitivity is bounded by . That is, any two clipped gradients satisfies


Based on the clipping (19) and the sensitivity bounded by (20), each agent generate the Laplace noise such that:


where is the -th dimensional value of . With the noise , each agent report noisy gradient to the central aggregator. This procedure is described in Algorithm 2.

Updating Global Parameter with Buffer.

The central aggregator updates his global parameter

by received gradients from the local agents. To reduce the variance of the noisy gradients, we introduce a temporal storage

buffer. The central aggregator first stores multiple noisy gradients into the buffer , and update with utilizing all as


The central aggregator does not utilize any other information about local agents except received noisy gradients. Therefore, the update process, which is the post-processing of all received gradients, does not violate -LDP for any local agents. After updating , the central aggregator flushes the buffer . We expect that the buffering improves learning stability as well as mini-batch learning.

  Input: agents, reward function , and randomized mechanism
  Parameters: privacy parameter , reduced dimension , clip parameter , maximum buffer size MAX_BUF, learning rate and scale parameter
  Initialize buffer
  Initialize global parameters
     // agent asynchronously begins a local process
     Copy parameters
     Initialize step counter
     Get initial state
     while  is not a terminal state do
        Receive and following and
     end while
     Send to central controller
     // agent ends the local process
     if  then
        Perform asynchronous update of with by (22)
         // clear buffer
     end if
  until  submissions received
Algorithm 1 PGC-A3C
  Input: gradient vector , privacy parameter , dimensionality of the gradient vector , and clipping size
   clip vector as (19)
  for  do
     generate such that (21).
  end for
Algorithm 2 Laplace mechanism (for gradient submissions)
  Input: gradient vector , privacy parameter , original dimension , reduced dimension and clipping size
  Generate random matrix as (5)
  for  do
      clip as (23)
      randomize as (24)
  end for
Algorithm 3 Projected Random Sign Mechanism

3.1.3 Acceleration of Learning Efficiency

Since the Laplace mechanism is simple, but decreases the accuracy of gradient significantly, the learning efficiency of the whole DRL might be decreased. We introduce an alternative randomizing technique, projected random sign (PRS), to have opportunity to increase the learning efficiency. To increase the learning effieicy, the PRS mechanism addresses reducing dimensionality and sharpning gradient direction while injecting randomness for LDP.

First, the PRS applies the random projection to reduce dimensionality. Each agent maps dimensional stochastic gradient vector to dimensional vector with random matrix as . follows (5).

Second, before applying randomization, each agent applies element-wise clipping denoted as follows:


where is the -th dimensional value of . This element-wise clipping bounds the sensitivity by .

Third, we apply bit flipping for the clipped vector produced by (23). The bit flipping extending (4) is denote as follows:


The randomization by (24) consumes privacy parameter for each dimension. At last, this mechanism inversely transforms as , and report it to the central aggregator.

3.2 Privacy Analysis

Lemma 1

An algorithm that follows the PGC framework satisfies -LDP for all local agents.

Proof Sketch

In step 3 of the PGC framework, each agent reports a noisy gradient that ensures -LDP. In step 4, the central aggregator updates only utilizing the received noisy gradients from the local agents. This step is independent of any information about local agents. Therefore, steps 4 does not violate -LDP due to post-processing invariance. Move forward to the next round. The central aggregator delivers the updated parameter to the other agent at step 1. At step 2, different agent copies the parameter as and updates through her local environment. Since the learning process at a local agent is independent of all other agents, the output also does not violate -LDP for all other agents.

Lemma 2

Gradient submission with the Laplace mechanism satisfies -LDP.


Each agent is given which contains the information of where . With , agent computes gradient and outputs . With the clipping and Laplace mechanism, for any and , the following inequality holds.

Since the inequality holds regardless , the following also holds. For any ,

Lemma 3

Gradient submission with the PRS mechanism satisfies -LDP.


For any , any and any such that ,

The following expansion is as well as Proof of Lemma 2.

Lemma 4

Updating paramters on the central aggregator (22) does not violate -LDP that has been satisfied for each local agent.


The parameter update (22) is only utilizing received noisy gradient in the batch , which means independent from any information about any local agents except noisy gradients. Due to post-processing invariance, the parameter update does not violate -LDP that has been satisfied at each local agent.

Theorem 3

PGC-A3C (Algorithm 1) satisfies -LDP for all local agents.


From Lemma 1, 2, 3 and 4, PGC-A3C obviously satisfies -LDP for all local agents.

3.2.1 Extending to multiple submissions

We can easily extend the algorithm to a locally multi-round algorithm. In the multi-round algorithm, each agent submits a randomized stochastic gradients -times. For each submission, the agent consumes a privacy budget , and the whole consumed budget is because of the sequential composition theorem (Theorem 1).

4 Experiments

We here demonstrate the effectiveness of our proposals. We evaluate learning efficiency, success ratio, and trade-off between privacy and efficiency. Before showing empirical results, we describe what the evaluation task is and how to implement PGC-A3C.

Evaluation Task. We make some numerical observations on cart pole with different gravity acceleration coefficients . Suppose that each coefficient is appeared uniformly. Cart Pole Barto et al. (1983) is the classical reinforcement learning task that an agent controls a cart with a pole to keep the pole standing. The number of time steps in which the pole is standing is the cumulative reward. The cumulative reward is called score in this section, and the maximum score is 200. consists of cart position , cart velocity , pole angle and pole velocity at tip . The cart moves on a one-dimensional line. At each time step, the agent selects an action from .

Stopping Criteria. We hire a way to iterate the learning process until the central aggregator received the predefined number of submissions from agents. Since we assume that each agent submits only once, the number of submissions is identical to the number of agents. We assume the scores are not private information. If we need to protect the score with LDP, we can easily develop additional submissions that agents send a noisy boolean representing whether the score is larger than a threshold.


To implement the proposed algorithms, we use two shallow neural networks corresponding to


, respectively. Each network has two layers, and the activation functions are ReLU.

, , and . Given a state , outputs the confidence in each action. Each agent takes an action having with probability . With probability , the agent takes a randomly selected action. is decreased from to as

. Our implementation utilizes 9 threads for asynchronous agent processes. Empirical codes are developed by Python 3.7.4, TensorFlow 1.14.0 

Abadi et al. (2015) and OpenAIGym 0.14.0 Brockman et al. (2016).

Hyper Parameters. We set discounting factor , learning rate , loss scaling factors and . is set for the Laplace mechanism and is set for the PRS. For the PRS, we set as following Wang et al. (2019).

Score. Each local agent measures a score, which is how long time steps the pole keeps the standings. We observe how the learning progress reaches the target score (). Especially to evaluate robustness across various environments, we measure the average score. The average score at -th submission over last (=10) submissions is:


where is the score at .

(a) Laplace ()
(b) Laplace ()
(c) PRS ()
(d) PRS ()
Figure 6: The average scores during the training process. Buffering helps the PRS to train the policy, but disturbs the training of Laplace mechanism.

4.1 Observation of Learning Behaviors

First, we observe the learning behaviors of our proposed methods to decide several hyper-parameters’ values. We compare two mechanisms with and without buffering, whose buffer size is . We regard training as a success if the average score meets .

Figure 6 shows the average scores of our proposed method employing two different mechanisms during the training process with and . Without buffering, scores for all settings change drastically, but the buffering limits the learning dynamics of the Laplace mechanism too small to learn. However, in the PRS mechanism, the buffering gives better learning stability than without buffering. Thus, the buffering helps the PRS mechanism to train the policy well, but it disturbs the training with the Laplace mechanism.

In the later part of the evaluations, we employ the Laplace mechanism with , and the PRS with .

4.2 Learning Efficiency

We evaluate how early the algorithms achieve a target score. To measure it, we define a metric first success time (FST):


We regard as if a learning process cannot meet the success in 90,000 updates.

Table 1 shows the median of the in trials for each setting. The smaller median value suggests that a method is efficient to learn. We measure the FST varying , and . The result of the Laplace mechanism shows a decreasing the median of along with increasing . While PGC-A3C with PRS shows better results within , but it does not show such improvement at . Therefore, PGC-A3C with the PRS has a chance to increase learning efficiency than Laplace.

median of (26)
Lap 18377.0 20238.5 5714.5 4055.0 1769.0
PRS 25226.5 7549.0 2656.5 11217.5
Table 1: #submissions that meets target score at first.
success ratio
Lap 0.80 0.90 1.00 1.00 1.0
PRS 0.85 0.95 0.90 0.90
Table 2: Success ratio.
(a) Laplace mechanism
(b) PRS mechanism
Figure 9: Success ratio for each at #submissions. PGC-A3C with the PRS mechanism makes more successes in earlier stage than the method with the Laplace.
relative area under curve
Lap 0.673 0.711 0.909 0.965 1.0
PRS 0.660 0.862 0.835 0.771
Table 3: Relative AUC of Figure 9 against non-private.

4.3 Success Ratio

We evaluate how many times proposed algorithms achieve the target score over trials for each setting. That is


where is the of -th trial. Here we set .

Table 2 shows the success ratios for various settings. The non-private A3C () succeeds in all trials. With both of the randomized mechanisms, the algorithms tend to make more successes for larger . The algorithm using PRS gives more successes for , and the algorithm using the Laplace mechanism shows better for .

Figure 9 plots the success ratio at . The horizontal axis shows the number of submissions, and the vertical axis shows the ratio of the trials in which an average score exceeds by the update. With larger , both algorithms achieve a high success ratio consuming fewer updates. The algorithm using PRS makes more successes in the early stage, and the algorithm with Laplace shows more successes by the end of each training process. This is due to larger loss of gradient by the PRS against the Laplace.

Table 3 shows the relative area under curve (AUC) of Figure 9 against the AUC of non-private A3C. An algorithm having a larger AUC is regarded as a better algorithm. Laplace mechanism achieves larger AUC in proportional to . While PRS shows the best at .

The PRS mechanism gives us more efficient DRL under LDP at some . The PRS mechanism may be a better choice if we require strong privacy guarantees (). Otherwise, the Laplace mechanism seems more promising.

5 Conclusion

We studied locally differentially private algorithms for distributed reinforcement learning to obtain a robust policy that performs well across distributed private environments. We proposed a general framework PGC, and its concrete algorithm PGC-A3C with two randomized mechanism for injecting randomness. Our proposed algorithm leans a robust policy based on the reported noisy gradients that satisfy LDP from local agents. Without any privacy concerns of the local agents, the algorithm can update a global model to make it robust across various environments. We also demonstrated how our method learns the robust policy effectively even it is required to satisfy local differential privacy. This work enables us to obtain a robust agent that performs well across distributed private environments.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from tensorflow.org External Links: Link Cited by: §4.
  • N. Abe, N. Verma, C. Apte, and R. Schroko (2004) Cross channel optimized marketing by reinforcement learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, New York, NY, USA, pp. 767–772. External Links: ISBN 1-58113-888-1, Link, Document Cited by: §1.
  • D. Achlioptas (2001) Database-friendly random projections. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, New York, NY, USA, pp. 274–281. External Links: ISBN 1-58113-361-8, Link, Document Cited by: §2.1.
  • G. Bacchiani, D. Molinari, and M. Patander (2019) Microscopic traffic simulation by cooperative multi-agent deep reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Richland, SC, pp. 1547–1555. External Links: ISBN 978-1-4503-6309-9, Link Cited by: §1.
  • A. G. Barto, R. S. Sutton, and C. W. Anderson (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (5), pp. 834–846. External Links: Document, ISSN Cited by: §4.
  • E. Bingham and H. Mannila (2001) Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, New York, NY, USA, pp. 245–250. External Links: ISBN 1-58113-391-X, Link, Document Cited by: §2.1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.
  • M. Chamikara, P. Bertok, I. Khalil, D. Liu, and S. Camtepe (2019)

    Local differential privacy for deep learning

    arXiv preprint arXiv:1908.02997. Cited by: §1.1.
  • R. Cogill, M. Rotkowitz, B. Van Roy, and S. Lall (2006) An approximate dynamic programming approach to decentralized control of stochastic systems. In Control of Uncertain Systems: Modelling, Approximation, and Design, B. A. Francis, M. C. Smith, and J. C. Willems (Eds.), Berlin, Heidelberg, pp. 243–256. External Links: ISBN 978-3-540-31755-5 Cited by: §1.
  • B. Ding, J. Kulkarni, and S. Yekhanin (2017) Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3571–3580. External Links: Link Cited by: §2.1.
  • B. Ding, H. Nori, P. Li, and J. Allen (2018) Comparing population means under local differential privacy: with significance and power. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.1.
  • J. C. Duchi, M. I. Jordan, and M. J. Wainwright (2013) Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Vol. , pp. 429–438. External Links: Document, ISSN 0272-5428 Cited by: §1, §2.1.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, S. Halevi and T. Rabin (Eds.), Berlin, Heidelberg, pp. 265–284. External Links: ISBN 978-3-540-32732-5 Cited by: §2.1.
  • M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg (2018) Noisy networks for exploration. In International Conference on Learning Representations, External Links: Link Cited by: §1.1.
  • W. B. Johnson (1984) Extensions of lipshitz mapping into hilbert space. Conference modern analysis and probability, 1984 (), pp. 189–206. External Links: ISSN , Document Cited by: §2.1.
  • S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith (2011) What can we learn privately?. SIAM Journal on Computing 40 (3), pp. 793–826. Cited by: §1, §2.1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1928–1937. External Links: Link Cited by: §1, §2.2.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.1.
  • J. Morimoto and K. Doya (2005) Robust reinforcement learning. Neural Computation 17 (2), pp. 335–359. External Links: Document, Link, https://doi.org/10.1162/0899766053011528 Cited by: §2.2.
  • A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al. (2015) Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §1.
  • G. Palmer, R. Savani, and K. Tuyls (2019) Negative update intervals in deep multi-agent reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Richland, SC, pp. 43–51. External Links: ISBN 978-1-4503-6309-9, Link Cited by: §1.
  • X. Pan, W. Wang, X. Zhang, B. Li, J. Yi, and D. Song (2019) How you act tells a lot: privacy-leaking attack on deep reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 368–376. Cited by: §1.
  • L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 2817–2826. External Links: Link Cited by: §2.2.
  • A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283. Cited by: §1, §2.2.
  • J. Sakuma, S. Kobayashi, and R. N. Wright (2008) Privacy-preserving reinforcement learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 864–871. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §1.1.
  • S. Shalev-Shwartz, S. Shammah, and A. Shashua (2016) Safe, multi-agent, reinforcement learning for autonomous driving. CoRR abs/1610.03295. External Links: Link, 1610.03295 Cited by: §1.
  • B. Wang and N. Hegde (2019) Privacy-preserving q-learning with functional noise in continuous spaces. In Advances in Neural Information Processing Systems, pp. 11323–11333. Cited by: §1.1.
  • N. Wang, X. Xiao, Y. Yang, J. Zhao, S. C. Hui, H. Shin, J. Shin, and G. Yu (2019) Collecting and analyzing multidimensional data with local differential privacy. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), Vol. , pp. 638–649. External Links: Document, ISSN Cited by: §4.
  • S. Zhang and F. Makedon (2005) Privacy preserving learning in negotiation. In Proceedings of the 2005 ACM Symposium on Applied Computing, SAC ’05, New York, NY, USA, pp. 821–825. External Links: ISBN 1-58113-964-0, Link, Document Cited by: §1.1.
  • T. Zhu and S. Y. Philip (2019) Applying differential privacy mechanism in artificial intelligence. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 1601–1609. Cited by: §1.1.