Actor Critic with Differentially Private Critic

10/14/2019 ∙ by Jonathan Lebensold, et al. ∙ 0

Reinforcement learning algorithms are known to be sample inefficient, and often performance on one task can be substantially improved by leveraging information (e.g., via pre-training) on other related tasks. In this work, we propose a technique to achieve such knowledge transfer in cases where agent trajectories contain sensitive or private information, such as in the healthcare domain. Our approach leverages a differentially private policy evaluation algorithm to initialize an actor-critic model and improve the effectiveness of learning in downstream tasks. We empirically show this technique increases sample efficiency in resource-constrained control problems while preserving the privacy of trajectories collected in an upstream task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While reinforcement learning (RL) is an attractive framework for modeling decision making under uncertainty, the sample inefficiency challenges are well known Wang et al. (2017), and in some case can be surmounted by, for example, using a simulator in silico Fox and Wiens (2019)

or relying on some form of transfer learning

Taylor and Stone (2009); Parisotto et al. (2016). These solutions, however, rarely account for real-world constraints arising in tasks where data privacy must be addressed. Examples include hospitals sharing patient data to improve clinical decision-making, or navigational information collected from agents in the real-world Fridman et al. (2018); Zhu et al. (2017). Thus, as more real-world problems are modelled as use cases for RL algorithms, privacy-preserving knowledge transfer between agent environments will become a deployment requirement.

Differential privacy (DP) is a robust privacy-preserving technique for data analysis algorithms Apple (2017); Al-Rubaie and Chang (2019). Previous works on DP for sequential decision-making tasks have focused on the (contextual) bandits setting Shariff and Sheffet (2018); Tossou and Dimitrakakis (2016, 2017); Gajane et al. (2017), or developed RL methods which treat the rewards as sensitive Wang and Hegde (2019). We argue that such approaches do not address situations where a trusted aggregator wishes to use historical, potentially sensitive data to bootstrap an RL algorithm to learn on an untrusted environment. For example, aggregating sensitive patient information to train an agent which could then be shared and improved (e.g. through personalization) on a smaller, local dataset. In this paper we model such a scenario by assuming a number of untrusted agents (consumers) whose goal to solve an RL task in a sample-efficient manner by leveraging information obtained from a trusted aggregator (the producer). The setup we consider is described in Figure 1.

Privacy risks are well studied in the supervised learning setting and DP provides safeguards against an attacker attempting to learn whether an individual record is included in the training set

Carlini et al. (2018); Abadi et al. (2016). In the context of RL, previous work has demonstrated how an attacker can infer information about training trajectories from a policy Pan et al. (2019). The clinical example we use to motivate our work is inspired by recent successes of RL in the context of sepsis treatments Komorowski et al. (2018).

How best to transfer knowledge from one RL agent to another, even without privacy considerations, remains an area of active research. Informally, any technique where an algorithm is parametrized based on previous training can be considered a form of transfer learning Bengio et al. (2007). The goal of most transfer learning is to use previously learned knowledge to speed up learning in a related task. Multi-task RL Teh et al. (2017) and multi-agent RL Lowe et al. (2017), model distillation Rusu et al. (2015) and meta-learning Finn et al. (2017) are strategies considered when the task may vary between environments or where a prior is assumed to increase sample efficiency. Transfer learning is also frequently used in supervised learning for vision and speech tasks Oquab et al. (2014); Yosinski et al. (2014); Onu et al. (2019).

In this work, we explore transfer learning in RL under DP constraints. Specifically, we investigate how actor-critic methods perform when initialized using a privatized First-Visit Monte Carlo estimate

Sutton and Barto (2018) of the value function. A trusted data aggregator, called the producer in our algorithm, uses DP-LSL Balle et al. (2016), a differentially private policy evaluation algorithm, to learn a value function . Actor-critic, a commonly used policy gradient method in RL, enables our agent, the consumer, to iteratively improve an action-value function and a state-value function . Our method uses the output of DP-LSL to initialize the actor-critic algorithm, effectively transfering knowledge between the producer and the consumers while preserving the privacy of the trajectories collected by the producer. Such an approach is desirable when the environment of the consumer may be considered data-limited.

Figure 1: DP Producer/Consumer preserves privacy with a single transfer from the producer.

2 Differentially Private Policy Evaluation with DP-LSL

Differential privacy is achieved by introducing carefully calibrated noise into an algorithm. The goal of a differentially private algorithm is to bound the effect that any individual user’s contribution might have on the output while maintaining proximity to the original output. By limiting individual contributions, the potential risk of an adversary learning sensitive information about any one user is limited.

A user must be defined in order to properly calibrate noise. We assume each user contributes a single trajectory to a database of records collected by the producer. Each represents a trajectory of states, actions and rewards in . For example, patient treatment decisions can be expressed as actions selected from a set and health readings as observations from a state-space , while rewards model the treatment outcome. The number of trajectories in is denoted by . The set of a states visited by a single trajectory is denoted by . The First-Visit Monte Carlo estimate of the value of a state is defined as the empirical average over of -discounted sums of rewards observed from the first visit to on each trajectory .111Trajectories where are ignored when computing .

DP-LSL Balle et al. (2016) is one of the few DP algorithms to support policy evaluation and provide a theoretical privacy and utility guarantee. By treating the estimation of the value function as a regularized least-squares regression problem based on Monte-Carlo estimates, , we can guarantee a limit on the influence of each trajectory to the value function. DP-LSL achieves differential privacy by adding Gaussian noise to the output of this regression problem; the noise is calibrated in a data-dependent manner to achieve -DP by using the smooth sensitivity framework Nissim et al. (2007).

More formally, to find the parameter vector

representing the value function, DP-LSL minimizes the objective function below, which includes a ridge penalty with :



The regression weights represent any initial prior/importance that we may ascribe to each state, eg. depending on how frequently they are visited. This least-squares problem can be solved in closed-form to find a parameter vector yielding the value function , where is a given feature matrix containing the features for each state. Gaussian noise is then applied to the result. The utility analysis in Balle et al. (2016) suggests that the regularization parameter must be carefully chosen as a function, which depends only on the number of trajectories .

3 Actor Critic with Differentially Private Critic

Our proposed algorithm comprises of two phases: a producer that uses historical data — considered confidential — to evaluate a policy, i.e. obtain the associated value function; and a consumer that uses an actor-critic algorithm initialized with the value function provided by the producer. Intuitively, such a prior should help the actor make initial estimates of actions taken. While we restrict ourselves to actor-critic, any algorithm that incorporates a value function could be used for the consumer phase.

Compute First-Visit Monte-Carlo Estimate Compute DP-LSL: Let Run Actor-Critic with Critic
Algorithm 1 Actor-critic with Differentially-Private Critic

The producer is therefore responsible for policy evaluation i.e., attempting to learn the state-value function for a for a given policy . Empirical results come from collecting sample trajectories using a learning algorithm (SARSA) Sutton and Barto (2018). While any learning algorithm could be used in practice, SARSA is appealing due to its relative simplicity and convergence properties at the limit.

4 Empirical Setup

Our empirical results come from 2 domains: an MDP domain (100 states with two actions) and the OpenAI Gym Brockman et al. (2016) environment Taxi-V2 Dietterich (2000). These experiments compare the benefit of value function transfer in consumer phase. The producer phase outputs a least-squares approximation of a differentially-private value function. By letting , we can then initialize . A ridge-regularization parameter is parameterized based on a regularization term where , and . We fix privacy parameter and vary .

Figure 2: MDP-100 and Taxi-V2 Task Transfer Results under DP-LSL

Patient Treatment Progression

The Markov-Decision Process (MDP) experiments can be regarded as clinical in nature – patient’s data is encoded into a state vector representation similar to

Komorowski et al. (2018) – but it could easily be applied to other domains, such as autonomous navigation. We study two approaches to generating samples: taking only optimal actions (the agent selects an action based on argmax where the is the optimal action-value function) and with an on-policy temporal-difference method, SARSA Sutton and Barto (2018).

Our MDP consists of a chain of states, where

. In each state the agent has some probability

of staying and probability of advancing (as in supp. material). The environment has two actions: and where their probability of transitioning to the right is and

respectively. A reward of 1 is given when the agent reaches the final, absorbing state, and -1 for all other states. States are one-hot encoded in a vector of size 100. This setup illustrates a case of policy evaluation in the medical domain, where patients tend to progress through stages of recovery at different rates, and past states are not usually revisited (because in the medical domain, states contain historic information about past treatments).


Taxi-V2 Dietterich (2000) is a discrete grid-world environment where the agent must pick up and drop off passengers at the right location. While this is still a relatively simple environment, it provides a classic scenario where an agent is able to draw on prior experience without leaking information trajectories performed before initialization.

Figure 3: Least Squares Transfer Performance (10k and 50k producer episodes)


We find in Figure 2 that even with a layer of DP noise, our critic benefits from an initialization of the value function estimate. In an environment where the number of episodes available is finite and regulatory frameworks inhibit sharing data, such a transfer learning approach could provide tangible benefits while minimizing the user harm through parameter sharing. Further experiments in Figure 3 illustrate that having more sample episodes will likely improve the quality of the transfer. We also find that some transfer is better than no transfer. For example in Taxi-V2, we achieve convergence after 10,000 episodes, whereas without transfer it took the agent 15,000 episodes. We also note that changing our privacy budget by two orders of magnitude does not vary the resulting transfer, meaning that we can benefit from initialization with limited risks to individual privacy.

5 Conclusion

We presented a motivated set of use cases for applying a differentially-private critic in the RL setting. The definition of the producer and consumer trust-model is common in real-world deployments and fits with existing transfer learning approaches where data centralization is difficult. Our preliminary results suggest a measurable improvement in sample efficiency through task transfer. We look forward to exploring how this framework could be extended so that the consumer’s critic could then be shared with the producer by leveraging ideas coming from the Federated Learning literature McMahan et al. (2018).


We thank Joey Bose, Mike Rabbat and Maxime Wabartha for discussion and comments. This work was supported in part by Google DeepMind and Mila.


  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS’16, New York, New York, USA, pp. 308–318. External Links: Document, ISBN 9781450341394, Link Cited by: §1.
  • [2] M. Al-Rubaie and J. M. Chang (2019)

    Privacy-preserving machine learning: threats and solutions

    IEEE Security & Privacy 17 (2), pp. 49–58. Cited by: §1.
  • [3] Apple (2017) Learning with Privacy at Scale. Ml 1, pp. 1–25. External Links: Link Cited by: §1.
  • [4] B. Balle, M. Gomrokchi, and D. Precup (2016-06) Differentially Private Policy Evaluation. International Conference on Machine Learning (ICML), pp. 2130–2138. External Links: 1603.02010, ISSN 1938-7228, Link Cited by: §1, §2, §2.
  • [5] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160. Cited by: §1.
  • [6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.
  • [7] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2018)

    The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

    Technical report Google. External Links: 1802.08232, Link Cited by: §1.
  • [8] T. G. Dietterich (2000) Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research 13, pp. 227–303. External Links: ISSN 10769757 Cited by: §4, §4.
  • [9] C. Finn, P. Abbeel, and S. Levine (2017-03) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. International Conference on Machine Learning (ICML). External Links: 1703.03400, Link Cited by: §1.
  • [10] I. Fox and J. Wiens (2019) Reinforcement learning for blood glucose control: challenges and opportunities. ICML 2019 Workshop RL4RealLife. Cited by: §1.
  • [11] L. Fridman, B. Jenik, and J. Terwilliger (2018) Deeptraffic: driving fast through dense traffic with deep reinforcement learning. arXiv preprint arXiv:1801.02805. Cited by: §1.
  • [12] P. Gajane, T. Urvoy, and E. Kaufmann (2017) Corrupt bandits for preserving local privacy. arXiv preprint arXiv:1708.05033. Cited by: §1.
  • [13] M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal (2018) The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine 24 (11), pp. 1716–1720. External Links: Document, ISSN 1546170X, Link Cited by: §1, §4.
  • [14] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6379–6390. External Links: Link Cited by: §1.
  • [15] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang (2018) Learning differentially private recurrent language models. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • [16] K. Nissim, S. Raskhodnikova, and A. D. Smith (2007) Smooth sensitivity and sampling in private data analysis. In

    Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007

    pp. 75–84. External Links: Link, Document Cited by: §2.
  • [17] C. C. Onu, J. Lebensold, W. L. Hamilton, and D. Precup (2019-06) Neural Transfer Learning for Cry-based Diagnosis of Perinatal Asphyxia. Interspeech. External Links: 1906.10199, Link Cited by: §1.
  • [18] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014)

    Learning and transferring mid-level image representations using convolutional neural networks


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1717–1724. Cited by: §1.
  • [19] X. Pan, W. Wang, X. Zhang, B. Li, J. Yi, and D. Song (2019) How you act tells a lot: privacy-leakage attack on deep reinforcement learning. CoRR abs/1904.11082. External Links: Link, 1904.11082 Cited by: §1.
  • [20] E. Parisotto, J. L. Ba, and R. Salakhutdinov (2016) Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning. ICLR. External Links: 1511.06342, ISBN 1511.06342v4, Link Cited by: §1.
  • [21] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell (2015) Policy Distillation. ICLR 2016, pp. 13. External Links: 1511.06295, Link Cited by: §1.
  • [22] R. Shariff and O. Sheffet (2018) Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 4301–4311. External Links: Link Cited by: §1.
  • [23] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Link Cited by: §1, §3, §4.
  • [24] M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: §1.
  • [25] Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 4496–4506. External Links: Link Cited by: §1.
  • [26] A. C. Y. Tossou and C. Dimitrakakis (2017) Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [27] A. C. Tossou and C. Dimitrakakis (2016) Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [28] B. Wang and N. Hegde (2019) Private q-learning with functional noise in continuous spaces. CoRR abs/1901.10634. External Links: Link, 1901.10634 Cited by: §1.
  • [29] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2017) Sample efficient actor-critic with experience replay. International Conference on Learning Representations. Cited by: §1.
  • [30] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §1.
  • [31] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §1.