While reinforcement learning (RL) is an attractive framework for modeling decision making under uncertainty, the sample inefficiency challenges are well known Wang et al. (2017), and in some case can be surmounted by, for example, using a simulator in silico Fox and Wiens (2019)
or relying on some form of transfer learningTaylor and Stone (2009); Parisotto et al. (2016). These solutions, however, rarely account for real-world constraints arising in tasks where data privacy must be addressed. Examples include hospitals sharing patient data to improve clinical decision-making, or navigational information collected from agents in the real-world Fridman et al. (2018); Zhu et al. (2017). Thus, as more real-world problems are modelled as use cases for RL algorithms, privacy-preserving knowledge transfer between agent environments will become a deployment requirement.
Differential privacy (DP) is a robust privacy-preserving technique for data analysis algorithms Apple (2017); Al-Rubaie and Chang (2019). Previous works on DP for sequential decision-making tasks have focused on the (contextual) bandits setting Shariff and Sheffet (2018); Tossou and Dimitrakakis (2016, 2017); Gajane et al. (2017), or developed RL methods which treat the rewards as sensitive Wang and Hegde (2019). We argue that such approaches do not address situations where a trusted aggregator wishes to use historical, potentially sensitive data to bootstrap an RL algorithm to learn on an untrusted environment. For example, aggregating sensitive patient information to train an agent which could then be shared and improved (e.g. through personalization) on a smaller, local dataset. In this paper we model such a scenario by assuming a number of untrusted agents (consumers) whose goal to solve an RL task in a sample-efficient manner by leveraging information obtained from a trusted aggregator (the producer). The setup we consider is described in Figure 1.
Privacy risks are well studied in the supervised learning setting and DP provides safeguards against an attacker attempting to learn whether an individual record is included in the training setCarlini et al. (2018); Abadi et al. (2016). In the context of RL, previous work has demonstrated how an attacker can infer information about training trajectories from a policy Pan et al. (2019). The clinical example we use to motivate our work is inspired by recent successes of RL in the context of sepsis treatments Komorowski et al. (2018).
How best to transfer knowledge from one RL agent to another, even without privacy considerations, remains an area of active research. Informally, any technique where an algorithm is parametrized based on previous training can be considered a form of transfer learning Bengio et al. (2007). The goal of most transfer learning is to use previously learned knowledge to speed up learning in a related task. Multi-task RL Teh et al. (2017) and multi-agent RL Lowe et al. (2017), model distillation Rusu et al. (2015) and meta-learning Finn et al. (2017) are strategies considered when the task may vary between environments or where a prior is assumed to increase sample efficiency. Transfer learning is also frequently used in supervised learning for vision and speech tasks Oquab et al. (2014); Yosinski et al. (2014); Onu et al. (2019).
In this work, we explore transfer learning in RL under DP constraints. Specifically, we investigate how actor-critic methods perform when initialized using a privatized First-Visit Monte Carlo estimateSutton and Barto (2018) of the value function. A trusted data aggregator, called the producer in our algorithm, uses DP-LSL Balle et al. (2016), a differentially private policy evaluation algorithm, to learn a value function . Actor-critic, a commonly used policy gradient method in RL, enables our agent, the consumer, to iteratively improve an action-value function and a state-value function . Our method uses the output of DP-LSL to initialize the actor-critic algorithm, effectively transfering knowledge between the producer and the consumers while preserving the privacy of the trajectories collected by the producer. Such an approach is desirable when the environment of the consumer may be considered data-limited.
2 Differentially Private Policy Evaluation with DP-LSL
Differential privacy is achieved by introducing carefully calibrated noise into an algorithm. The goal of a differentially private algorithm is to bound the effect that any individual user’s contribution might have on the output while maintaining proximity to the original output. By limiting individual contributions, the potential risk of an adversary learning sensitive information about any one user is limited.
A user must be defined in order to properly calibrate noise. We assume each user contributes a single trajectory to a database of records collected by the producer. Each represents a trajectory of states, actions and rewards in . For example, patient treatment decisions can be expressed as actions selected from a set and health readings as observations from a state-space , while rewards model the treatment outcome. The number of trajectories in is denoted by . The set of a states visited by a single trajectory is denoted by . The First-Visit Monte Carlo estimate of the value of a state is defined as the empirical average over of -discounted sums of rewards observed from the first visit to on each trajectory .111Trajectories where are ignored when computing .
DP-LSL Balle et al. (2016) is one of the few DP algorithms to support policy evaluation and provide a theoretical privacy and utility guarantee. By treating the estimation of the value function as a regularized least-squares regression problem based on Monte-Carlo estimates, , we can guarantee a limit on the influence of each trajectory to the value function. DP-LSL achieves differential privacy by adding Gaussian noise to the output of this regression problem; the noise is calibrated in a data-dependent manner to achieve -DP by using the smooth sensitivity framework Nissim et al. (2007).
More formally, to find the parameter vectorrepresenting the value function, DP-LSL minimizes the objective function below, which includes a ridge penalty with :
The regression weights represent any initial prior/importance that we may ascribe to each state, eg. depending on how frequently they are visited. This least-squares problem can be solved in closed-form to find a parameter vector yielding the value function , where is a given feature matrix containing the features for each state. Gaussian noise is then applied to the result. The utility analysis in Balle et al. (2016) suggests that the regularization parameter must be carefully chosen as a function, which depends only on the number of trajectories .
3 Actor Critic with Differentially Private Critic
Our proposed algorithm comprises of two phases: a producer that uses historical data — considered confidential — to evaluate a policy, i.e. obtain the associated value function; and a consumer that uses an actor-critic algorithm initialized with the value function provided by the producer. Intuitively, such a prior should help the actor make initial estimates of actions taken. While we restrict ourselves to actor-critic, any algorithm that incorporates a value function could be used for the consumer phase.
The producer is therefore responsible for policy evaluation i.e., attempting to learn the state-value function for a for a given policy . Empirical results come from collecting sample trajectories using a learning algorithm (SARSA) Sutton and Barto (2018). While any learning algorithm could be used in practice, SARSA is appealing due to its relative simplicity and convergence properties at the limit.
4 Empirical Setup
Our empirical results come from 2 domains: an MDP domain (100 states with two actions) and the OpenAI Gym Brockman et al. (2016) environment Taxi-V2 Dietterich (2000). These experiments compare the benefit of value function transfer in consumer phase. The producer phase outputs a least-squares approximation of a differentially-private value function. By letting , we can then initialize . A ridge-regularization parameter is parameterized based on a regularization term where , and . We fix privacy parameter and vary .
Patient Treatment Progression
The Markov-Decision Process (MDP) experiments can be regarded as clinical in nature – patient’s data is encoded into a state vector representation similar toKomorowski et al. (2018) – but it could easily be applied to other domains, such as autonomous navigation. We study two approaches to generating samples: taking only optimal actions (the agent selects an action based on argmax where the is the optimal action-value function) and with an on-policy temporal-difference method, SARSA Sutton and Barto (2018).
Our MDP consists of a chain of states, where
. In each state the agent has some probabilityof staying and probability of advancing (as in supp. material). The environment has two actions: and where their probability of transitioning to the right is and
respectively. A reward of 1 is given when the agent reaches the final, absorbing state, and -1 for all other states. States are one-hot encoded in a vector of size 100. This setup illustrates a case of policy evaluation in the medical domain, where patients tend to progress through stages of recovery at different rates, and past states are not usually revisited (because in the medical domain, states contain historic information about past treatments).
Taxi-V2 Dietterich (2000) is a discrete grid-world environment where the agent must pick up and drop off passengers at the right location. While this is still a relatively simple environment, it provides a classic scenario where an agent is able to draw on prior experience without leaking information trajectories performed before initialization.
We find in Figure 2 that even with a layer of DP noise, our critic benefits from an initialization of the value function estimate. In an environment where the number of episodes available is finite and regulatory frameworks inhibit sharing data, such a transfer learning approach could provide tangible benefits while minimizing the user harm through parameter sharing. Further experiments in Figure 3 illustrate that having more sample episodes will likely improve the quality of the transfer. We also find that some transfer is better than no transfer. For example in Taxi-V2, we achieve convergence after 10,000 episodes, whereas without transfer it took the agent 15,000 episodes. We also note that changing our privacy budget by two orders of magnitude does not vary the resulting transfer, meaning that we can benefit from initialization with limited risks to individual privacy.
We presented a motivated set of use cases for applying a differentially-private critic in the RL setting. The definition of the producer and consumer trust-model is common in real-world deployments and fits with existing transfer learning approaches where data centralization is difficult. Our preliminary results suggest a measurable improvement in sample efficiency through task transfer. We look forward to exploring how this framework could be extended so that the consumer’s critic could then be shared with the producer by leveraging ideas coming from the Federated Learning literature McMahan et al. (2018).
We thank Joey Bose, Mike Rabbat and Maxime Wabartha for discussion and comments. This work was supported in part by Google DeepMind and Mila.
-  (2016) Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security - CCS’16, New York, New York, USA, pp. 308–318. External Links: Cited by: §1.
Privacy-preserving machine learning: threats and solutions. IEEE Security & Privacy 17 (2), pp. 49–58. Cited by: §1.
-  (2017) Learning with Privacy at Scale. Ml 1, pp. 1–25. External Links: Cited by: §1.
-  (2016-06) Differentially Private Policy Evaluation. International Conference on Machine Learning (ICML), pp. 2130–2138. External Links: Cited by: §1, §2, §2.
-  (2007) Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160. Cited by: §1.
-  (2016) OpenAI gym. External Links: Cited by: §4.
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. Technical report Google. External Links: Cited by: §1.
-  (2000) Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intelligence Research 13, pp. 227–303. External Links: Cited by: §4, §4.
-  (2017-03) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. International Conference on Machine Learning (ICML). External Links: Cited by: §1.
-  (2019) Reinforcement learning for blood glucose control: challenges and opportunities. ICML 2019 Workshop RL4RealLife. Cited by: §1.
-  (2018) Deeptraffic: driving fast through dense traffic with deep reinforcement learning. arXiv preprint arXiv:1801.02805. Cited by: §1.
-  (2017) Corrupt bandits for preserving local privacy. arXiv preprint arXiv:1708.05033. Cited by: §1.
-  (2018) The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine 24 (11), pp. 1716–1720. External Links: Cited by: §1, §4.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6379–6390. External Links: Cited by: §1.
-  (2018) Learning differentially private recurrent language models. In International Conference on Learning Representations, External Links: Cited by: §5.
Smooth sensitivity and sampling in private data analysis.
Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007, pp. 75–84. External Links: Cited by: §2.
-  (2019-06) Neural Transfer Learning for Cry-based Diagnosis of Perinatal Asphyxia. Interspeech. External Links: Cited by: §1.
Learning and transferring mid-level image representations using convolutional neural networks. In , pp. 1717–1724. Cited by: §1.
-  (2019) How you act tells a lot: privacy-leakage attack on deep reinforcement learning. CoRR abs/1904.11082. External Links: Cited by: §1.
-  (2016) Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning. ICLR. External Links: Cited by: §1.
-  (2015) Policy Distillation. ICLR 2016, pp. 13. External Links: Cited by: §1.
-  (2018) Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 4301–4311. External Links: Cited by: §1.
-  (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Cited by: §1, §3, §4.
-  (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: §1.
-  (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 4496–4506. External Links: Cited by: §1.
-  (2017) Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2016) Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2019) Private q-learning with functional noise in continuous spaces. CoRR abs/1901.10634. External Links: Cited by: §1.
-  (2017) Sample efficient actor-critic with experience replay. International Conference on Learning Representations. Cited by: §1.
-  (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §1.
-  (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §1.