1 Introduction
Learning how to make decisions under uncertainty is becoming paramount in many practical applications, such as medical treatment design, energy management, adaptive user interfaces, recommender systems etc. Reinforcement learning (Sutton and Barto, 1998) provides a variety of algorithms capable of handling such tasks. However, in many practical applications, aside from obtaining good predictive performance, one might also require that the data used to learn the predictor be kept confidential. This is especially true in medical applications, where patient confidentiality is very important, and in other applications which are usercentric (such as recommender systems). Differential privacy (DP) (Dwork, 2006)
is a very active research area, originating from cryptography, but which has now been embraced by the machine learning community. DP is a formal model of privacy used to design mechanisms that reduce the amount of information leaked by the result of queries to a database containing sensitive information about multiple users
(Dwork, 2006). Many supervised learning algorithms have differentially private versions, including logistic regression
(Chaudhuri and Monteleoni, 2009; Chaudhuri et al., 2011)(Chaudhuri et al., 2011; Rubinstein et al., 2012; Jain and Thakurta, 2013), and the lasso (Thakurta and Smith, 2013). However, differential privacy for reinforcement learning tasks has not been tackled yet, except for the simpler case of bandit problems (Smith and Thakurta, 2013; Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016).In this paper, we tackle differential privacy for reinforcement learning algorithms for the full Markov Decision Process (MDP) setting. We develop differentially private algorithms for the problem of policy evaluation, in which a given way of behaving has to be evaluated quantitatively. We start with the batch, firstvisit Monte Carlo approach to policy evaluation, which is well understood and closest to regression algorithms, and provide two differentially private versions, which come with formal privacy proofs as well as guarantees on the quality of the solution obtained. Both algorithms work by injecting Gaussian noise into the parameters vector for the value functions, but they differ in the definition of the noise amount. Our privacy analysis techniques are related to previous output perturbation for empirical risk minimization (ERM), but there are some domain specific challenges that need to be addressed. Our utility analysis identifies parameters of the MDP that control how easy it is to maintain privacy in each case. The theoretical utility analysis, as well as some illustrative experiments, show that the accuracy of the private algorithms does not suffer (compared to usual Monte Carlo) when the data set is large.
The rest of the paper is organized as follows. In Sec. 2 we provide background notation and results on differential privacy and Monte Carlo methods for policy evaluation. Sec. 3 presents our proposed algorithms. The privacy analysis and the utility analysis are outlined in Sec. 4 and Sec. 5 respectively. Detailed proofs for both of these sections are given in the Supplementary Material. In Sec. 6 we provide empirical illustrations of the scaling behaviour of the proposal algorithms, using synthetic MDPs, which try to mimic characteristics of real applications. Finally, we conclude in Sec. 7 with a discussion of related work and avenues for future work.
2 Background
In this section we provide background on differential privacy and policy evaluation from Monte Carlo estimates.
2.1 Differential Privacy
DP takes a usercentric approach, by providing privacy guarantees based on the difference of the outputs of a learning algorithm trained on two databases differing in a single user. The central goal is to bound the loss in privacy that a user can suffer when the result of an analysis on a database with her data is made public. This can incentivize users to participate in studies using sensitive data, e.g. mining of medical records. In the context of machine learning, differentially private algorithms are useful because they allow learning models in such a way that their parameters do not reveal information about the training data (McSherry and Talwar, 2007). For example, one can think of using historical medical records to learn prognostic and diagnostic models which can then be shared between multiple health service providers without compromising the privacy of the patients whose data was used to train the model.
To formalize the above discussion, let be an input space and an output space. Suppose is a randomized algorithm that takes as input a tuple of elements from for some and outputs a (random) element of . We interpret as a dataset containing data from individuals and define its neighbouring datasets as those that differ from in their last^{2}^{2}2Formally, we should define neighbouring datasets as those which differ in one element, not necessarily the last. But we are implicitly assuming here that the order of the elements in does not affect the distribution of , so we can assume without loss of generality that the difference between neighbouring datasets is always in the last element. element: with . We denote this (symmetric) relation by . is differentially private for some if for every , every pair of datasets , , and every measurable set we have
(1) 
This definition means that the distribution over possible outputs of on inputs and is very similar, so revealing this output leaks almost no information on whether or was in the dataset.
A simple way to design a DP algorithm for a given function is the output perturbation mechanism, which releases , where is noise sampled from a properly calibrated distribution. For real outputs , the Laplace (resp. Gaussian) mechanism (see e.g. Dwork and Roth (2014)) samples each component of the noise
i.i.d. from a Laplace (resp. Gaussian) distribution with standard deviation
(resp. ), where is the global sensitivity of given byCalibrating noise to the global sensitivity is a worstcase approach that requires taking the supremum over all possible pairs of neighbouring datasets, and in general does not account for the fact that in some datasets privacy can be achieved with substantially smaller perturbations. In fact, for many applications (like the one we consider in this paper) the global sensitivity is too large to provide useful mechanisms. Ideally one would like to add perturbations proportional to the potential changes around the input dataset , as measured, for example by the local sensitivity . Nissim et al. (2007) showed that approaches based on do not lead to differentially private algorithms, and then proposed an alternative framework for DP mechanisms with datadependent perturbations based on the idea of smoothed sensitivity. This is the approach we use in this paper; see Section 4 for further details.
2.2 Policy Evaluation
Policy evaluation is the problem of obtaining (an approximation to) the value function of a Markov reward process defined by an MDP and a policy (Sutton and Barto, 1998; Szepesvári, 2010). In many cases of interest is unknown but we have access to trajectories containing state transitions and immediate rewards sampled from . When the state space of
is relatively small, tabular methods that represent the value of each state can be used individually. However, in problems with large (or even continuous) state spaces, parametric representations for the value function are typically needed in order to defeat the curse of dimensionality and exploit the fact that similar states will have similar values. In this paper we focus on policy evaluation with linear function approximation in the batch case, where we have access to a set of trajectories sampled from the policy of interest.
Let be an MDP over a finite state space with and a policy on . Given an initial state , the interaction of with is described by a sequence of state–action–reward triplets. Suppose is the discount factor of . The value function of assigns to each state the expected discounted cumulative reward obtained by a trajectory following policy from that state:
(2) 
The value function can be considered a vector . We make the usual assumption that any reward generated by is bounded: , so for all .
Let be a feature representation that associates each state to a dimensional feature vector . The goal is to find a parameter vector such that is a good approximation to . To do so, we assume that we have access to a collection of finite trajectories sampled from by , where each is a sequence of states, actions and rewards.
We will use a Monte Carlo approach, in which the returns of the trajectories in are used as regression targets to fit the parameters in via a least squares approach (Sutton and Barto, 1998). In particular, we consider firstvisit Monte Carlo estimates obtained as follows. Suppose is a trajectory that visits and is the time of the first visit to ; that is, , and for all . The return collected from this first visit is given by
and provides an unbiased estimate of
. For convenience, when state is not visited by trajectory we assume .Given the returns from all first visits corresponding to a dataset with trajectories, we can find a parameter vector for the estimator by solving the optimization problem , where
(3) 
and is the set of states visited by trajectory . The regression weights are given as an input to the problem and capture the user’s believe that some states are more relevant than others. It is obvious that is a convex function of . However, in general it is not strongly convex and therefore the optimum of is not necessarily unique. On the other hand, it is known that differential privacy is tightly related to certain notions of stability (Thakurta and Smith, 2013), and optimization problems with nonunique solutions generally pose a problem to stability. In order to avoid this problem, the private policy evaluation algorithms that we propose in Section 3 are based on optimizing slightly modified versions of which promote stability in their solutions. Note that the notions of stability related to DP are for worstcase situations: that is, they need to hold for every possible pair of neighbouring input dataset
, regardless of any generative model assumed for the trajectories in those datasets. In particular, these stability considerations are not directly related to the variance of the estimates in
.We end this section with a discussion of the main obstruction to stability, i.e. the cases where fails to have a unique solution. Given a dataset with trajectories we define a vector containing the average first visit returns from all trajectories in that visit a particular state. In particular, if represents the multiset of trajectories from that visit state at some point, then we have
(4) 
If is not visited by any trajectory in we set . To simplify notation, let be the vector collecting all these estimates. We also define a diagonal matrix with entries given by the product of the regression weight on each state and the fraction of trajectories in visiting that state: . Solving for in , it is easy to see that any optimal must satisfy
(5) 
Thus, this optimization has a unique solution if and only if the matrix is invertible. Since it is easy to find neighbouring datasets where at most one of and is invertible, optimizing directly poses a problem to the design differentially private policy evaluation algorithms with small perturbations. Next we present two DP algorithm based on stable policy evaluation algorithms.
3 Private FirstVisit Monte Carlo Algorithms
In this section we give the details of two differentially private policy evaluation algorithms based on firstvisit Monte Carlo estimates. Each of these algorithms corresponds to a different stable version of the minimization described in previous section. A formal privacy analysis of these algorithms is given in Section 4. Bounds showing how the privacy requirement affects the utility of the value estimates are presented in Section 5.
3.1 Algorithm DPLSW
One way to make the optimization more stable to changes in the dataset is to consider a similar leastsquares optimization where the optimization weights do not change with , and guarantee that the optimization problem is always strongly convex. Thus, we consider a new objective function given in terms of a new set of positive regression weights . Let be a diagonal matrix with . We define the objective function as:
(6) 
where is the weighted norm. To see the relation between the optimizations over and , note that equating the gradient of to we see that a minimum must satisfy
(7) 
Thus, the optimization problem is wellposed whenever is invertible, which henceforth will be our working assumption. Note that this is a mild assumption, since it is satisfied by choosing a feature matrix with full column rank. Under this assumption we have:
(8) 
where denotes the Moore–Penrose pseudoinverse. The difference between optimizing or is reflected in the differences between (5) and (7). In particular, if the trajectories in are i.i.d. and
denotes the probability that state
is visited by a trajectory in , then takingyields a loss function
that captures the effect of each state in in the asymptotic regime . However, we note that knowledge of these visit probabilities is not required for running our algorithm or for our analysis.Our first DP algorithm for policy evaluation applies a carefully calibrated output perturbation mechanism to the solution of . We call this algorithm DPLSW, and its full pseudocode is given in Algorithm 1. It receives as input the dataset , the regression weights , the feature representation , and the MDP parameters and . Additionally, the algorithm is parametrized by the privacy parameters and . Its output is the result of adding a random vector drawn from a multivariate Gaussian distribution to the parameter vector . In order to compute the variance of the algorithm needs to solve the discrete optimization problem , where , is a parameter computed in the algorithm, and is given by the following expression:
(9) 
Note that can be computed in time .
The variance of the noise in DPLSW is proportional to the upper bound on the return from any state. This bound might be excessively pessimistic in some applications, leading to unnecessary large perturbation of the solution . Fortunately, it is possible to replace the term with any smaller upper bound on the returns generated by the target MDP on any state. In practice this leads to more useful algorithms, but it is important to keep in mind that for the privacy guarantees to remain unaffected, one needs to assume that is a publicly known quantity (i.e. it is not based on an estimate made from private data). These same considerations apply to the algorithm in the next section.
3.2 Algorithm DPLSL
The second DP algorithm for policy evaluation we propose is also an output perturbation mechanism. It differs from DPLSW in they way stability of the unperturbed solutions is promoted. In this case, we choose to optimize a regularized version of . In particular, we consider the objective function obtained by adding a ridge penalty to the leastsquares loss from (3):
(10) 
where is a regularization parameter. The introduction of the ridge penalty makes the objective function strongly convex, and thus ensures the existence of a unique solution , which can be obtained in closedform as:
(11) 
Here is defined as in Section 2.2.
We call DPLSL the algorithm obtained by applying an output perturbation mechanism to the minimizer of ; the full pseudocode is given in Algorithm 2. It receives as input the privacy parameters and , a dataset of trajectories , the regression weights , the feature representation , a regularization parameter , and the MDP parameters and . After computing the solution to , the algorithm outputs , where is a dimensional noise vector drawn from . The variance of is obtained by solving a discrete optimization problem (different from the one in DPLSW). Let and for , define as:
(12) 
Then DPLSL computes , which can be done in time .
4 Privacy Analysis
This section provides a formal privacy analysis for DPLSW and DPLSL and shows that both algorithms are differentially private. We use the smooth sensitivity framework of (Nissim et al., 2007, 2011), which provides tools for the design of DP mechanisms with datadependent output perturbations. We rely on the following lemma, which provides sufficient conditions for calibrating Gaussian output perturbation mechanisms with variance proportional to smooth upper bounds of the local sensitivity.
Lemma 1 (Nissim et al. (2011)).
Let be an algorithm that on input computes a vector deterministically and then outputs , where is a variance that depends on . Let and . Suppose and are such that the following are satisfied for every pair of neighbouring datasets : (a) , and (b) . Then is differentially private.
Condition (a) says we need variance at least proportional to the local sensitivity . Condition (b) asks that the variance does not change too fast between neighbouring datasets, by imposing the constraint . This is precisely the spirit of the smoothed sensitivity principle: calibrate the noise to a smooth upper bound of the local sensitivity. We acknowledge Lemma 1 is only available in preprint form, and thus provide an elementary proof in Appendix A for completeness. The remaining proofs from this section are presented Appendices B and C.
4.1 Privacy Analysis of DPLSW
We start by providing an upper bound on the norm for any two neighbouring datasets . Using (8) it is immediate that:
(13) 
Thus, we need to bound .
Lemma 2.
Let be two neighbouring datasets of trajectories with and . Let . Let (resp. ) denote the set of states visited by (resp. ). Then we have
Since the condition in Lemma 1 needs to hold for any dataset neighbouring , we take the supremum of the bound above over all neighbours., which yields the following corollary.
Corollary 3.
If is a dataset of trajectories, then the following holds for every neighbouring dataset :
Using this result we see that in order to satisfy item (a) of Lemma 1 we can choose a noise variance satisfying:
(14) 
where only the last multiplicative term depends on the dataset , and the rest can be regarded as a constant that depends on parameters of the problem which are either public or chosen by the user, and will not change for a neighbouring dataset . Thus, we are left with a lower bound expressible as , where only depends on the dataset through its signature given by the number of times each state appears in the trajectories of : . Accordingly, we write , where is the function
(15) 
The signatures of two neighbouring datasets satisfy because replacing a single trajectory can only change by one the number of first visits to any particular state. Thus, assuming we have a function satisfying and for all with , we can take . This variance clearly satisfies the conditions of Lemma 1 since
The function is known as a smooth upper bound of , and the following result provides a tool for constructing such functions.
Lemma 4 (Nissim et al. (2007)).
Let . For any let . Given , the smallest smooth upper bound of is the function
(16) 
For some functions , the upper bound can be hard to compute or even approximate (Nissim et al., 2007). Fortunately, in our case a simple inspection of (15) reveals that is easy to compute. In particular, the following lemma implies that can be obtained in time .
Lemma 5.
The following holds for every :
Furthermore, for every we have .
Combining the last two lemmas, we see that the quantity computed in DPLSW is in fact a smooth upper bound to . Because the variance used in DPLSW can be obtained by plugging this upper bound into (14), the two conditions of Lemma 1 are satisfied. This completes the proof of the main result of this section:
Theorem 6.
Algorithm DPLSW is differentially private.
Before proceeding to the next privacy analysis, note that Corollary 3 is the reason why a mechanism with output perturbations proportional to the global sensitivity is not sufficient in this case. The bound there says that if in the worst case we can find datasets of an arbitrary size where some states are visited few (or zero) times, then the global sensitivity will not vanish as . Hence, the utility of such algorithm would not improve with the size of the dataset. The smoothed sensitivity approach works around this problem by adding large noise to these datasets, but adding much less noise to datasets where each state appears a sufficient number of times. Corollary 3 also provides the basis for efficiently computing smooth upper bounds to the local sensitivity. In principle, condition (b) in Lemma 1 refers to any dataset neighbouring , of which there are uncountably many because we consider real rewards. Bounding the local sensitivity in terms of the signature reduces this to finitely many “classes” of neighbours, and the form of the bound in Corollary 3 makes it possible to apply Lemma 4 efficiently.
4.2 Privacy Analysis of DPLSL
The proof that DPLSL is differentially private follows the same strategy as for DPLSW. We start with a lemma that bounds the local sensitivity of for pairs of neighbouring datasets . We use the notation for an indicator variable that is equal to one when state is visited within trajectory .
Lemma 7.
Let be two neighbouring datasets of trajectories with and . Let (resp. ) be the vector given by (resp. ). Define diagonal matrices given by and . If the regularization parameter satisfies , then:
As before, we need to consider the supremum of the bound over all possible neighbours of . In particular, we would like to get a bound whose only dependence on the dataset is through the signature . This is the purpose of the following corollary:
Corollary 8.
Let be a dataset of trajectories and suppose . Then the following holds for every neighbouring dataset :
where
By the same reasoning of Section 4.1, as long as the regularization parameter is larger than , a differentially private algorithm can be obtained by adding to a Gaussian perturbation with a variance satisfying
and the second condition of Lemma 1. This second requirement can be achieved by computing a smooth upper bound of the function given by
When going from to we substituted by to reflect the fact that any state cannot be visited by more than trajectories in a dataset of size . It turns out that in this case the function arising in Lemma 4 is also easy to compute.
Lemma 9.
For every , is equal to:
Furthermore, for every we have .
Finally, in view of Lemma 4, Corollary 8, and Lemma 9, the variance of the noise perturbation in DPLSL satisfies the conditions of Lemma 1, so we have proved the following.
Theorem 10.
Algorithm DPLSL is differentially private.
5 Utility Analysis
Because the promise of differential privacy has to hold for any possible pair of neighbouring datasets , the analysis in previous section does not assume any generative model for the input dataset . However, in practical applications we expect to contain multiple trajectories sampled from the same policy on the same MDP. The purpose of this section is to show that when the trajectories are i.i.d. the utility of our differentially private algorithms increases as . In other words, when the input dataset grows, the amount of noise added by our algorithms decreases, thus leading to more accurate estimates of the value function. This matches the intuition that when outputting a fixed number of parameters, using data from more users to estimate these parameters leads to a smaller individual contributions from each user, and makes the privacy constraint easier to satisfy.
To measure the utility of our DP algorithms we shall bound the difference in empirical risk between the private and nonprivate parameters learned from a given dataset. That is, we want to show that the quantity vanishes as , for both and . The first theorem bounds the expected empirical excess risk of DPLSW. The bound contains two terms: one vanishes as , and the other reflects the fact that states which are never visited pose a problem to stability. The proof is deferred to Appendix D.
Theorem 11.
Let and . Let . Suppose . Then is upper bounded by:
Note the above bound depends on the dimension through and . In terms of the size of the dataset, we can get excess risk bounds that decreases quadratically with by assuming that either all states are visited with nonzero probability or the user sets the regression weights so that such states do not contribute to .
Corollary 12.
If for all , then .
A similar theorem can be proved for DPLSL. However, in this case the statement of the bound is complicated by the appearance of cooccurrence probabilities of the form and . Here we only state the main corollary of our result; the full statement and the corresponding proofs are presented in Appendix E. This corollary is obtained by assuming the regularization parameter is allowed to grow with , and stresses the tensions in selecting an adequate regularization schedule.
Corollary 13.
Suppose with respect to . Then we have .
Note that taking we get a bound on the excess risk of order . However, if we want the regularization term in to vanish as we need . We shall see importance of this tradeoff in our experiments.
6 Experiments
In this section we illustrate the behaviour of the proposed algorithms on synthetic examples. The domain we use consists of a chain of states, where in each state the agent has some probability of staying and probability of advancing to its right. There is a reward of when the agent reaches the final, absorbing state, and for all other states. While this is a toy example, it illustrates the typical case of policy evaluation in the medical domain, where patients tend to progress through stages of recovery at different speeds, and past states are not typically revisited (partly because in the medical domain, states contain historic information about past treatments). Trajectories are drawn by starting in an initial state distribution and generating stateactionreward transitions according to the described probabilities until the absorbing state is reached. Trajectories are harvested in a batch, and the same batches are processed by all algorithms.
We experiment with both a tabular representation of the value function, as well as with function approximation. In the latter case, we simply aggregate pairs of adjacent states, which are hence forced to take the same value. We compared the proposed private algorithms DPLSW and DPLSL with their nonprivate equivalents LSW and LSL. The performance measure used is average root mean squared error over the state space. The error is obtained by comparing the state values estimated by the learning algorithms against the exact values obtained by exact, tabular dynamic programming. Standard errors computed over 20 independent runs are included.
The main results are summarized in Fig. 1, for an environment with states, , discount , and for the DP algorithms, and . In general, these constants should be chosen depending on the privacy constraints of the domain. Our theoretical results explain the expected effect of these choices on the privacyutility tradeoff so we do not provide extensive experiments with different values.
The left plot in Fig. 1 compares the nonprivate LSL and LSW versions of Monte Carlo evaluation, in the tabular and function approximation case. As can be seen, both algorithms are very stable and converge to the same solution, but LSW converges faster. The second plot compares the performance of all algorithms in the tabular case, over a range of regularization parameters, for two different batch sizes. The third plot compares the expected RMSE of the algorithms when run with state aggregation, as a function of batch size. As can be seen, the DP algorithms converge to the same solutions as the nonprivate corresponding versions for large enough batch sizes. Interestingly, the two proposed approaches serve different needs. The LSL algorithms work better with small batches of data, whereas the LSW approach is preferable with large batches. From an empirical point of view, the tradeoff between accuracy and privacy in the DPLSL algorithm should be done by setting a regularization schedule proportional to . While the theory suggests it is not the best schedule in terms of excess empirical risk, it achieves the best overall accuracy.
Finally, the last figure shows excess empirical risk as a function of the batch size. Interestingly, more aggressive function approximation helps both differentially private algorithms converge faster. This is intuitive, since using the same data to estimate fewer parameters means the effect of each individual trajectory is already obscured by the function approximation. Decreasing the number of parameters of the function approximator, , increases , which lowers the smooth sensitivity bounds. In medical applications, one expects to have many attributes measured about patients, and to need aggressive function approximation in order to provide generalization. This result tells us that differentially private algorithms should be favoured in this case as well.
Overall, the empirical results are very promising, showing that especially as batch size increases, the noise introduced by the DP mechanism decreases rapidly, and these algorithms provide the same performance but with the additional privacy guarantees.
7 Conclusion
We present the first differentially private algorithms for policy evaluation in the full MDP setting. Our algorithms are built on top of established Monte Carlo methods, and come with utility guarantees showing that the cost of privacy diminishes as training batches get larger. The smoothed sensitivity framework is a key component of our analyses, which differ from previous works on DP mechanisms for ERM and bandits problems in two substantial ways. The first, we consider optimizations with nonLipschitz loss functions, which prevents us from using most of the established techniques for analyzing privacy and utility in ERM algorithms and complicates some parts of our analysis. In particular, we cannot leverage the tight utility analysis of (Jain and Thakurta, 2014) to get dimension independent bounds. Second, and more importantly, the natural model of neighbouring datasets for policy evaluation involves replacing a whole trajectory. This implies that neighbouring datasets can differ in multiple regression targets, which is quite different from the usual supervised learning approach where neighbouring datasets can only change a single regression target. Our approach is also different from the online learning and bandits setting, where there is a single stream of experience and neighbouring datasets differ in one element of the stream. Note that this setting cannot be used naturally in the full MDP setup, because successive observations in a single stream are inherently correlated.
In future work we plan to extend our techniques in two directions. First, we would like to design DP policy evaluation methods based on temporaldifference learning (Sutton, 1988). Secondly, we will tackle the control case, where policy evaluation is often used as a subroutine, e.g. as in actorcritic methods. We also plan to evaluate the current algorithms on patient data from an ongoing clinical study (in which case, errors cannot be estimated precisely, because the right answer is not known).
References
 Chaudhuri and Monteleoni (2009) Kamalika Chaudhuri and Claire Monteleoni. Privacypreserving logistic regression. In Advances in Neural Information Processing Systems, pages 289–296, 2009.
 Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. volume 12. JMLR. org, 2011.
 Dwork (2006) Cynthia Dwork. Differential privacy. In Proceedings of the 33rd international conference on Automata, Languages and ProgrammingVolume Part II, pages 1–12, 2006.
 Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(34):211–407, 2014.
 Jain and Thakurta (2013) Prateek Jain and Abhradeep Thakurta. Differentially private learning with kernels. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 118–126, 2013.
 Jain and Thakurta (2014) Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In Proceedings of The 31st International Conference on Machine Learning, pages 476–484, 2014.
 Laurent and Massart (2000) Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
 McSherry and Talwar (2007) Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pages 94–103. IEEE, 2007.
 Mishra and Thakurta (2015) Nikita Mishra and Abhradeep Thakurta. Nearly optimal differentially private stochastic multiarm bandits. In UAI, 2015.

Nissim et al. (2007)
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.
Smooth sensitivity and sampling in private data analysis.
In
Proceedings of the thirtyninth annual ACM symposium on Theory of computing
, pages 75–84. ACM, 2007.  Nissim et al. (2011) Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis, 2011. URL http://www.cse.psu.edu/~ads22/pubs/NRS07/.
 Rubinstein et al. (2012) Benjamin IP Rubinstein, Peter L Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: Privacypreserving mechanisms for svm learning. Journal of Privacy and Confidentiality, 4(1):4, 2012.
 Smith and Thakurta (2013) Adam Smith and Abhradeep Thakurta. Nearly optimal algorithms for private online learning in fullinformation and bandit settings. In NIPS, 2013.
 Sutton (1988) Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1998.
 Szepesvári (2010) Csaba Szepesvári. Algorithms for reinforcement learning. Morgan & Claypool Publishers, 2010.

Thakurta and Smith (2013)
Abhradeep Guha Thakurta and Adam Smith.
Differentially private feature selection via stability arguments, and the robustness of the lasso.
In Conference on Learning Theory, pages 819–850, 2013. 
Tossou and Dimitrakakis (2016)
Aristide C. Y. Tossou and Christos Dimitrakakis.
Algorithms for differentially private multiarmed bandits.
In
International Conference on Artificial Intelligence (AAAI 2016)
, 2016.
Appendix A Smoothed Gaussian Perturbation
A proof of Lemma 1 in the paper can be found in the preprint Nissim et al. [2011]. For the sake of completeness, we provide here an elementary proof (albeit with slightly worse constants). In particular, we are going to prove the following.
Lemma 14.
Let be an algorithm that on input computes a vector deterministically and then outputs , where is a variance that depends on . Let and . Suppose that , and are such , and the following are satisfied for every pair of neighbouring datasets :

,

.
Then is differentially private.
We start with a simple characterization of differential privacy that will be useful for our proof.
Lemma 15.
Let be the output of a randomized algorithm on input . Write for the probability density of the output of on input . Suppose that for every pair of neighbouring datasets there exists a measurable set such that the following are satisfied:

;

for all we have .
Then is differentially private.
Proof.
Fix a pair of neighbouring datasets and let be any measurable set. Let be as in the statement and write . Using the assumptions on we see that
Now we proceed with the proof of Lemma 14. Let be two neighbouring datasets and let us write and for simplicity. Thus, for we have that are
dimensional independent Gaussian random variables whose means and variances satisfy the assumptions of Lemma
14 for some . The density function of is denoted by . In order to be able to apply Lemma 15 we want to show that the privacy loss between and defined as(17) 
is bounded by for all , where is an event with probability at least under .
We can start by identifying a candidate . Since has to have high probability w.r.t. , it should contain because a ball around the mean is the event with the highest probability under a spherical Gaussian distribution (among those with the same Lebesgue measure). For technical reasons, instead of a ball we will take a slightly more complicated region, which for now we will parametrize by two quantities . The definition of this region will depend on the difference of means :
(18) 
We need to choose and such that the probability , and for that we shall combine two different tail bounds. On the one hand, note that is a one dimensional standard Gaussian random variable and recall that for any :
(19) 
On the other hand,
follows a chisquared distribution with
degrees of freedom, for which is known Laurent and Massart [2000] that for all :(20) 
To make our choices for and we can take them such that