1 Introduction
Many proposed applications of reinforcement learning (RL) involve the use of data that could contain sensitive information. For example, Raghu et al. [2017] proposed an application of RL and offpolicy evaluation methods that uses peoples’ medical records, and Theocharous et al. [2015] applied offpolicy evaluation methods to user data collected by a bank in order to improve the targeting of advertisements. In examples like these, the data used by the RL systems is sensitive, and one should ensure that the methods applied to the data do not leak any sensitive information.
Recently, Balle et al. [2016] showed how techniques from differential privacy
can be used to ensure that (with high probability) policy evaluation methods for RL do not leak (much) sensitive information. In this paper we extend their work in two ways. First, RL methods are often applied to batches of data collected from the use of a currently deployed policy. The goal of these RL methods is not to evaluate the performance of the current policy, but to improve upon it. Thus, policy evaluation methods must be
offpolicy—they must use the data from the behavior policy to reason about the performance of newly proposed policies. This is the problem of offpolicy evaluation, and both of the previous medical and banking examples require these methods. Whereas Balle et al. [2016] consider the onpolicy setting (evaluating the deployed policy), we focus on the offpolicy setting.Second, Balle et al. [2016]
achieve their privacy guarantee using output perturbation: they first run an existing (nonprivate) leastsquares policy evaluation method, resulting in a realvalued vector; then they add random noise to each element of the vector. Although this approach was one of the first and most simple methods for ensuring that guarantees of privacy hold
[Dwork et al., 2006b], more sophisticated methods for ensuring privacy have since been developed. We show how one of these newer approaches to differential privacy, which adds noise to stochastic gradient descent updates
[Song et al., 2013; Bassily et al., 2014], rather than to the least squares solution, can be combined with GTD2, the dominant offpolicy evaluation algorithm [Sutton et al., 2009].After presenting our new privacy preserving offpolicy evaluation algorithm, which we call gradient perturbed offpolicy evaluation (GPOPE) to differentiate it from the previous outputperturbation methods, we provide proofs of privacy and convergence rate. We use the properties of Rényi differential privacy and its amplification via subsampling [Bun and Steinke, 2016; Mironov, 2017; Balle et al., 2018; Wang et al., 2018]
together with the moments accountant technique
[Abadi et al., 2016] to effectively keep track of differential privacy parameters through all steps of our algorithm. The convergence rate analysis quantifies the tradeoff between the strength of the privacy guarantees that our algorithms provide and the accuracy of their offpolicy predictions.Since the onpolicy setting is a special case of the offpolicy setting (where the policy being evaluated happens to be the same as the currently deployed policy), we can compare our algorithm directly to the outputperturbation methods of Balle et al. [2016] in the onpolicy setting. We show empirically that our algorithm offers greater utility, i.e., using the same data, our algorithm can provide stronger guarantees of privacy for the same degree of prediction error. We also conduct experiments in the offpolicy setting, where prior work is not applicable, and the results support the conclusions of our analytic analysis.
The rest of the paper is organized as follows. We review the relevant background on offpolicy evaluation in Section 2 and background on differential privacy in Section 3. We present our algorithm in Section 4. In Section 5 we analyze the privacy preserving properties of our algorithm, and in Section 6 we provide an analysis of the utility of our algorithm. We provide an empirical case study in Section 7, using a synthetic MDP that mimics characteristics of a medical application, the standard Mountain Car domain, and a more challenging HIV simulator. We conclude in Section 8 with a discussion of future work.
2 Background: OffPolicy Evaluation
This section offers a brief overview of offpolicy evaluation, including the definition of Markov decision processes, mean squared projected Bellman error, and the saddlepoint formulation of the gradient temporaldifference (GTD2) offpolicy evaluation method
[Sutton et al., 2009].A Markov decision process (MDP) [Sutton and Barto, 1998; Puterman, 2014] is a tuple , , , , , where is the finite set of possible states, is the time step, is the state at time
(a random variable),
is the finite set of possible actions, is the action at time , is the transition function, defined such that , is the scalar reward at time , is defined such that , and is a parameter that characterizes how rewards are discounted over time. A policy, , describes one way that actions can be chosen: .A key step in many RL algorithms is to estimate the statevalue function
of a given policy , which is defined as . The process of estimating a statevalue function is known as policy evaluation. In this paper we consider the problem of offpolicy evaluation, wherein we estimate given data (states, actions, and rewards) sampled from applying a different policy, , called the behavior policy, which may be different from (i.e., the policy being evaluated). Furthermore, we consider the setting where a linear function approximator, , is used. That is can be written as , where is a set of weights, and is a feature vector associated with state .Let be a trajectory with length . Often each trajectory contains data pertaining to a single individual over time. In realworld applications, states often describe people: their bank balance when the MDP models automatic selection of online credit card ads [Theocharous et al., 2015], or medical conditions when selecting between treatments [Raghu et al., 2017]. Similarly, actions can include drug prescriptions and rewards can encode medical outcomes.
Recent work has shown that optimizing the weight vector, , can be phrased as a saddle point problem [Liu et al., 2015]: where
(2.1)  
where is introduced by duality [Boyd et al., 2011; Sutton et al., 2009; Liu et al., 2015], the expected values in (2.1), are over states, , actions, , and rewards, , produced by running the behavior policy, , is the (finite) length of the trajectory, is shorthand for , and .
Liu et al. [2015]
proposed using a stochastic gradient method to optimize this saddlepoint problem. This algorithm uses the following unbiased estimates of
and , produced using the states, actions, and rewards from the trajectory (which is of length ):(2.2)  
The resulting stochastic gradient algorithm proposed by Liu et al. [2015] is identical to the GTD2 algorithm, and is given by the following update equations:^{1}^{1}1Although Sutton et al. [2009] were the first to derive GTD2, they did not derive it as a stochastic gradient algorithm. Liu et al. [2015] were the first to show that GTD2 can be phrased as presented here—as a stochastic gradient algorithm for a saddlepoint problem.
(2.3)  
where is a sequence of positive step sizes.
3 Background: Differential Privacy
In this section we define differential privacy (DP) and its application to the data underlying offpolicy evaluation. We also describe some tools that aid in analyzing the privacy loss when using gradient methods.
A data set, , consists of a set of points, , where each point is an element of universe (for RL, a point will correspond to a trajectory, , containing data associated with one person). For RL applications to human data, each point typically describes a trajectory consisting of a finite sequence of transitions of a single individual, i.e., , and the length of trajectory may vary across individuals. We assume each trajectory is generated by running a behavior policy, , and that states, actions, and rewards may all be potentially sensitive and therefore worthy of privacy protection. We denote by the set of all possible data sets.
The privacy condition our algorithm provides constrains the treatment of pairs of adjacent datasets:
Definition 1 (Adjacent Data Set).
Two data sets, are adjacent if they differ by exactly one point.
Differential privacy is a formal notion of privacy, which guarantees that the output of a computation on a sensitive data set cannot reveal too much about any one individual. Formally, consider a randomized mechanism, , which takes as input a data set and produces as output an element of some set, .
Definition 2 (Differential Privacy).
Let denote a randomized mechanism that has domain and range . satisfies differential privacy for some , if for every pair of adjacent data sets, , and for every the following holds:
(3.1) 
This definition requires that the difference in output probabilities resulting from changing the database by altering any one individual’s contribution will be small. Note that adjacent databases differ in an individual’s full trajectory, not merely one transition.
Applied to our reinforcement learning problem, a differentially private training mechanism allows the public release of a parameter vector of the value function with a strong guarantee: by analyzing the output, an adversary is severely limited in what they can learn about any individual, even if they have access to arbitrary public information.
4 Differentially Private OffPolicy Evaluation Algorithms
In this section we provide the details of our differentially private offpolicy evaluation algorithms.
We construct our differentially private offpolicy evaluation algorithm by using the Gaussian mechanism [Dwork et al., 2006b] and the moments accountant [Abadi et al., 2016] to privatize the stochastic gradient offpolicy evaluation algorithm presented in (2.3). This involves three steps. First, a trajectory of data is collected from running the behavior policy, . Second, a primaldual stochastic gradient estimate is generated from this data, and its norm is clipped to ensure that it is bounded below a positive constant,
. Third, we add normally distributed noise to each term of the gradient before updating the weights using the (clipped and noisy) stochastic gradient estimate. In subsequent sections we show that the amount of noise that we introduce provides the desired privacy preserving guarantees, regardless of the value chosen for
.Before providing pseudocode for our algorithms, we first define the primaldual gradient at the th step, , which is obtained by stacking the estimated primal and negative dual gradients:
(4.1)  
where and are defined in (2.2). Let denote the true primaldual gradient, , where the expected values are over states, actions, and rewards produced by running the behavior policy.
Pseudocode for our new privacy preserving offpolicy evaluation algorithm, which we call gradient Perturbed offpolicy evaluation (GPOPE), is provided in Algorithm 1.
Notice that in GPOPE we use all of the transitions from trajectory to create the unbiased estimates of , , and . Alternate algorithms could use data from a single trajectory to create multiple estimates of , , and
, and thus could perform multiple gradient updates given one trajectory. However, in preliminary experiments we found that the episodic approach taken by GPOPE (where we use all of the data from the trajectory for one update) performed the best. This is supported by our theoretical analysis, which shows that the tradeoffs between number of updates, the variance of updates, and the amount of noise that must be added to updates, favors this episodic approach.
We use to denote the variance of the Gaussian noise in our algorithm. The choice of depends on the desired privacy level of the algorithm, as discussed in the next section.
5 Privacy Analysis
In this section we provide a formal privacy analysis for our algorithm. We adapt the moments accounting introduced by Abadi et al. [2016] and the recent privacy amplification properties of subsampling mechanisms [Bun and Steinke, 2016; Balle et al., 2018; Wang et al., 2018] to bound the privacy loss of a sequence of adaptive mechanisms, and we show that our algorithm is differentially private.
Theorem 1.
Given a data set consisting of points and fixing the number of iterations, , there exist constants and , such that for any , Algorithm 1 is differentially private for if
(5.1) 
The detailed proof of Theorem 1 is in the appendix. In the remainder of this section we provide an outline of the proof of Theorem 1, which proceeds as follows. We first define the privacy loss and the privacy loss random variable
. We use privacy loss to measure the difference in the probability distribution resulting from running
on and . Bounds on the tails of the privacy loss random variable then imply the privacy condition.Definition 3 (Privacy Loss).
Let be a randomized mechanism with domain and range , and be auxiliary input, be a pair of adjacent data sets. For an outcome , the privacy loss at is:
(5.2) 
The auxiliary information, , could be any additional information available to the adversary. We use here to model the composition of adaptive mechanisms, where we have a sequence of mechanisms and the th mechanism, , could use the output of previous mechanisms, , as its input.
We define the privacy loss random variable using the outcome sampled from , as .
In order to more precisely analyze the privacy cost of sequences of mechanisms, we use a recent advance in privacy cost accounting called the moments accountant, introduced by Abadi et al. [2016] and which builds on prior work [Bun and Steinke, 2016; Dwork and Rothblum, 2016].
Definition 4 (Moments Accountant).
Let be a randomized mechanism and , a pair of adjacent databases. The moment of the privacy loss random variable is:
(5.3) 
The moments accountant is defined as
(5.4) 
which bounds the moment for all possible inputs (i.e., all possible ).
In the following lemma we provide an upper bound on the moments accountant for each iteration in our algorithm. This upper bound on the moments accountant is the key for proving Theorem 1.
Lemma 1.
Let sensitive dataset d contain trajectories, and be the sampled trajectory in the iteration. Then the randomized mechanism satisfies
(5.5) 
where denotes defined in step 5 of Algorithm 1, and returns the noised gradient.
We prove Lemma 1 using use the amplification properties for Rényi differential privacy via subsampling [Bun and Steinke, 2016; Mironov, 2017; Wang et al., 2018]. We provide a detailed proof in the appendix. The results in Lemma 1 are similar to a result of Abadi et al. [2016] when is large (if , ), but our Lemma 1 also covers the regime of small Abadi et al. [2016] does not cover. Also note that our definition of adjacent data sets is different from that of Abadi et al. [2016]. Our approach avoids the need to specify a discrete list of moments ahead of time as required in the moments accountant method of Abadi et al. [2016].
Note that our algorithm can guarantee differential privacy when each update only uses data from one trajectory. This is because the length of trajectories are not always the same, and so using data from multiple trajectories would cause Lemma 1 to not hold. However, our privacy analysis holds with the same privacy guarantee for the case when a subset of the transitions of the sampled trajectory are used. Intuitively, the best choice is to use all transitions of the sampled trajectory; we will justify this in the next section.
6 Utility Analysis
In this section we present the convergence analysis (utility analysis) of our algorithm. For this analysis, we assume that is selected to be sufficiently large so that the
norm of the gradient estimate is not clipped, i.e., the gradient estimates are sufficiently small (empirically, we found this assumption held across all of our experiments). Also, without loss of generality, we can avoid gradient clipping by scaling the objective function
[Wang et al., 2017], i.e., changing the basis used for approximation.Let be the privacy parameters, be the total number of iterations of the loop in Algorithm 1, and be the number of trajectories in the data set. The noise added to the gradient of is , where is the identity matrix and is the noise scale chosen according to Theorem 1. Let be a constant defined as , where is the Frobenius norm. Note that we choose according to Theorem 1, i.e., , so that we have
(6.1) 
which does not depend on and .
First, let the optimal solution be expressed as
(6.2)  
In order to analyze the convergence of the algorithm, we examine the difference between the current parameters and the optimal solution. We define a residual vector , at each iteration , and a useful parameter , as:
(6.3) 
Note that the optimal solution can be expressed as (6.2). The first order optimally condition is obtained by setting the gradient to zero, which is satisfied by , such that
(6.4) 
We have defined to be the stochastic approximate gradient at iteration , which is stacking of the approximate primal and negative dual gradient using the at iteration , and using the true gradient at iteration . Also let be the perturbed approximate gradient, which is defined in step 7 of Algorithm 1.
We also define to be the approximation error of the primaldual gradient at iteration , which is Note that , since it is an unbiased stochastic approximation. We introduce an assumption, which ensures that the variance of is bounded:
Assumption 1.
There exists a constant, , such that for any ,
(6.5) 
Remark 1.
Note that bounded variance of the stochastic approximation is a standard assumption in the literature of stochastic gradient methods. In our differentially private case, the variance bound should be in terms of the privacy guarantee (i.e. and ), since Algorithm 1 adds normally distributed noise to each term of the gradient. The term in the assumption above follows , where is the Frobenius norm of the covariance matrix of added noise. In the nonprivate case, should be , i.e., .
Thus, we obtain the key properties of each iteration in our algorithm.
Lemma 2.
Let be generated by the nonprivate algorithm at iteration , if we define as (6.3), and we use
to denote the minimum eigenvalue of
, to denote the maximum eigenvalue of . If we choose , we then have(6.6)  
where .
The detailed proof of Lemma 2 is in the appendix. Note that, in the stochastic programming literature, similar results rely on the assumption of a strongly convex (concave) objective function [Nesterov, 2013]. However, our results show that we do not need both the primal variable, , and dual variable, , to be strongly convex (concave). This is because of the special form our objective function, i.e., our objective function is a quadratic optimization problem [Bertsekas, 1999].
Next, we provide the utility analysis in terms of different step size approaches. We first provide the utility bound when using a constant step size.
Theorem 2.
Let be generated from Algorithm 1. If step size is constant, i.e., , where and is any positive real number, then
(6.7)  
(6.8) 
where and .
The detailed proof of Theorem 2 is in the appendix. Theorem 2 shows that there is a strong tradeoff between accuracy and privacy when using a constant step size. Increasing privacy requires to become larger, which increases the right side of (6.8). Furthermore, notice that term I has a linear rate of convergence, since we have , so that
(6.9)  
(6.10)  
(6.11) 
as goes to infinity, while term II diverges since it has in the numerator. Thus, initially term I dominates and we would expect rapid convergence. However, for large , term II will eventually dominate, and the algorithm will diverge.
Next, we consider the convergence rate (utility analysis) when using a diminishing step size sequence. Theorem 3 shows that in this setting the divergent term is not present.
Theorem 3.
Let be generated by Algorithm 1. If is a sequence of diminishing step sizes defined as , where , then:
(6.12)  
(6.13)  
(6.14) 
where .
The detailed proof of Theorem 3 is in the appendix. First, notice that Theorem 3 has the same accuracyprivacy tradeoff as Theorem 2 due to its dependence on . However, Theorem 3 shows that using diminishing step sizes results in a sublinear (i.e., worse than term I) convergence rate (term III), up to the informationtheoretic limit (term IV). Since constant step sizes provide a better initial convergence rate (before term II dominates) than diminishing step sizes, initially using a constant step size would be preferable. However, after enough iterations, the noise in the gradient prevents the constant step size algorithm from converging to an optimal solution due to term II. Thus, in the longterm (when running many iterations), using a diminishing step size will produce a better solution. It should be also noted that Bassily et al. [2014]
gave the optimal lower bound of utility for the problem of Empirical Risk Minimization (ERM) for both general convex and strongly convex loss function. Theorem 4 shows that for large
(i.e., ), our method can attain that optimal lower bound, i.e., .We now consider the influence of the minibatch size (i.e., the number of transitions used in the sampled trajectory). Let be the number of transition samples which are used in iteration . The approximation error according to the definition of primaldual gradient in (4.1), can be written as , where is the approximation error for only using one transition. Thus, if we replace Assumption 1 with the assumption , then we have that:
(6.15) 
Thus, the variance bound is inversely proportion to the number of transitions used, and tighter variance bounds provide faster rates of convergence (as shown in Theorems 2 and 3). Therefore, the best choice is to use all of the transitions in the th trajectory when computing .
7 Experimental Results
In this section we compare the performance of our proposed algorithm, gradient perturbed offpolicy evaluation (GPOPE) (called gradient perturbation in this section to emphasize its difference from prior methods), with two prior methods, DPLSW and DPLSL [Balle et al., 2016]
on two onpolicy evaluation tasks. For clarity, we use output perturbation V1 to denote DPLSW, and output perturbation V2 to denote DPLSL. We then illustrate the behaviour of gradient perturbation on offpolicy task, a common benchmark control tasks and on a more challenging HIV simulator. The results we show in the following figures are all averaged over 100 trials and include standard deviation error bars, and we fix
in all our experiments.Synthetic chain domain
In the first onpolicy task, we consider a chain domain that consists of states. The agent begins at a uniformly random state on the chain. In each state the agent has probability to stay and probability of advancing to the right. The agent receives a reward of when reaching the final absorbing state, and for all other states. We use , . We compared our algorithm, gradient perturbation, with output perturbation V1 and output perturbation V2 for onpolicy evaluation in the tabular setting. This toy example illustrates one typical case in medical applications [Balle et al., 2016], where patients tend to progress through stages of recovery at different speeds, and past states are not typically revisited (partly because in the medical domain, states contain historical information about past treatments). The main result is shown in Figure 0(a), where MSPBE denotes mean squared projected Bellman error (a common measure of inaccuracy for policy evaluation in reinforcement learning [Sutton et al., 2009]), and where the datasize, , is the number of trajectories used.
We use different step sizes (a hyperparameter) for different (amounts of data). Since the choice of step size cannot depend on the private data (this choice could leak information not captured by our analysis), we assume that the step size was tuned using similar public data—a common approach in differential privacy [Papernot et al., 2016]. For Figure 0(a), we assume that this method was used to obtain optimal step sizes. Our proposed method outperforms output perturbation V1 and output perturbation V2 in terms of accuracy by an order of magnitude.
In practice there may not always be public data similar to the private data, or the public data may differ slightly from the private data. This means that the optimal step size for the public data may not be the optimal step size for the private data. Therefore, it is necessary to test the robustness of our algorithm to changing hyperparameters. Since step size is the only variable hyperparameter for different , Figure 0(a) also shows the results of using and the optimal step sizes. Even using these imprecise optimal step sizes, our proposed approach usually achieves better accuracy than prior methods. Figure 0(b) shows the accuracy when the step size varies, but the amount of data, , is fixed. This shows that accuracy is stable for a very wide range of step sizes. We provide additional experiments in the appendix to further show the robustness of our algorithm to the step size parameter.
Mountain Car
Next we performed these same experiments using the mountain car domain [Sutton and Barto, 1998] for onpolicy policy evaluation. Mountain car is a popular RL benchmark problem with a two dimensional continuous state space, three discrete actions, and deterministic dynamics. We first used Qlearning with the fifth order Fourier basis [Konidaris et al., 2011] to obtain a decent policy to evaluate. We ran this policy to collect the trajectories that comprise the data set, and used our gradient perturbation algorithm and the output perturbation algorithms to estimate the value function for the learned policy.
Figure 1(a) shows the accuracy of our algorithm and compares with output perturbation V1 and least squares temporal difference [Bradtke and Barto, 1996, LSTD]. LSTD does not provide any privacy guarantees, and is presented here to show how close our algorithm is in accuracy to nonprivate methods. Note that output perturbation V2 fails to guarantee differential privacy for MDPs with continuous states or actions. While Figure 1(a) shows that our proposed gradient perturbation algorithm improves upon existing methods by orders of magnitude, Figure 2(a) provides a zoomed in view of the same plot to show the speed with which our algorithm converges when using different privacy settings. Similar to the chain domain, we show the robustness of our algorithm to step sizes in Figure 1(b), and present additional experiments in the appendix.
We also tested our algorithm on the mountain car domain for offpolicy evaluation. Since LSTD is an onpolicy algorithm, here we compare to a nonprivate offpolicy variant of LSTD, called WISLSTD [Mahmood et al., 2014]. Note that output perturbation methods fail to guarantee differential privacy for offpolicy evaluation, and so we only evaluate our algorithm in this part. Figure 2(b) shows the result of gradient perturbation for offpolicy evaluation for the mountain car domain with different privacy settings. The behavior policy, , is the policy learned by Qlearning, and the evaluated policy is the uniform policy. Despite being offpolicy (which usually increases data requirements relative to onpolicy problems), our algorithm’s performances in Figures 2(a) and 2(b) are remarkably similar.
HIV simulator
We also evaluate our approach on an HIV treatment simulation domain. This simulator was first introduced by Ernst et al. [2006], and consists of six features describing the state of the patient and four possible actions. Compared with the two domains above, this simulator is much closer to the practical medical treatment design, and its dynamics are more complex.
Figure 4 shows the results on the HIV simulator. We obtain the policy that is evaluated using Qlearning, and use a policy that is softmax w.r.t. the optimal Q function as the behavior policy. We use relative MSPBE in Figure 4, which normalizes MSPBE using the average reward () of the evaluated policy.
8 Discussion and Conclusion
To protect individual privacy when applying reinforcement learning algorithms to sensitive training data, we present the first differentially private algorithm for offpolicy evaluation. Our approach extends on the TD methods and comes with a privacy analysis and a utility (convergence rate) analysis. The utility guarantee shows that the privacy cost can be diminished by increasing the size of training batches, and the privacy/utility tradeoff can be optimized by using a decaying step size sequence. In our experiments, our algorithm, gradient Perturbed offpolicy evaluation (GPOPE), outperforms the previous methods in the restricted onpolicy setting that prior work considers, can work well for both discrete and continuous domains, and guarantees differential privacy for both onpolicy and offpolicy evaluation problems. We also demonstrate the effectiveness of our approach in both common benchmark tasks and on a more challenging HIV simulator. Since our approach is based on gradient computations, it can be extended easily to more advanced firstorder optimization methods, such as stochastic variance reduction methods [Du et al., 2017; Palaniappan and Bach, 2016], and momentum methods [Nesterov, 2013].
References
 Abadi et al. [2016] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM.
 Balle et al. [2018] Balle, B., Barthe, G., and Gaboardi, M. (2018). Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems, pages 6280–6290.

Balle et al. [2016]
Balle, B., Gomrokchi, M., and Precup, D. (2016).
Differentially private policy evaluation.
In
International Conference on Machine Learning
, pages 2130–2138.  Bassily et al. [2014] Bassily, R., Smith, A., and Thakurta, A. (2014). Differentially private empirical risk minimization: Efficient algorithms and tight error bounds. arXiv preprint arXiv:1405.7085.
 Bertsekas [1999] Bertsekas, D. P. (1999). Nonlinear programming. Athena scientific Belmont.
 Boyd et al. [2011] Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122.
 Bradtke and Barto [1996] Bradtke, S. J. and Barto, A. G. (1996). Linear leastsquares algorithms for temporal difference learning. Machine learning, 22(13):33–57.
 Bun and Steinke [2016] Bun, M. and Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer.
 Du et al. [2017] Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017). Stochastic variance reduction methods for policy evaluation. arXiv preprint arXiv:1702.07944.
 Dwork et al. [2006a] Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. (2006a). Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486–503. Springer.
 Dwork et al. [2006b] Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In TCC, volume 3876, pages 265–284. Springer.
 Dwork et al. [2014] Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407.
 Dwork and Rothblum [2016] Dwork, C. and Rothblum, G. N. (2016). Concentrated differential privacy. arXiv preprint arXiv:1603.01887.
 Ernst et al. [2006] Ernst, D., Stan, G.B., Goncalves, J., and Wehenkel, L. (2006). Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Decision and Control, 2006 45th IEEE Conference on, pages 667–672. IEEE.
 Konidaris et al. [2011] Konidaris, G., Osentoski, S., and Thomas, P. S. (2011). Value function approximation in reinforcement learning using the fourier basis. In AAAI, volume 6, page 7.
 Liu et al. [2015] Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. (2015). Finitesample analysis of proximal gradient td algorithms. In UAI, pages 504–513.
 Mahmood et al. [2014] Mahmood, A. R., Hasselt, H., and Sutton, R. S. (2014). Weighted importance sampling for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems 27.
 Meyer [2000] Meyer, C. D. (2000). Matrix analysis and applied linear algebra, volume 71. Siam.
 Mironov [2017] Mironov, I. (2017). Renyi differential privacy. arXiv preprint arXiv:1702.07476.
 Nesterov [2013] Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
 Palaniappan and Bach [2016] Palaniappan, B. and Bach, F. (2016). Stochastic variance reduction methods for saddlepoint problems. In Advances in Neural Information Processing Systems, pages 1416–1424.
 Papernot et al. [2016] Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., and Talwar, K. (2016). Semisupervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755.
 Puterman [2014] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Raghu et al. [2017] Raghu, A., Komorowski, M., Ahmed, I., Celi, L., Szolovits, P., and Ghassemi, M. (2017). Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602.
 Song et al. [2013] Song, S., Chaudhuri, K., and Sarwate, A. D. (2013). Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 245–248. IEEE.
 Sutton and Barto [1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
 Sutton et al. [2009] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009). Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM.
 Theocharous et al. [2015] Theocharous, G., Thomas, P. S., and Ghavamzadeh, M. (2015). Personalized ad recommendation systems for lifetime value optimization with guarantees. In IJCAI, pages 1806–1812.
 Wang et al. [2017] Wang, D., Ye, M., and Xu, J. (2017). Differentially private empirical risk minimization revisited: Faster and more general. In Advances in Neural Information Processing Systems, pages 2719–2728.
 Wang et al. [2018] Wang, Y.X., Balle, B., and Kasiviswanathan, S. (2018). Subsampled r’enyi differential privacy and analytical moments accountant. arXiv preprint arXiv:1808.00087.
Appendix A Relationships Among Parameters
Table 1 shows the relationships among privacy parameters , the total number of iterations , and the size of dataset . We use the color of red to denote negatively related, green to denote positively related. For example, if is decreased, and only is changed, then must be increased. Similarly, if size of dataset is increased, and only changed, then must be decreased.
Appendix B Proofs in Privacy Analysis
In this section we provide a detailed analysis of the privacy guarantee of our algorithm. We first introduce the following key definitions and properties we will use.
Definition 5 (Rényi Divergence).
Let and be probability distributions on . For , we define the Rényi divergence of order between and as
(B.1)  
(B.2)  
(B.3) 
where and
are the probability density functions of
and respectively.Definition 6 (Rényi Differential Privacy).
We say that a mechanism is Rényi Differential Privacy (RDP) with order if for all neighboring dataset
(B.4) 
Lemma 3 (Lemma 2.5 in [Bun and Steinke, 2016]).
Let , , and . Then,
(B.5) 
b.1 Proof of Lemma 1
Proof.
Let fixed and let , where denotes trajectory with length . Without loss of generality, let and . Thus and are distributed identically except for the first coordinate. Hence we transfer it to a onedimension problem. Let denote the probability density function of and let denote probability density function of . Thus,
(B.6)  
(B.7) 
where . To avoid the difficulty of analysis this complex mixture distribution, we decompose as a composition of two algorithm which is defined as: (1) subsample: subsample without replacement 1 datapoint of the dataset, and (2) a randomized algorithm taking the subsampled dataset as the input. Next, we use the amplification properties for RDP via subsampling to obtain .
By the amplification properties for RDP via subsampling (Theorem 9 in [Wang et al., 2018]), we can obtain is ()RDP,
(B.11) 
where is defined in (B.10), and we ignored the higherorder terms since . According to the definition of RDP, we have
(B.12) 
Since the Gaussian mechanism does not have a bound , term in the bound (B.11) can be simplified as , where according to (B.10).
By properties of Rényi divergence, we have
(B.13)  
(B.14)  
(B.15)  
(B.16)  
(B.17) 
where the second inequality follows from (B.12), the third inequality follow from (B.11). This completes the proof.
∎
b.2 Proof of Theorem 1
We first introduce a useful theorem for the calculation of
Theorem 4 (Theorem 2 in [Abadi et al., 2016]).
Let be the moments accountant of a randomized mechanism .

[Composability] Suppose that a mechanism consists of a sequence of adaptive mechanisms , , where : . Then, for any
(B.18) 
[Tail bound] For any , the mechanism is differentially private for
(B.19)
Theorem 4 enables us to compute and bound the moments accountant, , at each iteration and sum them to bound the moments of the whole algorithm. This allows us to convert the moments bound to the differential privacy guarantee.
Given Lemma 1 and Theorem 4, the proof that Algorithm 1 is differential private can be obtained directly, because Lemma 1 bounds the moments of each iteration, and we can calculate the moments accountant of our whole algorithm by applying Theorem 4. The proof of Theorem 1 is as follows.
Proof.
We first analysis the term in Lemma 1. If , we have
(B.20)  
(B.21)  
(B.22)  
(B.23) 
and
(B.24) 
If , we have
(B.25)  
(B.26)  
(B.27) 
and
(B.28) 
By Theorem 4 and Lemma 1, the moment of Algorithm 1 can be bounded as (assuming we set explicitly to satisfy ). In order to use Theorem 4 to guarantee the differential privacy of Algorithm 1, we need satisfy
(B.29) 
and to satisfy
(B.30) 
Thus, when , all these conditions are satisfied by setting
(B.31) 
for some explicit constants and . ∎
Appendix C Proofs in Utility Analysis
In this section we provide detailed proofs of utility analysis. First, we derive properties of each iteration of our algorithm. We assume that all transitions in the sampled trajectory are used in this subsection (as in the GPOPE algorithm).
We first provide the proof of lemma 2.
c.1 Proof of Lemma 2
Proof.
The updates have the following iteration
(C.1)  
(C.2)  
(C.3)  
(C.4) 
Subtracting optimal solution (defined in (6.2)) from both sides and using the first order optimally condition, we obtain
(C.5)  
(C.6) 
The analysis of the convergence rate examines the difference between the current parameters and the optimal solution. Note the residual vector in (6.3), obeys the following iteration: