I Introduction
Reinforcement learning (RL) is a controltheoretic problem, which adaptively learns to make sequential decisions in an unknown environment through trial and error. RL has shown to have significant success for delivering a wide variety of personalized services, including online news and advertisement recommendation [17], medical treatment design [26]
[20], and social robot [12]. In these applications, an RL agent improves its personalization algorithm by interacting with users to maximize the reward. In particular, in each round, the RL agent offers an action based on the user’s state, and then receives the feedback from the user (i.e., state information, state transition, reward, etc.). This feedback is used by the agent to learn the unknown environment and improve its action selection strategy.However, in most practical scenarios, the feedback from the users often encodes their sensitive information. For example, in a personalized healthcare setting, the states of a patient include personal information such as age, gender, height, weight, state of the treatment etc. Similarly, the states of a virtual keyboard user (e.g., Google GBoard users) are the words and sentences she typed in, which inevitably contain private information about the user. Another intriguing example is the social robot for second language education of children. The states include facial expressions, and the rewards contain whether they have passed the quiz. Users may not want any of this information to be inferred by others. This directly results in an increasing concern about privacy protection in personalized services. To be more specific, although a user might be willing to share her own information to the agent to obtain a better tailored service, she would not like to allow third parties to infer her private information from the output of the learning algorithm. For example, in the healthcare application, we would like to ensure that an adversary with arbitrary side knowledge cannot infer a particular patient’s state from the treatments prescribed to her.
Differential privacy (DP) [10] has become a standard mechanism for designing interactive learning algorithms under a rigorous privacy guarantee for individual data. Most of the previous works on differentially private learning under partial feedback focus on the simpler bandit setting (i.e., no state transition) [21, 22, 4, 18, 27]. For the general RL problem, there are only a few works that consider differential privacy [3, 24, 11]. More importantly, only the tabularasa discretestate discreteaction environments are considered in these works. However, in realworld applications mentioned above, the number of states and actions are often very large and can even be infinite. Over the years, for various nontabular environments, efficient and provably optimal algorithms for reward maximization or, equivalently, regret minimization have been developed (see, e.g., [2, 19, 8, 25, 14]). This directly motivates the following question: Is it possible to obtain the optimal reward while providing individual privacy guarantees in the nontabular RL scenario?
In this paper, we take the first step to answer the aforementioned question by considering a particular nontabular RL problem – adaptive control of linear quadratic (LQ) systems, in which the state transition is a linear function and the immediate reward (cost) is a quadratic function of the current state and action. In particular, our main contributions can be summarized as follows.

First, we provide a general framework for adaptive control of LQ systems under changing regularizers using the optimism in the face of uncertainty (OFU) principle, which covers both the extreme cases – nonprivate and fully private LQ control.

We then develop the first private RL algorithm, namely , for regret minimization in LQ systems by adapting the binary counting mechanism to ensure differencial privacy.

In particular, we show that satisfies joint differential privacy (JDP), which, informally, implies that sensitive information about a given user is protected even if an adversary has access to the actions prescribed to all other users.

Finally, we prove that achieves a sublinear regret guarantee, where the regret due to privacy only grows as with privacy levels implying that a high amount of privacy (low ) comes at a high cost and viceversa.
Ii Preliminaries
Iia Stochastic Linear Quadratic Control
We consider the discretetime episodic linear quadratic (LQ) control problem with time steps at every episode. Let be the state of the system, be the control and be the cost at time . An LQ problem is characterized by linear dynamics and a quadratic cost function
(1) 
where , are unknown matrices, and , are known positive definite (p.d.) matrices. The starting state is fixed (can possibly be chosen by an adversary) and the system noise is zeromean. We summarize the unknown parameters in .
The goal of the agent is to design a closedloop control policy mapping states to controls that minimizes the expected cost
(2) 
for all and . Here the expectation is over the random trajectory induced by the policy starting from state at time . From the standard theory for LQ control (e.g., [5]), the optimal policy has the form
where the gain matrices are given by
(3) 
Here the symmetric positive semidefinite matrices are defined recursively by the Riccati iteration
(4)  
with . The optimal cost is given by
(5) 
We let the agent play episodes and measure the performance by cumulative regret.^{1}^{1}1In the following, we add subscript to denote the variables for the th episode – state , control , noise and cost . In particular, if the true system dynamics are , the cumulative regret of the first episodes is given by
(6) 
where is the (expected) cost under an optimal policy for episode (computed via (5)), and is the (expected) cost under the chosen policy at the start of episode (computed via (2)). We seek to attain a sublinear regret , which ensures that the agent finds the optimal policy as . We end this section by presenting our assumptions on the LQ system (1), which are common in the LQ control literature [25].
Assumption 1 (Boundedness).
(a) The true system dynamics is a member of a set . (b) There exist constants , , such that , , and , , (c) For all , . (d) The noise at any and , is (i) independent of all other randomness, (ii) , and (iii) . (e) There exists a constant such that .
IiB Differential Privacy
We now formally define the notion of differential privacy in the context of episodic LQ control. We write to denote a sequence of unique users participating in the private RL protocol with an RL agent , where is the set of all users. Each user is identified by the state responses she gives to the controls chosen by the agent. We write to denote the privatized controls chosen by the agent when interacting with the users . Informally, we will be interested in randomized algorithms so that the knowledge of the output and all but the th user does not reveal ‘much’ about . We formalize in the following definition, which is adapted from [9].
Definition 1 (Differential Privacy (DP)).
For any and , an algorithm is differentially private if for all differing on a single user and all subset of controls ,
We now relax this definition motivated by the fact that the controls recommended to a given user is only observed by her. We consider joint differential privacy [15], which requires that simultaneously for all
, the joint distribution on controls sent to users other than
will not change substantially upon changing the state responses of the user . To this end, we let to denote all the controls chosen by the agent excluding those recommended to .Definition 2 (Joint Differential Privacy (JDP)).
For any and , an algorithm is jointly differentially private if for all , all differing on the th user and all subset of controls given to all but the th user,
This relaxation is necessary in our setting since knowledge of the controls recommended to the user can reveal a lot of information about her state responses. It weakens the constraint of DP only in that the controls given specifically to may be sensitive in her state responses. However, it is still a very strong definition since it protects from any arbitrary collusion of other users against her, so long as she does not herself make the controls reported to her public.
In this work, we look for algorithms that are JDP. But, we will build our algorithm upon standard DP mechanisms. Furthermore, to establish privacy, we will use a different relaxation called concentrated differential privacy (CDP) [6]. Roughly, a mechanism is CDP if the privacy loss has Gaussian tails. To this end, we let to be a mechanism taking as input a datastream and releasing output from some range .
Definition 3 (Concentrated Differential Privacy (CDP)).
For any , an algorithm is zeroconcentrated differentially private if for all differing on a single entry and all ,
where is the Renyi divergence between the distributions of and .^{2}^{2}2
For two probability distributions
and on , the Renyi divergence .Iii OFUBased Control
Our proposed private RL algorithm implements the optimism in the face of uncertainty (OFU) principle in LQ systems. As in [2]
, the key to implementing the OFUbased control is a highprobability confidence set for the unknown parameter matrix
.Iiia Adaptive Control with Changing Regularizers
We start with the adaptive LQ control with changing regularizers. This not only allows us to generalize previous results for nonprivate control, but more importantly serves as a basis for the analysis of private control in the next section. We first define the following compact notations. For a state and control pair at step in episode , i.e., and , we write . For any , we define the following matrices: , and . For two matrices and , we also define . Now, at every episode
, we consider the following ridge regression estimate w.r.t. a regularizing p.d. matrix
:In contrast to the standard online LQ control [2], here the sequence of matrices is perturbed by a sequence of regularizers . In particular, when , we get back the standard estimate of [2]. In addition, we also allow to be perturbed by a matrix at every episode . Setting and , we now define the estimate under changing regularizers and as
(7) 
We make the following assumptions on the sequence of regularizers and .
Assumption 2 (Regularity).
For any , there exist constants , and depending on such that, with probability at least , for all ,
Lemma 1 (Concentration under changing regularizers).
Lemma 1 helps us to introduce the following high probability confidence set
(8) 
We then search for an optimistic estimate within this confidence region , such that
(9) 
where is the optimal cost when system dynamics are (can be computed from (5)). With the estimate , the agent then chooses policy and selects the controls recommended by this policy
(10) 
where can be computed from (3). We call this procedure and bound its regret as follows.
Theorem 1 (Regret under changing regularizers).
Proof sketch. Inspired by [25], we first decompose the regret under the following ‘good’ event: which, by Assumption 1 and Lemma 1, holds w.p. at least . Then, under the ‘good’ event, the cumulative regret (6) can be written as
in which is given by (4) and denotes all the randomness present before time .
Now, we are going to bound each term, respectively. For the first two terms, we can show that both of them are bounded martingale difference sequences. Therefore, by Azuma–Hoeffding inequality, we have and with high probability. We use Lemma 1 and the OFU principle (9) to bound the third term as . To put everything together, first note from Assumption 1 that
Plugging this into given in Lemma 1 and the third term above, yields the final result. ∎
We end the section with a proof sketch of Lemma 1.
Proof sketch (Lemma 1). Under Assumptions 1 and 2, with some basic algebra, we first have
By Assumption 2, we have w.p. at least , . To bound , by the boundedness of in Assumption 1, we first note that each row of the matrix
is a subGaussian random vector with parameter
. We then generalize the selfnormalized concentration inequality of vectorvalued martingales [1, Theorem 1] to the setting of matrixvalued martingales. In particular, we show that w.p. at least ,Combining the bounds on and using a union bound argument, yields the final result. ∎
IiiB Private Control
In this section, we introduce the algorithm (Alg. 1). At every episode , we keep track of the history via private version of the matrices and . To do so, we first initialize two private counter mechanisms and , which take as parameters the privacy levels , , number of episodes , horizon and a problemspecific constant (see Assumption 1). The counter (resp. ) take as input an event stream of matrices (resp. ), and at the start of each episode , release the private version of the matrix (resp. ), which itself is a matrix of the same dimension. Let and denote the privatized versions for and , respectively. For some (will be determined later), we define and . We now instantiate the general procedure under changing regularizers (Section IIIA) with these private statistics. First, we compute the point estimate from (7) and build the confidence set from (8). Then, we choose the most optimisitic policy w.r.t. the entire set from (9) and (10). Finally, we execute the policy for the entire episode and update the counters with observed trajectory.
We now describe the private counters adapting the Binary counting mechanism of [7]. First, we write to denote a partial sum () involving the statecontrol pairs in episodes through . Next, we consider a binary interval tree, where each leaf node represents an episode (i.e., the tree has leaf nodes at the start of episode ), and each interior node represents the range of episodes covered by its children. At the start of episode , we first release a noisy corresponding to each node in the tree. Here is obtained by perturbing both th and th, , entries of with i.i.d. Gaussian noise .^{3}^{3}3This will ensure symmetry of the even after adding noise. Then is computed by summing up the noisy released by the set of nodes that uniquely cover the range . Observe that, at the end each episode, the mechanism only needs to store noisy required for computing private statistics at future episodes, and can safely discard that are no longer needed. For the private counter , we maintain with i.i.d. noise and compute the private statistics using a similar procedure. The noise levels and depends on the problem intrinsics (, , ) and privacy parameters (, ). These, in turn, govern the constants , , appearing in the confidence set and the regularizer . The details will be specified in the next Section as needed.
Iv Privacy and Regret Guarantees
In this section, we show that is a JDP algorithm with sublinear regret guarantee.
Iva Privacy Guarantee
Theorem 2 (Privacy).
Under Assumption 1, for any and , is JDP.
Proof sketch. We first show that both the counters and are DP. We begin with the counter . To this end, we need to determine a global upper bound over the sensitivity of all the . Informally, encodes the maximum change in the Frobenious norm of each if the trajectory of a single episode is changed. By Assumption 1, we have , and hence . Since the noisy are obtained via Gaussian mechanism, we have that each is CDP [6, Proposition 1.6]. We now see that every episode appears only in at most . Therefore, by the composition property, the whole counter is CDP, and thus, in turn, DP for any [6, Lemma 3.5]. Now, setting , we can ensure that is DP. A similar analysis yields that counter is DP if we set , where .
To prove Theorem 2, we now use the billboard lemma [13, Lemma 9] which, informally, states that an algorithm is JDP under continual observation if the output sent to each user is a function of the user’s private data and a common quantity computed using standard differential privacy. Note that at each episode , computes private statistics and for all users using the counters and . These statistics are then used to compute the policy . By composition and postprocessing properties of DP, we can argue that the sequence of policies are computed using an DP mechanism. Now, the controls during episode are generated using the policy and the user’s private data as . Then, by the billboard lemma, the composition of the controls sent to all the users is JDP.∎
IvB Regret Guarantee
Theorem 3 (Private regret).
Under Assumption 1, for any privacy parameters and , and for any , with probability at least , enjoys the regret bound
Theorems 2 and 3 together imply that can achieve a sublinear regret under JDP privacy guarantee. Furthermore, comparing Theorem 3 with Theorem 1, we see that the first term in the regret bound corresponds to the nonprivate regret, and the second term is the cost of privacy. More importantly, the cost due to privacy grows only as with .
Proof sketch (Theorem 3). First note that the private statistics can be computed by summing at most noisy . We then have that the total noise in each is a symmetric matrix with it’s th entry, , being i.i.d. . Therefore, by an adaptation of [23, Corollary 4.4.8], we have w.p. at least ,
Similarly, the total noise in each is an matrix, whose each entry is i.i.d. . Hence is a statistic with degrees of freedom, and therefore, by [16, Lemma 1], we have w.p. at least ,
By construction, we have the regularizer . Setting , we ensure that is p.d., and hence . Then, by a union bound argument, Assumption 2 holds for , and . Appropriating noise levels from Section IVA, the regret bound now follows from Theorem 1.∎
V Conclusion
We develop the first DP algorithm, , for episodic LQ control. Through the notion of JDP, we show that it can protect private user information from being inferred by observing the control policy without losing much on its regret performance. We leave as future work private control of nonlinear systems [8].
References
 [1] (2011) Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §IIIA.
 [2] (2011) Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26. Cited by: §I, §IIIA, §III.

[3]
(2016)
Differentially private policy evaluation.
In
International Conference on Machine Learning
, pp. 2130–2138. Cited by: §I.  [4] (2019) Differential privacy for multiarmed bandits: what is it and what is its cost?. arXiv preprint arXiv:1905.12298. Cited by: §I.
 [5] (2004) Dynamic programming and optimal control. Athena scientific Belmont, MA, 3 edition. Cited by: §IIA.
 [6] (2016) Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Cited by: §IIB, §IVA.
 [7] (2011) Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC) 14 (3), pp. 1–24. Cited by: §IIIB.

[8]
(2019)
Online learning in kernelized markov decision processes
. InThe 22nd International Conference on Artificial Intelligence and Statistics
, pp. 3197–3205. Cited by: §I, §V.  [9] (2014) The algorithmic foundations of differential privacy.. Cited by: §IIB.
 [10] (2008) Differential privacy: a survey of results. In International conference on theory and applications of models of computation, pp. 1–19. Cited by: §I.
 [11] (2020) Local differentially private regret minimization in reinforcement learning. arXiv preprint arXiv:2010.07778. Cited by: §I.
 [12] (2016) Affective personalization of a social robot tutor for children’s second language skills. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30. Cited by: §I.
 [13] (2016) Private matchings and allocations. SIAM Journal on Computing 45 (6), pp. 1953–1984. Cited by: §IVA.
 [14] (2020) Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143. Cited by: §I.
 [15] (2014) Mechanism design in large games: incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pp. 403–410. Cited by: §IIB.
 [16] (2000) Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1302–1338. Cited by: §IVB.
 [17] (2010) A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §I.
 [18] (2015) (Nearly) optimal differentially private stochastic multiarm bandits. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, pp. 592–601. Cited by: §I.
 [19] (2014) Modelbased reinforcement learning and the eluder dimension. In Proceedings of the 27th International Conference on Neural Information Processing SystemsVolume 1, pp. 1466–1474. Cited by: §I.
 [20] (2017) Literature survey of statistical, deep and reinforcement learning in natural language processing. In 2017 International Conference on Computing, Communication and Automation (ICCCA), pp. 350–354. Cited by: §I.
 [21] (2016) Algorithms for differentially private multiarmed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. Cited by: §I.
 [22] (2017) Achieving privacy in the adversarial multiarmed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §I.

[23]
(2018)
Highdimensional probability: an introduction with applications in data science
. Vol. 47, Cambridge university press. Cited by: §IVB.  [24] (2020) Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pp. 9754–9764. Cited by: §I.
 [25] (2020) Episodic linear quadratic regulators with lowrank transitions. arXiv preprint arXiv:2011.01568. Cited by: §I, §IIA, §IIIA.
 [26] (2009) Reinforcement learning design for cancer clinical trials. Statistics in medicine 28 (26), pp. 3294–3315. Cited by: §I.
 [27] (2020) Local differential privacy for bayesian optimization. arXiv preprint arXiv:2010.06709. Cited by: §I.
Comments
There are no comments yet.