I Introduction
In a recent paper, Ouyang et al. [1] presented a Thompson sampling (also called posterior sampling) algorithm to control a linear quadratic (LQ) system with unknown parameters. Their algorithm is called Thompson sampling with dynamic episodes (TSDE).111In [1], the algorithm is called posterior sampling reinforcement learning (PSRL-LQ). We use the term TSDE as it was used in [2], which was the conference version of [1], and is also used in other variations of the algorithm [3]. The main result of [1] is to show that the Bayesian regret of TSDE accumulated up to time is bounded by , where the notation hides constants and logarithmic factors. This result was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we present a minor variation of the TSDE algorithm and obtain a bound on the Bayesian regret by imposing a much milder technical assumption, which is in terms of the spectral radius of the closed loop system (rather than the induced norm, as was assumed in [1]).
Ii Model and problem formulation
We consider the same model as [1]. For the sake of completeness, we present the model below.
Consider a linear quadratic system with state , control input , and disturbance . We assume that the system starts from an initial state and evolves over time according to
(1) |
where and are the system dynamics matrices. The noise is an independent and identically distributed Gaussian process with .
Remark 1
In [1], it was assumed that . Using a general does not fundamentally change any of the results or the proof arguments.
At each time , the system incurs a per-step cost given by
(2) |
where and are positive definite matrices.
Let denote the parameters of the system. , where . The performance of any policy is measured by the long-term average cost given by
(3) |
Let denote the minimum of over all policies. It is well known [4] that if the pair is stabilizable, then is given by
where is the unique positive semi-definite solution of the following Riccati equation:
(4) |
Furthermore, the optimal control policy is given by
(5) |
where the gain matrix is given by
(6) |
As in [1]
, we are interested in the setting where the system parameters are unknown. We denote the unknown parameters by a random variable
and assume that there is a prior distribution on . The Bayesian regret of a policy operating for horizon is defined by(7) |
where the expectation is with respect to the prior on , the noise processes, the initial conditions, and the potential randomizations done by the policy .
Iii Thomson sampling based learning algorithm
As in [1], we assume that the unknown model parameters lie in a compact subset of . We assume that there is a prior on , which satisfies the following.
Assumption
There exist for and a positive definite matrix such that for any , , where
We maintain a posterior distribution on based on the history of the observations until time . From standard results in linear Gaussian regression [5]
, we know that the posterior is a truncated Gaussian distribution
where and and can be updated recursively as follows:
(8) | ||||
(9) |
where .
Iii-a Thompson sampling with dynamic episodes algorithm
We now present a minor variation of the Thompson sampling with dynamic episodes (TSDE) algorithm of [1]. As the name suggests, the algorithm operates in episodes of dynamic length. The key difference from [1] is that we enforce that each episode is of a minimum length . The choice of will be explained later.
Let and denote the start time and the length of episode , respectively. Episode has a minimum length of and ends when the length of the episode is strictly larger than the length of the previous episode (i.e., ) or at the first time after when the determinant of the covariance falls below half of its value at time , i.e., . Thus,
(10) |
Note that the stopping condition (10) implies that
(11) |
If we select in the above algorithm, we recover the stopping condition of [1].
The TSDE algorithm works as follows. At the beginning of episode , a parameter is sampled from the posterior distribution . During the episode, the control inputs are generated using the sampled parameters , i.e.,
(12) |
The complete algorithm is presented in Algorithm 1.
Iii-B A technical assumption and the choice of minimum episode length
We make the following assumption on the support of the prior distribution.
Assumption
There exists a positive number such that for any , where ,
Assumption III-B is a weaker form of the following assumption imposed in [1] (since for any matrix ).
Assumption
There exists a positive number such that for any , where ,
Note that it is much easier to satisfy Assumption III-B than Assumption III-B. For example, consider a family of matrices , where and . For each , the sprectral radius of is while its norm is at least . Thus, each satisfies Assumption III-B but not Assumption III-B.
Lemma 1
Assumption III-B implies that for any , there exists an such that for any with and for any integer ,
Proof
Let . Since is compact, so is . Now for any , there exists a norm (call it ) such that .
Since norms are continuous, there is an open ball centered at (let’s call this ) such that for any , we have . Consider the collection of open balls . This is an open cover of compact set . So, there is a finite sub-cover. Let’s denote this sub-cover by . By equivalence of norms, there is a finite constant such that for any matrix , for all . Let .
Now consider an arbitrary . It belongs to for some . Therefore, . Hence, for any integer , the above inequalities and the submulitplicity of norms give that .
A key implication of Lemma 1 is the following, which plays a critical role in analyzing the regret of TSDE.
Lemma 2
Iii-C Regret bounds
The following result provides an upper bound on the regret of the proposed algorithm.
The proof is presented in the next section.
Iv Regret analysis
For the ease of notation, we use instead of in this section. Following the exact same steps as [1], we can show that
(16) |
where
(17) | ||||
(18) | ||||
(19) |
We establish the bound on by individually bounding , , and .
Lemma 3
Before presenting the proof of this lemma, we establish some preliminary results.
Iv-a Preliminary results
Let denote the maximum of the norm of the state and denote the number of episodes until horizon .
Lemma 4
For any and any ,
See Appendix A for proof.
Lemma 5
For any , we have
See Appendix B for proof.
Lemma 6
The number of episodes is bounded by
See Appendix C for proof.
Remark 2
The statement of Lemmas 4 and 6 are the same as that of the corresponding lemmas in [1]. The proof of Lemma 4 in [1] relied on Assumption III-B. Since we impose a weaker assumption, our proof is more involved. The proof of Lemma 6 is similar to the proof of [1, Lemma 3]. However, since our TSDE algorithm is different from that in [1], some of the details of the proof are different.
Iv-B Proof of Lemma 3
We now prove each part of Lemma 3 separately.
Iv-B1 Proof of bound on
Iv-B2 Proof of bound on
Iv-B3 Proof of bound on
As in [1], we can bound the inner summand in as
Therefore,
which is same as [1, Eq. (45)]. Now, by simplifying the term inside using Cauchy-Schwartz inequality, we get
(24) |
Note that (24) is slightly different than the simplification of [1, Eq. (45)] using Cauchy-Schwartz inequality presented in [1, Eq. (46)], which used in each term in the right hand side instead of .
We bound each term of (24) separately as follows.
Lemma 7
We have the following inequality
See Appendix D for a proof.
Lemma 8
We have the following inequality
See Appendix E for a proof.
V Discussion and Conclusion
In this paper, we present a minor variation of the TSDE algorithm of [1] and show that its Bayesian regret up to time is bounded by under a milder technical assumption that [1]. The result in [1] was derived under the assumption that there exists a such that for any , . We require that . Our assumption on the sprectral radius of the closed loop system is milder and, in some sense, more natural than the assumption on the induced norm of the closed loop system.
The key technical result in [1] as well as our paper is Lemma 4, which shows that for any , . The proof argument in both [1] as well as our paper is to show that there is some constant such that . Under the stronger assumption in [1], one can show that for all , , which directly implies that . Under the weaker assumption in this paper, the argument is more subtle. The basic intuition is that in each episode, the system is asymptotically stable and, being a linear system, also exponentially stable (in the sense of Lemma 1). So, if the episode length is sufficiently long, then we can ensure that , where and is a constant. This is sufficient to ensure that for an appropriately defined .
The fact that each episode must be of length implies that the second triggering condition is not triggered for the first steps in an episode. Therefore, in this interval, the determinant of the covariance can be smaller than half of its value at the beginning of the episode. Consequently, we cannot use the same proof argument as [1] to bound because that proof relied on the fact that for any , . So, we provide a variation of that proof argument, where we use a coarser bound on given by Lemma 10.
We conclude by observing that the milder technical assumption imposed in this paper may not be necessary. Numerical experiments indicate that the regret of the TSDE algorithm shows behavior even when the uncertainty set does not satisfy Assumption III-B (as was also reported in [1]). This suggests that it might be possible to further relax Assumption III-B and still establish an regret bound.
Appendix A Proof of Lemma 4
For the ease of notation, let , , and . In addition, define , , , and where and are the true parameters.
From the system dynamics under the TSDE algorithm, we know that for any time , we have
Thus, from triangle inequality and Lemma 1, we get
(25) |
Now at time , we have
(26) |
where the second inequality follows from (11), which implies . Recursively expanding (26), we get
(27) |
Appendix B Proof of Lemma 5
Appendix C Proof of Lemma 6
The high-level idea of the proof is same as that of [1, Lemma 3]. Define macro episodes with start times , , where and for ,
Thus, a new macro-episode starts whenever an episode ends due to the second stopping criterion. Let denote the number of macro-episodes until time and define . Let denote the length of the -th macro-episode. Within a macro-episode, all but the last episode must be triggered by the first stopping criterion. Thus, for ,
where the last equality follows from (11). Hence, by following exactly the same argument as [1], we have
and therefore following [1, Eq. (40)], we have
(31) |
which is same as [1, Eq. (41)].
Now, observe that
(32) |
where follows because is a non-decreasing sequence (because ) and and subsequent inequalities follow from the definition of the macro episode and the second triggering condition.
Appendix D Proof of Lemma 7
Observe that the summand is constant for each episode. Therefore,