# A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems

We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al (arXiv:1709.04047). The regret bound of the algorithm was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we show that by making a minor modification in the algorithm (in particular, ensuring that an episode does not end too soon), this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system. The modified algorithm has the same Bayesian regret of 𝒪̃(√(T)), where T is the time-horizon and the 𝒪̃(·) notation hides logarithmic terms in T.

• 3 publications
• 4 publications
• 14 publications
• 16 publications
• 16 publications
11/23/2020

### Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Reinforcement learning (RL) with linear function approximation has recei...
08/09/2016

### Posterior Sampling for Reinforcement Learning Without Episodes

This is a brief technical note to clarify some of the issues with applyi...
09/08/2021

### Learning Zero-sum Stochastic Games with Posterior Sampling

In this paper, we propose Posterior Sampling Reinforcement Learning for ...
08/18/2021

### Scalable regret for learning to control network-coupled subsystems with unknown dynamics

We consider the problem of controlling an unknown linear quadratic Gauss...
02/07/2022

### On learning Whittle index policy for restless bandits with scalable regret

Reinforcement learning is an attractive approach to learn good resource ...
06/10/2020

### Making Non-Stochastic Control (Almost) as Easy as Stochastic

Recent literature has made much progress in understanding online LQR: a ...
02/15/2022

### Damped Online Newton Step for Portfolio Selection

We revisit the classic online portfolio selection problem, where at each...

## I Introduction

In a recent paper, Ouyang et al. [1] presented a Thompson sampling (also called posterior sampling) algorithm to control a linear quadratic (LQ) system with unknown parameters. Their algorithm is called Thompson sampling with dynamic episodes (TSDE).111In [1], the algorithm is called posterior sampling reinforcement learning (PSRL-LQ). We use the term TSDE as it was used in [2], which was the conference version of [1], and is also used in other variations of the algorithm [3]. The main result of [1] is to show that the Bayesian regret of TSDE accumulated up to time  is bounded by , where the notation hides constants and logarithmic factors. This result was derived under a technical assumption on the induced norm of the closed loop system. In this technical note, we present a minor variation of the TSDE algorithm and obtain a bound on the Bayesian regret by imposing a much milder technical assumption, which is in terms of the spectral radius of the closed loop system (rather than the induced norm, as was assumed in [1]).

## Ii Model and problem formulation

We consider the same model as [1]. For the sake of completeness, we present the model below.

Consider a linear quadratic system with state , control input , and disturbance . We assume that the system starts from an initial state and evolves over time according to

 xt+1=Axt+But+wt,t≥1, (1)

where and are the system dynamics matrices. The noise is an independent and identically distributed Gaussian process with .

###### Remark 1

In [1], it was assumed that . Using a general does not fundamentally change any of the results or the proof arguments.

At each time , the system incurs a per-step cost given by

 c(xt,ut)=x\raisebox0.0pt$⊺$tQxt+u\raisebox0.0pt$⊺$tRut, (2)

where and are positive definite matrices.

Let denote the parameters of the system. , where . The performance of any policy is measured by the long-term average cost given by

 J(π;θ)=limsupT→∞1TEπ[T∑t=1c(xt,ut)]. (3)

Let denote the minimum of over all policies. It is well known [4] that if the pair is stabilizable, then is given by

 J(θ)=σ2wTr(S(θ)),

where is the unique positive semi-definite solution of the following Riccati equation:

 S(θ)=Q+A\raisebox0.0pt$⊺$S(θ)A−A\raisebox0.0pt$⊺$S(θ)B(R+B\raisebox0.0pt$⊺$S(θ)B)−1B\raisebox0.0pt$⊺$S(θ)A. (4)

Furthermore, the optimal control policy is given by

 ut=G(θ)xt, (5)

where the gain matrix is given by

 G(θ)=−(R+B\raisebox0.0pt$⊺$S(θ)B)−1B\raisebox0.0pt$⊺$S(θ)A. (6)

As in [1]

, we are interested in the setting where the system parameters are unknown. We denote the unknown parameters by a random variable

and assume that there is a prior distribution on . The Bayesian regret of a policy operating for horizon is defined by

 R(T;π)=Eπ[T∑t=1c(xt,ut)−TJ(θ1)], (7)

where the expectation is with respect to the prior on , the noise processes, the initial conditions, and the potential randomizations done by the policy .

## Iii Thomson sampling based learning algorithm

As in [1], we assume that the unknown model parameters lie in a compact subset of . We assume that there is a prior on , which satisfies the following.

###### Assumption

There exist for and a positive definite matrix such that for any , , where

 ¯μ1(θ)=n∏i=1¯μ1(θ(i))and¯μ1(θ(i))=N(^θ1(i),Σ1).

We maintain a posterior distribution on based on the history of the observations until time . From standard results in linear Gaussian regression [5]

, we know that the posterior is a truncated Gaussian distribution

 μt(θ)=[n∏i=1¯μt(θ(i))]∣∣∣Ω1

where and and can be updated recursively as follows:

 ^θt+1(i) =^θt(i)+Σtzt(xt+1(i)−^θt(i)\raisebox0.0pt$⊺$zt)σ2w+z\raisebox0.0pt$⊺$tΣtzt, (8) Σ−1t+1 =Σ−1t+1σ2wztz\raisebox0.0pt$⊺$t, (9)

where .

### Iii-a Thompson sampling with dynamic episodes algorithm

We now present a minor variation of the Thompson sampling with dynamic episodes (TSDE) algorithm of [1]. As the name suggests, the algorithm operates in episodes of dynamic length. The key difference from [1] is that we enforce that each episode is of a minimum length . The choice of will be explained later.

Let and denote the start time and the length of episode , respectively. Episode  has a minimum length of and ends when the length of the episode is strictly larger than the length of the previous episode (i.e., ) or at the first time after when the determinant of the covariance  falls below half of its value at time , i.e., . Thus,

 (10)

Note that the stopping condition (10) implies that

 Tmin+1≤Tk≤Tk−1+1,∀k (11)

If we select in the above algorithm, we recover the stopping condition of [1].

The TSDE algorithm works as follows. At the beginning of episode , a parameter is sampled from the posterior distribution . During the episode, the control inputs are generated using the sampled parameters , i.e.,

 ut=G(¯θk)xt,tk≤t

The complete algorithm is presented in Algorithm 1.

### Iii-B A technical assumption and the choice of minimum episode length

We make the following assumption on the support of the prior distribution.

###### Assumption

There exists a positive number such that for any , where ,

 ρ(Aθ+BθG(ϕ))≤δ.

Assumption III-B is a weaker form of the following assumption imposed in [1] (since for any matrix ).

###### Assumption

There exists a positive number such that for any , where ,

 ∥Aθ+BθG(ϕ)∥≤δ.

Note that it is much easier to satisfy Assumption III-B than Assumption III-B. For example, consider a family of matrices , where and . For each , the sprectral radius of is while its norm is at least . Thus, each satisfies Assumption III-B but not Assumption III-B.

###### Lemma 1

Assumption III-B implies that for any , there exists an such that for any with and for any integer ,

 ∥(Aθ+BθG(ϕ))t∥≤α(ε+δ)t.

###### Proof

Let . Since is compact, so is . Now for any , there exists a norm (call it ) such that .

Since norms are continuous, there is an open ball centered at (let’s call this ) such that for any , we have . Consider the collection of open balls . This is an open cover of compact set . So, there is a finite sub-cover. Let’s denote this sub-cover by . By equivalence of norms, there is a finite constant such that for any matrix , for all . Let .

Now consider an arbitrary . It belongs to for some . Therefore, . Hence, for any integer , the above inequalities and the submulitplicity of norms give that .

A key implication of Lemma 1 is the following, which plays a critical role in analyzing the regret of TSDE.

###### Lemma 2

Fix an and let be as given by Lemma 1. Define

 (13)

Then, for with and , we have

 ∥(Aθ+BθG(ϕ))τ∥≤1 (14)

###### Proof

The proof follows immediately from the choice of , Lemma 1 and (11).

### Iii-C Regret bounds

The following result provides an upper bound on the regret of the proposed algorithm.

###### Theorem 1

Under Assumptions III and III-B and with , the regret of TSDE is upper bounded by

 R(T;TSDE)≤~O(σ2w(n+m)√nT). (15)

The proof is presented in the next section.

## Iv Regret analysis

For the ease of notation, we use instead of in this section. Following the exact same steps as [1], we can show that

 R(T)=R0(T)+R1(T)+R2(T) (16)

where

 R0(T) =E[KT∑k=1TkJ(¯θk)]−TE[J(θ1)], (17) R1(T) =E[KT∑k=1tk+1−1∑t=tk[x\raisebox0.0pt$⊺$tS(¯θk)xt−x\raisebox0.0pt$⊺$t+1S(¯θk)xt+1]] (18) R2(T) =E[KT∑k=1tk+1−1∑t=tk[(θ\raisebox0.0pt$⊺$1zt)\raisebox0.0pt$⊺$S(¯θk)θ\raisebox0.0pt$⊺$1zt −(θ\raisebox0.0pt$⊺$kzt)\raisebox0.0pt$⊺$S(¯θk)θ\raisebox0.0pt$⊺$kzt]] (19)

We establish the bound on by individually bounding , , and .

###### Lemma 3

The terms in (16) are bounded as follows:

1. .

2. .

3. .

Before presenting the proof of this lemma, we establish some preliminary results.

### Iv-a Preliminary results

Let denote the maximum of the norm of the state and denote the number of episodes until horizon .

###### Lemma 4

For any and any ,

 E[XqT]≤O(σqwlogT).

See Appendix A for proof.

###### Lemma 5

For any , we have

 E[XqTlogX2T]≤σqw~O(1).

See Appendix B for proof.

###### Lemma 6

The number of episodes is bounded by

 KT≤O(√(n+m)Tlog(TX2T)).

See Appendix C for proof.

###### Remark 2

The statement of Lemmas 4 and 6 are the same as that of the corresponding lemmas in [1]. The proof of Lemma 4 in [1] relied on Assumption III-B. Since we impose a weaker assumption, our proof is more involved. The proof of Lemma 6 is similar to the proof of [1, Lemma 3]. However, since our TSDE algorithm is different from that in [1], some of the details of the proof are different.

### Iv-B Proof of Lemma 3

We now prove each part of Lemma 3 separately.

#### Iv-B1 Proof of bound on R0(t)

Following exactly the same argument as the proof of [1, Lemma 5], we can show that

 R0(T)≤O(σ2wE[KT]). (20)

Substituting the result of Lemma 6, we get

 R0(T) ≤O(σ2wE[√(n+m)Tlog(TX2T)]) (a)≤O(σ2w√(n+m)Tlog(TE[X2T])) (b)≤~O(σ2w√(n+m)T)

where follows from Jensen’s inequality and follows from Lemma 4.

#### Iv-B2 Proof of bound on R1(t)

Following exactly the same argument as in the proof of [1, Lemma 6], we can show that

 R1(T)≤O(E[KTX2T]) (21)

Substituting the result of Lemma 6, we get

 R1(T)≤O(√(n+m)TE[X2T√log(TX2T)]) (22)

Now, consider the term

 E[X2T√log(TX2T)] (a)≤√E[X4T]E[log(TX2T)] (b)≤√E[X4T]log(TE[X2T]) (c)≤~O(σ2w) (23)

where follows from Cauchy-Schwartz inequality, follows from Jensen’s inequality, and follows from Lemma 4.

Substituting (23) in (22), we get the bound on .

#### Iv-B3 Proof of bound on R2(t)

As in [1], we can bound the inner summand in as

 ∥S(¯θk)0.5θ\raisebox0.0pt$⊺$1zt∥2−∥S(¯θk)0.5θ\raisebox0.0pt$⊺$kzt∥2≤O(XT∥(θ1−¯θk)\raisebox0.0pt$⊺$zt∥).

Therefore,

 R2(T)≤O(E[XTKT∑k=1tk+1−1∑t=tk∥(θ1−¯θk)\raisebox0.0pt$⊺$zt∥]),

which is same as [1, Eq. (45)]. Now, by simplifying the term inside using Cauchy-Schwartz inequality, we get

 E[XTKT∑k=1tk+1−1∑t=tk∥(θ1−¯θk)\raisebox0.0pt$⊺$zt∥] ≤ ⎷E[KT∑k=1tk+1−1∑t=tk∥Σ−0.5tk(θ1−¯θk)∥2] × ⎷E[KT∑k=1tk+1−1∑t=tkX2T∥Σ0.5tkzt∥2] (24)

Note that (24) is slightly different than the simplification of [1, Eq. (45)] using Cauchy-Schwartz inequality presented in [1, Eq. (46)], which used in each term in the right hand side instead of .

We bound each term of (24) separately as follows.

###### Lemma 7

We have the following inequality

 E[KT∑k=1tk+1−1∑t=tk∥Σ−0.5tk(θ1−¯θk)∥2] ≤O(n(n+m)(T+E[KT]))≤O(n(n+m)T).

See Appendix D for a proof.

###### Lemma 8

We have the following inequality

 E[KT∑k=1tk+1−1∑t=tkX2T∥Σ0.5tkzt∥2]≤~O((n+m)σ4w)

See Appendix E for a proof.

We get the bound on by substituting the result of Lemmas 7 and 8 in (24).

## V Discussion and Conclusion

In this paper, we present a minor variation of the TSDE algorithm of [1] and show that its Bayesian regret up to time  is bounded by under a milder technical assumption that [1]. The result in [1] was derived under the assumption that there exists a such that for any , . We require that . Our assumption on the sprectral radius of the closed loop system is milder and, in some sense, more natural than the assumption on the induced norm of the closed loop system.

The key technical result in [1] as well as our paper is Lemma 4, which shows that for any , . The proof argument in both [1] as well as our paper is to show that there is some constant such that . Under the stronger assumption in [1], one can show that for all , , which directly implies that . Under the weaker assumption in this paper, the argument is more subtle. The basic intuition is that in each episode, the system is asymptotically stable and, being a linear system, also exponentially stable (in the sense of Lemma 1). So, if the episode length is sufficiently long, then we can ensure that , where and is a constant. This is sufficient to ensure that for an appropriately defined .

The fact that each episode must be of length implies that the second triggering condition is not triggered for the first steps in an episode. Therefore, in this interval, the determinant of the covariance can be smaller than half of its value at the beginning of the episode. Consequently, we cannot use the same proof argument as [1] to bound because that proof relied on the fact that for any , . So, we provide a variation of that proof argument, where we use a coarser bound on given by Lemma 10.

We conclude by observing that the milder technical assumption imposed in this paper may not be necessary. Numerical experiments indicate that the regret of the TSDE algorithm shows behavior even when the uncertainty set does not satisfy Assumption III-B (as was also reported in [1]). This suggests that it might be possible to further relax Assumption III-B and still establish an regret bound.

## Appendix A Proof of Lemma 4

For the ease of notation, let , , and . In addition, define , , , and where and are the true parameters.

From the system dynamics under the TSDE algorithm, we know that for any time , we have

 xt=Ht−tkkxtk+t−1∑j=tkHt−1−jkwj

Thus, from triangle inequality and Lemma 1, we get

 ∥xt∥ ≤α¯δt−tkYk+[t−1∑j=tkα¯δt−1−j]WT ≤α¯δt−tkYk+[α1−¯δ]\eqqcolon¯αWT. (25)

Now at time , we have

 Yk+1 =∥xtk+1∥≤α¯δTkYk+¯αWT. ≤βYk+¯αWT (26)

where the second inequality follows from (11), which implies . Recursively expanding (26), we get

 Yk ≤¯αWT+β¯αWT+⋯+βk−2¯αWT ≤¯α1−βWT\eqqcolon¯βWT. (27)

Substituting (27) is (25), we get that for any , we have

 ∥xt||≤α¯δt−tk¯βWT+¯αWT≤[α¯β+¯α]\eqqcolonα0WT

where in the last inequality, we have used the fact that . Thus, for any episode , we have

 ¯Xk=maxtk

Hence,

 XT≤max{¯X1,…,¯XKT}≤α0WT.

Therefore, for any , we have

 E[XqT]≤αq0E[WqT]=αq0E[max1≤t≤T∥wt∥q] (28)

From [1, Eq. (39)], we have that

 E[max1≤t≤T∥wt∥q]≤σqwO(logT).

Substituting this is (28), we obtain the result of the lemma.

## Appendix B Proof of Lemma 5

Since log is an increasing function, . Therefore,

 E[XqTlogX2T] ≤E[XqTlogmax(e,X2T)] ≤√E[X2qT]E[(logmax(e,X2T))2] (29)

where the last inequality follows from Cauchy-Schwartz inequality. Since is concave for , we can use Jensen’s inequality to write

 E[(logmax(e,X2T))2] ≤(log(E[max(e,X2T)]))2 ≤(log(e+E[X2T]))2 (a)≤(log(e+O(σ2wlogT)))2 ≤~O(1) (30)

where uses Lemma 4. Substituting (30) in (29) and using Lemma 4 for bounding , we get

 E[XqTlogX2T] ≤√E[X2qT]E[(logmax(e,X2T))2] ≤σqw~O(1).

## Appendix C Proof of Lemma 6

The high-level idea of the proof is same as that of [1, Lemma 3]. Define macro episodes with start times , , where and for ,

 ni+1=min{k>ni∣∣detΣtk<12detΣtk−1}.

Thus, a new macro-episode starts whenever an episode ends due to the second stopping criterion. Let denote the number of macro-episodes until time  and define . Let denote the length of the -th macro-episode. Within a macro-episode, all but the last episode must be triggered by the first stopping criterion. Thus, for ,

 Tk=max{Tk−1+1,Tmin+1}=Tk−1+1

where the last equality follows from (11). Hence, by following exactly the same argument as [1], we have

 ni+1−ni≤√2¯Ti

and therefore following [1, Eq. (40)], we have

 KT≤√2MT (31)

which is same as [1, Eq. (41)].

Now, observe that

 detΣ−1T (a)≥detΣ−1tnM(b)≥2detΣ−1tnM−1 ≥⋯≥2M−1detΣ−11, (32)

where follows because is a non-decreasing sequence (because ) and  and subsequent inequalities follow from the definition of the macro episode and the second triggering condition.

Then following the same idea as the rest of the proof in [1], we get

 M≤O((n+m)log(TX2T)). (33)

Substituting (33) in (31), we obtain the result of the lemma.

## Appendix D Proof of Lemma 7

Observe that the summand is constant for each episode. Therefore,

 E[KT∑k=1tk+1−1∑t=tk[∥Σ−0.5tk(θ1−¯θk)\raisebox0.0pt$⊺$∥2]] =E[KT∑k=1[Tk∥Σ−0.5tk(θ1−¯θk)\raisebox0.0pt$⊺$∥2]] (a)≤E[KT∑k=1[(Tk−1+1)∥Σ−0.5tk(θ1−¯θk)\raisebox0.0pt$⊺$∥2]] =∞∑k=1E[1{tk≤T}(Tk−1+1)∥Σ−0.5tk(θ1−¯θk)\raisebox0.0pt$⊺$∥2] =∞∑k=1E[E[1{tk≤T}(Tk−1+1)∥Σ−0.5tk(θ1−¯θk)\raisebox0.0pt$⊺$∥2∣∣htk]] (