 We consider the classical problem of control of linear systems with quadratic cost. When the true system dynamics are unknown, an adaptive policy is required for learning the model parameters and planning a control policy simultaneously. Addressing this trade-off between accurate estimation and good control represents the main challenge in the area of adaptive control. Another important issue is to prevent the system becoming destabilized due to lack of knowledge of its dynamics. Asymptotically optimal approaches have been extensively studied in the literature, but there are very few non-asymptotic results which also do not provide a comprehensive treatment of the problem. In this work, we establish finite time high probability regret bounds that are optimal up to logarithmic factors. We also provide high probability guarantees for a stabilization algorithm based on random linear feedbacks. The results are obtained under very mild assumptions, requiring: (i) stabilizability of the matrices encoding the system's dynamics, and (ii) degree of heaviness of the noise distribution. To derive our results, we also introduce a number of new concepts and technical tools.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Adaptive policies for regulation of LQ state space models is the canonical problem studied in this work. That is, starting from the initial state , the dynamics and the cost of the system are defined according to

 x(t+1) = A0x(t)+B0u(t)+w(t+1), (1) ct = x(t)′Qx(t)+u(t)′Ru(t), (2)

for

. The vector

denotes the state (and output) of the system at time , represents the control signal, and the stochastic sequence of disturbance (or noise) is denoted by . Moreover, the quadratic function corresponds to the instantaneous cost of the system (the transpose of the vector is denoted by ). The transition matrix and the input matrix which constitute the dynamical parameters of the system are unknown. The positive definite matrices of the cost, , are assumed known.

The broad objective is to adaptively regulate the system in order to minimize the long term average cost. This canonical problem is well studied in the literature and a number of asymptotic results have been established as discussed next. However, finite time results are scarce and rather incomplete, despite their need in applications (e.g. network systems ).

Since the system dynamics are unknown, a popular adaptive procedure for regulation is based on the principle of Certainty Equivalence (CE) . Alternating between estimation and regulation, CE applies a control action as if the estimated parameter is the true one the system is evolving according to [3, 4, 5, 6, 7, 8]. However, it has been shown that the CE based strategy can lead to wildly incorrect parameter estimates [9, 10, 11], and thus modifications have been introduced in the literature [12, 13]. A popular approach, known as Optimism in the Face of Uncertainty (OFU) , was developed to address the suboptimality of CE. In OFU, after constructing a confidence set for the model parameters, a regulation policy is designed based on the most optimistic parameter in the confidence set .

The above references establish the asymptotic convergence of the average cost to the optimal value. However, non-asymptotic results on the growth rate of regret (i.e. the accumulative deviation from the optimal cost, see (5)) have recently appeared [16, 17]. These papers provide a near-optimal upper bound for the regret of OFU, under the following rather restrictive conditions:

1. The dynamics matrices are assumed to be controllable and observable. This leads to a redundant complexity in the computation of the adaptive regulator. Further, this assumption restricts the applicability of the analysis since the condition may be violated in many LQ systems.

2. An additional piece of information regarding the true parameters is a priori available; namely, Frobenius norm  or identifiability .

3. The operator norm of the closed-loop matrix is less than one, which excludes a remarkable fraction of systems with stable closed-loop matrices. In fact, a stable matrix can have an arbitrarily large operator norm. Note that condition 1

only implies that the largest eigenvalue (not the operator norm) of the closed-loop is less than one

.

4. The noise distribution satisfies a tail condition such as sub-Gaussianity  or Gaussianity . Moreover, the coordinates of the noise vectors are not correlated.

Resolving the aforementioned shortenings, this work establishes near-optimal regret bounds for an extensive family of LQ systems. Namely, we remove the conditions 1, 2 above, and replace the strict condition 3 with stabilizability, which is the necessary assumption for the optimal control problem to be well-defined. Regarding condition 4, the high probability near-optimal upper bound for regret presented in this work holds for a class of heavy-tailed noise vectors with arbitrary correlation structures.

There are a number of conceptual and technical difficulties one needs to address in order to obtain the results of optimal regulation. First, existing methodology for analyzing adaptive policies [13, 16, 17] becomes non-applicable beyond condition 3. One reason is due to the fact that matrix multiplication preserves the operator norm; i.e. the norm of the product is upper bounded by the product of the norms. However, the product of two stable matrices can have eigenvalues of arbitrarily large magnitude. Further, sub-Weibull distributions assumed in this work do not need to have generating functions 

. Hence, new tools are required to establish concentration inequalities for random matrices with heavy-tailed probability distributions

[20, 21].

The remainder of this paper is organized as follows. Section II formally defines the problem. Section III addresses the problem of accurate estimation of the closed-loop matrix and includes the analysis of the empirical covariance matrix, as well as a high probability prediction bound. Finally, an OFU-based algorithm for adaptive regulation of the system is presented in Section IV. We show that the regret of Algorithm 1 is with high-probability optimal, up to a logarithmic factor.

### I-a Notation

The following notation is used throughout this paper. For matrix , is its transpose. When , the smallest (respectively largest) eigenvalue of (in magnitude) is denoted by (respectively ) and the trace of is denoted by . For , the -norm of vector is . Further, when , the norm is defined according to . We also use the following notation for the operator norm of matrices. For , and , define

Whenever , we simply write . To denote the dimension of manifold over the field , we use .Finally, the sigma-field generated by random vectors is denoted by . The notation for , and are provided in Remark 1, equations (3), (4), and Remark 2, respectively.

## Ii Problem Formulation

First, we formally discuss the problem of adaptive regulation this work is addressing. Equation (1) depicts the dynamics of the system, where are independent mean-zero noise vectors with full rank covariance matrix :

 E[w(t)]=0,E[w(t)w(t)′]=C,|λmin(C)|>0.

The results established also hold if the noise vectors are martingale difference sequences. The true dynamics are assumed to be stabilizable, as defined below.

###### Definition 1 (Stabilizability ).

is stabilizable if there is such that . The linear feedback matrix is called a stabilizer.

###### Remark 1.

Henceforth, we use to denote the dynamics parameter , where and are and real matrices, respectively. Letting , obviously .

Here, we consider perfect observations, i.e. the output of the system corresponds to the state vector itself. Next, an admissible control policy is a mapping which designs the control action according to the dynamics matrix , the cost matrices , and the history of the system; i.e. for all ,

 u(t)=π(θ0,Q,R,{x(i)}ti=0,{u(j)}t−1j=0).

 u(t)=π(Q,R,{x(i)}ti=0,{u(j)}t−1j=0).

When applying the policy , the resulting instantaneous quadratic cost at time defined according to (2) is denoted by . If there is no superscript, the corresponding policy will be clear from the context. For arbitrary policy , let be the expected average cost of the system:

 ¯¯¯¯¯Jπ(θ0)=limsupT→∞1TT∑t=1E[c(π)t].

Note that the dependence of to the known cost matrices is suppressed. Then, the optimal expected average cost is defined by , where the minimum is taken over all admissible control policies. Further, is called an optimal policy for system , if satisfying . To find for general , one has to solve a Riccati equation. A solution, is a positive semidefinite matrix satisfying (3).

The following proposition establishes optimality of the linear feedback provided by Riccati equation (4). Henceforth, let denote the linear feedback :

 π⋆:u(t)=L(θ0)x(t),t=0,1,⋯.
###### Proposition 1 (Optimal policy ).

If is stabilizable, then (3) has a unique solution, is optimal, and . Conversely, if is a solution of (3), is a stabilizer.

Note that in the latter case of Proposition 1, the existence of a solution implies that it is unique, is an optimal policy, and .

In order to measure the quality of (adaptive) policy , the resulting cost will be compared to the optimal expected average cost defined above. More precisely, letting be the resulting instantaneous cost at time , regret at time is defined as

 R(T)=T∑t=1[c(π)t−J⋆(θ0)]. (5)

The comparison between adaptive control policies is made according to regret. The next result describes the asymptotic distribution of the regret. Lemma 1

, which is basically a Central Limit Theorem for

, states that even when applying optimal policy, the regret scales as

, multiplied by a normal random variable.

###### Lemma 1.

Applying optimal feedback , converges in distribution to as grows, where

 σ2 = 4tr(K(θ0)CK(θ0)∞∑n=1DnCD′n) + limT→∞T−1T∑t=1Var[w(t)′K(θ0)w(t)]>0.

The definition of regret in (5) is the cumulative deviation from the optimal expected average cost constituted by both the stochastic evolution of the system (randomness of ), as well as the uncertainty about the dynamics (unknownness of ). Lemma 1 is stating that from a pure control point of view, the convergence of the cumulative cost is at rate . So, trying to push the second fraction of the regret (which is due to learning the unknown dynamics) to have a rate less than is actually unnecessary. Further, Lemma 1 provides a lower bound for the regret of adaptive policies. The optimal policy for minimizing the (finite horizon) expected cumulative cost converges to as grows . Hence, the regret of an adaptive policy can not be less than that of

in the long run. The upper bound of a normal distribution which holds with probability at least

needs to be in magnitude at least . Therefore, Lemma 1 implies that a high probability regret bound to hold with probability at least , needs to be at least of the order of magnitude of .

To proceed, we introduce a notation that simplifies certain expressions throughout the remainder of this work.

###### Remark 2.

For arbitrary stabilizable , define . So, .

## Iii Closed-Loop Identification

When applying linear feedback to the system, the closed-loop dynamics becomes , where . Subsequently, we present bounds for the time length the user can interact with the system in order to have sufficiently many observations for accurate identification of the closed-loop matrix. The next set of results are used later on to construct the confidence set being used to design the adaptive policy.

First, we define least-squares estimation for matrix , as follows. Observing the state vectors , for an arbitrary matrix

define the sum-of-squares loss function

. Then, the true closed-loop transition matrix is estimated by , which is a minimizer of the above loss: . Next, the following tail condition is assumed for the noise coordinates.

###### Assumption 1 (Sub-Weibull distribution ).

There are positive reals , and , such that for all ,

 P(|wi(t)|>y)≤b1exp(−yαb2).

Clearly, the smaller the exponent is, the heavier the tail of will be. Note that assuming a sub-Weibull distribution for the noise coordinates is more general than the sub-Gaussian (or sub-exponential) assumption routinely made in the literature , where (). Further,

do not need to have a moment generating function if

. To obtain analogous results for uniformly bounded noise sequences, it suffices to let in the subsequently presented materials.

Next, we establish high probability results regarding the time length one needs to observe the state vectors of the system in order to ensure that identification of the closed-loop transition matrix is accurate enough. First, some straightforward algebra shows that , where denotes the (invertible) empirical covariance matrix of the state process. Therefore, the behavior of needs to be carefully studied and this constitutes a major part of this section. To do so, we define the constant which determines the influence of a stable closed-loop matrix on the magnitude of the state vectors. Let be the Jordan decomposition of ; i.e. is block diagonal, , where for all , is a Jordan matrix of :

 Λi=⎡⎢ ⎢ ⎢ ⎢⎣λi10⋯00λi1⋯0⋮⋮⋮⋮⋮00⋯0λi⎤⎥ ⎥ ⎥ ⎥⎦∈Cmi×mi.
###### Definition 2 (Constant η(D)).

For , letting

 ηt(Λi)=infρ≥|λi|tmi−1ρtmi−1∑j=0ρ−jj!

define . Then, let , and

 η(D)=∣∣∣∣∣∣P−1∣∣∣∣∣∣∞→2|||P|||∞∞∑t=0ηt(Λ).

Letting , if is diagonalizable, then clearly . In general, denoting the largest algebraic multiplicity of the eigenvalues of by , we have ; i.e.

The other quantities determining the magnitude of the state vectors are the followings:

 νn(δ) = (b2log(b1npδ−1))1/α, (6) ξn(δ) = η(D)(||x(0)||∞+νn(δ)). (7)

Lemma 2 and Lemma 3 show that are the high probability uniform bounds for the size of the noise and the state vectors. As a matter of fact, scale as . Hence, for uniformly bounded noise, both of them are fixed constants. Then, recalling that is the positive definite covariance matrix of the noise vectors, let be large enough, such that the followings hold for all :

 nνn(δ)2 ≥ 18|λmax(C)|+2ϵϵ2plog(4pδ), (8) nξn(δ)2νn(δ)2 ≥ 288ϵ2p|||D|||22log(4pδ), (9) nξn(δ)2 ≥ 6ϵ(|||D|||22+1). (10)

The following theorem provides a high probability lower bound for the smallest eigenvalue of .

###### Theorem 1 (Empirical covariance).

If , then

 P(|λmin(Vn+1)|

Moreover, .

###### Proof of Theorem 1.

We use the following lemmas, for which the proofs are provided in the appendix. For , and , define the event: .

We have .

###### Lemma 3.

The following holds on :

 max1≤t≤n||x(t)||2≤ξn(δ).
###### Lemma 4.

Letting

 Cn=1nn∑i=1w(i)w(i)′,

on we have , assuming

 nνn(δ)2≥6|λmax(C)|+2ϵ3ϵ2plog(2pδ). (11)
###### Lemma 5.

Suppose that

 Un=1nn−1∑i=0[Dx(i)w(i+1)′+w(i+1)x(i)′D′].

Then, holds on , as long as

 (12)

Next, note that implies

 Vn+1=x(0)x(0)′+Dn−1∑i=0x(i)x(i)′D′+nUn+nCn,

where are defined in Lemma 4, and Lemma 5. So, we obtain the Lyapunov equation , for

 En=Un+Cn+1nD(x(0)x(0)′−x(n)x(n)′)D′+1nx(0)x(0)′,

i.e.

 Vn+1=n∞∑i=0DiEnD′i, (13)

Henceforth, suppose that holds. According to Lemma 4, (8) implies that

 P(|λmax(Cn−C)|>ϵ3)≤δ2. (14)

In addition, by Lemma 5, (9) implies that

 P(|λmax(Un)|>ϵ3)≤δ2. (15)

Finally, using Lemma 3, by (10) we get

 (16)

Putting (14), (15), and (16) together, on , with probability at least , it holds that . Therefore, since (13) implies that , we get the desired result.Moreover, since , for , with probability at least we have

 ∣∣∣λmax(1nVn+1)∣∣∣ ≤ ∞∑i=0∣∣λmax(DiEnD′i)∣∣ (17) ≤ 32|λmax(C)|η(D′)2.

When , the conditions hold for arbitrary . Thus, we have , which according to (13) implies the desired result. ∎

The following corollary, which will be used later in Algorithm 1, provides the prediction bound , which is defined according to introduced in (6), (7):

 βn(δ)=16np(n−1)|λmin(C)|ξn(δ)2νn(δ)2log(2pδ). (18)
###### Corollary 1 (Prediction bound).

Define by (18). Then, implies that

 P(∣∣∣∣∣∣∣∣∣Vn1/2(^Dn−D)′∣∣∣∣∣∣∣∣∣22>βn(δ))≤3δ.

## Iv Design of Adaptive Policy

In this section, we present an algorithm for adaptive regulation of LQ systems. When applying the following algorithm, we assume that a stabilizing set is provided. Construction of such a set with an arbitrary high probability guarantee is addressed in the literature . It is established that the proposed adaptive stabilization procedure returns a stabilizing set in finite time. Note that the aforementioned analysis is fairly general such that the restrictive assumptions we discussed in Section I are not required. Nevertheless, if such a set is not available, the operator can apply the proposed method of random linear feedbacks  in order to stabilize the system before running the following adaptive policy.

In the episodic algorithm below, estimation will be reinforced at the end of every episode. Indeed, the algorithm is based on a sequence of confidence sets, which are constructed according to Corollary 1. This sequence will be tightened at the end of every episode; i.e. the provided confidence sets become more and more accurate. According to this sequence, the adaptive linear feedback will be updated after every episode. At the end of this section, we present a high probability regret bound.

First, we provide a high level explanation of the algorithm. Starting with the stabilizing set , we select a parameter based on OFU principle; i.e. is a minimizer of the optimal expected average cost over the corresponding confidence set (see (19)).

Then, assuming is the true parameter the system evolves according to, during the first episode the algorithm applies the optimal linear feedback . Once the observations during the first episode are collected, they are used to improve the accuracy of the high probability confidence set. Therefore, is tightened to , and the second episode starts by selecting and iterating the above procedure, and so on. The lengths of the episodes will be increasing, to make every confidence set significantly more accurate than all previous ones.

The intuition behind proficiency of the OFU principle is as follows. As shown in Section III, applying a linear feedback , observations of the state vectors will lead only to identification of the closed-loop matrix. Letting , the closed-loop transition matrix is . Note that in general, an accurate estimation of does not lead to that of . Therefore, approximating is impossible, regardless of the accuracy in the approximation of .

However, in order to design an adaptive policy to minimize the average cost, an effective approximation of is required. Specifically, as long as is available satisfying , one can apply an optimal linear feedback , no matter how large is. In general, estimation of such a is not possible. Yet, an optimistic approximation in addition to an exact knowledge of the closed-loop dynamics lead to an optimal linear feedback, thanks to the OFU principle.

###### Lemma 6.

If and , then is an optimal linear feedback for the system evolving according to .

In other words, applying linear feedback which is designed according to optimistically selected parameter , as long as the closed-loop matrix is exactly identified, an optimal linear feedback is automatically provided. Recall that the lengths of the episodes are growing so that the estimation of the closed-loop matrix becomes more precise at the end of every episode. Thus, the approximation is becoming more and more accurate. Rigorous analysis of the discussion above, leads to the high probability near-optimal regret bounds.

Algorithm 1 takes the inputs , explained below. is a bounded stabilizing set: for every , the system will be stable if the optimal linear feedback of is applied; i.e. . As mentioned before, an algorithmic procedure to obtain a bounded stabilizing set in finite time is available in the literature . Furthermore, is the highest probability that Algorithm 1 fails to adaptively regulate the system such that the regret will be nearly optimal (see Theorem 2). The reinforcement rate determines the growth rate of the lengths of the time intervals (episodes) an adaptive policy is applied until being updated (see (20)).

The algorithm provides an adaptive policy as follows. For , at the beginning of the -the episode, we apply linear feedback , where

 ~θ(i)∈argminθ∈Ω(i−1)J⋆(θ). (19)

Indeed, based on OFU principle, at the beginning of every episode, the most optimistic parameter amongst all we are uncertain about is being selected. The length of episode , which is the time period we apply the adaptive control policy , is designed according to the following equation. Letting , we update the control policy at the end of episode , i.e. at the time , defined according to

 τi=τi−1+γi/qN(|λmin(C)|2,δi2)+γi/q, (20)

where is defined by (8), (9), and (10). After the -th episode, we estimate the closed-loop transition matrix by the following least-squares estimator:

 ^D(i)=argminM∈Rp×pτi−1∑t=τi−1∣∣∣∣xj(t+1)−Mx(t)∣∣∣∣22. (21)

Letting be the empirical covariance matrix of episode ,

 V(i)=⌈τi⌉−1∑t=⌈τi−1⌉x(t)x(t)′, (22)

define the high probability confidence set

 Γ(i)={θ∈Rp×q : ∣∣∣∣∣∣∣∣∣V(i)12(θ~L(~θ(i))−^D(i))′∣∣∣∣∣∣∣∣∣22 (23) ≤

where is defined in (18). Note that according to Corollary 1, . Then, at the end of episode , the confidence set will be updated to

 Ω(i)=Ω(i−1)∩Γ(i), (24)

and episode starts, finding by (19), and then iterating all steps described above.

###### Remark 3.

The choice of does not need to be as extreme as (19). In fact, it suffices to satisfy

 J⋆(~θ(i))≤(τi−τi−1)−1/2+infθ∈Ω(i−1)J⋆(θ).

The following result states that performance of the above adaptive control algorithm is optimal, apart from a logarithmic factor. Theorem 2 also provides the effect of the degree of heaviness of the noise distribution (denoted by in Assumption 1) on the regret. Compared to , the notation used below, hides the logarithmic factors.

###### Theorem 2 (Regret bound).

For bounded , with probability at least , the regret of Algorithm 1 satisfies:

 R(T)≤~O(T1/2(−logδ)12+2α).
###### Proof of Theorem 2.

The stabilizing set is assumed to be bounded, so let

 (25)

Suppose that for , the parameter is being used to design the adaptive linear feedback . So, during every episode, does not change, and for we have .

Letting , the infinite horizon dynamic programming equations  are

 J⋆(θt) + x(t)′K(θt)x(t)=x(t)′Qx(t)+u(t)′Ru(t) +

where , and

 y(t+1)=Atx(t)+Btu(t)+w(t+1)=θt~L(θt)x(t)+w(t+1) (26)

is the desired dynamics of the system. Note that since the true evolution of the system is governed by , the next state is in fact

 x(t+1)=A0x(t)+B0u(t)+w(t+1)=θ0~L(θt)x(t)+w(t+1). (27)

Substituting (26), and (27) in the dynamic programming equation, and using (2) for the instantaneous cost , we have

 J⋆(θt)+x(t)′K(θt)x(t) = ct+E[w(t+1)′K(θt)w(t+1)∣∣Ft] + x(t)~L(θt)′θ′tK(θt)θt~L(θt)x(t) = ct+E[x(t+1)′K(θt)x(t+1)∣∣Ft] + x(t)~L(θt)′[θ′tK(θt)θt−θ′0K(θt)θ0]~L(θt)x(t).

Adding up the terms for , we get

 R(T)=T∑t=1[ct−J⋆(θ0)]=T1+T2+T3+T4, (28)

where

 T1 = T∑t=1[J⋆(θt)−J⋆(θ0)], T2 = T∑t=1(x(t)′K(θt)x(t) − E[x(t+1)′K(θt+1)x(t+1)∣∣Ft]), T3 = T∑t=1E[x(t+1)′(K(θt+1)−K(θt))x(t+1)∣∣Ft], T4 =

Let be the number of episodes considered until time . Thus,

 τm(T)≤T<τm(T)+1.

Now, letting be the length of episode , define the following events

 G = ∞⋂i=1{maxτi−1≤t<τi||w(t)||∞≤νni(δi2)}, H = ∞⋂i=1{θ0∈Ω(i)}.

According to Corollary 1,

 P(G∩H)≥1−∞∑i=13δi2≥1−5δ. (29)

For all , as long as , according to (19) we have ; i.e. . Therefore, on we have

 T1≤0. (30)

To conclude the proof of the theorem we use the following lemmas for which the proofs are provided in the appendix.

###### Lemma 7 (Bounding T2).

On , the following holds with probability at least :

 T2≤ρ2+(8T)1/2ρ3(log(Tm(T)))2/α(−logδ)1/2+2/α,

where are constants.

###### Lemma 8 (Bounding T3).

On , we have

 T3≤ρ3(log(Tm(T)))2/α(−logδ)2/αm(T),

where is the same as Lemma 7.

###### Lemma 9 (Bounding T4).

On the event , it holds that

 T4≤ρ4m(T)3/2βT(δm(T)2)1/2T1/2,

for some constant .

###### Lemma 10 (Bounding m(T)).

On the event the following holds:

 m(T)≤qlogγlog(T(γ1/q−1)τ1+1).

Finally, the definition of in (18) yields

 βn(δ)=O((logn)4/α(−logδ)1+4/α).

Therefore, plugging (30), and the results of Lemmas 7, 8, 9, and 10 into (28), we get

 R(T)≤~O(T1/2(−logδ)12+2α),

with probability at least on . Hence, according to (29), the failure probability is at most , which completes the proof of Theorem 2. ∎

## V Conclusion

We studied the adaptive regulation schemes for linear dynamical systems with quadratic costs, focusing on finite time analysis. Using the Optimism in the Face of Uncertainty principle, we established non-asymptotic optimality results under mild assumptions of stabilizability and a fairly general heavy-tailed noise distributions.

There are a number of interesting extensions of the current work. First, generalizing the non-asymptotic analysis of optimality to

imperfect observations of the state vector is a topic of future investigation. Another interesting direction is to specify the sufficient and necessary conditions for the true dynamics which lead to optimality of Certainty Equivalence. Moreover, approaches leaning to learning challenges such as consistency toward the true dynamics parameter, as well as those of interest for network systems (e.g. high-dimensional settings assuming sparsity) can be listed as interesting subjects to be addressed in the future.

Proofs of Auxiliary Results

###### Proof of Lemma 1.

When applying the linear feedback , the closed-loop transition matrix will be . Letting