 Adaptive regulation of linear systems represents a canonical problem in stochastic control. Performance of adaptive control policies is assessed through the regret with respect to the optimal regulator, that reflects the increase in the operating cost due to uncertainty about the parameters that drive the dynamics of the system. However, available results in the literature do not provide a sharp quantitative characterization of the effect of the unknown dynamics parameters on the regret. Further, there are issues on how easy it is to implement the adaptive policies proposed in the literature. Finally, results regarding the accuracy that the system's parameters are identified are scarce and rather incomplete. This study aims to comprehensively address these three issues. First, by introducing a novel decomposition of adaptive policies, we establish a sharp expression for the regret of an arbitrary policy in terms of the deviations from the optimal regulator. Second, we show that adaptive policies based on a slight modification of the widely used Certainty Equivalence scheme are optimal. Specifically, we establish a regret of (nearly) square-root rate for two families of randomized adaptive policies. The presented regret bounds are obtained by using anti-concentration results on the random matrices employed when randomizing the estimates of the unknown dynamics parameters. Moreover, we study the minimal additional information needed on dynamics matrices for which the regret will become of logarithmic order. Finally, the rate at which the unknown parameters of the system are being identified is specified for the proposed adaptive policies.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This work addresses the problem of designing adaptive policies (regulators) for the following standard Linear-Quadratic (LQ) system. Given the initial state , for the system evolves as

 x(t+1) = A0x(t)+B0u(t)+w(t+1), (1) ct(^π) = x(t)′Qx(t)+u(t)′Ru(t), (2)

where the vector

corresponds to the state (and also output) of the system at time , is the control action, and denotes a sequence of noise (i.e. random disturbance) vectors.

Further, , and are the positive definite matrices of the quadratic operating cost. The instantaneous cost function of the adaptive policy is denoted by with denoting the transpose of the vectors . The dynamics of the system, i.e. both the transition matrix , as well as the input matrix , are fixed and unknown, while are assumed known. The broad objective is to adaptively regulate the system in order to minimize the long term average cost.

Although LQ system regulation represents a canonical problem in optimal control, adaptive policies have not been adequately studied in the literature. In fact, a large number of classical papers focuses on the simpler setting of adaptive tracking, where the objective is to adaptively steer the system to track a reference trajectory [1, 2, 3, 4, 5, 6, 7, 8, 9]. Hence, because the operating cost is not directly a function of the control signal, i.e. , its technical analysis becomes different and less technically involved. However, for general LQ systems, both the state and the control signals impact the operating cost. The adaptive Linear-Quadratic Regulators (LQR) problem has been studied in the literature [10, 11, 12, 13, 14, 15, 16, 17], but there are still gaps that the present work aims to fill by addressing cost optimality, parameter estimation (identification), and the trade-off between exploration and exploitation.

Since the system’s dynamics are unknown, learning the key parameters is needed for planning an optimal regulation policy. However, the system operator needs to apply some control action at the same time, in order to collect data (observations) for parameter estimation. A popular approach to design an adaptive regulator is Certainty Equivalence (CE) . Intuitively, its prescription is to apply a control policy as if the estimated parameters are the true ones guiding the system’s evolution.

In general, the non-optimality (as well as the inconsistency) of CE [12, 19, 20] has led researchers to consider several modifications of the CE approach. One idea is to use the principle of Optimism in the Face of Uncertainty (OFU) [13, 14, 15] (also known as bet on the best , and the cost-biased approach ). It applies the optimal regulator as if an optimistic approximation of the unknown matrices is the truth governing the system’s dynamics 

. Another idea is to replace the point estimate of the system parameters by a posterior distribution obtained through Bayes theorem by integrating a prior distribution and the likelihood of the data collected so far. One then draws a sample from this posterior distribution and applies the optimal policy,

as if the system evolves according to the sampled dynamics matrices. This approach is known as Thompson Sampling (TS) or posterior sampling [16, 17]. Further modifications to CE adaptive LQR are based on
(i) either perturbing the control signal with a diminishing randomization [7, 8, 9, 11],
(ii) or imposing an identifiability assumption on the unknown dynamics matrices [1, 4].

Note that from an exploitation viewpoint, most of the existing work in the literature is purely asymptotic in nature and establishes the convergence of the adaptive average cost to the optimal expected average cost. It handles adaptive LQRs based on the OFU principle [10, 12], as well as those based on the method of random perturbations being applied to continuous time Ito processes . However, results on the speed of convergences are rare and rather incomplete. On the other hand, considering the exploration aspect, consistency of parameter estimates is lacking for general dynamics matrices [22, 23]. Moreover, rates for identification of system parameters are only provided for tracking problems [8, 9]. Indeed, the identification rate for matrices describing the system’s dynamics is not available for general LQ systems.

In most real applications, the effective time horizon is practically finite. It restricts the relevance of the aforementioned asymptotic analyses. Thus, addressing the optimality of an adaptive regulation procedure under more sensitive criteria is needed. For this purpose, one needs to comprehensively examine the regret (which is the cumulative deviation from the optimal policy as defined in (7)). Regret analyses are thus far limited to recent work addressing OFU adaptive policies [13, 14, 15], and results for TS obtained under restricted conditions [16, 17]. One issue with OFU is the computational intractability of finding an optimistic approximation of the true dynamics parameters. To do so, one needs to repeatedly solve a non-convex matrix optimization problem. More importantly, we show that the regret bounds established for OFU and TS can be achieved [13, 15], or improved [14, 16, 17] through a fairly simpler class of adaptive regulators.

A key contribution of this work is a remarkably general result to address the performance of different policies. Namely, tailoring a novel method for regret decomposition, we leverage some results from martingale theory to establish Theorem 3.1. It provides a sharp expression for the regret of arbitrary adaptive regulators in terms of the deviations from the optimal feedback. Leveraging the above, we analyze two families of CE-based adaptive policies.

First, we show that the growth rate of the regret is (nearly) square-root in time (of the interaction with the system), assuming that the CE regulator is properly randomized. Performance analyses are presented for both common approaches of additive randomization and posterior sampling. Then, the adaptive LQR problem is discussed when additional information (regarding the unknown dynamics parameters of the system) is available. In this case, a logarithmic rate for the regret of generalizations of CE adaptive policies is established, assuming that the available side information satisfies an identifiability condition. Examples of side information include constraints on the rank or the support of unknown matrices driving the system’s dynamics, that in turn lead to optimality of the linear feedback regulator, if the closed-loop matrix is accurately estimated. Note that to the best of our knowledge, comprehensive regret analyses of CE-based adaptive LQRs appear in this work for the first time. Further, the identification performance of the corresponding adaptive regulators is also addressed.

The remainder of the paper is organized as follows. The problem is formulated in Section 2. Then, we address the optimality performance of general adaptive policies in Subsection 3.1. Subsequently, the consistency of estimating the dynamics parameter is given in Subsection 3.2. In Section 4, we study two randomization schemes based on method (i) above. The growth rate of the regret, as well as the accuracy of parameter estimation are also examined. Finally, in Section 5 we study conditions similar to the aforementioned method (ii). A general identifiability condition leading to significant improvements in both the regret, as well as the accuracy of identification is discussed.

[Stochastic statements] All probabilistic equalities and inequalities throughout this paper hold almost surely, unless otherwise explicitly mentioned.

### 1.1 Notations

The following notations will be used throughout this paper. For a matrix , denotes its transpose. When

, the smallest (respectively largest) eigenvalue of

(in magnitude) is denoted by (respectively ). For , define the norm . We also use the following notation for the operator norm of matrices. For let,

In order to show the dimension of the manifold we use . Finally, to show the order of magnitude we use the notations , and . Namely,

 an=O(bn) ⇔ limsupn→∞∣∣∣anbn∣∣∣<∞, an=Ω(bn) ⇔ limsupn→∞∣∣∣bnan∣∣∣<∞, an≍bn ⇔ an=O(bn),an=Ω(bn).

## 2 Problem Formulation

We start by defining the adaptive LQR problem this work is addressing. The stochastic evolution of the system is governed by the linear dynamics (1), where for all , is the vector of random disturbances satisfying:

 E[w(t)]=0,E[w(t)w(t)′]=C,|λmin(C)|>0.

For the sake of simplicity, the noise vectors are assumed to be independent over time

. The latter assumption is made to simplify somewhat the presentation of the results; nevertheless, all results established in this study hold for martingale difference sequences and their generalization is conceptually straightforward. Further, the following mild and widely used moment condition for the noise process is assumed in this work. [Moment condition] There is a positive real number

, such that moments of order exist: . In addition, we assume that the true dynamics of the underlying system are stabilizable, a minimal assumption for positing a well-posed optimal control problem. [Stabilizability] The true dynamics satisfy the following: there exists a stabilizing linear feedback such that

 |λmax(A0+B0L)|<1.

Note that the assumption above implies the following asymptotic stabilizability in the average sense

 (3)

[Notation] Henceforth, for notational convenience, for , , we use to denote . Clearly, , where .

We assume perfect observations; i.e. the output of the system corresponds to the state vector itself. Next, an admissible control policy is a mapping that designs the control action according to the guiding dynamics matrices , the cost matrices , and the history of the system:

 u(t)=π(A0,B0,Q,R,{x(i)}ti=0,{u(j)}t−1j=0),

for all . An adaptive policy -e.g. - is oblivious to the dynamics parameter ; i.e.

 u(t)=^π(Q,R,{x(i)}ti=0,{u(j)}t−1j=0).

When applying the policy , the resulting instantaneous quadratic cost at time defined according to (2) is denoted by . For arbitrary policy , let denote the expected average cost of the system:

 ¯¯¯¯¯Jπ(A0,B0)=limsupn→∞1nn−1∑t=0E[ct(π)].

Note that the dependence of to the known cost matrices is suppressed. Then, the optimal expected average cost is defined as

 J⋆(A0,B0)=minπ¯¯¯¯¯Jπ(A0,B0),

where the minimum is taken over all admissible control policies. The following proposition provides an optimal linear feedback for minimizing the expected average cost, based on the corresponding Riccati equation. To this end, define the linear time-invariant policy according to (4), (5) (next page);

 π⋆:u(t)=L(θ0)x(t),t=0,1,2,⋯. (6)

[Optimal policy ] If is stabilizable, (4) has a unique solution and defined in (6) is an optimal regulator leading to . Conversely, if is a solution of (4), defined by (5) is a stabilizer. In the latter case of Proposition 2, the solution is unique and is an optimal regulator. Note that although is the only optimal policy amongst the time-invariant feedbacks, there are uncountably many time varying optimal controllers.

To rigorously set the stage, we denote the linear regulator by , where is a matrix determined according to , . Note that for time-invariant policy , we use and interchangeably. For an adaptive operator, the dynamics matrices are not exactly known. Hence, adaptive policy constitutes the linear feedbacks , where is required to be determined according to , . In order to measure the quality of an arbitrary regulator , the resulting instantaneous cost will be compared to that of the optimal policy defined in (6). Specifically, the regret of policy at time is defined as

 Rn(π)=n−1∑t=0[ct(π)−ct(π⋆)]. (7)

The comparison between adaptive control policies is made according to regret, which is the cumulative deviation of the instantaneous cost of the corresponding adaptive policy from the linear time-invariant optimal policy.

An analogous expression for regret is previously used for the problem of adaptive tracking [1, 2]. The alternative definition of the regret being used in the existing literature [13, 14, 15, 16, 17] is the cumulative deviations from the optimal expected average cost :

 n−1∑t=0[ct(π)−J⋆(θ0)].

The expression above differs by the regret defined in (7) by the term . The following result addresses the magnitude of the difference term. Assuming , the following holds:

 n−1∑t=0ct(π⋆)−nJ⋆(θ0)=O(n1/2logn).

Therefore, the aforementioned definitions for the regret are indifferent, as long as one can establish an upper bound of magnitude (modulo a logarithmic factor) for either definition. However, defining the regret by (7) leads to more accurate analyses and tighter results than the aforementioned alternative (e.g. the reciprocal specification of Theorem 3.1, and the logarithmic rate of Theorem 5). Note that the fourth moment condition of Proposition 2 holds for the sub-Weibull distributions considered in the relevant literature [13, 14, 15, 16, 17].

To proceed, we introduce another notation that simplifies certain expressions throughout the rest of the paper. [Notation] For arbitrary stabilizable , let . Therefore, .

Next, we study the properties of general adaptive regulators. First, from an exploitation viewpoint, we examine in Subsection 3.1 the regret of arbitrary linear policies. Then, from an exploration viewpoint, consistency of parameter estimation is considered in Subsection 3.2.

### 3.1 Regulation

The main result of this subsection provides an expression for the regret of an arbitrary (i.e. either adaptive or non-adaptive) policy. According to the following theorem, the regret of the regulator is of the same order as the summation of the squares of the deviations of the linear feedbacks from . Note that it is stronger than the previously known result that expressed the regret as the summation of the deviations from (not squared) [13, 14, 15, 16, 17]. As will be shown shortly, this difference changes completely the nature of both the informational lower bound, as well as the operational upper bound for the regret. [Regret specification] Suppose that is a linear policy. Then, we have

 Rn(π) ≍ n−1∑t=0||(L(θ0)−Lt)x(t)||2.

Note that the expression above for the regret is remarkably general, since policy does not need to meet any condition. Before examining the intuition behind the result of Theorem 3.1, we discuss its application to the sharp specification of the performance of arbitrary adaptive regulators.

The immediate consequence of Theorem 3.1 provides a tight upper bound for the regret of an adaptive policy, in terms of the linear feedbacks. Indeed, since the presented result is a reciprocal one, it will also provide a general information theoretic lower bound for the regret of an adaptive regulator. For stabilized dynamics, it is shown that the smallest estimation error when using a sample of size is . Thus, at time , the error in the identification of the unknown dynamics parameter is at least of the same order. Therefore, for the minimax growth rate of the regret, Theorem 3.1 implies the lower bound , which is of logarithmic order.

In other words, for an arbitrary adaptive policy , it holds that . In general, the information theoretic lower bound above is not known to be operationally achievable. Broadly speaking, it is due to the common Exploitation-Exploration dilemma. We will discuss the reasoning behind the presence of such a gap in Section 4, which leads to an operational lower bound of the order of square-root rate of growth. Nevertheless, in Section 5 we discuss settings where availability of some side information leads to an achievable regret of logarithmic order.

Next, we briefly explain the nature of the general statement of Theorem 3.1. The expression reveals a feature, which is essentially similar to revocability or memorylessness. In fact, since the dynamics of the system follow (1), the influence of a non-optimal linear feedback lasts forever. In other words, if at some point the state vector deviates from the optimal trajectory (which would occur if applying ), the control action at the next time step can not regulate this deviation. It is similar to the concept of irrevocability in the theory of bandits with dependent state evolutions . However, according to Theorem 3.1, the cumulative growth of the operating costs for control actions that deviate from the optimal one is dominated by the magnitude of the (squared) deviations themselves.

Hypothetically, suppose that at time , the operator stops using non-optimal linear feedbacks. According to Theorem 3.1, the regret will not grow anymore, since for . Although, the effect of the non-optimal actions can not actually get compensated because the state vectors are deviated from the optimal trajectory. Therefore, the regret has a memoryless flavor, since the effect of a non-optimal policy will be reflected only through the evolution of the sequence of the state vectors up to that time point; i.e. , and not the effect of the history in the future.

For a stabilized system, the magnitude of the state vector remains finite (and indeed bounded in the average sense as in (3)). Hence, the influence of the non-optimal history on the cumulative cost is . So, as soon as switching to the optimal feedback, the system forgets the non-optimal actions applied previously. Note that the stabilization of the system is addressed in the existing literature, using a randomization (of the sufficient dimension) of the linear regulator [27, 24]. In the next section, we will review how adding a relatively small randomization will automatically lead to system stabilization.

### 3.2 Identification

Another consideration for an adaptive policy is the exploration problem. Since in general the operator has no knowledge regarding the dynamics parameter , a natural question to address is that of learning , in addition to examining cost optimality. In this subsection, we address the asymptotic estimation consistency of general adaptive policies. That is, a rigorous formulation of the relationship between the learnable information (through observing the state of the system), and the desired optimal regulation one needs to plan for.

On one hand, for a linear feedback , the best one can do by observing the state vectors is “closed-loop identification” [5, 15]; i.e. estimating the closed-loop matrix accurately. On the other hand, an adaptive policy is at least desired to provide a sub-linear regret;

 limsupn→∞Rn(^π)n=0. (8)

The above two aspects of an adaptive policy provide the properties of the asymptotic uncertainty about the true dynamics parameter . By the uniqueness of according to Proposition 2, the linear feedbacks of the adaptive policy require to converge to . Further, uniquely identifies the asymptotic closed-loop matrix . This matrix according to (8) is supposed to be . Putting the above together, the asymptotic uncertainty is reduced to the set of parameters that satisfy

 L(θ∞) = L(θ0), (9) θ∞~L(θ0) = θ0~L(θ0). (10)

In order to rigorously analyze this uncertainty, we introduce some additional notation. First, for an arbitrary stabilizable , introduce the shifted

null-space of the linear transformation

by as:

 N(θ1)={θ∈Rp×q:θ~L(θ1)=θ1~L(θ1)}. (11)

Hence, is indeed the set of parameters , such that the closed-loop transition matrix of two systems with dynamics parameters will be the same, if applying the optimal linear regulator (5) calculated for . Hence, if the operator regulates the system by feedback , one can not identify . In other words, is the learning capability of adaptive regulators. Then, we define the desired planning of adaptive policies as follows. For an arbitrary stabilizable , define as the level-set of the optimal controller function (5), which maps to :

 S(θ1)={θ∈Rp×q:L(θ)=L(θ1)}. (12)

Therefore, is in fact the set of parameters , such that the calculation of optimal linear regulator (5) provides the same feedback matrix for both . Intuitively, (respectively ) reflects the learning (respectively planning) aspect of the adaptive regulators. It specifies the adaptivity (respectively regulation) performance which shows the accuracy (respectively optimality) of the parameter estimation (respectively cost minimization) procedure. So, the asymptotic uncertainty about the true parameter is according to (9), (10) limited to the set

 P0=S(θ0)∩N(θ0). (13)

In general, if for an adaptive policy the sublinearity of the regret in (8) holds, the consistency is equivalent to . The following result establishes the properties of . It is a generalization of some available results in the literature [22, 23]. Further, the result of Theorem 3.2 will be used later on to discuss the operational optimality of adaptive regulators, subject to the common trade-off between regulation and identification. [Consistency] The set defined in (13) is a shifted linear subspace of dimension . Therefore, consistency (of an adaptive policy with a sublinear regret) is automatically guaranteed, only if is a full-rank matrix. In other words, effective exploitation suffices for consistent exploration only if . For example, the sublinear regret bounds of OFU established in the literature [13, 15] imply consistency, assuming is of the full rank. Note that the converse is always true: consistency of parameter estimation implies the sublinearity of the regret. Clearly, full-rankness of holds for almost all parameters (with respect to Lebesgue measure on ).

The classical idea to design an adaptive policy is a greedy procedure known as Certainty Equivalence (CE). At every time , its prescription is to apply the optimal regulator provided by (5), as if the estimated parameter coincides exactly with the true one . According to the linear evolution (1

), a natural estimation procedure is to linearly regress

on the covariates , using all observations collected so far; .

Formally, the CE policy is , where is a solution of the least-squares estimator using the data observed until time :

 ^θn ∈ argminθ∈Rp×qn−1∑t=0∣∣∣∣x(t+1)−θ~L(^θt)x(t)∣∣∣∣2.

The initial regulator can be chosen arbitrarily in order to just start running the system.

The issue with CE is that it is capable of adapting to a non-optimal regulation. In fact, when there is no additional information on , CE can lead to linear regret. Technically, CE possibly fails to falsify an incorrect estimation of the true parameter . Suppose that at time , the hypothetical estimate of the true parameter is . When the linear feedback is applied to the system, the true closed-loop transition matrix will be . Then, if this matrix is the same as the (falsely) assumed closed-loop transition matrix , the learning procedure fails to falsify . So, if fails to provide an optimal feedback, i.e. , the adaptive policy is not guaranteed to tend toward a better control action. Therefore, a non-optimal regulator will be persistently applied to the system.

Fortunately, if slightly modified, CE can avoid unfalsifiable approximations of the true parameters. More precisely, we show that the set of unfalsifiable parameters defined below is of zero Lebesgue measure;

 U(θ0)={θ∈Rp×q:θ0~L(θ)=θ~L(θ)}. (14)

Note that according to (11), we have if and only if . Recalling the discussion in the previous section, captures the learning ability of adaptive regulators. From this viewpoint, the set contains the parameter matrices for which the hypothetically assumed closed-loop transition matrix is indistinguishable from the true one. The following lemma sets the stage for sublinearity of the regret. We subsequently show that it can be achieved if CE is sufficiently randomized. [Unfalsifiable set] The set defined in (14) has Lebesgue measure zero.

### 4.1 Randomized Certainty Equivalence

Leveraging Lemma 4, the operator can avoid the pathological set . In fact, it suffices to randomize the least-squares estimator with a small (diminishing) perturbation. In order to evade , such a perturbation needs to be continuously distributed over the parameter space . So, the linear transformation , as well as the null-space will be randomized. Note that as discussed in the previous section, the estimation of the uncertain dynamics parameter is related to the accurate closed-loop identification through .

In addition, Theorem 3.1 provides the cost optimality condition for the Randomized Certainty Equivalence (RCE) adaptive regulator. That is, the magnitude of the random perturbation needs to diminish sufficiently fast. Indeed, while a larger magnitude perturbation helps to the improvement of exploration, exploitation requires it to be sufficiently small. Addressing this trade-off is the common dilemma of adaptive control. At the end of this section, we will examine this trade-off based on properties of the least-squares estimation and the tight specification of the regret provided by Theorem 3.1.

Note also that the accuracy of estimating the transition matrix in an stabilized linear system scales at rate , where is the number of observations. Therefore, updating the estimated parameter can be deferred until collecting sufficiently more data. This is related to the class of episodic algorithms, where the design of the adaptive regulator is updated after episodes of exponentially growing lengths . Here, we show that the randomization in RCE can be episodic as well. Thus, calculation of the hypothetical optimal linear feedback by (5) will occur sparsely (i.e. times), which remarkably reduces the computational cost of the adaptive regulator.

To formally define the RCE adaptive regulator, for , let be independent random matrices such that all entries of

are independent standard Gaussian random variables. The sequence

is used to randomize the parameter estimates. Further, an arbitrary initial parameter determines the lengths of the episodes as follows.

Start RCE adaptive LQR by arbitrary initializing . Then, for each time , we apply . If satisfies for some , we update the estimate by

 ^θn =

where is the random perturbation. Otherwise, for the policy does not update: . Note that the distribution of over matrices is absolutely continuous with respect to Lebesgue measure. Thus, Lemma 4 implies that . Further, thanks to the randomization , the dynamics matrix is stabilizable (as well as controllable [28, 29]). Therefore, according to Proposition 2, the adaptive feedback is well defined. [Non-Gaussian Perturbation] The randomization matrices do not need to be Gaussian in general. In fact, it suffices to draw

from an arbitrary distribution with bounded probability density function (pdf) all over

. It only need to satisfy the moment condition below for some :

 supm≥1E[|||ϕm|||4+ϵ]<∞. (15)

As mentioned before, the reinforcement rate determines the lengths of the episodes the algorithm uses the parameter approximation , before reinforcing the estimation and updating the policy. Smaller values of correspond to shorter episodes and thus more updates and additional randomization; i.e. the smaller is, the better the exploration performance of RCE is. Although such an improvement will not provide a better asymptotic rate for the regret, it speeds up the convergence and so is suitable if the actual time horizon is not very large. Further, it increases the number of times the Riccati equation (5) needs to be computed. Therefore, in practice the operator can decide the value of according to the actual time length of interacting with the system, and the desired computational complexity. It is important especially if the evolution of the real-world plant under control requires the feedback policy to be updated fast (compared to the time the operator needs to calculate the linear feedback).

The following theorem rigorously addresses the behavior of the RCE adaptive regulator. It shows that other adaptive policies such as OFU [13, 14, 15] do not provide a better rate for the regret, while imposing a large computational burden by requiring solving a matrix optimization problem at the end of each episode. Note that this insight strongly leverages the novel and tight specification of the regret presented in Theorem 3.1. [RCE rates] Suppose that the adaptive policy is RCE. Let be the parameter estimation at time . Then, we have

 Rn(^π) = O(n1/2logn), ∣∣∣∣∣∣^θn−θ0∣∣∣∣∣∣ = O(n−1/4log1/2n).

### 4.2 Thompson Sampling

Another approach in existing literature is Thompson Sampling (TS), which has the following Bayesian interpretation. Applying an initial random linear feedback to the system, TS updates the parameter estimate through posterior sampling. That is, the operator draws a realization of the Gaussian posterior for which the mean and the covariance matrix are determined by the data observed to date.

Formally, let be a fixed positive definite matrix, and choose arbitrarily. Further, similar to RCE, fix the reinforcement rate . Then, for each time , we apply , where is designed as follows. If satisfies for some , the matrix

is drawn from a Gaussian distribution determined by

defined for according to

 μm = argminμ∈Rp×q⌊γm⌋−1∑t=0∣∣∣∣x(t+1)−μ~L(^θt)x(t)∣∣∣∣2, Σm = Σ0+⌊γm⌋−1∑t=0~L(^θt)x(t)x(t)′~L(^θt)′.

Namely, for , the -th row of is drawn independently from a multivariate Gaussian distribution of mean (the -th row of ), and covariance matrix . Otherwise, for the policy does not update: . Clearly, is the least-squares estimate and is the (unnormalized) empirical covariance of the data observed by the end of episode .

Unlike RCE, the randomization in the TS adaptive regulator is based on (the covariance matrix of) the observations of the state vectors and the control signals. Note that the central idea of randomized adaptive LQRs still holds: . The following result establishes the rates of both regulation and identification for TS.

[TS rates] Let the adaptive policy correspond to TS. Then, for the parameter estimate , and the regret , we have

 Rn(^π) = O(n1/2log2n), ∣∣∣∣∣∣^θn−θ0∣∣∣∣∣∣ = O(n−1/4logn).

For TS based adaptive LQRs, the Bayesian regret, namely, the expected value of the regret where the expectation is taken under the assumed prior, has been shown to be of a similar magnitude . Of course, this heavily relies on a Gaussian prior imposed on the true . Note that the worst-case (not Bayesian) regret itself was known to be of magnitude (apart from a logarithmic factor) . Therefore, Theorem 4.2 provides an improved regret bound for TS, thanks to Theorem 3.1. [Self stabilization] Both adaptive regulators RCE and TS stabilize the system, so that no a priori stabilization is needed. It is caused by the feedback which closes the loop randomly, based on randomization in RCE, and sampling from a Gaussian posterior in TS. First, if is not sufficiently small, the optimal regulator can destabilize the system of the true dynamics matrix ; i.e. . Thus, for the first few episodes of RCE and TS where there are no sufficient observations for accurate estimation, the closed-loop transition matrix can lead to instability. There is a pathological subset of unstable matrices, coined the set of irregular matrices in the literature . If the closed-loop matrix is irregular, it is not feasible to accurately estimate it . However, the set of irregular matrices is known to be of zero Lebesgue measure .

Then, it is rigorously established in the literature that under Assumption 2, a random linear feedback prevents the closed-loop matrix from becoming pathologically irregular . Therefore, RCE and TS provide regular closed-loop transition matrices. The regularity condition lets the adaptive policy to accurately estimate the unknown parameters even during the first few episodes, when the system dynamics are not yet stabilized. Technically, RCE and TS almost surely stabilize the system by some time . Note that the argument above does not hold for previously employed randomizations, where only the control signals are being randomly perturbed [4, 8, 11, 16]. As a matter of fact, if the closed-loop matrix is irregular, randomization of the control signals is not helpful at all for addressing the inaccurate estimation of the dynamics parameters .

### 4.3 Optimality

Next, we discuss the reason for the presence of a significant gap between the operational regrets of Theorem 4.1 and Theorem 4.2, and the information theoretic lower bound mentioned in Subsection 3.1. In fact, the following discussion shows that the logarithmic lower bound is not practically achievable. Nevertheless, in the next section we show how using additional information for the true dynamics parameter yields a regret of logarithmic order. In the sequel, we discuss an argument that leads to the following conjecture: the regret is operationally of order . For this purpose, we first state the following lemma about the level-set manifold defined in (12). It is a generalization of a previously established result for full-rank matrices [22, 23].

[Optimality manifold] The optimality level-set is a manifold of the following dimension at point :
. By Theorem 3.2, we have , where . The tangent space of the manifold at point , shares of its dimensions with , and the other dimensions are apart from . Intuitively, reflects the constraint of estimating the dynamics parameter, and is the desired information to design an optimal policy. Thus, those dimensions of which are not in can not be learned unless the subspace is sufficiently perturbed. Such a perturbation is available only through applying non-optimal feedbacks, which yields a larger regret than the logarithmic rate.

Next, we precisely analyze the regret based on the limits in falsifying the parameters not belonging to . First, from a planning viewpoint, the distance between an adaptive regulator and the optimal feedback is determined by the uncertainty one has to exactly specify the optimality manifold . As an extreme example, suppose that is provided to an operator who does not know . Then, denoting the adaptive policy above by , we have . Theorem 3.1 states that if at time the adaptive regulator approximates with error , the growth in the regret is . Thus, it suffices to examine the estimation accuracy that in turn depends both on the accuracy of learning the closed-loop transition matrix, as well as the falsification of dynamics parameters .

Now, suppose that the objective is to falsify , such that , and is orthogonal to defined in (13). The latter property of dictates . The key point is that in order to falsify , non-optimal linear feedbacks need to be applied sufficiently many times. For instance, if applying , the estimation provides , i.e. can never get falsified. More generally, assume that is a -perturbation of the optimal feedback: . The shifted subspace of uncertainty when applying deviates from by at most (in the sense of inner products of the unit vectors). Now, assume that the operator applies (or a similar -perturbed feedback) for a duration of time points. Note that the smallest estimation error of the closed-loop matrix is .

Then, the operator can falsify only if . In other words, the adaptive regulator can avoid applying control actions of distance from the optimal feedback, only if actions of distance are in advance applied for a period of length . On the other hand, according to Theorem 3.1, such perturbed feedbacks impose a growth of to the regret. Putting the above together, we get . It leads to the following conjecture which constitutes an interesting direction for future work. [Operational lower bound] For an arbitrary adaptive policy , the following result holds: . Note that if the above conjecture is true, RCE and TS provide a nearly optimal bound for the regret. Even the logarithmic gap between the lower and upper bounds for the regret is inevitable, due to the existence of an analogous gap in the closed-loop identification of linear systems .

Further, the above discussion explains the intuition behind the design of RCE. Specifically, the magnitude of the perturbation according to the above discussion is optimally selected, since it satisfies , modulo a logarithmic factor. Indeed, if randomization is (significantly) smaller in magnitude than , the portion of the regret due to such a perturbation will reduce. However, it also reduces the accuracy of the parameter estimate. Thus, the other portion of the regret due to estimation error will increase. A similar discussion holds for larger magnitudes of the perturbation

. On the other hand, the standard deviation of randomization in TS is determined by the collected observations. As one can see in the proof of Theorem

4.2, a similar magnitude of randomization is automatically imposed by the structure of TS adaptive LQR.

## 5 Generalized Certainty Equivalence

It is possible in real applications for the operator to have access to additional information for the dynamics parameter . Examples of such information are the set of non-zero entries of , or the rank of the dynamics matrix . Consider a plant under control where the operator knows the subsystems whose dynamics evolve independently of each other, which provides the necessary extra information needed on the structure of . Another example comes from large network systems, where a substantial portion of the matrix entries are zero .

Another example corresponds to systems (e.g. of state dimension ) whose dynamics exhibit longer memory (e.g. ). In this case, the state evolution once written in the form of (1) has a transition matrix of the form [7, 27]. In such cases, this additional structural information on can be used by the operator in order to obtain a better regret for the adaptive regulation of the system. Nevertheless, a comprehensive theory needs to formalize how this side information can provide theoretical sharp bounds for the regret.

In this section, we provide an identifiability condition that ensures that the adaptive LQRs attain the informational lower bound of logarithmic order. In addition to the classical CE adaptive regulator, we also consider the family of CE-based schemes which provide a logarithmic order of magnitude for the regret.

First, we introduce the Generalized Certainty Equivalence (GCE) adaptive regulator. Similar to RCE and TS, it is an episodic algorithm with exponentially growing durations of episodes. The only difference is that instead of randomizing the least-squares estimate, it is perturbed with an arbitrary matrix . Suppose that the operator knows that , based on side information .

Then, fix the reinforcement rate . At time we apply . If satisfies for some , we update the estimate by

 ^θn = ~θn+argminθ∈Γ0n−1∑t=0∣∣∣∣x(t+1)−θ~L(^θt)x(t)∣∣∣∣2,

where is arbitrary, and satisfies . Similar to RCE and TS, for the policy does not update: . Note that if , we get the episodic CE adaptive regulator. To proceed, we define the following identifiability condition.

[Identifiability] Suppose that there exist such that . Then, is identifiable, if for all stabilizable , it holds that

 (16)

for some constant . Intuitively, the definition above describes settings where side information is sufficient in the sense that an -accurate identification of the closed-loop matrix (the RHS of (16)) provides an -accurate approximation of the optimal linear feedback (the LHS of (16)). Subsequently, we provide concrete examples of , such as presence of sparsity or low-rankness in . Essentially, a finite union of manifolds of proper dimension in the space suffices for identifiability. To see that, we use the critical subsets , and defined in (11), (12), and (13), respectively.

First, note that provides the optimal linear feedback . Hence, for it holds that

Then, according to Theorem 3.2 both and are shifted linear subspaces passing through the true . Since , the null-space shares dimensions with , and has

 dim(N(θ0)P0)=dim(N(θ0))−dim(P0)=rank(A0)r

dimensions orthogonal to . The reason for the non-optimal regret (i.e. larger than a logarithmic magnitude) of the adaptive policies is the uncertainty . Additional knowledge about removes such uncertainty and is sufficient for Definition 5. Thus, a manifold (or a finite union of manifolds) of dimension implies the aforementioned identifiability condition. Below, we briefly provide some examples of .

• Optimality manifold: Obviously, a trivial example is . In this case, the LHS of (16) vanishes.

• Support condition: Let be the set of matrices with a priori known support . That is, for some set of indices , entries of all matrices are zero outside of :

 Γ0={θ=[θij]:θij=0 for (i,j)∉I}.

Then, is a (basic) subspace of and can satisfy the identifiability condition (16). Note that it is necessary to have .

• Sparsity condition: Let be the set of all matrices with at most non-zero entries. Then, is the union of the matrices with support for different index sets . Hence, the previous case implies that is a finite union of the manifolds of proper dimension.

• Rank condition: Let be the set of matrices such that . Then, is a finite union of manifolds of dimension at most . Hence, if

 d(p+q−d)≤pq−rank(A0)r,

and (16) holds, is identifiable.

• Subspace condition: For , let be matrices such that . Suppose that are linearly independent; i.e. for if then . Define

 Γ0={θ+θ0:tr(θ′θi)=0 for all 1≤i≤k}.

If for all it holds that , then satisfies the identifiability condition of Definition 5.

The following Theorem establishes the optimality of the GCE adaptive regulator under the identifiability assumption. As mentioned in Section 4, a logarithmic gap between the lower and upper bounds for the regret is inevitable due to similar limitations in system identification . [GCE Rates] Suppose that is identifiable and the adaptive policy corresponds to GCE. Letting be the parameter estimate at time , define by (13). Then, we have

 Rn(^π) = O(log2n), = O(n−1/2log1/2n).

Comparing the above result with Theorem 4.1 and Theorem 4.2, the identifiability assumption leads to significant improvements in rates of both the exploration and the exploitation. Moreover, if , then . Thus, the estimation accuracy in Theorem 5 becomes: .

An immediate consequence of Theorem 5 is the improvement of the result in the literature for the high-dimensional setting. Indeed, Theorem 5 provides both a remarkably smaller regret bound, as well as a significantly simpler adaptive LQR. Under a stronger identifiability condition, Ibrahimi et al.  show that OFU provides the regret bound . By Theorem 5, not only the adaptive regulator can be generalized to the family of GCE policies, but also the performance is remarkably improved and achieves a regret of magnitude. Note that it is totally based on the regret specification of Theorem 3.1.

## 6 Concluding Remarks

The performances of adaptive policies for LQ systems is addressed in this work, including both aspects of regulation and identification. First, we established a general result which specifies the regret of an arbitrary adaptive regulator in terms of the deviations from the optimal feedback. This tight reciprocal result provides a powerful machinery to analyze the subsequently presented policies. That is, we leveraged the aforementioned result to show that slight modifications of CE provide a regret of (nearly) square-root magnitude. The modifications consist of two basic approaches of randomizations:
(ii) Thompson (or posterior) sampling.

In addition, we formulated an identifiability condition which leads to logarithmic regret. The rates of identification are also discussed for the corresponding adaptive regulators. All presented results can be shown similarly for dependent noise; i.e. martingale difference sequences (adapted to a filtration). In fact, it suffices to replace the involved stochastic terms with those consisting the conditional expressions (w.r.t the corresponding filtration).

Rigorous establishment of the proposed operational lower bound for the regret is an interesting direction for future works. Besides, extending the developed machinery to other settings such as switching systems, or those with imperfect observations are topics of interest. On the other hand, extensions to the dynamical models illustrating network systems (e.g. high-dimensional sparse dynamics matrices) is a challenging problem for further investigation.

## References

•  T. Lai and C.-Z. Wei, “Extended least squares and their applications to adaptive control and prediction in linear systems,” IEEE Transactions on Automatic Control, vol. 31, no. 10, pp. 898–906, 1986.
•  T. Lai, “Asymptotically efficient adaptive control in stochastic regression models,” Advances in Applied Mathematics, vol. 7, no. 1, pp. 23–45, 1986.
•  L. Guo and H. Chen, “Convergence rate of els based adaptive tracker,” Syst. Sci & Math. Sci, vol. 1, pp. 131–138, 1988.
•  H.-F. Chen and J.-F. Zhang, “Convergence rates in stochastic adaptive tracking,” International Journal of Control, vol. 49, no. 6, pp. 1915–1935, 1989.
•  P. Kumar, “Convergence of adaptive control schemes using least-squares parameter estimates,” IEEE Transactions on Automatic Control, vol. 35, no. 4, pp. 416–424, 1990.
•  T. L. Lai and Z. Ying, “Parallel recursive algorithms in asymptotically efficient adaptive control of linear stochastic systems,” SIAM journal on control and optimization, vol. 29, no. 5, pp. 1091–1127, 1991.
•  L. Guo and H.-F. Chen, “The åstrom-wittenmark self-tuning regulator revisited and els-based adaptive trackers,” IEEE Transactions on Automatic Control, vol. 36, no. 7, pp. 802–812, 1991.
•  B. Bercu, “Weighted estimation and tracking for armax models,” SIAM Journal on Control and Optimization, vol. 33, no. 1, pp. 89–106, 1995.
•  L. Guo, “Convergence and logarithm laws of self-tuning regulators,” Automatica, vol. 31, no. 3, pp. 435–450, 1995.
•  M. C. Campi and P. Kumar, “Adaptive linear quadratic gaussian control: the cost-biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998.
•  T. E. Duncan, L. Guo, and B. Pasik-Duncan, “Adaptive continuous-time linear quadratic gaussian control,” IEEE Transactions on automatic control, vol. 44, no. 9, pp. 1653–1662, 1999.
•  S. Bittanti and M. C. Campi, “Adaptive control of linear time invariant systems: the “bet on the best” principle,” Communications in Information & Systems, vol. 6, no. 4, pp. 299–320, 2006.
•  Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems.” in COLT, 2011, pp. 1–26.
• 

M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in

Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.
•  M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Regret analysis for adaptive linear-quadratic policies,” arXiv preprint arXiv:1711.07230, 2017.
•  M. Abeille and A. Lazaric, “Thompson sampling for linear-quadratic control problems,” in

AISTATS 2017-20th International Conference on Artificial Intelligence and Statistics

, 2017.
•  Y. Ouyang, M. Gagrani, and R. Jain, “Control of unknown linear systems with thompson sampling,” in Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on.   IEEE, 2017, pp. 1198–1205.
•  Y. Bar-Shalom and E. Tse, “Dual effect, certainty equivalence, and separation in stochastic control,” IEEE Transactions on Automatic Control, vol. 19, no. 5, pp. 494–500, 1974.
•  T. L. Lai and C. Z. Wei, “Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems,” The Annals of Statistics, pp. 154–166, 1982.
•  A. Becker, P. Kumar, and C.-Z. Wei, “Adaptive control with the stochastic approximation algorithm: Geometry and convergence,” IEEE Transactions on Automatic Control, vol. 30, no. 4, pp. 330–338, 1985.
•  T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
•  J. W. Polderman, “On the necessity of identifying the true parameter in adaptive lq control,” Systems & control letters, vol. 8, no. 2, pp. 87–91, 1986.
•  ——, “A note on the structure of two subsets of the parameter space in adaptive control problems,” Systems & control letters, vol. 7, no. 1, pp. 25–34, 1986.
•  M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time adaptive stabilization of LQ systems,” arXiv preprint arXiv:1807.09120.
•  M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” arXiv preprint arXiv:1802.08334, 2018.
•  V. F. Farias and R. Madan, “The irrevocable multiarmed bandit problem,” Operations Research, vol. 59, no. 2, pp. 383–399, 2011.
•  M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342 – 353, 2018.
•  D. P. Bertsekas, Dynamic programming and optimal control.   Athena Scientific Belmont, MA, 1995, vol. 1, no. 2.
•  M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Optimality of fast matching algorithms for random networks with applications to structural controllability,” IEEE Transactions on Control of Network Systems, vol. 4, no. 4, pp. 770–780, 2017.
•  U. Shalit, D. Weinshall, and G. Chechik, “Online learning in the embedded manifold of low-rank matrices,”

Journal of Machine Learning Research

, vol. 13, no. Feb, pp. 429–458, 2012.

## Appendix A Proofs of Main Results

### a.1 Proof of Theorem 3.1

Given , and the linear policy , define the sequence of policies as follows.

 π0 = {L(θ0),⋯,L(θ0)}, π1 = {L0,L(θ0),⋯,L(θ0)}, π2 = {L0,L1,L(θ0),⋯,L(θ0)}, ⋮ πn = {L0,L1,⋯,Ln−1}.

Indeed, for , is a linear policy which applies the same linear feedback as at every time , and then for switches to the linear time-invariant optimal policy . Note that