On Applications of Bootstrap in Continuous Space Reinforcement Learning

In decision making problems for continuous state and action spaces, linear dynamical models are widely employed. Specifically, policies for stochastic linear systems subject to quadratic cost functions capture a large number of applications in reinforcement learning. Selected randomized policies have been studied in the literature recently that address the trade-off between identification and control. However, little is known about policies based on bootstrapping observed states and actions. In this work, we show that bootstrap-based policies achieve a square root scaling of regret with respect to time. We also obtain results on the accuracy of learning the model's dynamics. Corroborative numerical analysis that illustrates the technical results is also provided.

• 16 publications
• 67 publications
• 44 publications
06/09/2022

Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems

This work studies theoretical performance guarantees of a ubiquitous rei...
06/28/2018

Adaptive regulation of linear systems represents a canonical problem in ...
05/05/2013

Regret Bounds for Reinforcement Learning with Policy Advice

In some reinforcement learning problems an agent may be provided with a ...
01/01/2022

Joint Learning-Based Stabilization of Multiple Unknown Linear Systems

Learning-based control of linear systems received a lot of attentions re...
01/28/2022

Efficient Embedding of Semantic Similarity in Control Policies via Entangled Bisimulation

Learning generalizeable policies from visual input in the presence of vi...
10/13/2017

Unsupervised Real-Time Control through Variational Empowerment

We introduce a methodology for efficiently computing a lower bound to em...
06/08/2018

Randomized Prior Functions for Deep Reinforcement Learning

Dealing with uncertainty is essential for efficient reinforcement learni...

I Introduction

In the theory of reinforcement learning, efficient algorithms with provable theoretical guarantees are established for two canonical settings. The first is finite stateMarkov decision processes (MDPs) with state spaces of small cardinalities [1]. The second is the continuous space setting of linear quadratic (LQ) systems [2]

. In the latter one, the control action and the state both are multidimensional real vectors, and the state evolves according to stochastic linear dynamics determined by the control action. Further, the cost (or negative reward) has a quadratic form in both the state and the control input. Besides being theoretically amenable, LQ models capture a wide range of applications from air conditioning control

[3] to portfolio optimization [4]. LQ models also arise when studying the behavior of nonlinear systems around the working equilibrium [5, 6]

In applications where the true system model is not known, data-driven strategies are required for decision making under uncertainty [7]. Then, the learning algorithm has to select actions amongst infinitely many options in order to steer the system toward minimizing the costs incurred. Note that, unlike the finite state MDP case, in LQ systems there is a possible danger of the state vector becoming unbounded [8, 9, 10]. Therefore, the design and analysis of reinforcement learning algorithms for LQ systems involve significantly different conceptual and technical issues to balance exploration (identification) and exploitation (control). For this purpose, one might consider to use upper-confidence bound (UCB) approaches [11, 12, 13, 14] that rely on the optimism in the face of uncertainty (OFU) principle. The UCB approach was historically first developed for finite action bandit problems [15]. While being efficient in the finite action setting, UCB-based approaches have been found to be computationally intractable in more general problems [16].

Recently, various methods for reinforcement learning have been proposed that leverage randomization strategies to guide the learning process. Randomized policy search methods have been studied both empirically, as well as theoretically (see e.g. [17, 18]). For the problem of stabilizing an unknown LQ system, an algorithm leveraging random feedback gains is proposed [19]. There is also work showing the efficiency of achieving the exploration-exploitation trade-off by randomizing the learned model through both posterior sampling [20] and additive randomization [21]. Finally, finite time analysis of Certainty Equivalent policies utilizing input perturbation has led to performance guarantees for both learning [22] and planning [16].

In this paper, we study randomized algorithms that leverage the statistical bootstrap [23] for reinforcement learning in LQ systems. Bootstrap-based exploration has been analyzed in simpler settings, such as bandit problems [24, 25]

. There has been a lot of interest recently in using bootstrap-based exploration strategies especially along with deep neural networks

[26, 27, 28, 29]. However, results on bootstrap-based reinforcement learning algorithms for LQ models have been limited to primarily numerical analyses for learning the model-misspecification error [30], while rigorous performance guarantees are not currently available.

Further, bootstrap methods are also of practical interest because of their robustness to misspecified models. The amount of exploration in bootstrap-based adaptive control policies is endogenously determined by the history of the system to date. Therefore, the policy adapts its decision-making strategy with possible systematic and/or latent “biases” occurring due to lack of accurate information regarding the system’s dynamics111see the discussion at the end of Section IV for more details. Examples of such “biases” include structural breaks [31], system resets [30], and misspecification of the model dimension [32, 33, 34].

The focus of this work is on the performance of reinforcement learning policies that use the residual bootstrap

to balance exploration and exploitation. We show that model-based strategies that use linear regression for learning the model and bootstrapping for policy design, provide a regret that scales as the square root of the total time of interacting with the system. Further, the accuracy of learning the unknown dynamics parameter will be specified. To establish the results, we carefully examine the effect of different converging and diverging quantities involved in the problem, such as the errors in learning the model, the distributions induced by the regression residuals, the correlation within and between the observed input-state signals, and the ongoing learning-planning interactions, that are of independent interest. At the technical level, we leverage results in the literature on the bootstrap

[35], martingale central limit [36, 37] and convergence theorems [38].

The remainder of the paper is organized as follows: Section II introduces the mathematical model under consideration and discusses the rigorous formulation of the problem, and also provides some necessary preliminaries. Section III describes the bootstrap procedure and the resulting reinforcement learning algorithm to design the policy. Subsequently, the main result on the performance of the proposed algorithm is presented, together with numerical work showcasing the performance of the algorithm, in Section IV.

Notation

The following notation will be employed throughout the paper. is the transpose of matrix or vector

. The largest and the smallest eigenvalues of square Hermitian matrix

are denoted by and , respectively. If is not Hermitian, the ordering of the eigenvalues is determined by their magnitudes. The norm of the dimensional vector is denoted by , and is used for the operator norm of matrices: . For atomic measures on Euclidean spaces we use Dirac function ; i.e. it denotes a unit point mass at . Finally, the letters and (or ) are being used for generic reinforcement learning policies and model parameters, respectively, and will be rigorously defined later on.

Ii Setting and Problem Formulation

The model, denoted by , consists of multidimensional state and control vectors, parameters that specify its dynamical evolution over time and cost matrices, as defined next. The dimensional state process evolves according to an unknown stochastic linear dynamic equation governed by the dimensional control action , and the random disturbance (or noise) process :

 x(t+1) = A0x(t)+B0u(t)+ξ(t+1). (1)

That is, the current state and the input determine the next state through the state transition matrix , and the input influence matrix , respectively.

Definition 1.

Henceforth, we will denote the true parameter tuple by the dynamics matrix , with . Similarly, we use the parameter to denote generic dynamics matrices.

The additive noise in the stochastic dynamics (1) satisfies . For the sake of simplicity, we assume that the sequence of noise vectors are independent, and have a stationary covariance structure: . Further, is assumed to be positive definite, and , for some fixed . As a matter of fact, extensions to more general technical settings such as non-stationary [16] or singular covariance matrices (assuming reachability [9]), as well as conditionally independent processes [13], can be accommodated in a similar manner. Note though that the assumed noise process is not necessarily stationary in the strict sense.

We are interested in finding reinforcement learning policies to minimize the long-term average cost as formally defined next. First, suppose that and are the regulation weight matrices reflecting the effect of the state and the input vectors in the cost function, respectively. Specifically, letting be the decision making law (policy) determining the control input at every time , define the quadratic instantaneous cost of according to

 ct(π) = (2)

where , and are symmetric positive definite matrices. Thus, (2) reflects the desire to regulate the state of the system through control actions of small magnitude.

When the dynamics follow (1), and the instantaneous cost is given by (2), we denote the model by . Further, the history of the system at time , denoted as , consists of the sequence of the control inputs applied so far, and the resulting state vectors:

 Ht=(x(0),⋯,x(t),u(0),⋯,u(t−1)).

A reinforcement learning policy observes the history at time aiming to control the cost incurred. That is, the policy is a (possibly random) mapping which designs the input sequence  according to the history available up to that time;

 u(t)=π(Ht,Qx,Qu), (3)

so that the average cost is minimized. Thus, the objective is summarized in the following regulation problem:

Problem 1.

Find to minimize the average cost below, subject to (1), (2), and (3);

 limsupn→∞1nn−1∑t=0ct(π). (4)

Importantly, according to (3) the true dynamics parameter in (1) is unknown. Therefore, the policy must also employ an exploration procedure to accurately learn the model parameters, thus addressing the following identification problem:

Problem 2.

Using (1) and (3), design to learn , as accurately as possible.

Note that in the above formulation, the true system dynamics are unknown, while the cost matrices are known. It gives rise to a realistic setting, since the decision making algorithm does not know the actual evolution of the underlying system (i.e. ), but is aware of the criteria according to which a policy to achieve the goal is being assessed (i.e. ).

Subsequently, we define the regret of a policy, which is the amount of sub-optimality it incurs due to uncertainty about the parameters of the model (1). To do so, we need to introduce an optimal policy that minimizes the average cost, given full knowledge of the system model . Then, will be the baseline for assessing the exploitation performance of the arbitrary reinforcement learning policy . It is well known that in order to find , an algebraic Riccati equation needs to be solved [39, 40].

To proceed, we introduce some additional notation. First, for an arbitrary define the matrix valued mapping

 Φθ(P)=Qx+A′PA−A′PB(B′PB+Qu)−1B′PA.

Both the domain and the range of are the set of matrices. Next, if there is a positive semidefinite matrix satisfying the algebraic Riccati equation , let the feedback gain matrix be

 G(θ)=−(B′P(θ)B+Qu)−1B′P(θ)A. (5)

Furthermore, for in (1), define the linear time-invariant (LTI) policy

 π⋆:u(t)=G(θ0)x(t),t=0,1,2,⋯. (6)

Finally, using , the regret of is naturally defined by:

 (7)

It remains to specify settings for which is well-defined. To that end, the following closed-loop stabilizability condition for the model (1) is necessary and sufficient [39, 40].

Assumption 1.

There is a LTI feedback gain , such that satisfies .

Note that in general, the stabilizing gain mentioned above is only required to exist, and does not need to be known to the decision maker. In other words, to verify that the stabilizability Assumption 1 holds, it suffices to show that a hypothetical omniscient decision maker (who knows the true model ) possessing an omnipotent computational power is able to stabilize the system. However, we briefly outline the available constructive methods to compute (as well as ). It is shown (see for example [39, 40, 19]) that under Assumption 1 the following statements hold;

1. The positive definite matrix uniquely exists. So, both the feedback and the optimal policy are well defined.

2. Letting be an arbitrary positive semidefinite matrix, the recursive formula converges exponentially fast to as grows.

3. The feedback matrix stabilizes the system:

 |λmax(A0+B0G(θ0))|<1.
4. The minimum of the average cost (4) is achieved by .

5. In the class of LTI policies (i.e. of the form ), the policy is the only optimal one.

In the remainder of the paper, we employ reinforcement learning algorithms to address Problem 1, studying the growth rates of . Similarly, letting

be the learned/estimated parameter at time

(the sample size is as well), we consider the exploration performance in Problem 2 through the rates of the learning error . Bootstrap is the cornerstone of the proposed algorithms to efficiently randomize the design of the control inputs, and address the trade-off between the learning accuracy and the regret.

Iii Algorithms

An algorithm needs to address the common dilemma of decision making under uncertainty, as follows. First, if the algorithm makes decisions naively according to the estimated (learned) dynamics parameter, it will presumably fail to provide a small regret. Intuitively, the state and the action are required to be highly correlated in order to remain close to the optimal strategy in (6). Because of this correlation, history may fail to accurately learn , which can lead to drastically large regret values. Technically, if for some feedback gain matrix , then the dimension of the observed history is effectively , while the rows of the matrices in the parameter space belong to . Therefore, learning can be dramatically misleading. This phenomenon of failing to falsify the imprecise approximations of the true model is extensively discussed in the adaptive control literature [41, 42, 12, 21].

In other words, if the policy fails to sufficiently explore the parameter space, an inaccurate approximation can falsely be treated as an accurate one. This necessitates an efficient exploration strategy to decrease the aforementioned correlation between the state and the action. Moreover, the above argument reveals the reasoning leading to UCB approaches [13, 14], or statistically independent dither schemes [16, 22], as useful prescriptions to overcome the exploration-exploitation dilemma.

In order to explore, the decision maker needs to deviate from the learned model prior to using to design the reinforcement learning policy. On the other hand though, the above deviations must be sufficiently small in order to avoid significant deterioration in the exploitation performance (i.e. increase in the regret). The solution we discuss here is to utilize the bootstrap to provide the necessary balance between these two competing objectives.

To this end, the policy applies the supposedly optimal control action treating as the true model, where is provided by the bootstrap algorithm. It computes the regression residuals for the learned parameter , and bootstraps (i.e. resamples) them to reconstruct a surrogate system. Then, the history of the surrogate system will be the data being used to compute . In the first subsection, we explain the least squares estimator for learning the model parameter, as well as the above residual bootstrap procedure.

Subsequently, in the second subsection we present an episodic algorithm which updates the model-based policy at the end of every episode, while the lengths of the episodes grow exponentially fast. Therefore, as the duration of the interaction with the system grows, the number of policy updates scales logarithmically. As a matter of fact, this leads to a significant reduction in the computation of the reinforcement learning policy, by avoiding unnecessary updates before collecting sufficient data, due to the fact that the solution of the algebraic Riccati equation (5) for a hypothetical model is not instantly available. The latter would impose a substantial computational burden, especially for systems whose dimension is fairly large.

Iii-a Residual Bootstrap

According to the linear dynamical model in (1), a natural procedure to learn through the control input and the observed states is based on least squares. In the sequel, we discuss the residual bootstrap method for the least squares learning procedure. Further, we will present the corresponding algorithm which will be used as a subroutine in the reinforcement learning algorithm in the next subsection.

Recall that the LTI policy in (6) is optimal. Thus, a natural form of the adaptive policies that a reinforcement learning algorithm is expected to provide through planing according to the learned model, is . Assuming so for , now the algorithm needs to decide about the action at time . Thus, plugging in the dynamical model (1), and denoting

 Ft=[Ip,G′t]′∈Rq×p,

we get the so-called closed-loop evolution of the system by the (possibly time-varying) autoregressive dynamics

 x(t+1)=θ0Ftx(t)+ξ(t+1),

for . Then, Algorithm 1 returns the bootstrapped parameter based on the matrices , as well as the available state observations . The details are provided below.

First, based on the collected history , define the following least square estimate of :

 ˜θn=argminθ∈Rp×qn−1∑t=0||x(t+1)−θFtx(t)||2. (8)

The learning procedure (8) treats the noise vectors as the errors of a linear regression procedure, based on the dynamical model (1). Therefore, the residuals of the least squares estimate are defined by the difference between the observed response , and the fitted response . That is,

 (9)

for . The residuals can conceptually be considered as approximations of the actual regression errors . Using the residuals , we define the centered empirical distribution

 ˆPn=1nn∑t=1δ[ζ(t)−¯¯¯ζn], (10)

where , the average of the residuals given by

 ¯¯¯ζn=1nn∑t=1ζ(t), (11)

is being used for centering the empirical distribution. In fact,

is the sample probability measure for the population distribution of the noise process

. Note that is defined on . We then use and to generate the surrogate state vectors by the dynamical model

 ˆx(t+1)=˜θnFtˆx(t)+ˆξ(t+1),

where the bootstrap noise vectors are drawn independently from . Hence, letting be the expectation with respect to , clearly we have . Also note that the actual dynamics parameter for the surrogate system is the learned parameter defined in (8). Finally, the algorithm applies the least squares estimator to the generated surrogate states to obtain :

 ˆθn=argminθ∈Rp×qn−1∑t=0||ˆx(t+1)−θFtˆx(t)||2. (12)

The pseudo-code for the residual Bootstrap explained above is given in Algorithm 1. It will be used later at the heart of Algorithm 2 to design reinforcement learning policies.

Remark 1.

If the noise process is parametrized, one can accordingly draw from the corresponding parametric sample distribution.

To see that, assume we know that the noise vectors belong to a parametric family of stochastic processes. Then, instead of using the nonparametric empirical distribution in (10), one can use the the residuals to estimate the parameter of interest. So, letting be the parametric distribution provided by the obtained estimate, the bootstrap noise can be sampled independently from . For example, if we know that are i.i.d. Gaussian vectors, we can find the sample covariance matrix , and draw

independently from the centered Gaussian distribution with covariance matrix

.

Remark 2.

In the original version of bootstrap [23], the covariates (i.e. the state vectors) are fixed, and only the residuals are being bootstrapped. In the time series models such as (1), every state vector comprises of the previous noise vectors. Therefore, bootstrapping the residuals automatically leads to new state sequence for the surrogate system [30].

Iii-B Policy Design

Next, Algorithm 2 for decision making under uncertainty based on bootstrapping the residuals (Algorithm 1) is discussed. For this purpose, we first define the extended gain matrix based on the optimal feedback .

Definition 2.

For parameter , using the matrix in (5), define the matrix .

The matrix can be interpreted as an extension of the original feedback gain; applying , the closed-loop transition matrix takes the form .

Recall that the true model is not known, and a reinforcement learning algorithm needs to simultaneously learn the dynamics parameter, and design the control input. To do so, we present an episodic decision making strategy outlined in Algorithm 2. That is, the policy applies control actions during each episode, assuming that the approximation of the model available at the time coincides with the true model. Then, at the end of every episode, the algorithm updates the learned model based on the history collected so far, and continues making decisions as if the new approximate model is the truth. The learning mentioned above is through a linear regression for the dynamics (1), and the approximation consists of bootstrapping (by Algorithm 1) the model estimate obtained by the regression. In the sequel, we explain the details of the above alternating steps of the algorithm.

The reinforcement learning policy is initiated with the history in the first line of Algorithm 2. Then, it chooses an arbitrary stabilizable approximation of , denoted by , and starts the system by applying the action prescribed by the model ; i.e. . Note that selection of is straightforward, since almost all (w.r.t. Lebesgue measure) parameter matrices are stabilizable [20].

The starting time-points of the episodes are determined by the exponents of the reinforcement rate . That is, at every time , the approximation will be updated, while for the algorithm freezes . In other words, whenever , Algorithm 2 calls the residual bootstrap Algorithm 1 to get according to the collected history of the control actions and the states. So, for all , the matrices are exactly the same. The efficiency of the policy relies on the idea that the sequence will provide finer approximations of the truth , as the algorithm proceeds (or more precisely, as grows).

Iv Theoretical Results and Simulations

We start by establishing performance guarantees on the regret and the learning accuracy for bootstrap-based policies, supplemented by numerical examples that illustrate the behavior of Algorithm 2 for both identification and regulation. The following result specifies the growth rate of the regret for the policy designed by Algorithm 2, as well as the decay rate of the identification error .

Theorem 1.

Letting be the policy given by Algorithm 2, define the learned parameter by (8). Then, we have

 limsupn→∞(n−1/2log−2n)Rn(π) < ∞, limsupn→∞(n1/2log−2n)∣∣∣∣∣∣˜θn−θ0∣∣∣∣∣∣2 < ∞.

The proof is provided in the appendix. Technically, it relies on the careful examination of the effect of Algorithm 1 on the randomization of the feedback gains . This randomization in turn diversifies the extended gain matrices , so that their superposition efficiently explores the whole parameter space , as grows. To this end, we utilize the state-of-the-art results on the behavior of the algebraic Riccati equation [14], properties of the optimality manifold [43, 44, 21], and results from martingale theory [45, 46, 47], limit distributions of dependent sequences [36, 37], and the bootstrap [35].

The regulation and identification rates of Theorem 1 are modulo logarithmic factors similar to the corresponding rates of the reinforcement learning policies utilizing OFU [13, 14], additive randomization [21], posterior sampling [20], and input perturbation [16]. Moreover, the square root scaling of the regret is efficient for adaptive regulation of LQ systems as discussed next.

Recalling the discussion at the beginning of Section III, an adaptive control policy needs to sufficiently explore the parameter space in order to balance the trade-off of identification and regulation. For falsifying the imprecise approximation through an exploration procedure, the control signals need to deviate from the optimal feedback gain . More precisely, for a policy , let the deviations from the optimal feedback be , for . Then, observing the history , the error of estimating the true dynamics parameter is at least (modulo a constant factor), where  [7, 10]. Hence, if aims to falsify , the difference needs to be in the order of magnitude at least  [12, 21]. Whenever employs for designing control inputs, holds, since needs to be found through .

On the other hand, for the above deviations we have

 liminfn→∞σ2nRn(π)>0, (16)

according to the regret specification recently established [21]. Further, applying the adaptive feedback at time , the increase in the regret is approximately  [16], which is up to a constant factor at least  [14, 21]. Thus, the lower bound for implies that . Putting the latter result and (16) together, we obtain , which provides the lower bound . Note that a rigorous proof of the above lower bound argument is beyond the scope of this paper. For more detailed discussions, we refer the reader to the aforementioned references.

Iv-a Numerical Illustration

Next, we present numerical analyses employing Algorithm 2 for decision-making under uncertainty. Henceforth, let be the reinforcement learning policy provided by Algorithm 2, with reinforcement rate . The true dynamical model and cost matrices are provided in (13) and (14), respectively. Solving the algebraic Riccati equation, we get given in (15), which lead to a closed-loop matrix of the spectral radius .

Figure 1 depicts the normalized regret as a function of , for replicates of the stochastic linear system in (1). The corresponding normalized identification errors are plotted in Figure 2. These figures are in a full agreement with the theoretical result of Theorem 1; both normalized rates and are dominated by logarithmic factors of the time index . In Figure 3, we plot the resulting spectral radius of the reinforcement learning policy for both the actual system of the dynamics parameter , as well as that of the surrogate system of . According to Figure 3, Algorithm 2 fully stabilizes the system, even though in the first few episodes the system is unstable.

The ensuing figures indicate the robustness of Algorithm 2 to structural breaks. Figure 4 shows the performance of and for a single break in the model, wherein at time the dynamics matrices suddenly become

 A0=⎡⎢⎣1.070−0.370.48−0.890.850.440.040⎤⎥⎦,B0=⎡⎢⎣−0.480.44−0.30−0.520.590.260.30−0.440⎤⎥⎦.

A similar performance analysis while the system incurs two breaks is provided in Figure 5. The first break is similar to the one mentioned above, and occurs at time . Then, for the second break at time , the true dynamics matrices change to

 A0=⎡⎢⎣1.070−1.040.48−0.890.850.440.810⎤⎥⎦,B0=⎡⎢⎣−0.480.44−0.30−0.520.59−0.260.30−0.300⎤⎥⎦.

According to Figures 4 and 5, Algorithm 2 is robust to remarkably large values of model mis-specifications. Note that the reinforcement learning policy is fully ignorant of the breaks. So, needs to adaptively adjust its decision-making law toward the new optimal policies.

The rationale for the exhibited robustness is as follows: when a break occurs, the parameter estimate becomes an inaccurate approximation of the true dynamics matrix . This in turn leads the regression residuals becoming large. Therefore, the bootstrapped parameter computed by Algorithm 1 provides a large randomization, which in turn leads to an increase in the exploration phase. Then, after a few episodes, the resulting enhanced exploration provides more accurate estimates , and the above negative feedback procedure proceeds. Thus, as time grows, Algorithm 2 self-tunes to the equilibrium of the suitable amount of exploration.

The above argument intuitively indicates that the aforementioned equilibrium is a stable one. Since the endogenous randomization of the bootstrap procedure consistently assesses the accuracy of the fitted model , the resulting adaptive policy automatically adjusts the old decision-making strategy to the new environment. Hence, the algorithm accordingly addresses the unexpected flaw of the sudden and unknown changes in the true model , as well as the resulting unpredicted deviations in the trajectory of the state sequence .

V Concluding Remarks

We proposed a reinforcement learning algorithm for sequential decision-making for an LQ system with unknown temporal dynamics. The presented model-based policy is based on residual bootstrap, and is shown to be efficient in terms of both identification and regulation. Namely, we establish the rates for the worst-case regret, as well as the learning accuracy. Further, we discussed the robustness of the bootstrap method for handling unexpected changes in the dynamical model.

As the first work on bootstrap-based policies for LQ models, it poses a number of interesting questions. For example, theoretical analysis for addressing the performance of bootstrap method under imperfect observation is a natural direction for future work. Further subjects of interest include design and analysis of fully non-parametric randomization methods such as covariate resampling. Finally, extending the presented framework to model-free algorithms can be considered as another fruitful research direction to examine.

Appendix A Auxiliary Results

Next, we present auxiliary results being used in the proof of Theorem 1 in Appendix B. First, we present the regret bounds of Lemma 1, for which the proof can be found in the work of Faradonbeh et al. [21], using martingale convergence analysis of Lai and Wei [48]. Then, Lemma 2 provides the local specification of the optimality manifold which is established for both full-rank [43, 44] and rank-deficient dynamics matrices [21]. Subsequently, Lemma 3 is presented to study convergence rates of linear regression procedures. The asymptotic [45] and non-asymptotic [46, 47] proofs of Lemma 3 are available in the literature.

Then, we state Lemma 4 which addresses the behavior of the empirical covariance matrix of the state sequence of a stabilized system [8]. Further, we have Lemma 5 that is providing the local Lipschitz continuity property (i.e. in a neighborhood of ) of the feedback matrix  [14]. Finally, we establish Lemma 6 on the population covariance matrix induced by the empirical probability measure defined in (10).

Lemma 1.

For the sequence of matrices , suppose that there is a filtration such that for all , are -measurable, and . Then, for the regret of the policy

 π:u(t)=Gtx(t),t=0,1,2,⋯,

the following holds:

 limsupn→∞Rn(π)n−1∑t=0||(G(θ0)−Gt)x(t)||2+∣∣∣∣∣∣n∑t=1(A0+B0G(θ0))n−tξ(t)∣∣∣∣∣∣2<∞.
Lemma 2.

For the stabilizable parameter , let be the manifold of optimal feedback gains:

 S(θ1)={θ∈Rp×q:G(θ)=G(θ1)}.

Then, the tangent space of at point consists of matrices , such that satisfy

 N′P(θ1)D1+B′1Z+B′1∞∑k=0D′1k(D′1Z+Z′D1)D1k+1=0r×p,

where , .

Lemma 3.

Consider the dynamical system . Then, define

 Un = n−1∑t=0Ftx(t)x(t)′F′t∈Rq×q, Wn = n−1∑t=0ξ(t+1)x(t)′F′tU−1/2n∈Rp×q,

where is the Moore-Penrose inverse. It holds that

 limsupn→∞|λmax(WnW′n)|log|λmax(Un)|<∞.

Further, since provided by (8) satisfies the normal equation , we have

 (˜θn−θ0)Un(˜θn−θ0)′=WnW′n.

Therefore, we get

 limsupn→∞|λmin(Un)|∣∣∣∣∣∣˜θn−θ0∣∣∣∣∣∣2log|λmax(Un)|<∞.
Lemma 4.

Suppose that the control feedback matrix is applied to the dynamical model . Hence, plugging in (1), the system evolves according to . Assuming , for the empirical covariance matrix the following holds:

 limn→∞n−1Vn=∞∑k=0(θ0F(ˆθ))kΣ(θ0F(ˆθ))′k.
Lemma 5.

Letting , there is a constant such that

Lemma 6.

Let be the expectation with respect to ; the empirical probability measure of the residuals defined in (10). Then, for the Bootstrap covariance matrix we have

 0
Proof.

First, the definition of in (9), in addition to the dynamics (1) yield . Defining similar to Lemma 3, the normal equation implies that

 n−1∑t=0[ξ(t+1)x(t)′F′t(θ0−˜θn)′+(θ0−˜θn)Ftx(t)ξ(t+1)′]=−2WnW′n.

So, we obtain

 ˆΣn =

since

. Thus, applying the Law of Large Numbers to the matrices

, we get . Further, by the Law of Large Numbers, vanishes as grows. Therefore, the definitions of lead to

 limsupn→∞∣∣∣∣¯¯¯ζn∣∣∣∣ ≤ limsupn→∞1nn−1∑t=0∣∣∣∣(θ0−˜θn)Ftx(t)∣∣∣∣+∣∣ ∣∣∣∣ ∣∣1nn−1∑t=0ξ(t+1)∣∣ ∣∣∣∣ ∣∣ ≤ limsupn→∞(1nn−1∑t=0∣∣∣∣(θ0−˜θn)Ftx(t)∣∣∣∣2)1/2 ≤ limsupn→∞tr(1nn−1∑t=0(θ0−˜θn)Ftx(t)x(t)′F′t(θ0−˜θn)′)1/2 ≤ limsupn→∞(pn|λmax(WnW′n)|)1/2.

Finally, since Lemma 3 implies that , we get the desired result on the smallest eigenvalue: .

Appendix B Proof of Theorem 1

The following analysis rigorously studies the behavior of both and