I Introduction
In the theory of reinforcement learning, efficient algorithms with provable theoretical guarantees are established for two canonical settings. The first is finite stateMarkov decision processes (MDPs) with state spaces of small cardinalities [1]. The second is the continuous space setting of linear quadratic (LQ) systems [2]
. In the latter one, the control action and the state both are multidimensional real vectors, and the state evolves according to stochastic linear dynamics determined by the control action. Further, the cost (or negative reward) has a quadratic form in both the state and the control input. Besides being theoretically amenable, LQ models capture a wide range of applications from air conditioning control
[3] to portfolio optimization [4]. LQ models also arise when studying the behavior of nonlinear systems around the working equilibrium [5, 6]In applications where the true system model is not known, datadriven strategies are required for decision making under uncertainty [7]. Then, the learning algorithm has to select actions amongst infinitely many options in order to steer the system toward minimizing the costs incurred. Note that, unlike the finite state MDP case, in LQ systems there is a possible danger of the state vector becoming unbounded [8, 9, 10]. Therefore, the design and analysis of reinforcement learning algorithms for LQ systems involve significantly different conceptual and technical issues to balance exploration (identification) and exploitation (control). For this purpose, one might consider to use upperconfidence bound (UCB) approaches [11, 12, 13, 14] that rely on the optimism in the face of uncertainty (OFU) principle. The UCB approach was historically first developed for finite action bandit problems [15]. While being efficient in the finite action setting, UCBbased approaches have been found to be computationally intractable in more general problems [16].
Recently, various methods for reinforcement learning have been proposed that leverage randomization strategies to guide the learning process. Randomized policy search methods have been studied both empirically, as well as theoretically (see e.g. [17, 18]). For the problem of stabilizing an unknown LQ system, an algorithm leveraging random feedback gains is proposed [19]. There is also work showing the efficiency of achieving the explorationexploitation tradeoff by randomizing the learned model through both posterior sampling [20] and additive randomization [21]. Finally, finite time analysis of Certainty Equivalent policies utilizing input perturbation has led to performance guarantees for both learning [22] and planning [16].
In this paper, we study randomized algorithms that leverage the statistical bootstrap [23] for reinforcement learning in LQ systems. Bootstrapbased exploration has been analyzed in simpler settings, such as bandit problems [24, 25]
. There has been a lot of interest recently in using bootstrapbased exploration strategies especially along with deep neural networks
[26, 27, 28, 29]. However, results on bootstrapbased reinforcement learning algorithms for LQ models have been limited to primarily numerical analyses for learning the modelmisspecification error [30], while rigorous performance guarantees are not currently available.Further, bootstrap methods are also of practical interest because of their robustness to misspecified models. The amount of exploration in bootstrapbased adaptive control policies is endogenously determined by the history of the system to date. Therefore, the policy adapts its decisionmaking strategy with possible systematic and/or latent “biases” occurring due to lack of accurate information regarding the system’s dynamics^{1}^{1}1see the discussion at the end of Section IV for more details. Examples of such “biases” include structural breaks [31], system resets [30], and misspecification of the model dimension [32, 33, 34].
The focus of this work is on the performance of reinforcement learning policies that use the residual bootstrap
to balance exploration and exploitation. We show that modelbased strategies that use linear regression for learning the model and bootstrapping for policy design, provide a regret that scales as the square root of the total time of interacting with the system. Further, the accuracy of learning the unknown dynamics parameter will be specified. To establish the results, we carefully examine the effect of different converging and diverging quantities involved in the problem, such as the errors in learning the model, the distributions induced by the regression residuals, the correlation within and between the observed inputstate signals, and the ongoing learningplanning interactions, that are of independent interest. At the technical level, we leverage results in the literature on the bootstrap
[35], martingale central limit [36, 37] and convergence theorems [38].The remainder of the paper is organized as follows: Section II introduces the mathematical model under consideration and discusses the rigorous formulation of the problem, and also provides some necessary preliminaries. Section III describes the bootstrap procedure and the resulting reinforcement learning algorithm to design the policy. Subsequently, the main result on the performance of the proposed algorithm is presented, together with numerical work showcasing the performance of the algorithm, in Section IV.
Notation
The following notation will be employed throughout the paper. is the transpose of matrix or vector
. The largest and the smallest eigenvalues of square Hermitian matrix
are denoted by and , respectively. If is not Hermitian, the ordering of the eigenvalues is determined by their magnitudes. The norm of the dimensional vector is denoted by , and is used for the operator norm of matrices: . For atomic measures on Euclidean spaces we use Dirac function ; i.e. it denotes a unit point mass at . Finally, the letters and (or ) are being used for generic reinforcement learning policies and model parameters, respectively, and will be rigorously defined later on.Ii Setting and Problem Formulation
The model, denoted by , consists of multidimensional state and control vectors, parameters that specify its dynamical evolution over time and cost matrices, as defined next. The dimensional state process evolves according to an unknown stochastic linear dynamic equation governed by the dimensional control action , and the random disturbance (or noise) process :
(1) 
That is, the current state and the input determine the next state through the state transition matrix , and the input influence matrix , respectively.
Definition 1.
Henceforth, we will denote the true parameter tuple by the dynamics matrix , with . Similarly, we use the parameter to denote generic dynamics matrices.
The additive noise in the stochastic dynamics (1) satisfies . For the sake of simplicity, we assume that the sequence of noise vectors are independent, and have a stationary covariance structure: . Further, is assumed to be positive definite, and , for some fixed . As a matter of fact, extensions to more general technical settings such as nonstationary [16] or singular covariance matrices (assuming reachability [9]), as well as conditionally independent processes [13], can be accommodated in a similar manner. Note though that the assumed noise process is not necessarily stationary in the strict sense.
We are interested in finding reinforcement learning policies to minimize the longterm average cost as formally defined next. First, suppose that and are the regulation weight matrices reflecting the effect of the state and the input vectors in the cost function, respectively. Specifically, letting be the decision making law (policy) determining the control input at every time , define the quadratic instantaneous cost of according to
(2) 
where , and are symmetric positive definite matrices. Thus, (2) reflects the desire to regulate the state of the system through control actions of small magnitude.
When the dynamics follow (1), and the instantaneous cost is given by (2), we denote the model by . Further, the history of the system at time , denoted as , consists of the sequence of the control inputs applied so far, and the resulting state vectors:
A reinforcement learning policy observes the history at time aiming to control the cost incurred. That is, the policy is a (possibly random) mapping which designs the input sequence according to the history available up to that time;
(3) 
so that the average cost is minimized. Thus, the objective is summarized in the following regulation problem:
Importantly, according to (3) the true dynamics parameter in (1) is unknown. Therefore, the policy must also employ an exploration procedure to accurately learn the model parameters, thus addressing the following identification problem:
Note that in the above formulation, the true system dynamics are unknown, while the cost matrices are known. It gives rise to a realistic setting, since the decision making algorithm does not know the actual evolution of the underlying system (i.e. ), but is aware of the criteria according to which a policy to achieve the goal is being assessed (i.e. ).
Subsequently, we define the regret of a policy, which is the amount of suboptimality it incurs due to uncertainty about the parameters of the model (1). To do so, we need to introduce an optimal policy that minimizes the average cost, given full knowledge of the system model . Then, will be the baseline for assessing the exploitation performance of the arbitrary reinforcement learning policy . It is well known that in order to find , an algebraic Riccati equation needs to be solved [39, 40].
To proceed, we introduce some additional notation. First, for an arbitrary define the matrix valued mapping
Both the domain and the range of are the set of matrices. Next, if there is a positive semidefinite matrix satisfying the algebraic Riccati equation , let the feedback gain matrix be
(5) 
Furthermore, for in (1), define the linear timeinvariant (LTI) policy
(6) 
Finally, using , the regret of is naturally defined by:
(7) 
It remains to specify settings for which is welldefined. To that end, the following closedloop stabilizability condition for the model (1) is necessary and sufficient [39, 40].
Assumption 1.
There is a LTI feedback gain , such that satisfies .
Note that in general, the stabilizing gain mentioned above is only required to exist, and does not need to be known to the decision maker. In other words, to verify that the stabilizability Assumption 1 holds, it suffices to show that a hypothetical omniscient decision maker (who knows the true model ) possessing an omnipotent computational power is able to stabilize the system. However, we briefly outline the available constructive methods to compute (as well as ). It is shown (see for example [39, 40, 19]) that under Assumption 1 the following statements hold;

The positive definite matrix uniquely exists. So, both the feedback and the optimal policy are well defined.

Letting be an arbitrary positive semidefinite matrix, the recursive formula converges exponentially fast to as grows.

The feedback matrix stabilizes the system:

The minimum of the average cost (4) is achieved by .

In the class of LTI policies (i.e. of the form ), the policy is the only optimal one.
In the remainder of the paper, we employ reinforcement learning algorithms to address Problem 1, studying the growth rates of . Similarly, letting
be the learned/estimated parameter at time
(the sample size is as well), we consider the exploration performance in Problem 2 through the rates of the learning error . Bootstrap is the cornerstone of the proposed algorithms to efficiently randomize the design of the control inputs, and address the tradeoff between the learning accuracy and the regret.Iii Algorithms
An algorithm needs to address the common dilemma of decision making under uncertainty, as follows. First, if the algorithm makes decisions naively according to the estimated (learned) dynamics parameter, it will presumably fail to provide a small regret. Intuitively, the state and the action are required to be highly correlated in order to remain close to the optimal strategy in (6). Because of this correlation, history may fail to accurately learn , which can lead to drastically large regret values. Technically, if for some feedback gain matrix , then the dimension of the observed history is effectively , while the rows of the matrices in the parameter space belong to . Therefore, learning can be dramatically misleading. This phenomenon of failing to falsify the imprecise approximations of the true model is extensively discussed in the adaptive control literature [41, 42, 12, 21].
In other words, if the policy fails to sufficiently explore the parameter space, an inaccurate approximation can falsely be treated as an accurate one. This necessitates an efficient exploration strategy to decrease the aforementioned correlation between the state and the action. Moreover, the above argument reveals the reasoning leading to UCB approaches [13, 14], or statistically independent dither schemes [16, 22], as useful prescriptions to overcome the explorationexploitation dilemma.
In order to explore, the decision maker needs to deviate from the learned model prior to using to design the reinforcement learning policy. On the other hand though, the above deviations must be sufficiently small in order to avoid significant deterioration in the exploitation performance (i.e. increase in the regret). The solution we discuss here is to utilize the bootstrap to provide the necessary balance between these two competing objectives.
To this end, the policy applies the supposedly optimal control action treating as the true model, where is provided by the bootstrap algorithm. It computes the regression residuals for the learned parameter , and bootstraps (i.e. resamples) them to reconstruct a surrogate system. Then, the history of the surrogate system will be the data being used to compute . In the first subsection, we explain the least squares estimator for learning the model parameter, as well as the above residual bootstrap procedure.
Subsequently, in the second subsection we present an episodic algorithm which updates the modelbased policy at the end of every episode, while the lengths of the episodes grow exponentially fast. Therefore, as the duration of the interaction with the system grows, the number of policy updates scales logarithmically. As a matter of fact, this leads to a significant reduction in the computation of the reinforcement learning policy, by avoiding unnecessary updates before collecting sufficient data, due to the fact that the solution of the algebraic Riccati equation (5) for a hypothetical model is not instantly available. The latter would impose a substantial computational burden, especially for systems whose dimension is fairly large.
Iiia Residual Bootstrap
According to the linear dynamical model in (1), a natural procedure to learn through the control input and the observed states is based on least squares. In the sequel, we discuss the residual bootstrap method for the least squares learning procedure. Further, we will present the corresponding algorithm which will be used as a subroutine in the reinforcement learning algorithm in the next subsection.
Recall that the LTI policy in (6) is optimal. Thus, a natural form of the adaptive policies that a reinforcement learning algorithm is expected to provide through planing according to the learned model, is . Assuming so for , now the algorithm needs to decide about the action at time . Thus, plugging in the dynamical model (1), and denoting
we get the socalled closedloop evolution of the system by the (possibly timevarying) autoregressive dynamics
for . Then, Algorithm 1 returns the bootstrapped parameter based on the matrices , as well as the available state observations . The details are provided below.
First, based on the collected history , define the following least square estimate of :
(8) 
The learning procedure (8) treats the noise vectors as the errors of a linear regression procedure, based on the dynamical model (1). Therefore, the residuals of the least squares estimate are defined by the difference between the observed response , and the fitted response . That is,
(9) 
for . The residuals can conceptually be considered as approximations of the actual regression errors . Using the residuals , we define the centered empirical distribution
(10) 
where , the average of the residuals given by
(11) 
is being used for centering the empirical distribution. In fact,
is the sample probability measure for the population distribution of the noise process
. Note that is defined on . We then use and to generate the surrogate state vectors by the dynamical modelwhere the bootstrap noise vectors are drawn independently from . Hence, letting be the expectation with respect to , clearly we have . Also note that the actual dynamics parameter for the surrogate system is the learned parameter defined in (8). Finally, the algorithm applies the least squares estimator to the generated surrogate states to obtain :
(12) 
The pseudocode for the residual Bootstrap explained above is given in Algorithm 1. It will be used later at the heart of Algorithm 2 to design reinforcement learning policies.
Remark 1.
If the noise process is parametrized, one can accordingly draw from the corresponding parametric sample distribution.
To see that, assume we know that the noise vectors belong to a parametric family of stochastic processes. Then, instead of using the nonparametric empirical distribution in (10), one can use the the residuals to estimate the parameter of interest. So, letting be the parametric distribution provided by the obtained estimate, the bootstrap noise can be sampled independently from . For example, if we know that are i.i.d. Gaussian vectors, we can find the sample covariance matrix , and draw
independently from the centered Gaussian distribution with covariance matrix
.Remark 2.
In the original version of bootstrap [23], the covariates (i.e. the state vectors) are fixed, and only the residuals are being bootstrapped. In the time series models such as (1), every state vector comprises of the previous noise vectors. Therefore, bootstrapping the residuals automatically leads to new state sequence for the surrogate system [30].
IiiB Policy Design
Next, Algorithm 2 for decision making under uncertainty based on bootstrapping the residuals (Algorithm 1) is discussed. For this purpose, we first define the extended gain matrix based on the optimal feedback .
Definition 2.
For parameter , using the matrix in (5), define the matrix .
The matrix can be interpreted as an extension of the original feedback gain; applying , the closedloop transition matrix takes the form .
Recall that the true model is not known, and a reinforcement learning algorithm needs to simultaneously learn the dynamics parameter, and design the control input. To do so, we present an episodic decision making strategy outlined in Algorithm 2. That is, the policy applies control actions during each episode, assuming that the approximation of the model available at the time coincides with the true model. Then, at the end of every episode, the algorithm updates the learned model based on the history collected so far, and continues making decisions as if the new approximate model is the truth. The learning mentioned above is through a linear regression for the dynamics (1), and the approximation consists of bootstrapping (by Algorithm 1) the model estimate obtained by the regression. In the sequel, we explain the details of the above alternating steps of the algorithm.
The reinforcement learning policy is initiated with the history in the first line of Algorithm 2. Then, it chooses an arbitrary stabilizable approximation of , denoted by , and starts the system by applying the action prescribed by the model ; i.e. . Note that selection of is straightforward, since almost all (w.r.t. Lebesgue measure) parameter matrices are stabilizable [20].
The starting timepoints of the episodes are determined by the exponents of the reinforcement rate . That is, at every time , the approximation will be updated, while for the algorithm freezes . In other words, whenever , Algorithm 2 calls the residual bootstrap Algorithm 1 to get according to the collected history of the control actions and the states. So, for all , the matrices are exactly the same. The efficiency of the policy relies on the idea that the sequence will provide finer approximations of the truth , as the algorithm proceeds (or more precisely, as grows).
Iv Theoretical Results and Simulations
We start by establishing performance guarantees on the regret and the learning accuracy for bootstrapbased policies, supplemented by numerical examples that illustrate the behavior of Algorithm 2 for both identification and regulation. The following result specifies the growth rate of the regret for the policy designed by Algorithm 2, as well as the decay rate of the identification error .
Theorem 1.
The proof is provided in the appendix. Technically, it relies on the careful examination of the effect of Algorithm 1 on the randomization of the feedback gains . This randomization in turn diversifies the extended gain matrices , so that their superposition efficiently explores the whole parameter space , as grows. To this end, we utilize the stateoftheart results on the behavior of the algebraic Riccati equation [14], properties of the optimality manifold [43, 44, 21], and results from martingale theory [45, 46, 47], limit distributions of dependent sequences [36, 37], and the bootstrap [35].
The regulation and identification rates of Theorem 1 are modulo logarithmic factors similar to the corresponding rates of the reinforcement learning policies utilizing OFU [13, 14], additive randomization [21], posterior sampling [20], and input perturbation [16]. Moreover, the square root scaling of the regret is efficient for adaptive regulation of LQ systems as discussed next.
(13)  
(14)  
(15) 
Recalling the discussion at the beginning of Section III, an adaptive control policy needs to sufficiently explore the parameter space in order to balance the tradeoff of identification and regulation. For falsifying the imprecise approximation through an exploration procedure, the control signals need to deviate from the optimal feedback gain . More precisely, for a policy , let the deviations from the optimal feedback be , for . Then, observing the history , the error of estimating the true dynamics parameter is at least (modulo a constant factor), where [7, 10]. Hence, if aims to falsify , the difference needs to be in the order of magnitude at least [12, 21]. Whenever employs for designing control inputs, holds, since needs to be found through .
On the other hand, for the above deviations we have
(16) 
according to the regret specification recently established [21]. Further, applying the adaptive feedback at time , the increase in the regret is approximately [16], which is up to a constant factor at least [14, 21]. Thus, the lower bound for implies that . Putting the latter result and (16) together, we obtain , which provides the lower bound . Note that a rigorous proof of the above lower bound argument is beyond the scope of this paper. For more detailed discussions, we refer the reader to the aforementioned references.
Iva Numerical Illustration
Next, we present numerical analyses employing Algorithm 2 for decisionmaking under uncertainty. Henceforth, let be the reinforcement learning policy provided by Algorithm 2, with reinforcement rate . The true dynamical model and cost matrices are provided in (13) and (14), respectively. Solving the algebraic Riccati equation, we get given in (15), which lead to a closedloop matrix of the spectral radius .
Figure 1 depicts the normalized regret as a function of , for replicates of the stochastic linear system in (1). The corresponding normalized identification errors are plotted in Figure 2. These figures are in a full agreement with the theoretical result of Theorem 1; both normalized rates and are dominated by logarithmic factors of the time index . In Figure 3, we plot the resulting spectral radius of the reinforcement learning policy for both the actual system of the dynamics parameter , as well as that of the surrogate system of . According to Figure 3, Algorithm 2 fully stabilizes the system, even though in the first few episodes the system is unstable.
The ensuing figures indicate the robustness of Algorithm 2 to structural breaks. Figure 4 shows the performance of and for a single break in the model, wherein at time the dynamics matrices suddenly become
A similar performance analysis while the system incurs two breaks is provided in Figure 5. The first break is similar to the one mentioned above, and occurs at time . Then, for the second break at time , the true dynamics matrices change to
According to Figures 4 and 5, Algorithm 2 is robust to remarkably large values of model misspecifications. Note that the reinforcement learning policy is fully ignorant of the breaks. So, needs to adaptively adjust its decisionmaking law toward the new optimal policies.
The rationale for the exhibited robustness is as follows: when a break occurs, the parameter estimate becomes an inaccurate approximation of the true dynamics matrix . This in turn leads the regression residuals becoming large. Therefore, the bootstrapped parameter computed by Algorithm 1 provides a large randomization, which in turn leads to an increase in the exploration phase. Then, after a few episodes, the resulting enhanced exploration provides more accurate estimates , and the above negative feedback procedure proceeds. Thus, as time grows, Algorithm 2 selftunes to the equilibrium of the suitable amount of exploration.
The above argument intuitively indicates that the aforementioned equilibrium is a stable one. Since the endogenous randomization of the bootstrap procedure consistently assesses the accuracy of the fitted model , the resulting adaptive policy automatically adjusts the old decisionmaking strategy to the new environment. Hence, the algorithm accordingly addresses the unexpected flaw of the sudden and unknown changes in the true model , as well as the resulting unpredicted deviations in the trajectory of the state sequence .
V Concluding Remarks
We proposed a reinforcement learning algorithm for sequential decisionmaking for an LQ system with unknown temporal dynamics. The presented modelbased policy is based on residual bootstrap, and is shown to be efficient in terms of both identification and regulation. Namely, we establish the rates for the worstcase regret, as well as the learning accuracy. Further, we discussed the robustness of the bootstrap method for handling unexpected changes in the dynamical model.
As the first work on bootstrapbased policies for LQ models, it poses a number of interesting questions. For example, theoretical analysis for addressing the performance of bootstrap method under imperfect observation is a natural direction for future work. Further subjects of interest include design and analysis of fully nonparametric randomization methods such as covariate resampling. Finally, extending the presented framework to modelfree algorithms can be considered as another fruitful research direction to examine.
Appendix A Auxiliary Results
Next, we present auxiliary results being used in the proof of Theorem 1 in Appendix B. First, we present the regret bounds of Lemma 1, for which the proof can be found in the work of Faradonbeh et al. [21], using martingale convergence analysis of Lai and Wei [48]. Then, Lemma 2 provides the local specification of the optimality manifold which is established for both fullrank [43, 44] and rankdeficient dynamics matrices [21]. Subsequently, Lemma 3 is presented to study convergence rates of linear regression procedures. The asymptotic [45] and nonasymptotic [46, 47] proofs of Lemma 3 are available in the literature.
Then, we state Lemma 4 which addresses the behavior of the empirical covariance matrix of the state sequence of a stabilized system [8]. Further, we have Lemma 5 that is providing the local Lipschitz continuity property (i.e. in a neighborhood of ) of the feedback matrix [14]. Finally, we establish Lemma 6 on the population covariance matrix induced by the empirical probability measure defined in (10).
Lemma 1.
For the sequence of matrices , suppose that there is a filtration such that for all , are measurable, and . Then, for the regret of the policy
the following holds:
Lemma 2.
For the stabilizable parameter , let be the manifold of optimal feedback gains:
Then, the tangent space of at point consists of matrices , such that satisfy
where , .
Lemma 3.
Consider the dynamical system . Then, define
where is the MoorePenrose inverse. It holds that
Further, since provided by (8) satisfies the normal equation , we have
Therefore, we get
Lemma 4.
Suppose that the control feedback matrix is applied to the dynamical model . Hence, plugging in (1), the system evolves according to . Assuming , for the empirical covariance matrix the following holds:
Lemma 5.
Letting , there is a constant such that
Lemma 6.
Let be the expectation with respect to ; the empirical probability measure of the residuals defined in (10). Then, for the Bootstrap covariance matrix we have
(17) 
Proof.
First, the definition of in (9), in addition to the dynamics (1) yield . Defining similar to Lemma 3, the normal equation implies that
So, we obtain
since
. Thus, applying the Law of Large Numbers to the matrices
, we get . Further, by the Law of Large Numbers, vanishes as grows. Therefore, the definitions of lead toFinally, since Lemma 3 implies that , we get the desired result on the smallest eigenvalue: .
Appendix B Proof of Theorem 1
The following analysis rigorously studies the behavior of both and