# The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint

The effectiveness of model-based versus model-free methods is a long-standing question in reinforcement learning (RL). Motivated by recent empirical success of RL on continuous control tasks, we study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR). We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. For policy evaluation, we study a simple family of problem instances and show that nominal (certainty equivalence principle) control also requires a factor of state dimension fewer samples than the policy gradient method to reach the same level of control performance on these instances. Furthermore, the gap persists even when employing baselines commonly used in practice. To the best of our knowledge, this is the first theoretical result which demonstrates a separation in the sample complexity between model-based and model-free methods on a continuous control task.

There are no comments yet.

## Authors

• 22 publications
• 59 publications
• ### Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Model-free reinforcement learning (RL) is a powerful, general tool for l...
02/25/2018 ∙ by Vitchyr Pong, et al. ∙ 0

• ### Combining Model-Based and Model-Free Methods for Nonlinear Control: A Provably Convergent Policy Gradient Approach

Model-free learning-based control methods have seen great success recent...
06/12/2020 ∙ by Guannan Qu, et al. ∙ 0

• ### Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

We study the sample complexity of approximate policy iteration (PI) for ...
05/30/2019 ∙ by Karl Krauth, et al. ∙ 0

• ### Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

Reinforcement learning (RL) has been successfully used to solve many con...
12/22/2017 ∙ by Stephen Tu, et al. ∙ 0

• ### Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration

Several approximate policy iteration schemes without value functions, wh...
05/14/2008 ∙ by Christos Dimitrakakis, et al. ∙ 0

• ### Sample Efficient Policy Search for Optimal Stopping Domains

Optimal stopping problems consider the question of deciding when to stop...
02/21/2017 ∙ by Karan Goel, et al. ∙ 0

• ### The Effect of Q-function Reuse on the Total Regret of Tabular, Model-Free, Reinforcement Learning

Some reinforcement learning methods suffer from high sample complexity c...
03/07/2021 ∙ by Volodymyr Tkachuk, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The relative merits of model-based versus model-free methods in reinforcement learning (RL) is a decades old question. This debate has become reinvigorated in the last few years due to the impressive success of RL techniques in various domains such as game playing, robotic manipulation, and locomotion tasks. A common rule of thumb amongst RL practitioners is that model-free methods have worse sample complexity compared to model-based methods, but are generally able to achieve better performance asymptotically since they do not suffer from biases in the model that lead to sub-optimal behavior [10, 29, 33]. However, there is currently no general theory which rigorously explains the gap between performance of model-based versus model-free methods. While there has been theoretical work studying both model-based and model-free methods in RL, prior work has primarily shown specific upper bounds [5, 6, 17, 19, 41] which are not directly comparable, or information-theoretic lower bounds [17, 19]

which are currently too coarse-grained to delineate between model-based and model-free methods. Furthermore, most of the prior work has focused primarily on the tabular Markov Decision Process (MDP) setting.

We take a first step towards a theoretical understanding of the differences between model-based and model-free methods for continuous control settings. While we are ultimately interested in comparing these methods for general MDPs with non-linear state transition dynamics, in this work we build upon recent progress in understanding the performance guarantees of data-driven methods for the Linear Quadratic Regulator (LQR). We study the asymptotic behavior of both policy evaluation and policy optimization on LQR, comparing the performance of simple model-based methods which use empirical state transition data to fit a dynamics model versus the performance of popular model-free methods from RL: temporal-difference learning for policy evaluation and policy gradient methods for policy optimization.

Our analysis shows that in the policy evaluation setting, a simple model-based plugin estimator is always asymptotically more sample efficient than the classical least-squares temporal difference (LSTD) estimator; the gap between the two methods can be at least a factor of state-dimension. For policy optimization, we construct a simple family of instances for which nominal control (also known as the certainty equivalence principle in control theory) is also at least a factor of state-dimension more efficient than the widely used policy gradient method. Furthermore, the gap persists even when we employ commonly used baselines to reduce the variance of the policy gradient estimate. In both settings, we also show minimax lower bounds which highlight the near-optimality of model-free methods in certain regimes. To the best of our knowledge, our work is the first to rigorously show a setting where a strict separation between a model-based and model-free method solving the same continuous control task occurs.

## 2 Main Results

In this paper, we study the performance of model-based and model-free algorithms for the Linear Quadratic Regulator (LQR) via two fundamental primitives in reinforcement learning: policy evaluation and policy optimization. In both tasks we fix an unknown dynamical system

 xt+1=A⋆xt+B⋆ut+wt,

starting at

(for simplicity) and driven by Gaussian white noise

. We let denote the state dimension and denote the input dimension, and assume the system is underactuated (i.e. ). We also fix two positive semi-definite cost matrices .

### 2.1 Policy Evaluation

Given a controller that stabilizes , the policy evaluation task is to compute the (relative) value function :

 VK(x):=limT→∞E[T−1∑t=0(xTtQxt+uTtRut−λK)∣∣∣x0=x],ut=Kxt. (2.1)

Above, is the infinite horizon average cost. It is well-known that can be written as:

 VK(x)=σ2wxTP⋆x, (2.2)

where solves the discrete-time Lyapunov equation:

 (A⋆+B⋆K)TP⋆(A⋆+B⋆K)−P⋆+Q+KTRK=0. (2.3)

From the Lyapunov equation, it is clear that given , the solution to policy evaluation task is readily computable. In this paper, we study algorithms which only have input/output access to . Specifically, we study on-policy algorithms that operate on a single trajectory, where the input is determined by . The variable that controls the amount of information available to the algorithm is , the trajectory length. The trajectory will be denoted as . We are interested in the asymptotic behavior of algorithms as .

##### Model-based algorithm.

In light of Equation (2.3), the plugin estimator is a very natural model-based algorithm to use. Let denote the true closed-loop matrix. The plugin estimator uses the trajectory to estimate via least-squares; call this . The estimator then returns by using in-place of in (2.3). Algorithm 1 describes this estimator in more detail.

##### Model-free algorithm.

By observing that , one can apply Least-Squares Temporal Difference Learning (LSTD) [8, 9] with the feature map to estimate . Here, vectorizes the upper triangular part of a symmetric matrix, weighting the off-diagonal terms by to ensure consistency in the inner product. This is a classical algorithm in RL; the pseudocode is given in Algorithm 2.

We now proceed to compare the risk of Algorithm 1 versus Algorithm 2. Our notion of risk will be the expected squared error of the estimator: . Our first result gives an upper bound on the asymptotic risk of the model-based plugin Algorithm 1.

###### Theorem 2.1.

Let stabilize . Define to be the closed-loop matrix and let denote its stability radius. Recall that is the solution to the discrete-time Lyapunov equation (2.3) that parameterizes the value function . We have that Algorithm 1 with thresholds satisfying and and any fixed regularization parameter has the asymptotic risk upper bound:

 limT→∞T⋅E[∥ˆPplug(T)−P⋆∥2F]≤4Tr((I−LT⋆⊗sLT⋆)−1(LT⋆P2⋆L⋆⊗sσ2wP−1∞)(I−LT⋆⊗sLT⋆)−T).

Here, is the stationary covariance matrix of the closed-loop system and denotes the symmetric Kronecker product.

We make a few quick remarks regarding Theorem 2.1. First, while the risk bound is presented as an upper bound, the exact asymptotic risk can be recovered from the proof. Second, the thresholds and regularization parameter do not affect the final asymptotic bound, but do possibly affect both higher order terms and the rate of convergence to the limiting risk. We include these thresholds as they simplify the proof. In practice, we find that thresholding or regularization is generally not needed, with the caveat that if the estimate

is not stable then the solution to the discrete Lyapunov equation is not guaranteed to exist (and when it exists is not guaranteed to be positive semidefinite). Finally, we remark that a non-asymptotic high probability upper bound for the risk of Algorithm

1 can be easily derived by combining the single trajectory learning results of Simchowitz et al. [39] with standard results on perturbation of Lyapunov equations.

We now turn our attention to the model-free LSTD algorithm. Our next result gives a lower bound on the asymptotic risk of Algorithm 2.

###### Theorem 2.2.

Let stabilize . Define to be the closed-loop matrix . Recall that is the solution to the discrete-time Lyapunov equation (2.3) that parameterizes the value function . We have that Algorithm 2 with the cost estimates set to the true cost satisfies the asymptotic risk lower bound:

 liminfT→∞T⋅E[∥ˆPlstd(T)−P⋆∥2F]≥4Rplug +8σ2w⟨P∞,LT⋆P2⋆L⋆⟩Tr((I−LT⋆⊗sLT⋆)−1(P−1∞⊗sP−1∞)(I−LT⋆⊗sLT⋆)−T)

Here, is the asymptotic risk of the plugin estimator, is the stationary covariance matrix of the closed loop system , and denotes the symmetric Kronecker product.

Theorem 2.2 shows that the asymptotic risk of the model-free method always exceeds that of the model-based plugin method. We remark that we prove the theorem under an idealized setting where the infinite horizon cost estimate is set to the true cost . In practice, the true cost is not known and must instead be estimated from the data at hand. However, for the purposes of our comparison this is not an issue because using the true cost over an estimator of only reduces the variance of the risk.

To get a sense of how much excess risk is incurred by the model-free method over the model-based method, consider the following family of instances, defined for and :

 F(ρ,d,K):={(A⋆,B⋆):A⋆+B⋆K=τPE+γIn,(τ,γ)∈(0,1),τ+γ≤ρ,dim(E)≤d}. (2.4)

With this family, one can show with elementary computations that under the simplifying assumptions that and , Theorem 2.1 and Theorem 2.2 state that:

 limT→∞T⋅E[∥ˆPplug(T)−P⋆∥2F] ≤O(ρ2n2(1−ρ2)3), liminfT→∞T⋅E[∥ˆPlstd(T)−P⋆∥2F] ≥Ω(ρ2n3(1−ρ2)3).

That is, for , the plugin risk is a factor of state-dimension less than the LSTD risk. Moreover, the non-asymptotic result for LSTD from Lemma 4.1 of Abbasi-Yadkori [1] (which extends the non-asymptotic discounted LSTD result from Tu and Recht [44]) gives a bound of w.h.p., which matches the asymptotic bound of Theorem 2.2 in terms of up to logarithmic factors.

Our final result for policy evaluation is a minimax lower bound on the risk of any estimator over .

###### Theorem 2.3.

Fix a and suppose that satisfies . Suppose that is greater than an absolute constant and . We have that:

 infˆPsup(A⋆,B⋆)∈F(ρ,n4,K)E[∥ˆP−P⋆∥2F]≳ρ2n2(1−ρ2)3T,

where the infimum is taken over all estimators taking input .

Theorem 2.3 states that the rate achieved by the model-based Algorithm 3 over the family cannot be improved beyond constant factors, at least asymptotically; its dependence on both the state dimension and stability radius is optimal.

### 2.2 Policy Optimization

Given a finite horizon length , the policy optimization task is to solve the finite horizon optimal control problem:

 J⋆:=minut(⋅)E[T−1∑t=0(xTtQxt+uTtRut)+xTTQxT],xt+1=A⋆xt+B⋆ut+wt. (2.5)

We will focus on a special case of this problem when there is no penalty on the input: , , and . In this situation, the cost function reduces to and the optimal solution simply chooses a that cancels out the state ; that is . We work with this simple class of instances so that we can ensure that policy gradient converges to the optimal solution; in general this is not guaranteed.

We consider a slightly different input/output oracle model in this setting than we did in Section 2.1. The horizon length is now considered fixed, and rounds are played. At each round , the algorithm chooses a feedback matrix . The algorithm then observes the trajectory by playing the control input , where is i.i.d. noise used for the policy. This process then repeats for total rounds. After the rounds, the algorithm is asked to output a and is assigned the risk , where denotes playing the feedback on the true system . We will study the behavior of algorithms when (and is held fixed).

##### Model-based algorithm.

Under this oracle model, a natural model-based algorithm is to first use random open-loop feedback (i.e. ) to observe independent trajectories (each of length ), and then use the trajectory data to fit the state transition matrices ; call this estimate . After fitting the dynamics, the algorithm then returns the estimate of by solving the finite horizon problem (2.5) with taking the place of . In general, however, the assumption that will not hold, and hence the optimal solution to (2.5) will not be time-invariant. Moreover, solving for the best time-invariant static feedback for the finite horizon problem in general is not tractable. In light of this, to provide the fairest comparison to the model-free policy gradient method, we use the time-invariant static feedback that arises from infinite horizon solution given by the discrete algebraic Riccati equation as a proxy. We note that under our range inclusion assumption, the infinite horizon solution is a consistent estimator of the optimal feedback. The pseudo-code for this model-based algorithm is described in Algorithm 3.

##### Model-free algorithm.

We study a model-free algorithm based on policy gradients (see e.g. [32, 45]). Here, we choose to parameterize the policy as a time-invariant linear feedback. The algorithm is described in Algorithm 4.

In general for problems with a continuous action space, when applying policy gradient one has many degrees of freedom in choosing how to represent the policy

. Some of these degrees of freedom include whether or not the policy should be time-invariant and how much of the history before time should be used to compute the action at time . More broadly, the question is what function class should be used to model the policy. Ideally, one chooses a function class which is both capable of expressing the optimal solution and is easy to optimize over.

Another issue that significantly impacts the performance of policy gradient in practice is choosing a baseline which effectively reduces the variance of the policy gradient estimate. What makes computing a baseline challenging is that good baselines (such as value or advantage functions) require knowledge of the unknown MDP transition dynamics in order to compute. Therefore, one has to estimate the baseline from the empirical trajectories, adding another layer of complexity to the policy gradient algorithm.

In general, these issues are still an active area of research in RL and present many hurdles to a general theory for policy optimization. However, by restriction our attention to LQR, we can sidestep these issues which enables our analysis. In particular, by studying problems with no penalty on the input and where the state can be cancelled at every step, we know that the optimal control is a static time-invariant linear feedback. Therefore, we can restrict our policy representation to static linear feedback controllers without introducing any approximation error. Furthermore, it turns out that we can further parameterize instances so that the optimization landscape satisfies a standard notion of restricted strong convexity. This allows us to study policy gradient by leveraging the existing theory on the asymptotic distribution of stochastic gradient descent for strongly convex objectives. Finally, we can compute many of the standard baselines used in closed form, which further enables our analysis.

We note that in the literature, the model-based method is often called nominal control or the certainty equivalence principle. As noted in Dean et al. [11], one issue with this approach is that on an infinite horizon, there is no guarantee of robust stability with nominal control. However, as we are dealing with only finite horizon problems, the notion of stability is irrelevant.

We will consider a restricted family of instances

to obtain a sharp asymptotic analysis. For a

and , we define the family over as:

 G(ρ,d):={(ρU⋆UT⋆,ρU⋆):U⋆∈Rn×d,UT⋆U⋆=Id}.

This is a simple family where the matrix is stable, contractive, and symmetric. Observe that for we have . Furthermore, the optimal feedback for each of these instances. Our first result for policy optimization gives the asymptotic risk of the model-based Algorithm 3.

###### Theorem 2.4.

Fix a . For any , we have that the model-based plugin Algorithm 3 with thresholds such that , , and satisfies the asymptotic risk bound:

 limN→∞N⋅E[J(ˆKplug(N))−J⋆]=O(d2+(n−d)d)+oT(1).

Here, hides constants depending only on .

Theorem 2.4 states that when , the RHS of the risk bound for the model-based case is . It will turn out that the dependence on is optimal for the family . Similar to Theorem 2.1, Theorem 2.4 requires the setting of thresholds . These thresholds serve two purposes. First, they ensure the existence of a unique positive definite solution to the discrete algebraic Riccati solution with the input penalty (the details of this are worked out in Section 6.2). Second, they simplify various technical aspects of the proof related to uniform integrability. In practice, such strong thresholds are not needed, and we leave either removing them or relaxing their requirements to future work.

Next, we look at the model-free case. As mentioned previously, baselines are very influential on the behavior of policy gradient. In our analysis, we consider three different baselines:

 Ψt(T;K) =T∑ℓ=t+1∥xℓ∥22, (Simple baseline bt(xt;K)=∥xt∥22.) Ψt(T;K) =T∑ℓ=t∥xℓ∥22−VKt(xt), (Value function baseline bt(xt;K)=VKt(xt).) Ψt(T;K) =AKt(xt,ut). (Advantage baseline AKt(xt,ut)=QKt(xt,ut)−VKt(xt).)

Above, the simple baseline should be interpreted as having effectively no baseline; it turns out to simplify the variance calculations. On the other hand, the value function baseline

is a very popular heuristic used in practice

[32]. Typically one has to actually estimate the value function for a given policy, since computing it requires knowledge of the model dynamics. In our analysis however, we simply assume the true value function is known. While this is an unrealistic assumption in practice, we note that this assumption substantially reduce the variance of policy gradient, and hence only serves to reduce the asymptotic risk. The last baseline we consider is to use the advantage function . Using advantage functions has been shown to be quite effective in practice [38]. It has the same issue as the value function baseline in that it needs to be estimated from the data; once again in our analysis we simply assume we have access to the true advantage function.

Our main result for model-free policy optimization is the following asymptotic risk lower bound on Algorithm 4.

###### Theorem 2.5.

Fix a . For any consider Algorithm 4 with , step-sizes , and threshold . We have that:

 liminfN→∞N⋅E[J(ˆKpg(N))−J⋆]≥⎧⎪⎨⎪⎩Ω(T2⋅(d4+n3d))+oT(T2) (Simple baseline)Ω(T⋅dn2)+oT(T) (Value function baseline)Ω(d3+nd2) (Advantage baseline).

Here, hides constants depending only on .

Theorem 2.5 states that when , for the simple baseline the RHS is , for the value function baseline the RHS is , and finally for the advantage baseline the RHS is . In all cases (even with the advantage baseline), the dependence on is at least one factor more than in the model-free case, which is (Theorem 2.4). Furthermore, for the simple and value function baseline, we even see the RHS depending on the horizon length and , respectively. The extra factors of the horizon length appear due to the large variance of the policy gradient estimator without the variance reduction effects of the advantage baseline. Finally, we note that we prove Theorem 2.5 with a specific choice of step size . This step size corresponds to the standard step sizes commonly found in proofs for SGD on strongly convex functions (see e.g. Rakhlin et al. [34]), where is the strong convexity parameter. We leave to future work extending our results to support Polyak-Ruppert averaging, which would yield asymptotic results that are more robust to specific step size choices.

Finally, we turn to our information-theoretic lower bound for any (possibly adaptive) method over the family .

###### Theorem 2.6.

Fix a and suppose is greater than an absolute constant. Consider the family as describe above. Fix a time horizon and number of rollouts . The risk over any algorithm which plays (possibly adaptive) feedbacks of the form with and is lower bounded by:

 infAsupρ∈(0,1/4),(A⋆,B⋆)∈G(d,ρ)E[J(A)−J⋆]≳σ4wσ2w+σ2ud2(n−d)nN.

When , the RHS of the risk bound is . Therefore, Theorem 2.6 tells us that asymptotically, the model-based method in Algorithm 3 is nearly optimal in terms of its dependence on the state dimension .

## 3 Related Work

For general Markov Decision Processes (MDPs), the setting which is the best understood theoretically is the finite-horizon episodic case with discrete state and action spaces, often referred to as the “tabular” setting. Jin et al. [19] provide an excellent overview of the known regret bounds in the tabular setting; here we give a brief summary of the highlights. We focus only on regret bounds for simplicity, but note that many results have also been establishes in the PAC setting (see e.g. [23, 40, 41]). For tabular MDPs, a model-based method is one which stores the entire state-transition matrix, which takes space where is the number of states, is the number of actions, and is the horizon length. The best known regret bound in the model-free case is  [6], which matches the known lower bound of  [17, 19] up to log factors. On the other hand, a model-free method is one which only stores the -function and hence requires only space. The best known regret bound in the model-free case is , which is worse than the model-based case by a factor of the horizon length . Interestingly, there is no gap in terms of the number of states and actions . It is open whether or not the gap in is fundamental or can be closed. Sun et al. [42] present an information-theoretic definition of model-free algorithms. Under their definition, they construct a family of factored MDPs with horizon length where any model-free algorithm incurs sample complexity , whereas there exists a model-based algorithm that has sample complexity polynomial in and other relevant quantities. We leave proving lower bounds for LQR under their more general definition of model-free algorithms to future work.

For LQR, the story is less complete. Unlike the tabular setting, the storage requirements of a model-based method are comparable to a model-free method. For instance, it takes space to store the state transition model and space to store the -function. In presenting the known results of LQR, we will delineate between offline (one-shot) methods versus online (adaptive) methods.

In the offline setting, the first non-asymptotic result is from Fiechter [15], who studied the sample complexity of the discounted infinite horizon LQR problem. Later, Dean et al. [11] study the average cost infinite horizon problem, using tools from robust control to quantify how the uncertainty in the model affects control performance in an interpretable way. Both works fall under model-based methods, since they both propose to first estimate the state transition matrices from sample trajectories using least-squares and then use the estimated dynamics in a control synthesis procedure.

For model-free methods for LQR, Tu and Recht [44] study the performance of least-squares temporal difference learning (LSTD) [8, 9], which is a classic policy evaluation algorithm in RL. They focus on the discounted cost LQR setting and provide a non-asymptotic high probability bound on the risk of LSTD. Later, Abbasi-Yadkori et al. [1] extend this result to the average cost LQR setting. Most related to our analysis for policy gradient is Fazel et al. [14], who study the performance of model-free policy gradient related methods on LQR. Unfortunately, their bounds do not give explicit dependence on the problem instance parameters and are therefore difficult to compare to. Furthermore, Fazel et al. study a simplified version of the problem where the problem is a infinite horizon problem (as opposed to finite horizon in this work) and the only noise is in the initial state; all subsequence state transitions have no process noise. Other than our current work, we are unaware of any analysis (asymptotic or non-asymptotic) which explicitly studies the behavior of policy gradient on the finite horizon LQR problem. We also note that Fazel et al. analyze a policy optimization method which is more akin to random search (e.g. [25, 36]) than REINFORCE. Finally, note that all the results mentioned for LQR are only upper bounds; we are unaware of any lower bounds in the literature for LQR which give explicit dependence on the problem instance.

We now discuss known results for the online (adaptive) setting for LQR. For model-based algorithms, both optimism in the face of uncertainty (OFU) [2, 13, 16] and Thompson sampling [3, 4, 30] have been analyzed in the online learning literature. In both cases, the algorithms have been shown to achieve regret, which is known to be nearly optimal in the dependence on . However, in nearly all the bounds the dependence on the problem instance parameters is hidden. Furthermore, it is currently unclear how to solve the OFU subproblem in polynomial time for LQR. In response to the computational issues with OFU, Dean et al. [12] propose a polynomial time adaptive algorithm with sub-linear regret ; their bounds also make the dependence on the problem instance parameters explicit, but are quite conservative in this regard.

For model-free algorithms, Abbasi-Yadkori et al. [1] study the regret of a model-free algorithm similar in spirit to least-squares policy iteration (LSPI) [22]. They prove that their algorithm has regret for any , nearly matching the bound given by Dean et al. in terms of the dependence on . In terms of the dependence on the problem specific parameters, however, their bound is not directly comparable to that of Dean et al. Experimentally, Abbasi-Yadkori et al. observe that their model-free algorithm performs quite sub-optimally compared to model-based methods; these empirical observations are also consistent with similar experiments conducted in [25, 35, 44].

## 4 Asymptotic Toolbox

Our analysis relies heavily on computing limiting distributions for the various estimators we study. A crucial fact we use is that if the matrix

is stable, then the Markov chain

given by with is geometrically ergodic. This allows us to apply well known limit theorems for ergodic Markov chains.

In what follows, we let denote almost sure convergence and denote convergence in distribution. We also let denote the standard Kronecker product and denote the symmetric Kronecker product; see e.g. [37] for a review of the basic properties of the Kronecker and symmetric Kronecker product which we will use extensively throughout the sequel. For a matrix , the notation denotes the vectorized version of by stacking the columns. We will also let denote the operator that satisfies for all symmetric matrices , where the first inner product is with respect to and the second is with respect to . Finally, we let and denote the functional inverses of and . The proofs of the results presented in this section are deferred to the appendix.

We first state a well-known result that concerns the least-squares estimator of a stable dynamical system. In the scalar case, this result dates back to Mann and Wald [26].

###### Lemma 4.1.

Let be a dynamical system with stable and . Given a trajectory , let denote the least-squares estimator of with regularization :

 ˆL(T)=argminL∈Rn×n12T−1∑t=0∥xt+1−Lxt∥22+λ2∥L∥2F.

Let denote the stationary covariance matrix of the process , i.e. . We have that and furthermore:

 √Tvec(ˆL(T)−L⋆)D⇝N(0,σ2w(P−1∞⊗In)).

We now consider a slightly altered process where the system is no longer autonomous, and instead will be driven by white noise.

###### Lemma 4.2.

Let be a stable dynamical system driven by and . Consider a least-squares estimator of based off of independent trajectories of length , i.e. given ,

 ˆΘ(N)=argmin(A,B)∈Rn×(n+d)12N∑i=1T−1∑t=0∥x(i)t+1−Ax(i)t−Bu(i)t∥22+λ2∥[AB]∥2F.

Let denote the stationary covariance of the process , i.e. solves

 A⋆P∞AT⋆−P∞+σ2uB⋆BT⋆+σ2wIn=0.

We have that and furthermore:

 √Nvec(ˆΘ(N)−Θ⋆)D⇝N(0,σ2wT[P−1∞00(1/σ2u)Id]⊗In+o(1/T)).

Next, we consider the asymptotic distribution of Least-Squares Temporal Difference Learning for LQR.

###### Lemma 4.3.

Let be a linear system driven by and . Suppose the closed-loop matrix is stable. Let denote the stationary distribution of the Markov chain . Define the two matrices , the mapping , and the vector as

 A∞ :=Ex∼ν∞,x′∼p(⋅|x,π(x))[ϕ(x)(ϕ(x)−ϕ(x′))T], B∞ :=Ex∼ν∞,x′∼p(⋅|x,π(x))[((ϕ(x′)−ψ(x))Tw⋆)2ϕ(x)ϕ(x)T], ψ(x) :=Ex′∼p(⋅|x,π(x))[ϕ(x′)], w⋆ :=svec(P⋆).

Let denote the LSTD estimator given by:

 ˆwlstd(T)=(T−1∑t=0ϕ(xt)(ϕ(xt)−ϕ(xt+1))T)−1(T−1∑t=0(ct−λt)ϕ(xt)).

Suppose that LSTD is run with the true and that the matrix is invertible. We have that and furthermore:

 √T(ˆwlstd(T)−w⋆)D⇝N(0,A−1∞B∞A−T∞).

As a corollary to Lemma 4.3, we work out the formulas for and and a useful lower bound.

###### Corollary 4.4.

In the setting of Lemma 4.3, with , we have that the matrix is invertible, and:

 A∞ =(P∞⊗sP∞)−(P∞LT⋆⊗sP∞LT⋆), B∞ =(σ2w⟨P∞,LT⋆P2⋆L⋆⟩+2σ4w∥P⋆∥2F)(2(P∞⊗sP