The relative merits of model-based versus model-free methods in reinforcement learning (RL) is a decades old question. This debate has become reinvigorated in the last few years due to the impressive success of RL techniques in various domains such as game playing, robotic manipulation, and locomotion tasks. A common rule of thumb amongst RL practitioners is that model-free methods have worse sample complexity compared to model-based methods, but are generally able to achieve better performance asymptotically since they do not suffer from biases in the model that lead to sub-optimal behavior [10, 29, 33]. However, there is currently no general theory which rigorously explains the gap between performance of model-based versus model-free methods. While there has been theoretical work studying both model-based and model-free methods in RL, prior work has primarily shown specific upper bounds [5, 6, 17, 19, 41] which are not directly comparable, or information-theoretic lower bounds [17, 19]
which are currently too coarse-grained to delineate between model-based and model-free methods. Furthermore, most of the prior work has focused primarily on the tabular Markov Decision Process (MDP) setting.
We take a first step towards a theoretical understanding of the differences between model-based and model-free methods for continuous control settings. While we are ultimately interested in comparing these methods for general MDPs with non-linear state transition dynamics, in this work we build upon recent progress in understanding the performance guarantees of data-driven methods for the Linear Quadratic Regulator (LQR). We study the asymptotic behavior of both policy evaluation and policy optimization on LQR, comparing the performance of simple model-based methods which use empirical state transition data to fit a dynamics model versus the performance of popular model-free methods from RL: temporal-difference learning for policy evaluation and policy gradient methods for policy optimization.
Our analysis shows that in the policy evaluation setting, a simple model-based plugin estimator is always asymptotically more sample efficient than the classical least-squares temporal difference (LSTD) estimator; the gap between the two methods can be at least a factor of state-dimension. For policy optimization, we construct a simple family of instances for which nominal control (also known as the certainty equivalence principle in control theory) is also at least a factor of state-dimension more efficient than the widely used policy gradient method. Furthermore, the gap persists even when we employ commonly used baselines to reduce the variance of the policy gradient estimate. In both settings, we also show minimax lower bounds which highlight the near-optimality of model-free methods in certain regimes. To the best of our knowledge, our work is the first to rigorously show a setting where a strict separation between a model-based and model-free method solving the same continuous control task occurs.
2 Main Results
In this paper, we study the performance of model-based and model-free algorithms for the Linear Quadratic Regulator (LQR) via two fundamental primitives in reinforcement learning: policy evaluation and policy optimization. In both tasks we fix an unknown dynamical system
(for simplicity) and driven by Gaussian white noise. We let denote the state dimension and denote the input dimension, and assume the system is underactuated (i.e. ). We also fix two positive semi-definite cost matrices .
2.1 Policy Evaluation
Given a controller that stabilizes , the policy evaluation task is to compute the (relative) value function :
Above, is the infinite horizon average cost. It is well-known that can be written as:
where solves the discrete-time Lyapunov equation:
From the Lyapunov equation, it is clear that given , the solution to policy evaluation task is readily computable. In this paper, we study algorithms which only have input/output access to . Specifically, we study on-policy algorithms that operate on a single trajectory, where the input is determined by . The variable that controls the amount of information available to the algorithm is , the trajectory length. The trajectory will be denoted as . We are interested in the asymptotic behavior of algorithms as .
In light of Equation (2.3), the plugin estimator is a very natural model-based algorithm to use. Let denote the true closed-loop matrix. The plugin estimator uses the trajectory to estimate via least-squares; call this . The estimator then returns by using in-place of in (2.3). Algorithm 1 describes this estimator in more detail.
By observing that , one can apply Least-Squares Temporal Difference Learning (LSTD) [8, 9] with the feature map to estimate . Here, vectorizes the upper triangular part of a symmetric matrix, weighting the off-diagonal terms by to ensure consistency in the inner product. This is a classical algorithm in RL; the pseudocode is given in Algorithm 2.
We now proceed to compare the risk of Algorithm 1 versus Algorithm 2. Our notion of risk will be the expected squared error of the estimator: . Our first result gives an upper bound on the asymptotic risk of the model-based plugin Algorithm 1.
Let stabilize . Define to be the closed-loop matrix and let denote its stability radius. Recall that is the solution to the discrete-time Lyapunov equation (2.3) that parameterizes the value function . We have that Algorithm 1 with thresholds satisfying and and any fixed regularization parameter has the asymptotic risk upper bound:
Here, is the stationary covariance matrix of the closed-loop system and denotes the symmetric Kronecker product.
We make a few quick remarks regarding Theorem 2.1. First, while the risk bound is presented as an upper bound, the exact asymptotic risk can be recovered from the proof. Second, the thresholds and regularization parameter do not affect the final asymptotic bound, but do possibly affect both higher order terms and the rate of convergence to the limiting risk. We include these thresholds as they simplify the proof. In practice, we find that thresholding or regularization is generally not needed, with the caveat that if the estimate
is not stable then the solution to the discrete Lyapunov equation is not guaranteed to exist (and when it exists is not guaranteed to be positive semidefinite). Finally, we remark that a non-asymptotic high probability upper bound for the risk of Algorithm1 can be easily derived by combining the single trajectory learning results of Simchowitz et al.  with standard results on perturbation of Lyapunov equations.
We now turn our attention to the model-free LSTD algorithm. Our next result gives a lower bound on the asymptotic risk of Algorithm 2.
Let stabilize . Define to be the closed-loop matrix . Recall that is the solution to the discrete-time Lyapunov equation (2.3) that parameterizes the value function . We have that Algorithm 2 with the cost estimates set to the true cost satisfies the asymptotic risk lower bound:
Here, is the asymptotic risk of the plugin estimator, is the stationary covariance matrix of the closed loop system , and denotes the symmetric Kronecker product.
Theorem 2.2 shows that the asymptotic risk of the model-free method always exceeds that of the model-based plugin method. We remark that we prove the theorem under an idealized setting where the infinite horizon cost estimate is set to the true cost . In practice, the true cost is not known and must instead be estimated from the data at hand. However, for the purposes of our comparison this is not an issue because using the true cost over an estimator of only reduces the variance of the risk.
To get a sense of how much excess risk is incurred by the model-free method over the model-based method, consider the following family of instances, defined for and :
That is, for , the plugin risk is a factor of state-dimension less than the LSTD risk. Moreover, the non-asymptotic result for LSTD from Lemma 4.1 of Abbasi-Yadkori  (which extends the non-asymptotic discounted LSTD result from Tu and Recht ) gives a bound of w.h.p., which matches the asymptotic bound of Theorem 2.2 in terms of up to logarithmic factors.
Our final result for policy evaluation is a minimax lower bound on the risk of any estimator over .
Fix a and suppose that satisfies . Suppose that is greater than an absolute constant and . We have that:
where the infimum is taken over all estimators taking input .
2.2 Policy Optimization
Given a finite horizon length , the policy optimization task is to solve the finite horizon optimal control problem:
We will focus on a special case of this problem when there is no penalty on the input: , , and . In this situation, the cost function reduces to and the optimal solution simply chooses a that cancels out the state ; that is . We work with this simple class of instances so that we can ensure that policy gradient converges to the optimal solution; in general this is not guaranteed.
We consider a slightly different input/output oracle model in this setting than we did in Section 2.1. The horizon length is now considered fixed, and rounds are played. At each round , the algorithm chooses a feedback matrix . The algorithm then observes the trajectory by playing the control input , where is i.i.d. noise used for the policy. This process then repeats for total rounds. After the rounds, the algorithm is asked to output a and is assigned the risk , where denotes playing the feedback on the true system . We will study the behavior of algorithms when (and is held fixed).
Under this oracle model, a natural model-based algorithm is to first use random open-loop feedback (i.e. ) to observe independent trajectories (each of length ), and then use the trajectory data to fit the state transition matrices ; call this estimate . After fitting the dynamics, the algorithm then returns the estimate of by solving the finite horizon problem (2.5) with taking the place of . In general, however, the assumption that will not hold, and hence the optimal solution to (2.5) will not be time-invariant. Moreover, solving for the best time-invariant static feedback for the finite horizon problem in general is not tractable. In light of this, to provide the fairest comparison to the model-free policy gradient method, we use the time-invariant static feedback that arises from infinite horizon solution given by the discrete algebraic Riccati equation as a proxy. We note that under our range inclusion assumption, the infinite horizon solution is a consistent estimator of the optimal feedback. The pseudo-code for this model-based algorithm is described in Algorithm 3.
We study a model-free algorithm based on policy gradients (see e.g. [32, 45]). Here, we choose to parameterize the policy as a time-invariant linear feedback. The algorithm is described in Algorithm 4.
In general for problems with a continuous action space, when applying policy gradient one has many degrees of freedom in choosing how to represent the policy. Some of these degrees of freedom include whether or not the policy should be time-invariant and how much of the history before time should be used to compute the action at time . More broadly, the question is what function class should be used to model the policy. Ideally, one chooses a function class which is both capable of expressing the optimal solution and is easy to optimize over.
Another issue that significantly impacts the performance of policy gradient in practice is choosing a baseline which effectively reduces the variance of the policy gradient estimate. What makes computing a baseline challenging is that good baselines (such as value or advantage functions) require knowledge of the unknown MDP transition dynamics in order to compute. Therefore, one has to estimate the baseline from the empirical trajectories, adding another layer of complexity to the policy gradient algorithm.
In general, these issues are still an active area of research in RL and present many hurdles to a general theory for policy optimization. However, by restriction our attention to LQR, we can sidestep these issues which enables our analysis. In particular, by studying problems with no penalty on the input and where the state can be cancelled at every step, we know that the optimal control is a static time-invariant linear feedback. Therefore, we can restrict our policy representation to static linear feedback controllers without introducing any approximation error. Furthermore, it turns out that we can further parameterize instances so that the optimization landscape satisfies a standard notion of restricted strong convexity. This allows us to study policy gradient by leveraging the existing theory on the asymptotic distribution of stochastic gradient descent for strongly convex objectives. Finally, we can compute many of the standard baselines used in closed form, which further enables our analysis.
We note that in the literature, the model-based method is often called nominal control or the certainty equivalence principle. As noted in Dean et al. , one issue with this approach is that on an infinite horizon, there is no guarantee of robust stability with nominal control. However, as we are dealing with only finite horizon problems, the notion of stability is irrelevant.
We will consider a restricted family of instances
to obtain a sharp asymptotic analysis. For aand , we define the family over as:
This is a simple family where the matrix is stable, contractive, and symmetric. Observe that for we have . Furthermore, the optimal feedback for each of these instances. Our first result for policy optimization gives the asymptotic risk of the model-based Algorithm 3.
Fix a . For any , we have that the model-based plugin Algorithm 3 with thresholds such that , , and satisfies the asymptotic risk bound:
Here, hides constants depending only on .
Theorem 2.4 states that when , the RHS of the risk bound for the model-based case is . It will turn out that the dependence on is optimal for the family . Similar to Theorem 2.1, Theorem 2.4 requires the setting of thresholds . These thresholds serve two purposes. First, they ensure the existence of a unique positive definite solution to the discrete algebraic Riccati solution with the input penalty (the details of this are worked out in Section 6.2). Second, they simplify various technical aspects of the proof related to uniform integrability. In practice, such strong thresholds are not needed, and we leave either removing them or relaxing their requirements to future work.
Next, we look at the model-free case. As mentioned previously, baselines are very influential on the behavior of policy gradient. In our analysis, we consider three different baselines:
|(Simple baseline .)|
|(Value function baseline .)|
|(Advantage baseline .)|
Above, the simple baseline should be interpreted as having effectively no baseline; it turns out to simplify the variance calculations. On the other hand, the value function baseline
is a very popular heuristic used in practice. Typically one has to actually estimate the value function for a given policy, since computing it requires knowledge of the model dynamics. In our analysis however, we simply assume the true value function is known. While this is an unrealistic assumption in practice, we note that this assumption substantially reduce the variance of policy gradient, and hence only serves to reduce the asymptotic risk. The last baseline we consider is to use the advantage function . Using advantage functions has been shown to be quite effective in practice . It has the same issue as the value function baseline in that it needs to be estimated from the data; once again in our analysis we simply assume we have access to the true advantage function.
Our main result for model-free policy optimization is the following asymptotic risk lower bound on Algorithm 4.
Fix a . For any consider Algorithm 4 with , step-sizes , and threshold . We have that:
Here, hides constants depending only on .
Theorem 2.5 states that when , for the simple baseline the RHS is , for the value function baseline the RHS is , and finally for the advantage baseline the RHS is . In all cases (even with the advantage baseline), the dependence on is at least one factor more than in the model-free case, which is (Theorem 2.4). Furthermore, for the simple and value function baseline, we even see the RHS depending on the horizon length and , respectively. The extra factors of the horizon length appear due to the large variance of the policy gradient estimator without the variance reduction effects of the advantage baseline. Finally, we note that we prove Theorem 2.5 with a specific choice of step size . This step size corresponds to the standard step sizes commonly found in proofs for SGD on strongly convex functions (see e.g. Rakhlin et al. ), where is the strong convexity parameter. We leave to future work extending our results to support Polyak-Ruppert averaging, which would yield asymptotic results that are more robust to specific step size choices.
Finally, we turn to our information-theoretic lower bound for any (possibly adaptive) method over the family .
Fix a and suppose is greater than an absolute constant. Consider the family as describe above. Fix a time horizon and number of rollouts . The risk over any algorithm which plays (possibly adaptive) feedbacks of the form with and is lower bounded by:
3 Related Work
For general Markov Decision Processes (MDPs), the setting which is the best understood theoretically is the finite-horizon episodic case with discrete state and action spaces, often referred to as the “tabular” setting. Jin et al.  provide an excellent overview of the known regret bounds in the tabular setting; here we give a brief summary of the highlights. We focus only on regret bounds for simplicity, but note that many results have also been establishes in the PAC setting (see e.g. [23, 40, 41]). For tabular MDPs, a model-based method is one which stores the entire state-transition matrix, which takes space where is the number of states, is the number of actions, and is the horizon length. The best known regret bound in the model-free case is , which matches the known lower bound of [17, 19] up to log factors. On the other hand, a model-free method is one which only stores the -function and hence requires only space. The best known regret bound in the model-free case is , which is worse than the model-based case by a factor of the horizon length . Interestingly, there is no gap in terms of the number of states and actions . It is open whether or not the gap in is fundamental or can be closed. Sun et al.  present an information-theoretic definition of model-free algorithms. Under their definition, they construct a family of factored MDPs with horizon length where any model-free algorithm incurs sample complexity , whereas there exists a model-based algorithm that has sample complexity polynomial in and other relevant quantities. We leave proving lower bounds for LQR under their more general definition of model-free algorithms to future work.
For LQR, the story is less complete. Unlike the tabular setting, the storage requirements of a model-based method are comparable to a model-free method. For instance, it takes space to store the state transition model and space to store the -function. In presenting the known results of LQR, we will delineate between offline (one-shot) methods versus online (adaptive) methods.
In the offline setting, the first non-asymptotic result is from Fiechter , who studied the sample complexity of the discounted infinite horizon LQR problem. Later, Dean et al.  study the average cost infinite horizon problem, using tools from robust control to quantify how the uncertainty in the model affects control performance in an interpretable way. Both works fall under model-based methods, since they both propose to first estimate the state transition matrices from sample trajectories using least-squares and then use the estimated dynamics in a control synthesis procedure.
For model-free methods for LQR, Tu and Recht  study the performance of least-squares temporal difference learning (LSTD) [8, 9], which is a classic policy evaluation algorithm in RL. They focus on the discounted cost LQR setting and provide a non-asymptotic high probability bound on the risk of LSTD. Later, Abbasi-Yadkori et al.  extend this result to the average cost LQR setting. Most related to our analysis for policy gradient is Fazel et al. , who study the performance of model-free policy gradient related methods on LQR. Unfortunately, their bounds do not give explicit dependence on the problem instance parameters and are therefore difficult to compare to. Furthermore, Fazel et al. study a simplified version of the problem where the problem is a infinite horizon problem (as opposed to finite horizon in this work) and the only noise is in the initial state; all subsequence state transitions have no process noise. Other than our current work, we are unaware of any analysis (asymptotic or non-asymptotic) which explicitly studies the behavior of policy gradient on the finite horizon LQR problem. We also note that Fazel et al. analyze a policy optimization method which is more akin to random search (e.g. [25, 36]) than REINFORCE. Finally, note that all the results mentioned for LQR are only upper bounds; we are unaware of any lower bounds in the literature for LQR which give explicit dependence on the problem instance.
We now discuss known results for the online (adaptive) setting for LQR. For model-based algorithms, both optimism in the face of uncertainty (OFU) [2, 13, 16] and Thompson sampling [3, 4, 30] have been analyzed in the online learning literature. In both cases, the algorithms have been shown to achieve regret, which is known to be nearly optimal in the dependence on . However, in nearly all the bounds the dependence on the problem instance parameters is hidden. Furthermore, it is currently unclear how to solve the OFU subproblem in polynomial time for LQR. In response to the computational issues with OFU, Dean et al.  propose a polynomial time adaptive algorithm with sub-linear regret ; their bounds also make the dependence on the problem instance parameters explicit, but are quite conservative in this regard.
For model-free algorithms, Abbasi-Yadkori et al.  study the regret of a model-free algorithm similar in spirit to least-squares policy iteration (LSPI) . They prove that their algorithm has regret for any , nearly matching the bound given by Dean et al. in terms of the dependence on . In terms of the dependence on the problem specific parameters, however, their bound is not directly comparable to that of Dean et al. Experimentally, Abbasi-Yadkori et al. observe that their model-free algorithm performs quite sub-optimally compared to model-based methods; these empirical observations are also consistent with similar experiments conducted in [25, 35, 44].
4 Asymptotic Toolbox
Our analysis relies heavily on computing limiting distributions for the various estimators we study. A crucial fact we use is that if the matrix
is stable, then the Markov chaingiven by with is geometrically ergodic. This allows us to apply well known limit theorems for ergodic Markov chains.
In what follows, we let denote almost sure convergence and denote convergence in distribution. We also let denote the standard Kronecker product and denote the symmetric Kronecker product; see e.g.  for a review of the basic properties of the Kronecker and symmetric Kronecker product which we will use extensively throughout the sequel. For a matrix , the notation denotes the vectorized version of by stacking the columns. We will also let denote the operator that satisfies for all symmetric matrices , where the first inner product is with respect to and the second is with respect to . Finally, we let and denote the functional inverses of and . The proofs of the results presented in this section are deferred to the appendix.
We first state a well-known result that concerns the least-squares estimator of a stable dynamical system. In the scalar case, this result dates back to Mann and Wald .
Let be a dynamical system with stable and . Given a trajectory , let denote the least-squares estimator of with regularization :
Let denote the stationary covariance matrix of the process , i.e. . We have that and furthermore:
We now consider a slightly altered process where the system is no longer autonomous, and instead will be driven by white noise.
Let be a stable dynamical system driven by and . Consider a least-squares estimator of based off of independent trajectories of length , i.e. given ,
Let denote the stationary covariance of the process , i.e. solves
We have that and furthermore:
Next, we consider the asymptotic distribution of Least-Squares Temporal Difference Learning for LQR.
Let be a linear system driven by and . Suppose the closed-loop matrix is stable. Let denote the stationary distribution of the Markov chain . Define the two matrices , the mapping , and the vector as
Let denote the LSTD estimator given by:
Suppose that LSTD is run with the true and that the matrix is invertible. We have that and furthermore:
As a corollary to Lemma 4.3, we work out the formulas for and and a useful lower bound.
In the setting of Lemma 4.3, with , we have that the matrix is invertible, and: