One of the most challenging problems in reinforcement learning (RL) is how to effectively trade off exploration and exploitation in an unknown environment. A number of learning methods has been proposed in finite Markov decision processes (MDPs) and they have been analyzed in the PAC-MDP (see e.g.,[strehl2009reinforcement]) and the regret framework (see e.g., [jaksch2010near-optimal]). The two most popular approaches to address the exploration-exploitation trade-off are the optimism-in-face-of-uncertainty (OFU) principle, where optimistic policies are selected according to upper-confidence bounds on the true MDP paramaters, and the Thompson sampling (TS) strategy111In RL literature, TS has been introduced by strens2000a-bayesian and it is often referred to as posterior-sampling for reinforcement learning (PSRL)., where random MDP parameters are selected from a posterior distribution and the corresponding optimal policy is executed. Despite their success in finite MDPs, extensions of these methods and their analyses to continuous state-action spaces are still rather limited. osband2016generalization study how to randomize the parameters of a linear function approximator to induce exploration and prove regret guarantees in the finite MDP case. osband2015bootstrapped develops a specific TS method applied to the more complex case of neural architectures with significant empirical improvements over alternative exploration strategies, although with no theoretical guarantees. In this paper, we focus on a specific family of continuous state-action MDPs, the linear quadratic (LQ) control problems, where the state transition is linear and the cost function is quadratic in the state and the control. Despite their specific structure, LQ models are very flexible and widely used in practice (e.g., to track a reference trajectory). If the parameter defining dynamics and cost is known, the optimal control can be computed explicitly as a linear function of the state with an appropriate gain. On the other hand, when is unknown, an exploration-exploitation trade-off needs to be solved. bittanti2006adaptive and campi1998adaptive, first proposed an optimistic approach to this problem, showing that the performance of an adaptive control strategy asymptotically converges to the optimal control. Building on this approach and the OFU principle, abbasi2011regret proposed a learning algorithm (OFU-LQ) with cumulative regret. abbasi2015bayesian further studied how the TS strategy, could be adapted to work in the LQ control problem. Under the assumption that the true parameters of the model are drawn from a known prior, they show that the so-called Bayesian regret matches the bound of OFU-LQ.
In this paper, we analyze the regret of TS in LQ problems in the more challenging frequentist case, where is a fixed parameter, with no prior assumption of its value. The analysis of OFU-LQ relies on three main ingredients: 1) optimistic parameters, 2) lazy updates (the control policy is updated only a logarithmic number of times) and 3)
concentration inequalities for regularized least-squares used to estimate the unknown parameter. While we build on previous results for the least-squares estimates of the parameters, points 1) and 2) should be adapted for TS. Unfortunately, the Bayesian regret analysis of TS in [abbasi2015bayesian] does not apply in this case, since no prior is available on . Furthermore, we show that existing frequentist regret analysis for TS in linear bandit [agrawal2012thompson] cannot be generalized to the LQ case. This requires deriving a novel line of proof in which we first prove that TS
has a constant probability to sample an optimistic parameter (i.e., an LQ system whose optimal expected average cost is smaller than the true one) and then we exploit the LQ structure to show how being optimistic allows to directly link the regret to the controls operated byTS over time and eventually bound them. Nonetheless, this analysis reveals a critical trade-off between the frequency with which new parameters are sampled (and thus the chance of being optimistic) and the regret cumulated every time the control policy changes. In OFU-LQ this trade-off is easily solved by construction: the lazy update guarantees that the control policy changes very rarely and whenever a new policy is computed, it is guaranteed to be optimistic. On the other hand, TS relies on the random sampling process to obtain optimistic models and if this is not done frequently enough, the regret can grow unbounded. This forces TS to favor short episodes and we prove that this leads to an overall regret of order in the one-dimensional case (i.e., both states and controls are scalars), which is significantly worse than the regret of OFU-LQ.
The control problem. We consider the discrete-time infinite-horizon linear quadratic (LQ) control problem. Let be the state of the system and be the control at time ; an LQ problem is characterized by linear dynamics and a quadratic cost function
where and are unknown matrices and and are known positive definite matrices of appropriate dimension. We summarize the unknown parameters in . The noise process is zero-mean and it satisfies the following assumption.
is a martingale difference sequence, where is the filtration which represents the information knowledge up to time .
In LQ, the objective is to design a closed-loop control policy mapping states to controls that minimizes the average expected cost
with and . Standard theory for LQ control guarantees that the optimal policy is linear in the state and that the corresponding average expected cost is the solution of a Riccati equation.
Proposition 1 (Thm.16.6.4 in [lancaster1995algebraic]).
Under Asm. 1 and for any LQ system with parameters such that is stabilizable222 is stabilizable if there exists a control gain matrix s.t. is stable (i.e., all eigenvalues are in
is stable (i.e., all eigenvalues are in)., and p.d. cost matrices and , the optimal solution of Eq. 2 is given by
where is the optimal policy, is the corresponding average expected cost, is the optimal gain, and is the unique solution to the Riccati equation associated with the control problem. Finally, we also have that is asymptotically stable.
For notational convenience, we use , so that the closed loop dynamics can be equivalently written as . We introduce further assumptions about the LQ systems we consider.
We assume that the LQ problem is characterized by parameters such that the cost matrices and are symmetric p.d., and where333Even if is not defined for every , we extend its domain of definition by setting . .
While Asm. 1 basically guarantees that the linear model in Eq. 1 is correct, Asm. 2 restricts the control parameters to the admissible set . This is used later in the learning process and it replaces Asm. A2-4 in [abbasi2011regret] in a synthetic way, as shown in the following proposition.
Given an admissible set as defined in Asm. 2, we have 1) (A,B), 2) is compact, and 3) there exists and positive constants such that and .444We use and to denote the Frobenius and the 2-norm respectively..
As an immediate result, any system with is stabilizable, and therefore, Asm. 2 implies that Prop. 1 holds. Finally, we derive a result about the regularity of the Riccati solution, which we later use to relate the regret to the controls performed by TS.
Under Asm. 1 and for any LQ with parameters and cost matrices and satisfying Asm. 2, let be the optimal solution of Eq. 2. Then, the mapping is continuously differentiable. Furthermore, let be the closed-loop matrix, then the directional derivative of in a direction , denoted as , where is the gradient of , is the solution of the Lyapunov equation
The learning problem. At each time , the learner chooses a policy , it executes the induced control and suffers a cost . The performance is measured by the cumulative regret up to time as , where at each step the difference between the cost of the controller and the expected average cost of the optimal controller is measured. Let be a sequence of controls and be the corresponding states, then can be estimated by regularized least-squares (RLS). Let , for any regularization parameter , the design matrix and the RLS estimate are defined as
For notational convenience, we use . We recall a concentration inequality for RLS estimates.
Proposition 3 (Thm. 2 in [abbasi-yadkori2011improved]).
We assume that are conditionally and component-wise sub-Gaussian of parameter and that . Then for any and any -adapted sequence , the RLS estimator is such that
w.p. (w.r.t. the noise and any randomization in the choice of the control), where
Further, when ,
At any step , we define the ellipsoid centered in with orientation and radius , with . Finally, we report a standard result of RLS that, together with Prop. 3, shows that the prediction error on the points used to construct the estimator is cumulatively small.
Proposition 4 (Lem. 10 in [abbasi2011regret]).
Let , for any arbitrary -adapted sequence , let be the corresponding design matrix, then
Moreover, when for all , then
3 Thompson Sampling for LQR
We introduce a specific instance of TS for learning in LQ problems obtained as a modification of the algorithm proposed in [abbasi2015bayesian], where we replace the Bayesian structure and the Gaussian prior assumption with a generic randomized process and we modify the update rule. The algorithm is summarized in Alg. 1. At any step , given the RLS-estimate and the design matrix , TS samples a perturbed parameter . In order to ensure that the sampling parameter is indeed admissible, we re-sample it until a valid is obtained. Denoting as the rejection sampling operator associated with the admissible set , we define as
where and every coordinate of the matrix is a random sample drawn i.i.d. from . We refer to this distribution as . Notice that such sampling does not need to be associated with an actual posterior over but it just need to randomize parameters coherently with the RLS estimate and the uncertainty captured in . Let , then the high-probability TS ellipsoid is defined so that any parameter belongs to it with probability.
Given the parameter , the gain matrix is computed and the corresponding optimal control is applied. As a result, the learner observes the cost and the next state , and and are updated accordingly. Similar to most of RL strategies, the updates are not performed at each step and the same estimated optimal policy is kept constant throughout an episode. Let be the design matrix at the beginning of an episode, then the episode is terminated upon two possible conditions: 1) the determinant condition of the design matrix is doubled (i.e., ) or 2) a maximum length condition is reached. While the first condition is common to all RL strategies, here we need to force the algorithm to interrupt episodes as soon as their length exceeds steps. The need for this additional termination condition is intrinsically related to the TS nature and it is discussed in detail in the next section.
4 Theoretical analysis
We prove the first frequentist regret bound for TS in LQ systems of dimension (, ). In order to isolate the steps which explicitly rely on this restriction, whenever possible we derive the proof in the general -dimensional case.
This result is in striking contrast with previous results in multi-armed and linear bandit where the frequentist regret of TS is and the Bayesian analysis of TS in control problems where the regret is also . As discussed in the introduction, the frequentist regret analysis in control problems introduces a critical trade-off between the frequency of selecting optimistic models, which guarantees small regret in bandit problems, and the reduction of the number of policy switches, which leads to small regret in control problems. Unfortunately, this trade-off cannot be easily balanced and this leads to a final regret of . Sect. 4.2 provides a more detailed discussion on the challenges of bounding the frequentist regret of TS in LQ problems.
4.1 Setting the Stage
Concentration events. We introduce the following high probability events.
Let and and . We define the event (RLS estimate concentration) and the event (parameter concentrates around ) .
We also introduce a high probability event on which the states are bounded almost surely.
Let , be two problem dependent positive constants and . We define the event (bounded states) .
Then we have that , and . We show that these events do hold with high probability.
On , . Thus, .
Lem. 2 leverages Prop. 3 and the sampling distribution to ensure that holds w.h.p. Furthermore, Corollary 1 ensures that the states remains bounded w.h.p. on the events .666This non-trivial result is directly collected from the bounding-the-state section of [abbasi2011regret]. As a result, the proof can be derived considering that both parameters concentrate and that states are bounded, which we summarize in the sequence of events , which holds with probability at least for all .
Regret decomposition. Conditioned on the filtration and event , we have , and . We directly decompose the regret and bound it on this event as [abbasi2011regret, Sect. 4.2]
where is decomposed into the three components
Before entering into the details of how to bound each of these components, in the next section we discuss what are the main challenges in bounding the regret.
4.2 Related Work and Challenges
Since the RLS estimator is the same in both TS and OFU, the regret terms and can be bounded as in [abbasi2011regret]. In fact, is a martingale by construction and it can be bounded by Azuma inequality. The term is related to the difference between the true next expected state and the predicted next expected state . A direct application of RLS properties makes this difference small by construction, thus bounding . Finally, the term is directly affected by the changes in model from any two time instants (i.e., and ), while measures the difference in optimal average expected cost between the true model and the sampled model . In the following, we denote by and the elements at time of these two regret terms and we refer to them as consistency regret and optimality regret respectively.
Optimistic approach. OFU-LQ explicitly bounds both regret terms directly by construction. In fact, the lazy update of the control policy allows to set to zero the consistency regret in all steps but when the policy changes between two episodes. Since in OFU-LQ an episode terminates only when the determinant of the design matrix is doubled, it is easy to see that the number of episodes is bounded by , which bounds as well (with a constant depending on the bounding of the state and other parameters specific of the LQ system).777Notice that the consistency regret is not specific to LQ systems but it is common to all regret analyses in RL (see e.g., UCRL [jaksch2010near-optimal]) except for episodic MDPs and it is always bounded by keeping under control the number of switches of the policy (i.e., number of episodes).At the same time, at the beginning of each episode an optimistic parameter is chosen, i.e., , which directly ensures that is upper bounded by 0 at each time step.
Bayesian regret. The lazy PSRL algorithm in [abbasi2015bayesian] has the same lazy update as OFUL and thus it directly controls by a small number of episodes. On the other hand, the random choice of does not guarantee optimism at each step anymore. Nonetheless, the regret is analyzed in the Bayesian setting, where is drawn from a known prior and the regret is evaluated in expectation w.r.t. the prior. Since is drawn from a posterior constructed from the same prior as , in expectation its associated is the same as , thus ensuring that .
Frequentist regret. When moving from Bayesian to frequentist regret, this argument does not hold anymore and the (positive) deviations of w.r.t. has to be bounded in high probability. abbasi2011regret exploits the linear structure of LQ problems to reuse arguments originally developed in the linear bandit setting. Similarly, we could leverage on the analysis of TS for linear bandit by agrawal2012thompson to derive a frequentist regret bound. agrawal2012thompson partition the (potentially infinite) arms into saturated and unsaturated
arms depending on their estimated value and their associated uncertainty (i.e., an arm is saturated when the uncertainty of its estimate is smaller than its performance gap w.r.t. the optimal arm). In particular, the uncertainty is measured using confidence intervals derived from a concentration inequality similar to Prop.3
. This suggests to use a similar argument and classify policies as saturated and unsaturated depending on their value. Unfortunately, this proof direction cannot be applied in the case of LQR. In fact, in an LQ systemthe performance of a policy is evaluated by the function and the policy uncertainty should be measured by a confidence interval constructed as . Despite the concentration inequality in Prop. 3, we notice that neither nor may be finite, since may not stabilize the system (or ) and thus incur an infinite cost. As a result, it is not possible to introduce the notion of saturated and unsaturated policies in this setting and another line of proof is required. Another key element in the proof of [agrawal2012thompson] for TS in linear bandit is to show that TS has a constant probability to select optimistic actions and that this contributes to reduce the regret of any non-optimistic step. In our case, this translates to requiring that TS selects a system whose corresponding optimal policy is such that . Lem. 3 shows that this happens with a constant probability . Furthermore, we can show that optimistic steps reduce the regret of non-optimistic steps, thus effectively bounding the optimality regret . Nonetheless, this is not compatible with a small consistency regret. In fact, we need optimistic parameters to be sampled often enough. On the other hand, bounding the consistency regret requires to reduce the switches between policies as much as possible (i.e., number of episodes). If we keep the same number of episodes as with the lazy update of OFUL (i.e., about episodes), then the number of sampled points is as small as . While OFU-LQ guarantees that any policy update is optimistic by construction, with TS, only a fraction of steps would be optimistic on average. Unfortunately, such small number of optimistic steps is no longer enough to derive a bound on the optimality regret . Summarizing, in order to derive a frequentist regret bound for TS in LQ systems, we need the following ingredient 1) constant probability of optimism, 2) connection between optimism and without using the saturated and unsaturated argument, 3) a suitable trade-off between lazy updates to bound the consistency regret and frequent updates to guarantee small optimality regret.
4.3 Bounding the Optimality Regret
decomposition. We define the “extended” filtration . Let be the (random) number of episodes up to time , be the steps when the policy is updated, i.e., when a new parameter is sampled, and let be the associated length of each episode, then we can further decompose as
We focus on the second regret term that we redefine for any for notational convenience.
Optimism and expectation. Let be the set of optimistic parameters (i.e., LQ systems whose optimal average expected cost is lower than the true one). Then, for any , the per-step regret is bounded by:
where we use first the definition of the optimistic parameter set, then bounding the resulting quantity by its absolute value, and finally switch to the expectation over the optimistic set, since the inequality is true for any . While this inequality is true for any sampling distribution, it is convenient to select it equivalent to the sampling distribution of TS. Thus, we set with is component wise Gaussian and obtain
At this point we need to show that the probability of sampling an optimistic parameter is constant at any step . This result is proved in the following lemma.
Let be the set of optimistic parameters and with be component-wise normal , then in the one-dimensional case ( and )
where is a strictly positive constant.
Integrating this result into the previous expression gives
The most interesting aspect of this result is that the constant probability of being optimistic allows us to bound the worst-case non-stochastic quantity depending on by an expectation up to a multiplicative constant (we drop the events for notational convenience). The last term is the conditional absolute deviation of the performance w.r.t. the TS distribution. This connection provides a major insight about the functioning of TS, since it shows that TS does not need to have an accurate estimate of but it should rather reduce the estimation errors of only on the directions that may translate in larger errors in estimating the objective function . In fact, we show later that at each step TS chooses a sampling distribution that tends to minimize the expected absolute deviations of , thus contributing to reduce the deviations in .
Variance and gradient. Let , we introduce the mapping from the ball to defined as
where the restriction on the ball is here to meet the confidence ellipsoid of the sampling. Since the perturbation is independent of the past, we can rewrite Eq. 10 as
We now need to show that this formulation of the regret is strictly related to the policy executed by TS. We prove the following result (proof in the supplement).
Let be a convex domain with finite diameter diam. Let be a non-negative log-concave function on with continuous derivative up to the second order. Then, for all 888 is the Sobolev space of order 1 in . such that one has
Before using the previous result, we relate the gradient of to the gradient of . Since for any and any , we have
which leads to
We are now ready to use the weighted Poincaré inequality of Lem. 4 to link the expectation of to the expectation of its gradient. From Lem. 1, we have and its expectation is zero by construction. On the other hand, the rejection sampling procedure impose that we conditioned the expectation with which is unfortunately not convex. However, we can still apply Lem. 4 considering the function and diameter . As a result, we finally obtain
From gradient to actions. Recalling the definition of we notice that the previous expression bound the regret with a term involving the gain of the optimal policy for the sampled parameter . This shows that the regret is directly related to the policies chosen by TS. To make such relationship more apparent, we now elaborate the previous expression to reveal the sequence of state-control pairs induced by the policy with gain . We first plug the bound on back into Eq. 9 as
We remove the expectation by adding and subtracting the actual realizations of as
Thus, one obtains
Now we want to relate the cumulative sum of the last regret term to . This quantity represents the prediction error of the RLS, and we know from Prop. 6 that it is bounded w.h.p. We now focus on the one-dimensional case, where is just a scalar value. Noticing that , one has:
Intuitively, it means that over each episode, the more states are excited (e.g., the larger ), the more reduces in the direction . As a result, to ensure that the term in is small, it would be sufficient ti show that , i.e., that the states provides enough information to learn the system in each chosen direction . More formally, let assume that there exists a constant such that for all . Then,
where we use that as guaranteed by the termination condition. Unfortunately, the intrinsic randomness of (triggered by the noise ) is such that the assumption above is violated w.p. 1. However, in the one-dimensional case, the regret over the episode can be conveniently written as