Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems

01/25/2022
by   Akshay Mete, et al.
0

We consider the problem of controlling a stochastic linear system with quadratic costs, when its system parameters are not known to the agent – called the adaptive LQG control problem. We re-examine an approach called "Reward-Biased Maximum Likelihood Estimate" (RBMLE) that was proposed more than forty years ago, and which predates the "Upper Confidence Bound" (UCB) method as well as the definition of "regret". It simply added a term favoring parameters with larger rewards to the estimation criterion. We propose an augmented approach that combines the penalty of the RBMLE method with the constraint of the UCB method, uniting the two approaches to optimization in the face of uncertainty. We first establish that theoretically this method retains 𝒪(√(T)) regret, the best known so far. We show through a comprehensive simulation study that this augmented RBMLE method considerably outperforms the UCB and Thompson sampling approaches, with a regret that is typically less than 50% of the better of their regrets. The simulation study includes all examples from earlier papers as well as a large collection of randomly generated systems.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/12/2020

Regret Bound of Adaptive Control in Linear Quadratic Gaussian (LQG) Systems

We study the problem of adaptive control in partially observable linear ...
11/16/2020

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

The principle of Reward-Biased Maximum Likelihood Estimate Based Adaptiv...
01/31/2020

Regret Minimization in Partially Observable Linear Quadratic Control

We study the problem of regret minimization in partially observable line...
06/17/2022

Thompson Sampling Achieves Õ(√(T)) Regret in Linear Quadratic Control

Thompson Sampling (TS) is an efficient method for decision-making under ...
09/16/2021

Adaptive Control of Quadratic Costs in Linear Stochastic Differential Equations

We study a canonical problem in adaptive control; design and analysis of...
06/08/2020

Learning the Truth From Only One Side of the Story

Learning under one-sided feedback (i.e., where examples arrive in an onl...
10/08/2020

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Modifying the reward-biased maximum likelihood method originally propose...

1 Introduction

Consider a linear dynamical system (Åström, 2012) in which system evolution is given as , where is the system state at time , is the control applied at time by an “agent” (Sutton and Barto, 2018), and is i.i.d. Gaussian noise at time . are system matrices that describe the system dynamics. The instantaneous cost incurred at time is a quadratic function of the current state and control. The goal of the agent is to minimize the expected value of the cumulative costs incurred during steps. This objective serves the purpose of keeping the system state close to the origin by using minimal control energy. The controller/agent has to attain this goal without knowing the parameter . Let denote the optimal average LQG cost when the true parameter is equal to , and is known to the agent. The goal is to choose adaptively on the basis of the information collected during system operation, so as to minimize the expected “regret” (Lattimore and Szepesvári, 2020)

This is called the adaptive LQG control problem.

In this work we use an approach called “Reward-Biased Maximum Likelihood Estimate” (RBMLE) that was first proposed more than four decades ago for the problem of online learning (Kumar and Becker, 1982; Kumar, 1983a; Mete et al., 2021). We begin by giving a brief description of RBMLE. Consider a system with state space X, an action set

, and with controlled transition probabilities

for and dependent on an unknown parameter . The goal is to maximize a long-term average reward , where and are the state and action at time . The parameter is not known; all that is known is that it belongs to a known set . If one knew , one could take action at time , where for each , denotes a stationary policy that is optimal if the parameter is . In the absence of knowledge of , one could employ a “certainty equivalent” approach where one simply makes an estimate of at time , and simply use the action in state that would be optimal if were the true parameter. Specifically, let be the maximum likelihood estimate (MLE) of at time , . Then, under the certainty equivalent approach, the action taken at time is . Mandl (Mandl, 1974) was the first to analyze this CE approach. Recognizing that CE generally suffers from being unable to converge to unless one can distinguish from other s irrespective of what action is taken in what state, he showed that if for all , then . Borkar and Varaiya (Borkar and Varaiya, 1979) performed a further analysis of the CE rule, and showed that one only obtains a certain “closed-loop identifiability” property, i.e., if denotes the limiting value of the ML estimates under the CE rule, then the system dynamics associated with the parameter, controller tuple are the same as that of , i.e., for all . The problem is that as the parameter estimates begin to converge, exploration ceases, and one ends up only identifying the behavior of the system under the limited actions being applied to the system. One misses out on other potentially valuable policies. Since the limiting policy need not be an optimal long-term average policy for the true system , the CE rule leads to a sub-optimal performance. Indeed this problem goes by various names in different fields, the dual control problem (Feldbaum, 1960; Wittenmark, 1995), the closed-identifiability problem (Borkar and Varaiya, 1979; Berry and Fristedt, 1985; Gittins et al., 2011), or the problem of exploration vs. exploitation (Lattimore and Szepesvári, 2020).

This problem was resolved without resort to forced exploration in (Kumar and Becker, 1982).

the following indistinguishability condition holds the problem of “insufficient exploration,” i.e. there is a positive probability with which the MLE does not converge to , and hence asymptotically it applies sub-optimal controls.

how it was discovered in (Kumar and Becker, 1982) that when maximum likelihood estimator (MLE) is inherently biased towards parameters for which the optimal rewards is less than the true optimal rewards, and how it proposed RBMLE as a natural solution to overcome this bias. Let denote the average LQG cost when a stationary control policy is applied to the linear system with parameter .

This exploration-exploitation problem (or equivalently the dual control problem) was finally resolved in the work. Their approach is summarized as follows. (Kumar and Becker, 1982) begins by observing that . To see why this is true, we note that since and

induce the same stationary probability distribution on the state-space, we have that

. However, since might not be optimal for , we have . Upon combining these, we obtain . This relation shows the following fundamental property of the ML estimates when they are also used for generating controls: the estimates are inherently biased towards those parameters that have a higher operating cost than the true parameter value. After making this fundamental observation, (Kumar and Becker, 1982) proposed to counteract this bias by providing a bias in the likelihood objective function in the reverse direction, i.e. by favoring those which yield a larger optimal reward or equivalently those which have a smaller value of optimal average cost . More specifically, the RBMLE algorithm they developed generates a reward biased MLE by solving the following problem at each time :

(1)

where denotes the negative log-likelihood probability associated with the system state transitioning to state when control is applied in state , and the true parameter is is any strictly increasing function, is a biasing parameter that must be chosen so as to satisfy the following two desired properties:

(i) it should be sufficiently large enough so that asymptotically RBMLE does choose parameters that have a larger optimal reward than under . If this were not the case, then the resulting learning algorithm would suffer from the same drawbacks as that of the CE rule of (Mandl, 1974).

(ii) should also not be too large so that the ML estimates are asymptotically inconsistent.

They showed that in order for (i), (ii) to be satisfied, we should let , and . The resulting RBMLE algorithm was shown to attain the same average cost performance as the case when

is known, in which case the agent can compute and use the optimal stationary policy. In the parlance of machine learning community, the limiting normalized regret 

(Lai and Robbins, 1985a; Lattimore and Szepesvári, 2020; Auer et al., 2002) (normalized by operating time ) vanishes.

RBMLE is the first algorithm that did not resort to “forced exploration” in order to solve the adaptive control /RL problem. It attained this by introducing the idea of favoring parameters that yield better performance, which is now known as the principle of “optimism in the face of uncertainty” (Lai and Robbins, 1985a; Agrawal, 1995; Auer et al., 2002) Since its introduction in (Kumar and Becker, 1982), RBMLE has been applied to solve a variety of online learning problems. As an example, it has been used to learn optimal policies for more general MDPs (Kumar, 1982; Kumar and Lin, 1982; Borkar, 1990), LQG systems (Kumar, 1983a; Campi and Kumar, 1998a, b; Prandini and Campi, 2000), LTI systems (Bittanti et al., 2006), non-linear systems (Kumar, 1983b), ergodic problems (Stettner, 1993), control of diffusions (Borkar, 1991; Duncan et al., 1994). These works only showed that RBMLE yields the optimal long-term average performance.

After the seminal works (Kumar and Becker, 1982; Kumar, 1982; Kumar and Lin, 1982) in the field of adaptive control, the work (Lai and Robbins, 1985b) studied in detail the multi-armed bandit (MAB) problem, and introduced the notion of “regret” in order to characterize the performance of a learning algorithm for MAB. Regret of an algorithm is the difference between the expected rewards earned by the algorithm, and the mean reward of an optimal arm times the operating time-horizon. Clearly, regret is a much finer metric to judge the performance of a learning algorithm as compared with the long-term average reward. (Lai and Robbins, 1985b) showed that asymptotically the optimal regret scales as , and also developed an “Upper Confidence Bounds” (UCB) algorithm that assigns an index to each arm, and then plays the arm with the highest value of the index. This index is an optimistic estimate, i.e. a high-probability upper-bound, of the true (but unknown) mean reward of that arm. Since the work (Lai and Robbins, 1985b), the criterion of regret has become very popular to characterize the performance of RL algorithms. For example UCRL (Auer and Ortner, 2007), UCRL2 (Jaksch et al., 2010), Thompson Sampling (Brafman and Tennenholtz, 2002; Agrawal and Goyal, 2012; Ouyang et al., 2017b).

Though the seminal works on RBMLE showed that it is optimal with respect to long-term average reward, the finite-time performance analysis for various learning tasks, or its learning regret, is an overdue topic of topical interest. Recently, several works have addressed this issue. For example, (Liu et al., 2020) analyzes RBMLE for MABP and shows that it enjoys a regret that is comparable to other state-of-the-art algorithms such as UCB and Thompson Sampling, while the empirical performance of RBMLE is much better than these. Similar results were obtained when RBMLE was used to optimize rewards in the case of linear bandits (Hung et al., 2020). The case of RL for discrete MDPs was treated in (Mete et al., 2021). In this work we analyze RBMLE algorithm in the context of RL for LQG control systems.

1.1 Previous Works

The seminal work on adaptive LQG control (Kumar, 1983a) showed that the RBMLE algorithm attains the optimal infinite horizon average cost asymptotically by utilizing reward-biased estimates of the system parameters in order to generate controls. However, it assumed that the unknown parameters can assume values from only a discrete set. This assumption was removed in (Campi and Kumar, 1998a; Prandini and Campi, 2000). Meanwhile, works such as (Lai and Wei, 1982, 1987; Chen and Guo, 1987) used forced exploration techniques in order to ensure that the algorithm does not suffer from the problem of insufficient exploration, i.e. the asymptotic estimates of the unknown parameters converge to their true value. These are somewhat similar in spirit to the greedy learning algorithm (Sutton and Barto, 2018; Auer et al., 2002). More recently, inspired by the RBMLE principle, the work (Abbasi-Yadkori and Szepesvári, 2011) uses the principle of optimism in the face of uncertainty for adaptive LQG control, and derives bounds on its learning regret by building upon the techniques introduced in (Campi and Kumar, 1998b). The algorithm proposed therein is “UCB-type,” i.e. it optimizes the function over a high-probability confidence ball, to get an optimistic estimate of , and is shown to have a regret that is bounded as (Ibrahimi et al., 2012) also uses the OFU principle to address the adaptive LQG control problem. More recently, (Dean et al., 2020b) studies a related “sample complexity” problem associated with LQG control, i.e. it quantifies the number of samples required in order to generate a sufficiently precise estimate of the parameters . However, it does this under the assumption that the controls are generated in an i.i.d. manner (which is somewhat equivalent to forced exploration). An alternative approach to OFU principle for designing learning algorithms is Thompson sampling (Thompson, 1933), or posterior sampling. Works (Abbasi-Yadkori and Szepesvári, 2015; Ouyang et al., 2017a) considered the adaptive LQG control problem in the Bayesian setup, and showed that the expected regret of Thompson sampling is .

Key Challenges:

Several critical questions need to be answered in order to tailor the RBMLE to LQG control.

  • The average reward optimality proved in prior RBMLE studies for LQG control (Kumar, 1983a; Campi and Kumar, 1998a, b; Prandini and Campi, 2000) is a gross measure which implies that only the regret (defined below) is , while a much stronger finite-time order-optimal regret such as that of OFU (Abbasi-Yadkori and Szepesvári, 2011) or Thompson Sampling (Abbasi-Yadkori and Szepesvári, 2015; Ouyang et al., 2017a) (both these regrets are ) is desired. What is the regret bound of the RBMLE algorithms for adaptive LQG?

  • The above options for described in (1) are very broad, and not all of them lead to order-optimality of regret. How should one choose the function in order to attain order-optimality?

  • What are the advantages of RBMLE algorithms as compared with the existing methods?

We make several contributions.

  • We perform a finite-time performance analysis of the RBMLE algorithm for the setup of adaptive LQR problem. We show that its regret is and is competitive with the OFU algorithm of (Abbasi-Yadkori and Szepesvári, 2011).

  • We reveal the general interplay between the choice of and the regret bound.

  • We evaluate the empirical performance of RBMLE algorithms in extensive experiments. They demonstrate competitive performance in regret as well as scalability against the current best policies.

OFU
Thompson Sampling

RBMLE

2 Problem Formulation

Consider the following linear system (Kumar and Varaiya, 2015),

(2)

where are the state and the control input respectively at time . represents the noise in the system at time . The system parameters, represented by matrices are unknown.

The system incurs a cost at time given by,

(3)

where and are known positive semi-definite matrices, The sample-path average cost incurred by the controller that chooses is:

(4)

Denote the history of states and inputs until time by

(5)

Define and let be the unknown system parameter. Then, the linear system (2) can equivalently be written as

(6)

2.1 Assumptions

We make the following assumptions regarding the system in (2). These have been used in earlier works on adaptive LQG control (Abbasi-Yadkori and Szepesvári, 2011; Campi and Kumar, 1998b).

Assumption 1.

The system is controllable and observable (Kailath, 1980), and there exists a known constant positive such that where,

Assumption 2.

We assume that the system noise is element-wise sub-Gaussian. Thus,

(7)

Moreover, has the following properties.

  1. is a martingale difference sequence.

2.2 Controller Design for a Known LQG Control System

Consider the following linear system,

(8)

where . Then the following facts are true (Kumar and Varaiya, 2015). There is a unique positive definite matrix that satisfies the Riccati equation

(9)

The optimal control law for this system which minimizes the average cost (4) is given by , where the “gain matrix” is

(10)

The optimal average cost is

(11)

As a consequence of Assumption 1, the parameter set is bounded, hence one can show that is bounded as well, let be a constant such that

(12)
Assumption 3.

For all , we have

  1. , therefore there exist a positive constant such that,

    (13)
  2. There exists a positive constant such that,

    (14)

2.3 Construction of Confidence Interval

In this section, given the history ,we construct a “high-probability confidence ball,” i.e., a set of plausible system parameters that contains the true parameter with a high probability.

For , define

(15)

where

is the identity matrix of dimension

. The squared error of is given as

(16)

Then the -regularized squared error with regularization parameter is given by:

(17)

Let be the regularized least-squares estimate of , i.e.

(18)

Let be the following confidence ball around ,

where,

(19)

Next, we show that lies within the ball with high probability.

Lemma 2.1.

Define

where,

Let and . Then,

(20)
Proof.

See Lemma 4 in (Abbasi-Yadkori and Szepesvári, 2011). ∎

Lemma 2.2.

On the event defined in Lemma 2.1, the following holds,

where, and are problem dependant constants.

Proof.

See Lemma 5 in (Abbasi-Yadkori and Szepesvári, 2011). ∎

3 RBMLE for LQG Systems

We employ a version of the RBMLE algorithm that proceeds in an episodic manner. Let denote the starting time of the -th episode. It lasts for time steps. At time , it solves the following optimization problem:

(21)

where is the multiplicative pre-factor associated with the cost-bias term. Thus, is the cost-biased estimate of given the operational history of the system . It implements the following control policy during episode :

  Initialize:
  for  do
     Set and
     while  do
        
        
         and
     end while
  end for
Algorithm 1 RBMLE

We note the key difference between RBMLE and OFU lies in the objective function that is being optimized at the beginning of an episode. While OFU optimizes the function over the confidence ball , RBMLE on the other hand optimizes the function . The second term in the objective function is responsible for a better empirical performance of the RBMLE algorithm, which is shown in detail in Section 5. It is an interesting question to connect this the term to a better regret bound of RBMLE.

We note that this is slightly modified version of the RBMLE algorithm that was proposed in (Campi and Kumar, 1998b) for LQG systems. The algorithm of (Campi and Kumar, 1998b)

does not explicitly maintain a confidence interval. Instead it chooses

as the minimizer of  (21) over the set i.e.

(22)

In the next couple of sections, we provide guarantees on the performance of RBMLE in terms of upper-bound on its regret, and also demonstrate that its empirical performance is better than other popular algorithms.

4 Regret Analysis

In this section, we focus on the theoretical analysis of the RBMLE algorithm that was discussed in Section 3. More specifically, we derive an upper bound on the regret of the RBMLE algorithm. The regret of a learning algorithm for LQG systems is defined as (Abbasi-Yadkori and Szepesvári, 2011):

(23)

The notion of regret was first introduced in (Lai and Robbins, 1985a) in context of multi-armed bandit problem. It has been widely used for characterizing the finite-time performance of algorithms that make sequential decisions (Lattimore and Szepesvári, 2020; Auer et al., 2002). Note that minimizing the regret is equivalent to minimizing the performance cost.

The following result allows us to decompose the regret into easy-to-analyze terms.

Lemma 4.1.

The regret of a learning algorithm which implements is upper bounded as follow:

(24)

where,

Proof.

See Appendix A. ∎

Next, we provide individual bounds on each of and on the event .

4.1 Bound on

Theorem 4.2.

On the event ,

where, , and .

Proof.

The proof is the same as that of Lemma 7 in (Abbasi-Yadkori and Szepesvári, 2011). ∎

4.2 Bound on

Theorem 4.3.

On the event ,

where, is defined in Lemma 2.2.

Proof.

As defined in Lemma 4.1,

Each term in the summation is non-zero only at the end of an episode when there is a policy change. Since there are episodes till time , there are non-zero terms and each of them is bounded by . ∎

4.3 Bound on

Theorem 4.4.

On the event ,

where, and is defined in 19.

Proof.

The proof is the same as that of Lemma 13 in (Abbasi-Yadkori and Szepesvári, 2011). ∎

4.4 Bound on

Theorem 4.5.

On the event ,

Proof.

See Appendix B. ∎

Now that we have established upper bounds on the terms and , we can use these in conjunction with the decomposition result of Lemma 4.1 in order to bound the regret of RBMLE as follows.

Theorem 4.6.

For any and , the regret of the RBMLE Algorithm is upper-bounded with a probability at least as follows:

(25)

where, problem dependant constants and terms logarithmic in are hidden.

Proof.

See Appendix C

5 Empirical Performance

In this section, we evaluate the empirical performance of RBMLE and compare it with the existing popular learning algorithms such as OFU (Abbasi-Yadkori and Szepesvári, 2011) and non-Bayesian Thompson Sampling (Abeille and Lazaric, 2017).

Our experiment setup is as follows: We initialize the system state as and use the first 10 time-steps with randomly generated inputs to initialize the estimates.

We ensure that all the algorithms utilize the same realization of initial random inputs, and the same external noise at each time . More details on implementation of OFU, TS can be found in (Dean et al., 2018)

. OFU is implemented using a heuristic projected gradient descent approach. Note that the RBMLE algorithm differs from OFU only in that it optimizes a slightly different objective function,

implementation is similar.

We plot the averaged regrets along with the standard deviation for a time horizon of

steps. Each experiment is performed over 200 iterations. We carry out simulation studies for the following systems, starting with a simple example in which the system state and control input are both scalars. In Figures 1 and 2, we compare the performance of RBMLE with OFU and TS for this simple example. It is evident that RBMLE has much lower average regret compared to OFU and TS.

  1. Unstable Laplacian System This system represents a marginally unstable Laplacian dynamics, and has been previously studied in (Dean et al., 2018; Abbasi-Yadkori et al., 2019; Dean et al., 2020a; Tu and Recht, 2018). The system matrices are given as follows.

  2. Chained Integrator Dynamics: In this example, we consider a chained integrator system with 2-dimensional states and 2-dimensional input.

    (26)
Figure 1: Unstable Laplacian System (Regret Comparision)
Figure 2: Chained Integrator Dynamics (Regret Comparision)

6 Concluding Remarks

We proposed the RBMLE algorithm for solving the adaptive linear quadratic control problem. It is an alternative to the OFU approach of (Abbasi-Yadkori and Szepesvári, 2011), and generates a reward biased estimate of the unknown parameter by optimizing an objective function which is a modification of the log-likelihood function, one in which a bias term is added to give a higher preference for those parameter values that generate higher average rewards. The key difference between RBMLE and OFU is that they optimize different objective functions while generating an estimate of the underlying system parameter. OFU optimizes the objective function , while RBMLE additionally includes a term that is representative of the log-likelihood of making the observations given that is the true parameter value. This difference is responsible for a better empirical performance of the RBMLE algorithm, and it remains to be seen in future works whether this additional term yields a better regret bound for the RBMLE over the OFU or Thompson Sampling.

We establish bounds on its regret, and show that it is competitive with OFU. We show, via extensive simulations performed for a broad range of real-world systems, that the empirical performance of RBMLE is much better than OFU and Thompson Sampling. Along with recent works on RBMLE for MDPs (Mete et al., 2021), stochastic Multi-Armed Bandits (Liu et al., 2020) and Contextual bandits (Hung et al., 2020), this work on LQG substantiates the case of RBMLE as a promising alternative to the much celebrated Upper Confidence Bound and Thompson Sampling for model based RL problems. Hence, an important topic for future research is analyzing RBMLE for important settings such as that of RL for constrained MDPs (Altman, 1999), RL for continuous spaces (Ortner and Ryabko, 2013) and Bayesian Optimization. Developing efficient computational procedures for RBMLE for MDPs is another challenging open problem.

References

  • Y. Abbasi-Yadkori, N. Lazic, and C. Szepesvári (2019) Model-free linear quadratic control via reduction to expert prediction. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    ,
    pp. 3108–3117. Cited by: item 1.
  • Y. Abbasi-Yadkori and C. Szepesvári (2011) Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26. Cited by: Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems, 1st item, 1st item, §1.1, §2.1, §2.3, Lemma 2.2, §4.1, §4.3, §4, §5, §6.
  • Y. Abbasi-Yadkori and C. Szepesvári (2015) Bayesian optimal control of smoothly parameterized systems.. In UAI, pp. 1–11. Cited by: 1st item, §1.1.
  • M. Abeille and A. Lazaric (2017) Thompson sampling for linear-quadratic control problems. In Artificial Intelligence and Statistics, pp. 1246–1254. Cited by: §5.
  • R. Agrawal (1995) Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. External Links: ISSN 00018678, Link Cited by: §1.
  • S. Agrawal and N. Goyal (2012) Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39–1. Cited by: §1.
  • E. Altman (1999)

    Constrained markov decision processes

    .
    Vol. 7, CRC Press. Cited by: §6.
  • K. J. Åström (2012) Introduction to stochastic control theory. Courier Corporation. Cited by: §1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: §1.1, §1, §1, §4.
  • P. Auer and R. Ortner (2007)

    Logarithmic online regret bounds for undiscounted reinforcement learning

    .
    In Advances in neural information processing systems, pp. 49–56. Cited by: §1.
  • D. A. Berry and B. Fristedt (1985) Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall 5 (71-87), pp. 7–7. Cited by: §1.
  • S. Bittanti, M. C. Campi, et al. (2006) Adaptive control of linear time invariant systems: the “bet on the best” principle. Communications in Information & Systems 6 (4), pp. 299–320. Cited by: §1.
  • V. Borkar and P. Varaiya (1979)

    Adaptive control of markov chains, i: finite parameter set

    .
    IEEE Transactions on Automatic Control 24 (6), pp. 953–957. Cited by: §1.
  • V. Borkar (1990) The kumar-becker-lin scheme revisited. Journal of optimization theory and applications 66 (2), pp. 289–309. Cited by: §1.
  • V. Borkar (1991) Self-tuning control of diffusions without the identifiability condition. Journal of optimization theory and applications 68 (1), pp. 117–138. Cited by: §1.
  • R. I. Brafman and M. Tennenholtz (2002) R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3 (Oct), pp. 213–231. Cited by: §1.
  • M. C. Campi and P. R. Kumar (1998a) Adaptive linear quadratic gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization 36 (6), pp. 1890–1907. Cited by: 1st item, §1.1, §1.
  • M. C. Campi and P. Kumar (1998b) Adaptive linear quadratic gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization 36 (6), pp. 1890–1907. Cited by: 1st item, §1.1, §1, §2.1, §3.
  • H. Chen and L. Guo (1987) Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost. SIAM Journal on Control and Optimization 25 (4), pp. 845–867. Cited by: §1.1.
  • S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu (2018) Regret bounds for robust adaptive control of the linear quadratic regulator. arXiv preprint arXiv:1805.09388. Cited by: item 1, §5.
  • S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu (2020a) On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics 20 (4), pp. 633–679. Cited by: item 1.
  • S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu (2020b) On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics 20 (4), pp. 633–679. Cited by: §1.1.
  • T. Duncan, B. Pasik-Duncan, and L. Stettner (1994) Almost self-optimizing strategies for the adaptive control of diffusion processes. Journal of optimization theory and applications 81 (3), pp. 479–507. Cited by: §1.
  • A. A. Feldbaum (1960) Dual control theory. i. Avtomatika i Telemekhanika 21 (9), pp. 1240–1249. Cited by: §1.
  • J. Gittins, K. Glazebrook, and R. Weber (2011) Multi-armed bandit allocation indices. John Wiley & Sons. Cited by: §1.
  • Y. Hung, P. Hsieh, X. Liu, and P. Kumar (2020) Reward-biased maximum likelihood estimation for linear stochastic bandits. arXiv preprint arXiv:2010.04091. Cited by: §1, §6.
  • M. Ibrahimi, A. Javanmard, and B. Van Roy (2012) Efficient reinforcement learning for high dimensional linear quadratic systems.. In NIPS, pp. 2645–2653. Cited by: §1.1.
  • T. Jaksch, R. Ortner, and P. Auer (2010) Near-optimal regret bounds for reinforcement learning.. Journal of Machine Learning Research 11 (4). Cited by: §1.
  • T. Kailath (1980) Linear systems. Vol. 156, Prentice-Hall Englewood Cliffs, NJ. Cited by: Assumption 1.
  • P. Kumar and W. Lin (1982) Optimal adaptive controllers for unknown markov chains. IEEE Transactions on Automatic Control 27 (4), pp. 765–774. Cited by: §1, §1.
  • P. R. Kumar and A. Becker (1982) A new family of optimal adaptive controllers for markov chains. IEEE Transactions on Automatic Control 27 (1), pp. 137–146. Cited by: Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems, §1, §1, §1, §1, §1, §1.
  • P. R. Kumar (1983a) Optimal adaptive control of linear-quadratic-gaussian systems. SIAM Journal on Control and Optimization 21 (2), pp. 163–178. Cited by: 1st item, §1.1, §1, §1.
  • P. Kumar (1983b) Simultaneous identification and adaptive control of unknown systems over finite parameter sets. IEEE Transactions on Automatic Control 28 (1), pp. 68–76. Cited by: §1.
  • P. R. Kumar and P. Varaiya (2015) Stochastic systems: estimation, identification, and adaptive control. SIAM. Cited by: §2.2, §2.
  • P. Kumar (1982) Adaptive control with a compact parameter set. SIAM Journal on Control and Optimization 20 (1), pp. 9–13. Cited by: §1, §1.
  • T. L. Lai and H. Robbins (1985a) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems, §1, §1, §4.
  • T. L. Lai and H. Robbins (1985b) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
  • T. L. Lai and C. Z. Wei (1982) Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics 10 (1), pp. 154–166. Cited by: §1.1.
  • T. L. Lai and C. Wei (1987) Asymptotically efficient self-tuning regulators. SIAM Journal on Control and Optimization 25 (2), pp. 466–481. Cited by: §1.1.
  • T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §1, §1, §1, §4.
  • X. Liu, P. Hsieh, Y. H. Hung, A. Bhattacharya, and P. Kumar (2020) Exploration through reward biasing: reward-biased maximum likelihood estimation for stochastic multi-armed bandits. In International Conference on Machine Learning, pp. 6248–6258. Cited by: §1, §6.
  • P. Mandl (1974) Estimation and control in markov chains. Advances in Applied Probability 6 (1), pp. 40–60. Cited by: §1, §1.
  • A. Mete, R. Singh, X. Liu, and P. R. Kumar (2021) Reward biased maximum likelihood estimation for reinforcement learning. In Learning for Dynamics and Control, pp. 815–827. Cited by: §1, §1, §6.
  • R. Ortner and D. Ryabko (2013) Online regret bounds for undiscounted continuous reinforcement learning. arXiv preprint arXiv:1302.2550. Cited by: §6.
  • Y. Ouyang, M. Gagrani, and R. Jain (2017a) Control of unknown linear systems with thompson sampling. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1198–1205. Cited by: 1st item, §1.1.
  • Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain (2017b) Learning unknown markov decision processes: a thompson sampling approach. arXiv preprint arXiv:1709.04570. Cited by: §1.
  • M. Prandini and M. C. Campi (2000) Adaptive lqg control of input-output systems—a cost-biased approach. SIAM Journal on Control and Optimization 39 (5), pp. 1499–1519. Cited by: 1st item, §1.1, §1.
  • Ł. Stettner (1993) On nearly self-optimizing strategies for a discrete-time uniformly ergodic adaptive model. Applied Mathematics and Optimization 27 (2), pp. 161–177. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.1, §1.
  • W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.1.
  • S. Tu and B. Recht (2018) Least-squares temporal difference learning for the linear quadratic regulator. In International Conference on Machine Learning, pp. 5005–5014. Cited by: item 1.
  • B. Wittenmark (1995) Adaptive dual control methods: an overview. Adaptive Systems in Control and Signal Processing 1995, pp. 67–72. Cited by: §1.

Appendix A Proof of Lemma 4.1

Proof.

The Bellman optimality equation for LQ problem can be written as:

where . Also note that .

(27)
(28)

Substituting , and then using the martinangle property of noise (Assumption 2), we get

(29)
(30)

Subtracting from both sides, we get,

Appendix B Proof of Theorem 4.5

Proof.

Let be the time of the beginning of episode . Then the RBMLE algorithm implements the policy

(31)

Note that . Therefore Using Lemma 4.1, we have

Appendix C Proof of Theorem 4.6

Proof.

As shown in Lemma 4.1, the regret is given by:

(32)

On the event , we substitute the individual bounds on and to get following bound on :

Since occurs with probability atleast (Lemma 2.1), we get the desired result. ∎