1 Introduction
Consider a linear dynamical system (Åström, 2012) in which system evolution is given as , where is the system state at time , is the control applied at time by an “agent” (Sutton and Barto, 2018), and is i.i.d. Gaussian noise at time . are system matrices that describe the system dynamics. The instantaneous cost incurred at time is a quadratic function of the current state and control. The goal of the agent is to minimize the expected value of the cumulative costs incurred during steps. This objective serves the purpose of keeping the system state close to the origin by using minimal control energy. The controller/agent has to attain this goal without knowing the parameter . Let denote the optimal average LQG cost when the true parameter is equal to , and is known to the agent. The goal is to choose adaptively on the basis of the information collected during system operation, so as to minimize the expected “regret” (Lattimore and Szepesvári, 2020)
This is called the adaptive LQG control problem.
In this work we use an approach called “RewardBiased Maximum Likelihood Estimate” (RBMLE) that was first proposed more than four decades ago for the problem of online learning (Kumar and Becker, 1982; Kumar, 1983a; Mete et al., 2021). We begin by giving a brief description of RBMLE. Consider a system with state space X, an action set
, and with controlled transition probabilities
for and dependent on an unknown parameter . The goal is to maximize a longterm average reward , where and are the state and action at time . The parameter is not known; all that is known is that it belongs to a known set . If one knew , one could take action at time , where for each , denotes a stationary policy that is optimal if the parameter is . In the absence of knowledge of , one could employ a “certainty equivalent” approach where one simply makes an estimate of at time , and simply use the action in state that would be optimal if were the true parameter. Specifically, let be the maximum likelihood estimate (MLE) of at time , . Then, under the certainty equivalent approach, the action taken at time is . Mandl (Mandl, 1974) was the first to analyze this CE approach. Recognizing that CE generally suffers from being unable to converge to unless one can distinguish from other s irrespective of what action is taken in what state, he showed that if for all , then . Borkar and Varaiya (Borkar and Varaiya, 1979) performed a further analysis of the CE rule, and showed that one only obtains a certain “closedloop identifiability” property, i.e., if denotes the limiting value of the ML estimates under the CE rule, then the system dynamics associated with the parameter, controller tuple are the same as that of , i.e., for all . The problem is that as the parameter estimates begin to converge, exploration ceases, and one ends up only identifying the behavior of the system under the limited actions being applied to the system. One misses out on other potentially valuable policies. Since the limiting policy need not be an optimal longterm average policy for the true system , the CE rule leads to a suboptimal performance. Indeed this problem goes by various names in different fields, the dual control problem (Feldbaum, 1960; Wittenmark, 1995), the closedidentifiability problem (Borkar and Varaiya, 1979; Berry and Fristedt, 1985; Gittins et al., 2011), or the problem of exploration vs. exploitation (Lattimore and Szepesvári, 2020).This problem was resolved without resort to forced exploration in (Kumar and Becker, 1982).
the following indistinguishability condition holds the problem of “insufficient exploration,” i.e. there is a positive probability with which the MLE does not converge to , and hence asymptotically it applies suboptimal controls.
how it was discovered in (Kumar and Becker, 1982) that when maximum likelihood estimator (MLE) is inherently biased towards parameters for which the optimal rewards is less than the true optimal rewards, and how it proposed RBMLE as a natural solution to overcome this bias. Let denote the average LQG cost when a stationary control policy is applied to the linear system with parameter .
This explorationexploitation problem (or equivalently the dual control problem) was finally resolved in the work. Their approach is summarized as follows. (Kumar and Becker, 1982) begins by observing that . To see why this is true, we note that since and
induce the same stationary probability distribution on the statespace, we have that
. However, since might not be optimal for , we have . Upon combining these, we obtain . This relation shows the following fundamental property of the ML estimates when they are also used for generating controls: the estimates are inherently biased towards those parameters that have a higher operating cost than the true parameter value. After making this fundamental observation, (Kumar and Becker, 1982) proposed to counteract this bias by providing a bias in the likelihood objective function in the reverse direction, i.e. by favoring those which yield a larger optimal reward or equivalently those which have a smaller value of optimal average cost . More specifically, the RBMLE algorithm they developed generates a reward biased MLE by solving the following problem at each time :(1) 
where denotes the negative loglikelihood probability associated with the system state transitioning to state when control is applied in state , and the true parameter is . is any strictly increasing function, is a biasing parameter that must be chosen so as to satisfy the following two desired properties:
(i) it should be sufficiently large enough so that asymptotically RBMLE does choose parameters that have a larger optimal reward than under . If this were not the case, then the resulting learning algorithm would suffer from the same drawbacks as that of the CE rule of (Mandl, 1974).
(ii) should also not be too large so that the ML estimates are asymptotically inconsistent.
They showed that in order for (i), (ii) to be satisfied, we should let , and . The resulting RBMLE algorithm was shown to attain the same average cost performance as the case when
is known, in which case the agent can compute and use the optimal stationary policy. In the parlance of machine learning community, the limiting normalized regret
(Lai and Robbins, 1985a; Lattimore and Szepesvári, 2020; Auer et al., 2002) (normalized by operating time ) vanishes.RBMLE is the first algorithm that did not resort to “forced exploration” in order to solve the adaptive control /RL problem. It attained this by introducing the idea of favoring parameters that yield better performance, which is now known as the principle of “optimism in the face of uncertainty” (Lai and Robbins, 1985a; Agrawal, 1995; Auer et al., 2002) Since its introduction in (Kumar and Becker, 1982), RBMLE has been applied to solve a variety of online learning problems. As an example, it has been used to learn optimal policies for more general MDPs (Kumar, 1982; Kumar and Lin, 1982; Borkar, 1990), LQG systems (Kumar, 1983a; Campi and Kumar, 1998a, b; Prandini and Campi, 2000), LTI systems (Bittanti et al., 2006), nonlinear systems (Kumar, 1983b), ergodic problems (Stettner, 1993), control of diffusions (Borkar, 1991; Duncan et al., 1994). These works only showed that RBMLE yields the optimal longterm average performance.
After the seminal works (Kumar and Becker, 1982; Kumar, 1982; Kumar and Lin, 1982) in the field of adaptive control, the work (Lai and Robbins, 1985b) studied in detail the multiarmed bandit (MAB) problem, and introduced the notion of “regret” in order to characterize the performance of a learning algorithm for MAB. Regret of an algorithm is the difference between the expected rewards earned by the algorithm, and the mean reward of an optimal arm times the operating timehorizon. Clearly, regret is a much finer metric to judge the performance of a learning algorithm as compared with the longterm average reward. (Lai and Robbins, 1985b) showed that asymptotically the optimal regret scales as , and also developed an “Upper Confidence Bounds” (UCB) algorithm that assigns an index to each arm, and then plays the arm with the highest value of the index. This index is an optimistic estimate, i.e. a highprobability upperbound, of the true (but unknown) mean reward of that arm. Since the work (Lai and Robbins, 1985b), the criterion of regret has become very popular to characterize the performance of RL algorithms. For example UCRL (Auer and Ortner, 2007), UCRL2 (Jaksch et al., 2010), Thompson Sampling (Brafman and Tennenholtz, 2002; Agrawal and Goyal, 2012; Ouyang et al., 2017b).
Though the seminal works on RBMLE showed that it is optimal with respect to longterm average reward, the finitetime performance analysis for various learning tasks, or its learning regret, is an overdue topic of topical interest. Recently, several works have addressed this issue. For example, (Liu et al., 2020) analyzes RBMLE for MABP and shows that it enjoys a regret that is comparable to other stateoftheart algorithms such as UCB and Thompson Sampling, while the empirical performance of RBMLE is much better than these. Similar results were obtained when RBMLE was used to optimize rewards in the case of linear bandits (Hung et al., 2020). The case of RL for discrete MDPs was treated in (Mete et al., 2021). In this work we analyze RBMLE algorithm in the context of RL for LQG control systems.
1.1 Previous Works
The seminal work on adaptive LQG control (Kumar, 1983a) showed that the RBMLE algorithm attains the optimal infinite horizon average cost asymptotically by utilizing rewardbiased estimates of the system parameters in order to generate controls. However, it assumed that the unknown parameters can assume values from only a discrete set. This assumption was removed in (Campi and Kumar, 1998a; Prandini and Campi, 2000). Meanwhile, works such as (Lai and Wei, 1982, 1987; Chen and Guo, 1987) used forced exploration techniques in order to ensure that the algorithm does not suffer from the problem of insufficient exploration, i.e. the asymptotic estimates of the unknown parameters converge to their true value. These are somewhat similar in spirit to the greedy learning algorithm (Sutton and Barto, 2018; Auer et al., 2002). More recently, inspired by the RBMLE principle, the work (AbbasiYadkori and Szepesvári, 2011) uses the principle of optimism in the face of uncertainty for adaptive LQG control, and derives bounds on its learning regret by building upon the techniques introduced in (Campi and Kumar, 1998b). The algorithm proposed therein is “UCBtype,” i.e. it optimizes the function over a highprobability confidence ball, to get an optimistic estimate of , and is shown to have a regret that is bounded as . (Ibrahimi et al., 2012) also uses the OFU principle to address the adaptive LQG control problem. More recently, (Dean et al., 2020b) studies a related “sample complexity” problem associated with LQG control, i.e. it quantifies the number of samples required in order to generate a sufficiently precise estimate of the parameters . However, it does this under the assumption that the controls are generated in an i.i.d. manner (which is somewhat equivalent to forced exploration). An alternative approach to OFU principle for designing learning algorithms is Thompson sampling (Thompson, 1933), or posterior sampling. Works (AbbasiYadkori and Szepesvári, 2015; Ouyang et al., 2017a) considered the adaptive LQG control problem in the Bayesian setup, and showed that the expected regret of Thompson sampling is .
Key Challenges:
Several critical questions need to be answered in order to tailor the RBMLE to LQG control.

The average reward optimality proved in prior RBMLE studies for LQG control (Kumar, 1983a; Campi and Kumar, 1998a, b; Prandini and Campi, 2000) is a gross measure which implies that only the regret (defined below) is , while a much stronger finitetime orderoptimal regret such as that of OFU (AbbasiYadkori and Szepesvári, 2011) or Thompson Sampling (AbbasiYadkori and Szepesvári, 2015; Ouyang et al., 2017a) (both these regrets are ) is desired. What is the regret bound of the RBMLE algorithms for adaptive LQG?

The above options for described in (1) are very broad, and not all of them lead to orderoptimality of regret. How should one choose the function in order to attain orderoptimality?

What are the advantages of RBMLE algorithms as compared with the existing methods?
We make several contributions.

We perform a finitetime performance analysis of the RBMLE algorithm for the setup of adaptive LQR problem. We show that its regret is and is competitive with the OFU algorithm of (AbbasiYadkori and Szepesvári, 2011).

We reveal the general interplay between the choice of and the regret bound.

We evaluate the empirical performance of RBMLE algorithms in extensive experiments. They demonstrate competitive performance in regret as well as scalability against the current best policies.
OFU  

Thompson Sampling  
RBMLE 
2 Problem Formulation
Consider the following linear system (Kumar and Varaiya, 2015),
(2) 
where are the state and the control input respectively at time . represents the noise in the system at time . The system parameters, represented by matrices are unknown.
The system incurs a cost at time given by,
(3) 
where and are known positive semidefinite matrices, The samplepath average cost incurred by the controller that chooses is:
(4) 
Denote the history of states and inputs until time by
(5) 
Define and let be the unknown system parameter. Then, the linear system (2) can equivalently be written as
(6) 
2.1 Assumptions
We make the following assumptions regarding the system in (2). These have been used in earlier works on adaptive LQG control (AbbasiYadkori and Szepesvári, 2011; Campi and Kumar, 1998b).
Assumption 1.
The system is controllable and observable (Kailath, 1980), and there exists a known constant positive such that where,
Assumption 2.
We assume that the system noise is elementwise subGaussian. Thus,
(7) 
Moreover, has the following properties.

is a martingale difference sequence.

2.2 Controller Design for a Known LQG Control System
Consider the following linear system,
(8) 
where . Then the following facts are true (Kumar and Varaiya, 2015). There is a unique positive definite matrix that satisfies the Riccati equation
(9) 
The optimal control law for this system which minimizes the average cost (4) is given by , where the “gain matrix” is
(10) 
The optimal average cost is
(11) 
As a consequence of Assumption 1, the parameter set is bounded, hence one can show that is bounded as well, let be a constant such that
(12) 
Assumption 3.
For all , we have

, therefore there exist a positive constant such that,
(13) 
There exists a positive constant such that,
(14)
2.3 Construction of Confidence Interval
In this section, given the history ,we construct a “highprobability confidence ball,” i.e., a set of plausible system parameters that contains the true parameter with a high probability.
For , define
(15) 
where
is the identity matrix of dimension
. The squared error of is given as(16) 
Then the regularized squared error with regularization parameter is given by:
(17) 
Let be the regularized leastsquares estimate of , i.e.
(18) 
Let be the following confidence ball around ,
where,
(19) 
Next, we show that lies within the ball with high probability.
Lemma 2.1.
Define
where,
Let and . Then,
(20) 
Proof.
See Lemma 4 in (AbbasiYadkori and Szepesvári, 2011). ∎
3 RBMLE for LQG Systems
We employ a version of the RBMLE algorithm that proceeds in an episodic manner. Let denote the starting time of the th episode. It lasts for time steps. At time , it solves the following optimization problem:
(21) 
where is the multiplicative prefactor associated with the costbias term. Thus, is the costbiased estimate of given the operational history of the system . It implements the following control policy during episode :
We note the key difference between RBMLE and OFU lies in the objective function that is being optimized at the beginning of an episode. While OFU optimizes the function over the confidence ball , RBMLE on the other hand optimizes the function . The second term in the objective function is responsible for a better empirical performance of the RBMLE algorithm, which is shown in detail in Section 5. It is an interesting question to connect this the term to a better regret bound of RBMLE.
We note that this is slightly modified version of the RBMLE algorithm that was proposed in (Campi and Kumar, 1998b) for LQG systems. The algorithm of (Campi and Kumar, 1998b)
does not explicitly maintain a confidence interval. Instead it chooses
as the minimizer of (21) over the set i.e.(22) 
In the next couple of sections, we provide guarantees on the performance of RBMLE in terms of upperbound on its regret, and also demonstrate that its empirical performance is better than other popular algorithms.
4 Regret Analysis
In this section, we focus on the theoretical analysis of the RBMLE algorithm that was discussed in Section 3. More specifically, we derive an upper bound on the regret of the RBMLE algorithm. The regret of a learning algorithm for LQG systems is defined as (AbbasiYadkori and Szepesvári, 2011):
(23) 
The notion of regret was first introduced in (Lai and Robbins, 1985a) in context of multiarmed bandit problem. It has been widely used for characterizing the finitetime performance of algorithms that make sequential decisions (Lattimore and Szepesvári, 2020; Auer et al., 2002). Note that minimizing the regret is equivalent to minimizing the performance cost.
The following result allows us to decompose the regret into easytoanalyze terms.
Lemma 4.1.
The regret of a learning algorithm which implements is upper bounded as follow:
(24) 
where,
Proof.
See Appendix A. ∎
Next, we provide individual bounds on each of and on the event .
4.1 Bound on
Theorem 4.2.
On the event ,
where, , and .
Proof.
The proof is the same as that of Lemma 7 in (AbbasiYadkori and Szepesvári, 2011). ∎
4.2 Bound on
Proof.
As defined in Lemma 4.1,
Each term in the summation is nonzero only at the end of an episode when there is a policy change. Since there are episodes till time , there are nonzero terms and each of them is bounded by . ∎
4.3 Bound on
Theorem 4.4.
Proof.
The proof is the same as that of Lemma 13 in (AbbasiYadkori and Szepesvári, 2011). ∎
4.4 Bound on
Theorem 4.5.
On the event ,
Proof.
See Appendix B. ∎
Now that we have established upper bounds on the terms and , we can use these in conjunction with the decomposition result of Lemma 4.1 in order to bound the regret of RBMLE as follows.
Theorem 4.6.
For any and , the regret of the RBMLE Algorithm is upperbounded with a probability at least as follows:
(25) 
where, problem dependant constants and terms logarithmic in are hidden.
Proof.
See Appendix C ∎
5 Empirical Performance
In this section, we evaluate the empirical performance of RBMLE and compare it with the existing popular learning algorithms such as OFU (AbbasiYadkori and Szepesvári, 2011) and nonBayesian Thompson Sampling (Abeille and Lazaric, 2017).
Our experiment setup is as follows: We initialize the system state as and use the first 10 timesteps with randomly generated inputs to initialize the estimates.
We ensure that all the algorithms utilize the same realization of initial random inputs, and the same external noise at each time . More details on implementation of OFU, TS can be found in (Dean et al., 2018)
. OFU is implemented using a heuristic projected gradient descent approach. Note that the RBMLE algorithm differs from OFU only in that it optimizes a slightly different objective function,
implementation is similar.We plot the averaged regrets along with the standard deviation for a time horizon of
steps. Each experiment is performed over 200 iterations. We carry out simulation studies for the following systems, starting with a simple example in which the system state and control input are both scalars. In Figures 1 and 2, we compare the performance of RBMLE with OFU and TS for this simple example. It is evident that RBMLE has much lower average regret compared to OFU and TS.
Chained Integrator Dynamics: In this example, we consider a chained integrator system with 2dimensional states and 2dimensional input.
(26)
6 Concluding Remarks
We proposed the RBMLE algorithm for solving the adaptive linear quadratic control problem. It is an alternative to the OFU approach of (AbbasiYadkori and Szepesvári, 2011), and generates a reward biased estimate of the unknown parameter by optimizing an objective function which is a modification of the loglikelihood function, one in which a bias term is added to give a higher preference for those parameter values that generate higher average rewards. The key difference between RBMLE and OFU is that they optimize different objective functions while generating an estimate of the underlying system parameter. OFU optimizes the objective function , while RBMLE additionally includes a term that is representative of the loglikelihood of making the observations given that is the true parameter value. This difference is responsible for a better empirical performance of the RBMLE algorithm, and it remains to be seen in future works whether this additional term yields a better regret bound for the RBMLE over the OFU or Thompson Sampling.
We establish bounds on its regret, and show that it is competitive with OFU. We show, via extensive simulations performed for a broad range of realworld systems, that the empirical performance of RBMLE is much better than OFU and Thompson Sampling. Along with recent works on RBMLE for MDPs (Mete et al., 2021), stochastic MultiArmed Bandits (Liu et al., 2020) and Contextual bandits (Hung et al., 2020), this work on LQG substantiates the case of RBMLE as a promising alternative to the much celebrated Upper Confidence Bound and Thompson Sampling for model based RL problems. Hence, an important topic for future research is analyzing RBMLE for important settings such as that of RL for constrained MDPs (Altman, 1999), RL for continuous spaces (Ortner and Ryabko, 2013) and Bayesian Optimization. Developing efficient computational procedures for RBMLE for MDPs is another challenging open problem.
References

Modelfree linear quadratic control via reduction to expert prediction.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 3108–3117. Cited by: item 1.  Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26. Cited by: Augmented RBMLEUCB Approach for Adaptive Control of Linear Quadratic Systems, 1st item, 1st item, §1.1, §2.1, §2.3, Lemma 2.2, §4.1, §4.3, §4, §5, §6.
 Bayesian optimal control of smoothly parameterized systems.. In UAI, pp. 1–11. Cited by: 1st item, §1.1.
 Thompson sampling for linearquadratic control problems. In Artificial Intelligence and Statistics, pp. 1246–1254. Cited by: §5.
 Sample mean based index policies with o(log n) regret for the multiarmed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. External Links: ISSN 00018678, Link Cited by: §1.
 Analysis of thompson sampling for the multiarmed bandit problem. In Conference on learning theory, pp. 39–1. Cited by: §1.

Constrained markov decision processes
. Vol. 7, CRC Press. Cited by: §6.  Introduction to stochastic control theory. Courier Corporation. Cited by: §1.
 Finitetime analysis of the multiarmed bandit problem. Machine learning 47 (2), pp. 235–256. Cited by: §1.1, §1, §1, §4.

Logarithmic online regret bounds for undiscounted reinforcement learning
. In Advances in neural information processing systems, pp. 49–56. Cited by: §1.  Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall 5 (7187), pp. 7–7. Cited by: §1.
 Adaptive control of linear time invariant systems: the “bet on the best” principle. Communications in Information & Systems 6 (4), pp. 299–320. Cited by: §1.

Adaptive control of markov chains, i: finite parameter set
. IEEE Transactions on Automatic Control 24 (6), pp. 953–957. Cited by: §1.  The kumarbeckerlin scheme revisited. Journal of optimization theory and applications 66 (2), pp. 289–309. Cited by: §1.
 Selftuning control of diffusions without the identifiability condition. Journal of optimization theory and applications 68 (1), pp. 117–138. Cited by: §1.
 Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research 3 (Oct), pp. 213–231. Cited by: §1.
 Adaptive linear quadratic gaussian control: the costbiased approach revisited. SIAM Journal on Control and Optimization 36 (6), pp. 1890–1907. Cited by: 1st item, §1.1, §1.
 Adaptive linear quadratic gaussian control: the costbiased approach revisited. SIAM Journal on Control and Optimization 36 (6), pp. 1890–1907. Cited by: 1st item, §1.1, §1, §2.1, §3.
 Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost. SIAM Journal on Control and Optimization 25 (4), pp. 845–867. Cited by: §1.1.
 Regret bounds for robust adaptive control of the linear quadratic regulator. arXiv preprint arXiv:1805.09388. Cited by: item 1, §5.
 On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics 20 (4), pp. 633–679. Cited by: item 1.
 On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics 20 (4), pp. 633–679. Cited by: §1.1.
 Almost selfoptimizing strategies for the adaptive control of diffusion processes. Journal of optimization theory and applications 81 (3), pp. 479–507. Cited by: §1.
 Dual control theory. i. Avtomatika i Telemekhanika 21 (9), pp. 1240–1249. Cited by: §1.
 Multiarmed bandit allocation indices. John Wiley & Sons. Cited by: §1.
 Rewardbiased maximum likelihood estimation for linear stochastic bandits. arXiv preprint arXiv:2010.04091. Cited by: §1, §6.
 Efficient reinforcement learning for high dimensional linear quadratic systems.. In NIPS, pp. 2645–2653. Cited by: §1.1.
 Nearoptimal regret bounds for reinforcement learning.. Journal of Machine Learning Research 11 (4). Cited by: §1.
 Linear systems. Vol. 156, PrenticeHall Englewood Cliffs, NJ. Cited by: Assumption 1.
 Optimal adaptive controllers for unknown markov chains. IEEE Transactions on Automatic Control 27 (4), pp. 765–774. Cited by: §1, §1.
 A new family of optimal adaptive controllers for markov chains. IEEE Transactions on Automatic Control 27 (1), pp. 137–146. Cited by: Augmented RBMLEUCB Approach for Adaptive Control of Linear Quadratic Systems, §1, §1, §1, §1, §1, §1.
 Optimal adaptive control of linearquadraticgaussian systems. SIAM Journal on Control and Optimization 21 (2), pp. 163–178. Cited by: 1st item, §1.1, §1, §1.
 Simultaneous identification and adaptive control of unknown systems over finite parameter sets. IEEE Transactions on Automatic Control 28 (1), pp. 68–76. Cited by: §1.
 Stochastic systems: estimation, identification, and adaptive control. SIAM. Cited by: §2.2, §2.
 Adaptive control with a compact parameter set. SIAM Journal on Control and Optimization 20 (1), pp. 9–13. Cited by: §1, §1.
 Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: Augmented RBMLEUCB Approach for Adaptive Control of Linear Quadratic Systems, §1, §1, §4.
 Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
 Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics 10 (1), pp. 154–166. Cited by: §1.1.
 Asymptotically efficient selftuning regulators. SIAM Journal on Control and Optimization 25 (2), pp. 466–481. Cited by: §1.1.
 Bandit algorithms. Cambridge University Press. Cited by: §1, §1, §1, §4.
 Exploration through reward biasing: rewardbiased maximum likelihood estimation for stochastic multiarmed bandits. In International Conference on Machine Learning, pp. 6248–6258. Cited by: §1, §6.
 Estimation and control in markov chains. Advances in Applied Probability 6 (1), pp. 40–60. Cited by: §1, §1.
 Reward biased maximum likelihood estimation for reinforcement learning. In Learning for Dynamics and Control, pp. 815–827. Cited by: §1, §1, §6.
 Online regret bounds for undiscounted continuous reinforcement learning. arXiv preprint arXiv:1302.2550. Cited by: §6.
 Control of unknown linear systems with thompson sampling. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1198–1205. Cited by: 1st item, §1.1.
 Learning unknown markov decision processes: a thompson sampling approach. arXiv preprint arXiv:1709.04570. Cited by: §1.
 Adaptive lqg control of inputoutput systems—a costbiased approach. SIAM Journal on Control and Optimization 39 (5), pp. 1499–1519. Cited by: 1st item, §1.1, §1.
 On nearly selfoptimizing strategies for a discretetime uniformly ergodic adaptive model. Applied Mathematics and Optimization 27 (2), pp. 161–177. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §1.1, §1.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.1.
 Leastsquares temporal difference learning for the linear quadratic regulator. In International Conference on Machine Learning, pp. 5005–5014. Cited by: item 1.
 Adaptive dual control methods: an overview. Adaptive Systems in Control and Signal Processing 1995, pp. 67–72. Cited by: §1.
Appendix A Proof of Lemma 4.1
Proof.
The Bellman optimality equation for LQ problem can be written as:
where . Also note that .
(27)  
(28) 
Substituting , and then using the martinangle property of noise (Assumption 2), we get
(29)  
(30) 
Subtracting from both sides, we get,
∎
Appendix B Proof of Theorem 4.5
Proof.
Let be the time of the beginning of episode . Then the RBMLE algorithm implements the policy
(31) 
Note that . Therefore Using Lemma 4.1, we have
∎