1 Introduction
Decision making for dynamical systems in the presence of uncertainty is a problem of great prevalence and importance, as well as considerable difficulty, especially when knowledge of the dynamics is available only via limited observations of system behavior. In machine learning, the datadriven search for a control policy to maximize the expected reward attained by a stochastic dynamic process is known as
reinforcement learning (RL) [41]. Despite remarkable recent success in games [28, 39], a major obstacle to the deployment RLbased control on physical systems (e.g. robots and selfdriving cars) is the issue of robustness, i.e., guaranteed safe and reliable operation. With the necessity of such guarantees widely acknowledged [2], socalled ‘safe RL’ remains an active area of research [18].The problem of robust automatic decision making for uncertain dynamical systems has also been the subject of intense study in the area of robust control (RC) [49]. In RC, one works with a set of plausible models and seeks a control policy that is guaranteed to stabilize all models within the set. In addition, there is also a performance objective to optimize, i.e. a reward to be maximized, or equivalently, a cost to be minimized. Such cost functions are usually defined with reference to either a nominal model [17, 22] or the worstcase model [32] in the set. RC has been extremely successful in a number of engineering applications [34]; however, as has been noted, e.g., [43, 31], robustness may (understandably) come at the expense of performance, particularly for worstcase design.
The problem we address in this paper lies at the intersection of reinforcement learning and robust control, and can be summarized as follows: given observations from an unknown dynamical system, we seek a policy to optimize the expected cost (as in RL), subject to certain robust stability guarantees (as in RC). Specifically, we focus our attention on control of linear timeinvariant dynamical systems, subject to Gaussian disturbances, with the goal of minimizing a quadratic function penalizing state deviations and control action. When the system is known, this is the classical linear quadratic regulator (LQR), a.k.a. , optimal control problem [8]. We are interested in the setting in which the system is unknown, and knowledge of the dynamics must be inferred from observed data.
Contributions and paper structure
The principle contribution of this paper is an algorithm to optimize the expected value of the linear quadratic regulator reward/cost function, where the expectation is w.r.t. the posterior distribution of unknown system parameters, given observed data; c.f. Section 3 for a detailed problem formulation. Specifically, we construct a sequence of convex approximations (upper bounds) to the expected cost, that can be optimized via semidefinite programing [45]. The algorithm, developed in Section 4, invokes the majorizeminimization (MM) principle [26], and consequently enjoys reliable convergence to local optima. An important part of our contribution lies in guarantees on the robust stability properties of the resulting control policies, c.f. Section 4.3. We demonstrate the proposed method via two experimental case studies: i) the benchmark problem on simulated systems considered in [14, 43], and ii) stabilization of a realworld inverted pendulum. Strong performance and robustness properties are observed in both. Moving forward, from a machine learning perspective this work contributes to the growing body of research concerned with ensuring robustness in RL, c.f. Section 2. From a control perspective, this work appropriates cost functions more commonly found in RL (namely, expected reward) to a RC setting, with the objective of reducing conservatism of the resulting robust control policies.
2 Related work
Incorporating various notions of ‘robustness’ into RL has long been an area of active research [18]. In socalled ‘safe RL’, one seeks to respect certain safety constraints during exploration and/or policy optimization, for example, avoiding undesirable regions of the stateaction space [19, 1]
. A related problem is addressed in ‘risksensitive RL’, in which the search for a policy takes both the expected value and variance of the reward into account
[27, 16]. Recently, there has been an increased interest in notions of robustness more commonly considered in control theory, chiefly stability [31, 3]. Of particular relevance is the work of [4], which employs Lyapunov theory [24] to verify stability of learned policies. Like the present paper, [4] adopts a Bayesian framework; however, [4] makes use of Gaussian processes [35] to model the uncertain nonlinear dynamics, which are assumed to be deterministic. A major difference between [4] and our work is the cost function; in the former the policy is selected by optimizing for worstcase performance, whereas we optimize the expected cost. Robustness of datadriven control has also been the focus of a recently developed family of methods referred to as ‘coarseID control’, c.f.,[42, 14, 7, 40], in which finitedata bounds on the accuracy of the least squares estimator are combined with modern robust control tools, such as
system level synthesis [47]. CoarseID builds upon socalled ‘ identification’ methods for learning models of dynamical systems, along with error bounds that are compatible with robust synthesis methods [23, 11, 10]. identification assumes an adversarial (i.e. worstcase) disturbance model, whereas CoarseID is applicable to probabilistic models, such as those considered in the present paper. Of particular relevance to the present paper is [14], which provides sample complexity bounds on the performance of robust control synthesis for the infinite horizon LQR problem, when the true system is not known. Such bounds necessarily consider the worstcase model, given the observed data, where as we are concerned with expected cost over the posterior distribution of models. In closing, we briefly mention the socalled ‘RiemannStieltjes’ class of optimal control problems, for uncertain continuoustime dynamical systems, c.f., e.g., [37, 36]. Such problems often arise in aerospace applications (e.g. satellite control) where the objective is to design an openloop control signal (e.g. for an orbital maneuver) rather than a feedback policy.3 Problem formulation
In this section we describe in detail the specific problem that we address in this paper. The following notation is used: denotes the set of symmetric matrices; () denotes the cone of positive semdefinite (positive definite) matrices. denotes , similarly for and . The trace of is denoted . The transpose of is denoted . is shorthand for . The convex hull of set is denoted . The set of Schur stable matrices is denoted .
Dynamics, reward function and policies
We are concerned with control of discrete linear timeinvariant dynamical systems of the form
(1) 
where , , and denote the state, input, and unobserved exogenous disturbance at time , respectively. Let . Our objective is to design a feedback control policy that minimizes the cost function , where evolves according to (1), and and are user defined weight matrices. A number of different parametrizations of the policy
have been considered in the literature, from neural networks (popular in RL, e.g.,
[4]) to causal (typically linear) dynamical systems (common in RC, e.g., [32]). In this paper, we will restrict our attention to staticgain policies of the form , where is constant. As noted in [14], controller synthesis and implementation, is simpler (and more computationally efficient) for such policies. When the parameters of the true system, denoted , are known this is the infinite horizon LQR problem, the optimal solution of which is wellknown [5]. We assume that is unknown; rather, our knowledge of the dynamics must be inferred from observed sequences of inputs and states.Observed data
We adopt the datadriven setup used in [14], and assume that where is the observed state sequence attained by evolving the true system for time steps, starting from an arbitrary and driven by arbitrary input . Each of these independent experiments is referred to as a rollout. We perform parameter inference in the offline/batch setting; i.e., all data is assumed to be available at the time of controller synthesis.
Optimization objective
Given observed data and, possibly, prior knowledge of the system, we then have the posterior distribution over the model parameters denoted , in place of the true parameters . The function that we seek to minimize is the expected cost w.r.t. the posterior distribution, i.e.,
(2) 
In practice, the support of almost surely contains that are unstabilizable, which implies that (2) is infinite. Consequently, we shall consider averages over confidence regions w.r.t. . For convenience, let us denote the infinite horizon LQR cost, for given system parameters , by
(3a)  
(3b) 
where the second equality follows from standard Gramian calculations, and denotes the set of Schur stable matrices. As an alternative to (2) we consider the cost function , where denotes the confidence region of the parameter space w.r.t. the posterior . Though better suited to optimization than (2), which is almost surely infinite, this integral cannot be evaluated in closed form, due to the complexity of w.r.t. . Furthermore, there is still no guarantee that contain only stabilizable models. To circumvent both of these issues, we propose the following Monte Carlo approximation of ,
(4) 
where denotes the set of stabilizable .
Posterior distribution
Given data , the parameter posterior distribution is given by Bayes’ rule:
(5) 
where denotes our prior belief on , , and denotes the unnormalized posterior. To sample from , we can distinguish between two different cases. First, consider the case when is known or can be reliably estimated independently of . This is the setting in, e.g., [14]
. In this case, the likelihood can be equivalently expressed as a Gaussian distribution over
. Then, when the prior is uniform (i.e. noninformative) or Gaussian (selfconjugate), the posterior is also Gaussian, c.f. Appendix A.1.1. Second, consider the general case in which , along with, is unknown. In this setting, one can select from a number of methods adapted for Bayesian inference in dynamical systems, such as MetropolisHastings
[29], Hamiltonian Monte Carlo [12], and Gibbs sampling [13, 48]. When one places a noninformative prior on (e.g., ), each iteration of a Gibbs sampler targeting requires sampling from either a Gaussian or an inverse Wishart distribution, for which reliable numerical methods exist; c.f., Appendix A.1.2. In both of these cases we can sample from and evaluate pointwise. To draw , as in (4), we can first draw a large number of samples from , discard the (100)% of samples with the lowest unnormalized posterior values, and then further discard any samples that happen to be unstabilizable. For convenience, we define , which should be interpreted as a set of realizations of this procedure for sampling .Summary
We seek the solution of the optimization problem for .
4 Solution via semidefinite programing
In this section we present the principle contribution of this paper: a method for solving via convex (semidefinite) programing (SDP). It is convenient to consider an equivalent representation
(6a)  
s.t.  (6b) 
where the Comparison Lemma [30, Lecture 2] has been used to replace the equality in (3b) with the inequality in (6b). We introduce the notation , where and are arbitrarily small and large positive constants, respectively. serves as a compact approximation of , suitable for use with SDP solvers, i.e., .
4.1 Common Lyapunov relaxation
The principle challenge in solving (6) is that the constraint (6b) is not jointly convex in and . The usual approach to circumventing this nonconvexity is to first apply the Schur complement to (6b), and then conjugate by the matrix , which leads to the equivalent constraint
(7) 
With the change of variables and , (7) becomes an LMI, in and . This approach is effective when (i.e. we have a single nominal system, as in standard LQR). However, when we cannot introduce a new for each , as we lose uniqueness of the controller in , i.e., in general for . One strategy (prevalent in robust control, e.g., [14, §C]) is to employ a ‘common Lyapunov function’, i.e., for all . This gives the following convex relaxation (upper bound) of problem (6),
(8a)  
(8b) 
where denotes the Cholesky factorization of , i.e., , and are slack variables used to encode the cost (6a) with the change of variables, i.e.,
The approximation in (8) is highly conservative, which motivates the iterative local optimization method presented in Section 4.2. Nevertheless, (8) provides a principled way (i.e., a oneshot convex program) to initialize the iterative search method derived in Section 4.2.
4.2 Iterative improvement by sequential semidefinite programing
To develop this iterative search method first consider an equivalent representation of ,
(9a)  
(9b) 
This representation highlights the nonconvexity of due to the term, which was addressed (in the usual way) by a change of variables in Section 4.1. In this section, we will instead replace with a linear approximation and prove that this leads to a tight convex upper bound. Given , let denote the first order (i.e. linear) Taylor series approximation of about some nominal , i.e., . We now define the function
(10a)  
(10b) 
where is any that achieves the minimum in (9), with for some nominal , i.e., . Analogously to (4), we define
(11) 
We now show that is a convex upper bound on , which is tight at . The proof is given in A.2.2 and makes use of the following technical lemma (c.f. A.2.1 for proof),
Lemma 4.1.
for all , where denotes the firstorder Taylor series expansion of about .
Theorem 4.1.
Let be defined as in (11), with such that is finite. Then is a convex upper bound on , i.e., . Furthermore, the bound is ‘tight’ at , i.e., .
Iterative algorithm
To improve upon the common Lyapunov solution given by (8), we can solve a sequence of convex optimization problems: , c.f. Algorithm 1 for details. This procedure of optimizing tight surrogate functions in lieu of the actual objective function is an example of the ‘majorizeminimization (MM) principle’, a.k.a. optimization transfer [26]. MM algorithms enjoy good numerical robustness, and (with the exception of some pathological cases) reliable convergence to local minima [44]. Indeed, it is readily verified that , where equality follows from tightness of the bound, and the second inequality is due to the fact that is an upper bound. This implies that is a converging sequence.
4.3 Robustness
Hitherto, we have considered the performance component of the robust control problem, namely minimization of the expected cost; we now address the robust stability requirement. It is desirable for the learned policy to stabilize every model in the confidence region ; in fact, this is necessary for the cost to be finite. Algorithm 1 ensures stability of each of the sampled systems from , which implies that stabilizes the entire region as . However, we would like to be able to say something about robustness for finite . To this end, we make two remarks. First, if closedloop stability of each sampled model is verified with a common Lyapunov function, then the policy stabilizes the convex hull of the sampled systems:
Theorem 4.2.
Suppose there exists such that for and all . Then for all , where denotes the convex hull of .
The proof of Theorem 4.2 is given in A.2.3. The conditions of Theorem 4.2 hold for the common Lyapunov approach in (8), and can be made to hold for Algorithm 1 by introducing an additional Lyapunov stability constraint (with common Lyapunov function) for each sampled system, at the expense of some conservatism. Second, we observe empirically that Algorithm 1 returns policies that very nearly stabilize the entire region , despite a very modest number of samples relative to the dimension of the parameter space, c.f., Section5.1, in particular Figure 2. A number of recent papers have investigated sampling (or grid) based approaches to stability verification of control policies, e.g., [46, 4, 6]. Understanding why policies from Algorithm 1 generalize effectively to the entire region is an interesting topic of future research.
5 Experimental results
5.1 Numerical simulations using synthetic systems
In this section, we study the infinite horizon LQR problem specified by
where denotes the Toeplitz matrix with first row and first colum . This is the same problem studied in [14, §6] (for ), where it is noted that such dynamics naturally arise in consensus and distributed averaging problems. To obtain problem data , each rollout involves simulating (1), with the true parameters, for time steps, excited by with . Note: to facilitate comparison with [14], we too shall assume that is known. Furthermore, for all experiments will denote the 95% confidence region, as in [14]. We compare the following methods of control synthesis: existing methods: (i) nominal: standard LQR using the nominal model from the least squares, i.e., ; (ii) worstcase: optimize for worstcase model (95% confidence) s.t. robust stability constraints, i.e., the method of [14, §5.2]; (iii) : enforce stability constraint from [14, §5.2], but optimize performance for the nominal model ; proposed method(s): (iv) CL: the common Lyapunov relaxation of 8; (v) proposed: the method proposed in this paper, i.e., Algorithm 1; additional new methods: (vi) alternater: initialize with the solution, and apply the iterative optimization method proposed in Section 4.2, c.f., Remark 4.1; (vii) alternates: optimize for the nominal model , enforce stability for the sampled systems in . Before proceeding, we wish to emphasize that the different control synthesis methods have different objectives; a lower cost does not mean that the associated method is ‘better’. This is particularly true for worstcase which seeks to optimize performance for the worst possible model so as to bound the cost on the true system.
To evaluate performance, we compare the cost of applying a learned policy to the true system , to the optimal cost achieved by the optimal controller (designed using ), i.e., . We refer to this as ‘LQR suboptimality.’ In Figure 1 we plot LQR suboptimality is shown as a function of the number of rollouts , for . We make the following observations. Foremost, the methods that enforce stability ‘stochastically’ (i.e. pointwise), namely proposed and alternates, attain significantly lower costs than the methods that enforce stability ‘robustly’. Furthermore, in situations with very little data, e.g. , the robust control methods are usually unable to find a stabilizing controller, yet the proposed method finds a stabilizing controller in the majority of trials. Finally, we note that the iterative procedure in proposed (and alternates) significantly improves on the commonLyapunov relaxation CL; similarly, alternater consistently improves upon (as expected).
It is natural to ask whether the reduction in cost exhibited by proposed (and alternates) come at the expense of robustness, namely, the ability to stabilize a large region of the parameter space. Empirical results suggest that this is not the case. To investigate this we sample 5000 fresh (i.e. not used for learning) models from and check closedloop stability of each; this is repeated for 50 independent experiments with varying and . The median percentage of models that were unstable in closedloop is recorded in Table 1. We make two observations: (i) the proposed method exhibits strong robustness. Even for (i.e. 288dim parameter space), it stabilizes % of samples from the confidence region, with only MC samples. (ii) when the robust methods (worstcase, , alternater) are feasible, the resulting policies were found to stabilize 100% of samples; however, for , the methods were infeasible almost half the time, whereas proposed always returned a policy. Further evidence is provided in Figure 2, which plots robustness and performance as a function of the number of MC samples, . For and
, the entire confidence region is stabilized with very high probability, suggesting that
is not required for robust stability in practice.optimal  nominal  worstcase  CL  proposed  alternates  

3  61.6 (0)  28.75 (0)  0 (0)  0 (0)  0.10 (0)  1.35 (0) 
6  95.37 (0)  58.41 (0)  0 (0)  0 (0)  0.18 (0)  1.76 (0) 
9  99.6 (0)  81.9 (0)  0 (0)  0 (0)  0.24 (0)  1.40 (0) 
12  100 (0)  94.28 (0)  0 (46.0)  0 (0)  0.27 (0)  1.27 (0) 
5.2 Realworld experiments on a rotary inverted pendulum
We now apply the proposed algorithm to the classic problem of stabilizing a (rotary) inverted pendulum, on real (i.e. physical) hardware (Quanser QUBE 2), c.f. A.3 for details. To generate training data, the superposition of a nonstabilizing control signal and a sinusoid of random frequency is applied to the rotary arm motor while the pendulum is inverted. The arm and pendulum angles (along with velocities) are sampled at 100Hz until the pendulum angle exceeds , which takes no more than 5 seconds. This constitutes one rollout. We applied the worstcase, , and proposed methods to optimize the LQ cost with and . To generate bounds and for worstcase and , we sample from the 95% confidence region of the posterior, using Gibbs sampling, and take and . The proposed method used 100 such samples for synthesis. We also applied the least squares policy iteration method [25], but none of the policies could stabilize the pendulum given the amount of training data. Results are presented in Figure 3, from which we make the following remarks. First, as in Section5.1, the proposed method achieves high performance (low cost), especially in the low data regime where the magnitude of system uncertainty renders the other synthesis methods infeasible. Insight into this performance is offered by Figure 3(b), which indicates that policies from the proposed method stabilize the pendulum with control signals of smaller magnitude. Finally, performance of the proposed method converges after very few rollouts. Datainefficiency is a wellknown limitation of RL; understanding and mitigating this inefficiency is the subject of considerable research [14, 43, 15, 38, 20, 21]. Investigating the role that a Bayesian approach to uncertainty quantification plays in the apparent sampleefficiency of the proposed method is an interesting topic for further inquiry.
Acknowledgments
This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT150012) and via the projects Learning flexible models for nonlinear dynamics (contract number: 201703807) and NewLEADS  New Directions in Learning Dynamical Systems (contract number: 621201606079), both funded by the Swedish Research Council.
References
 [1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 1–8. ACM, 2005.
 [2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
 [3] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin. Provably safe and robust learningbased model predictive control. Automatica, 49(5):1216–1226, 2013.
 [4] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe modelbased reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems (NIPS). 2017.
 [5] D. P. Bertsekas. Dynamic programming and optimal control. Belmont, MA: Athena Scientific, 1995.
 [6] R. Bobiti and M. Lazar. A sampling approach to finding lyapunov functions for nonlinear discretetime systems. In Control Conference (ECC), 2016 European, pages 561–566. IEEE, 2016.
 [7] R. Boczar, N. Matni, and B. Recht. Finitedata performance guarantees for the outputfeedback control of an unknown system. arXiv preprint arXiv:1803.09186, 2018.
 [8] J. B. Burl. Linear optimal control: and methods. AddisonWesley Longman Publishing Co., Inc., 1998.
 [9] C. K. Carter and R. Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.
 [10] J. Chen and G. Gu. Controloriented system identification: an approach, volume 19. WileyInterscience, 2000.
 [11] J. Chen and C. N. Nett. The CaratheodoryFejer problem and identification: a time domain approach. In Proceedings of the 32nd IEEE Conference on Decision and Control (CDC), pages 68–73, 1993.
 [12] S. H. Cheung and J. L. Beck. Bayesian model updating using hybrid Monte Carlo simulation with application to structural dynamic models with many uncertain parameters. Journal of engineering mechanics, 135(4):243–255, 2009.
 [13] J. Ching, M. Muto, and J. Beck. Bayesian linear structural model updating using Gibbs sampler with modal data. In Proceedings of the 9th International Conference on Structural Safety and Reliability, pages 2609–2616. Millpress, 2005.
 [14] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
 [15] M. Deisenroth and C. E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 [16] S. Depeweg, J. M. HernándezLobato, F. DoshiVelez, and S. Udluft. Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risksensitive Learning. arXiv:1710.07283, 2017.
 [17] J. Doyle, K. Zhou, K. Glover, and B. Bodenheimer. Mixed and performance objectives. II. Optimal control. IEEE Transactions on Automatic Control, 39(8):1575–1587, 1994.
 [18] J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
 [19] P. Geibel and F. Wysotzki. Risksensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81–108, 2005.
 [20] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Qprop: Sampleefficient policy gradient with an offpolicy critic. In International Conference on Learning Representations, 2017.
 [21] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Qlearning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
 [22] W. M. Haddad, D. S. Bernstein, and D. Mustafa. Mixednorm / regulation and estimation: The discretetime case. Systems & Control Letters, 16(4):235–247, 1991.
 [23] A. J. Helmicki, C. A. Jacobson, and C. N. Nett. Control oriented system identification: a worstcase/deterministic approach in . IEEE Transactions on Automatic control, 36(10):1163–1176, 1991.
 [24] H. K. Khalil. Noninear systems. PrenticeHall, New Jersey, 2(5):5–1, 1996.
 [25] M. G. Lagoudakis and R. Parr. Leastsquares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
 [26] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics, 9(1):1–20, 2000.
 [27] O. Mihatsch and R. Neuneier. Risksensitive reinforcement learning. Machine learning, 49(23):267–290, 2002.
 [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[29]
B. Ninness and S. Henriksen.
Bayesian system identification via Markov chain Monte Carlo techniques.
Automatica, 46(1):40–51, 2010.  [30] Oliveira, Maurício de. MAE 280B: Linear Control Design., 2009.
 [31] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot. Robust Constrained Learningbased NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547–1563, 2016.
 [32] I. R. Petersen, M. R. James, and P. Dupuis. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control, 45(3):398–412, 2000.
 [33] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook., 2012.
 [34] I. Postlethwaite, M. C. Turner, and G. Herrmann. Robust control applications. Annual Reviews in Control, 31(1):27–39, 2007.
 [35] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
 [36] I. M. Ross, R. J. Proulx, and M. Karpenko. Unscented optimal control for space flight. In 24th International Symposium on Space Flight Dynamics, 2014.
 [37] I. M. Ross, R. J. Proulx, M. Karpenko, and Q. Gong. Riemann–stieltjes optimal control problems for uncertain dynamic systems. Journal of Guidance, Control, and Dynamics, 38(7):1251–1263, 2015.
 [38] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 [39] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 [40] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334, 2018.
 [41] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 [42] S. Tu, R. Boczar, A. Packard, and B. Recht. Nonasymptotic analysis of robust control from coarsegrained identification. arXiv preprint arXiv:1707.04791, 2017.
 [43] S. Tu and B. Recht. Leastsquares temporal difference learning for the linear quadratic regulator. arXiv preprint arXiv:1712.08642, 2017.
 [44] F. Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, pages 831–840, 2005.
 [45] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM review, 38(1):49–95, 1996.
 [46] J. Vinogradska, B. Bischoff, D. NguyenTuong, A. Romer, H. Schmidt, and J. Peters. Stability of controllers for gaussian process forward models. In International Conference on Machine Learning, pages 545–554, 2016.
 [47] Y.S. Wang, N. Matni, and J. C. Doyle. A system level approach to controller synthesis. arXiv preprint arXiv:1610.04815, 2016.
 [48] A. Wills, T. B. Schön, F. Lindsten, and B. Ninness. Estimation of linear systems using a Gibbs sampler. IFAC Proceedings Volumes, 45(16):203–208, 2012.
 [49] K. Zhou and J. C. Doyle. Essentials of robust control, volume 104. Prentice Hall Upper Saddle River, NJ, 1998.
Appendix A Supplementary material
a.1 Sampling from the posterior distribution
a.1.1 Case I: known
First, consider the case where is known; i.e., . From Bayes’ rule, we have
(12) 
where and . When is known, we can express the likelihood in a form equivalent to an (unnormalized) Gaussian distribution over , i.e.,
(13a)  
(13b)  
(13c)  
(13d)  
(13e) 
where , , , and . This implies that . Therefore, when the prior is noninformative () or Gaussian (selfconjugate), the posterior is also Gaussian.
a.1.2 Case II: unknown
Next, consider the generic case in which all parameters are unknown. Then . One approach to sampling from the posterior involves Gibbs sampling [9], i.e., alternating between the following two sampling steps:
(14)  
(15) 
to form the Markov Chain . As demonstrated in A.1.1, the distribution is Gaussian, so sampling is straightforward. To sample from , first note
(16) 
Observe that
where denotes the inverse Wishart distribution. Note, if then is not valid. However, we may consider a prior such as (Jeffreys’ noninformative prior) which means
(17) 
where . This is a welldefined inverse Wishart distribution, sampling from which is straightforward.
a.2 Proofs
a.2.1 Proof of Lemma 4.1
Lemma.
for all , where denotes the firstorder Taylor series expansion of about .
Proof.
Let , i.e, is the Cholesky factorization of . Then,
where the penultimate implication invokes the Woodbury matrix inversion identity [33, eq. 159]. ∎
a.2.2 Proof of Theorem 4.1
Theorem.
Let be defined as in (11), with such that is finite. Then is a convex upper bound on , i.e., . Furthermore, the bound is ‘tight’ at , i.e., .
Proof.
We will prove that is a tight convex bound on , as this implies that is a tight convex bound on . First, we prove convexity. is defined as the supremum over an infinite family of convex functions over a compact convex set, and is therefore a itself convex function. Note: the minimum of a linear function, e.g. can be trivially expressed as the supremum of a convex function, i.e., . Next, we prove the upper bound. From Lemma 4.1,