Learning convex bounds for linear quadratic control policy synthesis

06/01/2018 ∙ by Jack Umenberger, et al. ∙ Uppsala universitet 0

Learning to make decisions from observed data in dynamic environments remains a problem of fundamental importance in a number of fields, from artificial intelligence and robotics, to medicine and finance. This paper concerns the problem of learning control policies for unknown linear dynamical systems so as to maximize a quadratic reward function. We present a method to optimize the expected value of the reward over the posterior distribution of the unknown system parameters, given data. The algorithm involves sequential convex programing, and enjoys reliable local convergence and robust stability guarantees. Numerical simulations and stabilization of a real-world inverted pendulum are used to demonstrate the approach, with strong performance and robustness properties observed in both.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decision making for dynamical systems in the presence of uncertainty is a problem of great prevalence and importance, as well as considerable difficulty, especially when knowledge of the dynamics is available only via limited observations of system behavior. In machine learning, the data-driven search for a control policy to maximize the expected reward attained by a stochastic dynamic process is known as

reinforcement learning (RL) [41]. Despite remarkable recent success in games [28, 39], a major obstacle to the deployment RL-based control on physical systems (e.g. robots and self-driving cars) is the issue of robustness, i.e., guaranteed safe and reliable operation. With the necessity of such guarantees widely acknowledged [2], so-called ‘safe RL’ remains an active area of research [18].

The problem of robust automatic decision making for uncertain dynamical systems has also been the subject of intense study in the area of robust control (RC) [49]. In RC, one works with a set of plausible models and seeks a control policy that is guaranteed to stabilize all models within the set. In addition, there is also a performance objective to optimize, i.e. a reward to be maximized, or equivalently, a cost to be minimized. Such cost functions are usually defined with reference to either a nominal model [17, 22] or the worst-case model [32] in the set. RC has been extremely successful in a number of engineering applications [34]; however, as has been noted, e.g., [43, 31], robustness may (understandably) come at the expense of performance, particularly for worst-case design.

The problem we address in this paper lies at the intersection of reinforcement learning and robust control, and can be summarized as follows: given observations from an unknown dynamical system, we seek a policy to optimize the expected cost (as in RL), subject to certain robust stability guarantees (as in RC). Specifically, we focus our attention on control of linear time-invariant dynamical systems, subject to Gaussian disturbances, with the goal of minimizing a quadratic function penalizing state deviations and control action. When the system is known, this is the classical linear quadratic regulator (LQR), a.k.a. , optimal control problem [8]. We are interested in the setting in which the system is unknown, and knowledge of the dynamics must be inferred from observed data.

Contributions and paper structure

The principle contribution of this paper is an algorithm to optimize the expected value of the linear quadratic regulator reward/cost function, where the expectation is w.r.t. the posterior distribution of unknown system parameters, given observed data; c.f. Section 3 for a detailed problem formulation. Specifically, we construct a sequence of convex approximations (upper bounds) to the expected cost, that can be optimized via semidefinite programing [45]. The algorithm, developed in Section 4, invokes the majorize-minimization (MM) principle [26], and consequently enjoys reliable convergence to local optima. An important part of our contribution lies in guarantees on the robust stability properties of the resulting control policies, c.f. Section 4.3. We demonstrate the proposed method via two experimental case studies: i) the benchmark problem on simulated systems considered in [14, 43], and ii) stabilization of a real-world inverted pendulum. Strong performance and robustness properties are observed in both. Moving forward, from a machine learning perspective this work contributes to the growing body of research concerned with ensuring robustness in RL, c.f. Section 2. From a control perspective, this work appropriates cost functions more commonly found in RL (namely, expected reward) to a RC setting, with the objective of reducing conservatism of the resulting robust control policies.

2 Related work

Incorporating various notions of ‘robustness’ into RL has long been an area of active research [18]. In so-called ‘safe RL’, one seeks to respect certain safety constraints during exploration and/or policy optimization, for example, avoiding undesirable regions of the state-action space [19, 1]

. A related problem is addressed in ‘risk-sensitive RL’, in which the search for a policy takes both the expected value and variance of the reward into account

[27, 16]. Recently, there has been an increased interest in notions of robustness more commonly considered in control theory, chiefly stability [31, 3]. Of particular relevance is the work of [4], which employs Lyapunov theory [24] to verify stability of learned policies. Like the present paper, [4] adopts a Bayesian framework; however, [4] makes use of Gaussian processes [35] to model the uncertain nonlinear dynamics, which are assumed to be deterministic. A major difference between [4] and our work is the cost function; in the former the policy is selected by optimizing for worst-case performance, whereas we optimize the expected cost. Robustness of data-driven control has also been the focus of a recently developed family of methods referred to as ‘coarse-ID control’, c.f.,[42, 14, 7, 40]

, in which finite-data bounds on the accuracy of the least squares estimator are combined with modern robust control tools, such as

system level synthesis [47]. Coarse-ID builds upon so-called ‘ identification’ methods for learning models of dynamical systems, along with error bounds that are compatible with robust synthesis methods [23, 11, 10]. identification assumes an adversarial (i.e. worst-case) disturbance model, whereas Coarse-ID is applicable to probabilistic models, such as those considered in the present paper. Of particular relevance to the present paper is [14], which provides sample complexity bounds on the performance of robust control synthesis for the infinite horizon LQR problem, when the true system is not known. Such bounds necessarily consider the worst-case model, given the observed data, where as we are concerned with expected cost over the posterior distribution of models. In closing, we briefly mention the so-called ‘Riemann-Stieltjes’ class of optimal control problems, for uncertain continuous-time dynamical systems, c.f., e.g., [37, 36]. Such problems often arise in aerospace applications (e.g. satellite control) where the objective is to design an open-loop control signal (e.g. for an orbital maneuver) rather than a feedback policy.

3 Problem formulation

In this section we describe in detail the specific problem that we address in this paper. The following notation is used: denotes the set of symmetric matrices; () denotes the cone of positive semdefinite (positive definite) matrices. denotes , similarly for and . The trace of is denoted . The transpose of is denoted . is shorthand for . The convex hull of set is denoted . The set of Schur stable matrices is denoted .

Dynamics, reward function and policies

We are concerned with control of discrete linear time-invariant dynamical systems of the form

(1)

where , , and denote the state, input, and unobserved exogenous disturbance at time , respectively. Let . Our objective is to design a feedback control policy that minimizes the cost function , where evolves according to (1), and and are user defined weight matrices. A number of different parametrizations of the policy

have been considered in the literature, from neural networks (popular in RL, e.g.,

[4]) to causal (typically linear) dynamical systems (common in RC, e.g., [32]). In this paper, we will restrict our attention to static-gain policies of the form , where is constant. As noted in [14], controller synthesis and implementation, is simpler (and more computationally efficient) for such policies. When the parameters of the true system, denoted , are known this is the infinite horizon LQR problem, the optimal solution of which is well-known [5]. We assume that is unknown; rather, our knowledge of the dynamics must be inferred from observed sequences of inputs and states.

Observed data

We adopt the data-driven setup used in [14], and assume that where is the observed state sequence attained by evolving the true system for time steps, starting from an arbitrary and driven by arbitrary input . Each of these independent experiments is referred to as a rollout. We perform parameter inference in the offline/batch setting; i.e., all data is assumed to be available at the time of controller synthesis.

Optimization objective

Given observed data and, possibly, prior knowledge of the system, we then have the posterior distribution over the model parameters denoted , in place of the true parameters . The function that we seek to minimize is the expected cost w.r.t. the posterior distribution, i.e.,

(2)

In practice, the support of almost surely contains that are unstabilizable, which implies that (2) is infinite. Consequently, we shall consider averages over confidence regions w.r.t. . For convenience, let us denote the infinite horizon LQR cost, for given system parameters , by

(3a)
(3b)

where the second equality follows from standard Gramian calculations, and denotes the set of Schur stable matrices. As an alternative to (2) we consider the cost function , where denotes the confidence region of the parameter space w.r.t. the posterior . Though better suited to optimization than (2), which is almost surely infinite, this integral cannot be evaluated in closed form, due to the complexity of w.r.t. . Furthermore, there is still no guarantee that contain only stabilizable models. To circumvent both of these issues, we propose the following Monte Carlo approximation of ,

(4)

where denotes the set of stabilizable .

Posterior distribution

Given data , the parameter posterior distribution is given by Bayes’ rule:

(5)

where denotes our prior belief on , , and denotes the unnormalized posterior. To sample from , we can distinguish between two different cases. First, consider the case when is known or can be reliably estimated independently of . This is the setting in, e.g., [14]

. In this case, the likelihood can be equivalently expressed as a Gaussian distribution over

. Then, when the prior is uniform (i.e. non-informative) or Gaussian (self-conjugate), the posterior is also Gaussian, c.f. Appendix A.1.1. Second, consider the general case in which , along with

, is unknown. In this setting, one can select from a number of methods adapted for Bayesian inference in dynamical systems, such as Metropolis-Hastings

[29], Hamiltonian Monte Carlo [12], and Gibbs sampling [13, 48]. When one places a non-informative prior on (e.g., ), each iteration of a Gibbs sampler targeting requires sampling from either a Gaussian or an inverse Wishart distribution, for which reliable numerical methods exist; c.f., Appendix A.1.2. In both of these cases we can sample from and evaluate point-wise. To draw , as in (4), we can first draw a large number of samples from , discard the (100)% of samples with the lowest unnormalized posterior values, and then further discard any samples that happen to be unstabilizable. For convenience, we define , which should be interpreted as a set of realizations of this procedure for sampling .

Summary

We seek the solution of the optimization problem for .

4 Solution via semidefinite programing

In this section we present the principle contribution of this paper: a method for solving via convex (semidefinite) programing (SDP). It is convenient to consider an equivalent representation

(6a)
s.t. (6b)

where the Comparison Lemma [30, Lecture 2] has been used to replace the equality in (3b) with the inequality in (6b). We introduce the notation , where and are arbitrarily small and large positive constants, respectively. serves as a compact approximation of , suitable for use with SDP solvers, i.e., .

4.1 Common Lyapunov relaxation

The principle challenge in solving (6) is that the constraint (6b) is not jointly convex in and . The usual approach to circumventing this nonconvexity is to first apply the Schur complement to (6b), and then conjugate by the matrix , which leads to the equivalent constraint

(7)

With the change of variables and , (7) becomes an LMI, in and . This approach is effective when (i.e. we have a single nominal system, as in standard LQR). However, when we cannot introduce a new for each , as we lose uniqueness of the controller in , i.e., in general for . One strategy (prevalent in robust control, e.g., [14, §C]) is to employ a ‘common Lyapunov function’, i.e., for all . This gives the following convex relaxation (upper bound) of problem (6),

(8a)
(8b)

where denotes the Cholesky factorization of , i.e., , and are slack variables used to encode the cost (6a) with the change of variables, i.e.,

The approximation in (8) is highly conservative, which motivates the iterative local optimization method presented in Section 4.2. Nevertheless, (8) provides a principled way (i.e., a one-shot convex program) to initialize the iterative search method derived in Section 4.2.

4.2 Iterative improvement by sequential semidefinite programing

To develop this iterative search method first consider an equivalent representation of ,

(9a)
(9b)

This representation highlights the nonconvexity of due to the term, which was addressed (in the usual way) by a change of variables in Section 4.1. In this section, we will instead replace with a linear approximation and prove that this leads to a tight convex upper bound. Given , let denote the first order (i.e. linear) Taylor series approximation of about some nominal , i.e., . We now define the function

(10a)
(10b)

where is any that achieves the minimum in (9), with for some nominal , i.e., . Analogously to (4), we define

(11)

We now show that is a convex upper bound on , which is tight at . The proof is given in A.2.2 and makes use of the following technical lemma (c.f. A.2.1 for proof),

Lemma 4.1.

for all , where denotes the first-order Taylor series expansion of about .

Theorem 4.1.

Let be defined as in (11), with such that is finite. Then is a convex upper bound on , i.e., . Furthermore, the bound is ‘tight’ at , i.e., .

Iterative algorithm

To improve upon the common Lyapunov solution given by (8), we can solve a sequence of convex optimization problems: , c.f. Algorithm 1 for details. This procedure of optimizing tight surrogate functions in lieu of the actual objective function is an example of the ‘majorize-minimization (MM) principle’, a.k.a. optimization transfer [26]. MM algorithms enjoy good numerical robustness, and (with the exception of some pathological cases) reliable convergence to local minima [44]. Indeed, it is readily verified that , where equality follows from tightness of the bound, and the second inequality is due to the fact that is an upper bound. This implies that is a converging sequence.

1:Input: observed data , confidence , LQR cost matrices and , number of particles in Monte Carlo approximation , convergence tolerance .
2:Generate samples from , i.e., , using the appropriate Bayesian inference method from Section 3.
3:Solve (8). Let denote the optimal solution of (8). Set , and .
4:while  do
5:     Solve . Set and .
6:end while
7:return as the control policy.
Algorithm 1 Optimization of via semidefinite programing
Remark 4.1.

This sequential SDP approach can be applied in other robust control settings, e.g., mixed [17], to improve on the common Lyapunov solution, c.f., Section5.1 for an illustration.

4.3 Robustness

Hitherto, we have considered the performance component of the robust control problem, namely minimization of the expected cost; we now address the robust stability requirement. It is desirable for the learned policy to stabilize every model in the confidence region ; in fact, this is necessary for the cost to be finite. Algorithm 1 ensures stability of each of the sampled systems from , which implies that stabilizes the entire region as . However, we would like to be able to say something about robustness for finite . To this end, we make two remarks. First, if closed-loop stability of each sampled model is verified with a common Lyapunov function, then the policy stabilizes the convex hull of the sampled systems:

Theorem 4.2.

Suppose there exists such that for and all . Then for all , where denotes the convex hull of .

The proof of Theorem 4.2 is given in A.2.3. The conditions of Theorem 4.2 hold for the common Lyapunov approach in (8), and can be made to hold for Algorithm 1 by introducing an additional Lyapunov stability constraint (with common Lyapunov function) for each sampled system, at the expense of some conservatism. Second, we observe empirically that Algorithm 1 returns policies that very nearly stabilize the entire region , despite a very modest number of samples relative to the dimension of the parameter space, c.f., Section5.1, in particular Figure 2. A number of recent papers have investigated sampling (or grid) based approaches to stability verification of control policies, e.g., [46, 4, 6]. Understanding why policies from Algorithm 1 generalize effectively to the entire region is an interesting topic of future research.

5 Experimental results

5.1 Numerical simulations using synthetic systems

In this section, we study the infinite horizon LQR problem specified by

where denotes the Toeplitz matrix with first row and first colum . This is the same problem studied in [14, §6] (for ), where it is noted that such dynamics naturally arise in consensus and distributed averaging problems. To obtain problem data , each rollout involves simulating (1), with the true parameters, for time steps, excited by with . Note: to facilitate comparison with [14], we too shall assume that is known. Furthermore, for all experiments will denote the 95% confidence region, as in [14]. We compare the following methods of control synthesis: existing methods: (i) nominal: standard LQR using the nominal model from the least squares, i.e., ; (ii) worst-case: optimize for worst-case model (95% confidence) s.t. robust stability constraints, i.e., the method of [14, §5.2]; (iii) : enforce stability constraint from [14, §5.2], but optimize performance for the nominal model ; proposed method(s): (iv) CL: the common Lyapunov relaxation of 8; (v) proposed: the method proposed in this paper, i.e., Algorithm 1; additional new methods: (vi) alternate-r: initialize with the solution, and apply the iterative optimization method proposed in Section 4.2, c.f., Remark 4.1; (vii) alternate-s: optimize for the nominal model , enforce stability for the sampled systems in . Before proceeding, we wish to emphasize that the different control synthesis methods have different objectives; a lower cost does not mean that the associated method is ‘better’. This is particularly true for worst-case which seeks to optimize performance for the worst possible model so as to bound the cost on the true system.

To evaluate performance, we compare the cost of applying a learned policy to the true system , to the optimal cost achieved by the optimal controller (designed using ), i.e., . We refer to this as ‘LQR suboptimality.’ In Figure 1 we plot LQR suboptimality is shown as a function of the number of rollouts , for . We make the following observations. Foremost, the methods that enforce stability ‘stochastically’ (i.e. point-wise), namely proposed and alternate-s, attain significantly lower costs than the methods that enforce stability ‘robustly’. Furthermore, in situations with very little data, e.g. , the robust control methods are usually unable to find a stabilizing controller, yet the proposed method finds a stabilizing controller in the majority of trials. Finally, we note that the iterative procedure in proposed (and alternate-s) significantly improves on the common-Lyapunov relaxation CL; similarly, alternate-r consistently improves upon  (as expected).

[width=0.9]cost_infeas_nx3_rollouts5_40_update_2

Figure 1: LQR suboptimality as a function of the number of rollouts (i.e. amount of training data). suboptimality denotes cases in which the method was unable to find a stabilizing controller for the true system (including infeasibility of the optimization problem for policy synthesis), and the % denotes the frequency with which this occurred for the 50 experimental trials conducted.

It is natural to ask whether the reduction in cost exhibited by proposed (and alternate-s) come at the expense of robustness, namely, the ability to stabilize a large region of the parameter space. Empirical results suggest that this is not the case. To investigate this we sample 5000 fresh (i.e. not used for learning) models from and check closed-loop stability of each; this is repeated for 50 independent experiments with varying and . The median percentage of models that were unstable in closed-loop is recorded in Table 1. We make two observations: (i) the proposed method exhibits strong robustness. Even for (i.e. 288-dim parameter space), it stabilizes % of samples from the confidence region, with only MC samples. (ii) when the robust methods (worst-case, , alternate-r) are feasible, the resulting policies were found to stabilize 100% of samples; however, for , the methods were infeasible almost half the time, whereas proposed always returned a policy. Further evidence is provided in Figure 2, which plots robustness and performance as a function of the number of MC samples, . For and

, the entire confidence region is stabilized with very high probability, suggesting that

is not required for robust stability in practice.

optimal nominal worst-case CL proposed alternate-s
3 61.6 (0) 28.75 (0) 0 (0) 0 (0) 0.10 (0) 1.35 (0)
6 95.37 (0) 58.41 (0) 0 (0) 0 (0) 0.18 (0) 1.76 (0)
9 99.6 (0) 81.9 (0) 0 (0) 0 (0) 0.24 (0) 1.40 (0)
12 100 (0) 94.28 (0) 0 (46.0) 0 (0) 0.27 (0) 1.27 (0)
Table 1: Median % of unstable closed-loop models, with open-loop models sampled from the 95% confidence region of the posterior, for system of varying dimension ; c.f. Section 5.1 for details. Parenthesized quantities denote the % of cases for which the policy synthesis optimization problem was infeasible (i.e. no policy was returned). 50 experiments were conducted, with .  and alternate-r have the same robustness guarantees as worst-case, and are omitted.

[width=0.9]mc_samples

Figure 2: (left) Median % of unstable closed-loop models, with open-loop models sampled from the 95% confidence region of the posterior, for and , as a function of the number of samples used in the MC approximation (4). (right) LQR suboptimality as a function of . 50 experiments were conducted, c.f. Section5.1 for details. Shaded regions cover the interquartile range.

5.2 Real-world experiments on a rotary inverted pendulum

We now apply the proposed algorithm to the classic problem of stabilizing a (rotary) inverted pendulum, on real (i.e. physical) hardware (Quanser QUBE 2), c.f. A.3 for details. To generate training data, the superposition of a non-stabilizing control signal and a sinusoid of random frequency is applied to the rotary arm motor while the pendulum is inverted. The arm and pendulum angles (along with velocities) are sampled at 100Hz until the pendulum angle exceeds , which takes no more than 5 seconds. This constitutes one rollout. We applied the worst-case, , and proposed methods to optimize the LQ cost with and . To generate bounds and for worst-case and , we sample from the 95% confidence region of the posterior, using Gibbs sampling, and take and . The proposed method used 100 such samples for synthesis. We also applied the least squares policy iteration method [25], but none of the policies could stabilize the pendulum given the amount of training data. Results are presented in Figure 3, from which we make the following remarks. First, as in Section5.1, the proposed method achieves high performance (low cost), especially in the low data regime where the magnitude of system uncertainty renders the other synthesis methods infeasible. Insight into this performance is offered by Figure 3(b), which indicates that policies from the proposed method stabilize the pendulum with control signals of smaller magnitude. Finally, performance of the proposed method converges after very few rollouts. Data-inefficiency is a well-known limitation of RL; understanding and mitigating this inefficiency is the subject of considerable research [14, 43, 15, 38, 20, 21]. Investigating the role that a Bayesian approach to uncertainty quantification plays in the apparent sample-efficiency of the proposed method is an interesting topic for further inquiry.

[width=0.45]costs_wc_h_ours_half

(a)

[width=0.45]signals_half

(b)
Figure 3: (a) LQR cost on real-world pendulum experiment, as a function of the number of rollouts. cost denotes controllers that resulted in instability during testing. n/a denotes cases in which the synthesis problem was infeasible. (b) pendulum angle and control signal recorded after 10 rollouts.

Acknowledgments

This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT15-0012) and via the projects Learning flexible models for nonlinear dynamics (contract number: 2017-03807) and NewLEADS - New Directions in Learning Dynamical Systems (contract number: 621-2016-06079), both funded by the Swedish Research Council.

References

  • [1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 1–8. ACM, 2005.
  • [2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
  • [3] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.
  • [4] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems (NIPS). 2017.
  • [5] D. P. Bertsekas. Dynamic programming and optimal control. Belmont, MA: Athena Scientific, 1995.
  • [6] R. Bobiti and M. Lazar. A sampling approach to finding lyapunov functions for nonlinear discrete-time systems. In Control Conference (ECC), 2016 European, pages 561–566. IEEE, 2016.
  • [7] R. Boczar, N. Matni, and B. Recht. Finite-data performance guarantees for the output-feedback control of an unknown system. arXiv preprint arXiv:1803.09186, 2018.
  • [8] J. B. Burl. Linear optimal control: and methods. Addison-Wesley Longman Publishing Co., Inc., 1998.
  • [9] C. K. Carter and R. Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.
  • [10] J. Chen and G. Gu. Control-oriented system identification: an approach, volume 19. Wiley-Interscience, 2000.
  • [11] J. Chen and C. N. Nett. The Caratheodory-Fejer problem and identification: a time domain approach. In Proceedings of the 32nd IEEE Conference on Decision and Control (CDC), pages 68–73, 1993.
  • [12] S. H. Cheung and J. L. Beck. Bayesian model updating using hybrid Monte Carlo simulation with application to structural dynamic models with many uncertain parameters. Journal of engineering mechanics, 135(4):243–255, 2009.
  • [13] J. Ching, M. Muto, and J. Beck. Bayesian linear structural model updating using Gibbs sampler with modal data. In Proceedings of the 9th International Conference on Structural Safety and Reliability, pages 2609–2616. Millpress, 2005.
  • [14] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
  • [15] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
  • [16] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning. arXiv:1710.07283, 2017.
  • [17] J. Doyle, K. Zhou, K. Glover, and B. Bodenheimer. Mixed and performance objectives. II. Optimal control. IEEE Transactions on Automatic Control, 39(8):1575–1587, 1994.
  • [18] J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  • [19] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81–108, 2005.
  • [20] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017.
  • [21] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
  • [22] W. M. Haddad, D. S. Bernstein, and D. Mustafa. Mixed-norm / regulation and estimation: The discrete-time case. Systems & Control Letters, 16(4):235–247, 1991.
  • [23] A. J. Helmicki, C. A. Jacobson, and C. N. Nett. Control oriented system identification: a worst-case/deterministic approach in . IEEE Transactions on Automatic control, 36(10):1163–1176, 1991.
  • [24] H. K. Khalil. Noninear systems. Prentice-Hall, New Jersey, 2(5):5–1, 1996.
  • [25] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
  • [26] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics, 9(1):1–20, 2000.
  • [27] O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49(2-3):267–290, 2002.
  • [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [29] B. Ninness and S. Henriksen.

    Bayesian system identification via Markov chain Monte Carlo techniques.

    Automatica, 46(1):40–51, 2010.
  • [30] Oliveira, Maurício de. MAE 280B: Linear Control Design., 2009.
  • [31] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot. Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547–1563, 2016.
  • [32] I. R. Petersen, M. R. James, and P. Dupuis. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control, 45(3):398–412, 2000.
  • [33] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook., 2012.
  • [34] I. Postlethwaite, M. C. Turner, and G. Herrmann. Robust control applications. Annual Reviews in Control, 31(1):27–39, 2007.
  • [35] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
  • [36] I. M. Ross, R. J. Proulx, and M. Karpenko. Unscented optimal control for space flight. In 24th International Symposium on Space Flight Dynamics, 2014.
  • [37] I. M. Ross, R. J. Proulx, M. Karpenko, and Q. Gong. Riemann–stieltjes optimal control problems for uncertain dynamic systems. Journal of Guidance, Control, and Dynamics, 38(7):1251–1263, 2015.
  • [38] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • [39] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • [40] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp analysis of linear system identification. arXiv preprint arXiv:1802.08334, 2018.
  • [41] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • [42] S. Tu, R. Boczar, A. Packard, and B. Recht. Non-asymptotic analysis of robust control from coarse-grained identification. arXiv preprint arXiv:1707.04791, 2017.
  • [43] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator. arXiv preprint arXiv:1712.08642, 2017.
  • [44] F. Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, pages 831–840, 2005.
  • [45] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM review, 38(1):49–95, 1996.
  • [46] J. Vinogradska, B. Bischoff, D. Nguyen-Tuong, A. Romer, H. Schmidt, and J. Peters. Stability of controllers for gaussian process forward models. In International Conference on Machine Learning, pages 545–554, 2016.
  • [47] Y.-S. Wang, N. Matni, and J. C. Doyle. A system level approach to controller synthesis. arXiv preprint arXiv:1610.04815, 2016.
  • [48] A. Wills, T. B. Schön, F. Lindsten, and B. Ninness. Estimation of linear systems using a Gibbs sampler. IFAC Proceedings Volumes, 45(16):203–208, 2012.
  • [49] K. Zhou and J. C. Doyle. Essentials of robust control, volume 104. Prentice Hall Upper Saddle River, NJ, 1998.

Appendix A Supplementary material

a.1 Sampling from the posterior distribution

a.1.1 Case I: known

First, consider the case where is known; i.e., . From Bayes’ rule, we have

(12)

where and . When is known, we can express the likelihood in a form equivalent to an (un-normalized) Gaussian distribution over , i.e.,

(13a)
(13b)
(13c)
(13d)
(13e)

where , , , and . This implies that . Therefore, when the prior is non-informative () or Gaussian (self-conjugate), the posterior is also Gaussian.

a.1.2 Case II: unknown

Next, consider the generic case in which all parameters are unknown. Then . One approach to sampling from the posterior involves Gibbs sampling [9], i.e., alternating between the following two sampling steps:

(14)
(15)

to form the Markov Chain . As demonstrated in A.1.1, the distribution is Gaussian, so sampling is straightforward. To sample from , first note

(16)

Observe that

where denotes the inverse Wishart distribution. Note, if then is not valid. However, we may consider a prior such as (Jeffreys’ noninformative prior) which means

(17)

where . This is a well-defined inverse Wishart distribution, sampling from which is straightforward.

a.2 Proofs

a.2.1 Proof of Lemma 4.1

Lemma.

for all , where denotes the first-order Taylor series expansion of about .

Proof.

Let , i.e, is the Cholesky factorization of . Then,

where the penultimate implication invokes the Woodbury matrix inversion identity [33, eq. 159]. ∎

a.2.2 Proof of Theorem 4.1

Theorem.

Let be defined as in (11), with such that is finite. Then is a convex upper bound on , i.e., . Furthermore, the bound is ‘tight’ at , i.e., .

Proof.

We will prove that is a tight convex bound on , as this implies that is a tight convex bound on . First, we prove convexity. is defined as the supremum over an infinite family of convex functions over a compact convex set, and is therefore a itself convex function. Note: the minimum of a linear function, e.g. can be trivially expressed as the supremum of a convex function, i.e.,