1. Introduction
Reinforcement learningbased control has recently achieved impressive successes in games [31, 32] and simulators [28]
. But these successes are significantly more challenging to translate to complex physical systems with continuous state and action spaces, safety constraints, and nonnegligible operation and failure costs that demand data efficiency. An intense and growing research effort is creating a large array of models, algorithms, and heuristics for approaching the myriad of challenges arising from these systems. To complement a dominant trend of more computationally focused work, the canonical linear quadratic regulator (LQR) problem in control theory has reemerged as an important theoretical benchmark for learningbased control
[30, 12]. Despite its long history, there remain fundamental open questions for LQR with unknown models, and a foundational understanding of learning in LQR problems can give insight into more challenging problems.All recent work on learning in LQR problems has utilized either deterministic or additive noise models [30, 12, 14, 8, 15, 1, 23, 35, 2, 37, 26], but here we consider multiplicative noise models. In control theory, multiplicative noise models have been studied almost as long as their deterministic and additive noise counterparts [39, 11], although this area is somewhat less developed and far less widely known. We believe the study of learning in LQR problems with multiplicative noise is important for three reasons. First, this class of models is much richer than deterministic or additive noise while still allowing exact solutions when models are known, which makes it a compelling additional benchmark. Second, they explicitly incorporate model uncertainty and inherent stochasticity, thereby improving robustness properties of the controller. Robustness is a critical and poorly understood issue in reinforcement learning; existing methods which do not account for uncertainty can converge to fragile policies or fail to converge at all. Additionally, intentional injection of multiplicative noise into learning algorithms is known to enhance robustness of policies from ad hoc work on domain randomization [33]. Moreover, stochastic representations of model uncertainty (via multiplicative noise) are perhaps most natural when models are estimated from noisy and incomplete data; these representations can be obtained directly from nonasymptotic statistical concentration bounds and bootstrap methods. Third, in emerging difficulttomodel complex systems where learningbased control approaches are perhaps most promising, multiplicative noise models are increasingly relevant; examples include networked control systems with noisy communication channels [3, 17], modern power networks with large penetration of intermittent renewables [10, 27], turbulent fluid flow [25]
, and neuronal brain networks
[9].1.1. Related literature
Multiplicative noise LQR problems have been studied in control theory since the 1960s [39]. Since then a line of research parallel to deterministic and additive noise has developed, including basic stability and stabilizability results [38], semidefinite programming formulations [13, 7, 24], robustness properties [11, 6, 19, 4], and numerical algorithms [5]. This line of research is less widely known perhaps because much of it studies continuous time systems, where the heavy machinery required to formalize stochastic differential equations is a barrier to entry for a broad audience. Multiplicative noise models are wellpoised to offer datadriven model uncertainty representations and enhanced robustness in learningbased control algorithms and complex dynamical systems and processes.
Recent work on learning in LQR problems has focused entirely on deterministic or additive noise models. In contrast to classical work on system identification and adaptive control, which has a strong focus on asymptotic results, more recent work has focused on nonasymptotic analysis using recent tools from statistics and machine learning. There remain fundamental open problems for learning in LQR problems, with several addressed only recently, including nonasymptotic sample complexity
[12, 35], regret bounds [1, 2, 26], and algorithmic convergence [14].1.2. Our contributions
We give several fundamental results for policy gradient algorithms on linear quadratic problems with multiplicative noise. Our main contributions are as follows, which can be viewed as a generalization of the recent results of Fazel et al. [14] for deterministic LQR to multiplicative noise LQR:

In particular, in §3.2 the gradient domination property is exploited to prove global convergence of three policy gradient algorithm variants (namely, exact gradient descent, “natural” gradient descent, and GaussNewton/policy iteration) to the globally optimum control policy with a rate that depends polynomially on problem parameters (Theorems 3.4, 3.5, and 3.6).

Furthermore, in §4
we show that a modelfree policy gradient algorithm, where the cost gradient is estimated from trajectory data rather than computed from model parameters, also converges globally (with high probability) with an appropriate exploration scheme and sufficiently many samples (also polynomial in problem data) (Theorem
4.1).
Thus, policy gradient algorithms for the multiplicative noise LQR problem enjoy the same global convergence properties as deterministic LQR, while significantly enhancing the resulting controller’s robustness to variations and inherent stochasticity in the system dynamics, as demonstrated by our numerical experiments in §5.
To our best knowledge, the present paper is the first work to consider and obtain global convergence results using reinforcement learning algorithms for the multiplicative noise LQR problem. Our approach allows the explicit incorporation of a model uncertainty representation that significantly improves the robustness of the controller compared to deterministic and additive noise approaches.
2. Linear Quadratic Optimal Control with Multiplicative Noise
We consider the linear quadratic regulator problem with multiplicative noise
(1)  
subject to 
where is the system state, is the control input, the initial state is distributed according to distribution , and and . The dynamics are described by a dynamics matrix and input matrix
and incorporate multiplicative noise terms modeled by the i.i.d. (across time), zeromean, mutually independent scalar random variables
and , which have variances and , respectively. The matrices and specify how each scalar noise term affects the system dynamics and input matrices. Equivalently, the terms and are zeromean random matrices with a joint covariance structure over their entries. We define the covariance matrices and ; the variances and and matrices andare simply the eigenvalues and (reshaped) eigenvectors of
and , respectively^{1}^{1}1We assume that and are independent for simplicity, but it is also straightforward to include correlations between the entries of and into the model.. The goal is to determine an optimal closedloop state feedback policy with from a set of admissible policies.We assume that the problem data , , , , , and permit existence and finiteness of the optimal value of the problem, in which case the system is called meansquare stabilizable and requires meansquare stability of the closedloop system [22, 38]. The system in (1) is called meansquare stable if for any given initial covariance . Meansquare stability is a form of robust stability, requiring stricter and more complicated conditions than stabilizability of the nominal system . This essentially can limit the size of the multiplicative noise covariance, which can be viewed as a representation of uncertainty in the nominal system model or as inherent variation in the system dynamics.
2.1. Control design with known models: Value Iteration
Dynamic programming can be used to show that the optimal policy is linear state feedback , where denotes the optimal gain matrix, and the resulting optimal cost for a fixed initial state is quadratic, i.e., , where is a symmetric positive definite matrix. When the model parameters are known, there are several ways to compute the optimal feedback gains and corresponding optimal cost. The optimal cost is given by the solution of the generalized Riccati equation
This can be solved via the value iteration recursion
with or via semidefinite programming formulations (see, e.g., [7, 13, 24]). The corresponding optimal gain matrix is then
2.2. Control design with known models: Policy Gradient and Policy Iteration
Here we consider an alternative approach that facilitates datadriven approaches for learning optimal and robust policies. For a fixed linear state feedback policy , the closedloop dynamics become
and we define the corresponding value function for
If gives closedloop meansquare stability then the value function can be written as , where is the unique positive semidefinite solution to the generalized Lyapunov equation
(2) 
Further, we define the state covariance matrices , which satisfy the recursion
Defining the infinitehorizon aggregate state covariance matrix , then provided that gives closedloop meansquare stability, also satisfies a generalized Lyapunov equation
(3) 
Defining the cost achieved by a gain matrix by , we have
This leads to the idea of performing gradient descent on (i.e., policy gradient) via the update to find the optimal gain matrix. However, two properties of the LQR cost function complicate a convergence analysis of gradient descent. First, is extended valued since not all gain matrices provide closedloop meansquare stability, so it does not have (global) Lipschitz gradients. Second, and even more concerning, is generally nonconvex in (even for deterministic LQR problems, as observed by Fazel et al. [14]), so it is unclear if and when gradient descent converges to the global optimum, or if it even converges at all. Fortunately, as in the deterministic case, we show that the multiplicative LQR cost possesses further key properties that enable proof of global convergence despite the lack of Lipschitz gradients and nonconvexity.
3. Gradient Domination and Global Convergence of Policy Gradient
In this section, we demonstrate that the multiplicative noise LQR cost function is gradient dominated, which facilitates optimization by gradient descent. Gradient dominated functions have been studied for many years in the optimization literature [29] and have recently been discovered in deterministic LQR problems by [14]. We then show that the policy gradient algorithm and two important variants for multiplicative noise LQR converge globally to the optimal policy. In contrast with [14], the policies we obtain are robust to uncertainties and inherent stochastic variations in the system dynamics. The proofs of all technical results can be found in the Appendices.
3.1. Multiplicative Noise LQR Cost is Gradient Dominated
First, we give the expression for the policy gradient for the multiplicative noise LQR cost.
Lemma 3.1 (Policy Gradient Expression).
The policy gradient is given by
Next, we see that the multiplicative noise LQR cost is gradient dominated.
Lemma 3.2 (Gradient domination).
The multiplicative noise LQR cost satisfies the gradient domination condition
The gradient domination property gives the following stationary point characterization.
Corollary 3.3.
If then either or rank.
In other words, so long as is full rank, stationarity is both necessary and sufficient for global optimality, as for convex functions. Note that to ensure that is full rank, it is not sufficient to simply have multiplicative noise in the dynamics with a deterministic initial state . To see this, simply observe that if and then , which is clearly rank deficient. By contrast, additive noise is sufficient to ensure that is full rank with a deterministic initial state . Taking ensures rank and thus implies .
Although the gradient of the multiplicative noise LQR cost is not globally Lipschitz continuous, it is locally Lipschitz continuous over any subset of its domain (i.e., over any set of meansquare stabilizing gain matrices). The gradient domination is then sufficient to show that policy gradient descent will converge to the optimal gains at a linear rate (a short proof of this fact for globally Lipschitz functions is given in [21]). We prove this convergence of policy gradient to the optimum feedback gain by bounding the local Lipschitz constant in terms of the problem data, which bounds the maximum step size and the convergence rate.
3.2. Global Convergence of Policy Gradient for Multiplicative Noise LQR
We analyze three policy gradient algorithm variants:

Exact gradient descent: 6cm

Natural gradient descent: 6cm

GaussNewton/policy iteration: 6cm
The more elaborate natural gradient and GaussNewton variants provide superior convergence rates and simpler proofs. A development of the natural policy gradient is given in [14] building on ideas from [20]. The GaussNewton step with step size is in fact identical to the policy improvement step in policy iteration (a short derivation is given in Appendix C.1) and was first studied for deterministic LQR by Hewer in 1971 [18]. This was extended to a modelfree setting using policy iteration and Qlearning in [8], proving asymptotic convergence of the gain matrix to the optimal gain matrix. For multiplicative noise LQR, we have the following results.^{2}^{2}2We include a factor of 2 on the gradient expression that was erroneously dropped in [14]. This affects the step size restrictions by a corresponding factor of 2.
Theorem 3.4 (GaussNewton/policy iteration convergence).
Using the GaussNewton step
with step size gives global convergence to the optimal gain matrix at a linear rate described by
Theorem 3.5 (Natural policy gradient convergence).
Using the natural policy gradient step
with step size
gives global convergence to the optimal gain matrix at a linear rate described by
Theorem 3.6 (Policy gradient convergence).
Using the policy gradient step
with step size gives global convergence to the optimal gain matrix at a linear rate described by
where
and and are
and
The proofs for these results are provided in the Appendices and explicitly incorporate the effects of the multiplicative noise terms and in the dynamics. For the exact and natural policy gradient algorithms, we show explicitly how the maximum allowable step size depends on problem data and in particular on the multiplicative noise terms. Compared to deterministic LQR, the multiplicative noise terms decrease the allowable step size and thereby decrease the convergence rate; specifically, the statemultiplicative noise increases the initial cost and the norms of the covariance and cost , and the inputmultiplicative noise also increases the denominator term . This means that the algorithm parameters for deterministic LQR in [14] may cause failure to converge on problems with multiplicative noise. Moreover, even the optimal policies for deterministic LQR may actually destabilize systems in the presence of small amounts of multiplicative noise uncertainty, indicating the possibility for a catastrophic lack of robustness. The results and proofs also differ from that of [14] because a more complicated form of stochastic stability (namely, meansquare stability) must be accounted for, and because generalized Lyapunov equations must be solved to compute the gradient steps, which requires specialized solvers.
4. Global Convergence of ModelFree Policy Gradient
The results in the previous section are modelbased; the policy gradient steps are computed exactly based on knowledge of the model parameters. In a modelfree setting, the policy gradient can be estimated to arbitrary accuracy from sample trajectories with a sufficient number of sample trajectories of sufficiently long rollout length. We show for multiplicative noise LQR that with a finite number of samples polynomial in the problem data, the modelfree policy gradient algorithm still converges to the globally optimal policy in the presence of small perturbations on the gradient.
In the modelfree setting, the policy gradient method proceeds as before except that at each iteration Algorithm 1 is called to generate an estimate of the gradient via the zerothorder optimization procedure described by Fazel et al. [14].
Theorem 4.1 (ModelFree Policy Gradient convergence).
Suppose the step size is chosen according to the restriction in Theorem 3.6 and at every iteration the gradient is estimated using Algorithm 1 where the number of samples , rollout length , and exploration radius are chosen according to fixed quantities ,, which are polynomial in the problem data , , , , , , , , , . Then with high probability of at least performing gradient descent results in convergence to the global optimum at the linear rate
Remark 4.2 (From deterministic to multiplicative noise LQR).
In comparison with the deterministic dynamics studied by [14], the following remarks are in order:

One of the critical effects of multiplicative noise is that the computational burden of performing policy gradient is increased. This is evident from the mathematical expressions which bound the relevant quantities whose exact relationship is developed in the Appendices. In particular, , , and are necessarily higher with either state or inputdependent multiplicative noise, and is greater than . These increases all act to reduce the step size (and thus convergence rate), and in the modelfree setting increase the number of samples and rollout length required.
5. Numerical Experiments
In this section we demonstrate the efficacy of the policy gradient algorithms. We first considered a system with 4 states and 1 input representing an active twomass suspension converted from continuous to discrete time using a standard bilinear transformation. We considered the system dynamics with and without multiplicative noise. The system was openloop mean stable, and in the presence of multiplicative noise it was openloop meansquare unstable. We refer to the cost with multiplicative noise as the LQRm cost and the cost without any noise as the LQR cost. Let and be gains which optimize the LQRm and LQR cost, respectively.
We performed exact policy gradient descent in the modelknown setting; at each iteration gradients were calculated by solving generalized Lyapunov equations (2) and (3) using the problem data. We performed the optimization for both settings of noise starting from the same random feasible initial gain. The step size was set to a small constant in accordance with Theorem 3.6. The optimization stopped once the Frobenius norm of the gradient fell below a small threshold. The plots in Fig. 1 show the cost of the gains at each iteration; Figs. 0(a) and 0(b) show gains during minimization of the LQRm cost and LQR cost, respectively.
When there was high multiplicative noise, the noiseaware controller minimized the LQRm cost as desired. However, the noiseignorant controller actually destabilized the system in the meansquare sense; this can be seen in Fig. 0(b) as the LQRm cost exploded upwards to infinity. Looking at the converse scenario, indeed minimized the LQR cost as expected. However, while did lead to a slightly suboptimal LQR cost, it nevertheless ensured that at least the LQR cost was finite (gains were mean stabilizing) throughout the optimization. In this sense, the multiplicative noiseaware optimization is generally safer and more robust than noiseignorant optimization, and in examples like this is actually necessary for meansquare stabilization.
We also considered 10state, 10input systems with randomly generated problem data. The systems were all openloop meansquare stable with initial gains set to zero. We ran policy gradient using the exact gradient, natural gradient, and GaussNewton step directions on 20 unique problem instances using the largest feasible constant step sizes for a fixed number of iterations so that the final cost was no more than worse than optimal. The plots in Fig. 2 show the cost over the iterations; the bold centerline is the mean of all trials and the shaded region is between the maximum and minimum of all trials. It is evident that in terms of convergence the GaussNewton step was extremely fast, the natural gradient was somewhat slow and the exact gradient was the slowest. Nevertheless, all algorithms exhibited convergence to the optimum, empirically confirming the asserted theoretical claims.
Python code which implements the algorithms and generates the figures reported in this work can be found in the GitHub repository at https://github.com/TSummersLab/polgradmultinoise/.
The code was run on a desktop PC with a quadcore Intel i7 6700K 4.0GHz CPU, 16GB RAM. No GPU computing was utilized.
6. Conclusions
We have shown that policy gradient methods in both modelknown and modelunknown settings give global convergence to the globally optimal policy for LQR systems with multiplicative noise. These techniques are directly applicable for the design of robust controllers of uncertain systems and serve as a benchmark for datadriven control design. Our ongoing work is exploring ways of mitigating the relative sample inefficiency of modelfree policy gradient methods by leveraging the special structure of LQR models and Nesterovtype acceleration, and exploring alternative system identification and adaptive control approaches. We are also investigating other methods of building robustness through and dynamic game approaches.
Technical Proofs
Before proceeding with the proof of the main results of this study, we first review several basic matrix expressions that will be used later throughout the section.
Appendix A Standard matrix expressions
 Spectral norm:

We denote the matrix spectral norm as which clearly satisfies
(4)  Frobenius norm:

We denote the matrix Frobenius norm as whose square satisfies
(5)  Frobenius norm spectral norm:

For any matrix the Frobenius norm is greater than or equal to the spectral norm:
(6)  Inverse of spectral norm inequality:

(7)  Invariance of trace under cyclic permutation:

(8)  Invariance of trace under arbitrary permutation for a product of three matrices:

(9)  Scalar trace equivalence:

(10)  Tracespectral norm inequalities:

(11) If
(12) and if
(13)  Submultiplicativity of spectral norm:

(14)  Positive semidefinite matrix inequality:

Suppose and . Then
(15)  Vector self outer product positive semidefiniteness:

(16) since .
 Singular value inequality for positive semidefinite matrices:

Suppose and and . Then
(17)  Weyl’s Inequality for singular values:

Suppose . Let singular values of , , and be
where . Then we have
(18) Consequently, we have
(19) and
(20)  Vector Bernstein inequality:
Appendix B Policy Gradient Expression and Gradient Domination
b.1. Policy gradient expression
We give the expression for the policy gradient for linear state feedback policies applied to the LQRwithmultiplicativenoise problem.
Lemma B.1 (Policy Gradient Expression).
The policy gradient is given by
(22) 
Proof.
Substituting the RHS of the generalized Lyapunov equation into the cost yields
(23)  
(24) 
Taking the gradient with respect to and using the product rule we obtain
(25)  
(26)  
(27)  
(28)  
(29)  
(30)  
(31)  
(32) 
where the overbar on is used to denote the term being differentiated. Applying this gradient formula recursively to the last term in the last line (namely ), we obtain
(33) 
which completes the proof. ∎
b.2. Additional quantities
We define the stochastic system state transition matrices
(34) 
We define
(35) 
and
(36) 
so that
(37) 
We define the (deterministic) nominal closedloop state transition matrix
(38) 
Similarly we define the stochastic closedloop state transition matrix
(39) 
We define the closedloop LQR cost matrix
(40) 
b.3. State value function, stateaction value function, and advantage
We have already defined the state value function (or simply the “value function” or “function” in reinforcement learning jargon) in the main document. We now define an equivalent notation by moving the functional dependency on to the subscript, giving
(41) 
given that
(42) 
where we take expectation with respect to the and determining . Equivalently,
(43) 
The stateaction value function (or simply the “function” in reinforcement learning jargon) is
(44) 
where we take expectation with respect to the and determining and respectively. Notice that the state and action which are the functional inputs do not have to be generated by the gain matrix in the subscript. Indeed we have if , but not in general. Also note that only the rightmost expression (the state value function) is dependent on the gain matrix. These facts will be crucial to proving the value difference lemma. Expanding, we can also write the stateaction value function as
(45)  
(46) 
The advantage function is defined as
(47) 
The advantage function can be thought of as the difference in cost (“advantage”) when starting in state of taking an action for one step instead of the action generated by policy .
We also define the state sequence
(48) 
and the action sequence
(49) 
and the cost sequence
(50) 
Note that , , , and are all random variables whose distributions are determined by the multiplicative noise data.
We can now derive the valuedifference lemma, which Fazel refers to as the “costdifference” lemma.
Lemma B.2 (Value difference).
Suppose and generate the (stochastic) state, action, and cost sequences
(51) 
and
(52) 
respectively. Then the value difference is
(53) 
Also, the advantage satisfies
(54) 
where
(55) 
Proof.
By definition we have
(56) 
so we can write the value difference as
(57)  
(58)  
(59) 
We can expand out the following value function difference as