 # Learning Unstable Dynamical Systems with Time-Weighted Logarithmic Loss

When training the parameters of a linear dynamical model, the gradient descent algorithm is likely to fail to converge if the squared-error loss is used as the training loss function. Restricting the parameter space to a smaller subset and running the gradient descent algorithm within this subset can allow learning stable dynamical systems, but this strategy does not work for unstable systems. In this work, we look into the dynamics of the gradient descent algorithm and pinpoint what causes the difficulty of learning unstable systems. We show that observations taken at different times from the system to be learned influence the dynamics of the gradient descent algorithm in substantially different degrees. We introduce a time-weighted logarithmic loss function to fix this imbalance and demonstrate its effectiveness in learning unstable systems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Systems with memories that evolve over time require the use of a dynamical model for their representation. This model describes how the memory, or the state, of this system changes over time, how its state is affected by inputs to the system, and how it generates observable outputs. System identification corresponds to the task of learning the unknown parameters of this dynamical model from the known inputs and the observed outputs of the system.

Identification of dynamical systems from time-series data is an important problem for various applications, such as model prediction in reinforcement learning

[Lambert et al., 2019, Zhang et al., 2016], analysis of medical health records [Rubanova et al., 2019] and prediction with financial time-series data [Tsay, 2014, Ganeshapillai et al., 2013]. However, the identification problems that arise in these applications pose some theoretical challenges:

1. Unless the state of the system is observed with a known noiseless mapping, the identification of the system model is coupled with the state estimation. Consequently, the system identification task is in general a nonconvex problem

[Hardt et al., 2018]. To circumvent this nonconvexity, the initial state can be assumed to be zero in control settings, and a known input can be used to drive the state of the system [Sastry, 1984, Sastry and Bodson, 1989]. However, in medical and financial settings, the initial state of the system is typically not known a priori, and the deviations of the initial state from a nominal value cannot be neglected. Therefore, a joint and nonconvex optimization procedure is unavoidable in these settings to estimate the initial state of the system along with the unknown model parameters [Frigola et al., 2014, Duncker et al., 2019].

2. For control of a dynamical system in a reinforcement learning task, it is most critical that the unstable111

The term stability refers to bounded-input bounded-output stability. For continuous-time linear time-invariant systems, this corresponds to the condition where the eigenvalues of the state transition matrix have strictly negative real parts.

modes of the system be discovered and stabilized properly. Similarly, financial data and medical health records usually exhibit sudden changes in their pattern, which call for potentially unstable dynamics in their representation and estimation. However, the primary tools for nonconvex optimization, namely, the gradient methods, fail to converge and find an accurate model representation for unstable systems [Hardt et al., 2018].

3. Especially in medical and financial data sets, the data are sampled irregularly; that is, the observations are not periodically sampled. The common heuristic approach to handle this situation is imputing the absent observations by interpolating the observed values of the output

[Che et al., 2018]. This approach, however, might fail to capture the correct dynamics of the underlying system. An alternative is to use a model that can take account for the evolution of the state of the system during unobserved intervals without requiring periodic observations [Chen et al., 2018].

In this work, we use the gradient descent algorithm to identify the unknown parameters of a linear dynamical system from its observed outputs. We look into the dynamics of this algorithm and try to pinpoint what causes the inability of the gradient methods to converge when they are used to identify an unstable dynamical system. Similar to the work of Chen et al. , our analysis uses a continuous-time model so that it directly applies to irregularly sampled data sets with no need for imputation.

### 1.1 Our contributions

By analyzing the dynamics of the gradient descent algorithm during identification of a linear dynamical system, we achieve the following.

1. We obtain an upper bound on the learning rate of the gradient descent algorithm so that it can converge while learning a dynamical system with the squared-error loss. This upper bound explicitly depends on the eigenvalue of the system with the largest real part, and it shows that identifying a system becomes harder as the system becomes unstable. Furthermore, the upper bound on the learning rate shows that the samples taken at different times affect the convergence of the gradient descent algorithm in substantially different degrees.

2. To enable the convergence of the gradient descent algorithm even when learning unstable systems, we introduce a new loss function which balances the influence of the samples taken at different times on the convergence of the algorithm. Then we demonstrate the effectiveness of this loss function while estimating linear dynamical systems.

Note that the primary question our work addresses is about the use of the gradient descent algorithm while learning a dynamical system: can this algorithm converge at all while learning the parameters of a dynamical system model? This is a different problem than whether a specific algorithm, or a specific model can learn the dynamical system of interest more accurately than the state-of-the-art.

### 1.2 Related works

Hardt et al.  studied the convergence of the gradient descent algorithm while learning linear dynamical systems. They demonstrated the failure of this algorithm to learn even stable systems, and proposed a projected gradient method that fixed the issue for linear stable systems. Learning an unstable system, however, was considered to be infeasible. In contrast, we retain the standard gradient descent algorithm in this work, and we introduce a new loss function that allows learning even unstable systems with no necessity for projection.

If the state of a linear system is directly accessed, that is, if the output of the system is equal to the state of the system possibly with some additive noise, learning the system parameters can be formulated as an ordinary least squares problem.

Alaeddini et al.  and Sarkar and Rakhlin  make this assumption and arrive at a convex optimization problem. By doing so, they avoid the use of gradient descent algorithm, and therefore, they do not suffer from the issues pointed out by Hardt et al. . However, as mentioned earlier, the assumption of having an access to the true internal state is unrealistic in many application domains, such as, health and finance.

Using variational inference is a common approach to estimate the initial state jointly with the dynamical model parameters in a Bayesian setting [Frigola et al., 2014, Archer et al., 2015, Krishnan et al., 2017, Eleftheriadis et al., 2017, Duncker et al., 2019, Gregor et al., 2019]. In this approach, a separate model is employed to estimate the initial state from the whole observed trajectory. One of the models that we will consider in this work is a simpler, deterministic counterpart of this approach. We show that convergence issues of the gradient descent algorithm are also valid for this deterministic counterpart of variational inference.

[Chen et al., 2018, Rubanova et al., 2019]

use a neural network to represent a continuous-time dynamical system. Since these models are also trained with the gradient descent algorithm, they also suffer from the stability issues of the gradient descent algorithm while learning the parameters of a dynamical model. Indeed, the training data of all the examples outlined in these works involve trajectories that converge to either a stable equilibrium or a stable limit cycle of the system.

## 2 Problem Formulation

For each , let denote a continuous-time process representing the state of a linear time-invariant dynamical system:

 dzk(t)dt=Azk(t)∀t≥0, ∀k∈K,

where denotes the state transition dynamics of the system. Then the evolution of the process is described by for all for each [Callier and Desoer, 1991]. Let be the set of samples obtained from at time instants via an observation matrix :

 xk(t)=Czk(t)=CeAtzk(0)∀t∈Tk, ∀k∈K.

Define the initial state of the trajectory of as ; that is, let for all . We will look for a linear dynamical system model that fits all the trajectories, and we will use the gradient descent algorithm to reveal the difficulty of its convergence. In particular, our goal is to study whether the gradient descent algorithm is able to converge to a solution while solving the problem

 minimizeA,C ∑k∈K∑t∈Tkℓ(xk(t),CeAtsk) (1a)

where is a differentiable loss function. We consider two choices for in the following sections: the squared-error loss as it is used both in classical works [Åström and Eykhoff, 1971] and in recent works [Hardt et al., 2018], and the time-weighted logarithmic loss introduced in Section 4.

The set of initial states is left arbitrary in the statement of (1); we consider three possible cases for these initial states, and our analysis in the following sections applies to all of these three cases.

1. Each is known or has a fixed value. In other words, the set is not updated by the gradient descent algorithm.

2. Each is also a variable, and the gradient descent algorithm optimizes over as well:

 minimizeA,C,{sk}k∈K∑k∈K∑t∈Tkℓ(xk(t),CeAtsk) (2)
3. Each is output of a state estimator:

 sk=gϕ({t,xk(t)}t∈Tk)∀k∈K,

where is the parameters of this estimator, and the gradient descent algorithm solves the problem

 minimizeA,C,ϕ ∑k∈K∑t∈Tkℓ(xk(t),CeAtsk)+L(ϕ) (3a) subject to sk=gϕ({t,xk(t)}t∈Tk)∀k∈K, (3b)

where is an additional loss term associated with the estimation of the initial state, and it satisfies

 ∂L∂A=0, ∂L∂C=0.

This case can be considered as the deterministic counterpart of the framework used in variational inference of state space models [Jordan et al., 1999, Archer et al., 2015]. This comparison is discussed further in Section 6.

In the following sections, we will demonstrate the analysis and state the theorems for problem (2) in the second case. The statements are identically valid for the other two cases, as explained in Appendix D.

## 3 Learning with Squared-Error Loss

In this section, we consider problem (2) with the squared-error loss:

 minimizeA,C,{sk}k∈K ∑k∈K∑t∈Tk∥xk(t)−CeAtsk∥22

If we use the gradient descent algorithm to solve problem (2), the learning rate of the algorithm needs to be sufficiently small for the algorithm to converge [Bertsekas, 1999]. The next theorem gives an upper bound on the learning rate as a necessary condition for the convergence of the algorithm.

###### Theorem 1.

Let be a set of trajectories, and let denote the initial state for trajectory for each . Define the set of sampling instants of as , and denote the samples taken from this trajectory by . Assume that the gradient descent algorithm is used to solve the problem

 minA,C,{sk}k∈K ∑k∈K∑t∈Tk∥∥xk(t)−CeAtsk∥∥22. (4)

Then for almost every initialization, the learning rate of the gradient descent algorithm, , must satisfy

 δ≤2λmin(ρ2∑k∈K∑t∈Tkt2e2\emph{Re}(Λ)t^sk^s⊤k)

so that the algorithm can converge to the solution achieving zero training error, where denotes the minimum eigenvalue of its argument, is the eigenvalue of with the largest real part, , and

is the set of eigenvectors of

corresponding to .

###### Proof.

See Appendix A. ∎

Note that the eigenvalues of a linear dynamical system have a particular meaning in control theory: they describe the stability of the system [Callier and Desoer, 1991]. If any eigenvalue of has a real part that is strictly positive, then the state of the system will grow unboundedly large from almost all initial points; and the system is called unstable in this case. If, on the other hand, all eigenvalues of has a negative real part, then the state of the system will converge to a fixed point from all initial points, and the system will be stable.

The condition about reaching zero training error might be somewhat restrictive, but the main purpose of Theorem 1 is not to prescribe a learning rate for all possible cases; it is to reveal that the samples taken at different times affect the convergence of the algorithm very differently. Indeed, Theorem 1 shows that if the gradient descent algorithm is used to learn an unstable system, samples taken at later times impose a bound on the required learning rate exponentially more strict, which renders learning an unstable dynamical system infeasible.

Note that if the set of initial states does not span the whole state space, then the bound given in Theorem 1

will be void. This suggests that it will be easier to train a dynamical model if the initial states of the trajectories given in the training data do not have a large variance. However, this does not mean the learned model will be accurate. Since there is no information available about how the system evolves for the initial states in the nullspace of

, the model learned will fail to predict the behavior of the system for the initial states with a nonzero component in this unlearned subspace as well.

The appearance of in Theorem 1 reflects the notion of observability [Callier and Desoer, 1991]. Based on the relationship between the matrices and , it may not be possible to observe certain eigenvalues, or modes, of the learned system in its output; these modes are called unobservable modes. As these modes do not appear in the output of the learned system, they cannot affect the gradient descent algorithm.

###### Remark 1.

The analysis for Theorem 1 shows that, for the Hessian of the loss function (4) at , the ratio of the largest eigenvalue to the smallest eigenvalue of satisfies

 λmax(H)λmin(H)≥λmin(ρ21∑k∈K∑t∈Tkt2e2\emph{Re% }(λ1)t^sk^s⊤k)λmax(ρ22∑k∈K∑t∈Tkt2e2% \emph{Re}(λ2)t^sk^s⊤k)

for any pair of eigenvalues of , where , , and , are the right eigenvectors of corresponding to , , respectively. This implies that, if the loss function can be represented well by its second order approximation around , local convergence rate for estimating the eigenvalue will require

 O⎛⎜ ⎜⎝⎡⎢⎣log⎛⎜⎝⎛⎝1−β∑k∈K∑t∈Tkt2e2Re(λ2)t∑k∈K∑t∈Tkt2e2Re(λ1)t⎞⎠−1⎞⎟⎠⎤⎥⎦−1⎞⎟ ⎟⎠

iterations of the gradient descent algorithm, where is some constant depending on and . This shows that learning the stable modes of a system can become infeasible when the system is unstable. See Appendix C for more details.

The necessary condition given in Theorem 1 implies that the convergence of the algorithm gives us information about the rightmost eigenvalue of the dynamical system that is being estimated. This is stated in Corollary 1.

###### Corollary 1.

Assume that the observation matrix , the gradient descent algorithm is used to solve the problem (4) and the algorithm has converged from a random222

The random distribution is assumed to assign zero probability to every set with Lebesgue measure zero.

initialization to the solution achieving zero training error. Then the eigenvalue of with the largest real part, , almost surely satisfies

 \emph{Re}(Λ)≤infτ>012τlog⎡⎢ ⎢⎣1δτ22λmin(∑k∈K∑t∈Tk^sk^s⊤k1{t≥τ})⎤⎥ ⎥⎦if  \emph{Re}(Λ)>0,
 \emph{Re}(Λ)≤infτ2>τ1>012τ2log⎡⎢ ⎢⎣1δτ212λmin(∑k∈K∑t∈Tk^sk^s⊤k1{τ1≤t≤τ2})⎤⎥ ⎥⎦if  \emph{% Re}(Λ)<0.

## 4 Learning with Time-Weighted Logarithmic Loss

Theorem 1 shows that when the gradient descent algorithm is used to learn the parameters of an unstable dynamical system, the effect of the samples taken at later times are exponentially more weighted around a global minimum. It is important to note that this is the case for the choice of squared-error loss as the training loss function. In this section, we introduce a new loss function in order to balance the effects of all samples on the dynamics of the algorithm. This new loss function greatly relaxes the necessary condition given in Theorem 1, and it enables training even unstable linear systems with the gradient descent algorithm.

For any , define as

 Fϵ(ξ)={log(ϵ+ξ)−log(ϵ)ξ≥0,−log(ϵ−ξ)+log(ϵ)ξ<0. (5)

Given two trajectories and in , consider the loss function defined as

 ℓ(x,y)=∑t∈Tn∑j=11t2(Fϵ(e⊤jx(t))−Fϵ(e⊤jy(t)))2,

where denotes the

-th standard basis vector with a 1 in its

-th coordinate and 0 in all other coordinates. Note that is zero if and only if for all ; and it is strictly positive otherwise. Similar to Section 3, we will analyze this loss functions for learning linear dynamical systems.

###### Theorem 2.

Let be a set of trajectories, and let denote the initial state for trajectory for each . Define the set of sampling instants of as , and denote the samples taken from this trajectory by . Assume that the gradient descent algorithm is used to solve

 minA,C,{sk}k∈K ∑k∈K∑t∈Tkn∑j=11t2(Fϵ(e⊤jxk(t))−Fϵ(e⊤jCeAtsk))2, (6)

where is as defined in (5). Then for almost every initialization, the learning rate of the gradient descent algorithm must satisfy

 δ≤2λ\emph{min}(∑k∈K∑t∈Tkρ2e2Re(Λ)t(∥^Ce^At^sk∥∞+ϵ)2^sk^s⊤k)

so that the algorithm can converge to the solution achieving zero training error, where is the eigenvalue of with the largest real part, , and is the set of right-eigenvectors of corresponding to its eigenvalue .

###### Proof.

See Appendix B. ∎

The necessary conditions on the step size given in Theorem 2 and in Theorem 1 are obtained by following the identical analysis procedure. Theorem 2 shows that the loss function (6) substantially relaxes the necessary condition given in Theorem 1, and it balances the weights of all the sampling instants on the dynamics of the gradient descent algorithm. In other words, it makes it easier for the gradient descent algorithm to converge to the global minima. This is demonstrated in the next section.

## 5 Experiments

To check if the time-weighted logarithmic loss function introduced in Theorem 2 allows learning linear dynamical systems with the gradient descent algorithm, we generated a set of output trajectories from randomly generated linear systems and trained a linear model with this data set by using the logarithmic loss function. We also trained the model with the same data set by using the mean-squared-error loss to compare the two estimates.

For the experiments, we considered the discretized version of the dynamical systems. In other words, we used

 zk(t)=Atzk(0)∀t∈N, ∀k∈K.

Note that with this discrete-time representation, the stability of the system is described based on the position of the eigenvalues relative to the unit circle. The system is stable if all of its eigenvalues are inside the unit circle.

We randomly generated and to produce a set of observation sequences. In particular, we generated as , where

is a matrix whose entries are independent and uniformly distributed between

. The elements of

were drawn from independent standard normal distributions. We obtained 50 trajectories from the generated system by providing different initial states, and each trajectory consisted of 50 observations.

For training a linear model on this data set, we used the stochastic gradient method with momentum. Both for the mean-squared-error loss and for the time-weighted logarithmic loss, the gradients were normalized to unit norm if their norm exceeded 1. Figure 1 shows a typical plot for the training error of an unstable system for each of these loss functions. We observe that the gradient descent algorithm is not able to decrease the mean-squared error loss, whereas the time-weighted logarithmic loss function is diminished easily. Figure 1: Typical plots of training error when mean-squared-error is used [left] and when time-weighted logarithmic loss function is used [right].

To check if this decrease in the loss function corresponds to an effective learning of the actual model, we computed the eigenvalues of the estimated system throughout training and compared them with the eigenvalues of the actual system. Figure 2 demonstrates an example of how the estimates for the eigenvalues evolve during training. The state space of the system in Figure 2 is three dimensional, and the system is unstable as one of its eigenvalues is outside of the unit circle. When the mean-squared-error loss is used, only the unstable mode of the system is estimated correctly. In contrast, the time-weighted logarithmic loss function is able to discover all three modes of the system. Additional plots are provided in Appendix E.

## 6 Discussion

Variational inference. Variational inference is a Bayesian approach to handle the unknown parameters and the unobserved states of a dynamical system simultaneously [Jordan et al., 1999, Archer et al., 2015]. For variational inference, the system is described by a generative model: , where and are the sequence of observations and hidden states of the system, and is the parameters of the model. Given the observations, the posterior is approximated by another model: . Then, the objective function to be minimized is described as [Archer et al., 2015]

 −H(gϕ(z|x))−Egϕ(z|x)[log(pθ(x,z)], (7)

where is the entropy of its argument. Assume the stochasticity of the initial state and the state transitions is removed, and each observation is obtained through an observation mapping with an additive Gaussian noise:

 x(t)=c(z(t))+ξt,

where is an independent and identically distributed sequence. Then the minimization of the loss function (7) reduces to the problem

 minimizeθ,ϕ ∑t∈T∥x(t)−c(z(t))∥22 subject to z(0)=argmax~zgϕ(~z|x),

and the system identification problem becomes equivalent to problem (3). This is the reason why we referred to (3) as the deterministic counterpart of the variational inference formulation.

Random initial states. In our analyses, we treated the initial states as unknown but deterministic values that could be learned during training. With this deterministic viewpoint, Theorem 1 and Theorem 2 assumed that the observations could be matched to the latent state of the system perfectly and the training loss function could be made identically zero. It is not possible in general to satisfy this requirement with an expected loss over a set of random initial states. Therefore, Theorem 1 and Theorem 2 do not apply to the formulations with random initial states verbatim.

Convergence of policy gradient. Even though the focus of this work has been on system identification, the gradient descent algorithm will exhibit similar convergence problems when maximizing an objective over a time horizon while altering the dynamics of a dynamical system. Note that policy gradient methods in reinforcement learning [Sutton and Barto, 2018] fall into this category. This is why our analysis in this work can potentially be used for studying and improving the stability of policy gradient methods.

## 7 Conclusion

To understand the hardness of learning dynamical systems from observed trajectories, we analyzed the dynamics of the gradient descent algorithm while training the parameters of a dynamical model, and we observed that samples taken at different times affect the dynamics of the algorithm in substantially different degrees. To balance the effect of samples taken at different times, we introduced the time-weighted logarithmic loss function and demonstrated its effectiveness.

In this work, we focused on learning linear dynamical systems. Whether a similar loss function improves training of nonlinear models is an important direction for future research. In addition, we considered a deterministic framework for our problem formulation with a dynamical system. An interesting question is whether allowing randomness in the state of the system or the state transitions could trade off the accuracy of the estimated model for the efficiency of the training procedure.

## References

• A. Alaeddini, S. Alemzadeh, A. Mesbahi, and M. Mesbahi (2018) Linear model regression on time-series data: non-asymptotic error bounds and applications. In IEEE Conference on Decision and Control, pp. 2259–2264. Cited by: §1.2.
• E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski (2015) Black box variational inference for state space models. arXiv preprint arXiv:1511.07367. Cited by: §1.2, item 3, §6.
• K. J. Åström and P. Eykhoff (1971) System identification—a survey. Automatica 7 (2), pp. 123–162. Cited by: §2.
• D. P. Bertsekas (1999) Nonlinear programming. 2nd edition, Athena Scientific. Cited by: §3.
• F. M. Callier and C. A. Desoer (1991) Linear system theory. Springer-Verlag. Cited by: §2, §3, §3.
• Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu (2018) Recurrent neural networks for multivariate time series with missing values. Scientific Reports 8 (1), pp. 6085. Cited by: item 3.
• T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018) Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571–6583. Cited by: item 3, §1.2, §1.
• L. Duncker, G. Bohner, J. Boussard, and M. Sahani (2019) Learning interpretable continuous-time models of latent stochastic dynamical systems. In

International Conference on Machine Learning

,
Vol. 97, pp. 1726–1734. Cited by: item 1, §1.2.
• S. Eleftheriadis, T. Nicholson, M. Deisenroth, and J. Hensman (2017) Identification of gaussian process state space models. In Advances in Neural Information Processing Systems, pp. 5309–5319. Cited by: §1.2.
• R. Frigola, Y. Chen, and C. E. Rasmussen (2014) Variational gaussian process state-space models. In Advances in Neural Information Processing Systems, pp. 3680–3688. Cited by: item 1, §1.2.
• G. Ganeshapillai, J. Guttag, and A. Lo (2013) Learning connections in financial time series. In International Conference on Machine Learning, pp. 109–117. Cited by: §1.
• K. Gregor, G. Papamakarios, F. Besse, L. Buesing, and T. Weber (2019) Temporal difference variational auto-encoder. In International Conference on Learning Representations, Cited by: §1.2.
• M. Hardt, T. Ma, and B. Recht (2018) Gradient descent learns linear dynamical systems. Journal of Machine Learning Research 19 (29), pp. 1–44. Cited by: item 1, item 2, §1.2, §1.2, §2.
• M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine Learning 37 (2), pp. 183–233. Cited by: item 3, §6.
• H. K. Khalil (1996) Nonlinear systems. 2nd edition, Prentice Hall. Cited by: Appendix A.
• R. G. Krishnan, U. Shalit, and D. Sontag (2017) Structured inference networks for nonlinear state space models. In

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

,
pp. 2101–2109. Cited by: §1.2.
• N. O. Lambert, D. S. Drew, J. Yaconelli, S. Levine, R. Calandra, and K. S. Pister (2019) Low-level control of a quadrotor with deep model-based reinforcement learning. IEEE Robotics and Automation Letters 4 (4), pp. 4224–4230. Cited by: §1.
• Y. Rubanova, R. T. Q. Chen, and D. Duvenaud (2019) Latent ODEs for irregularly-sampled time series. In Advances in Neural Information Processing Systems, Cited by: §1.2, §1.
• T. Sarkar and A. Rakhlin (2019) Near optimal finite time identification of arbitrary linear dynamical systems. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 5610–5618. Cited by: §1.2.
• S. S. Sastry (1984) Model-reference adaptive control – stability, parameter convergence, and robustness. IMA Journal of Mathematical Control and Information 1 (1), pp. 27–66. Cited by: item 1.
• S. Sastry and M. Bodson (1989) Adaptive control: stability, convergence and robustness. Prentice Hall. Cited by: item 1.
• R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §6.
• R. S. Tsay (2014) Financial time series. Wiley StatsRef: Statistics Reference Online. Cited by: §1.
• T. Zhang, G. Kahn, S. Levine, and P. Abbeel (2016) Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search. In IEEE International Conference on Robotics and Automation, pp. 528–535. Cited by: §1.

## Appendix A Proof of Theorem 1

To begin with, assume that is a fixed matrix, and consider only one trajectory with only one sample taken at time . Then the loss function to be minimized is

 ℓ(A,s)=12∥x−CeAts∥22,

where denotes the initial state of the trajectory. The update rule for the gradient descent algorithm gives

 A ←A−δ2∂∂A⟨CeAts−x,CeAts−x⟩ (8a) s ←s−δ2∂∂s⟨CeAts−x,CeAts−x⟩ (8b)

This update rule creates a nonlinear dynamical system where the state of the system is the parameters .

A dynamical system can converge to its equilibrium only if that equilibrium is stable in the sense of Lyapunov. A standard tool to analyze the stability for nonlinear systems is given by Lyapunov’s direct method: an equilibrium of a nonlinear system can be stable only if the linearization of the system around that equilibrium has no unstable mode [Khalil, 1996]. If, on the other hand, the linearized model has an eigenvalue larger than 1 in magnitude, then the nonlinear system is definitely unstable — which rules out the possibility of convergence to this equilibrium from its neighbors, except for a set on a low-dimensional manifold, which has Lebesgue measure zero. This shows that the system (8) can converge to an equilibrium only if all eigenvalues of the linearized model around that equilibrium are less than 1 in magnitude.

We can write the linearization of (8) around an equilibrium as

 ~A ←~A−δf1(~A)−δf2(~s), ~s ←~s−δf3(~A)−δf4(~s),

where

• is the Jacobian with respect to A of the gradient with respect to A of the loss function at ,

• is the Jacobian with respect to s of the gradient with respect to A of the loss function at ,

and and are defined similarly. Note that and are the Jacobians of the gradients of the same function with respect to the same parameters in different orders; therefore, they are hermitian of each other:

 ⟨~A,f2(~s)⟩=⟨f3(~A),~s⟩∀~A,∀~s.

This shows that the linearized model can be associated with a symmetric matrix; and consequently, all of its eigenvalues are real-valued, and its eigenvalues can be less than 1 only if all of its diagonal blocks have eigenvalues less than 1. In other words, a necessary condition for the solution to be stable is that the mappings

 ~A ←~A−δf1(~A) (9) ~s ←~s−δf4(~s) (10)

have eigenvalues less than 1 in magnitude, or equivalently, the functions and have eigenvalues less than . Note that this conclusion would be identical if was also updated via the gradient descent algorithm. In particular, we would need the eigenvalue of the mapping to be less than 1 in magnitude around the equilibrium .

Finding a lower bound for the largest eigenvalue of the mapping will be easier with the following lemma.

###### Lemma 1.

Let be a twice-differentiable function for all , and define

 F(x)=12∑i∈If2i(x).

If , then the Hessian of at satisfies

 ∇2F(x0)=∑i∈I∇fi(x0)∇fi(x0)⊤.
###### Proof.

We can write the gradient and the Hessian of , respectively, as

 ∇F(x0) =∑i∈I(∇fi(x0))fi(x0), ∇2F(x0) =∑i∈I∇fi(x0)∇fi(x0)⊤+fi(x0)⋅∇2fi(x0).

Note that implies that for all . Then we have

 ∇2F(x0)=∑i∈I∇fi(x0)∇fi(x0)⊤.

Remember that is the Jacobian with respect A of the gradient with respect to of the loss function

 ℓ(A,C,s)=12⟨CeAts−x,CeAts−x⟩.

Given , we can write

 ℓ(A,C,s)=12n∑j=1(e⊤jCeAts−e⊤jx)2,

where is the -th standard basis vector with a 1 in its -th coordinate and 0 in all other coordinates. Then, by using Lemma 1, the largest eigenvalue of the mapping can be lower bounded by

 maxY:∥Y∥F=1 n∑j=1∣∣⟨Y,∇A(e⊤jCeAts−e⊤jx)⟩∣∣2. (11)

To find the gradient, we can expand the matrix exponential:

 ∇A(e⊤jC∞∑k=0tkk!Aks)=∞∑k=1k−1∑r=0tkk!(A⊤)rC⊤ejs⊤(A⊤)k−1−r.

If we choose , where and are the unit-norm right and left eigenvectors of corresponding to its eigenvalue with the largest real part, we obtain

 ⟨~Y,∇A(e⊤jC∞∑k=0tkk!Aks)⟩ =∞∑k=1k−1∑r=0tkk!Λk−1⟨u,C⊤ej⟩⟨v,s⟩ =∞∑k=1tk(k−1)!Λk−1⟨u,C⊤ej⟩⟨v,s⟩ =teΛt⟨u,C⊤ej⟩⟨v,s⟩ =teΛt⟨Cu,ej⟩⟨v,s⟩.

Remember that (11) is a lower bound for the largest eigenvalue of , and so is

 n∑j=1∣∣⟨~Y,∇A(e⊤jCeAts−e⊤jx)⟩∣∣2 =n∑j=1t2e2Re(Λ)t∣∣⟨Cu,ej⟩∣∣2|⟨v,s⟩|2 =ρ2t2e2Re(Λ)t|⟨v,s⟩|2,

where is the largest real part of the eigenvalues of and . If we have multiple trajectories, this lower bound will become

 ∑k∈K∑t∈Tkρ2t2e2Re(Λ)t|⟨v,sk⟩|2,

where is the set of initial states of the trajectories.

As a result, for convergence of the gradient descent algorithm to a solution , it is necessary that

 ∑k∈K∑t∈Tkρ2t2e2Re(Λ)t|⟨v,^sk⟩|2≤2δ.

Without making any assumptions about the eigenvectors of , we can obtain the final necessary condition as

 λmin⎛⎝∑k∈K∑t∈Tkρ2t2e2Re(Λ)t^sk^s⊤k⎞⎠≤2δ,

or equivalently as

 δ≤2λmin(ρ2∑k∈K∑t∈Tkt2e2Re(Λ)t^sk^s⊤k).

This completes the proof.

## Appendix B Proof of Theorem 2

Similar to the proof of Theorem 1, we will use Lemma 1 to find a lower bound for the largest eigenvalue of the linearized system around . Without loss of generality, assume . Then,

 ∇Alog(e⊤jCeAts+ϵ) =∇Alog(e⊤jC∞∑k=0tkk!Aks+ϵ) =1e⊤jCeAts+ϵ∞∑k=1k−1∑r=0tkk!(A⊤)rC⊤ejs⊤(A⊤)k−1−r.

For the matrix , where and are the right and left eigenvectors of corresponding to its eigenvalue with the largest real part, we have

 ⟨~Y,∇Alog(e⊤jCeAts+ϵ)⟩ =1e⊤jCeAts+ϵ∞∑k=1tk(k−1)!Λk−1⟨u,C⊤ej⟩⟨v,s⟩ =teΛte⊤jCeAts+ϵ⟨u,C⊤ej⟩⟨v,s⟩.

By using Lemma 1, we obtain a lower bound for the largest eigenvalue of the linearization of the gradient descent algorithm around as

 ∑k∈K∑t∈Tkn∑j=11t2∣∣ ∣∣teΛte⊤jCeAtsk+ϵ⟨Cu,ej⟩⟨v,sk⟩∣∣ ∣∣2.

We can write a further lower bound for this expression as

 ∑k∈K∑t∈Tkn∑j=1e2Re(Λ)t(∥CeAtsk∥∞+ϵ)2∣∣⟨Cu,ej⟩∣∣2|⟨v,sk⟩|2 =∑k∈K∑t∈Tkρ2e2Re(Λ)t(∥CeAtsk∥∞+ϵ)2|⟨v,sk⟩|2,

and finally,

 λmin⎛⎝∑k∈K∑t∈Tkρ2e2Re(Λ)t(∥CeAtsk∥∞+ϵ)2sks⊤k⎞⎠,

where and is the right-eigenvector of corresponding to its eigenvalue . For stability of the algorithm around the equilibrium point , we need

 λmin⎛⎜ ⎜⎝∑k∈K∑t∈Tkρ2e2Re(Λ)t(∥^Ce^At^sk∥∞+ϵ)2^sk^s⊤k⎞⎟ ⎟⎠≤2δ,

where is the step size of the algorithm.

## Appendix C Remarks on Convergence Rate

In the proof of Theorem 1, we considered the mapping

 ~A←~A−δf1(~A),

where is the Jacobian of the gradient of the loss function

 ℓ(A,s)=12∥x−CeAts∥22

with respect to at the point . For Theorem 1, we computed the largest learning rate at which the algorithm can still converge to the specified equilibrium. Note that this was equivalent to computing a lower bound for the largest eigenvalue of the mapping . Similar to the proof of Theorem 1, we can compute an upper bound for the smallest eigenvalue of around the solution .

By using Lemma 1, the smallest eigenvalue of the mapping can be upper bounded by

 minY:∥Y∥F=1 n∑j=1∣∣⟨Y,∇A(e⊤jCeAts−e⊤jx)⟩∣∣2. (12)

Similar to the proof of Theorem 1, we can expand the matrix exponential:

 ∇A(e⊤jC∞∑k=0tkk!Aks)=∞∑k=1k−1∑r=0tkk!(A⊤)rC⊤ejs⊤(A⊤)k−1−r.

If we choose , where and are the unit-norm right and left eigenvectors of corresponding to its eigenvalue , we obtain

 ⟨~Y,∇A(e⊤jC∞∑k=0tkk!Aks)⟩ =∞∑k=1k−1∑r=0tkk!λk−12⟨u,C⊤ej⟩⟨v,s⟩ =∞∑k=1tk(k−1)!λk−12⟨u,C⊤ej⟩⟨v,s⟩ =teλ2t⟨u,C⊤ej⟩⟨v,s⟩ =teλ2t⟨Cu,ej⟩⟨v,s⟩.

Remember that (12) is an upper bound for the smallest eigenvalue of , and so is

 n∑j=1∣∣⟨~Y,∇A(e⊤jCeAts−e⊤jx)⟩∣∣2 =n∑j=1t2e2Re(λ2)t∣∣⟨Cu,ej⟩∣∣2|⟨v,s⟩|2 =ρ2t2e2Re(λ2)t|⟨v,s⟩|2,

where . If we have multiple trajectories, this upper bound will become

 ∑k∈K∑t∈Tkρ2t2e2Re(λ2)t|⟨v,sk⟩|2,

where is the set of initial states of the trajectories. We can bring this upper bound into a form independent of :

 λmax⎛⎝∑k∈K∑t∈Tkρ2t2e2Re(λ2)tsks⊤k⎞⎠.

This shows that the ratio of the largest eigenvalue to the smallest eigenvalue of satisfies

 λmax(f1)λmin(f1)≥λmin(ρ21∑k∈K∑t∈Tkt2e2\emph{Re}(λ1)t^sk^s⊤k)λmax(ρ22∑k∈K∑t∈Tkt2e2\emph{Re}(λ2)t^sk^s⊤k)

for any pair of eigenvalues of , where , , and , are the right eigenvectors of corresponding to , . If denotes the Hessian of the loss function at the point , we have and . Therefore, we also have

 λmax(H)λ