# Learning Linear Dynamical Systems via Spectral Filtering

We present an efficient and practical algorithm for the online prediction of discrete-time linear dynamical systems with a symmetric transition matrix. We circumvent the non-convex optimization problem using improper learning: carefully overparameterize the class of LDSs by a polylogarithmic factor, in exchange for convexity of the loss functions. From this arises a polynomial-time algorithm with a near-optimal regret guarantee, with an analogous sample complexity bound for agnostic learning. Our algorithm is based on a novel filtering technique, which may be of independent interest: we convolve the time series with the eigenvectors of a certain Hankel matrix.

There are no comments yet.

## Authors

• 46 publications
• 24 publications
• 14 publications
• ### Spectral Filtering for General Linear Dynamical Systems

We give a polynomial-time algorithm for learning latent-state linear dyn...
02/12/2018 ∙ by Elad Hazan, et al. ∙ 0

• ### Learning nonlinear dynamical systems from a single trajectory

We introduce algorithms for learning nonlinear dynamical systems of the ...
04/30/2020 ∙ by Dylan J. Foster, et al. ∙ 8

• ### Finite-Time Error Analysis of Asynchronous Q-Learning with Discrete-Time Switching System Models

This paper develops a novel framework to analyze the convergence of Q-le...
02/17/2021 ∙ by Donghwan Lee, et al. ∙ 0

• ### Linear Encodings for Polytope Containment Problems

The polytope containment problem is deciding whether a polytope is a con...

• ### Proper Learning of Linear Dynamical Systems as a Non-Commutative Polynomial Optimisation Problem

There has been much recent progress in forecasting the next observation ...
02/04/2020 ∙ by Quan Zhou, et al. ∙ 0

• ### Assessing monotonicity of transfer functions in nonlinear dynamical control systems

When dealing with dynamical systems arising in diverse control systems, ...
05/27/2018 ∙ by Nadezhda Gribkova, et al. ∙ 0

• ### Linear Dynamics: Clustering without identification

Clustering time series is a delicate task; varying lengths and temporal ...
08/02/2019 ∙ by Chloe Ching-Yun Hsu, et al. ∙ 0

## Code Repositories

### WaveFiltering

Wave-filtering for learning and controlling dynamical systems

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear dynamical systems (LDSs) are a class of state space models which accurately model many phenomena in nature and engineering, and are applied ubiquitously in time-series analysis, robotics, econometrics, medicine, and meteorology. In this model, the time evolution of a system is explained by a linear map on a finite-dimensional hidden state, subject to disturbances from input and noise. Recent interest has focused on the effectiveness of recurrent neural networks (RNNs), a nonlinear variant of this idea, for modeling sequences such as audio signals and natural language.

Central to this field of study is the problem of system identification

: given some sample trajectories, output the parameters for an LDS which generalize to predict unseen future data. Viewed directly, this is a non-convex optimization problem, for which efficient algorithms with theoretical guarantees are very difficult to obtain. A standard heuristic for this problem is expectation-maximization (EM), which can find poor local optima in theory and practice.

We consider a different approach: we formulate system identification as an online learning problem, in which neither the data nor predictions are assumed to arise from an LDS. Furthermore, we slightly overparameterize the class of predictors, yielding an online convex program amenable to efficient regret minimization. This carefully chosen relaxation, which is our main theoretical contribution, expands the dimension of the hypothesis class by only a polylogarithmic factor. This construction relies upon recent work on the spectral theory of Hankel matrices.

The result is a simple and practical algorithm for time-series prediction, which deviates significantly from existing methods. We coin the term wave-filtering for our method, in reference to our relaxation’s use of convolution by wave-shaped eigenvectors. We present experimental evidence on both toy data and a physical simulation, showing our method to be competitive in terms of predictive performance, more stable, and significantly faster than existing algorithms.

### 1.1 Our contributions

Consider a discrete-time linear dynamical system with inputs , outputs , and a latent state

, which can all be multi-dimensional. With noise vectors

, the system’s time evolution is governed by the following equations:

 ht+1 =Aht+Bxt+ηt yt =Cht+Dxt+ξt.

If the dynamics

are known, then the Kalman filter

[Kal60]

is known to estimate the hidden state optimally under Gaussian noise, thereby producing optimal predictions of the system’s response to any given input. However, this is rarely the case – indeed, real-world systems are seldom purely linear, and rarely are their evolution matrices known.

We henceforth give a provable, efficient algorithm for the prediction of sequences arising from an unknown dynamical system as above, in which the matrix is symmetric. Our main theoretical contribution is a regret bound for this algorithm, giving nearly-optimal convergence to the lowest mean squared prediction error (MSE) realizable by a symmetric LDS model:

###### Theorem 1 (Main regret bound; informal).

On an arbitrary sequence , Algorithm 1 makes predictions which satisfy

 MSE(^y1,…,^yT)−MSE(^y∗1,…,^y∗T)≤~O(poly(n,m,d,logT)√T),

compared to the best predictions by a symmetric LDS, while running in polynomial time.

Note that the signal need not be generated by an LDS, and can even be adversarially chosen. In the less general batch (statistical) setting, we use the same techniques to obtain an analogous sample complexity bound for agnostic learning:

###### Theorem 2 (Batch version; informal).

For any choice of , given access to an arbitrary distribution over training sequences , Algorithm 2, run on i.i.d. sample trajectories from , outputs a predictor such that

 ED[MSE(^Θ)−MSE(Θ∗)]≤ε+~O(poly(n,m,d,logT,log1/ε))√N,

compared to the best symmetric LDS predictor , while running in polynomial time.

Typical regression-based methods require the LDS to be strictly stable, and degrade on ill-conditioned systems; they depend on a spectral radius parameter . Our proposed method of wave-filtering provably and empirically works even for the hardest case of . Our algorithm attains the first condition number-independent polynomial guarantees in terms of regret (equivalently, sample complexity) and running time for the MIMO setting. Interestingly, our algorithms never need to learn the hidden state, and our guarantees can be sharpened to handle the case when the dimensionality of is infinite.

### 1.2 Related work

The modern setting for LDS arose in the seminal work of Kalman [Kal60], who introduced the Kalman filter as a recursive least-squares solution for maximum likelihood estimation (MLE) of Gaussian perturbations to the system. The framework and filtering algorithm have proven to be a mainstay in control theory and time-series analysis; indeed, the term Kalman filter model is often used interchangeably with LDS. We refer the reader to the classic survey [Lju98], and the extensive overview of recent literature in [HMR16].

Ghahramani and Roweis [RG99] suggest using the EM algorithm to learn the parameters of an LDS. This approach, which directly tackles the non-convex problem, is widely used in practice [Mar10a]. However, it remains a long-standing challenge to characterize the theoretical guarantees afforded by EM. We find that it is easy to produce cases where EM fails to identify the correct system.

In a recent result of [HMR16]

, it is shown for the first time that for a restricted class of systems, gradient descent (also widely used in practice, perhaps better known in this setting as backpropagation) guarantees polynomial convergence rates and sample complexity in the batch setting. Their result applies essentially only to the SISO case (vs. multi-dimensional for us), depends polynomially on the spectral gap (as opposed to no dependence for us), and requires the signal to be created by an LDS (vs. arbitrary for us).

## 2 Preliminaries

### 2.1 Linear dynamical systems

Many different settings have been considered, in which the definition of an LDS takes on many variants. We are interested in discrete time-invariant MIMO (multiple input, multiple output) systems with a finite-dimensional hidden state.111We assume finite dimension for simplicity of presentation. However, it will be evident that hidden-state dimension has no role in our algorithm, and shows up as and in the regret bound. Formally, our model is given as follows:

###### Definition 2.1.

A linear dynamical system (LDS) is a map from a sequence of input vectors to output (response) vectors of the form

 ht+1 =Aht+Bxt+ηt (1) yt =Cht+Dxt+ξt, (2)

where is a sequence of hidden states, are matrices of appropriate dimension, and are (possibly stochastic) noise vectors.

Unrolling this recursive definition gives the impulse response function, which uniquely determines the LDS. For notational convenience, for invalid indices , we define , , and to be the zero vector of appropriate dimension. Then, we have:

 yt=T−1∑i=1CAi(Bxt−i+ηt−i)+CAth0+Dxt+ξt. (3)

We will consider the (discrete) time derivative of the impulse response function, given by expanding by Equation (3). For the rest of this paper, we focus our attention on systems subject to the following restrictions:

1. The LDS is Lyapunov stable: , where denotes the operator (a.k.a. spectral) norm.

2. The transition matrix is symmetric and positive semidefinite.222The psd constraint on can be removed by augmenting the inputs with extra coordinates . We omit this for simplicity of presentation.

The first assumption is standard: when the hidden state is allowed to blow up exponentially, fine-grained prediction is futile. In fact, many algorithms only work when is bounded away from , so that the effect of any particular on the hidden state (and thus the output) dissipates exponentially. We do not require this stronger assumption.

We take a moment to justify assumption (ii), and why this class of systems is still expressive and useful. First, symmetric LDSs constitute a natural class of linearly-observable, linearly-controllable systems with dissipating hidden states (for example, physical systems with friction or heat diffusion). Second, this constraint has been used successfully for video classification and tactile recognition tasks

[HSC16]. Interestingly, though our theorems require symmetric , our algorithms appear to tolerate some non-symmetric (and even nonlinear) transitions in practice.

### 2.2 Sequence prediction as online regret minimization

A natural formulation of system identification is that of online sequence prediction. At each time step , an online learner is given an input , and must return a predicted output . Then, the true response is observed, and the predictor suffers a squared-norm loss of . Over rounds, the goal is to predict as accurately as the best LDS in hindsight.

Note that the learner is permitted to access the history of observed responses . Even in the presence of statistical (non-adversarial) noise, the fixed maximum-likelihood sequence produced by will accumulate error linearly as . Thus, we measure performance against a more powerful comparator, which fixes LDS parameters , and predicts by the previous response plus the derivative of the impulse response function of at time .

We will exhibit an online algorithm that can compete against the best in this setting. Let be the predictions made by an online learner, and let be the sequence of predictions, realized by a chosen setting of LDS parameters , which minimize total squared error. Then, we define regret by the difference of total squared-error losses:

 Regret(T)def=T∑t=1∥yt−^yt∥2−T∑t=1∥yt−y∗t∥2.

This setup fits into the standard setting of online convex optimization (in which a sublinear regret bound implies convergence towards optimal predictions), save for the fact that the loss functions are non-convex in the system parameters. Also, note that a randomized construction (set all , and let

be i.i.d. Bernoulli random variables) yields a lower bound

333This is a standard construction; see, e.g. Theorem 3.2 in [Haz16]. for any online algorithm: .

To quantify regret bounds, we must state our scaling assumptions on the (otherwise adversarial) input and output sequences. We assume that the inputs are bounded: . Also, we assume that the output signal is Lipschitz in time: . The latter assumption exists to preclude pathological inputs where an online learner is forced to incur arbitrarily large regret. For a true noiseless LDS, is not too large; see Lemma F.5 in the appendix.

We note that an optimal regret bound can be trivially achieved in this setting by algorithms such as Hedge [LW94], using an exponential-sized discretization of all possible LDS parameters; this is the online equivalent of brute-force grid search. Strikingly, our algorithms achieve essentially the same regret bound, but run in polynomial time.

### 2.3 The power of convex relaxations

Much work in system identification, including the EM method, is concerned with explicitly finding the LDS parameters which best explain the data. However, it is evident from Equation 3 that the terms cause the least-squares (or any other) loss to be non-convex in . Many methods used in practice, including EM and subspace identification, heuristically estimate each hidden state

, after which estimating the parameters becomes a convex linear regression problem. However, this first step is far from guaranteed to work in theory or practice.

Instead, we follow the paradigm of improper learning: in order to predict sequences as accurately as the best possible LDS , one need not predict strictly from an LDS. The central driver of our algorithms is the construction of a slightly larger hypothesis class , for which the best predictor is nearly as good as . Furthermore, we construct so that the loss functions are convex under this new parameterization. From this will follow our efficient online algorithm.

As a warmup example, consider the following overparameterization: pick some time window , and let the predictions be linear in the concatenation . When is bounded away from 1, this is a sound assumption.444This assumption is used in autoregressive models; see Section 6 of [HMR16] for a theoretical treatment. However, in general, this approximation is doomed to either truncate longer-term input-output dependences (short ), or suffer from overfitting (long ). Our main theorem uses an overparameterization whose approximation factor is independent of , and whose sample complexity scales only as .

### 2.4 Low approximate rank of Hankel matrices

Our analysis relies crucially on the spectrum of a certain Hankel matrix, a square matrix whose anti-diagonal stripes have equal entries (i.e. is a function of ). An important example is the Hilbert matrix , the -by- matrix whose -th entry is . For example,

 H3,−1=⎡⎢⎣1\sfrac12\sfrac13\sfrac12\sfrac13\sfrac14\sfrac13\sfrac14\sfrac15⎤⎥⎦.

This and related matrices have been studied under various lenses for more than a century: see, e.g., [Hil94, Cho83]. A basic fact is that is a positive definite matrix for every . The property we are most interested in is that the spectrum of a positive semidefinite Hankel matrix decays exponentially, a difficult result derived in [BT16] via Zolotarev rational approximations. We state these technical bounds in Appendix E.

## 3 The wave-filtering algorithm

Our online algorithm (Algorithm 1) runs online projected gradient descent [Zin03] on the squared loss . Here, each is a matrix specifying a linear map from featurized inputs to predictions . Specifically, after choosing a certain bank of filters , consists of convolutions of the input time series with each (scaled by certain constants), along with , , and . The number of filters will turn out to be polylogarithmic in .

The filters and scaling factors

are given by the top eigenvectors and eigenvalues of the Hankel matrix

, whose entries are given by

 Zij:=2(i+j)3−(i+j).

In the language of Section 2.3, one should think of each as arising from an -dimensional hypothesis class , which replaces the original -dimensional class of LDS parameters . Theorem 3 gives the key fact that approximately contains .

In Section 4, we provide the precise statement and proof of Theorem 1, the main regret bound for Algorithm 1, with some technical details deferred to the appendix. We also obtain analogous sample complexity results for batch learning; however, on account of some definitional subtleties, we defer all discussion of the offline case, including the statement and proof of Theorem 2, to Appendix A.

We make one final interesting note here, from which the name wave-filtering arises: when plotted coordinate-wise, our filters look like the vibrational modes of an inhomogeneous spring (see Figure 1). We provide some insight on this phenomenon (along with some other implementation concerns) in Appendix B. Succinctly: in the scaling limit, commutes with a certain second-order Sturm-Liouville differential operator

. This allows us to approximate filters with eigenfunctions of

, using efficient numerical ODE solvers.

## 4 Analysis

We first state the full form of the regret bound achieved by Algorithm 1:555Actually, for a slightly tighter proof, we analyze a restriction of the algorithm which does not learn the portion

, instead always choosing the identity matrix for that block.

###### Theorem 1 (Main).

On any sequence , Algorithm 1, with a choice of , , and , achieves regret

 Regret(T)≤O(R4ΘR2xLylog2(RΘRxLyn)⋅n√Tlog6T),

competing with LDS predictors with and .

Note that the dimensions do not appear explicitly in this bound, though they typically factor into . In Section 4.1, we state and prove Theorem 3, the convex relaxation guarantee for the filters, which may be of independent interest. This allows us to approximate the optimal LDS in hindsight (the regret comparator) by the loss-minimizing matrix . In Section 4.2, we complete the regret analysis using Theorem 3, along with bounds on the diameter and gradient, to conclude Theorem 1.

Since the batch analogue is less general (and uses the same ideas), we defer discussion of Algorithm 2 and Theorem 2 to Appendix A.

### 4.1 Approximate convex relaxation via wave filters

Assume for now that ; we will remove this at the end, and see that the regret bound is asymptotically the same. Recall (from Section 2.2) that we measure regret compared to predictions obtained by adding the derivative of the impulse response function of an LDS to . Our approximation theorem states that for any , there is some which produces approximately the same predictions. Formally:

###### Theorem 3 (Spectral convex relaxation for symmetric LDSs).

Let be the online predictions made by an LDS . Let . Then, for any , with a choice of , there exists an such that

 T∑t=1∥MΘ~Xt−yt∥2≤T∑t=1∥^yt−yt∥2+ε.

Here, and are defined as in Algorithm 1 (noting that includes the previous ground truth ).

###### Proof.

We construct this mapping explicitly. Write as the block matrix

 [M(1)M(2)⋯M(k)M(x′)M(x)M(y)],

where the blocks’ dimensions are chosen to align with , the concatenated vector

 [σ1/41(X∗ϕ1)tσ1/42(X∗ϕ2)t⋯σ1/4k(X∗ϕk)txt−1xtyt−1],

so that the prediction is the block matrix-vector product

 MΘ~Xt=k∑j=1σ1/4jM(j)(X∗ϕj)t+M(x′)xt−1+M(x)xt+M(y)yt−1.

Without loss of generality, assume that is diagonal, with entries .666Write the eigendecomposition . Then, the LDS with parameters makes the same predictions as the original, with diagonal. Let be the -th row of , and the -th column of . Also, we define a continuous family of vectors , with entries . Then, our construction is as follows:

• , for each .

• .

Below, we give the main ideas for why this works, leaving the full proof to Appendix C.

Since is the identity, the online learner’s task is to predict the differences as well as the derivative , which we write here:

 ^yt−yt−1 =(CB+D)xt−Dxt−1+T−1∑i=1C(Ai−Ai−1)Bxt−i =(CB+D)xt−Dxt−1+T−1∑i=1C(d∑l=1(αil−αi−1l)el⊗el)Bxt−i =(CB+D)xt−Dxt−1+d∑l=1(cl⊗bl)T−1∑i=1μ(αl)(i)xt−i. (4)

Notice that the inner sum is an inner product between each coordinate of the past inputs with (or a convolution, viewed across the entire time horizon). The crux of our proof is that one can approximate using a linear combination of the filters . Writing for short, notice that

 Z=∫10μ(α)⊗μ(α)dα,

since the entry of the RHS is

 ∫10(α−1)2αi+j−2dα=1i+j−1−2i+j+1i+j+1=Zij.

What follows is a spectral bound for reconstruction error, relying on the low approximate rank of :

###### Lemma 4.1.

Choose any . Let be the projection of onto the -dimensional subspace of spanned by . Then,

 ∥μ(α)−~μ(α)∥2≤ ⎷6T∑j=k+1σj≤O(c−k/logT0√logT),

for an absolute constant .

By construction of , replaces each in Equation (4) with its approximation . Hence we conclude that

 MΘ~Xt =yt−1+(CB+D)xt−Dxt−1+d∑l=1(cl⊗bl)T−1∑i=1~μ(αl)(i)xt−i =yt−1+(^yt−yt−1)+ζt=^yt+ζt,

letting denote some residual vectors arising from discarding the subspace of dimension . Theorem 3 follows by showing that these residuals are small, using Lemma 4.1: it turns out that is exponentially small in , which implies the theorem. ∎

### 4.2 From approximate relaxation to low regret

Let denote the best LDS predictor, and let be its image under the map from Theorem 3, so that total squared error of predictions is within from that of . Notice that the loss functions are quadratic in , and thus convex. Algorithm 1 runs online gradient descent [Zin03] on these loss functions, with decision set . Let be the diameter of , and be the largest norm of a gradient. We can invoke the classic regret bound:

###### Lemma 4.2 (e.g. Thm. 3.1 in [Haz16]).

Online gradient descent, using learning rate , has regret

 RegretOGD(T)def=T∑t=1ft(Mt)−minM∈MT∑t=1ft(M)≤2GmaxDmax√T.

To finish, it remains to show that and are small. In particular, since the gradients contain convolutions of the input by (not ) unit vectors, special care must be taken to ensure that these do not grow too quickly. These bounds are shown in Section D.2, giving the correct regret of Algorithm 1 in comparison with the comparator . By Theorem 3, competes arbitrarily closely with the best LDS in hindsight, concluding the theorem.

Finally, we discuss why it is possible to relax the earlier assumption on the initial hidden state. Intuitively, as more of the ground truth responses are revealed, the largest possible effect of the initial state decays. Concretely, in Section D.4, we prove that a comparator who chooses a nonzero can only increase the regret by an additive in the online setting.

## 5 Experiments

In this section, to highlight the appeal of our provable method, we exhibit two minimalistic cases where traditional methods for system identification fail, while ours successfully learns the system. Finally, we note empirically that our method seems not to degrade in practice on certain well-behaved nonlinear systems. In each case, we use filters, and a regularized follow-the-leader variant of Algorithm 1 (see Appendix B.2).

### 5.1 Synthetic systems: two hard cases for EM and SSID

We construct two difficult systems, on which we run either EM or subspace identification777Specifically, we use “Deterministic Algorithm 1” from page 52 of [VODM12]. (SSID), followed by Kalman filtering to obtain predictions. Note that our method runs significantly ( times) faster than this traditional pipeline.

In the first example (Figure 2(a), left), we have a SISO system () and ; all , , and are i.i.d. Gaussians, and . Most importantly, is ill-conditioned, so that there are long-term dependences between input and output. Observe that although EM and SSID both find reasonable guesses for the system’s dynamics, they turns out to be local optima. Our method learns to predict as well as the best possible LDS.

The second example (Figure 2(a), right) is a MIMO system (with ), also with Gaussian noise. The transition matrix has a diverse spectrum, the observation matrix has i.i.d. Gaussian entries, and . The inputs are random block impulses. This system identification problem is high-dimensional and non-convex; it is thus no surprise that EM and SSID consistently fail to converge.

### 5.2 The forced pendulum: a nonlinear, non-symmetric system

We remark that although our algorithm has provable regret guarantees only for LDSs with symmetric transition matrices, it appears in experiments to succeed in learning some non-symmetric (even nonlinear) systems in practice, much like the unscented Kalman filter [WVDM00]. In Figure 2(b), we provide a typical learning trajectory for a forced pendulum, under Gaussian noise and random block impulses. Physical systems like this are widely considered in control and robotics, suggesting possible real-world applicability for our method.

## 6 Conclusion

We have proposed a novel approach for provably and efficiently learning linear dynamical systems. Our online wave-filtering algorithm attains near-optimal regret in theory; and experimentally outperforms traditional system identification in both prediction quality and running time. Furthermore, we have introduced a “spectral filtering” technique for convex relaxation, which uses convolutions by eigenvectors of a Hankel matrix. We hope that this theoretical tool will be useful in tackling more general cases, as well as other non-convex learning problems.

## Acknowledgments

We thank Holden Lee and Yi Zhang for helpful discussions. We especially grateful to Holden for a thorough reading of our manuscript, and for pointing out a way to tighten the result in Lemma C.1.

## References

• [Aud14] Koenraad MR Audenaert. A generalisation of mirsky’s singular value inequalities. arXiv preprint arXiv:1410.4941, 2014.
• [BM02] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.

Journal of Machine Learning Research

, 3(Nov):463–482, 2002.
• [BT16] Bernhard Beckermann and Alex Townsend. On the singular values of matrices with displacement structure. arXiv preprint arXiv:1609.09494, 2016.
• [Cho83] Man-Duen Choi. Tricks or treats with the hilbert matrix. The American Mathematical Monthly, 90(5):301–312, 1983.
• [DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
• [GH96] Zoubin Ghahramani and Geoffrey E Hinton. Parameter estimation for linear dynamical systems. Technical report, Technical Report CRG-TR-96-2, University of Toronto, Deptartment of Computer Science, 1996.
• [Grü82] F Alberto Grünbaum. A remark on hilbert’s matrix. Linear Algebra and its Applications, 43:119–124, 1982.
• [Haz16] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
• [Hil94] David Hilbert. Ein beitrag zur theorie des legendre’schen polynoms. Acta mathematica, 18(1):155–159, 1894.
• [HMR16] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191, 2016.
• [HSC16] Wenbing Huang, Fuchun Sun, Lele Cao, Deli Zhao, Huaping Liu, and Mehrtash Harandi. Sparse coding and dictionary learning with linear dynamical systems. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pages 3938–3947, 2016.
• [Kal60] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82.1:35–45, 1960.
• [KV05] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
• [Lju98] Lennart Ljung. System identification: Theory for the User. Prentice Hall, Upper Saddle Riiver, NJ, 2 edition, 1998.
• [Lju02] Lennart Ljung. Prediction error estimation methods. Circuits, Systems and Signal Processing, 21(1):11–21, 2002.
• [LW94] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
• [Mar10a] James Martens. Learning the linear dynamical system with asos. In Johannes F rnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning, pages 743–750. Omnipress, 2010.
• [Mar10b] James Martens. Learning the linear dynamical system with asos. In Proceedings of the 27th International Conference on Machine Learning, pages 743–750, 2010.
• [RG99] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural computation, 11(2):305–345, 1999.
• [Sch11] J Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. Journal für die reine und Angewandte Mathematik, 140:1–28, 1911.
• [Sle78] David Slepian. Prolate spheroidal wave functions, fourier analysis, and uncertainty: The discrete case. Bell Labs Technical Journal, 57(5):1371–1430, 1978.
• [SS82] Robert H Shumway and David S Stoffer. An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, 3(4):253–264, 1982.
• [VODM12] Peter Van Overschee and BL De Moor. Subspace Identification for Linear Systems. Springer Science & Business Media, 2012.
• [WVDM00] Eric A Wan and Rudolph Van Der Merwe. The unscented kalman filter for nonlinear estimation. In Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000, pages 153–158. IEEE, 2000.
• [Zin03] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.

## Guide to the Appendix

• In Appendix A, we present two formulations of the batch learning equivalent of the online algorithm, and derive Theorem 2, a companion sample complexity bound.

• In Appendix B, we discuss some variants of our online algorithm, and offer some tips for implementation. We also provide discussion on the connection of our filters to eigenfunctions of a certain differential operator.

• In Appendix C, we prove the key approximate convex relaxation result (Theorem 3).

• In Appendix D, we complete the details for the proof sketch provided in Section 4.2, concluding the main theorem, the regret bound for the online algorithm. Importantly, we address the subtle issue of deriving upper bounds for the gradient and diameter of the decision set.

• In Appendix E, we derive explicit non-asymptotic bounds for quantities of interest pertaining to the Hankel matrix , notably spectral decay. Key results are adapted from [BT16].

• In Appendix F, we verify some easy-to-prove properties of the important vector , for sake of completeness.

## Appendix A Batch variants of the algorithm

The online prediction setting is sensitive to permutation of the time series: that is, the same LDS does not in general map to . As such, one must take care when defining the batch case: the output time series (and thus, loss functions) are correlated, so it is not meaningful to assume that they are i.i.d. samples from a distribution. Thus, our online regret bound, which concerns a single episode, does not translate directly. However, our convex relaxation technique still allows us to do efficient improper learning with least-squares regression, giving interesting and novel statistical guarantees. In this section, we provide two possible formulations of the batch setting, along with accompanying theorems.

In both cases, it is most natural to fix an episode length , and consider a rollout of the system to be a single example. For short, let denote the concatenated vector of inputs for a single example, and the concatenated responses. The batch formulation is to learn the dynamics of the system using samples . Recall that the samples satisfy and .

Unlike in the online setting, it will be less confusing in the batch setting to measure the mean squared error of predictions, rather than the total squared error. Thus, in this section, will always refer to mean squared error. As well, to follow statistical learning conventions (for ease of reading), we use to denote a hypothesis (an LDS) instead of ; this is distinguished from the hidden state .

### a.1 Learning the derivative: the direct analogue

Throughout this subsection, assume that .

As noted, the sequential prediction algorithm can be restricted so as to never make updates to the submatrix , keeping it to be the identity matrix. Notice that all other features in consist of inputs and their convolutions. In other words, we can take the view that the matrix can be used to predict the differences between successive responses, as a function of the entire (aligned) input time series .

Thus, we can formulate a direct analogue for the online algorithm: learn the mapping from an input time series to the differences , the concatenation of all . For this, we can use Theorem 3 (the approximation result) directly, and obtain an improper agnostic learning guarantee.

Specifically, let be a subset of the hypothesis class of LDS parameters , subject to , and choose any approximation tolerance .888The distinction between measuring total vs. mean squared error is hidden in the constant in front of the when choosing the number of filters . Then, Theorem 3 states that choosing with ensures the -approximate relaxation property. In the language of the batch setting: for each which predicts on the sample with a mean squared error , there is some so that

 ℓX,Y(h)≤ℓX,Y(^h)+ε.

The choice of batch algorithm is clear, in order to mimic Algorithm 1: run least-squares regression on and , where is the same featurization of the inputs as used in the online algorithm. We describe this procedure fully in Algorithm 2.

#### a.1.1 Generalization bound

By definition, Algorithm 2 minimizes the empirical MSE loss on the samples; as such, we can derive a PAC-learning bound for regression. We begin with some definitions and assumptions, so that we can state the theorem.

As in the statement of the online algorithm, as a soft dimensionality restriction, we constrain the comparator class to contain LDSs with parameters such that and . For an empirical sample set , let . Similarly, for a distribution , let .

Then, we are able to obtain a sample complexity bound:

###### Theorem 2 (Generalization of the batch algorithm).

Choose any . Let be a set of i.i.d. training samples from a distribution . Let be the output of Algorithm 2, with a choice of . Let

be the true loss minimizer. Then, with probability at least

, it holds that

 ℓD(^h)−minh∈HℓD(h)≤ε+O(R4ΘR2xLylog2(RΘRxLyn/ε)nlog6T+√log1/δ)√N.
###### Proof.

Lemma D.1 shows that we can restrict by a Frobenius norm bound:

 ∥M∥F≤O(R2Θ√k).

Thus, the empirical Rademacher complexity of on samples, with this restriction, thus satisfies

 RN(^H)≤O(R2ΘRx√k√N).

Also, no single prediction error (and thus neither the empirical nor population loss) will exceed the upper bound

 ℓmaxdef=Θ(R4ΘR2xL2yk).

Finally, the loss is -Lipschitz in the matrix , where is the same upper bound for the gradient as mentioned in Section 4.2. Lemma D.5, states that this is bounded by .

With all of these facts in hand, a standard Rademacher complexity-dependent generalization bound holds in the improper hypothesis class (see, e.g. [BM02]):

###### Lemma A.1 (Generalization via Rademacher complexity).

With probability at least , it holds that

 ℓD(^h)−ℓD(^h∗) ≤GmaxRN(^H)+ℓmax√8ln2/δN

With the stated choice of , an upper bound for the RHS of Lemma A.1 is

 O(R4ΘR2xLylog2(RΘRxLyn/ε)nlog6T+√log1/δ)√N.

Combining this with the approximation result (Theorem 3) yields the theorem. ∎

### a.2 The pure batch setting

A natural question is whether there exists a batch learning algorithm that can use to predict directly, as opposed to the differences . This is possible in the regime of low noise: if one has predictions on that are correct up to MSE , an easy solution is to integrate and obtain predictions for ; however, the errors will accumulate to . The same agnostic learning guarantee costs a rather dramatic factor of in sample complexity.

In the regime of low noise, an analogue of our approximation theorem (Theorem 3) is powerful enough to guarantee low error. For convenience and concreteness, we record this here:

###### Theorem 3b (Pure-batch approximation).

Let be an LDS specified by parameters , with , and . Suppose takes an input sequence , and produces output sequence , assuming all noise vectors are 0. Then, for any , with a choice of , there exists an such that

 T∑t=1∥∥∥(t∑u=1MΘ~Xu)−yt∥∥∥2≤T∑t=1∥^yt−yt∥2+ε,

where is defined as in Algorithm 1, without the entries.

This fact follows from Theorem 3, setting as the desired precision; the cost of this additional precision is only a constant factor in . Furthermore, this is subject to the same Frobenius norm constraint as in Lemma D.1.

#### a.2.1 Filters from the Hilbert matrix

Alternatively, in the realizable case (when the samples from are generated by an LDS, possibly with small noise), one can invoke a similar approximate relaxation theorem as Theorem 3. The filters become the eigenvectors of the Hilbert matrix , the matrix whose -th entry is . This matrix exhibits the same spectral decay as ; see [BT16] for precise statements. the proof follows the sketch from Section 4.1, approximating the powers of by a spectral truncation of a different curve , sometimes called the moment curve in

. The Hilbert matrix arises from taking the second moment matrix of the uniform distribution on this curve.

However, we find that this approximation guarantee is insufficient to show the strong regret and agnostic learning bounds we exhibit for learning the derivative of the impulse response function. Nonetheless, we find that regression with these filters works well in practice, even interchangeably in the online algorithm; see Section B.1 for some intuition.

### a.3 Learning the initial hidden state via hints

In either of the above settings, it is not quite possible to apply the same argument as in the online setting for pretending that the initial hidden state is zero. When this assumption is removed, the quality of the convex relaxation degrades by an additive ; see Section D.4. This does not matter much for the regret bound, because it is subsumed by the worst-case regret of online gradient descent.

However, in the batch setting, we take the view of fixed and increasing , so the contribution of the initial state is no longer asymptotically negligible. In other words, this additive approximation error hinders us from driving arbitrarily close to zero, no matter how many filters are selected. In settings where is large enough, one may find this acceptable.

We present an augmented learning problem in which we can

predict as well as an LDS: the initial hidden state is provided in each sample, up to an arbitrary linear transformation. Thus, each sample takes the form

, and it is guaranteed that for each sample, for a fixed matrix . This must be well-conditioned for the problem to remain well-posed: our knowledge of should be in the same dynamic range as the ground truth. Concretely, we should assume that is bounded.

The construction is as follows: append “dummy” dimensions to the input, and add an impulse of in those dimensions at time 0. During the actual episode, these dummy inputs are always zero. Then, replacing with the augmented block matrix recovers the behavior of the system. Thus, we can handle this formulation of hidden-state learning in the online or batch setting, incurring no additional asymptotic factors.

#### a.3.1 Initializations with finite support

We highlight an important special case of the formulation discussed above, which is perhaps the motivating rationale for this altered problem.

Consider a batch system identification setting in which there are only finitely many initial states in the training and test data, and the experimenter can distinguish between these states. This can be interpreted a set of known initial “configurations” of the system. Then, it is sufficient to augment the data with a one-hot vector in , corresponding to the known initialization in each sample. An important case is when : when there is only one distinct initial configuration; this occurs frequently in control problems.

In summary, the stated augmentation takes the original LDS with dimensions , and transforms it into one with dimensions . The matrix , as defined above, is the -by- matrix whose columns are the possible initial hidden states, which can be in arbitrary dimension. For convenience, we summarize this observation:

###### Proposition A.2 (Learning an LDS with few, distinguishable hidden states).

When there are known hidden states, with , Theorems 2, 3, and 3b apply to the modified LDS learning problem, with samples of the form . The dimension becomes .

## Appendix B Implementation and variants

We discuss the points mentioned in Section 3 at greater length. Unlike the rest of the appendix, this section contains no technical proofs, and is intended as a user-friendly guide for making the wave-filtering method usable in practice.

### b.1 Computing the filters via Sturm-Liouville ODEs

We begin by expanding upon the observation, noted in Section 3, that the eigenvectors resemble inhomogeneously-oscillating waves, providing some justification for the heuristic numerical computation of the top eigenvectors of .

Computing the filters directly from is difficult. In fact, the Hilbert matrix (its close cousin) is notoriously super-exponentially ill-conditioned; it is probably best known for being a pathological benchmark for finite-precision numerical linear algebra algorithms. One could ignore efficiency issues, and view this as a data-independent preprocessing step: these filters are deterministic. However, this difficult numerical problem poses an issue for using our method in practice.

Fortunately, as briefly noted in Section 3, some recourse is available. In [Grü82], Grünbaum constructs a tridiagonal matrix which commutes with each Hilbert matrix , as defined in Section 2.4. In the appropriate scaling limit as , this becomes a Sturm-Liouville differential operator which does not depend on , given by

 D=ddx((1−x2)x2d</