Linear dynamical systems (LDSs) are a cornerstone of signal processing and time series analysis. The problem of predicting the response signal arising from a LDS is a fundamental problem in machine learning, with a history of more than half a century.
An LDS is given by matrices . Given a sequence of inputs , the output of the system is governed by the linear equations
are noise vectors, andis a hidden (latent) state. It has been observed numerous times in the literature that if there is no hidden state, or if the transition matrices are known, then the formulation is essentially convex and amenable to efficient optimization.
In this paper we are concerned with the general and more challenging case, arguably the one which is more applicable as well, in which the hidden state is not observed, and the system dynamics are unknown to the learner. In this setting, despite the vast literature on the subject from various communities, there is a lack of provably efficient methods for learning the LDS without strong generative or other assumptions.
Building on the recent spectral filtering technique from , we develop a more general convex relaxation theorem for LDSs, resulting in an efficient algorithm for the LDS prediction problem in the general setting. Our algorithm makes online predictions which are close (in terms of mean squared error) to those of the optimal LDS in hindsight.
1.1 Our contributions
Let and be the inputs and outputs of a linear dynamical system. Our main result is a polynomial-time algorithm that predicts given all previous input and feedback , and competes with the predictions of the best LDS in terms of mean squared error. Defining regret as the additional least-squares loss incurred compared to the best predictions :
our main algorithm achieves a bound of
Here, denotes the inevitable loss incurred by perturbations to the system which cannot be anticipated by the learner, which are allowed to be adversarial. The constant in the , as well as , depend only polynomially on the dimensionality of the system, the norms of the inputs and outputs, and certain natural quantities related to the transition matrix . Additionally, the running time of our algorithm is polynomial in all natural parameters of the problem.
In comparison to previous approaches, we note:
The main feature is that the regret does not depend on the spectral radius
of the system’s hidden-state transition matrix. If one allows a dependence on the condition number, then simple linear regression-based algorithms are known to obtain the same result, with time and sample complexity polynomial in. For a reference, see section 6 of .
Our algorithm is the first polynomial-time algorithm with this guarantee. However, inefficient algorithms that attain similar regret bounds are also possible, based on the continuous multiplicative-weights algorithm (see , as well as the EWOO algorithm in ). These methods, mentioned briefly in , basically amount to discretizing the entire parameter space of LDSs, and take time exponential in the system dimensions.
The approach of  makes use of the eigendecomposition of the symmetric transition matrix
: they show that impulse response function for a real scalar is close to a fixed low-rank subspace defined by the top eigenvectors of a certain Hankel matrix. For the case of a general (asymmetric) LDS, however, the eigenvalues ofare complex-valued, and the subspace property no longer holds.
The primary difficulty is that the complex-valued Hankel matrix corresponding to complex power does not have the same approximate low-rank property. Instead, we use the fact that the approximation result with real eigenvalues holds if all of the phases of the complex eigenvalues have been identified and taken into account. However, identifying the phases and simultaneously learning the other relevant parameters of the predictor is a non-convex learning problem. We give a novel convex relaxation of this simultaneous learning problem that may be useful in other regret minimization settings.
A second important component is incorporating an autoregressive model, i.e. predicting the current output as a learned linear function of the pastinputs and outputs. Intuitively, for a general LDS, regressing only on the previous output (as was done in  for symmetric LDSs) is problematic, because no single linear map can cancel out the effect of multiple complex rotations. Instead, we use an autoregressive time window of , the dimension of the hidden state. The autoregressive component accomplishes two objectives: first, it makes the system responsive to noise on the hidden state, and second, it ensures that the additional dependence of on the (in general long) history of inputs is bounded, and hence the wave-filtering method will succeed.
Autoregressive models have been studied and used extensively; however, we are not aware of formal analyses on how well they can approximate a general LDS with a hidden state. As part of our analysis, we show that the error or ill-conditionedness of this approximation can be bounded in terms of norms of certain natural polynomials related to the dynamical system.
1.3 Related work
The prediction problems of time series for linear dynamical systems was defined in the seminal work of Kalman 20], and the extensive overview of recent literature in .
For a linear dynamical system with no hidden state, the system is identifiable by a convex program and thus well understood (see [10, 1], who address sample complexity issues and regret for system identification and linear-quadratic control in this setting).
Various approaches have been proposed to learn the system in the case that the system is unknown, however most Ghahramani and Roweis  suggest using the EM algorithm to learn the parameters of an LDS. This approach remains widely used, but is inherently non-convex and can get stuck in local minima. Recently 
show that for a restricted class of systems, gradient descent (also widely used in practice, perhaps better known in this setting as backpropagation) guarantees polynomial convergence rates and sample complexity in the batch setting. Their result applies essentially only to the SISO case, depends polynomially on the spectral gap, and requires the signal to be generated by an LDS.
In recent work,  show how to efficiently learn an LDS in the online prediction setting, without any generative assumptions, and without dependence on the condition number. Their new methodology, however, was restricted to LDSs with symmetric transition matrices. For the structural result, we use the same results from the spectral theory of Hankel matrices; see [3, 16, 9]. As discussed in the previous section, obtaining provably efficient algorithms for the general case is significantly more challenging.
We make use of linear filtering, or linear regression on the past observations as well as inputs, as a subroutine for future prediction. This technique is well-established in the context of autoregressive models for time-series prediction that have been extensively studied in the learning and signal-processing literature, see e.g. [11, 6, 7, 19, 2, 21].
A linear dynamical system , with initial hidden state , specifies a map from inputs to outputs (responses) , given by the recursive equations
where are matrices of appropriate dimension, and are noise vectors.
We make the following assumptions to characterize the “size” of an LDS we are competing against:
Inputs and outputs and bounded: .111Note that no bound on is required for the approximation theorem; only appears in the regret bound.
The system is Lyapunov stable, i.e., the largest singular value ofis at most 1: . Note that we do not need this parameter to be bounded away from .
is diagonalizable by a matrix with small entries: , with . Intuitively, this holds if the eigenvectors corresponding to larger eigenvalues aren’t close to linearly dependent.
have bounded spectral norms: .
Let be the set of phases of all eigenvalues of . There exists a monic polynomial of degree such that for all , the norm of its coefficients is at most , and the norm is at most . We will explain this condition in Section 4.1.
In our regret model, the adversary chooses an LDS , and has a budget . The dynamical system produces outputs given by the above equations, where the noise vectors are chosen adversarially, subject to a budget constraint: .
Then, the online prediction setting is identical to that proposed in . For each iteration , the input is revealed, and the learner must predict a response . Then, the true is revealed, and the learner suffers a least-squares loss of . Of course, if scales with the time horizon , it is information-theoretically impossible for an online algorithm to incur a loss sublinear in , even under non-adversarial (e.g. Gaussian) perturbations. Thus, our end-to-end goal is to track the LDS with loss that scales with the total magnitude of the perturbations, independently of .
This formulation is fundamentally a min-max problem: given a limited budget of perturbations, an adversary tries to maximize the error of the algorithm’s predictions, while the algorithm seeks to be robust against any such adversary. This corresponds to the notion of robustness in the control theory literature; see Section 15.5 of .
2.2 The spectral filtering method
The spectral filtering technique is introduced in , which considers a spectral decomposition of the derivative of the impulse response function of an LDS with a symmetric transition matrix. We derive Theorem 3, an analogue of this result for the asymmetric case, using some of the technical lemmas from this work.
A crucial object of consideration from  is the set of wave-filters , which are the top eigenvectors of the deterministic Hankel matrix , whose entries are given by . Bounds on the -rank of positive semidefinite Hankel matrices can be found in . Like Algorithm 1 from , the algorithm will “compress” the input time series using a time-domain convolution of the input time series with filters derived from these eigenvectors.
2.3 Notation for matrix norms
We will consider a few “mixed”
matrix norms of a 4-tensor, whose elements are indexed by (the roles and bounds of these indices will be introduced later). For conciseness, whenever the norm of such a 4-tensor is taken, we establish the notation for the mixed matrix norm
and the limiting case
These are the straightforward analogues of the matrix norms defined in , and appear in the regularization of the online prediction algorithm.
To define the algorithm, we specify a reparameterization of linear dynamical systems. To this end, we define a pseudo-LDS, which pairs a subspace-restricted linear model of the impulse response with an autoregressive model:
A pseudo-LDS is given by two 4-tensors a vector , and matrices . Let the prediction made by , which depends on the entire history of inputs and past outputs be given by
Here, are the top eigenvectors, with eigenvalues , of . These can be computed using specialized methods . Some of the dimensions of these tensors are parameters to the algorithm, which we list here:
Number of filters .
Phase discretization parameter .
Autoregressive parameter .
Additionally, we define the following:
Regularizer , where , and .
Composite norm .
Composite norm constraint , and the corresponding set of pseudo-LDSs .
Crucially, is linear in each of ; consequently, the least-squares loss is convex, and can be minimized in polynomial time. To this end, our online prediction algorithm is follow-the-regularized-leader (FTRL), which requires the solution of a convex program at each iteration. We choose this regularization to obtain the strongest theoretical guarantee, and provide a brief note in Section 5 on alternatives to address performance issues.
At a high level, our algorithm works by first approximating the response of an LDS by an autoregressive model of order , then refining the approximation using wave-filters with a phase component. Specifically, the blocks of and corresponding to filter index and phase index specify the linear dependence of on a certain convolution of the input time series, whose kernel is the pointwise product of and a sinusoid with period . The structural result which drives the theorem is that the dynamics of any true LDS are approximated by such a pseudo-LDS, with reasonably small parameters and coefficients.
Note that the autoregressive component in our definition of a pseudo-LDS is slightly more restricted than multivariate autoregressive models: the coefficients are scalar, rather than allowed to be arbitrary matrices. These options are interchangeable for our purposes, without affecting the asymptotic regret; we choose to use scalar coefficients for a more streamlined analysis.
The online prediction algorithm is fully specified in Algorithm 1; the parameter choices that give the best asymptotic theoretical guarantees are specified in the appendix, while typical realistic settings are outlined in Section 5.
Theorem 2 (Main; informal).
There are three parts to the analysis, which we outline in the following subsections: proving the approximability of an LDS by a pseudo-LDS, bounding the regret incurred by the algorithm against the best pseudo-LDS, and finally analyzing the effect of noise . The full proofs are in Appendices A, B, and C, respectively.
4.1 Approximation theorem for general LDSs
We develop a more general analogue of the structural result from , which holds for systems with asymmetric transition matrix .
Theorem 3 (Approximation theorem; informal).
There is , and a pseudo-LDS of norm such that approximates to within for :
For the formal statement (with precise bounds) and proof, see Appendix A.2. In this section we give some intuition for the conditions and an outline of the proof.
First, we explain the condition on the polynomial . As we show in Appendix A.1 we can predict using a pure autoregressive model, without wavefilters, if we require to have all eigenvalues of as roots (i.e., it is divisible by the minimal polynomial of ). However, the coefficients of this polynomial could be very large. The size of these coefficients will appear in the bound for the main theorem, as using large coefficients in the predictor will make it sensitive to noise.
Requiring only to have the phases of eigenvalues of as roots can decrease the coefficients significantly. As an example, consider if has many distinct eigenvalues with phase , and similarly for , and , and suppose their absolute values are close to 1. Then the minimal polynomial is approximately which can have coefficients as large as . On the other hand, for the theorem we can take which has degree 3 and coefficients bounded by a constant. Intuitively, the wavefilters help if there are few distinct phases, or they are well-separated (consider that if the phases were exactly the th roots of unity, that can be taken to be ). Note that when the roots are real, we can take and the analysis reduces to that of .
We now sketch a proof of Theorem 3. Motivation is given by the Cayley-Hamilton Theorem, which says that if is the characteristic polynomial of , then . This fact tells us that the satisfies a linear recurrence of order : if , then .
If has only the phases as the roots, then but can be written in terms of the wavefilters. Consider for simplicity the 1-dimensional (complex) LDS , and let with . Suppose and . In general the LDS is a “sum” of LDS’s that are in this form. Summing the past terms with coefficients given by ,
The terms can be taken care of by linear regression. Consider a term , in this sum. The coefficient is . Because , this can be written as
Factoring out from each of these terms show that can be expressed as a function of a convolution of the vector with . The wavefilters were designed precisely to approximate the vector well, hence can be approximated using the wavefilters multiplied by phase and convolved with . Note that the is necessary in order to make the norm of bounded, and hence ensure the wavefilters have bounded coefficients.
4.2 Regret bound for pseudo-LDSs
As an intermediate step toward the main theorem, we show a regret bound on the total least-squares prediction error made by Algorithm 1, compared to the best pseudo-LDS in hindsight.
Theorem 4 (FTRL regret bound; informal).
Let denote the predictions made by the fixed pseudo-LDS minimizing the total squared-norm error. Then, there is a choice of parameters for which the decision set contains all LDSs which obey the assumptions from Section 2.1, for which the predictions made by Algorithm 1 satisfy
where , .
The regret bound follows by applying the standard regret bound of follow-the-regularized-leader (see, e.g. ). However, special care must be taken to ensure that the gradient and diameter factors incur only a factor, noting that the discretization parameter (one of the dimensions of and ) must depend polynomially on in order for the class of pseudo-LDSs to approximate true LDSs up to error . To this end, we use a modification of the strongly convex matrix regularizer found in , resulting in a regret bound with logarithmic dependence on .
Intuitively, this is possible due to the -sparsity (and thus boundedness) of the phases of true LDSs, which transfers to an bound (in the phase dimension only) on pseudo-LDSs that compete with LDSs of the same size. This allows us to formulate a second convex relaxation, on top of that of wave-filtering, for simultaneous identification of eigenvalue phase and magnitude. For the complete theorem statement and proof, see Appendix B.
We note that the regret analysis can be used directly with the approximation result for autoregressive models (Theorem 5), without wave-filtering. This way, one can straightforwardly obtain a sublinear regret bound against autoregressive models with bounded coefficients. However, for the reasons discussed in Section 4.1, the wave-filtering technique affords us a much stronger end-to-end result.
4.3 Pseudo-LDSs compete with true LDSs
Theorem 3 shows that there exists a pseudo-LDS approximating the actual LDS to within in the noiseless case. We next need to analyze the best approximation when there is noise. We show in Appendix C (Lemma 18) that if the noise is bounded (), we incur an additional term equal to the size of the perturbation times a competitive ratio depending on the dynamical system, for a total of . We show this by showing that any noise has a bounded effect on the predictions of the pseudo-LDS.222In other words, the prediction error of the pseudo-LDS is stable to noise, and we bound its norm.
Letting be the predictions of the best pseudo-LDS, we have
For the complete proof, see Appendix C.2.
We exhibit two experiments on synthetic time series, which are generated by randomly-generated ill-conditioned LDSs. In both cases, is a block-diagonal matrix, whose 2-by-2 blocks are rotation matrices for phases drawn uniformly at random. This comprises a hard case for direct system identification: long-term time dependences between input and output, and the optimization landscape is non-convex, with many local minima. Here, and are random matrices of standard i.i.d. Gaussians. In the first experiment, the inputs are i.i.d. spherical Gaussians; in the second, the inputs are Gaussian block impulses.
We make a few straightforward modifications to Algorithm 1, for practicality. First, we replace the scalar autoregressive parameters with matrices . Also, for performance reasons, we use ridge regularization instead of the prescribed pseudo-LDS regularizer with composite norm constraint. We choose an autoregressive parameter of (in accordance with the theory), and .
As shown in Figure 2, our algorithm significantly outperforms the baseline methods of system identification followed by Kalman filtering. The EM and subspace identification (SSID; see ) algorithms finds a local optimum; in the experiment with Gaussian inputs, the latter failed to converge (left).
to compute and process the additional convolutions. To remove this phase discretization bottleneck, many heuristics are available for phase identification; see Chapter 6 of.
We gave the first, to the best of our knowledge, polynomial-time algorithm for prediction in the general LDS setting without dependence on the spectral radius parameter of the underlying system. Our algorithm combines several techniques, namely the recently introduced wave-filtering method, as well as convex relaxation and linear filtering.
One important future direction is to improve the regret in the setting of (non-adversarial) Gaussian noise. In this setting, if the LDS is explicitly identified, the best predictor is the Kalman filter, which, when unrolled, depends on feedback for all previous time steps, and only incurs a cost from noise in (3). It is of great theoretical and practical interest to compete directly with the Kalman filter without system identification.
-  Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
-  Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, pages 172–184, 2013.
-  Bernhard Beckermann and Alex Townsend. On the singular values of matrices with displacement structure. SIAM Journal on Matrix Analysis and Applications, 38(4):1227–1248, 2017.
-  David Belanger and Sham Kakade. A linear dynamical system model for text. In International Conference on Machine Learning, pages 833–842, 2015.
-  Daniel L Boley, Franklin T Luk, and David Vandevoorde. A fast method to diagonalize a Hankel matrix. Linear algebra and its applications, 284(1-3):41–52, 1998.
-  G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. Prentice-Hall, 3 edition, 1994.
-  P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, 2 edition, 2009.
-  Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
-  Man-Duen Choi. Tricks or treats with the hilbert matrix. The American Mathematical Monthly, 90(5):301–312, 1983.
-  Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
-  J. Hamilton. Time Series Analysis. Princeton Univ. Press, 1994.
-  Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191, 2016.
-  Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
-  Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69(2-3):169–192, December 2007.
-  Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pages 6705–6715, 2017.
-  David Hilbert. Ein beitrag zur theorie des legendre’schen polynoms. Acta mathematica, 18(1):155–159, 1894.
-  Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13(Jun):1865–1890, 2012.
-  Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82.1:35–45, 1960.
-  Vitaly Kuznetsov and Mehryar Mohri. Time series prediction and online learning. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1190–1213, Columbia University, New York, New York, USA, 23–26 Jun 2016.
-  Lennart Ljung. System identification: Theory for the User. Prentice Hall, Upper Saddle Riiver, NJ, 2 edition, 1998.
-  Taesup Moon and Tsachy Weissman. Competitive on-line linear fir mmse filtering. In IEEE International Symposium on Information Theory - Proceedings, pages 1126 – 1130, 07 2007.
-  Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural computation, 11(2):305–345, 1999.
-  Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
-  Peter Van Overschee and BL De Moor. Subspace Identification for Linear Systems. Springer Science & Business Media, 2012.
-  Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.
Appendix A Proof of approximation theorem
a.1 Warm-up: Simple autoregressive model
As a warm-up, we first establish rigorous regret bounds for an autoregressive model, depending on properties of the the minimal polynomial of the linear dynamical system. Next we will see how introducing wavefilters can improve these bounds.
Consider a LDS , where is diagonalizable with spectral radius .
Let be a monic polynomial of degree such that .333In other words, the minimal polynomial of divides . For a diagonalizable matrix, the minimal polynomial is the characteristic polynomial except without repeated zeros.
. For a diagonalizable matrix, the minimal polynomial is the characteristic polynomial except without repeated zeros.Suppose ,444For a polynomial , let
Suppose is generated from the LDS with inputs . Then there exist and satisfying
Unfolding the LDS,
Let (with ). Then
using the fact that . Writing this in the autoregressive format,
We let for and check that this satisfies the conditions. Note . By the - inequality,
a.2 Autoregressive model with wavefilters: An approximate convex relaxation for asymmetric LDS
If we add wavefiltered inputs to the regression, the bounds depend not on the minimal polynomial of , but rather on the minimal polynomial having all the as zeros, where are the phases of zeros of the charateristic polynomial of . For example, if all the roots are real and distinct, then the polynomial is rather than . This can be an exponential improvement. For example, consider the case where all the are close to 1. Then the minimal polynomial is close to , which has coefficients as large as . Note the case of real roots reduces to .
First, we introduce some notation for convenience. Note that in (1), is a certain convolution. For we define
so that we can write (1) as
We now give a more precise statement of Theorem 3 and prove it.
Consider a LDS , where is diagonalizable with spectral radius . Let the eigenvalues of be for , where and . Let be the set of . Let be a monic polynomial of degree such that for each . Suppose , , and the diagonalization satisfies .
Suppose is generated from the LDS with inputs such that for all . Then there is , , , and satisfying
such that the pseudo-LDS approximates within for :
Our plan is as follows. First we show that choosing and as in Theorem 5,
can be approximated by
for ; this is obtained by a projection to the wavefilters. Next we approximate by discretizing the phase,
for some integers . Finally, we show taking the real part gives something in the form
This matches the form of a pseudo-LDS given in (1) after collecting terms,
Now we carry out the plan. Again we have (14) except that this time the second term is not 0. Let , be the columns of and be the rows of . By assumption on , for each , so the second term of (14) equals
Thus this equals