The problem of estimating the hidden state and outputs of a known linear dynamical system (LDS), given the inputs and observations, is a well-studied problem in control theory [kamen1999introduction]. In the case of Gaussian noise, this problem is completely solved by the Kalman filter [kalman1960new], which recursively propagates the optimal linear estimator for the hidden state. When the recursion for the estimator is unrolled, the Kalman filter is seen to be a linear autoregressive filter: it predicts the system’s next output as a linear combination of the system’s past ground-truth outputs.
However, when the LDS is unknown
, optimal filtering is a much harder problem. More generally, learning to control (or maximize reward) in an unknown system is a foundational problem in machine learning and control theory. One widely-used approach is to learn the dynamical matrices from data, after which one can simply apply the Kalman filter. Unfortunately, this approach runs into computational barriers: the usual formulation of this problem is nonconvex. System identification techniques provide various practical algorithms for this problem[ljung1998system]. However, these algorithms, such as EM [roweis1999unifying], lack rigorous end-to-end guarantees, and are often unstable or find suboptimal solutions in high dimensions.
In this work, we bypass the state-space representation of an LDS, and analyze the statistical guarantees of learning an autoregressive filter directly. This allows us to compete with the predictions of the steady-state Kalman filter, without the computationally intractable task of explicitly identifying the system. We present a polynomial-time algorithm for learning an autoregressive filter for time-series prediction. The predictor has robust () learning guarantees, which do not seem to be attained by naive least-squares.
Our primary motivation is the following question: can we learn the Kalman filter directly, without learning the system? We consider the setting of a linear dynamical system with hidden state, defined by
where are inputs, are hidden states, are outputs, , , , and and are independent zero-mean noise (assumed Gaussian to use the Kalman filter). Crucially, only the , and not the , are observed. A classic approach to learning the dynamics from data is subspace identification [ho1966effective, van2012subspace], for which statistical guarantees only exist in the asymptotic regime or under stringent assumptions. In the presence of noise, these methods are often used to initialize the EM algorithm [roweis1999unifying]
, a classic heuristic for a non-convex objective.
Another age-old model for dynamical systems is the autoregressive-moving average (ARMA) model [Hamilton94, BoxJenRe94, BroDav09], which models latent perturbations using a moving average
process. A central technique here is to recover an ARMA model by solving the Yule-Walker equations. However, to our knowledge, existing work on provably learning these models is limited to asymptotic guarantees.
1.2 Our results
We show that under certain stability conditions of the Kalman filter, we can bypass proper identification of the system, and still converge to the performance of the Kalman filter. We take an improper learning approach, reducing this problem to the general problem of learning an autoregressive model.
Our algorithm is based off a simple and familiar algorithm in time series analysis: using a sine-wave input design to fit an autoregressive model using least-squares. However, a key problem with the ordinary least-squares approach is that it does not provide learning guarantees under worst-case input (that we have not necessarily seen), i.e., in the norm. Such worst-case bounds are important because in the usual control-theoretic framework, bounds under the norm are used to obtain guarantees for robust control.
To obtain bounds for learning an autoregressive model, we augment our algorithm with a objective to learn a predictor that is robust in the sense. When applied to the Kalman filter, our work gives (to our knowledge) the first non-asymptotic sample complexity guarantees for learning an optimal autoregressive filter for estimation in a LDS.
1.3 Related work
LDS without hidden state, and FIR’s.
The problem of learning unknown dynamical systems has attracted a lot of recent attention from the machine learning community, due to connections to reinforcement learning and recurrent neural networks. Much progress has been made on the simpler related problem of learning and control in a linear dynamical model with no hidden state. Such a model is defined by
where , , , are as before, but is now observed. [dean2017sample] consider the linear quadratic regulator (LQR)—the control problem for such a LDS—and prove that the least-squares estimator of the dynamics, given independent rollouts, is sample-efficient for this setting. [simchowitz2018learning] show that access to independent rollouts is unnecessary; the LDS can be identified with a single rollout, even when the system is only marginally stable.
An alternative approach to identifying and is to learn the system as a finite-impluse response (FIR) filter. This is because the problem of learning a FIR filter can be thought of as a relaxation of the problem of learning a LDS, by “unrolling” the LDS. [tu2017non] use ordinary least-squares with design inputs to learn a FIR, and give near-optimal sample complexity bounds. [boczar2018finite] complete the “identify-then-control” pipeline by studying robust control for this estimated FIR filter.
However, because the predictions given by a FIR filter depend only on the inputs , and not the outputs , these methods do not suffice when the system has a hidden state. Such filters can only capture the dynamics of stable systems: for unstable or marginally stable systems, the infinite impulse response filter does not decay, so it is not approximated by a short truncation. Moreover, in these works, prediction performance guarantees are given under observation noise, and become very poor under process noise; indeed, to achieve optimal filtering (as in Kalman filtering), one must regress on the output. (See Appendix B for a simple example.) Our approach fills a gap in the literature, by giving a statistical analyses similar to [tu2017non] for autoregressive models.
LDS with hidden state, and autoregressive models.
When the system has a hidden state, several recent works analyze settings in which the dynamics can be identified. [hardt2016gradient] show that under certain conditions on the characteristic polynomial of the system’s transition matrix, gradient descent learns the parameters of a single-input single-output LDS. However, they only consider the setting of observation noise, and not process noise (i.e. ). In work concurrent to ours, [simchowitz2019learning], building on [oymak2018non], consider the problem of learning an autoregressive filter, and for the case of a LDS, are able to recover matrices , , which give an equivalent realization of the LDS. Although they allow for semi-parametric noise and marginally stable systems, their guarantees are for estimating the filter in operator norm, rather than the system in the more stringent norm.
[AnavaHMS13] show that in the online learning (regret minimization) setting, it is possible to learn an ARMA model sample-efficiently, even in the presence of adversarial (as opposed to i.i.d. Gaussian) noise. However, the regret framework is different than what is required for control, as it ensures performance only on the data that is seen; the predictor is not required to perform well on worst-case input. Furthermore, the constraint on the norm of the MA coefficients , which they require for the dynamical stability of their estimator of residuals, is very stringent.
Finally, we note the approach of online spectral filtering for prediction in symmetric and asymmetric LDS’s [HSZ17, hazan2018spectral]. In these works, the process noise is only handled up to a multiplicative factor of the optimal filter with knowledge of the system. Intuitively, this “competitive ratio bound” arises because these works consider regressing only on one or a few past observations (in a somewhat rigid manner), rather than having the freedom to imitate an optimal autoregressive filter.
2 Problem setting and preliminaries
We first state the general problem of learning an autoregressive model, and then in Section 2.2 describe the connection to linear dynamical systems. In Section 2.3 we introduce some concepts from control theory and use it to write error bounds in terms of control-theoretic norms of filters (Lemma 2.4).
2.1 Problem statement
A (single-input, single-output) dynamical system converts input signals
into output signals (random variables). We will assume that the data are generated by an autoregressive model:
where is a time series of i.i.d. Gaussian noise, , , are supported on , and for and for .
Let be filters. The learner is given black-box access to the system which takes inputs to outputs by (4). During each rollout, the learner specifies an input design , and receives the corresponding output sequence. After collecting outputs from rollouts, the learner returns filters , which specify a map from input to output signals via (4).
For an estimate of , define the error in the prediction (compared to the expected value of ) to be
The goal is to learn such that the expected error in the prediction is a small fraction of the input, plus a small fraction of the elapsed time:
2.2 Connection to the Kalman filter
Our work is motivated by optimal state estimation in LDS’s with hidden state given by the dynamics (1)–(2). The Kalman filter gives the optimal solution in the case that the parameters of the LDS are known and is drawn from a gaussian with known mean and covariance. We can compute matrices , , and such that the optimal estimate of the latent state and the observation are given by a time-varying LDS (taking the as feedback) with those matrices:
Taking , under mild non-degeneracy conditions the covariance of the latent state conditioned on the observations approaches a fixed covariance matrix , and the matrices , , and approach certain fixed matrices , , and . Our goal is to learn this steady-state Kalman filter without knowing parameters of the original LDS. 111Note that if the parameters of the LDS are unknown, then any , , for which the law of the in (1)–(2) is the same as the law of the actual is an equivalent realization. Then the Kalman filters computed from these , , will all give equivalent predictions, so we need not distinguish between them. At steady-state, given the observations up to time , the actual hidden state and observation will be distributed as and for some covariance matrices , .
Denote , where and are the submatrices acting on and , respectively. Consider for simplicity the case where the input and output dimensions are 1: if the hidden state has dimension , then , , , and we simply have for some .
We can then “unfold” the Kalman filter into an equivalent autoregressive model (4) by letting and , and .222Note this is not to be confused with the in (1)–(2): this has larger variance because it also incorporates the uncertainty about the hidden state.
has larger variance because it also incorporates the uncertainty about the hidden state.Note that the autoregressive model captures the law of the random process defined by the LDS (under what is observable at each time step, i.e., the filtration ), without utilizing a hidden state.
In this setting, we again attempt to minimize the error between the prediction and the expected value when the dynamics are known,
2.3 Preliminaries on control theory
An impulse response function can be equivalently be represented as a power series.
For a sequence define the transfer function of by . We will always denote the transfer function of a sequence in by the corresponding capital letter.
Note that if , then as formal power series, , and equality holds as functions for such that , converge absolutely. Translation corresponds to multiplication: the transfer function of is . Hence, letting be the transfer function of , we have from (4) that
Thus, we can rewrite (4) as
where , the “unrolled” filter, is such that .
The -norm of a filter is the -norm of the transfer function over the unit circle :
The -norm of a filter is the -norm of the transfer function over the unit circle:
For the rest of the paper we will assume the system is stable, i.e., , so that .333The condition is necessary to do estimation of a general autoregressive filter in -norm. Otherwise, it is impossible to worst-case estimation over an infinite time horizon, with only access to a finite rollout, as an input with infinite response can have arbitrarily small response over a finite horizon. This suggests that to solve the control problem over infinite time horizon of a non-stable system, one should look for weaker assumptions than learning in -error that still allow control.
The -norm represents the steady state variance under iid Gaussian noise as input, and the -norm represents the maximum norm of the output when :
where . Because has mean 0,
Suppose that . Then in the setting of Problem 2.1,
We will approximate with finite-length filters of length , so we need to make sure is large enough to capture most of the response. For this, we use the following definition and lemma from [tu2017non] which gives a sufficient length in terms of the desired error and a norm.
Definition 2.5 (Sufficient length condition, [tu2017non, Definition 1]).
We say that a Laurent series has stability radius if converges for . Let be stable with stability radius . Fix . Define
Note that this “sufficient length condition” is analogous to having a dependence on the spectral norm of , for learning a LDS. Indeed, a filter corresponding to a LDS will have stability radius .
Lemma 2.6 ([tu2017non, Lemma 4.1]).
Suppose is stable with stability radius . Then Hence, if , then .
3 Algorithm and main theorem
We motivate our main algorithm, Algorithm 1. The most natural algorithm is the following: let the inputs be sinusoids at equally spaced frequencies, and solve a least-squares problem for . However, ordinary least-squares will only give for which the estimation error is small for random input, while we desire for which the estimation error is small for worst-case input; in other words, it gives an average-case (), rather than the worst-case () bound that we desire. This is analogous to the difference between estimating a matrix in Frobenius and operator norm; the Frobenius norm trivially bounds the operator norm, but the resulting bound is typically from optimal. Hence, the sample complexity bound from ordinary least-squares does not have optimal dependence on . Note that [boczar2018finite] solve the analogous problem for a FIR filter with least-squares without suffering an extra factor, because in that setting, the matrix in the least-squares problem is a fixed matrix depending on the inputs, the error in the estimate is gaussian, and supremum bounds for gaussians are applicable. Our setting is more challenging because the ’s depend on noise in that we have no control over.
The first step of our algorithm is still to solve a least-squares problem. We do this in two parts: first, solve for by regressing on zero input, and then using , solve for separately for each frequency . We do this to avoid the error in —larger by a factor because it is -dimensional—contributing to the error in the .
The final step is to combine the . Because the number of frequencies is larger than the length
of the filter (necessary to be able to interpolate to unseen frequencies), we cannot find a singlethat matches each on the th frequency. Keeping in mind our objective, we hence optimize a problem over the frequencies to interpolate the .
Note that in the algorithm we can just take just for the signals because the signals for are trivial; we consider to make the notation in the proof simpler. For convenience of notation we re-index the time series to start at .
refers to the vector.
Theorem 3.1 (Learning an autoregressive model).
To prove the theorem, we establish the bounds
and use Lemma 2.4. Note there is no dependence on in (29) for the following reason: smaller means worse estimation of (the response to noise) by a factor of , but when tested on rollouts with noise , the error is not affected.
We expect the dependence on to be optimal: there are parameters, and we have access to samples (including samples in the same rollout). We also conjucture that the dependence is unavoidable.
As an immediate corollary, we obtain a theorem for learning the Kalman filter. For simplicity, we state the result when has the steady-state distribution, to avoid burn-in time arguments.
Corollary 3.2 (Improperly learning the Kalman filter).
Consider the system (1)–(2). Let , , , be the Kalman filter matrices and be the variance in the estimate of , as defined in Section 2.2. Let and . Suppose that has spectral radius , and suppose the rollouts are started with . Algorithm 1 with parameters given in Theorem 3.1 returns predictions such that
4 Proof sketch
It will be convenient to first prove the theorem in the case when the burn-in time is infinite. Note that by the stability assumption on , for signals with finite , the outputs will not diverge.
Theorem 3.1 holds in the setting when the burn-in time is infinite.
Step 1 (Concentration):
If and , then the error from the least-squares problem is . A simple way to bound this is to bound from below and from above. When we have samples, and , we expect and .
We show that the matrices and in the least-squares problem (25) and (26) concentrate using matrix concentration bounds (Lemma 5.2), and that the terms such as concentrate by martingale concentration (Lemma 5.5). The main complication is to track how the error propagates into (see (42) and following computations).
Step 2 (Generalization):
The bounds we obtain on in the direction of the th frequency (94) show that the actual solution does well in the min-max problem (27). The solution to (27) will only do better. By concentration, the matrices in the least-squares problem , and in the actual expected square loss are comparable. Because does well in the min-max problem, it will do comparably well with respect to the actual expected loss, when the input is one of the frequencies that has been tested, .
In this step, we already have enough to bound , the error in estimation with pure noise and no input signal.
Step 3 (Interpolation):
We’ve produced that is close to the actual when tested on each of the frequencies , but need to extend this bound to all frequencies. Considering transfer functions and clearing denominators, this reduces to a problem about polynomial interpolation. We use a theorem from approximation theory (Theorem 5.6) that bounds the maximum of a polynomial on the unit circle, given its value at equispaced points. Note that it is crucial here that the number of parameters in is less than the number of frequencies tested.
Note that we needed to clear from the denominator, so we lose a factor of here. We obtain a bound on , finishing the proof of Theorem 4.1.
Step 4 (Truncation):
Finally, we show that with a large burn-in time, the distribution of ’s will be almost indistinguishable from the steady-state distribution, and hence the algorithm still works.
We first set up notation and make some preliminary observations. A table of notation is provided in Section A. Let be the matrix with columns , and likewise define , so that for . Let , , where , is the noise in the various rollouts. We will also write for the noise from a generic rollout (so , are independent copies of ).
Let and . These matrices not depend on , which can be seen as follows. Consider the system response to . (Although we cannot put in complex values in the system, there is a well-defined response for complex inputs.) Let be the matrix with columns , where the is defined as in (4) except with noise equal to , . Because has the same distribution as , the expression does not depend on . Expanding, it equals . Similarly, is well-defined. Note that has rank , as the columns of and are spanned by .
Let denote the expected value of given for : . Let denote the expected value of given only the inputs for .
We first compute the error and , and then the error in the mean response which is given by , and the analogous expression for . This is broken up into subexpressions that we apply concentration bounds to.
We calculate the least squares solution and the error , noting that .