Learning Linear Dynamical Systems with Semi-Parametric Least Squares

by   Max Simchowitz, et al.

We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. We then demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfy the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.


Active Learning for Identification of Linear Dynamical Systems

We propose an algorithm to actively estimate the parameters of a linear ...

Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression

In experimental design, we are given a large collection of vectors, each...

Follow the Leader If You Can, Hedge If You Must

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy t...

Robust guarantees for learning an autoregressive filter

The optimal predictor for a linear dynamical system (with hidden state a...

On the number of semi-magic squares of order 6

We present an exact method for counting semi-magic squares of order 6. S...

Linear Dynamics: Clustering without identification

Clustering time series is a delicate task; varying lengths and temporal ...

Learning from many trajectories

We initiate a study of supervised learning from many independent sequenc...

Code Repositories

1 Introduction

Serial data are ubiquitous in machine learning, control theory, reinforcement learning, and the physical and social sciences. A major challenge is that such data exhibit long-term correlations, which often obscure the effects of design variables on measured observations and drive up the variance of statistical estimators. In the study of linear, time-invariant dynamical (LTI) systems, for example, a vast literature of both classical and contemporary work typically assumes that the system exhibits a property called

strict stability, which ensures that long term correlations decay geometrically (Verhaegen, 1993). While recent works show this condition can be removed in the special case when the full system state is perfectly observed (Simchowitz et al., 2018; Sarkar and Rakhlin, 2018; Faradonbeh et al., 2018), it is not known whether the condition is necessary in general. Moreover, it is not fully understood whether the rate of decay of correlations is the right quantity to parametrize the difficulty of learning, even for strictly stable systems. Among the many challenges introduced by both non-strictly stable and almost-unstable LTI systems is that the more one probes, the more the long-term correlations compound to yield measurements with large magnitudes, thereby driving up the variance of statistical estimators. This problem of growing variance arises in many other problem domains as well: for example, the reinforcement learning community has produced a great body of work dedicated to reducing variance in the present of long time horizons and large reward baselines (Sutton and Barto, 1998; Greensmith et al., 2004). This work intervenes by making two contributions. First, we analyze a simple prefiltered variation of the least squares estimator (PF-LS) for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning Chernozhukov et al. (2017); Krishnamurthy et al. (2018). We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. Second, we demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfies the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.

1.1 Problem Statement

We consider the problem of regressing a sequence of observations to a sequence of inputs for times . For a fixed

, we define the concatenated input vectors

, and assume an serial, semi-parametric relationship between and ; that is, there exists a filtration and a for which


We choose inputs such that which ensures (i) the Neyman-orthogonality condition , indispensable for consistent semi-parametric estimation (Chernozhukov et al., 2017) and (ii) that the inputs have well-conditioned covariance. This setting is motivated by the problem of estimating the parameters of a discrete-time linear system, which evolves according to the updates


where is the sequence of inputs, the sequence of outputs, is a sequence of states, and are sequences of process noise and sensor noise, respectively, and all above matrices are of appropriate dimension.111We do not estimate and , which are in general unidentifiable without a specific noise model. Crucially, we do not observe the system states or the noises and . We shall assume that the process and sensor noises can be chosen semi-adversarially in the sense that (i.e. an adversary may only act with a -step delay). This model captures both stochastic and worst-case oblivious noise. A simple recursion shows that this condition implies that (1.1) holds in this setting, with equal to

the length- response function222We suppress the dependence on to ease notation. from the inputs to the observation . Examining the dynamical equations (1.2), we see that the error corresponds to the residual part of which is does not depend on , and is therefore -adapted. An important recent result due to Oymak and Ozay (2018) demonstrates that for these LTI systems, a consistent estimate of can produce a consistent estimate of an equivalent realization of .333That is, a pair such that , and for all , . Furthermore, Oymak and Ozay (2018) show that if the operator norm

is strictly less than one, then ordinary least squares (OLS) can efficiently recover

from the inputs

. Formally, if the process and sensor noises are i.i.d normally distributed, the least squares estimator which uses samples

for some ,


converges to at a rate of . Unfortunately, the condition is quite stringent, and the learning rates degrade as approaches . Indeed, many systems of interest do not even satisfy a weaker condition known as strict stability: , where denotes the spectral radius. For example, simple oscillators, integrators, and elementary systems that arise from Newton’s laws yield realizations where . For example, the LTI system corresponding to the discretization of the differential equation , with sampling time , includes the matrix . This matrix violates the strict stability condition, yet satisfies . As mentioned above, non-asymptotic bounds for learning LTI systems typically yield rates which depend on the inverse stability gap (Oymak, 2018; Hardt et al., 2016; Shah et al., 2012); for example, Oymak (2018) requires one to select a horizon length for which , which necessitates that . This work, on the other hand, suggests that a dependence on stability gap can be avoided in many cases, and the difficulty of learning can instead by parametrized by quantities that are often less conservative.

1.2 Prefiltered Least Squares (PF-LS)

In light of the limitations of ordinary least squares, we analyze PF-LS, a simple prefiltering step to improve the estimation of the matrix in the general semi-parametric setting (1.1). In Section 2, we specialize our analysis to establish consistent recovery of any linear dynamical system for which . Prefiltering mitigates the magnitude of the errors by learning a coarse linear filter of future outputs, denoted , for . This filter uses a sequence encoding past observations to estimate via the prediction . We then estimate by regressing the filtered observations to . Concretely, our procedure is achieved with the following two steps of least squares.


Throughout, we let , and we use the notation to denote the matrix whose rows are and the matrix whose rows are . Our first contribution is the following inequality. Throughout, we use to denote inequality up to universal multiplicative constants.

Theorem 1.1 (Prefiltering Oracle Inequality, Informal)

Consider the general semi-parametric setting described in Section 1.1, and suppose that

. Then, with high probability,

Here, hides logarithmic terms in , , and . The term

captures the extent to which prefiltering overfits to , and


describes the data-dependent prediction error of the best filter.

We defer a precise statement of Theorem 1.1 to Theorem 3.3. Note that this result makes no assumptions on the structure of the noise or the features , other than the measurability assumptions that . Moreover, the term captures to the actual sequence of errors , rather than an a priori upper bound. For a sense of sense of scaling, the overfitting term is typically dominated by , and by setting , . When on average, this terms behaves as , and thus decays at a rate of of . In general, we only need to ensure .

1.3 Organization

Section 2 presents a precise statement of the PF-LS oracle inequality for LTI systems, Proposition 2.1, as well as bounds for the associated term in terms of the phase rank of  (Definition 2.3). Consistency of estimation for marginally stable LTI systems is presented as a consequence (Corollary 2.2). Section 3 walks the reader through the analysis of the PF-LS estimator, culminating in a formal statement of the oracle inequality, Theorem 3.3. Section 4 provides a proof sketch for results described in Section 2, and Section 5 addresses related work. Complete proofs are deferred to the Appendix, which is divided into three parts: Part I contains graphical illustrations of phase rank (Appendix A), the proof of Corollary 2.2, and the lower bound for OLS, Theorem C.1. Part II pertains to the the oracle inequality and related material from Section 3. Part III addresses results specific to LTI systems. The appendix begins with a preface which consolidates notation and outlines the organization of the subsequent appendices in greater detail.

2 Rates for Learning LTI Systems

In the setting of marginally stable systems, we can not guarantee that grows as due to the possible accumulation of system inputs. Indeed, Theorem C.1 in the appendix shows that the OLS estimator is inconsistent whenever and the system satisfies a weak identifiability criterion. Even for strictly systems, it is not clear whether the inverse stability gap , correctly describes the difficulty of estimation. What we show is that by choosing a large enough filter length and features


we can ensure both that for marginally stable systems and that need not depend on the stability gap . Our choice of in (2.7) corresponds to filtering a subsampled history of the outputs in order to predict . This linear prefiltering step is in the spirit of many schemes detailed in the system identification and time-series literature (see Section 5), many of which ensure performance in both stochastic and adversarial settings for strictly stable systems. From the perspective of prefiltered semi-parametric learning, we observe that corresponds to , and that the dynamical equations (1.2) imply that features (2.7) are -adapted. We begin the task of deriving explicit estimation rates for LTI systems by first establishing an oracle inequality for the PF-LS scheme given by (1.4) and (1.5), for two particular noise models.

Definition 2.1 (Noise Models)

In the stochastic noise model, and are conditionally -subgaussian.444That is, for all , and analogously for In the adversarial noise model, the noise processes satisfy the bounds and with probability .

Observe that shaped and/or scaled noise can be addressed by altering and appropriately. The conditions and make the adversarial and stochastic noise models comparable, as in the stochastic noise model we have and by subgaussianity. We shall assume for the rest of the section, and we address general in the Appendix. Now, we introduce the following parameter, which illustrates the dependence of our bounds on the eigenstructure of

, the conditioning of the eigenvalues, and the magnitude of the noises encoded in

and .

Definition 2.2

Let denote the Jordan-normal decomposition of . We define

Lastly, our results apply once satisfies a moderate lower bound. Specifically, we define

where is a sufficiently large constant. Our result bounds for LTI systems in terms of , , and dimension quantities. Furthermore, we let . With these quantities defined, we state a specialized version of our general oracle inequality, Theorem 3.3.

Proposition 2.1

Fix and , and suppose that , , , and that the largest Jordan block of is of size . Choosing some and defining

it holds with probability at least in the stochastic noise model that

In the adversarial noise model, we instead take .

We remark that the logarithmic terms in can be refined further, but we we state the above bound for its relative simplicity. In the following subsection, we show that for any marginally stable system, one can ensure that as long as is chosen to be sufficiently large. As a consequence, we verify in Appendix B that combining prefiltered least squares with the Ho-Kalman algorithm analyzed in Oymak and Ozay (2018) provides the consistent estimation of the underlying system parameters themselves.

Corollary 2.2 (Recovery of System Parameters)

Suppose that the system is minimal (Definition B.1), and . Then for , , and sufficiently large and , there exists a constant and depending on system parameters , the dimension , and

; such that with high probability there exists a unitary matrix


Here, is a certain realization of : the output of the Ho-Kalman algorithm on , and is the output of the Ho-Kalman algorithm on .

We remark that the condition that is minimal is necessary to ensure identifiability in the manner described by Corollary 2.2. When minimality fails, Appendix B explains that there always exists a reduced system model equivalent (in an input-output sense) to , which can be estimated in the sense of Corollary 2.2. As mentioned above, Theorem C.1 shows that OLS inconsistently estimates when , even with no process or sensor noise.

2.1 Learning without the stability gap

While Corollary 2.2 ensures consistent for marginally stable systems with , we in fact wish to answer the more ambitious question: how accurately does the inverse stability gap describe the intrinsic difficulty of learning an LTI system? To this end, we describe an alternative criterion we call phase rank, which does not depend on the stability radius in the worst case. We use phase rank to bound the term , and thus, through Proposition 2.1, upper bound the learning rate of the estimator . As a second alternative to the inverse stability gap, in Appendix K we define a condition called strong observability, related to the classical notion of observability in control theory (Hautus, 1983). This condition also allows us to bound estimation rates in terms of quantities that do not degrade (in the worst case) as . Phase Rank: Many approaches in the recent learning theory literature have developed bounds which depend directly on the spectrum of and magnitudes of the state-space realization matrices ; however, many existing works have incurred dependencies on the minimal polynomial of , which is exponentially large in the worst case. Our first approach adopts a new measure of complexity we call phase rank, inspired by Hazan et al. (2018), derives bounds from the spectrum of without paying for the size of the minimal polynomial. Rather than capturing the effects of all eigenvalues, the phase rank groups together eigenvalues with approximately the same phase, and only considers the eigenvalues of which lie near the boundary of the unit disk. Formally, let denote the complex unit disk, and for a marginally stable , let denote the set555As we will be taking the maximum over , set is equivalent to multiset for our purposes. of all pairs , where is an eigenvalue of and is a size of an associated Jordan block. We define phase rank as follows:

Definition 2.3 (Phase Rank)

Let . We say that has -phase rank if there exists such that, for any with , there exists at least elements satisfying

In the above definition, the parameter allows us to group together eigenvalues having approximately the same phase mod , and the parameter controls the ‘width’ of the approximation. The phase rank is typically small for many systems of interest; for example, for real diagonalizable systems, it is at most , regardless of the stability radius, thereby obviating a dependence on in the worst case. Phase rank can also take advantage of benign systems which do exhibit stability; for example, phase rank is equal to for strictly stable systems with . In Appendix A, we give some visual diagrams to aid the intuition for this condition. Moreover, the phase-rank is at most the degree of the “minimal-phase polynomial” of , the measure of complexity studied by Hazan et al. (2018) which inspired this condition. With this definition in hand, we provide the following bound on :

Proposition 2.3 (Bounds for Phase Rank)

Suppose that has phase rank , and maximum Jordan block size . Then, for any and and , it holds with probability under the stochastic noise setting of Definition 2.1 that

For systems where the eigendirections of can be ‘disentangled’ when observed by , we show in Appendix I.1.4 that one may decouple the eigenmodes of to only have to consider the phase rank restricted to smaller portions of the spectrum of . For example, if there exists a well-conditioned matrix for which and are diagonal with blocks , , respectively, then we only incur a penalty for the maximum of the phase ranks of and . Appendix I also gives refinements of Proposition 2.3 that take more granular aspects of into account. For constant phase rank , we have the bound , which roughly matches the dependence on term in the bounds in Oymak and Ozay (2018).

3 Oracle Inequality for Prefiltered Least Squares

The goal of this section is to present Theorem 3.3, a technical version of the general oracle inequality Theorem 1.1 for the estimator . To guide the proofs and intuition, we first consider the performance of the estimator which uses an arbitrary, fixed filter , rather than the data-dependent filter chosen by (1.4):


Note that for , is the OLS estimator. To analyze , we must define the associated error . Since are -adapted, is as well. Thus,

is a semi-parametric model describing the relationship between

and . We shall now establish two crucial properties which ensure estimation. The first shows that the matrix is well-conditioned with high probability.

Lemma 3.1 (Lemma C.2 in Oymak and Ozay (2018))

Define the event and number as

Then, if and the sample size satisfies for a sufficiently large , it holds that .

Inverting, we see that it suffices to take for a possibly larger constant to ensure the condition holds. The second property we shall use is Neyman orthogonality, which states that

This property is satisfied in our setting, since is adapted and . Chernozhukov et al. (2017) show that under general conditions, Neyman orthogonality generally ensures consistency of least squares, and Krishnamurthy et al. (2018) recently demonstrated that this idea implies consistency in a time-series setting. Our first result adapts this argument to handle the fact that our regression variables, have structure; namely, they are concatenated subsequences of . Denoting to be the matrix , we show the following bound.

Proposition 3.2 (Error Bound for Fixed Filter)

For any fixed filter , , and , it holds with probability at least that on ,


where and . In particular, the complement of (3.9) occurs with probability at most .

For a sense of scaling, observe that whenever the sequence is in magnitude on average, then is with high probability, yielding estimation rates of . Proof Sketch: Proposition 3.2 is derived as a special case of a more general result, Theorem E.1, which relies on self-normalized tail bounds for martingale sequences due to Abbasi-Adkori (2011). This theorem is similar in spirit to the tail bounds obtained by Krishnamurthy et al. (2018) for semi-parametric contextual bandits. The parameter arises from the use of these tools, but it can be chosen quite small due to the doubly-logarithmic dependence in . The novelty of our bound comes from a careful chaining argument in Appendix F.1 specific to semi-parametric regression with the concatenated sequence , based on the techniques in Krahmer et al. (2014). This bound yields a dependence on instead the larger quantity . In service of this argument, we give a recipe for applying Talagrand’s chaining (Talagrand, 2014) to self-normalized martingale tail bounds in Appendix F, which may be of general interest.

3.1 Statement of the Oracle Inequality

In Proposition 3.2, we bounded the error for the least squares estimate associated with a fixed predictor, , in terms of its associated error . Specifically, Proposition 3.2 implied that when grows as , decays as . In many cases, such as our setting of marginally stable systems, it is not possible to select a filter a priori in such a way that . Instead, in light of Proposition 3.2, one would like to choose the filter which minimizes , and pay for the magnitude of its associated error . Our main result of this section is an oracle inequality, proved in Appendix D, which shows that prefiltering the output sequence essentially accomplishes this goal.

Theorem 3.3 (PF-LS Oracle Inequality)

Let be as in (1.6), and define

Then for any , then following inequality holds probability with , provided satisfies the conditions of Lemma 3.1:

Here, the term corresponds to the error achieved by the best filter , and the term captures the “effective dimension” of the estimation problem, totaling the dimensions of the filter class, inputs , and observations . For the case of linear systems where and , in Appendix H.4 we give an algorithm for selecting the parameter which admits an oracle inequality, Proposition H.4. Moreover, the bound of Theorem 3.3 depends only logarithmically on , so may be taken to be very small. Proof Sketch: In proving Theorem 3.3, we first obtain an intermediate but analogous result in terms of the intermediate quantity , which we bound in Appendix D.2 by using KKT arguments and a variant of Proposition 3.2. To prove the the intermediate result, Appendix D.1 considers “slices” along directions , and establish uniform bounds with respect to a hierarchy of coverings of , each with a different scale and granularity, such that the bounds hold for each covering in the hierarchy simultaneously. This lets us tailor the granularity of the covering for each specific filter , which (a) yields bounds depending on the data-dependent errors , (b) ensures tighter control on filters with smaller norm, and (c) tolerates logarithmically more error as grows. Specializing the uniform bound to requires loose control of (hence the regularization in (1.4)) and the choice of trades off between the magnitude of and the quantity inside the logarithmic term. This is somewhat of an artifact of the proof and is unnecessary if is bounded from below, for example.

4 Proof Sketch for Bounding

Since , it suffices to exhibit some with reasonable operator norm for which grows as . To this end, we define the auxiliary signal and associated observation via

Here, is the state as if the noise and inputs had been “shut off” at time . We further define the features and decompose the error term as , where

Here, describes the approximation error incurred in predicting from the shut-off sequence , and accounts for the additional noise induced by the shut-off sequence. In Propositions 4.1 (resp. G.2), we prove bounds on the total contributions of these two errors under stochastic (resp. adversarial) noise models outlined in Assumption 2.1. For any fixed , the error terms do not grow with time, since they only account for the contribution of noise over time steps; thus, the contribution of to grows as . The terms , on the other hand, may grow with time because they depend on the state , which can grow in magnitude for marginally stable systems under consistent excitation. Fortunately, by the Cayley-Hamilton theorem, we can observe that for large enough , there always exists a for which for all . Indeed, let denote the minimal polynomial of , and let . Then, if , a short computation shows that . Unfortunately, requires to be at least the degree of the minimal polynomial of , which can be as large as in general. Moreover, the minimal polynomial may have exponentially large coefficients, which can amplify the effect of noise in and also affect the contribution of the regularization term . As introduced in Section 2.1, bounded phase rank ensures that there exists a smaller (both in length and in norm) filter than one would obtain by applying the minimal polynomial. Throughout, we shall consider the stochastic case; the adversarial case is similar and deferred to the Appendix. Applying Phase Rank: Proposition 2.3 in the introduction gave an explicit bound on in terms of the largest Jordan block of and the phase rank of the system. The formal proof of this bound is deferred to Appendix I. Here, we shall instead provide an informal intuition about why phase rank is also a natural quantity. Consider the state transition matrix , and suppose . The first coordinate corresponds to a marginally unstable eigenvalue , and the second corresponds to an eigenvalue which is strictly stable by a small margin . Taking , we see the filter corresponding to the polynomial not only exactly cancels the first mode, it also downweights the second mode by a factor of , as . Moreover, we can express the second coordinate as . Due to the geometric decay, this sum roughly depends on only the last terms in the sum. Therefore, even for adversarial noise, should be at most on average. With this observation in hand,