WaveFiltering
Wave-filtering for learning and controlling dynamical systems
view repo
We present an efficient and practical algorithm for the online prediction of discrete-time linear dynamical systems with a symmetric transition matrix. We circumvent the non-convex optimization problem using improper learning: carefully overparameterize the class of LDSs by a polylogarithmic factor, in exchange for convexity of the loss functions. From this arises a polynomial-time algorithm with a near-optimal regret guarantee, with an analogous sample complexity bound for agnostic learning. Our algorithm is based on a novel filtering technique, which may be of independent interest: we convolve the time series with the eigenvectors of a certain Hankel matrix.
READ FULL TEXT VIEW PDF
We give a polynomial-time algorithm for learning latent-state linear
dyn...
read it
We introduce algorithms for learning nonlinear dynamical systems of the ...
read it
This paper develops a novel framework to analyze the convergence of
Q-le...
read it
The polytope containment problem is deciding whether a polytope is a
con...
read it
There has been much recent progress in forecasting the next observation ...
read it
When dealing with dynamical systems arising in diverse control systems, ...
read it
Clustering time series is a delicate task; varying lengths and temporal
...
read it
Wave-filtering for learning and controlling dynamical systems
Linear dynamical systems (LDSs) are a class of state space models which accurately model many phenomena in nature and engineering, and are applied ubiquitously in time-series analysis, robotics, econometrics, medicine, and meteorology. In this model, the time evolution of a system is explained by a linear map on a finite-dimensional hidden state, subject to disturbances from input and noise. Recent interest has focused on the effectiveness of recurrent neural networks (RNNs), a nonlinear variant of this idea, for modeling sequences such as audio signals and natural language.
Central to this field of study is the problem of system identification
: given some sample trajectories, output the parameters for an LDS which generalize to predict unseen future data. Viewed directly, this is a non-convex optimization problem, for which efficient algorithms with theoretical guarantees are very difficult to obtain. A standard heuristic for this problem is expectation-maximization (EM), which can find poor local optima in theory and practice.
We consider a different approach: we formulate system identification as an online learning problem, in which neither the data nor predictions are assumed to arise from an LDS. Furthermore, we slightly overparameterize the class of predictors, yielding an online convex program amenable to efficient regret minimization. This carefully chosen relaxation, which is our main theoretical contribution, expands the dimension of the hypothesis class by only a polylogarithmic factor. This construction relies upon recent work on the spectral theory of Hankel matrices.
The result is a simple and practical algorithm for time-series prediction, which deviates significantly from existing methods. We coin the term wave-filtering for our method, in reference to our relaxation’s use of convolution by wave-shaped eigenvectors. We present experimental evidence on both toy data and a physical simulation, showing our method to be competitive in terms of predictive performance, more stable, and significantly faster than existing algorithms.
Consider a discrete-time linear dynamical system with inputs , outputs , and a latent state
, which can all be multi-dimensional. With noise vectors
, the system’s time evolution is governed by the following equations:If the dynamics
are known, then the Kalman filter
[Kal60]is known to estimate the hidden state optimally under Gaussian noise, thereby producing optimal predictions of the system’s response to any given input. However, this is rarely the case – indeed, real-world systems are seldom purely linear, and rarely are their evolution matrices known.
We henceforth give a provable, efficient algorithm for the prediction of sequences arising from an unknown dynamical system as above, in which the matrix is symmetric. Our main theoretical contribution is a regret bound for this algorithm, giving nearly-optimal convergence to the lowest mean squared prediction error (MSE) realizable by a symmetric LDS model:
On an arbitrary sequence , Algorithm 1 makes predictions which satisfy
compared to the best predictions by a symmetric LDS, while running in polynomial time.
Note that the signal need not be generated by an LDS, and can even be adversarially chosen. In the less general batch (statistical) setting, we use the same techniques to obtain an analogous sample complexity bound for agnostic learning:
For any choice of , given access to an arbitrary distribution over training sequences , Algorithm 2, run on i.i.d. sample trajectories from , outputs a predictor such that
compared to the best symmetric LDS predictor , while running in polynomial time.
Typical regression-based methods require the LDS to be strictly stable, and degrade on ill-conditioned systems; they depend on a spectral radius parameter . Our proposed method of wave-filtering provably and empirically works even for the hardest case of . Our algorithm attains the first condition number-independent polynomial guarantees in terms of regret (equivalently, sample complexity) and running time for the MIMO setting. Interestingly, our algorithms never need to learn the hidden state, and our guarantees can be sharpened to handle the case when the dimensionality of is infinite.
The modern setting for LDS arose in the seminal work of Kalman [Kal60], who introduced the Kalman filter as a recursive least-squares solution for maximum likelihood estimation (MLE) of Gaussian perturbations to the system. The framework and filtering algorithm have proven to be a mainstay in control theory and time-series analysis; indeed, the term Kalman filter model is often used interchangeably with LDS. We refer the reader to the classic survey [Lju98], and the extensive overview of recent literature in [HMR16].
Ghahramani and Roweis [RG99] suggest using the EM algorithm to learn the parameters of an LDS. This approach, which directly tackles the non-convex problem, is widely used in practice [Mar10a]. However, it remains a long-standing challenge to characterize the theoretical guarantees afforded by EM. We find that it is easy to produce cases where EM fails to identify the correct system.
In a recent result of [HMR16]
, it is shown for the first time that for a restricted class of systems, gradient descent (also widely used in practice, perhaps better known in this setting as backpropagation) guarantees polynomial convergence rates and sample complexity in the batch setting. Their result applies essentially only to the SISO case (vs. multi-dimensional for us), depends polynomially on the spectral gap (as opposed to no dependence for us), and requires the signal to be created by an LDS (vs. arbitrary for us).
Many different settings have been considered, in which the definition of an LDS takes on many variants. We are interested in discrete time-invariant MIMO (multiple input, multiple output) systems with a finite-dimensional hidden state.^{1}^{1}1We assume finite dimension for simplicity of presentation. However, it will be evident that hidden-state dimension has no role in our algorithm, and shows up as and in the regret bound. Formally, our model is given as follows:
A linear dynamical system (LDS) is a map from a sequence of input vectors to output (response) vectors of the form
(1) | ||||
(2) |
where is a sequence of hidden states, are matrices of appropriate dimension, and are (possibly stochastic) noise vectors.
Unrolling this recursive definition gives the impulse response function, which uniquely determines the LDS. For notational convenience, for invalid indices , we define , , and to be the zero vector of appropriate dimension. Then, we have:
(3) |
We will consider the (discrete) time derivative of the impulse response function, given by expanding by Equation (3). For the rest of this paper, we focus our attention on systems subject to the following restrictions:
The LDS is Lyapunov stable: , where denotes the operator (a.k.a. spectral) norm.
The transition matrix is symmetric and positive semidefinite.^{2}^{2}2The psd constraint on can be removed by augmenting the inputs with extra coordinates . We omit this for simplicity of presentation.
The first assumption is standard: when the hidden state is allowed to blow up exponentially, fine-grained prediction is futile. In fact, many algorithms only work when is bounded away from , so that the effect of any particular on the hidden state (and thus the output) dissipates exponentially. We do not require this stronger assumption.
We take a moment to justify assumption (ii), and why this class of systems is still expressive and useful. First, symmetric LDSs constitute a natural class of linearly-observable, linearly-controllable systems with dissipating hidden states (for example, physical systems with friction or heat diffusion). Second, this constraint has been used successfully for video classification and tactile recognition tasks
[HSC16]. Interestingly, though our theorems require symmetric , our algorithms appear to tolerate some non-symmetric (and even nonlinear) transitions in practice.A natural formulation of system identification is that of online sequence prediction. At each time step , an online learner is given an input , and must return a predicted output . Then, the true response is observed, and the predictor suffers a squared-norm loss of . Over rounds, the goal is to predict as accurately as the best LDS in hindsight.
Note that the learner is permitted to access the history of observed responses . Even in the presence of statistical (non-adversarial) noise, the fixed maximum-likelihood sequence produced by will accumulate error linearly as . Thus, we measure performance against a more powerful comparator, which fixes LDS parameters , and predicts by the previous response plus the derivative of the impulse response function of at time .
We will exhibit an online algorithm that can compete against the best in this setting. Let be the predictions made by an online learner, and let be the sequence of predictions, realized by a chosen setting of LDS parameters , which minimize total squared error. Then, we define regret by the difference of total squared-error losses:
This setup fits into the standard setting of online convex optimization (in which a sublinear regret bound implies convergence towards optimal predictions), save for the fact that the loss functions are non-convex in the system parameters. Also, note that a randomized construction (set all , and let
be i.i.d. Bernoulli random variables) yields a lower bound
^{3}^{3}3This is a standard construction; see, e.g. Theorem 3.2 in [Haz16]. for any online algorithm: .To quantify regret bounds, we must state our scaling assumptions on the (otherwise adversarial) input and output sequences. We assume that the inputs are bounded: . Also, we assume that the output signal is Lipschitz in time: . The latter assumption exists to preclude pathological inputs where an online learner is forced to incur arbitrarily large regret. For a true noiseless LDS, is not too large; see Lemma F.5 in the appendix.
We note that an optimal regret bound can be trivially achieved in this setting by algorithms such as Hedge [LW94], using an exponential-sized discretization of all possible LDS parameters; this is the online equivalent of brute-force grid search. Strikingly, our algorithms achieve essentially the same regret bound, but run in polynomial time.
Much work in system identification, including the EM method, is concerned with explicitly finding the LDS parameters which best explain the data. However, it is evident from Equation 3 that the terms cause the least-squares (or any other) loss to be non-convex in . Many methods used in practice, including EM and subspace identification, heuristically estimate each hidden state
, after which estimating the parameters becomes a convex linear regression problem. However, this first step is far from guaranteed to work in theory or practice.
Instead, we follow the paradigm of improper learning: in order to predict sequences as accurately as the best possible LDS , one need not predict strictly from an LDS. The central driver of our algorithms is the construction of a slightly larger hypothesis class , for which the best predictor is nearly as good as . Furthermore, we construct so that the loss functions are convex under this new parameterization. From this will follow our efficient online algorithm.
As a warmup example, consider the following overparameterization: pick some time window , and let the predictions be linear in the concatenation . When is bounded away from 1, this is a sound assumption.^{4}^{4}4This assumption is used in autoregressive models; see Section 6 of [HMR16] for a theoretical treatment. However, in general, this approximation is doomed to either truncate longer-term input-output dependences (short ), or suffer from overfitting (long ). Our main theorem uses an overparameterization whose approximation factor is independent of , and whose sample complexity scales only as .
Our analysis relies crucially on the spectrum of a certain Hankel matrix, a square matrix whose anti-diagonal stripes have equal entries (i.e. is a function of ). An important example is the Hilbert matrix , the -by- matrix whose -th entry is . For example,
This and related matrices have been studied under various lenses for more than a century: see, e.g., [Hil94, Cho83]. A basic fact is that is a positive definite matrix for every . The property we are most interested in is that the spectrum of a positive semidefinite Hankel matrix decays exponentially, a difficult result derived in [BT16] via Zolotarev rational approximations. We state these technical bounds in Appendix E.
Our online algorithm (Algorithm 1) runs online projected gradient descent [Zin03] on the squared loss . Here, each is a matrix specifying a linear map from featurized inputs to predictions . Specifically, after choosing a certain bank of filters , consists of convolutions of the input time series with each (scaled by certain constants), along with , , and . The number of filters will turn out to be polylogarithmic in .
The filters and scaling factors
are given by the top eigenvectors and eigenvalues of the Hankel matrix
, whose entries are given byIn the language of Section 2.3, one should think of each as arising from an -dimensional hypothesis class , which replaces the original -dimensional class of LDS parameters . Theorem 3 gives the key fact that approximately contains .
In Section 4, we provide the precise statement and proof of Theorem 1, the main regret bound for Algorithm 1, with some technical details deferred to the appendix. We also obtain analogous sample complexity results for batch learning; however, on account of some definitional subtleties, we defer all discussion of the offline case, including the statement and proof of Theorem 2, to Appendix A.
We make one final interesting note here, from which the name wave-filtering arises: when plotted coordinate-wise, our filters look like the vibrational modes of an inhomogeneous spring (see Figure 1). We provide some insight on this phenomenon (along with some other implementation concerns) in Appendix B. Succinctly: in the scaling limit, commutes with a certain second-order Sturm-Liouville differential operator
. This allows us to approximate filters with eigenfunctions of
, using efficient numerical ODE solvers.We first state the full form of the regret bound achieved by Algorithm 1:^{5}^{5}5Actually, for a slightly tighter proof, we analyze a restriction of the algorithm which does not learn the portion
, instead always choosing the identity matrix for that block.
On any sequence , Algorithm 1, with a choice of , , and , achieves regret
competing with LDS predictors with and .
Note that the dimensions do not appear explicitly in this bound, though they typically factor into . In Section 4.1, we state and prove Theorem 3, the convex relaxation guarantee for the filters, which may be of independent interest. This allows us to approximate the optimal LDS in hindsight (the regret comparator) by the loss-minimizing matrix . In Section 4.2, we complete the regret analysis using Theorem 3, along with bounds on the diameter and gradient, to conclude Theorem 1.
Since the batch analogue is less general (and uses the same ideas), we defer discussion of Algorithm 2 and Theorem 2 to Appendix A.
Assume for now that ; we will remove this at the end, and see that the regret bound is asymptotically the same. Recall (from Section 2.2) that we measure regret compared to predictions obtained by adding the derivative of the impulse response function of an LDS to . Our approximation theorem states that for any , there is some which produces approximately the same predictions. Formally:
Let be the online predictions made by an LDS . Let . Then, for any , with a choice of , there exists an such that
Here, and are defined as in Algorithm 1 (noting that includes the previous ground truth ).
We construct this mapping explicitly. Write as the block matrix
where the blocks’ dimensions are chosen to align with , the concatenated vector
so that the prediction is the block matrix-vector product
Without loss of generality, assume that is diagonal, with entries .^{6}^{6}6Write the eigendecomposition . Then, the LDS with parameters makes the same predictions as the original, with diagonal. Let be the -th row of , and the -th column of . Also, we define a continuous family of vectors , with entries . Then, our construction is as follows:
, for each .
.
Below, we give the main ideas for why this works, leaving the full proof to Appendix C.
Since is the identity, the online learner’s task is to predict the differences as well as the derivative , which we write here:
(4) |
Notice that the inner sum is an inner product between each coordinate of the past inputs with (or a convolution, viewed across the entire time horizon). The crux of our proof is that one can approximate using a linear combination of the filters . Writing for short, notice that
since the entry of the RHS is
What follows is a spectral bound for reconstruction error, relying on the low approximate rank of :
Choose any . Let be the projection of onto the -dimensional subspace of spanned by . Then,
for an absolute constant .
By construction of , replaces each in Equation (4) with its approximation . Hence we conclude that
letting denote some residual vectors arising from discarding the subspace of dimension . Theorem 3 follows by showing that these residuals are small, using Lemma 4.1: it turns out that is exponentially small in , which implies the theorem. ∎
Let denote the best LDS predictor, and let be its image under the map from Theorem 3, so that total squared error of predictions is within from that of . Notice that the loss functions are quadratic in , and thus convex. Algorithm 1 runs online gradient descent [Zin03] on these loss functions, with decision set . Let be the diameter of , and be the largest norm of a gradient. We can invoke the classic regret bound:
Online gradient descent, using learning rate , has regret
To finish, it remains to show that and are small. In particular, since the gradients contain convolutions of the input by (not ) unit vectors, special care must be taken to ensure that these do not grow too quickly. These bounds are shown in Section D.2, giving the correct regret of Algorithm 1 in comparison with the comparator . By Theorem 3, competes arbitrarily closely with the best LDS in hindsight, concluding the theorem.
Finally, we discuss why it is possible to relax the earlier assumption on the initial hidden state. Intuitively, as more of the ground truth responses are revealed, the largest possible effect of the initial state decays. Concretely, in Section D.4, we prove that a comparator who chooses a nonzero can only increase the regret by an additive in the online setting.
In this section, to highlight the appeal of our provable method, we exhibit two minimalistic cases where traditional methods for system identification fail, while ours successfully learns the system. Finally, we note empirically that our method seems not to degrade in practice on certain well-behaved nonlinear systems. In each case, we use filters, and a regularized follow-the-leader variant of Algorithm 1 (see Appendix B.2).
We construct two difficult systems, on which we run either EM or subspace identification^{7}^{7}7Specifically, we use “Deterministic Algorithm 1” from page 52 of [VODM12]. (SSID), followed by Kalman filtering to obtain predictions. Note that our method runs significantly ( times) faster than this traditional pipeline.
In the first example (Figure 2(a), left), we have a SISO system () and ; all , , and are i.i.d. Gaussians, and . Most importantly, is ill-conditioned, so that there are long-term dependences between input and output. Observe that although EM and SSID both find reasonable guesses for the system’s dynamics, they turns out to be local optima. Our method learns to predict as well as the best possible LDS.
The second example (Figure 2(a), right) is a MIMO system (with ), also with Gaussian noise. The transition matrix has a diverse spectrum, the observation matrix has i.i.d. Gaussian entries, and . The inputs are random block impulses. This system identification problem is high-dimensional and non-convex; it is thus no surprise that EM and SSID consistently fail to converge.
We remark that although our algorithm has provable regret guarantees only for LDSs with symmetric transition matrices, it appears in experiments to succeed in learning some non-symmetric (even nonlinear) systems in practice, much like the unscented Kalman filter [WVDM00]. In Figure 2(b), we provide a typical learning trajectory for a forced pendulum, under Gaussian noise and random block impulses. Physical systems like this are widely considered in control and robotics, suggesting possible real-world applicability for our method.
We have proposed a novel approach for provably and efficiently learning linear dynamical systems. Our online wave-filtering algorithm attains near-optimal regret in theory; and experimentally outperforms traditional system identification in both prediction quality and running time. Furthermore, we have introduced a “spectral filtering” technique for convex relaxation, which uses convolutions by eigenvectors of a Hankel matrix. We hope that this theoretical tool will be useful in tackling more general cases, as well as other non-convex learning problems.
We thank Holden Lee and Yi Zhang for helpful discussions. We especially grateful to Holden for a thorough reading of our manuscript, and for pointing out a way to tighten the result in Lemma C.1.
Journal of Machine Learning Research
, 3(Nov):463–482, 2002.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3938–3947, 2016.In Appendix B, we discuss some variants of our online algorithm, and offer some tips for implementation. We also provide discussion on the connection of our filters to eigenfunctions of a certain differential operator.
In Appendix F, we verify some easy-to-prove properties of the important vector , for sake of completeness.
The online prediction setting is sensitive to permutation of the time series: that is, the same LDS does not in general map to . As such, one must take care when defining the batch case: the output time series (and thus, loss functions) are correlated, so it is not meaningful to assume that they are i.i.d. samples from a distribution. Thus, our online regret bound, which concerns a single episode, does not translate directly. However, our convex relaxation technique still allows us to do efficient improper learning with least-squares regression, giving interesting and novel statistical guarantees. In this section, we provide two possible formulations of the batch setting, along with accompanying theorems.
In both cases, it is most natural to fix an episode length , and consider a rollout of the system to be a single example. For short, let denote the concatenated vector of inputs for a single example, and the concatenated responses. The batch formulation is to learn the dynamics of the system using samples . Recall that the samples satisfy and .
Unlike in the online setting, it will be less confusing in the batch setting to measure the mean squared error of predictions, rather than the total squared error. Thus, in this section, will always refer to mean squared error. As well, to follow statistical learning conventions (for ease of reading), we use to denote a hypothesis (an LDS) instead of ; this is distinguished from the hidden state .
Throughout this subsection, assume that .
As noted, the sequential prediction algorithm can be restricted so as to never make updates to the submatrix , keeping it to be the identity matrix. Notice that all other features in consist of inputs and their convolutions. In other words, we can take the view that the matrix can be used to predict the differences between successive responses, as a function of the entire (aligned) input time series .
Thus, we can formulate a direct analogue for the online algorithm: learn the mapping from an input time series to the differences , the concatenation of all . For this, we can use Theorem 3 (the approximation result) directly, and obtain an improper agnostic learning guarantee.
Specifically, let be a subset of the hypothesis class of LDS parameters , subject to , and choose any approximation tolerance .^{8}^{8}8The distinction between measuring total vs. mean squared error is hidden in the constant in front of the when choosing the number of filters . Then, Theorem 3 states that choosing with ensures the -approximate relaxation property. In the language of the batch setting: for each which predicts on the sample with a mean squared error , there is some so that
The choice of batch algorithm is clear, in order to mimic Algorithm 1: run least-squares regression on and , where is the same featurization of the inputs as used in the online algorithm. We describe this procedure fully in Algorithm 2.
By definition, Algorithm 2 minimizes the empirical MSE loss on the samples; as such, we can derive a PAC-learning bound for regression. We begin with some definitions and assumptions, so that we can state the theorem.
As in the statement of the online algorithm, as a soft dimensionality restriction, we constrain the comparator class to contain LDSs with parameters such that and . For an empirical sample set , let . Similarly, for a distribution , let .
Then, we are able to obtain a sample complexity bound:
Choose any . Let be a set of i.i.d. training samples from a distribution . Let be the output of Algorithm 2, with a choice of . Let
be the true loss minimizer. Then, with probability at least
, it holds thatLemma D.1 shows that we can restrict by a Frobenius norm bound:
Thus, the empirical Rademacher complexity of on samples, with this restriction, thus satisfies
Also, no single prediction error (and thus neither the empirical nor population loss) will exceed the upper bound
Finally, the loss is -Lipschitz in the matrix , where is the same upper bound for the gradient as mentioned in Section 4.2. Lemma D.5, states that this is bounded by .
With all of these facts in hand, a standard Rademacher complexity-dependent generalization bound holds in the improper hypothesis class (see, e.g. [BM02]):
With probability at least , it holds that
A natural question is whether there exists a batch learning algorithm that can use to predict directly, as opposed to the differences . This is possible in the regime of low noise: if one has predictions on that are correct up to MSE , an easy solution is to integrate and obtain predictions for ; however, the errors will accumulate to . The same agnostic learning guarantee costs a rather dramatic factor of in sample complexity.
In the regime of low noise, an analogue of our approximation theorem (Theorem 3) is powerful enough to guarantee low error. For convenience and concreteness, we record this here:
Let be an LDS specified by parameters , with , and . Suppose takes an input sequence , and produces output sequence , assuming all noise vectors are 0. Then, for any , with a choice of , there exists an such that
where is defined as in Algorithm 1, without the entries.
This fact follows from Theorem 3, setting as the desired precision; the cost of this additional precision is only a constant factor in . Furthermore, this is subject to the same Frobenius norm constraint as in Lemma D.1.
Alternatively, in the realizable case (when the samples from are generated by an LDS, possibly with small noise), one can invoke a similar approximate relaxation theorem as Theorem 3. The filters become the eigenvectors of the Hilbert matrix , the matrix whose -th entry is . This matrix exhibits the same spectral decay as ; see [BT16] for precise statements. the proof follows the sketch from Section 4.1, approximating the powers of by a spectral truncation of a different curve , sometimes called the moment curve in
. The Hilbert matrix arises from taking the second moment matrix of the uniform distribution on this curve.
However, we find that this approximation guarantee is insufficient to show the strong regret and agnostic learning bounds we exhibit for learning the derivative of the impulse response function. Nonetheless, we find that regression with these filters works well in practice, even interchangeably in the online algorithm; see Section B.1 for some intuition.
In either of the above settings, it is not quite possible to apply the same argument as in the online setting for pretending that the initial hidden state is zero. When this assumption is removed, the quality of the convex relaxation degrades by an additive ; see Section D.4. This does not matter much for the regret bound, because it is subsumed by the worst-case regret of online gradient descent.
However, in the batch setting, we take the view of fixed and increasing , so the contribution of the initial state is no longer asymptotically negligible. In other words, this additive approximation error hinders us from driving arbitrarily close to zero, no matter how many filters are selected. In settings where is large enough, one may find this acceptable.
We present an augmented learning problem in which we can
predict as well as an LDS: the initial hidden state is provided in each sample, up to an arbitrary linear transformation. Thus, each sample takes the form
, and it is guaranteed that for each sample, for a fixed matrix . This must be well-conditioned for the problem to remain well-posed: our knowledge of should be in the same dynamic range as the ground truth. Concretely, we should assume that is bounded.The construction is as follows: append “dummy” dimensions to the input, and add an impulse of in those dimensions at time 0. During the actual episode, these dummy inputs are always zero. Then, replacing with the augmented block matrix recovers the behavior of the system. Thus, we can handle this formulation of hidden-state learning in the online or batch setting, incurring no additional asymptotic factors.
We highlight an important special case of the formulation discussed above, which is perhaps the motivating rationale for this altered problem.
Consider a batch system identification setting in which there are only finitely many initial states in the training and test data, and the experimenter can distinguish between these states. This can be interpreted a set of known initial “configurations” of the system. Then, it is sufficient to augment the data with a one-hot vector in , corresponding to the known initialization in each sample. An important case is when : when there is only one distinct initial configuration; this occurs frequently in control problems.
In summary, the stated augmentation takes the original LDS with dimensions , and transforms it into one with dimensions . The matrix , as defined above, is the -by- matrix whose columns are the possible initial hidden states, which can be in arbitrary dimension. For convenience, we summarize this observation:
When there are known hidden states, with , Theorems 2, 3, and 3b apply to the modified LDS learning problem, with samples of the form . The dimension becomes .
We discuss the points mentioned in Section 3 at greater length. Unlike the rest of the appendix, this section contains no technical proofs, and is intended as a user-friendly guide for making the wave-filtering method usable in practice.
We begin by expanding upon the observation, noted in Section 3, that the eigenvectors resemble inhomogeneously-oscillating waves, providing some justification for the heuristic numerical computation of the top eigenvectors of .
Computing the filters directly from is difficult. In fact, the Hilbert matrix (its close cousin) is notoriously super-exponentially ill-conditioned; it is probably best known for being a pathological benchmark for finite-precision numerical linear algebra algorithms. One could ignore efficiency issues, and view this as a data-independent preprocessing step: these filters are deterministic. However, this difficult numerical problem poses an issue for using our method in practice.
Fortunately, as briefly noted in Section 3, some recourse is available. In [Grü82], Grünbaum constructs a tridiagonal matrix which commutes with each Hilbert matrix , as defined in Section 2.4. In the appropriate scaling limit as , this becomes a Sturm-Liouville differential operator which does not depend on , given by
Comments
There are no comments yet.