On-Line Learning of Linear Dynamical Systems: Exponential Forgetting in Kalman Filters

09/16/2018 ∙ by Mark Kozdoba, et al. ∙ 0

Kalman filter is a key tool for time-series forecasting and analysis. We show that the dependence of a prediction of Kalman filter on the past is decaying exponentially, whenever the process noise is non-degenerate. Therefore, Kalman filter may be approximated by regression on a few recent observations. Surprisingly, we also show that having some process noise is essential for the exponential decay. With no process noise, it may happen that the forecast depends on all of the past uniformly, which makes forecasting more difficult. Based on this insight, we devise an on-line algorithm for improper learning of a linear dynamical system (LDS), which considers only a few most recent observations. We use our decay results to provide the first regret bounds w.r.t. to Kalman filters within learning an LDS. That is, we compare the results of our algorithm to the best, in hindsight, Kalman filter for a given signal. Also, the algorithm is practical: its per-update run-time is linear in the regression depth.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear Dynamical Systems (LDS) are a key standard tool in modeling and forecasting time series, with an exceedingly large number of applications. In forecasting with an LDS, typically one learns the parameters of the LDS first, using a maximum likelihood principle, and then uses Kalman filter to generate predictions. The two features that seem to contribute the most to the success of LDS in practice are the ability of LDS to model a wide range of behaviors, and the recursive nature of Kalman filter, which allows for fast, real-time forecasts via a constant-time update of the previous estimate. On the other hand, a major difficulty with LDSs is that the process of learning system parameters, via expectation maximization (EM) or direct likelihood optimization, may be time consuming and prone to getting stuck in local maxima. We refer to

Anderson and Moore (1979); West and Harrison (1997); Hamilton (1994); Chui and Chen (2017) for book-length introductions.

Recently, there has been an interest in alternative, improper learning approaches, where one approximates the predictions of LDSs by a linear function of a few past observations. The advantage of such approaches is that it convexifies the problem, i.e., learning the linear function amounts to a convex problem, which avoids the issues brought by the non-convex nature of the likelihood function. The convexification allows for on-line algorithms, which are typically fast and simple. A crucial advance of these recent approaches is the guarantee that the predictions of the convexified, improper-learning algorithm are at least as good as the predictions of the proper one. One therefore avoids the long learning times and issues related to non-convexity associated with the classical algorithms, while maintaining the statistical performance.

Leading examples of this approach Anava et al. (2013); Liu et al. (2016); Hazan, Singh, and Zhang (2017) utilise a framework of regret bounds Cesa-Bianchi and Lugosi (2006) to provide guarantees on the performance of the convexifications. In this framework, one considers a sequence of observations , with or without additional assumptions. After observing , an algorithm for improper learning produces a forecast of the next observation. Then, roughly speaking, one shows that the sum of errors of the forecast thus produced is close to the sum of errors of the best model (in hindsight) from within a certain class. It is said that the algorithm competes against a certain class.

In this paper, we take several steps towards developing guarantees for an algorithm, which competes against Kalman filters. Specifically, we ask what conditions make it possible to model the predictions of Kalman filter as a regression of a few past observations? We show that for a natural, large, and well-known class of LDSs, the observable LDSs, the dependence of Kalman filter on the past decays exponentially if the process noise of the LDS is non-degenerate. Consequently, predictions of such LDS can be modeled as auto-regressions. In addition, we show that at least some non-degeneracy of the process noise is necessary for the exponential decay. We provide an example with no process noise, where the dependence on the past does not converge exponentially.

Next, based on the decay results, we give an on-line algorithm for time-series prediction and prove regret bounds for it. The algorithm makes predictions in the form , where are observations, and

is the vector of auto-regression (AR) coefficients, which is updated by the algorithm in an on-line manner.

For any LDS , denote by the predicted value of by Kalman filter corresponding to , given . Denote by the total error made by the algorithm up to time , and by

the total error made by Kalman filter corresponding to . Let be any finite family of observable linear dynamical systems with non-degenerate process noise. We show that for an appropriate regression depth , for any bounded sequence we have

(1)

where is a constant depending on the family . In words, up to an arbitrarily small given in advance, the average prediction error of the algorithm is as good or better than the average prediction error of the best Kalman filter in . We emphasize that while there is a dependence on in the bounds, via the constant , the algorithm itself depends on only through the regression depth . In particular, the algorithm does not depend on the cardinality of , and the time complexity of each iteration is .

To summarize, our contributions are as follows: We show that the dependence of predictions of Kalman filters in a system with non-degenerate process noise is exponentially decaying and that therefore Kalman filters may be approximated by regressions on a few recent observations, cf. Theorem 2. We also show that the process noise is essential for the exponential decay. We given an on-line prediction algorithm and prove the first regret bounds against Kalman filters, cf. Theorem 6. Experimentally, we illustrate the performance on a single example in the main body of the text, and further examples in the supplementary material.

2 Literature

In this section, we review the relevant literature and place the current work in context.

We refer to Hamilton (1994) for an exposition on LDSs, Kalman filter, and the classical approach to learning the LDS parameters via tha maximum likelihood optimization. See also Roweis and Ghahramani (1999) for a survey of relations between LDSs and a large variety of other probabilistic models. A general exposition of on-line learning can be found in Hazan (2016).

As discussed in the Introduction, we are concerned with improper learning, where we show that an alternative model can be shown to generate forecasts that are as good as Kalman filter, up to any given error. Perhaps the first example of an improper learning that is still used today is the moving average, or the exponential moving average Gardner . In this approach, predictions for a process – of a possibly complex nature – are made using a simple auto-regressive (AR) or AR-like model. This is very successful in a multitude of engineering applications. Nevertheless, until recently, there were very few guarantees for the performance of such methods.

In Anava et al. (2013), the first guarantees regarding prediction of a (non-AR) subset of auto-regressive moving-average (ARMA) processes by AR processes were given, together with an algorithm for finding the appropriate AR. In Liu et al. (2016), these results were extended to a subset of autoregressive integrated moving average (ARIMA) processes, while at the same time the assumptions on the underlying ARMA model were relaxed.

In this paper, we show that AR models may also be used to forecast as well as Kalman filters. One major difference between our results and the previous work is that we obtain approximation results on arbitrary bounded sequences. Indeed, regret results of Anava et al. (2013) and Liu et al. (2016) only hold under the assumption that the data sequence was generated by a particular fixed ARMA or ARIMA process. Moreover, the constants in the regret bounds of Anava et al. (2013) and Liu et al. (2016) depend on the generating model, and the guaranteed convergence may be arbitrarily slow, even when the sequence to forecast is generated by appropriate model.

In contrast, we show that up to an arbitrarily small error given in advance, AR() will perform as well as Kalman filter on any bounded sequence. We also obtain approximation results in the more general case of bounded difference sequences.

Another related work is Hazan, Singh, and Zhang (2017), which addresses a different aspect of LDS approximation by ARs. In the case of LDSs with inputs

, building on known eigenvalue-decay estimates of Hankel matrices, it is shown that the influence of all past inputs may be effectively approximated by an AR-type model. However, the arguments and the algorithms in

Hazan, Singh, and Zhang (2017) were not designed to address model noise. In particular, the algorithm of Hazan, Singh, and Zhang (2017) makes predictions based on the whole history of inputs and on only one most recent observation, , and hence clearly can not compete with Kalman filters in situations with no inputs. We demonstrate this in the Experiments section.

Also, note that while on-line gradient descent (OGD) is used throughout much recent work in forecasting Anava et al. (2013); Hazan, Singh, and Zhang (2017); Liu et al. (2016), there are important differences in the per-iteration run-time, as well as the problems solved. For example, the OGD of Hazan, Singh, and Zhang (2017) requires operations to forecast at time , as it does consider the whole history of inputs and an eigen-decomposition of an matrix at each time , although this could be pre-computed. Our Algorithm 1 has a liner per-iteration complexity in the regression depth , which can be as small as . Notably, its run-time at time does not depend on time .

3 Preliminaries

As usual in the literature West and Harrison (1997), we define a linear system as:

(2)
(3)

where are scalar observations, and is the hidden state. is the state transition matrix which defines the system dynamics, and is the observation direction. The process-noise terms and observation-noise terms are mean zero normal independent variables. For all the covariance of is

and the variance of

is . The initial state

is a normal random variable with mean

and covariance .

For denote

(4)

and let be the covariance matrix of given . Note that is the estimate of the current hidden state, given the observations. Further, the central quantity of this paper is

(5)

This is the forecast of the next observation, given the current data. The quantities and are known as Kalman Filter. In particular, in this paper we refer to the sequence as the Kalman filter associated with the LDS .

The Kalman filter satisfies the following recursive update equations: Set

Note that in this notation we have

Then the update equations of Kalman filter are:

(6)
(7)

where is an operator which acts by . The matrix of is given by the outer product , where .

An important property of Kalman Filter is that while depends on , the covariance matrix does not. Indeed, note that are all deterministic sequences which do not depend on the observations.

We explicitly write the recurrence relation for :

(8)

Also write for convenience

(9)

A more explicit form of the prediction of given , may be obtained by unrolling (6) and using (9):

(10)
(12)

In general, set and . Chose and fix some . Then for any , the expectation (10) has the form displayed in Figure 1.

(13)
Figure 1: The unrolling of the forecast . The remainder term goes to zero exponentially fast with , by Lemma 3.

Next, a linear system is said to be observable, West and Harrison (1997), if

(14)

Roughly speaking, the pair is observable if the state can be recovered from a sufficient number of observations, in a noiseless situation. Note that if there were parts of the state that do not influence the observations, these parts would be irrelevant for forecast purposes. Thus we are only interested in observable LDSs.

When is observable, it is known Harrison (1997) that the sequences converge. See also Anderson and Moore (1979); West and Harrison (1997). We denote the limits by and respectively. Moreover, the limits satisfy the recursions as equalities. In particular we have

(15)

Finally, an operator is non-negative, denoted , if for all , and is positive, denoted , if for all . Note that are either covariance matrices or limits of such matrices, and thus are symmetric and non-negative.

4 Exponential Decay and AR Approximation

In what follows, we denote by

(16)

the inner products induced by and on , where is the limit of as described above. In particular, we set and rewrite (15) as

(17)

Observe that since , we have , and in particular if then . In other words, if , then and induce proper norms on :

(18)

Next, consider the remainder term in the prediction equation (13), where we have replaced with their limit values :

Let us now state and prove the key result of this paper: if , then converges to zero exponentially fast with . The key to the proof will be to consider contractivity properties with respect to the norm induced by , rather than with respect to the the default inner product.

Theorem 1.

If , then there is
such that for every ,

(19)
Proof.

Set

(20)

Then

(21)
(22)

Therefore we have

(23)

In addition, by (17),

(24)

Combining (23) and (24), we obtain

(25)

Equation (4) immediately implies that is non-increasing. Recall that by (18), is dominated by . However, since both and define proper norms, by the equivalence of finite dimensional norms, the inverse inequality is also true: There exists such that

(26)

Therefore the decrease in (4) must be exponential:

(27)

It is of interest to stress the fact that Theorem 1 does not assume any contractivity properties of . In particular, the very common assumption of the spectral radius of being bounded by is not required.

Let us state and prove our main approximation result:

Theorem 2 (LDS Approximation).

Let be an observable LDS with .

  1. For any , and any , there is , and , such that for every sequence with , and for every ,

    (28)
  2. For any , and any , there is , and , such that for every sequence with , and for every ,

    (29)

We first prove the bound on the remainder term in the prediction equation (13).

Lemma 3 (Remainder-Term Bound).

Let be an observable LDS with .

  1. If a sequence satisfies for all , then there are constants and such that for any and ,

    (30)
  2. If a sequence satisfies for all , then there are constants and such that for all and ,

    (31)
Proof.

Recall that satisfies the recursion (9),

(32)

Denote by and and by the norms induced by and respectively. Set and . By Theorem 1, there is such that is a -contraction with respect to . Fix some such that . Since , there is some such that for all , is a -contraction. In addition, let be such that for all . Set . Fix and set . For , using (32) write as

(33)

Observe that if an operator is a -contraction with respect to , then for any ,

(34)

where is the equivalence constant between and .

For every by (4) we have

(35)

By the choice of , as since the expansion in (35) is only up to , every in (35) is a -contraction and all satisfy .

Combining this with (34) and using triangle inequality, we obtain

(36)

Finally, choose . Note that . Therefore,

(37)
(38)
(39)
(40)

Observe that the term in (40) is a constant, independent of , and that the series in (39) are summable w.r.t . Therefore, in the bounded case , the proof is complete.

In the Lipschitz case, for every , we have

(41)

Substituting this into (37)-(40), and observing that the resulting series are still summable, we obtain

(42)

Thus using

(43)

completes the proof in the Lipschitz case. ∎

We now prove Theorem 2.

Proof.

Recall that is given by (13). Fix some and set , and for . Note that and here corresponds to in the statement of the Theorem. Set also and for , . Clearly with and for every fixed . Next, using Lemma 3, the discrepancy between and the predictor is given by

(44)

in the bounded case. In this case, therefore, choosing regression depth large enough so that and large enough so that for all , for all , suffices to conclude the proof. The proof of the Lipschitz case follows similar lines and is given in the Supplementary Material due to space constraints. ∎

To conclude this section, we discuss the relation between exponential convergence and the non-degenerate noise assumption, . Note that the crucial part of Theorem 1, inequality (26), holds if and only if we can guarantee that for every for which . In particular, this holds when – that is, the noise is full dimensional. We now demonstrate that at least some noise is necessary for the exponential decay to hold.

Consider first a one dimensional example.

Example 4.

With , assume that are generated by an LDS with , and some . Assume that the true process starts from a deterministic state . Since we do not know , we start the Kalman filter with and initial covariance .

In this case, clearly the observations are independent samples of a fixed distribution with mean and variance . The Kalman filter in this situation is equivalent to a Bayesian mean estimator with prior distribution . From general considerations, it follows that with . Indeed, if we start with , then we have for all . Since the limit does not depend on the initialization, Harrison (1997), we have for every initialization. As a side note, in this particular case it can be shown, either via the Bayesian interpretation or directly, that decays as (that is, , with ). Now, note that , and that for any fixed , as grows. Next, for fixed , consider the prediction equation (13). On the one hand, we know that converges to

in probability. This is clear for instance from the Bayesian estimator interpretation above. On the other hand, the coefficients of all

in (13) converge to . It follows therefore, that the remainder term in (13), , converges in probability to as . In particular, the remainder term does not converge to . This is in sharp contrast with the exponential convergence of this term to zero in the case, as given by Lemma 3.

The above example can be generalized as follows:

Example 5.

In any dimension , let define an LDS such that is a rotation, and such that is observable. Again choose and . As before, let the true process start from a state and start the filter with and .

Considerations similar to those of the previous example imply that but does not. Consequently, the remainder term will not converge to zero.

5 An Algorithm and Regret Bounds

In this section, we introduce our prediction algorithm and prove the associated regret bounds. Our on-line algorithm maintains a state estimate, which is represented by the regression coefficients , where is the regression depth, a parameter of the algorithm. At time step , the algorithm first produces a prediction of the observation , using the current state and previous observations, . Specifically, we will predict by

(45)

After the prediction is made, the true observation is revealed to the algorithm, and a loss associated with the prediction is computed. Here we consider the quadratic loss for simplicity: We define as

. The loss function at time

will be given by

(46)

In addition, the state is updated. We use the general scheme of on-line gradient decent algorithms, Zinkevich (2003), where the update goes against the direction of the gradient of the current loss. In addition, it is useful to restrict the state to a bounded domain. We will use a Euclidean ball of radius as the domain, where is a parameter of the algorithm. We denote this domain by and denote by the Euclidean projection onto this domain. If the gradient step takes the state outside of the domain, the state is projected back onto . The pseudo-code is presented in Algorithm 1, where the gradient of the cost at at time is given by

(47)
1:  Input: Regression length , domain bound .     Observations , given sequentially.
2:  Set the learning rate .
3:  Initialize arbitrarily in .
4:  for  to  do
5:     Predict
6:     Observe and compute the loss of (46)
7:     Update using (47)
8:  end for
Algorithm 1 On-line Gradient Descent

Note a slight abuse of notation in Algorithm 1: the vector denotes the state at time , while in (45) and elsewhere in the text, denotes the scalar coordinates of . Whether the vector or the coordinates are considered will always be clear from context.

For any LDS , let , defined by (13), be the prediction of that Kalman filter associated with makes, given . We start all filters with the initial state , and initial covariance , the identity matrix. Let be any family of LDSs. Then for any sequence , the quantity

(48)

where are the sequence of states produced by Algorithm 1, is called the regret. As discussed in the introduction, is the total error incurred by the algorithm, and is the loss of the best (in hindsight) Kalman filter in . Therefore, small regret means that the algorithm performs on sequence as well as the best Kalman filter in , even if we are allowed to select that Kalman filter in hindsight, after the whole sequence is revealed.

In the Supplementary Material, we prove the following bound on the regret of Algorithm 1:

Theorem 6.

Let be a finite family of LDSs, such that every , is observable and has . Let be given. For any , there are ,, and , such that the following holds:

For every sequence with , if is a sequence produced by Algorithm 1 with parameters and , then for every ,

(49)

Due to the limited space in the main body of the text, we describe only the main ideas of the proof here. Similarly to other proofs in this domain, it consists of two steps. In the first step we show that

(50)
Figure 2: The error of AR(

) compared against Kalman filter, last-value prediction, and spectral filtering in terms of the mean and standard deviation over

runs on Example 7.
Figure 3: The ratio of the errors of Kalman filter and AR() on Example 7 indicated by colours as a function of of process and observation noise, on the vertical and horizontal axes, resp. Origin is the top-left corner.
Figure 4: The error of AR() as a function of , in terms of the mean and standard deviation over runs on Example  7, for 4 choices of of process and observation noise, respectively.

This means that Algorithm 1 performs as well as the best in hindsight fixed state vector . This follows from the general results in Zinkevich (2003). In the second step, we use the approximation Theorem 2 to find for each an appropriate , such that the predictions are approximated by . It follows from this step, that the best Kalman filter performs approximately as well as the best . Specifically, we have

(51)

Because by construction , clearly it holds that

and therefore combining (50) and (51) yields the statement of Theorem 6.

6 Experiments

To illustrate our results, we present experiments on a few well-known examples in the Supplementary Material. Out of those, we chose one to present here:

Example 7 (Adapted from Hazan, Singh, and Zhang (2017)).

Consider the system:

(52)

with process noise distributed as and observation noise for different choices of .

In Figure 2, we compare the prediction error for 4 methods: the standard baseline last-value prediction , also known as persistence prediction, the spectral filtering of Hazan, Singh, and Zhang (2017), Kalman filter, and AR(2). Here AR(2) is the truncation of Kalman filter, given by (13) with regression depth and no remainder term. Average error over 100 observation sequences generated by (52) with is shown as solid line, and its standard deviation is shown a as shaded region. Note that from some time on, spectral filtering essentially performs persistence prediction, since the inputs are zero. Further, note that both Kalman filter and AR(2) considerably improve upon the performance of last- value prediction.

In Figure 3, we compare the performance of AR(2) and Kalman filter under varying magnitude of noises . In particular, colour indicates the ratio of the errors of Kalman filter to the errors of AR(2), wherein the errors are the average prediction error over 10 trajectories of (52) for each cell of the heat-map, with each trajectory of length 50. (The formula is given in the Supplementary Material.) Consistent with our analysis, one can observe that increasing the variance of process noise improves the approximation of the Kalman filter by AR(2).

Finally, in Figure 4, we illustrate the decay of the remainder term by presenting the mean (line) and standard deviation (shaded area) of the error as a function of the regression depth . There, 4 choices of the covariance matrix of the process noise and the variance of the observation noise are considered within Example  7 and the error is averaged over runs of length . Of course, as expected, increasing decreases the error, until the error approaches that of the Kalman filter. Observe again that for a given value of the observation noise, the convergence w.r.t is slower for smaller process noise, consistently with our theoretical observations.

7 Conclusions

We have presented a forecasting method, which is applicable to arbitrary sequences and comes with a regret bound competing against a class of methods, which includes Kalman filters.

We could generalise the results. First, since Theorem 2 provides approximation in absolute value for every large-enough , our regret bounds may be easily extended to other losses. Second, for simplicity, we have considered only bounded sequences. While this is a standard assumption in the literature, it is somewhat restrictive, since, at least theoretically, LDS observations may grow at a rate of . For this reason, we have given the approximation theorem also for bounded-difference (Lipschitz) sequences, and the regret results may be extended to this setting as well. One could also provide regret bounds for special cases of LDS, as surveyed by Roweis and Ghahramani (1999).

References

  • Anava et al. (2013) Anava, O.; Hazan, E.; Mannor, S.; and Shamir, O. 2013. Online learning for time series prediction. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA.
  • Anderson and Moore (1979) Anderson, B., and Moore, J. 1979. Optimal Filtering. Prentice Hall.
  • Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press.
  • Chui and Chen (2017) Chui, C., and Chen, G. 2017. Kalman Filtering: with Real-Time Applications. Springer International Publishing.
  • (5) Gardner, E. S. Exponential smoothing: The state of the art. Journal of Forecasting 4(1):1–28.
  • Hamilton (1994) Hamilton, J. 1994. Time Series Analysis. Princeton University Press.
  • Harrison (1997) Harrison, P. J. 1997. Convergence and the constant dynamic linear model. Journal of Forecasting 16(5).
  • Hazan, Singh, and Zhang (2017) Hazan, E.; Singh, K.; and Zhang, C. 2017. Online learning of linear dynamical systems. In Advances in Neural Information Processing Systems, 6686–6696.
  • Hazan (2016) Hazan, E. 2016. Introduction to online convex optimization. Found. Trends Optim.
  • Liu et al. (2016) Liu, C.; Hoi, S. C. H.; Zhao, P.; and Sun, J. 2016. Online arima algorithms for time series prediction. AAAI’16.
  • Roweis and Ghahramani (1999) Roweis, S., and Ghahramani, Z. 1999. A unifying review of linear gaussian models. Neural Computation 11(2):305–345.
  • West and Harrison (1997) West, M., and Harrison, J. 1997. Bayesian Forecasting and Dynamic Models (2nd ed.). Springer-Verlag.
  • Zinkevich (2003) Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. ICML.

Appendix A Proof Of Theorem 6

Proof.

Let be given. For every , let be the approximation rank, the approximating , and convergence times for which the approximation (28) holds by Theorem 2. Set , , and consider all as vectors in

, by padding

with zeros if necessary. Then we have uniform approximation,

(53)

for all bounded sequences , all and all . We set and apply Algorithm 1 with parameters and to the sequence .

Using the standard results, Theorem 1 of Zinkevich (2003), we have for every ,

(54)

These are bounds quantify the performance of Algorithm 1 against the best . We now proceed to compare the loss of the best with the loss of the best Kalman filter.

In what follows we assume . Indeed, since is independent of , the sequences are bounded, and the family is finite, the whole regret up to time can be dominated by a constant, .

It then remains to observe that

(55)

To see this, write