Supervised Learning for Dynamical System Learning

05/20/2015 ∙ by Ahmed Hefny, et al. ∙ Carnegie Mellon University 0

Recently there has been substantial interest in spectral methods for learning dynamical systems. These methods are popular since they often offer a good tradeoff between computational and statistical efficiency. Unfortunately, they can be difficult to use and extend in practice: e.g., they can make it difficult to incorporate prior information such as sparsity or structure. To address this problem, we present a new view of dynamical system learning: we show how to learn dynamical systems by solving a sequence of ordinary supervised learning problems, thereby allowing users to incorporate prior knowledge via standard techniques such as L1 regularization. Many existing spectral methods are special cases of this new framework, using linear regression as the supervised learner. We demonstrate the effectiveness of our framework by showing examples where nonlinear regression or lasso let us learn better state representations than plain linear regression does; the correctness of these instances follows directly from our general analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Likelihood-based approaches to learning dynamical systems, such as EM [1] and MCMC [2], can be slow and suffer from local optima. This difficulty has resulted in the development of so-called “spectral algorithms” [3]

, which rely on factorization of a matrix of observable moments; these algorithms are often fast, simple, and globally optimal.

Despite these advantages, spectral algorithms fall short in one important aspect compared to EM and MCMC: the latter two methods are meta-algorithms or frameworks that offer a clear template for developing new instances incorporating various forms of prior knowledge. For spectral algorithms, by contrast, there is no clear template to go from a set of probabilistic assumptions to an algorithm. In fact, researchers often relax model assumptions to make the algorithm design process easier, potentially discarding valuable information in the process.

To address this problem, we propose a new framework for dynamical system learning, using the idea of instrumental-variable regression [4, 5] to transform dynamical system learning to a sequence of ordinary supervised learning problems. This transformation allows us to apply the rich literature on supervised learning to incorporate many types of prior knowledge. Our new methods subsume a variety of existing spectral algorithms as special cases.

The remainder of this paper is organized as follows: first we formulate the new learning framework (Sec. 2). We then provide theoretical guarantees for the proposed methods (Sec. 4). Finally, we give two examples of how our techniques let us rapidly design new and useful dynamical system learning methods by encoding modeling assumptions (Sec. 5).

2 A framework for spectral algorithms

Figure 1: A latent-state dynamical system. Observation is determined by latent state and noise .
Figure 2: Learning and applying a dynamical system with instrumental regression. The predictions from S1 provide training data to S2. At test time, we filter or predict using the weights from S2.

A dynamical system is a stochastic process (i.e., a distribution over sequences of observations) such that, at any time, the distribution of future observations is fully determined by a vector

called the latent state. The process is specified by three distributions: the initial state distribution , the state transition distribution , and the observation distribution . For later use, we write the observation as a function of the state and random noise , as shown in Figure 2.

Given a dynamical system, one of the fundamental tasks is to perform inference, where we predict future observations given a history of observations. Typically this is accomplished by maintaining a distribution or belief over states where denotes the first observations. represents both our knowledge and our uncertainty about the true state of the system. Two core inference tasks are filtering and prediction.111There are other forms of inference in addition to filtering and prediction, such as smoothing and likelihood evaluation, but they are outside the scope of this paper. In filtering, given the current belief and a new observation , we calculate an updated belief that incorporates . In prediction, we project our belief into the future: given a belief

we estimate

for some (without incorporating any intervening observations).

The typical approach for learning a dynamical system is to explicitly learn the initial, transition, and observation distributions by maximum likelihood. Spectral algorithms offer an alternate approach to learning: they instead use the method of moments to set up a system of equations that can be solved in closed form to recover estimates of the desired parameters. In this process, they typically factorize a matrix or tensor of observed moments—hence the name “spectral.”

Spectral algorithms often (but not always [6]) avoid explicitly estimating the latent state or the initial, transition, or observation distributions; instead they recover observable operators that can be used to perform filtering and prediction directly. To do so, they use an observable representation: instead of maintaining a belief over states , they maintain the expected value of a sufficient statistic of future observations. Such a representation is often called a (transformed) predictive state [7].

In more detail, we define , where is a vector of future features. The features are chosen such that determines the distribution of future observations .222For convenience we assume that the system is -observable: that is, the distribution of all future observations is determined by the distribution of the next observations. (Note: not by the next observations themselves.) At the cost of additional notation, this restriction could easily be lifted. Filtering then becomes the process of mapping a predictive state to conditioned on , while prediction maps a predictive state to without intervening observations.

A typical way to derive a spectral method is to select a set of moments involving , work out the expected values of these moments in terms of the observable operators, then invert this relationship to get an equation for the observable operators in terms of the moments. We can then plug in an empirical estimate of the moments to compute estimates of the observable operators.

While effective, this approach can be statistically inefficient (the goal of being able to solve for the observable operators is in conflict with the goal of maximizing statistical efficiency) and can make it difficult to incorporate prior information (each new source of information leads to new moments and a different and possibly harder set of equations to solve). To address these problems, we show that we can instead learn the observable operators by solving three supervised learning problems.

The main idea is that, just as we can represent a belief about a latent state as the conditional expectation of a vector of observable statistics, we can also represent any other distributions needed for prediction and filtering via their own vectors of observable statistics. Given such a representation, we can learn to filter and predict by learning how to map these vectors to one another.

In particular, the key intermediate quantity for filtering is the “extended and marginalized” belief —or equivalently . We represent this distribution via a vector of features of the extended future. The features are chosen such that the extended state determines . Given , filtering and prediction reduce respectively to conditioning on and marginalizing over .

In many models (including Hidden Markov Models (HMMs) and Kalman filters), the extended state

is linearly related to the predictive state —a property we exploit for our framework. That is,

(1)

for some linear operator . For example, in a discrete system can be an indicator vector representing the joint assignment of the next observations, and can be an indicator vector for the next observations. The matrix

is then the conditional probability table

.

Our goal, therefore, is to learn this mapping . Naïvely, we might try to use linear regression for this purpose, substituting samples of and in place of and since we cannot observe or directly. Unfortunately, due to the overlap between observation windows, the noise terms on and are correlated. So, naïve linear regression will give a biased estimate of .

To counteract this bias, we employ instrumental regression [4, 5]. Instrumental regression uses instrumental variables that are correlated with the input but not with the noise . This property provides a criterion to denoise the inputs and outputs of the original regression problem: we remove that part of the input/output that is not correlated with the instrumental variables. In our case, since past observations do not overlap with future or extended future windows, they are not correlated with the noise , as can be seen in Figure 2. Therefore, we can use history features as instrumental variables.

In more detail, by taking the expectation of (1) given , we obtain an instrument-based moment condition: for all ,

(2)

Assuming that there are enough independent dimensions in that are correlated with , we maintain the rank of the moment condition when moving from (1) to (2), and we can recover by least squares regression if we can compute and for sufficiently many examples .

11todo: 1GG: I removed the footnote about it not being necessary to denoise both input and output; it’s interesting but perhaps belongs in some discussion in an appendix.

Fortunately, conditional expectations such as are exactly what supervised learning algorithms are designed to compute. So, we arrive at our learning framework: we first use supervised learning to estimate and , effectively denoising the training examples, and then use these estimates to compute by finding the least squares solution to (2).

In summary, learning and inference of a dynamical system through instrumental regression can be described as follows:

  • Model Specification: Pick features of history , future and extended future . must be a sufficient statistic for . must satisfy

    • for a known function .

    • for a known function .

  • S1A (Stage 1A) Regression: Learn a (possibly non-linear) regression model to estimate . The training data for this model are across time steps .333Our bounds assume that the training time steps are sufficiently spaced for the underlying process to mix, but in practice, the error will only get smaller if we consider all time steps .

  • S1B Regression: Learn a (possibly non-linear) regression model to estimate . The training data for this model are across time steps .

  • S2 Regression: Use the feature expectations estimated in S1A and S1B to train a model to predict , where is a linear operator. The training data for this model are estimates of obtained from S1A and S1B across time steps .

  • Initial State Estimation: Estimate an initial state by averaging across several example realizations of our time series.444Assuming ergodicity, we can set the initial state to be the empirical average vector of future features in a single long sequence, .

  • Inference: Starting from the initial state , we can maintain the predictive state through filtering: given we compute . Then, given the observation , we can compute . Or, in the absence of , we can predict the next state . Finally, by definition, the predictive state is sufficient to compute .555It might seem reasonable to learn directly, thereby avoiding the need to separately estimate and condition on . Unfortunately, is nonlinear for common models such as HMMs.

The process of learning and inference is depicted in Figure 2. Modeling assumptions are reflected in the choice of the statistics , and as well as the regression models in stages S1A and S1B. Table 1 demonstrates that we can recover existing spectral algorithms for dynamical system learning using linear S1 regression. In addition to providing a unifying view of some successful learning algorithms, the new framework also paves the way for extending these algorithms in a theoretically justified manner, as we demonstrate in the experiments below.

Model/Algorithm future features extended future features
Spectral Algorithm for HMM [3] where is an indicator vector and spans the range of (typically the top left singular vectors of the joint probability table ) Estimate a state normalizer from S1A output states.
SSID for Kalman filters (time dependent gain) and , where for a matrix that spans the range of (typically the top left singular vectors of the covariance matrix ) and , where is formed by stacking and .

specifies a Gaussian distribution where conditioning on

is straightforward.
SSID for stable Kalman filters (constant gain) ( obtained as above) and Estimate steady-state covariance by solving Riccati equation [8]. together with the steady-state covariance specify a Gaussian distribution where conditioning on is straightforward.
Uncontrolled HSE-PSR [9] Evaluation functional for a characteristic kernel and Kernel Bayes rule [10].
Table 1: Examples of existing spectral algorithms reformulated as two-stage instrument regression with linear S1 regression. Here is a vector formed by stacking observations through and denotes the outer product. Details and derivations can be found in the supplementary material.

3 Related Work

This work extends predictive state learning algorithms for dynamical systems, which include spectral algorithms for Kalman filters [11], Hidden Markov Models [3, 12], Predictive State Representations (PSRs) [13, 14] and Weighted Automata [15]. It also extends kernel variants such as [9], which builds on [16]

. All of the above work effectively uses linear regression or linear ridge regression (although not always in an obvious way).

One common aspect of predictive state learning algorithms is that they exploit the covariance structure between future and past observation sequences to obtain an unbiased observable state representation. Boots and Gordon [17] note the connection between this covariance and (linear) instrumental regression in the context of the HSE-HMM. We use this connection to build a general framework for dynamical system learning where the state space can be identified using arbitrary (possibly nonlinear) supervised learning methods. This generalization lets us incorporate prior knowledge to learn compact or regularized models; our experiments demonstrate that this flexibility lets us take better advantage of limited data.

Reducing the problem of learning dynamical systems with latent state to supervised learning bears similarity to Langford et al.’s sufficient posterior representation (SPR) [18], which encodes the state by the sufficient statistics of the conditional distribution of the next observation and represents system dynamics by three vector-valued functions that are estimated using supervised learning approaches. While SPR allows all of these functions to be non-linear, it involves a rather complicated training procedure involving multiple iterations of model refinement and model averaging, whereas our framework only requires solving three regression problems in sequence. In addition, the theoretical analysis of [18] only establishes the consistency of SPR learning assuming that all regression steps are solved perfectly. Our work, on the other hand, establishes convergence rates based on the performance of S1 regression.

4 Theoretical Analysis

In this section we present error bounds for two-stage instrumental regression. These bounds hold regardless of the particular S1 regression method used, assuming that the S1 predictions converge to the true conditional expectations. The bounds imply that our overall method is consistent.

Let be i.i.d. triplets of input, output, and instrumental variables. (Lack of independence will result in slower convergence in proportion to the mixing time of our process.) Let and denote and . And, let and denote and as estimated by the S1A and S1B regression steps. Here and .

We want to analyze the convergence of the output of S2 regression—that is, of the weights given by ridge regression between S1A outputs and S1B outputs:

(3)

Here denotes tensor (outer) product, and is a regularization parameter that ensures the invertibility of the estimated covariance.

Before we state our main theorem we need to quantify the quality of S1 regression in a way that is independent of the S1 functional form. To do so, we place a bound on the S1 error, and assume that this bound converges to zero: given the definition below, for each fixed , .

Definition 1 (S1 Regression Bound).

For any and , the S1 regression bound is a number such that, with probability at least , for all :

In many applications, , and will be finite dimensional real vector spaces: , and . However, for generality we state our results in terms of arbitrary reproducing kernel Hilbert spaces. In this case S2 uses kernel ridge regression, leading to methods such as HSE-PSRs. For this purpose, let and denote the (uncentered) covariance operators of and respectively: . And, let denote the closure of the range of .

22todo: 2AH: the papers don’t talk much about the existence of the covariance operator. I think it follows from Reisz representation theorem with additional arguments about the boundedness of the covariance, which implies continuity and hence linearity on the completion of tensor products.

With the above assumptions, Theorem 2 gives a generic error bound on S2 regression in terms of S1 regression. If and are finite dimensional and

has full rank, then using ordinary least squares (i.e., setting

) will give the same bound, but with

in the first two terms replaced by the minimum eigenvalue of

, and the last term dropped.

Theorem 2.

Assume that almost surely. Assume is a Hilbert-Schmidt operator, and let be as defined in (3). Then, with probability at least , for each s.t. , the error is bounded by 33todo: 3AH: The constant depends on . Thus used ”for each”. Note though that we don’t need the union bound for a finite set of points. Also, with some upper bound assumption on , we get a uniform result (see the proof of the regularization part)

44todo: 4GG: I’m worried that reviewers will not like how much we defer to the supplementary material. Not sure how to fix that, though…55todo: 5GG: it says we defer a finite sample analysis to the appendix, but theorem 2 seems to say something about behavior on finite samples? It would be much better if we didn’t have to say this was deferred.66todo: 6GG: I don’t understand the effect of decomposing . It looks restrictive that has to go to 0.

We defer the proof to the supplementary material. The supplementary material also provides explicit finite-sample bounds (including expressions for the constants hidden by -notation), as well as concrete examples of S1 regression bounds for practical regression models.

Theorem 2 assumes that is in . For dynamical systems, all valid states satisfy this property. However, with finite data, estimation errors may cause the estimated state (i.e., ) to have a non-zero component in . Lemma 3 bounds the effect of such errors: it states that, in a stable system, this component gets smaller as S1 regression performs better. The main limitation of Lemma 3 is the assumption that is -Lipchitz, which essentially means that the model’s estimated probability for is bounded below. There is no way to guarantee this property in practice; so, Lemma 3 provides suggestive evidence rather than a guarantee that our learned dynamical system will predict well.

Lemma 3.

For observations , let be the estimated state given . Let be the projection of onto . Assume is -Lipchitz on when evaluated at , and for any . Given the assumptions of theorem 2 and assuming that for all , the following holds for all with probability at least .

77todo: 7Note that the constant depends on , which in turn depends on the trained model. So I am not sure whether this statement makes sense.

Since is bounded, the prediction error due to diminishes at the same rate as .

5 Experiments and Results

We now demonstrate examples of tweaking the S1 regression to gain advantage. In the first experiment we show that nonlinear regression can be used to reduce the number of parameters needed in S1, thereby improving statistical performance for learning an HMM. In the second experiment we show that we can encode prior knowledge as regularization.

5.1 Learning A Knowledge Tracing Model

In this experiment we attempt to model and predict the performance of students learning from an interactive computer-based tutor. We use the Bayesian knowledge tracing (BKT) model [19], which is essentially a 2-state HMM: the state represents whether a student has learned a knowledge component (KC), and the observation represents the success/failure of solving the question in a sequence of questions that cover this KC. Figure 3 summarizes the model. The events denoted by guessing, slipping, learning and forgetting typically have relatively low probabilities.

Figure 3: Transitions and observations in BKT. Each node represents a possible value of the state or observation. Solid arrows represent transitions while dashed arrows represent observations.

5.1.1 Data Description

We evaluate the model using the “Geometry Area (1996-97)” data available from DataShop [20]. This data was generated by students learning introductory geometry, and contains attempts by 59 students in 12 knowledge components. As is typical for BKT, we consider a student’s attempt at a question to be correct iff the student entered the correct answer on the first try, without requesting any hints from the help system. Each training sequence consists of a sequence of first attempts for a student/KC pair. We discard sequences of length less than 5, resulting in a total of 325 sequences.

88todo: 8

GG: do we pad sequences at the beginning with special “before the start of time” observations? This lets us take advantage of shorter sequences as well (where the history window extends before the beginning of the sequence), as well as the first few steps of longer sequences, increasing the available data. AH:Yes we do. I edited the text to reflect that. The reason for discarding sequences is for testing no training; to make sure that test sequences have sufficient steps for initial filtering. GG: Interesting—what happens if we include the short sequences at test time? Does it just increase the noise level of predictions? (We should be able to predict for them as well, so it would be preferable to include them if possible in a future version of the experiments.)

5.1.2 Models and Evaluation

Under the (reasonable) assumption that the two states have distinct observation probabilities, this model is 1-observable. Hence we define the predictive state to be the expected next observation, which results in the following statistics: and , where is represented by a 2 dimensional indicator vector and denotes the Kronecker product. Given these statistics, the extended state is a joint probability table of .

We compare three models that differ by history features and S1 regression method:

Spec-HMM: This baseline uses and linear S1 regression, making it equivalent to the spectral HMM method of [3], as detailed in the supplementary material.

Feat-HMM: This baseline represents by an indicator vector of the joint assignment of the previous observations (we set to 4) and uses linear S1 regression. This is essentially a feature-based spectral HMM [12]. It thus incorporates more history information compared to Spec-HMM at the expense of increasing the number of S1 parameters by .

LR-HMM: This model represents by a binary vector of length encoding the previous

observations and uses logistic regression as the S1 model. Thus, it uses the same history information as Feat-HMM but reduces the number of parameters to

at the expense of inductive bias.

We evaluated the above models using 1000 random splits of the 325 sequences into 200 training and 125 testing. For each testing observation we compute the absolute error between actual and expected value (i.e. ). We report the mean absolute error for each split. The results are displayed in Figure 4.666The differences have similar sign but smaller magnitude if we use RMSE instead of MAE.

We see that, while incorporating more history information increases accuracy (Feat-HMM vs. Spec-HMM), being able to incorporate the same information using a more compact model gives an additional gain in accuracy (LR-HMM vs. Feat-HMM). We also compared the LR-HMM method to an HMM trained using expectation maximization (EM). We found that the LR-HMM model is much faster to train than EM while being on par with it in terms of prediction error.

777We used MATLAB’s built-in logistic regression and EM functions.


Model Spec-HMM Feat-HMM LR-HMM EM
Training time (relative to Spec-HMM) 1 1.02 2.219 14.323
Figure 4: Experimental results: each graph compares the performance of two models (measured by mean absolute error) on 1000 train/test splits. The black line is . Points below this line indicate that model is better than model . The table shows training time.
99todo: 9GG: fonts in fig 5 should be much larger.1010todo: 10GG: might want to consider also a model with separate features and a linear regression1111todo: 11GG: might be able to show more benefit if we predict farther into the future: filter for steps, predict without incorporating observations for steps, then evaluate RMSE on the next observation.

5.2 Modeling Independent Subsystems Using Lasso Regression

Spectral algorithms for Kalman filters typically use the left singular vectors of the covariance between history and future features as a basis for the state space. However, this basis hides any sparsity that might be present in our original basis. In this experiment, we show that we can instead use lasso (without dimensionality reduction) as our S1 regression algorithm to discover sparsity. This is useful, for example, when the system consists of multiple independent subsystems, each of which affects a subset of the observation coordinates.

To test this idea we generate a sequence of 30-dimensional observations from a Kalman filter. Observation dimensions 1 through 10 and 11 through 20 are generated from two independent subsystems of state dimension 5. Dimensions 21-30 are generated from white noise. Each subsystem’s transition and observation matrices have random Gaussian coordinates, with the transition matrix scaled to have a maximum eigenvalue of 0.95. States and observations are perturbed by Gaussian noise with covariance of

and respectively.

We estimate the state space basis using 1000 examples (assuming 1-observability) and compare the singular vectors of the past to future regression matrix to those obtained from the Lasso regression matrix. The result is shown in figure

5. Clearly, using Lasso as stage 1 regression results in a basis that better matches the structure of the underlying system.

Figure 5: Left singular vectors of (left) true linear predictor from to (i.e. ), (middle) covariance matrix between and

and (right) S1 Sparse regression weights. Each column corresponds to a singular vector (only absolute values are depicted). Singular vectors are ordered by their mean coordinate, interpreting absolute values as a probability distribution over coordinates.

6 Conclusion

In this work we developed a general framework for dynamical system learning using supervised learning methods. The framework relies on two key principles: first, we extend the idea of predictive state to include extended state as well, allowing us to represent all of inference in terms of predictions of observable features. Second, we use past features as instruments in an instrumental regression, denoising state estimates that then serve as training examples to estimate system dynamics.

We have shown that this framework encompasses and provides a unified view of some previous successful dynamical system learning algorithms. We have also demostrated that it can be used to extend existing algorithms to incorporate nonlinearity and regularizers, resulting in better state estimates. As future work, we would like to apply this framework to leverage additional techniques such as manifold embedding and transfer learning in stage 1 regression. We would also like to extend the framework to controlled processes.

References

  • [1] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss.

    A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains.

    The Annals of Mathematical Statistics, 41(1):pp. 164–171, 1970.
  • [2] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman and Hall, London, 1996 (ISBN: 0-412-05551-1).
    This book thoroughly summarizes the uses of MCMC in Bayesian analysis. It is a core book for Bayesian studies.
  • [3] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. In COLT, 2009.
  • [4] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000.
  • [5] J.H. Stock and M.W. Watson. Introduction to Econometrics. Addison-Wesley series in economics. Addison-Wesley, 2011.
  • [6] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models.

    The Journal of Machine Learning Research

    , 15(1):2773–2832, 2014.
  • [7] Matthew Rosencrantz and Geoff Gordon. Learning low dimensional predictive representations. In ICML ’04: Twenty-first international conference on Machine learning, pages 695–702, 2004.
  • [8] P. van Overschee and L.R. de Moor. Subspace identification for linear systems: theory, implementation, applications. Kluwer Academic Publishers, 1996.
  • [9] Byron Boots, Arthur Gretton, and Geoffrey J. Gordon. Hilbert Space Embeddings of Predictive State Representations. In

    Proc. 29th Intl. Conf. on Uncertainty in Artificial Intelligence (UAI)

    , 2013.
  • [10] Kenji Fukumizu, Le Song, and Arthur Gretton.

    Kernel bayes’ rule: Bayesian inference with positive definite kernels.

    Journal of Machine Learning Research, 14(1):3753–3783, 2013.
  • [11] Byron Boots. Spectral Approaches to Learning Predictive Representations. PhD thesis, Carnegie Mellon University, December 2012.
  • [12] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-2010), 2010.
  • [13] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. Closing the learning planning loop with predictive state representations. In I. J. Robotic Research, volume 30, pages 954–956, 2011.
  • [14] Byron Boots and Geoffrey Gordon. An online spectral learning algorithm for partially observable nonlinear dynamical systems. In Proceedings of the 25th National Conference on Artificial Intelligence (AAAI-2011), 2011.
  • [15] Borja Balle, William Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages: Unified presentation and empirical comparison. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1386–1394. JMLR Workshop and Conference Proceedings, 2014.
  • [16] L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markov models. In Proc. 27th Intl. Conf. on Machine Learning (ICML), 2010.
  • [17] Byron Boots and Geoffrey Gordon. Two-manifold problems with applications to nonlinear system identification. In Proc. 29th Intl. Conf. on Machine Learning (ICML), 2012.
  • [18] John Langford, Ruslan Salakhutdinov, and Tong Zhang. Learning nonlinear dynamic models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 593–600, 2009.
  • [19] Albert T. Corbett and John R. Anderson. Knowledge tracing: Modelling the acquisition of procedural knowledge. User Model. User-Adapt. Interact., 4(4):253–278, 1995.
  • [20] Kenneth R. Koedinger, R. S. J. Baker, K. Cunningham, A. Skogsholm, B. Leber, and John Stamper. A data repository for the EDM community: The PSLC DataShop. Handbook of Educational Data Mining, pages 43–55, 2010.
  • [21] Le Song, Jonathan Huang, Alexander J. Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 961–968, 2009.
  • [22] Daniel Hsu, Sham M Kakade, and Tong Zhang. Tail inequalities for sums of random matrices that depend on the intrinsic dimension. Electronic Communications in Probability, 17(14):1–13, 2012.
  • [23] Joel A. Tropp. User-friendly tools for random matrices: An introduction. NIPS Tutorial, 2012.
  • [24] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 9.1–9.24, 2012.

Appendix A Spectral and HSE Dynamical System Learning as Regression

In this section we provide examples of mapping some of the successful dynamical system learning algorithms to our framework.

a.1 Hmm

In this section we show that we can use instrumental regression framework to reproduce the spectral learning algorithm for learning HMM [3]. We consider 1-observable models but the argument applies to -observable models. In this case we use and , where denotes the kronecker product. Let be the joint probability table of observations and and let be its estimate from the data. We start with the (very restrictive) case where is invertible. Given samples of , and , in S1 regression we apply linear regression to learn two matrices and such that: 1212todo: 12I also assume is invertible (i.e. the sequence can start with any observation with non-zero probability)

(A.1)
(A.2)

where

In S2 regression, we learn the matrix that gives the least squares solution to the system of equations

which gives

(A.3)

Having learned the matrix , we can estimate

starting from a state . Since

specifies a joint distribution over

and we can easily condition on (or marginalize ) to obtain . We will show that this is equivalent to learning and applying observable operators as in [3]:

For a given value of , define

(A.4)

where is an matrix which selects a block of rows in corresponding to . Specifically, . 888Following the notation used in [3], .

with a normalization constant given by

(A.5)

Now we move to a more realistic setting, where we have . Therefore we project the predictive state using a matrix that preserves the dynamics, by requiring that (i.e. is an independent set of columns spanning the range of the HMM observation matrix ).

It can be shown [3] that . Therefore, we can use the leading left singular vectors of , which corresponds to replacing the linear regression in S1A with a reduced rank regression. However, for the sake of our discussion we will use the singular vectors of . In more detail, let be the rank- SVD decomposition of . We use and . S1 weights are then given by and and S2 weights are given by

(A.6)

In the limit of infinite data, spans and hence . Substituting in (A.6) gives

Similar to the full-rank case we define, for each observation an selector matrix and an observation operator

(A.7)

This is exactly the observation operator obtained in [3]. However, instead of using A.6, they use A.7 with and replaced by their empirical estimates.

Note that for a state , . To get , the normalization constant becomes , where for any valid predictive state . To estimate we solve the aforementioned condition for states estimated from all possible values of history features . This gives,

where the columns of represent all possible values of . This in turn gives

the same estimator proposed in [3].

a.2 Stationary Kalman Filter

A Kalman filter is given by

We consider the case of a stationary filter where is independent of . We choose our statistics

Where a window of observations is represented by stacking individual observations into a single vector. It can be shown [11, 8] that

and it follows that

where is the extended observation operator

It follows that and must be large enough to have . Let

be the matrix of left singular values of

corresponding to non-zero singular values. Then is invertible and we can write

which matches the instrumental regression framework. 1313todo: 13Covariance pseudo-inverse For the steady-state case (constant Kalman gain), one can estimate given the data and the parameter by solving Riccati equation as described in [8]. and then specify a joint Gaussian distribution over the next observations where marginalization and conditioning can be easily performed.

We can also assume a Kalman filter that is not in the steady state (i.e. the Kalman gain is not constant). In this case we need to maintain sufficient statistics for a predictive Gaussian distribution (i.e. mean and covariance). Let denote the vectorization operation, which stacks the columns of a matrix into a single vector. We can stack and to into a single vector that we refer to as 1st+2nd moments vector. We do the same for future and extended future. We can, in principle, perform linear regression on these 1st+2nd moment vectors but that requires an unnecessarily large number of parameters. Instead, we can learn an S1A regression function of the form

(A.8)
(A.9)

Where is simply the covariance of the residuals of the 1st moment regression (i.e. covariance of ). This is still a linear model in terms of 1st+2nd moment vectors and hence we can do the same for S1B and S2 regression models. This way, the extended belief vector (the expectation of 1st+2nd moments of extended future) fully specifies a joint distribution over t