Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction

11/15/2012
by   Georg M. Goerg, et al.
0

We introduce 'mixed LICORS', an algorithm for learning nonlinear, high-dimensional dynamics from spatio-temporal data, suitable for both prediction and simulation. Mixed LICORS extends the recent LICORS algorithm (Goerg and Shalizi, 2012) from hard clustering of predictive distributions to a non-parametric, EM-like soft clustering. This retains the asymptotic predictive optimality of LICORS, but, as we show in simulations, greatly improves out-of-sample forecasts with limited data. The new method is implemented in the publicly-available R package "LICORS" (http://cran.r-project.org/web/packages/LICORS/).

READ FULL TEXT VIEW PDF

page 6

page 7

page 8

06/08/2015

The LICORS Cabinet: Nonparametric Algorithms for Spatio-temporal Prediction

Spatio-temporal data is intrinsically high dimensional, so unsupervised ...
05/14/2021

A Play on Birds! The staRVe Package for Analyzing Spatio-Temporal Point-Referenced Data in R

We present the R package staRVe for analyzing spatio-temporal point-refe...
12/09/2018

Spatio-Temporal Models for Big Multinomial Data using the Conditional Multivariate Logit-Beta Distribution

We introduce a Bayesian approach for analyzing high-dimensional multinom...
11/11/2018

A Survey of Mixed Data Clustering Algorithms

Most of the datasets normally contain either numeric or categorical feat...
02/04/2022

Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

In clinical trials, identification of prognostic and predictive biomarke...
08/28/2020

Nowcasting in a Pandemic using Non-Parametric Mixed Frequency VARs

This paper develops Bayesian econometric methods for posterior and predi...

1 Introduction

Recently Goerg and Shalizi (2012) introduced light cone reconstruction of states (LICORS), a nonparametric procedure for recovering predictive states from spatio-temporal data. Every spatio-temporal process has an associated, latent prediction process, whose measure-valued states are the optimal local predictions of the future at each point in space and time (Shalizi, 2003)

. LICORS consistently estimates this prediction process from a single realization of the manifest space-time process, through agglomerative clustering in the space of predictive distributions; estimated states are clusters. This converges on the minimal set of states capable of optimal prediction of the original process.

Experience with other clustering problems shows that soft-threshold techniques often predict much better than hard-threshold methods. Famously, while -means (Lloyd, 1982)

is very fast and robust, the expectation maximization (EM) algorithm

(Dempster et al., 1977)

in Gaussian mixture models gives better clustering results. Moreover, mixture models allow clusters to be interpreted probabilistically, and new data to be assigned to old clusters.

With this inspiration, we introduce mixed LICORS, a soft-thresholding version of (hard) LICORS. This embeds the prediction process framework of Shalizi (2003) in a mixture-model setting, where the predictive states correspond to the optimal mixing weights on extremal distributions, which are themselves optimized for forecasting. Our proposed nonparametric, EM-like algorithm then follows naturally.

After introducing the prediction problem and fixing notation, we explain the mixture-model and hidden-state-space interpretations of predictive states (§2). We then present our nonparametric EM algorithm for estimation, with automatic selection of the number of predictive states (§3), and the corresponding prediction algorithm (§4). After demonstrating that mixed LICORS predicts better out of sample than hard-clustering procedures (§5), we review the proposed method and discuss future work (§6).

2 A Predictive States Model for Spatio-temporal Processes

We fix notation and the general set-up for predicting spatio-temporal processes, following Shalizi (2003). We observe a random field , discrete- or continuous-valued, at each point on a regular spatial lattice

, at each moment

of discrete time , or observational in all. The field is if space is dimensional (plus for time); video is . is a norm (e.g., Euclidean) on the spatial coordinates .

To optimally predict an unknown (future) given (past) observed data , we need to know . Estimating this conditional distribution is quite difficult if can depend on all variables in . Since information in a physical system can typically only propagate at a finite speed , can only depend on a subset of . We therefore only have to consider local, spatio-temporal neighborhoods of as possibly relevant for prediction.

2.1 Light Cones

The past light cone (PLC) of is all past events that could have influenced ,

(1)

Analogously the future light cone (FLC) is all future events which could be influenced by . In practice, we limit PLCs and FLCs to finite horizons and ; together with , these fix the dimensionality of light-cones; see Figure 1 for an illustration.

(a) field
(b) field
Figure 1: Geometry of past ( ) and future () light cones, with , , .

For physically-plausible processes, then, we only have to find the conditional distribution of given its PLC . Doing this for every in space-time, however, is still excessive. Not only do spatio-temporal patterns recur, but different histories can have similar predictive consequences. Thus we need not find for each separately, but can first summarize PLCs by predictively sufficient statistics, , and then only find .

We now construct the minimal sufficient statistic, , following Shalizi (2003), to which we refer for some mathematical details, and references on predictive sufficiency.

Definition 2.1 (Equivalent configurations).

The past configurations at and at are predictively equivalent, , if they have the same conditional distributions for FLCs, i.e.,

Let be the equivalence class of , the set of all past configurations and coordinates that predict the same future as does at . Let

(2)

be the function mapping each to its predictive equivalence class. The values can take are predictive states; they are the minimal statistics which are sufficient for predicting from (Shalizi, 2003).

Assumption 2.2 (Conditional invariance).

The predictive distribution of a PLC configuration does not change over time or space. That is, for all , all , and all past light-cone configurations ,

(3)

We may thus regard as an equivalence relation among PLC configurations, and as a function over alone.

Assumption 2.2 enables inference from a single realization of the process (as opposed to multiple independent realizations). For readability, we encode the space-time index by a single index .222Appendix A in the SM gives an explicit rule (Eq. (31)) to map each to a unique .

The future is independent of the past given the predictive state (Shalizi, 2003),

(4)

and, by construction,

(5)

Thus the prediction problem simplifies to estimating for each .

2.2 A Statistical Predictive-States Model

The focus on predictive distributions in constructing the predictive states does not prevent us from using them to model the joint distribution.

Proposition 2.3.

The joint pdf of satisfies

(6)

where is the margin of the spatio-temporal process, i.e., all points in the field that do not have a completely observed PLC.

Proof.

See Supplementary Material, §A. ∎

Proposition 2.3 shows that the focus on conditional predictive distributions does not restrict the applicability of the light cone setup to forecasts alone, but is in fact a generative representation for any spatio-temporal process.

2.2.1 Minimal Sufficient Statistical Model

Given Eq. (6) simplifies to (omitting )

(7)

using (4). Any particular implicitly specifies the number of predictive states , and all predictive distributions . However, in practice only and are observed; the mapping is exactly what we are trying to estimate.

2.3 Predictive States as Hidden Variables

Since there is a one-to-one correspondence between the mapping and the set of equivalence classes / predictive states which are its range, Shalizi (2003) and Goerg and Shalizi (2012) do not formally distinguish between them. Here, however, it is important to keep distinction between the predictive state space and the mapping . Our EM algorithm is based on the idea that the predictive state is a hidden variable, , taking values in the finite state space , and the mixture weights of PLC are the soft-threshold version of the mapping . It is this hidden variable interpretation of predictive states that we use to estimate the minimal sufficient statistic , the hidden state space , and the predictive distributions .

Introducing the abbreviation for , the latent variable approach lets (7) be written as

(8)

Eq. (8) is the pdf of a component mixture model with complete data, and is a randomized version of .

2.4 Log-likelihood of

From (8) the complete data log-likelihood is, neglecting a term,

(9)

where and the second equality follows since for one and only one , and otherwise.

The “parameters” in (9) are and ; and are observed, and is a hidden variable. The optimal mapping is the one that maximizes (9):

(10)

Without any constraints on or the maximum is obtained for and ; “the most faithful description of the data is the data”.333On the other extreme is a field with only predictive state, i.e. the iid case. As this tells us nothing about the underlying dynamics, we must put some constraints on and/or to get a useful solution. For now, assume that is fixed and we only have to estimate ; in Section 3.3, we will give a data-driven procedure to choose .

2.5 Nonparametric Likelihood Approximation

To solve (10) with we need to evaluate (9) for candidate solutions . We cannot do this directly, since (9) involves the unobserved . Moreover, predictive distributions can have arbitrary shapes, so we want to use nonparametric methods, but this inhibits direct evaluation of the component distributions .

We solve both difficulties together by using a nonparametric variant of the expectation-maximization (EM) algorithm (Dempster et al., 1977). Following the recent nonparametric EM literature (Hall et al., 2005; Benaglia et al., 2011; Bordes et al., 2007), we approximate the

in the log-likelihood with kernel density estimators (KDEs) using a previous estimate

. That is we approximate (9) with :

(11)

where an equivalent version of is given below in (20).

3 EM Algorithm for Predictive State Space Assignment

Since maps to , the hidden state variable and the “parameter” play the same role. This in turn results in similar E and M steps. Figure 2 gives an overview of the proposed algorithm.

3.1 Expectation Step

The E-step requires the expected log-likelihood

(12)

where expectation is taken with respect to , the conditional distribution of the hidden variable given the data and the current estimate . Using (9) we obtain

(13)

As for we use KDEs and obtain an approximate expected log-likelihood

(14)

The conditional distribution of given its FLC and PLC, , comes from Bayes’ rule,

(15)

the second equality following from the conditional independence of and given the state .

For brevity, let , an weight matrix

, whose rows are probability distributions over states. This

is the soft-thresholding version of , so we can write the expected log-likelihood in terms of ,

(16)

The current can be used to update (conditional) probabilities in (15) by

(17)
(18)
(19)

where i) is the effective sample size of state , ii) and are weighted mean and covariance matrix estimators of the PLCs using the th column of , and iii) the FLC distribution is estimated with a weighted444We also tried a hard-threshold estimator, but we found that the soft-threshold KDE performed better. KDE (wKDE)

(20)
Here the weights are again the th column of , and is a kernel function with a state-dependent bandwidth . We used a Gaussian kernel, and to get a good, cluster-adaptive bandwidth , we pick out on those for which (hard-thresholding of weights; cf. Benaglia et al. 2011) and apply Silverman’s rule-of-thumb-bandwidth (bw.ndr0 in the R function density). After estimation, we normalize each in (17), .

Ideally, we would use a non-parametric estimate for the PLC distribution, e.g., forest density estimators (Liu et al., 2011; Chow and Liu, 1968). Currently, however, such estimators are too slow to handle many iterations at large , so we model state-conditional PLC distributions as multivariate Gaussians. Simulations suggest that this is often adequate in practice.

3.2 Approximate Maximization Step

In parametric problems, the M-step solves

(21)

to improve the estimate. Starting from a guess , the EM algorithm iterates (12) and (21) to convergence.

In nonparametric problems, finding an that increases is difficult, since wKDEs with non-zero bandwidth are not maximizing the likelihood; they are not even guaranteed to increase it. Optimizing (14) by brute force isn’t computationally feasible either, as it would mean searching state assignments (cf. Bordes et al. 2007).

Initialization: Set . Split data in and . Initialize states randomly from Boolean . E-step: Obtain updated via (17). Approximate M-step: Update mixture pdfs via (20) with . Out-of-sample Prediction: Evaluate out-of-sample MSE for by predicting FLCs from PLCs in . Set . Temporary convergence: Iterate 1 - 3 until

(22)
Merging: Estimate pairwise distances (or test)
(23)
If : determine and merge these columns
(24)
Omit column from , set , and re-start iterations at 1.
If : return with lowest out-of-sample MSE.

Figure 2: Mixed LICORS: nonparametric EM algorithm for predictive state recovery.

However, in our particular setting the parameter space and the expectation of the hidden variable coincide, since is a soft-thresholding version of . Furthermore, none of the estimates above requires a deterministic mapping; they are all weighted MLEs or KDEs. Thus, like Benaglia et al. (2009), we take the weights from the E-step, , to update each component distribution using (20). This in turn can then be plugged into (11) to update the likelihood function, and in (17) for the next E-step.

The wKDE update does not solve (21) nor does it provably increase the log-likelihood (although in simulations it often does so). We thus use cross-validation (CV) to select the best , and henceforth do not rely on an ever-increasing log-likelihood as the usual stopping rule in EM algorithms (see Section 5 for details).

3.3 Data-driven Choice of : Merge States To Obtain Minimal Sufficiency

One advantage of the mixture model in (8) is that predictive states have, by definition, comparable conditional distributions. Since conditional densities can be tested for equality by a nonparametric two-sample test (or using a distribution metric), we can merge nearby classes. We thus propose a data-driven automatic selection of , which solves this key challenge in fitting mixture models: 1) start with a sufficiently large number of clusters, ; 2) test for equality of predictive distribution each time the EM reaches a (local) optimum; 3) merge until (iid case) – step 5 in Fig. 2; 4) choose the best model (and thus ) by CV.

4 Forecasting Given New Data

The estimate can be used to forecast given a new . Integrating out yields a mixture

(25)

As is independent of we do not have to re-estimate them for each , but can use the wKDEs in (20) from the training data.

The mixture weights are in general different for each PLC and can again be estimated using Bayes’s rule (with the important difference that now we only condition on , not on ):

(26)

After re-normalization of , the predictive distribution (25) can be estimated via

(27)

A point forecast can then be obtained by a weighted combination of point estimates in each component (e.g. weighted mean), or by the mode of the full distribution. In the simulations we use the weighted average from each component as the prediction of .

5 Simulations

To evaluate the predictive ability of mixed LICORS in a practical, non-asymptotic context we use the following simulation. The continuous-valued field has a discrete latent state space . The observable field

evolves according to a conditional Gaussian distribution,

(28)
and initial conditions: (29)

The state space evolves with the observable field,

(30)

where is the closest integer to . In words, Eq. (30) says that the latent state is the rounded difference between the sample average of the nearest sites at and the sample average of the nearest sites at . Thus and .

If we include the present in the FLC, (28) gives , making FLC distributions one-dimensional. As is integer-valued and the conditional distribution becomes if , the system has predictive states, , distinguished by the conditional mean . Thus , .

(a) observed field
(b) state-space
Figure 3: Simulation of (28) – (30).
Figure 4: Mixed LICORS with and starting states. Nonparametric estimates of conditional predictive distributions (top-left); trace plots for log-likelihood and MSE (right).

Figure 3 shows one realization of (28) – (30) for (vertical) and (left to right), where we discarded the first time steps as burn-in to avoid too much dependence on the initial conditions (29). While the (usually unobserved) state space has distinctive green temporal traces and also alternating red and blue patches, the observed field is too noisy to clearly see any of these patterns.555All computation was done in R (R Development Core Team, 2010).

(a) estimated predictive states
(b) true predictive states
(c) residuals
Figure 5: Mixed LICORS model fit and residual check.

Figure 4 summarizes one run of mixed LICORS with initial states, , and distance in (23). The first time steps were used as training data, and the second half as test data. The optimal , which minimized the out-of-sample weighted MSE, occurred at iteration , with estimated predictive states. The trace plots show large temporary drops (increases) in the log-likelihood (MSE) whenever the EM reaches a local optimum and merges two states. After merging, the forecasting performance and log-likelihood quickly return to — or even surpass — previous optima.

The predictions from in Fig. 4(a) show that mixed LICORS is practically unbiased – compare to the visually indistinguishable true state space in Fig. 4(b). The residuals in Fig. 4(c)

show no obvious patterns except for a larger variance in the right half (training vs. test data).

5.1 Mixed versus Hard LICORS

Mixed LICORS does better than hard LICORS at forecasting. We use independent realizations of (28) – (30) and for each one we train the model on the first half, and test it on the second (future) half. Lower out-of-sample future MSEs are, of course, better.

We use EM as outlined in Fig. 2 with states, maximum number of iterations. We keep the estimate with the lowest out-of-sample MSE over

independent runs. The first run is initialized with a K-means++ clustering

(Arthur and Vassilvitskii, 2007) on the PLC space; state initializations in the remaining nine runs were uniformly at random from .

To test whether mixed LICORS accurately estimates the mapping , we also predict FLCs of an independently generated realization of the same process. If the out-of-sample MSE for the independent field is the same as the out-of-sample MSE for the future evolution of the training data, then mixed LICORS is not just memorizing fluctuations in any given realization, but estimates characteristics of the random field. We calculated both weighted-mixture forecasts, and the forecast of the state with the highest weight.

Figure 6: Comparing mixed and hard LICORS by forecast MSEs.

Figure 6 shows the results. Mixed LICORS greatly improves upon the hard LICORS estimates, with up to reduction in out-of-sample MSE. As with hard LICORS, the MSE for future and independent realizations are essentially the same, showing that mixed LICORS generalizes from one realization to the whole process. As the weights often already are / assignments, weighted-mixture prediction is only slightly better than using the highest-weight state.

5.2 Simulating New Realizations of The Underlying System

Recall that all simulations use (29) as initial conditions. If we want to know the effect of different starting conditions, then we can simply use Eqs. (28) & (30) to simulate that process, since they fully specify the evolution of the stochastic process. In experimental studies, however, researchers usually lack such generative models; learning one is often the point of the experiment in the first place.

Since mixed LICORS estimates joint and conditional predictive distributions, and not only the conditional mean, it is possible to simulate a new realization from an estimated model. Figure 7 outlines this simulation procedure. We will now demonstrate that mixed LICORS can be used instead to simulate from different initial conditions without knowing Eqs. (28) & (30).

Initialize field . Set . Fetch all PLCs at time : For each : draw state draw If , set and go to step 1. Otherwise return simulation .

Figure 7: Simulate new realization from spatio-temporal process on using true and estimated dynamics (use (4) in 2a; (20) in 2b).
(a) simulations from true dynamics
(b) simulations from estimated dynamics
Figure 8: Simulating another realization of (28) & (30) but with different starting conditions.

For example, Fig. 7(a) shows a simulation using the true mechanisms in (28) & (30) with starting conditions and in alternating patches of ten times , ten times , ten times , etc. (total of patches since ). The first couple of columns (on the left) are influenced by different starting conditions, but the initial effect dies out soon (since ) and similar structures (left to right traces) as in simulations with (29) emerge (Fig. 3).

Figure 7(b) shows simulations solely using the mixed LICORS estimates in Fig. 4. While the patterns are quantitatively different (due to random sampling), the qualitative structures are strikingly similar. Thus mixed LICORS can not only accurately estimate , but also learn the optimal prediction rule (28) solely from the observed data .

6 Discussion

Mixed LICORS attacks the problem of reconstructing the predictive state space of a spatio-temporal process through a nonparametric, EM-like algorithm, softly clustering observed histories by their predictive consequences. Mixed LICORS is a probabilistic generalization of hard LICORS and can thus be easily adapted to other statistical settings such as classification or regression. Simulations show that it greatly outperforms its hard-clustering predecessor.

However, like other state-of-the-art nonparametric EM-like algorithms (Hall et al., 2005; Bordes et al., 2007; Mallapragada et al., 2010), theoretical properties of our procedure are not yet well understood. In particular, the nonparametric estimation of mixture models poses identifiability problems (Benaglia et al., 2009, §2 and references therin). Here, we demonstrated that in practice mixed LICORS does not suffer from identifiability problems, and outperforms (identifiable) hard-clustering methods.

We also demonstrate that mixed LICORS can learn spatio-temporal dynamics from data, which can then be used for simulating new experiments, whether from the observed initial conditions or from new ones. Simulating from observed conditions allows for model checking; simulating from new ones makes predictions about unobserved behavior. Thus mixed LICORS can in principle make a lot of expensive, time- and labor-intensive experimental studies much more manageable and easier to plan. In particular, mixed LICORS can be applied to e.g., functional magnetic resonance imaging (fMRI) data to analyze and forecast complex, spatio-temporal brain activity. Due to space limits we refer to future work.

Acknowledgments

We thank Stacey Ackerman-Alexeeff, Dave Albers, Chris Genovese, Rob Haslinger, Martin Nilsson Jacobi, Heike Jänicke, Kristina Klinkner, Cristopher Moore, Jean-Baptiste Rouquier, Chad Schafer, Rafael Stern, and Chris Wiggins for valuable discussion, and Larry Wasserman for detailed suggestions that have improved all aspects of this work. GMG was supported by an NSF grant (# DMS 1207759). CRS was supported by grants from INET and from the NIH (# 2 R01 NS047493).

References

Appendix A Proofs

Here we will show that the joint pdf of the entire field conditioned on its margin (Figure 9, gray area) equals the product of the predictive conditional distributions.

Proof of Proposition 2.3.

To simplify notation, we assign index numbers to the space-time grid to ensure that the PLC of cannot contain if . We can do this by iterating through space and time in increasing order over time (and, for fixed , any order over space):

(31)
Figure 9: Margin of spatio-temporal field in .

We assume that the process we observed is part of a larger field on an extended grid , with and , for a total of space-time points, . The margin are all points , that do not have a fully observed past light cone. Formally,

(32)

The size of depends on the past horizon as well as the speed of propagation , .

In Figure 9, the part of the field with fully observed PLCs are marked in red. Points in the margin, in gray, have PLCs extending into the fully unobserved region (blue). Points in the red area have a PLC that lies fully in the red or partially in the gray, but never in the blue area. As can be see in Fig. 9 the margin at each is a constant fraction of space, thus overall grows linearly with ; it does not grow with an increasing , but stays constant.

For simplicity, assume that are from the truncated (red) field, such that all their PLCs are observed (they may lie in ), and the remaining s lie in (with a PLC that is only partially unobserved). Furthermore let . Thus

The first term factorizes as

where the second-to-last equality follows since by (31), , and the last equality holds since is conditional independent of the rest given its own PLC (due to limits in information propagation over space-time).

By induction,

(33)
(34)

This shows that the conditional log-likelihood maximization we use for our estimators is equivalent (up to a constant) to full joint maximum likelihood estimation. ∎

Appendix B Predictive States and Mixture Models

Another way to understand predictive states is as the extremal distributions of an optimal mixture model (Lauritzen, 1974, 1984).

To predict any variable , we have to know its distribution . If, as often happens, that distribution is very complicated, we may try to decompose it into a mixture of simpler “base” or “extremal” distributions, , with mixing weights ,

(35)

The familiar Gaussian mixture model, for instance, makes the extremal distributions to be Gaussians (with indexing both expectations and variances), and makes the mixing weights a combination of delta functions, so that becomes a weighted sum of finitely-many Gaussians.

The conditional predictive distribution of in (35) is a weighted average over the extremal conditional distributions ,

(36)

This only makes the forecasting problem harder, unless , that is, unless is a predictively sufficient statistic for . The most parsimonious mixture model is the one with the minimal sufficient statistic, . This shows that predictive states are the best “parameters” in (35) for optimal forecasting. Using them,

(37)
(38)

where is the probability that the predictive state of is , and . Since each light cone has a unique predictive state,

(39)

Thus the predictive distribution given is just

(40)

Now the forecasting problem simplifies to mapping to its predictive state, ; the appropriate distribution-valued forecast is , and point forecasts are derived from it as needed.

This mixture-model point of view highlights how prediction benefits from grouping points by their predictive consequences, rather than by spatial proximity (as a Gaussian mixture would do). For us, this means clustering past light-cone configurations according to the similarity of their predictive distributions, not according to (say) the Euclidean geometry. Mixed LICORS thus learns a new geometry for the system, which is optimized for forecasting.