1 Introduction
Recently Goerg and Shalizi (2012) introduced light cone reconstruction of states (LICORS), a nonparametric procedure for recovering predictive states from spatiotemporal data. Every spatiotemporal process has an associated, latent prediction process, whose measurevalued states are the optimal local predictions of the future at each point in space and time (Shalizi, 2003)
. LICORS consistently estimates this prediction process from a single realization of the manifest spacetime process, through agglomerative clustering in the space of predictive distributions; estimated states are clusters. This converges on the minimal set of states capable of optimal prediction of the original process.
Experience with other clustering problems shows that softthreshold techniques often predict much better than hardthreshold methods. Famously, while means (Lloyd, 1982)
is very fast and robust, the expectation maximization (EM) algorithm
(Dempster et al., 1977)in Gaussian mixture models gives better clustering results. Moreover, mixture models allow clusters to be interpreted probabilistically, and new data to be assigned to old clusters.
With this inspiration, we introduce mixed LICORS, a softthresholding version of (hard) LICORS. This embeds the prediction process framework of Shalizi (2003) in a mixturemodel setting, where the predictive states correspond to the optimal mixing weights on extremal distributions, which are themselves optimized for forecasting. Our proposed nonparametric, EMlike algorithm then follows naturally.
After introducing the prediction problem and fixing notation, we explain the mixturemodel and hiddenstatespace interpretations of predictive states (§2). We then present our nonparametric EM algorithm for estimation, with automatic selection of the number of predictive states (§3), and the corresponding prediction algorithm (§4). After demonstrating that mixed LICORS predicts better out of sample than hardclustering procedures (§5), we review the proposed method and discuss future work (§6).
2 A Predictive States Model for Spatiotemporal Processes
We fix notation and the general setup for predicting spatiotemporal processes, following Shalizi (2003). We observe a random field , discrete or continuousvalued, at each point on a regular spatial lattice
, at each moment
of discrete time , or observational in all. The field is if space is dimensional (plus for time); video is . is a norm (e.g., Euclidean) on the spatial coordinates .To optimally predict an unknown (future) given (past) observed data , we need to know . Estimating this conditional distribution is quite difficult if can depend on all variables in . Since information in a physical system can typically only propagate at a finite speed , can only depend on a subset of . We therefore only have to consider local, spatiotemporal neighborhoods of as possibly relevant for prediction.
2.1 Light Cones
The past light cone (PLC) of is all past events that could have influenced ,
(1) 
Analogously the future light cone (FLC) is all future events which could be influenced by . In practice, we limit PLCs and FLCs to finite horizons and ; together with , these fix the dimensionality of lightcones; see Figure 1 for an illustration.
For physicallyplausible processes, then, we only have to find the conditional distribution of given its PLC . Doing this for every in spacetime, however, is still excessive. Not only do spatiotemporal patterns recur, but different histories can have similar predictive consequences. Thus we need not find for each separately, but can first summarize PLCs by predictively sufficient statistics, , and then only find .
We now construct the minimal sufficient statistic, , following Shalizi (2003), to which we refer for some mathematical details, and references on predictive sufficiency.
Definition 2.1 (Equivalent configurations).
The past configurations at and at are predictively equivalent, , if they have the same conditional distributions for FLCs, i.e.,
Let be the equivalence class of , the set of all past configurations and coordinates that predict the same future as does at . Let
(2) 
be the function mapping each to its predictive equivalence class. The values can take are predictive states; they are the minimal statistics which are sufficient for predicting from (Shalizi, 2003).
Assumption 2.2 (Conditional invariance).
The predictive distribution of a PLC configuration does not change over time or space. That is, for all , all , and all past lightcone configurations ,
(3) 
We may thus regard as an equivalence relation among PLC configurations, and as a function over alone.
Assumption 2.2 enables inference from a single realization of the process (as opposed to multiple independent realizations). For readability, we encode the spacetime index by a single index .^{2}^{2}2Appendix A in the SM gives an explicit rule (Eq. (31)) to map each to a unique .
The future is independent of the past given the predictive state (Shalizi, 2003),
(4) 
and, by construction,
(5) 
Thus the prediction problem simplifies to estimating for each .
2.2 A Statistical PredictiveStates Model
The focus on predictive distributions in constructing the predictive states does not prevent us from using them to model the joint distribution.
Proposition 2.3.
The joint pdf of satisfies
(6) 
where is the margin of the spatiotemporal process, i.e., all points in the field that do not have a completely observed PLC.
Proof.
See Supplementary Material, §A. ∎
Proposition 2.3 shows that the focus on conditional predictive distributions does not restrict the applicability of the light cone setup to forecasts alone, but is in fact a generative representation for any spatiotemporal process.
2.2.1 Minimal Sufficient Statistical Model
2.3 Predictive States as Hidden Variables
Since there is a onetoone correspondence between the mapping and the set of equivalence classes / predictive states which are its range, Shalizi (2003) and Goerg and Shalizi (2012) do not formally distinguish between them. Here, however, it is important to keep distinction between the predictive state space and the mapping . Our EM algorithm is based on the idea that the predictive state is a hidden variable, , taking values in the finite state space , and the mixture weights of PLC are the softthreshold version of the mapping . It is this hidden variable interpretation of predictive states that we use to estimate the minimal sufficient statistic , the hidden state space , and the predictive distributions .
2.4 Loglikelihood of
From (8) the complete data loglikelihood is, neglecting a term,
(9) 
where and the second equality follows since for one and only one , and otherwise.
The “parameters” in (9) are and ; and are observed, and is a hidden variable. The optimal mapping is the one that maximizes (9):
(10) 
Without any constraints on or the maximum is obtained for and ; “the most faithful description of the data is the data”.^{3}^{3}3On the other extreme is a field with only predictive state, i.e. the iid case. As this tells us nothing about the underlying dynamics, we must put some constraints on and/or to get a useful solution. For now, assume that is fixed and we only have to estimate ; in Section 3.3, we will give a datadriven procedure to choose .
2.5 Nonparametric Likelihood Approximation
To solve (10) with we need to evaluate (9) for candidate solutions . We cannot do this directly, since (9) involves the unobserved . Moreover, predictive distributions can have arbitrary shapes, so we want to use nonparametric methods, but this inhibits direct evaluation of the component distributions .
We solve both difficulties together by using a nonparametric variant of the expectationmaximization (EM) algorithm (Dempster et al., 1977). Following the recent nonparametric EM literature (Hall et al., 2005; Benaglia et al., 2011; Bordes et al., 2007), we approximate the
in the loglikelihood with kernel density estimators (KDEs) using a previous estimate
. That is we approximate (9) with :(11) 
where an equivalent version of is given below in (20).
3 EM Algorithm for Predictive State Space Assignment
Since maps to , the hidden state variable and the “parameter” play the same role. This in turn results in similar E and M steps. Figure 2 gives an overview of the proposed algorithm.
3.1 Expectation Step
The Estep requires the expected loglikelihood
(12) 
where expectation is taken with respect to , the conditional distribution of the hidden variable given the data and the current estimate . Using (9) we obtain
(13) 
As for we use KDEs and obtain an approximate expected loglikelihood
(14) 
The conditional distribution of given its FLC and PLC, , comes from Bayes’ rule,
(15) 
the second equality following from the conditional independence of and given the state .
For brevity, let , an weight matrix
, whose rows are probability distributions over states. This
is the softthresholding version of , so we can write the expected loglikelihood in terms of ,(16) 
The current can be used to update (conditional) probabilities in (15) by
(17)  
(18)  
(19) 
where i) is the effective sample size of state , ii) and are weighted mean and covariance matrix estimators of the PLCs using the th column of , and iii) the FLC distribution is estimated with a weighted^{4}^{4}4We also tried a hardthreshold estimator, but we found that the softthreshold KDE performed better. KDE (wKDE)
Ideally, we would use a nonparametric estimate for the PLC distribution, e.g., forest density estimators (Liu et al., 2011; Chow and Liu, 1968). Currently, however, such estimators are too slow to handle many iterations at large , so we model stateconditional PLC distributions as multivariate Gaussians. Simulations suggest that this is often adequate in practice.
3.2 Approximate Maximization Step
In parametric problems, the Mstep solves
(21) 
to improve the estimate. Starting from a guess , the EM algorithm iterates (12) and (21) to convergence.
In nonparametric problems, finding an that increases is difficult, since wKDEs with nonzero bandwidth are not maximizing the likelihood; they are not even guaranteed to increase it. Optimizing (14) by brute force isn’t computationally feasible either, as it would mean searching state assignments (cf. Bordes et al. 2007).
However, in our particular setting the parameter space and the expectation of the hidden variable coincide, since is a softthresholding version of . Furthermore, none of the estimates above requires a deterministic mapping; they are all weighted MLEs or KDEs. Thus, like Benaglia et al. (2009), we take the weights from the Estep, , to update each component distribution using (20). This in turn can then be plugged into (11) to update the likelihood function, and in (17) for the next Estep.
The wKDE update does not solve (21) nor does it provably increase the loglikelihood (although in simulations it often does so). We thus use crossvalidation (CV) to select the best , and henceforth do not rely on an everincreasing loglikelihood as the usual stopping rule in EM algorithms (see Section 5 for details).
3.3 Datadriven Choice of : Merge States To Obtain Minimal Sufficiency
One advantage of the mixture model in (8) is that predictive states have, by definition, comparable conditional distributions. Since conditional densities can be tested for equality by a nonparametric twosample test (or using a distribution metric), we can merge nearby classes. We thus propose a datadriven automatic selection of , which solves this key challenge in fitting mixture models: 1) start with a sufficiently large number of clusters, ; 2) test for equality of predictive distribution each time the EM reaches a (local) optimum; 3) merge until (iid case) – step 5 in Fig. 2; 4) choose the best model (and thus ) by CV.
4 Forecasting Given New Data
The estimate can be used to forecast given a new . Integrating out yields a mixture
(25) 
As is independent of we do not have to reestimate them for each , but can use the wKDEs in (20) from the training data.
The mixture weights are in general different for each PLC and can again be estimated using Bayes’s rule (with the important difference that now we only condition on , not on ):
(26) 
After renormalization of , the predictive distribution (25) can be estimated via
(27) 
A point forecast can then be obtained by a weighted combination of point estimates in each component (e.g. weighted mean), or by the mode of the full distribution. In the simulations we use the weighted average from each component as the prediction of .
5 Simulations
To evaluate the predictive ability of mixed LICORS in a practical, nonasymptotic context we use the following simulation. The continuousvalued field has a discrete latent state space . The observable field
evolves according to a conditional Gaussian distribution,
(28)  
and initial conditions:  (29) 
The state space evolves with the observable field,
(30) 
where is the closest integer to . In words, Eq. (30) says that the latent state is the rounded difference between the sample average of the nearest sites at and the sample average of the nearest sites at . Thus and .
If we include the present in the FLC, (28) gives , making FLC distributions onedimensional. As is integervalued and the conditional distribution becomes if , the system has predictive states, , distinguished by the conditional mean . Thus , .
Figure 3 shows one realization of (28) – (30) for (vertical) and (left to right), where we discarded the first time steps as burnin to avoid too much dependence on the initial conditions (29). While the (usually unobserved) state space has distinctive green temporal traces and also alternating red and blue patches, the observed field is too noisy to clearly see any of these patterns.^{5}^{5}5All computation was done in R (R Development Core Team, 2010).
Figure 4 summarizes one run of mixed LICORS with initial states, , and distance in (23). The first time steps were used as training data, and the second half as test data. The optimal , which minimized the outofsample weighted MSE, occurred at iteration , with estimated predictive states. The trace plots show large temporary drops (increases) in the loglikelihood (MSE) whenever the EM reaches a local optimum and merges two states. After merging, the forecasting performance and loglikelihood quickly return to — or even surpass — previous optima.
The predictions from in Fig. 4(a) show that mixed LICORS is practically unbiased – compare to the visually indistinguishable true state space in Fig. 4(b). The residuals in Fig. 4(c)
show no obvious patterns except for a larger variance in the right half (training vs. test data).
5.1 Mixed versus Hard LICORS
Mixed LICORS does better than hard LICORS at forecasting. We use independent realizations of (28) – (30) and for each one we train the model on the first half, and test it on the second (future) half. Lower outofsample future MSEs are, of course, better.
We use EM as outlined in Fig. 2 with states, maximum number of iterations. We keep the estimate with the lowest outofsample MSE over
independent runs. The first run is initialized with a Kmeans++ clustering
(Arthur and Vassilvitskii, 2007) on the PLC space; state initializations in the remaining nine runs were uniformly at random from .To test whether mixed LICORS accurately estimates the mapping , we also predict FLCs of an independently generated realization of the same process. If the outofsample MSE for the independent field is the same as the outofsample MSE for the future evolution of the training data, then mixed LICORS is not just memorizing fluctuations in any given realization, but estimates characteristics of the random field. We calculated both weightedmixture forecasts, and the forecast of the state with the highest weight.
Figure 6 shows the results. Mixed LICORS greatly improves upon the hard LICORS estimates, with up to reduction in outofsample MSE. As with hard LICORS, the MSE for future and independent realizations are essentially the same, showing that mixed LICORS generalizes from one realization to the whole process. As the weights often already are / assignments, weightedmixture prediction is only slightly better than using the highestweight state.
5.2 Simulating New Realizations of The Underlying System
Recall that all simulations use (29) as initial conditions. If we want to know the effect of different starting conditions, then we can simply use Eqs. (28) & (30) to simulate that process, since they fully specify the evolution of the stochastic process. In experimental studies, however, researchers usually lack such generative models; learning one is often the point of the experiment in the first place.
Since mixed LICORS estimates joint and conditional predictive distributions, and not only the conditional mean, it is possible to simulate a new realization from an estimated model. Figure 7 outlines this simulation procedure. We will now demonstrate that mixed LICORS can be used instead to simulate from different initial conditions without knowing Eqs. (28) & (30).
For example, Fig. 7(a) shows a simulation using the true mechanisms in (28) & (30) with starting conditions and in alternating patches of ten times , ten times , ten times , etc. (total of patches since ). The first couple of columns (on the left) are influenced by different starting conditions, but the initial effect dies out soon (since ) and similar structures (left to right traces) as in simulations with (29) emerge (Fig. 3).
Figure 7(b) shows simulations solely using the mixed LICORS estimates in Fig. 4. While the patterns are quantitatively different (due to random sampling), the qualitative structures are strikingly similar. Thus mixed LICORS can not only accurately estimate , but also learn the optimal prediction rule (28) solely from the observed data .
6 Discussion
Mixed LICORS attacks the problem of reconstructing the predictive state space of a spatiotemporal process through a nonparametric, EMlike algorithm, softly clustering observed histories by their predictive consequences. Mixed LICORS is a probabilistic generalization of hard LICORS and can thus be easily adapted to other statistical settings such as classification or regression. Simulations show that it greatly outperforms its hardclustering predecessor.
However, like other stateoftheart nonparametric EMlike algorithms (Hall et al., 2005; Bordes et al., 2007; Mallapragada et al., 2010), theoretical properties of our procedure are not yet well understood. In particular, the nonparametric estimation of mixture models poses identifiability problems (Benaglia et al., 2009, §2 and references therin). Here, we demonstrated that in practice mixed LICORS does not suffer from identifiability problems, and outperforms (identifiable) hardclustering methods.
We also demonstrate that mixed LICORS can learn spatiotemporal dynamics from data, which can then be used for simulating new experiments, whether from the observed initial conditions or from new ones. Simulating from observed conditions allows for model checking; simulating from new ones makes predictions about unobserved behavior. Thus mixed LICORS can in principle make a lot of expensive, time and laborintensive experimental studies much more manageable and easier to plan. In particular, mixed LICORS can be applied to e.g., functional magnetic resonance imaging (fMRI) data to analyze and forecast complex, spatiotemporal brain activity. Due to space limits we refer to future work.
Acknowledgments
We thank Stacey AckermanAlexeeff, Dave Albers, Chris Genovese, Rob Haslinger, Martin Nilsson Jacobi, Heike Jänicke, Kristina Klinkner, Cristopher Moore, JeanBaptiste Rouquier, Chad Schafer, Rafael Stern, and Chris Wiggins for valuable discussion, and Larry Wasserman for detailed suggestions that have improved all aspects of this work. GMG was supported by an NSF grant (# DMS 1207759). CRS was supported by grants from INET and from the NIH (# 2 R01 NS047493).
References
 Arthur and Vassilvitskii (2007) D. Arthur and S. Vassilvitskii. kmeans++: The advantages of careful seeding. In H. Gabow, editor, Proceedings of the 18th Annual ACMSIAM Symposium on Discrete Algorithms [SODA07], pages 1027–1035, Philadelphia, 2007. Society for Industrial and Applied Mathematics. URL http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Benaglia et al. (2009) T. Benaglia, D. Chauveau, and D. R. Hunter. An EMlike algorithm for semi and nonparametric estimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18:505–526, 2009. URL http://sites.stat.psu.edu/~dhunter/papers/mvm.pdf.
 Benaglia et al. (2011) T. Benaglia, D. Chauveau, and D. R. Hunter. Bandwidth selection in an EMlike algorithm for nonparametric multivariate mixtures. In D. R. Hunter, D. S. P. Richards, and J. L. Rosenberger, editors, Nonparametric Statistics and Mixture Models: A Festschrift in Honor of Thomas P. Hettmansperger. World Scientific, Singapore, 2011. URL http://hal.archivesouvertes.fr/docs/00/35/32/97/PDF/npEM_bandwidth.pdf.
 Bordes et al. (2007) L. Bordes, D. Chauveau, and P. Vandekerkhove. A stochastic EM algorithm for a semiparametric mixture model. Computational Statistics and Data Analysis, 51:5429–5443, 2007. doi: 10.1016/j.csda.2006.08.015.
 Chow and Liu (1968) C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT14:462–467, 1968.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977. URL http://www.jstor.org/pss/2984875.
 Goerg and Shalizi (2012) G. M. Goerg and C. R. Shalizi. LICORS: Light cone reconstruction of states for nonparametric forecasting of spatiotemporal systems. Technical report, Department of Statistics, Carnegie Mellon University, 2012. URL http://arxiv.org/abs/1206.2398.
 Hall et al. (2005) P. Hall, A. Neeman, R. Pakyari, and R. Elmore. Nonparametric inference in multivariate mixtures. Biometrika, 92:667–678, 2005.
 Lauritzen (1974) S. L. Lauritzen. Sufficiency, prediction and extreme models. Scandinavian Journal of Statistics, 1:128–134, 1974. URL http://www.jstor.org/pss/4615564.
 Lauritzen (1984) S. L. Lauritzen. Extreme point models in statistics. Scandinavian Journal of Statistics, 11:65–91, 1984. URL http://www.jstor.org/pss/4615945. With discussion and response.

Liu et al. (2011)
H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman.
Forest density estimation.
Journal of Machine Learning Research
, 12:907–951, 2011. URL http://jmlr.csail.mit.edu/papers/v12/liu11a.html.  Lloyd (1982) S. Lloyd. Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2):129 – 137, 1982. ISSN 00189448. doi: 10.1109/TIT.1982.1056489.

Mallapragada et al. (2010)
P. K. Mallapragada, R. Jin, and A. K. Jain.
Nonparametric mixture models for clustering.
In E. R. Hancock, R. C. Wilson, T. Windeatt, I. Ulusoy, and
F. Escolano, editors,
Structural, Syntactic, and Statistical Pattern Recognition [SSPR//SPR 2010]
, pages 334–343, New York, 2010. SpringerVerlag. doi: 10.1007/9783642149801˙32.  R Development Core Team (2010) R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. URL http://www.Rproject.org/. ISBN 3900051070.
 Shalizi (2003) C. R. Shalizi. Optimal nonlinear prediction of random fields on networks. Discrete Mathematics and Theoretical Computer Science, AB(DMCS):11–30, 2003. URL http://arxiv.org/abs/math.PR/0305160.
Appendix A Proofs
Here we will show that the joint pdf of the entire field conditioned on its margin (Figure 9, gray area) equals the product of the predictive conditional distributions.
Proof of Proposition 2.3.
To simplify notation, we assign index numbers to the spacetime grid to ensure that the PLC of cannot contain if . We can do this by iterating through space and time in increasing order over time (and, for fixed , any order over space):
(31) 
We assume that the process we observed is part of a larger field on an extended grid , with and , for a total of spacetime points, . The margin are all points , that do not have a fully observed past light cone. Formally,
(32) 
The size of depends on the past horizon as well as the speed of propagation , .
In Figure 9, the part of the field with fully observed PLCs are marked in red. Points in the margin, in gray, have PLCs extending into the fully unobserved region (blue). Points in the red area have a PLC that lies fully in the red or partially in the gray, but never in the blue area. As can be see in Fig. 9 the margin at each is a constant fraction of space, thus overall grows linearly with ; it does not grow with an increasing , but stays constant.
For simplicity, assume that are from the truncated (red) field, such that all their PLCs are observed (they may lie in ), and the remaining s lie in (with a PLC that is only partially unobserved). Furthermore let . Thus
The first term factorizes as
where the secondtolast equality follows since by (31), , and the last equality holds since is conditional independent of the rest given its own PLC (due to limits in information propagation over spacetime).
By induction,
(33)  
(34) 
This shows that the conditional loglikelihood maximization we use for our estimators is equivalent (up to a constant) to full joint maximum likelihood estimation. ∎
Appendix B Predictive States and Mixture Models
Another way to understand predictive states is as the extremal distributions of an optimal mixture model (Lauritzen, 1974, 1984).
To predict any variable , we have to know its distribution . If, as often happens, that distribution is very complicated, we may try to decompose it into a mixture of simpler “base” or “extremal” distributions, , with mixing weights ,
(35) 
The familiar Gaussian mixture model, for instance, makes the extremal distributions to be Gaussians (with indexing both expectations and variances), and makes the mixing weights a combination of delta functions, so that becomes a weighted sum of finitelymany Gaussians.
The conditional predictive distribution of in (35) is a weighted average over the extremal conditional distributions ,
(36) 
This only makes the forecasting problem harder, unless , that is, unless is a predictively sufficient statistic for . The most parsimonious mixture model is the one with the minimal sufficient statistic, . This shows that predictive states are the best “parameters” in (35) for optimal forecasting. Using them,
(37)  
(38) 
where is the probability that the predictive state of is , and . Since each light cone has a unique predictive state,
(39) 
Thus the predictive distribution given is just
(40) 
Now the forecasting problem simplifies to mapping to its
predictive state, ; the appropriate
distributionvalued forecast is , and point
forecasts are derived from it as needed.
This mixturemodel point of view highlights how prediction benefits from grouping points by their predictive consequences, rather than by spatial proximity (as a Gaussian mixture would do). For us, this means clustering past lightcone configurations according to the similarity of their predictive distributions, not according to (say) the Euclidean geometry. Mixed LICORS thus learns a new geometry for the system, which is optimized for forecasting.