I Introduction
Modeling spatiotemporal data, such as high resolution video, is hard. The sheer dimensionality of the data often makes global inference methods difficult. Similarly curses of dimensionality for textual and timeseries data have been met with great success by HMMs
(Rabiner, 1989), using localized models for prediction and tractable inference on sequences. Inspired by this example, we look to localized models for modeling of spatiotemporal data, like video and fMRI data. Light cone methods, such as mixed LICORS (Goerg and Shalizi, 2013), successfully reduce the global inference task to iterating a tractable, localized one. These methods can be used for both regression (point predictions ofvalued outputs from input variables) and computing probability densities. The latter property allows one to tractably compute distributions over spaces of events, e.g., over the space of all possible videos,
, just as HMMs induce probability distributions over the set
of all possible sequences (Figure 1). This ability could make light cone decompositions as general and useful for modeling spatiotemporal data as HMMs are for textual and timeseries data.The goals of this manuscript are thus: (1) Showing how light cone decompositions help make spatiotemporal modeling tasks tractable; (2) Introducing three easytoimplement light cone algorithms, allowing others to begin experimenting with light cone methods; (3) Assessing the predictive accuracy of light cones methods on two video prediction tasks; and (4) Providing a finite sample guarantee on the error of predictive state light cone methods. We begin with some preliminaries.
Ii Notation and Preliminaries
Given a random field , observed for each point on a regular spatial lattice at discrete time instants , we seek to approximate a joint likelihood over the observations of the spatiotemporal process, and to accurately forecast the future of the process. Since causal influences in physical systems only propagate at finite speed (denoted ), we follow Parlitz and Merkwirth (2000) and adopt the concept of light cones, which are defined as the set of events that could influence . Formally, a past light cone (PLC) is the set of all past variables^{1}^{1}1Strictly, we should distinguish the light cone proper, which is a region of spacetime, from the configuration of the random field over this region. We elide the distinction for brevity. that could have affected :
Similarly, a future light cone (FLC) is the set of all future events that could be affected by . As a practical matter, not all past (or future) events are equally informative, since more recent events tend to exert greater causal influence. Thus, in practice, we can approximate the true past light cone with a much smaller subset light cone, improving tractability without incurring much predictive error.
Furthermore, we adopt the conditional independence assumption for light cones given in Goerg and Shalizi (2013), which allows for the factorization of the joint likelihood into the product of conditional likelihoods. Indexing each by a single integer for simplicity of notation, the joint pdf of factorizes as
where the proportionality accounts for incompletely observed light cones along the edge of the field.
Given this factorization, it becomes natural to seek equivalence classes of light cones, namely, i.e., to cluster light cones into sets based on the similarity of their conditional distributions. Such equivalence classes of past light cones are predictive states (Knight, 1975; Goerg and Shalizi, 2012), and our immediate goals become twofold: first, to discover these latent predictive states (i.e., learn a mapping
from PLCs to predictive states), and second, to estimate the conditional distribution over
given its predictive state. Goerg and Shalizi (2012) introduced LICORS as a nonparametric method of predictive state reconstruction, followed by mixed LICORS (2013) as a mixture model extension of LICORS, where each future light cone is forecast using a mixture of extremal predictive states. Mixed LICORS has predictive advantages over the original LICORS, but requires finding an matrix of weights (where is the number of light cones andthe number of predictive states) using a form of EM, where each weight is determined using a kernel density estimate on all points. Each EM iteration takes
steps, slowing mixed LICORS considerably for large . Almost equally daunting, the original algorithms are quite complex, difficult to implement and debug, inhibiting their adoption.Iii Contributions
We review the use of light cones for localized spatiotemporal prediction. We introduce two simplified nonparametric methods for the predictive state reconstruction task and a simple regression light cone method for fast and accurate forecasting. The first predictive state method, Moonshine, is a simple metaalgorithm consisting of basic clustering steps combined with dimensionality reduction and nonparametric density estimation. Moonshine is instancebased and requires no iterative likelihood maximization, yet retains many of the qualities of the more complex mixed LICORS method. The second predictive state algorithm, One Hundred Proof (OHP), simplifies the Moonshine approach further and consists of clustering in the space of future light cones, using the clusters to obtain statespecific nonparametric density estimates over the space of PLCs and FLCs. These simple algorithms are much easier to implement than the LICORS algorithms, being simplified approximations of the mixed LICORS system, yet retain many of their forecasting and modeling strengths.
We further conduct two sets of empirical experiments showing the predictive power of light cone methods for predicting videolike data, and report results. Lastly, we give a large sample theoretical guarantee for light cone predictive state systems.
The remainder is structured as follows. §IV
describes the Moonshine, One Hundred Proof and light cone linear regression algorithms. §
V describes the experimental setup for two realworld spatiotemporal prediction tasks, and gives the results of the algorithms and baselines. §VII gives an upper bound on the estimation error of our methods. §VIII reviews related and future work, while §IX summarizes our findings.Iv Methods
Our simple predictive state reconstruction methods build upon the principles introduced in Goerg and Shalizi (2013) for mixed LICORS. Both new methods reconstruct a set of predictive states and a soft mapping from past light cones to states, through use of nonparametric density estimation over the space of light cones. That is, for all past light cones the methods compute
where is the normalized weight of state for light cone . Unlike mixed LICORS, the new methods avoid having to explicitly construct an matrix, yet retain the benefits of soft membership mixture modeling.
After describing the reconstruction algorithms, we discuss how one can determine the conditional probability density of an observation given its past light cone, and how to use this conditional density in forecasts. We then describe an additional pure regression light cone method, useful for fast and accurate forecasting without state reconstruction. Appendix 2 describes parameter settings and practical implementation issues that arise when using the algorithms.
Iva Moonshine
Moonshine begins by decomposing the random field into its component light cones, shown at far left in Figure 2. The algorithm then proceeds through two successive stages of clustering, separated by a dimensionreduction step. The main steps of Moonshine are given in Algorithm 1.
The output of the procedure is a set of predictive states, each of which consists of a set of PLCs and FLCs. The predictive states are used to create a pair of nonparametric density estimates, one over PLCs and one over FLCs, which jointly identify each state.
Initial Clustering: For the first clustering step, Moonshine uses a densitybased clustering approach (Ester et al., 1996) to cluster the light cones in the space of PLCs, which assumes that similar PLCs have similar predictive consequences. Such clustering methods need a specified localneighborhood size, so we begin with small neighborhoods, progressively increase until 90% of all points are clustered, and assign the remaining points to the nearest cluster center (effectively hybridizing densitybased clustering with means). This allows for good coverage while avoiding formation of a single, allencompassing cluster. (Alternative clustering algorithms, e.g., Zahn (1971); Gokcay and Principe (2002); Zhao et al. (2015), would also work.)
Density Estimation and Dimensionality Reduction: The FLCs associated with each cluster (mapped through their respective light cones) are used to form kernel density estimates over the space of FLCs. In other words, each cluster consists of some set of associated FLCs and these FLCs are then used to estimate densities over the FLC space. We estimate the densities of randomly selected points, where is a parameter that affects the degree of dimensionality reduction. The logprobability ratio is taken between the first point and the remaining
points. This vector of log probability ratios forms the “signature” of the cluster, following the construction of a canonical sufficient statistic for exponential family distributions
(Kulhavý, 1996, p. 123).Merging Clusters: If the number of clusters is greater than the maximum number of predictive states specified for the model, we cluster again to reduce the number. We cluster the lowdimensional signature vectors with means++ (Arthur and Vassilvitskii, 2007), to form the final predictive states. The original light cones are then assigned to the resulting states, so each predictive state has a unique set of PLCs and FLCs with which to form nonparametric density estimates over both the PLC and FLC spaces.
IvB One Hundred Proof (OHP)
OHP simplifies Moonshine, with a single clustering step and subsequent mapping of light cones to clusters. The main difference is the space in which the clustering occurs: Moonshine clusters in the space of PLCs, but OHP clusters in the space of FLCs. Clustering in FLC space effectively groups past light cones by their predictive consequences, learning a geometry of our space where points with similar futures are “near” each other regardless of differences in their histories. This results in predictive states with expected nearminimalvariance future distributions
(Arthur and Vassilvitskii, 2007), such that once we are sure of which state a new PLC maps to, we are highly certain of what outcome the state will generate.To motivate this choice, imagine that all pasts map to some small set of distinct futures, such as to the letters of a discrete finite alphabet. Given input past we want to estimate a probability function over output , so one way to do this is to group all occurrences of future
, and use that cluster to estimate the distribution, using Bayes’ Theorem, namely,
Using nonparametric density estimation over the points observed with outcome , we can estimate the first quantity on the right hand side, and taking the normalized number of member outcomes allows one to estimate the second. This example can easily extend to continuous quantities, by clustering in the space of observed future outcomes and substituting predictive states for the finite alphabet, which is the motivation for the OHP algorithm.
The two steps of OHP are (Algorithm 2):

Cluster FLCs: After decomposing our spatiotemporal process into light cones, we cluster the FLCs using means++. The number of clusters (which will become the number of predictive states) is a userdefined parameter.

Map Light Cones: We then map the original light cones to our clusters, and produce our final predictive states, which consist of unique sets of PLCs and FLCs.
As in the case of Moonshine, the FLCs and PLCs for each state are used to compute nonparametric density estimates over the space of FLCs and PLCs, providing estimators for and respectively. Algorithm 2 outlines the process of state reconstruction for OHP.
IvC Predictive Distributions for Light Cone Systems
Given the states reconstructed by Moonshine or OHP, we can estimate predictive distributions as follows. The conditional probability (or probability density) of given PLC is obtained by mixing over the predictive states, namely
(1)  
(2) 
where the second equality follows from the conditional independence of and given the predictive state . The terms serve as the mixture weights, and Bayes’s Theorem yields
(3)  
(4) 
All of the quantities in (2) and (4) can be estimated using our reconstructed predictive states, which are each associated with unique sets of PLCs and FLCs. We estimate by , where is the total number of light cone observations and is the number of light cones assigned to state . The two stateconditioned densities and are estimated using nonparametric density estimation techniques (such as kernel density estimation) based on their associated FLCs and PLCs. Thus we get
(5) 
where and denote the nonparametric density estimates of the two corresponding conditional densities.
When we need a point prediction of , we use the conditional mean:
(6)  
(7)  
(8) 
Replacing with (4), plugging in the estimated densities and probabilities, and using the mean future value for state (denoted ) to estimate , we obtain the final prediction rule
(9)  
(10) 
which is simply a suitably weighted mixture of the mean predictions for each state.
IvD Light Cone Linear Regression
If only predictive regression is needed and not a full generative model, one can perform linear regression directly using light cones. Light cone linear regression uses the same light cone decomposition as the LICORS, Moonshine and OHP methods, but learns a regression rule directly from past light cones to future light cone values. This has the advantages of extremely fast prediction and good forecasting accuracy, along with simple implementation. We evaluate the performance of light cone linear regression on two realworld forecasting tasks, in §V.
V Experimental Setup
In order to evaluate the effectiveness of light cone methods, we attempt spatiotemporal forecasting on realworld data.
Va Forecasting Task 1: Electrostatic Potentials
For the first task, the data come from a set of experiments measuring electrostatic potential changes in organic electronic materials (Hoffmann et al., 2013).^{2}^{2}2Specifically, the data were collected using kelvin force probe microscopy to measure spatiotemporal changes in electrostatic charge regions on the surface of poly(3hexyl)thiophene film. We learn a common set of predictive states across experiments, and do framebyframe prediction on a single heldout experiment, effectively crossvalidating across experiments.
Each experiment consists of 7–10 time slices, or frames. Each frame is a 256by256 matrix of scalar measurements, which we call pixels, since the data resembles video in structure. Predictions are performed for 254by254 pixels in each frame after the first, which allows for each pixel to be predicted based on a full light cone, thus excluding marginal light cones.
VB Forecasting Task 2: Human Speaker Video
For the second task, we predict the next frame of a fullresolution video from a recording of a human speaker, used in generating an intelligent avatar agent.^{3}^{3}3Used with permission from GetAbby (True Image Interactive, LLC). In this task, we perform leaveoneframeout predictions, crossvalidating across video frames. Each frame consists of 440by330 pixels, of which predictions are performed on the 428by328 interior pixels, again excluding marginal light cones. Every fifth frame from the video is retained, and light cones are extracted from roughly one hundred skip frames. Fortythousand light cones are subsampled for tractability. These light cones are used for crossvalidation.
VC Comparison Methods and Parameter Settings
We compare the performance of predictive state reconstruction and forecasting systems with some simple baseline methods. For all light cone methods, the same set of light cones were extracted from the data, with , , and , resulting in PLCs of dimension and FLCs with dimension . We evaluate the performance of the mixed LICORS system, implemented by the authors following Goerg and Shalizi (2013). For tractability, only twenty thousand light cones were used in training each fold for the first task, and forty thousand for the second. Kernel density estimators were used for both PLC density estimation as well as FLC density estimation, to improve predictive performance. Initialization was performed using means++ and the iteration delta was set to . For light cone linear regression, we use linear regression implemented in the scikitlearn package for Python (Pedregosa et al., 2011), version 15.2.
The simplest method we compare against is the “predict the value from the last frame” method that simply takes the previous value of a pixel and uses that as the prediction for the pixel in the current frame. The nearest neighbor regressor takes as input a past light cone and finds the nearest PLCs in Euclidean space, then takes the weighted average of their individual future light cone values and outputs that as the current prediction. Below, we report results from the scikitlearn implementation of KNeighborsRegressor with default parameter settings.
VD Performance Metrics
We compared performance in terms of meansquarederror (MSE) and correlation (Pearson ) with the ground truth. Additionally, for the three distributional methods (mixed LICORS, Moonshine and OHP) we measured the average per pixel loglikelihood (Avg. LL) of the predictions, an estimate of the (negative) crossentropy between the model and the truth, and the perplexity (), with lower perplexity being better. For the distributional methods, we tested performance both for a large maximum number of states () and a small number of states ().
To avoid negative infinities appearing when model likelihoods are sufficiently close to zero, we apply smoothing to the three distributional models for all likelihood estimates mapping to zero, converting them to likelihoods of .
VE Qualitative Results
Light cone systems compare favorably to stateoftheart deep learning methods, such as
Mathieu et al. (2015) (seen in Figure 4), which improves on earlier work by Ranzato et al. (2014). The amount of blurring and structural aberration becomes noticeable in their prediction examples, reproduced here. Compare with Figure 5, where a light cone system (mixed LICORS) is used to predict the next frame of human video. The light cone predictions maintain strong structural consistency and minimal blurring, at the cost of some quantization effects (due to predictive state clustering).For the electrostatic potentials prediction task, Fig. 6 and 7 show three frames of predictions each for Moonshine and OHP, respectively. The next frame (top to bottom) is predicted using models trained on the remaining six experiments, given PLCs from the previous frame. Error percentage was calculated as a proportion of the maximum dynamic range of the actual values or predictions, namely,
where is the set of true testing frames, is the set of predicted frames, is the true value at a pixel, is the predicted value of a pixel and is the norm. Qualitatively, both methods do well, capturing much of the changing dynamics in each frame. The methods have trouble representing the extreme values at the two “hotspots” (visible in the error plots in the third columns), giving instead oversmoothed predictions. Other than those extreme regions, the error residuals lack obvious structure and are relatively small.
VF Quantitative Results
Method  MSE  95% CI  Pearson  95% CI  Avg. LL  95% CI  Perplexity  

FuturelikethePast  0.778  [0.777, 0.780]  0.615  [0.614, 0.616]  
KNN Regression  0.852  [0.851, 0.853]  0.506  [0.505, 0.506]  
Light Cone Linear Regression  0.607  [0.606, 0.608]  0.628  [0.627, 0.628]  
Mixed LICORS  100  0.569  [0.567, 0.571]  0.663  [0.661, 0.665]  1.034  [1.110, 0.964]  2.052 
Moonshine  100  0.570  [0.569, 0.572]  0.656  [0.655, 0.657]  0.672  [0.727, 0.617]  1.593 
One Hundred Proof  100  0.592  [0.591, 0.593]  0.641  [0.640, 0.642]  1.724  [2.127, 1.321]  3.303 
Mixed LICORS  10  0.566  [0.565, 0.567]  0.668  [0.667, 0.669]  1.022  [1.096, 0.947]  2.030 
Moonshine  10  0.609  [0.605, 0.613]  0.625  [0.622, 0.628]  0.722  [0.767, 0.678]  1.650 
One Hundred Proof  10  0.597  [0.595, 0.598]  0.648  [0.646, 0.649]  0.682  [0.757, 0.608]  1.605 
Table I
shows how well each method did at predicting electrostatic potentials (Task 1). Mixed LICORS and Moonshine have the lowest MSE, with 95% confidence intervals disjoint from the intervals of other methods. Mixed LICORS also has the highest (Pearson) correlation with the true values. Lastly, of the generative methods (i.e., mixed LICORS, Moonshine and One Hundred Proof), Moonshine and OHP have the highest average loglikelihood and lowest perplexity. Thus, mixed LICORS and Moonshine provide the best overall performance on the dataset.
Restricting ourselves to the generative methods for a compact number of states (), mixed LICORS has the lowest average MSE, while Moonshine and One Hundred Proof have the best probabilistic performance, giving the highest likelihoods and lowest perplexities for the data.
Method  MSE  95% CI  Pearson  95% CI  Avg. LL  95% CI  Perplexity  

FuturelikethePast  0.031  [0.031, 0.031]  0.984  [0.984, 0.984]  
KNN Regression  0.033  [0.033, 0.033]  0.984  [0.984, 0.984]  
Light Cone Linear Regression  0.028  [0.028, 0.028]  0.986  [0.986, 0.0986]  
Mixed LICORS  100  0.038  [0.038, 0.038]  0.981  [0.981, 0.981]  0.102  [0.099, 0.105]  0.932 
Moonshine  100  0.039  [0.039, 0.039]  0.981  [0.981, 0.981]  0.925  [0.874, 0.976]  0.527 
One Hundred Proof  100  1.060  [0.460, 1.659]  0.911  [0.871, 0.952]  6.48  [8.025, 4.948]  89.641 
Table II gives the results from video prediction (Task 2). Light cone linear regression has the strongest overall performance, with low error and high correlation to the ground truth. However, the strong temporal consistency of this dataset allows even the FLTP method to perform remarkably well, outperforming the predictive state light cone methods. While forecasting is relatively easy for this task, being able to estimate a likelihood model for such data gives the predictive state methods an edge over pure regression methods.
Vi Discussion
In this manuscript, we have tested an existing light cone method (mixed LICORS), qualitatively comparing it to deep learning methods, and introduced three new light cone methods (light cone linear regression, Moonshine, OHP). The two latter predictive state methods are successive approximations of the approach used by mixed LICORS, with OHP pushing the limit of how simplified we could make the approximation. OHP is demonstrated to be one approximation too far, since the increased simplification comes at the cost of degraded performance.
On the first realworld spatiotemporal regression task, we find that the three LICORSinspired methods (mixed LICORS, Moonshine and One Hundred Proof) are able to accurately forecast the changing dynamics of the underlying spatiotemporal system. Furthermore, being generative methods, they can be used to compute the likelihood of spatiotemporal data. Moonshine and One Hundred Proof (OHP) are conceptually simple, easy to implement alternatives to the full mixed LICORS system, which give comparable performance for likelihood estimation and forecasting on this task. Although OHP is the simplest method, it fails to perform well in some contexts, such as the second video prediction task, showing a tradeoff between method simplicity and forecasting performance.
Light cone linear regression is a fast and simple method, and is able to perform well on both prediction tasks. It does not estimate likelihoods over data as do the other predictivestate methods, but moving to generalized linear models would allow this. It shows the effectiveness of light cone decompositions and remains a useful approach.
Overall, the best performance on all tasks was achieved or shared by the three new methods, with Moonshine having the best probabilistic modeling performance on both tasks, light cone linear regression having the best forecasting performance on the second task, and OHP having good modeling performance under the constrained setting of limited number of states. Moonshine has better probabilistic modeling performance than mixed LICORS on these tasks, and has statistically indistinguishable forecasting capability (see Tables I (100 state case) and II). While it might be argued that the improved performance was not improved enough, we have to remind ourselves that these are approximations – that they improve performance at all is surprising.
Although OHP does have limited forecasting ability, it does manage to model at least one of the datasets well, showing that its simplified form is not entirely without merit. This, at very least, shows when approximations become too simplified to accomplish complex tasks. Negative results are important, especially when detecting boundaries.
Vii Theoretical Results
We state a result for light cone predictive state systems, with proof given in Appendix I.
We wish to bound the error of our estimated distribution over futures given pasts, namely, the error of . For a fixed random sample of data, let denote the optimal estimate for constructable from the sample. We begin by noting
The second summand on the righthand side is the gap between the optimal estimate and truth, which we assume to shrink in probability with the sample size (as in Goerg and Shalizi 2012). We focus on first term, which is the gap between our lightcone based nonparametric estimator and the optimal estimate. For this quantity, we state our main result:
Theorem 1.
For a fixed data sample of size , let denote the optimal estimator based on that sample and be the light cone estimator based on the same sample. Let be bounded by a constant for all . If
for all , then for any , , , and sufficiently large ,
where is the (smallest) sum of weights for the predictive states and is a bandwidth kernel.
Proof sketch (see appendix for details): For the quantity
, we first mix over states, and use the chain rule to condition. Then we add and subtract
, and split the sum into two parts, one multiplied by and the other multiplied by . By the assumptions stated, the second sum is bounded and decreasing to zero, so that for sufficiently large it is smaller than any . The first sum is less than , which we bound with high probability, using a Hoeffding bound for dependant data (van de Geer, 2002). The result follows directly from application of the Hoeffding bound.Viii Related and Future Work
Viiia Related Work
Our debt to Goerg and Shalizi (2012, 2013) needs no elaboration. We share the same general framework, but aim at simpler algorithms, even if it costs some predictive power. The work on LICORS grows out of earlier work on predictive Markovian representations of nonMarkovian time series (Knight, 1975; Crutchfield and Young, 1989; Shalizi and Crutchfield, 2001; Shalizi and Klinkner, 2004), whose transfer to spatiotemporal data was originally aimed at unsupervised pattern analysis in natural systems (Shalizi et al., 2004, 2006); our qualitative results suggest Moonshine and OHP remain suitable for this, as well as for prediction. The formalism used in this line of work is mathematically equivalent to the “predictive representations of state” introduced by Littman et al. (2002), and lately the focus of much interest in conjunction with spectral estimation methods (Boots and Gordon, 2011). Both formalisms are also equivalent to observable operator models (Jaeger, 2000) and to “sufficient posterior” representations (Langford et al., 2009); our approach may suggest new estimation algorithms within those formalisms.
ViiiB Future Work
Light cone methods, such as the three described here, hold promise for the prediction of dynamical systems. Given the flexibility and generality of light cone decompositions, one can easily extend such methods to handle fullcolor video (e.g., Figure 8), and Kinect™sensor depth video. These applications are the focus of current and future research.
The “rate limiting step” for approximate light cone methods like Moonshine and OHP is the speed of nonparametric density estimation. Methods that scale poorly in the number of observations are of limited use. Towards that end, future research into fast approximate nonparametric density estimation will improve the computational efficiency of the methods presented.
The theoretical properties of the two predictive state methods will be further explored in a future paper, especially with regard to the tradeoffs in their approximation to what LICORS or mixed LICORS would do, and the influence of the new algorithms’ internal randomness.
Ix Conclusion
Faced with the task of learning to accurately model videolike data, we explore the strengths and drawbacks of light cone decomposition methods and propose new simplified nonparametric predictive state methods inspired by the mixed LICORS (Goerg and Shalizi, 2013) algorithm. The methods, Moonshine and One Hundred Proof, do not require costly iterative EM training or the memory intensive formation of an explicit matrix, yet retain the generative modeling capabilities and are competitive in predictive performance to the original mixed LICORS method. The methods are shown to perform well on one realworld data task, effectively capturing spatiotemporal structure and outperforming baseline methods, while a light cone version of linear regression performs well on the remaining task. Overall, we see that light cone decompositions of complex spatiotemporal data can open opportunities to tractably estimate probability densities and accurately forecast the changing systems. By introducing simplified versions of light cone algorithms, we hope to encourage further exploration and application of this general technique.
References
 Arthur and Vassilvitskii [2007] D. Arthur and S. Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

Boots and Gordon [2011]
B. Boots and G. Gordon.
An online spectral learning algorithm for partially observable
nonlinear dynamical systems.
In W. Burgard and D. Roth, editors,
Proceedings of the 25th National Conference on Artificial Intelligence (AAAI2011)
, pages 293–300, Menlo Park, California, 2011. AAAI.  Crutchfield and Young [1989] J. P. Crutchfield and K. Young. Inferring statistical complexity. Physical Review Letters, 63:105–108, 1989.
 Ester et al. [1996] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226–231, 1996.
 Goerg and Shalizi [2012] G. M. Goerg and C. R. Shalizi. LICORS: Light cone reconstruction of states for nonparametric forecasting of spatiotemporal systems. arXiv preprint arXiv:1206.2398, 2012.
 Goerg and Shalizi [2013] G. M. Goerg and C. R. Shalizi. Mixed LICORS: A nonparametric algorithm for predictive state reconstruction. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pages 289–297, 2013.
 Gokcay and Principe [2002] E. Gokcay and J. Principe. Information theoretic clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(2):158–171, Feb 2002. ISSN 01628828. doi: 10.1109/34.982897.
 Hoffmann et al. [2013] P. B. Hoffmann, A. G. Gagorik, X. Chen, and G. R. Hutchison. Asymmetric surface potential energy distributions in organic electronic materials via kelvin probe force microscopy. The Journal of Physical Chemistry C, 117(36):18367–18374, 2013.
 Jaeger [2000] H. Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12:1371–1398, 2000. doi: 10.1162/089976600300015411.
 Knight [1975] F. B. Knight. A predictive view of continuous time processes. Annals of Probability, 3:573–596, 1975.
 Kulhavý [1996] R. Kulhavý. Recursive Nonlinear Estimation: A Geometric Approach, volume 216 of Lecture Notes in Control and Information Sciences. SpringerVerlag, Berlin, 1996. pp. 115.
 Langford et al. [2009] J. Langford, R. Salakhutdinov, and T. Zhang. Learning nonlinear dynamic models. In A. Danyluk, L. Bottou, and M. Littman, editors, Proceedings of the 26th Annual International Conference on Machine Learning [ICML 2009], pages 593–600, New York, 2009. Association for Computing Machinery.
 Littman et al. [2002] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS 2001), pages 1555–1561, Cambridge, Massachusetts, 2002. MIT Press.
 Mathieu et al. [2015] M. Mathieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
 Parlitz and Merkwirth [2000] U. Parlitz and C. Merkwirth. Prediction of spatiotemporal time series based on reconstructed local states. Physical Review Letters, 84:1890–1893, 2000.

Parzen [1962]
E. Parzen.
On estimation of a probability density function and mode.
The annals of mathematical statistics, pages 1065–1076, 1962.  Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Rabiner [1989]
L. Rabiner.
A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257–286, 1989.  Ranzato et al. [2014] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
 Rosenblatt et al. [1956] M. Rosenblatt et al. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832–837, 1956.
 Shalizi and Crutchfield [2001] C. R. Shalizi and J. P. Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity. Journal of Statistical Physics, 104:817–879, 2001.
 Shalizi and Klinkner [2004] C. R. Shalizi and K. L. Klinkner. Blind construction of optimal nonlinear recursive predictors for discrete sequences. In M. Chickering and J. Y. Halpern, editors, Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004), pages 504–511, Arlington, Virginia, 2004. AUAI Press.
 Shalizi et al. [2004] C. R. Shalizi, K. L. Klinkner, and R. Haslinger. Quantifying selforganization with optimal predictors. Physical Review Letters, 93:118701, 2004. doi: 10.1103/PhysRevLett.93.118701.
 Shalizi et al. [2006] C. R. Shalizi, R. Haslinger, J.B. Rouquier, K. L. Klinkner, and C. Moore. Automatic filters for the detection of coherent structure in spatiotemporal systems. Physical Review E, 73:036104, 2006.

van de Geer [2002]
S. A. van de Geer.
On Hoeffding’s inequality for dependent random variables.
In H. Dehling, T. Mikosch, and M. Sorensen, editors, Empirical Process Techniques for Dependent Data, pages 161–169. Birkhäuser, Boston, 2002.  Zahn [1971] C. T. Zahn. Graphtheoretical methods for detecting and describing gestalt clusters. Computers, IEEE Transactions on, 100(1):68–86, 1971.
 Zhao et al. [2015] H. Zhao, P. Poupart, Y. Zhang, and M. Lysy. Sof: Softcluster matrix factorization for probabilistic clustering. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI 2015, 2015.
Appendix I: Proofs
Lemma 1.
Let denote the density for state under the true assignment matrix and let . Given an isolated change in in the weight , the difference between density estimate and is bound by
Proof.
(11)  
(12)  
(13)  
(14)  
(15)  
(16) 
Furthermore, we can bound this quantity by
(17)  
(18)  
(19) 
∎
Lemma 2.
Let be defined as in Lemma 1. Given a fixed data sample of size , for all , and we have
Proof.
Once the sample is fixed, becomes a deterministic function of the sample, and becomes a deterministic constant. Following van de Geer 2002, we define
(20)  
(21)  
(22) 
where denotes that the two functions only differ at the th matrix entry, and are constant (degenerate) random variables for a fixed sample and
(23)  
(24)  
(25) 
Then, for all , and , we have
(27)  
(28)  
(29)  
(30) 
Given a fixed sample of size , choose such that for all . Then
(31)  
(32)  
(33) 
Because for all , we have
(34)  
(35) 
Having already establish that , we set and obtain
(36) 
∎
Theorem 1.
For a fixed data sample of size , let denote the optimal estimator based on that sample and be the light cone estimator based on the same sample. Let be bounded by a constant for all . If for all , then for any , , , and sufficiently large ,
where is the (smallest) sum of weights for the predictive states, and is a kernel of bandwidth .
Proof.
(37)  
(38)  
(39)  
(40)  
(41)  
(42) 
Therefore,
(44)  
(45)  
(46)  
(47) 
For sufficiently large , and , given that is bounded and . Therefore, given sufficiently large,
(48)  
(49)  
(50)  
(51) 
where the penultimate inequality follows from Lemma 2. ∎
Appendix II: Implementation Details
We now discuss the choosing of various parameter settings for the two algorithms, as well as some computational techniques used to improve runtime performance.
Choosing Number of States
In both mixed LICORS and Moonshine a user must specify the maximum number of predictive states for the model, which effectively controls the complexity of the model. In OHP, one must specify the exact number of predictive states, since the number is determined by a means++ [Arthur and Vassilvitskii, 2007] clustering step. In all cases, this number can be chosen based on user preference for simpler models, or crossvalidation may be used to find the number of states that gives the best predictive performance on heldout data.
Dimensionality Reduction Choice in Moonshine
Another parameter that must be chosen is the degree of dimensionality reduction when forming distribution signatures in Moonshine. Data can guide this choice (through crossvalidation), or user preference for more compact models can guide the choice for greater degrees of dimensionality reduction. The fewer the number of dimensions, the less discriminative the signatures, and thus, the higher the likelihood of merging clusters.
Density Based Clustering Considerations
When using density based clustering such as DBSCAN [Ester et al., 1996], two issues arise. First, a suitable local neighborhood size must be chosen (controlled by an parameter). Second, such methods can be computationally expensive and thus slow. To address the first issue, we take an iterative search approach by beginning with very small neighborhood sizes, then increase them until a significant portion of the data is clustered, but keep the proportion below 100%. To address the second issue, we use DBSCAN to cluster only a seed portion of all observations, then assign remaining observations to nearest cluster centers, which greatly improves runtime. Controlling the proportion of data used for seeding versus the portion assigned to cluster centers affects the degree of forced convexity of resulting clusters, and also determines the total runtime of the clustering. Fewer seed points results in faster clustering, but with more convexshaped (e.g., meanslike) clusters.
Scaling
Since Moonshine and OHP cluster based on distances, it becomes important to normalize the scaling of all axes and dynamic ranges of all experiments. Additionally, if the scale of training light cones differs from the scale of test light cones predictive performance will suffer.
Nonparametric Density Estimation
Nonparametric density estimation techniques are instance based and slow with increasing numbers of observations. Our algorithms use kernel density estimators [Rosenblatt et al., 1956, Parzen, 1962], for which we only retain a randomly chosen subsample of five hundred points in each cluster to compute the densities. The resulting systems still perform well, as shown in §V, while being computationally tractable.
Comments
There are no comments yet.