The LICORS Cabinet: Nonparametric Algorithms for Spatio-temporal Prediction

06/08/2015
by   George D. Montanez, et al.
Carnegie Mellon University
0

Spatio-temporal data is intrinsically high dimensional, so unsupervised modeling is only feasible if we can exploit structure in the process. When the dynamics are local in both space and time, this structure can be exploited by splitting the global field into many lower-dimensional "light cones". We review light cone decompositions for predictive state reconstruction, introducing three simple light cone algorithms. These methods allow for tractable inference of spatio-temporal data, such as full-frame video. The algorithms make few assumptions on the underlying process yet have good predictive performance and can provide distributions over spatio-temporal data, enabling sophisticated probabilistic inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

03/04/2020

On spatial and spatio-temporal multi-structure point process models

Spatial and spatio-temporal single-structure point process models are wi...
06/18/2020

Learning non-rigid surface reconstruction from spatio-temporal image patches

We present a method to reconstruct a dense spatio-temporal depth map of ...
11/15/2012

Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction

We introduce 'mixed LICORS', an algorithm for learning nonlinear, high-d...
12/23/2019

Multi-level Convolutional Autoencoder Networks for Parametric Prediction of Spatio-temporal Dynamics

A data-driven framework is proposed for the predictive modeling of compl...
02/01/2019

Public decision support for low population density areas: An imbalance-aware hyper-ensemble for spatio-temporal crime prediction

Crime events are known to reveal spatio-temporal patterns, which can be ...
05/17/2011

Splitting method for spatio-temporal search efforts planning

This article deals with the spatio-temporal sensors deployment in order ...
05/15/2020

Online path sampling control with progressive spatio-temporal filtering

This work introduces progressive spatio-temporal filtering, an efficient...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modeling spatio-temporal data, such as high resolution video, is hard. The sheer dimensionality of the data often makes global inference methods difficult. Similarly curses of dimensionality for textual and time-series data have been met with great success by HMMs

(Rabiner, 1989), using localized models for prediction and tractable inference on sequences. Inspired by this example, we look to localized models for modeling of spatio-temporal data, like video and fMRI data. Light cone methods, such as mixed LICORS (Goerg and Shalizi, 2013), successfully reduce the global inference task to iterating a tractable, localized one. These methods can be used for both regression (point predictions of

-valued outputs from input variables) and computing probability densities. The latter property allows one to tractably compute distributions over spaces of events, e.g., over the space of all possible videos,

, just as HMMs induce probability distributions over the set

of all possible sequences (Figure 1). This ability could make light cone decompositions as general and useful for modeling spatio-temporal data as HMMs are for textual and time-series data.

Fig. 1: Probability densities over the space of all strings, , and the space of all videos, .

The goals of this manuscript are thus: (1) Showing how light cone decompositions help make spatio-temporal modeling tasks tractable; (2) Introducing three easy-to-implement light cone algorithms, allowing others to begin experimenting with light cone methods; (3) Assessing the predictive accuracy of light cones methods on two video prediction tasks; and (4) Providing a finite sample guarantee on the error of predictive state light cone methods. We begin with some preliminaries.

Ii Notation and Preliminaries

Given a random field , observed for each point on a regular spatial lattice at discrete time instants , we seek to approximate a joint likelihood over the observations of the spatio-temporal process, and to accurately forecast the future of the process. Since causal influences in physical systems only propagate at finite speed (denoted ), we follow Parlitz and Merkwirth (2000) and adopt the concept of light cones, which are defined as the set of events that could influence . Formally, a past light cone (PLC) is the set of all past variables111Strictly, we should distinguish the light cone proper, which is a region of space-time, from the configuration of the random field over this region. We elide the distinction for brevity. that could have affected :

Similarly, a future light cone (FLC) is the set of all future events that could be affected by . As a practical matter, not all past (or future) events are equally informative, since more recent events tend to exert greater causal influence. Thus, in practice, we can approximate the true past light cone with a much smaller subset light cone, improving tractability without incurring much predictive error.

Furthermore, we adopt the conditional independence assumption for light cones given in Goerg and Shalizi (2013), which allows for the factorization of the joint likelihood into the product of conditional likelihoods. Indexing each by a single integer for simplicity of notation, the joint pdf of factorizes as

where the proportionality accounts for incompletely observed light cones along the edge of the field.

Given this factorization, it becomes natural to seek equivalence classes of light cones, namely, i.e., to cluster light cones into sets based on the similarity of their conditional distributions. Such equivalence classes of past light cones are predictive states (Knight, 1975; Goerg and Shalizi, 2012), and our immediate goals become twofold: first, to discover these latent predictive states (i.e., learn a mapping

from PLCs to predictive states), and second, to estimate the conditional distribution over

given its predictive state. Goerg and Shalizi (2012) introduced LICORS as a nonparametric method of predictive state reconstruction, followed by mixed LICORS (2013) as a mixture model extension of LICORS, where each future light cone is forecast using a mixture of extremal predictive states. Mixed LICORS has predictive advantages over the original LICORS, but requires finding an matrix of weights (where is the number of light cones and

the number of predictive states) using a form of EM, where each weight is determined using a kernel density estimate on all points. Each EM iteration takes

steps, slowing mixed LICORS considerably for large . Almost equally daunting, the original algorithms are quite complex, difficult to implement and debug, inhibiting their adoption.

Iii Contributions

We review the use of light cones for localized spatio-temporal prediction. We introduce two simplified nonparametric methods for the predictive state reconstruction task and a simple regression light cone method for fast and accurate forecasting. The first predictive state method, Moonshine, is a simple meta-algorithm consisting of basic clustering steps combined with dimensionality reduction and nonparametric density estimation. Moonshine is instance-based and requires no iterative likelihood maximization, yet retains many of the qualities of the more complex mixed LICORS method. The second predictive state algorithm, One Hundred Proof (OHP), simplifies the Moonshine approach further and consists of clustering in the space of future light cones, using the clusters to obtain state-specific nonparametric density estimates over the space of PLCs and FLCs. These simple algorithms are much easier to implement than the LICORS algorithms, being simplified approximations of the mixed LICORS system, yet retain many of their forecasting and modeling strengths.

We further conduct two sets of empirical experiments showing the predictive power of light cone methods for predicting video-like data, and report results. Lastly, we give a large sample theoretical guarantee for light cone predictive state systems.

The remainder is structured as follows. §IV

describes the Moonshine, One Hundred Proof and light cone linear regression algorithms. §

V describes the experimental setup for two real-world spatio-temporal prediction tasks, and gives the results of the algorithms and baselines. §VII gives an upper bound on the estimation error of our methods. §VIII reviews related and future work, while §IX summarizes our findings.

Iv Methods

Our simple predictive state reconstruction methods build upon the principles introduced in Goerg and Shalizi (2013) for mixed LICORS. Both new methods reconstruct a set of predictive states and a soft mapping from past light cones to states, through use of nonparametric density estimation over the space of light cones. That is, for all past light cones the methods compute

where is the normalized weight of state for light cone . Unlike mixed LICORS, the new methods avoid having to explicitly construct an matrix, yet retain the benefits of soft membership mixture modeling.

After describing the reconstruction algorithms, we discuss how one can determine the conditional probability density of an observation given its past light cone, and how to use this conditional density in forecasts. We then describe an additional pure regression light cone method, useful for fast and accurate forecasting without state reconstruction. Appendix 2 describes parameter settings and practical implementation issues that arise when using the algorithms.

Iv-a Moonshine

Fig. 2: Component stages of the Moonshine algorithm.
1:  Decompose spatio-temporal process into light cone (PLC, FLC) observation tuples.
2:  Cluster PLCs using density based clustering.
3:  Compute cluster-conditioned density estimates for random points.
4:  if number of clusters maximum number then
5:     Merge clusters in the space of reduced dimension.
6:  end if
7:  Map original light cones to final clusters.
Algorithm 1 Moonshine

Moonshine begins by decomposing the random field into its component light cones, shown at far left in Figure 2. The algorithm then proceeds through two successive stages of clustering, separated by a dimension-reduction step. The main steps of Moonshine are given in Algorithm 1.

The output of the procedure is a set of predictive states, each of which consists of a set of PLCs and FLCs. The predictive states are used to create a pair of nonparametric density estimates, one over PLCs and one over FLCs, which jointly identify each state.

Initial Clustering: For the first clustering step, Moonshine uses a density-based clustering approach (Ester et al., 1996) to cluster the light cones in the space of PLCs, which assumes that similar PLCs have similar predictive consequences. Such clustering methods need a specified local-neighborhood size, so we begin with small neighborhoods, progressively increase until 90% of all points are clustered, and assign the remaining points to the nearest cluster center (effectively hybridizing density-based clustering with -means). This allows for good coverage while avoiding formation of a single, all-encompassing cluster. (Alternative clustering algorithms, e.g., Zahn (1971); Gokcay and Principe (2002); Zhao et al. (2015), would also work.)

Density Estimation and Dimensionality Reduction: The FLCs associated with each cluster (mapped through their respective light cones) are used to form kernel density estimates over the space of FLCs. In other words, each cluster consists of some set of associated FLCs and these FLCs are then used to estimate densities over the FLC space. We estimate the densities of randomly selected points, where is a parameter that affects the degree of dimensionality reduction. The log-probability ratio is taken between the first point and the remaining

points. This vector of log probability ratios forms the “signature” of the cluster, following the construction of a canonical sufficient statistic for exponential family distributions

(Kulhavý, 1996, p. 123).

Merging Clusters: If the number of clusters is greater than the maximum number of predictive states specified for the model, we cluster again to reduce the number. We cluster the low-dimensional signature vectors with -means++ (Arthur and Vassilvitskii, 2007), to form the final predictive states. The original light cones are then assigned to the resulting states, so each predictive state has a unique set of PLCs and FLCs with which to form nonparametric density estimates over both the PLC and FLC spaces.

Iv-B One Hundred Proof (OHP)

Fig. 3: The One Hundred Proof algorithm’s input and output. Light cones are input, clustered using the FLCs, which results in density estimates for PLCs and FLCs for each state. Densities are drawn as one-dimensional for simplicity, but are typically multi-dimensional, continuous objects.
1:  Decompose spatio-temporal process into light cone (PLC, FLC) observation pairs.
2:  Cluster FLCs using -means++ clustering.
3:  Map original light cone pairs to final clusters.
Algorithm 2 One Hundred Proof

OHP simplifies Moonshine, with a single clustering step and subsequent mapping of light cones to clusters. The main difference is the space in which the clustering occurs: Moonshine clusters in the space of PLCs, but OHP clusters in the space of FLCs. Clustering in FLC space effectively groups past light cones by their predictive consequences, learning a geometry of our space where points with similar futures are “near” each other regardless of differences in their histories. This results in predictive states with expected near-minimal-variance future distributions

(Arthur and Vassilvitskii, 2007), such that once we are sure of which state a new PLC maps to, we are highly certain of what outcome the state will generate.

To motivate this choice, imagine that all pasts map to some small set of distinct futures, such as to the letters of a discrete finite alphabet. Given input past we want to estimate a probability function over output , so one way to do this is to group all occurrences of future

, and use that cluster to estimate the distribution, using Bayes’ Theorem, namely,

Using nonparametric density estimation over the points observed with outcome , we can estimate the first quantity on the right hand side, and taking the normalized number of member outcomes allows one to estimate the second. This example can easily extend to continuous quantities, by clustering in the space of observed future outcomes and substituting predictive states for the finite alphabet, which is the motivation for the OHP algorithm.

The two steps of OHP are (Algorithm 2):

  1. Cluster FLCs: After decomposing our spatio-temporal process into light cones, we cluster the FLCs using -means++. The number of clusters (which will become the number of predictive states) is a user-defined parameter.

  2. Map Light Cones: We then map the original light cones to our clusters, and produce our final predictive states, which consist of unique sets of PLCs and FLCs.

As in the case of Moonshine, the FLCs and PLCs for each state are used to compute nonparametric density estimates over the space of FLCs and PLCs, providing estimators for and respectively. Algorithm 2 outlines the process of state reconstruction for OHP.

Iv-C Predictive Distributions for Light Cone Systems

Given the states reconstructed by Moonshine or OHP, we can estimate predictive distributions as follows. The conditional probability (or probability density) of given PLC is obtained by mixing over the predictive states, namely

(1)
(2)

where the second equality follows from the conditional independence of and given the predictive state . The terms serve as the mixture weights, and Bayes’s Theorem yields

(3)
(4)

All of the quantities in (2) and (4) can be estimated using our reconstructed predictive states, which are each associated with unique sets of PLCs and FLCs. We estimate by , where is the total number of light cone observations and is the number of light cones assigned to state . The two state-conditioned densities and are estimated using nonparametric density estimation techniques (such as kernel density estimation) based on their associated FLCs and PLCs. Thus we get

(5)

where and denote the nonparametric density estimates of the two corresponding conditional densities.

When we need a point prediction of , we use the conditional mean:

(6)
(7)
(8)

Replacing with (4), plugging in the estimated densities and probabilities, and using the mean future value for state (denoted ) to estimate , we obtain the final prediction rule

(9)
(10)

which is simply a suitably weighted mixture of the mean predictions for each state.

Iv-D Light Cone Linear Regression

If only predictive regression is needed and not a full generative model, one can perform linear regression directly using light cones. Light cone linear regression uses the same light cone decomposition as the LICORS, Moonshine and OHP methods, but learns a regression rule directly from past light cones to future light cone values. This has the advantages of extremely fast prediction and good forecasting accuracy, along with simple implementation. We evaluate the performance of light cone linear regression on two real-world forecasting tasks, in §V.

V Experimental Setup

In order to evaluate the effectiveness of light cone methods, we attempt spatio-temporal forecasting on real-world data.

V-a Forecasting Task 1: Electrostatic Potentials

For the first task, the data come from a set of experiments measuring electrostatic potential changes in organic electronic materials (Hoffmann et al., 2013).222Specifically, the data were collected using kelvin force probe microscopy to measure spatio-temporal changes in electrostatic charge regions on the surface of poly(3-hexyl)thiophene film. We learn a common set of predictive states across experiments, and do frame-by-frame prediction on a single held-out experiment, effectively cross-validating across experiments.

Each experiment consists of 7–10 time slices, or frames. Each frame is a 256-by-256 matrix of scalar measurements, which we call pixels, since the data resembles video in structure. Predictions are performed for 254-by-254 pixels in each frame after the first, which allows for each pixel to be predicted based on a full light cone, thus excluding marginal light cones.

V-B Forecasting Task 2: Human Speaker Video

For the second task, we predict the next frame of a full-resolution video from a recording of a human speaker, used in generating an intelligent avatar agent.333Used with permission from GetAbby (True Image Interactive, LLC). In this task, we perform leave-one-frame-out predictions, cross-validating across video frames. Each frame consists of 440-by-330 pixels, of which predictions are performed on the 428-by-328 interior pixels, again excluding marginal light cones. Every fifth frame from the video is retained, and light cones are extracted from roughly one hundred skip frames. Forty-thousand light cones are subsampled for tractability. These light cones are used for cross-validation.

V-C Comparison Methods and Parameter Settings

We compare the performance of predictive state reconstruction and forecasting systems with some simple baseline methods. For all light cone methods, the same set of light cones were extracted from the data, with , , and , resulting in PLCs of dimension and FLCs with dimension . We evaluate the performance of the mixed LICORS system, implemented by the authors following Goerg and Shalizi (2013). For tractability, only twenty thousand light cones were used in training each fold for the first task, and forty thousand for the second. Kernel density estimators were used for both PLC density estimation as well as FLC density estimation, to improve predictive performance. Initialization was performed using -means++ and the iteration delta was set to . For light cone linear regression, we use linear regression implemented in the scikit-learn package for Python (Pedregosa et al., 2011), version 15.2.

The simplest method we compare against is the “predict the value from the last frame” method that simply takes the previous value of a pixel and uses that as the prediction for the pixel in the current frame. The -nearest neighbor regressor takes as input a past light cone and finds the -nearest PLCs in Euclidean space, then takes the weighted average of their individual future light cone values and outputs that as the current prediction. Below, we report results from the scikit-learn implementation of KNeighborsRegressor with default parameter settings.

V-D Performance Metrics

We compared performance in terms of mean-squared-error (MSE) and correlation (Pearson ) with the ground truth. Additionally, for the three distributional methods (mixed LICORS, Moonshine and OHP) we measured the average per pixel log-likelihood (Avg. LL) of the predictions, an estimate of the (negative) cross-entropy between the model and the truth, and the perplexity (), with lower perplexity being better. For the distributional methods, we tested performance both for a large maximum number of states () and a small number of states ().

To avoid negative infinities appearing when model likelihoods are sufficiently close to zero, we apply smoothing to the three distributional models for all likelihood estimates mapping to zero, converting them to likelihoods of .

V-E Qualitative Results

Fig. 4: Prediction examples of Mathieu et al. (2015)
Fig. 5: Predicting video with mixed LICORS light cone system.
Fig. 6: Predicting electrostatic potentials with Moonshine.
Fig. 7: Predicting electrostatic potentials with OHP.

Light cone systems compare favorably to state-of-the-art deep learning methods, such as

Mathieu et al. (2015) (seen in Figure 4), which improves on earlier work by Ranzato et al. (2014). The amount of blurring and structural aberration becomes noticeable in their prediction examples, reproduced here. Compare with Figure 5, where a light cone system (mixed LICORS) is used to predict the next frame of human video. The light cone predictions maintain strong structural consistency and minimal blurring, at the cost of some quantization effects (due to predictive state clustering).

For the electrostatic potentials prediction task, Fig. 6 and 7 show three frames of predictions each for Moonshine and OHP, respectively. The next frame (top to bottom) is predicted using models trained on the remaining six experiments, given PLCs from the previous frame. Error percentage was calculated as a proportion of the maximum dynamic range of the actual values or predictions, namely,

where is the set of true testing frames, is the set of predicted frames, is the true value at a pixel, is the predicted value of a pixel and is the norm. Qualitatively, both methods do well, capturing much of the changing dynamics in each frame. The methods have trouble representing the extreme values at the two “hotspots” (visible in the error plots in the third columns), giving instead over-smoothed predictions. Other than those extreme regions, the error residuals lack obvious structure and are relatively small.

V-F Quantitative Results

Method MSE 95% CI Pearson 95% CI Avg. LL 95% CI Perplexity
Future-like-the-Past 0.778 [0.777, 0.780] 0.615 [0.614, 0.616]
KNN Regression 0.852 [0.851, 0.853] 0.506 [0.505, 0.506]
Light Cone Linear Regression 0.607 [0.606, 0.608] 0.628 [0.627, 0.628]
Mixed LICORS 100 0.569 [0.567, 0.571] 0.663 [0.661, 0.665] -1.034 [-1.110, -0.964] 2.052
Moonshine 100 0.570 [0.569, 0.572] 0.656 [0.655, 0.657] -0.672 [-0.727, -0.617] 1.593
One Hundred Proof 100 0.592 [0.591, 0.593] 0.641 [0.640, 0.642] -1.724 [-2.127, -1.321] 3.303
Mixed LICORS 10 0.566 [0.565, 0.567] 0.668 [0.667, 0.669] -1.022 [-1.096, -0.947] 2.030
Moonshine 10 0.609 [0.605, 0.613] 0.625 [0.622, 0.628] -0.722 [-0.767, -0.678] 1.650
One Hundred Proof 10 0.597 [0.595, 0.598] 0.648 [0.646, 0.649] -0.682 [-0.757, -0.608] 1.605
TABLE I: Results for predicting electrostatic potentials.

Table I

shows how well each method did at predicting electrostatic potentials (Task 1). Mixed LICORS and Moonshine have the lowest MSE, with 95% confidence intervals disjoint from the intervals of other methods. Mixed LICORS also has the highest (Pearson) correlation with the true values. Lastly, of the generative methods (i.e., mixed LICORS, Moonshine and One Hundred Proof), Moonshine and OHP have the highest average log-likelihood and lowest perplexity. Thus, mixed LICORS and Moonshine provide the best overall performance on the dataset.

Restricting ourselves to the generative methods for a compact number of states (), mixed LICORS has the lowest average MSE, while Moonshine and One Hundred Proof have the best probabilistic performance, giving the highest likelihoods and lowest perplexities for the data.

Method MSE 95% CI Pearson 95% CI Avg. LL 95% CI Perplexity
Future-like-the-Past 0.031 [0.031, 0.031] 0.984 [0.984, 0.984]
KNN Regression 0.033 [0.033, 0.033] 0.984 [0.984, 0.984]
Light Cone Linear Regression 0.028 [0.028, 0.028] 0.986 [0.986, 0.0986]
Mixed LICORS 100 0.038 [0.038, 0.038] 0.981 [0.981, 0.981] 0.102 [0.099, 0.105] 0.932
Moonshine 100 0.039 [0.039, 0.039] 0.981 [0.981, 0.981] 0.925 [0.874, 0.976] 0.527
One Hundred Proof 100 1.060 [0.460, 1.659] 0.911 [0.871, 0.952] -6.48 [-8.025, -4.948] 89.641
TABLE II: Results for predicting video of human speakers.

Table II gives the results from video prediction (Task 2). Light cone linear regression has the strongest overall performance, with low error and high correlation to the ground truth. However, the strong temporal consistency of this dataset allows even the FLTP method to perform remarkably well, outperforming the predictive state light cone methods. While forecasting is relatively easy for this task, being able to estimate a likelihood model for such data gives the predictive state methods an edge over pure regression methods.

Vi Discussion

In this manuscript, we have tested an existing light cone method (mixed LICORS), qualitatively comparing it to deep learning methods, and introduced three new light cone methods (light cone linear regression, Moonshine, OHP). The two latter predictive state methods are successive approximations of the approach used by mixed LICORS, with OHP pushing the limit of how simplified we could make the approximation. OHP is demonstrated to be one approximation too far, since the increased simplification comes at the cost of degraded performance.

On the first real-world spatio-temporal regression task, we find that the three LICORS-inspired methods (mixed LICORS, Moonshine and One Hundred Proof) are able to accurately forecast the changing dynamics of the underlying spatio-temporal system. Furthermore, being generative methods, they can be used to compute the likelihood of spatio-temporal data. Moonshine and One Hundred Proof (OHP) are conceptually simple, easy to implement alternatives to the full mixed LICORS system, which give comparable performance for likelihood estimation and forecasting on this task. Although OHP is the simplest method, it fails to perform well in some contexts, such as the second video prediction task, showing a trade-off between method simplicity and forecasting performance.

Light cone linear regression is a fast and simple method, and is able to perform well on both prediction tasks. It does not estimate likelihoods over data as do the other predictive-state methods, but moving to generalized linear models would allow this. It shows the effectiveness of light cone decompositions and remains a useful approach.

Overall, the best performance on all tasks was achieved or shared by the three new methods, with Moonshine having the best probabilistic modeling performance on both tasks, light cone linear regression having the best forecasting performance on the second task, and OHP having good modeling performance under the constrained setting of limited number of states. Moonshine has better probabilistic modeling performance than mixed LICORS on these tasks, and has statistically indistinguishable forecasting capability (see Tables I (100 state case) and II). While it might be argued that the improved performance was not improved enough, we have to remind ourselves that these are approximations – that they improve performance at all is surprising.

Although OHP does have limited forecasting ability, it does manage to model at least one of the datasets well, showing that its simplified form is not entirely without merit. This, at very least, shows when approximations become too simplified to accomplish complex tasks. Negative results are important, especially when detecting boundaries.

Vii Theoretical Results

We state a result for light cone predictive state systems, with proof given in Appendix I.

We wish to bound the error of our estimated distribution over futures given pasts, namely, the error of . For a fixed random sample of data, let denote the optimal estimate for constructable from the sample. We begin by noting

The second summand on the right-hand side is the gap between the optimal estimate and truth, which we assume to shrink in probability with the sample size (as in Goerg and Shalizi 2012). We focus on first term, which is the gap between our light-cone based nonparametric estimator and the optimal estimate. For this quantity, we state our main result:

Theorem 1.

For a fixed data sample of size , let denote the optimal estimator based on that sample and be the light cone estimator based on the same sample. Let be bounded by a constant for all . If

for all , then for any , , , and sufficiently large ,

where is the (smallest) sum of weights for the predictive states and is a bandwidth kernel.

Proof sketch (see appendix for details): For the quantity

, we first mix over states, and use the chain rule to condition. Then we add and subtract

, and split the sum into two parts, one multiplied by and the other multiplied by . By the assumptions stated, the second sum is bounded and decreasing to zero, so that for sufficiently large it is smaller than any . The first sum is less than , which we bound with high probability, using a Hoeffding bound for dependant data (van de Geer, 2002). The result follows directly from application of the Hoeffding bound.

Viii Related and Future Work

Viii-a Related Work

Our debt to Goerg and Shalizi (2012, 2013) needs no elaboration. We share the same general framework, but aim at simpler algorithms, even if it costs some predictive power. The work on LICORS grows out of earlier work on predictive Markovian representations of non-Markovian time series (Knight, 1975; Crutchfield and Young, 1989; Shalizi and Crutchfield, 2001; Shalizi and Klinkner, 2004), whose transfer to spatio-temporal data was originally aimed at unsupervised pattern analysis in natural systems (Shalizi et al., 2004, 2006); our qualitative results suggest Moonshine and OHP remain suitable for this, as well as for prediction. The formalism used in this line of work is mathematically equivalent to the “predictive representations of state” introduced by Littman et al. (2002), and lately the focus of much interest in conjunction with spectral estimation methods (Boots and Gordon, 2011). Both formalisms are also equivalent to observable operator models (Jaeger, 2000) and to “sufficient posterior” representations (Langford et al., 2009); our approach may suggest new estimation algorithms within those formalisms.

Viii-B Future Work

Fig. 8: Color prediction of human film data, using mixed LICORS light cone forecasting.

Light cone methods, such as the three described here, hold promise for the prediction of dynamical systems. Given the flexibility and generality of light cone decompositions, one can easily extend such methods to handle full-color video (e.g., Figure 8), and Kinect™-sensor depth video. These applications are the focus of current and future research.

The “rate limiting step” for approximate light cone methods like Moonshine and OHP is the speed of nonparametric density estimation. Methods that scale poorly in the number of observations are of limited use. Towards that end, future research into fast approximate nonparametric density estimation will improve the computational efficiency of the methods presented.

The theoretical properties of the two predictive state methods will be further explored in a future paper, especially with regard to the trade-offs in their approximation to what LICORS or mixed LICORS would do, and the influence of the new algorithms’ internal randomness.

Ix Conclusion

Faced with the task of learning to accurately model video-like data, we explore the strengths and drawbacks of light cone decomposition methods and propose new simplified nonparametric predictive state methods inspired by the mixed LICORS (Goerg and Shalizi, 2013) algorithm. The methods, Moonshine and One Hundred Proof, do not require costly iterative EM training or the memory intensive formation of an explicit matrix, yet retain the generative modeling capabilities and are competitive in predictive performance to the original mixed LICORS method. The methods are shown to perform well on one real-world data task, effectively capturing spatio-temporal structure and outperforming baseline methods, while a light cone version of linear regression performs well on the remaining task. Overall, we see that light cone decompositions of complex spatio-temporal data can open opportunities to tractably estimate probability densities and accurately forecast the changing systems. By introducing simplified versions of light cone algorithms, we hope to encourage further exploration and application of this general technique.

References

  • Arthur and Vassilvitskii [2007] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
  • Boots and Gordon [2011] B. Boots and G. Gordon. An online spectral learning algorithm for partially observable nonlinear dynamical systems. In W. Burgard and D. Roth, editors,

    Proceedings of the 25th National Conference on Artificial Intelligence (AAAI-2011)

    , pages 293–300, Menlo Park, California, 2011. AAAI.
  • Crutchfield and Young [1989] J. P. Crutchfield and K. Young. Inferring statistical complexity. Physical Review Letters, 63:105–108, 1989.
  • Ester et al. [1996] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226–231, 1996.
  • Goerg and Shalizi [2012] G. M. Goerg and C. R. Shalizi. LICORS: Light cone reconstruction of states for non-parametric forecasting of spatio-temporal systems. arXiv preprint arXiv:1206.2398, 2012.
  • Goerg and Shalizi [2013] G. M. Goerg and C. R. Shalizi. Mixed LICORS: A nonparametric algorithm for predictive state reconstruction. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pages 289–297, 2013.
  • Gokcay and Principe [2002] E. Gokcay and J. Principe. Information theoretic clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(2):158–171, Feb 2002. ISSN 0162-8828. doi: 10.1109/34.982897.
  • Hoffmann et al. [2013] P. B. Hoffmann, A. G. Gagorik, X. Chen, and G. R. Hutchison. Asymmetric surface potential energy distributions in organic electronic materials via kelvin probe force microscopy. The Journal of Physical Chemistry C, 117(36):18367–18374, 2013.
  • Jaeger [2000] H. Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12:1371–1398, 2000. doi: 10.1162/089976600300015411.
  • Knight [1975] F. B. Knight. A predictive view of continuous time processes. Annals of Probability, 3:573–596, 1975.
  • Kulhavý [1996] R. Kulhavý. Recursive Nonlinear Estimation: A Geometric Approach, volume 216 of Lecture Notes in Control and Information Sciences. Springer-Verlag, Berlin, 1996. pp. 115.
  • Langford et al. [2009] J. Langford, R. Salakhutdinov, and T. Zhang. Learning nonlinear dynamic models. In A. Danyluk, L. Bottou, and M. Littman, editors, Proceedings of the 26th Annual International Conference on Machine Learning [ICML 2009], pages 593–600, New York, 2009. Association for Computing Machinery.
  • Littman et al. [2002] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS 2001), pages 1555–1561, Cambridge, Massachusetts, 2002. MIT Press.
  • Mathieu et al. [2015] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  • Parlitz and Merkwirth [2000] U. Parlitz and C. Merkwirth. Prediction of spatiotemporal time series based on reconstructed local states. Physical Review Letters, 84:1890–1893, 2000.
  • Parzen [1962] E. Parzen.

    On estimation of a probability density function and mode.

    The annals of mathematical statistics, pages 1065–1076, 1962.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Rabiner [1989] L. Rabiner.

    A tutorial on hidden markov models and selected applications in speech recognition.

    Proceedings of the IEEE, 77(2):257–286, 1989.
  • Ranzato et al. [2014] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
  • Rosenblatt et al. [1956] M. Rosenblatt et al. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832–837, 1956.
  • Shalizi and Crutchfield [2001] C. R. Shalizi and J. P. Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity. Journal of Statistical Physics, 104:817–879, 2001.
  • Shalizi and Klinkner [2004] C. R. Shalizi and K. L. Klinkner. Blind construction of optimal nonlinear recursive predictors for discrete sequences. In M. Chickering and J. Y. Halpern, editors, Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (UAI 2004), pages 504–511, Arlington, Virginia, 2004. AUAI Press.
  • Shalizi et al. [2004] C. R. Shalizi, K. L. Klinkner, and R. Haslinger. Quantifying self-organization with optimal predictors. Physical Review Letters, 93:118701, 2004. doi: 10.1103/PhysRevLett.93.118701.
  • Shalizi et al. [2006] C. R. Shalizi, R. Haslinger, J.-B. Rouquier, K. L. Klinkner, and C. Moore. Automatic filters for the detection of coherent structure in spatiotemporal systems. Physical Review E, 73:036104, 2006.
  • van de Geer [2002] S. A. van de Geer.

    On Hoeffding’s inequality for dependent random variables.

    In H. Dehling, T. Mikosch, and M. Sorensen, editors, Empirical Process Techniques for Dependent Data, pages 161–169. Birkhäuser, Boston, 2002.
  • Zahn [1971] C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. Computers, IEEE Transactions on, 100(1):68–86, 1971.
  • Zhao et al. [2015] H. Zhao, P. Poupart, Y. Zhang, and M. Lysy. Sof: Soft-cluster matrix factorization for probabilistic clustering. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 2015, 2015.

Appendix I: Proofs

Lemma 1.

Let denote the density for state under the true assignment matrix and let . Given an isolated change in in the weight , the difference between density estimate and is bound by

Proof.
(11)
(12)
(13)
(14)
(15)
(16)

Furthermore, we can bound this quantity by

(17)
(18)
(19)

Lemma 2.

Let be defined as in Lemma 1. Given a fixed data sample of size , for all , and we have

Proof.

Once the sample is fixed, becomes a deterministic function of the sample, and becomes a deterministic constant. Following van de Geer 2002, we define

(20)
(21)
(22)

where denotes that the two functions only differ at the th matrix entry, and are constant (degenerate) random variables for a fixed sample and

(23)
(24)
(25)

Then, for all , and , we have

(27)
(28)
(29)
(30)

Given a fixed sample of size , choose such that for all . Then

(31)
(32)
(33)

Because for all , we have

(34)
(35)

Having already establish that , we set and obtain

(36)

Theorem 1.

For a fixed data sample of size , let denote the optimal estimator based on that sample and be the light cone estimator based on the same sample. Let be bounded by a constant for all . If for all , then for any , , , and sufficiently large ,

where is the (smallest) sum of weights for the predictive states, and is a kernel of bandwidth .

Proof.
(37)
(38)
(39)
(40)
(41)
(42)

Therefore,

(44)
(45)
(46)
(47)

For sufficiently large , and , given that is bounded and . Therefore, given sufficiently large,

(48)
(49)
(50)
(51)

where the penultimate inequality follows from Lemma 2. ∎

Appendix II: Implementation Details

We now discuss the choosing of various parameter settings for the two algorithms, as well as some computational techniques used to improve runtime performance.

Choosing Number of States

In both mixed LICORS and Moonshine a user must specify the maximum number of predictive states for the model, which effectively controls the complexity of the model. In OHP, one must specify the exact number of predictive states, since the number is determined by a -means++ [Arthur and Vassilvitskii, 2007] clustering step. In all cases, this number can be chosen based on user preference for simpler models, or cross-validation may be used to find the number of states that gives the best predictive performance on held-out data.

Dimensionality Reduction Choice in Moonshine

Another parameter that must be chosen is the degree of dimensionality reduction when forming distribution signatures in Moonshine. Data can guide this choice (through cross-validation), or user preference for more compact models can guide the choice for greater degrees of dimensionality reduction. The fewer the number of dimensions, the less discriminative the signatures, and thus, the higher the likelihood of merging clusters.

Density Based Clustering Considerations

When using density based clustering such as DBSCAN [Ester et al., 1996], two issues arise. First, a suitable local neighborhood size must be chosen (controlled by an parameter). Second, such methods can be computationally expensive and thus slow. To address the first issue, we take an iterative search approach by beginning with very small neighborhood sizes, then increase them until a significant portion of the data is clustered, but keep the proportion below 100%. To address the second issue, we use DBSCAN to cluster only a seed portion of all observations, then assign remaining observations to nearest cluster centers, which greatly improves runtime. Controlling the proportion of data used for seeding versus the portion assigned to cluster centers affects the degree of forced convexity of resulting clusters, and also determines the total runtime of the clustering. Fewer seed points results in faster clustering, but with more convex-shaped (e.g., -means-like) clusters.

Scaling

Since Moonshine and OHP cluster based on distances, it becomes important to normalize the scaling of all axes and dynamic ranges of all experiments. Additionally, if the scale of training light cones differs from the scale of test light cones predictive performance will suffer.

Nonparametric Density Estimation

Nonparametric density estimation techniques are instance based and slow with increasing numbers of observations. Our algorithms use kernel density estimators [Rosenblatt et al., 1956, Parzen, 1962], for which we only retain a randomly chosen subsample of five hundred points in each cluster to compute the densities. The resulting systems still perform well, as shown in §V, while being computationally tractable.