I Introduction
In this paper, we develop a method for estimation of spatiotemporal covariance and apply it to video modeling and prediction. The covariance for spatiotemporal processes manifests itself as multiframe covariance, i.e. the covariance not only between pixels or features in a single frame, but also between pixels or features in a set of nearby frames. In streaming applications, at each time the covariance may be estimated over a sliding time window of frames. If each frame contains spatial components, e.g., pixels, then the covariance is described by a by matrix:
(1) 
where denotes the pixels or other features of interest in the th video frame. We make the standard piecewise stationarity assumption that can be approximated as unchanging over each consecutive set of frames.
As can be very large, even for moderately large and
the number of degrees of freedom (
) in the covariance matrix can greatly exceed the number of i.i.d. samples available to estimate the covariance matrix. One way to handle this problem is to introduce structure and/or sparsity into the covariance matrix, thus reducing the number of parameters to be estimated. In many spatiotemporal applications it is expected (and confirmed by experiment) that significant sparsity exists in the inverse pixel correlation matrix due to Markovian relations between neighboring pixels and frames. Sparsity alone, however, is not sufficient, and applying standard sparse methods such as GLasso directly to the spatiotemporal covariance matrix is computationally prohibitive [2].A natural nonsparse alternative is to introduce structure is by modeling the covariance matrix as the Kronecker product of two smaller matrices, i.e.
(2) 
When the measurements are Gaussian with covariance of this form they are said to follow a matrixnormal distribution
[2]. This model lends itself to coordinate decompositions [3, 1, 4]. For spatiotemporal data, we consider the natural decomposition of space (features) vs. time (frames) [1, 4]. In this setting, the matrix is the “spatial covariance” and is the “time covariance.”Previous applications of the model of Equation (2) include MIMO wireless channel modeling as a transmit vs. receive decomposition [5], geostatistics [6], genomics [7], multitask learning [8], collaborative filtering [9]
[10], mine detection [10], and recommendation systems [3].An extension to the representation (2) introduced in [1] approximates the covariance matrix using a sum of Kronecker product factors
(3) 
where is the separation rank.
This allows for more accurate approximation of the covariance when it is not in Kronecker product form but most of its energy is in the first few Kronecker components. An algorithm for fitting the model (3) to a measured sample covariance matrix was introduced in [1]. The Kronecker sum model does not naturally accommodate additive noise since the diagonal elements (variances) must conform to the Kronecker structure.
In this paper, we extend the Kronecker sum model, and the PRLS algorithm of [1], by adding a structured diagonal matrix to (3). This model is called the Diagonally Loaded Kronecker Sum model and, although it has an additional parameters, we show that it does significantly better at predicting video data. We also derive the asymptotic CramérRao lower bound on the estimation MSE of the ML predictor coefficients using both standard covariance and Kronecker estimation.
The rest of this paper is organized as follows. Section II introduces the diagonally loaded Kronecker sum model and a LS algorithm for its estimation. We also derive the CRB based asymptotic predictor performance gain when estimating using the model of (2). In section III we present results on the accuracy of the multiple Kronecker representation of realworld video spatiotemporal covariances. Section IV presents our comparative prediction performance results using CMU activity video data, and we conclude the paper in Section V.
Ii SumsofKronecker Covariance representation for Prediction
As an example application of covariance estimation, we turn in this section to the use of estimated spatiotemporal covariance matrices for prediction tasks. Given a covariance matrix and mean
of a vector with nonoverlapping subvectors
and the ML predictor of given is(4) 
where and are the appropriate submatrices of [8].
Iia Modified LS Algorithm for Prediction Tasks
Although the Kronecker structure of video spacetime covariance matrices is strong, the diagonal elements of any covariance matrix are strongly affected by any uncorrelated noise in the system [8], which does not replicate across the matrix in a Kronecker fashion. Hence, for example, the Kronecker estimate will overestimate positive inframe correlations.Since the diagonal elements of a covariance matrix are highly important for determining the inverse of the matrix and by extension the predictor coefficients, this can cause a significant loss of accuracy.
To correct this problem, we thus propose to approximate the covariance using the Kronecker model
(5) 
where is diagonal [8].
Since the diagonal addition is arbitrary, it does not matter what values the Kronecker portion assigns to the diagonal elements. Hence we set them as don’t cares in the leastsquares low separation rank approximation. We thus turn to the estimation of and from the sample covariance matrix with the diagonal elements of being don’t cares.
Following rearrangement of to form as in [11], this becomes the problem of finding a rankone (low rank for multiple Kroneckers) approximation to a matrix where the intersections of a set of rows and columns are not included in the LS objective function [11].
For notational simplicity, multiply by permutation matrices to put it in the form
(6) 
where the don’t cares are now contained in the () . We also divide the permuted rank approximation matrices and in the same way, that is where are the columns of . As shown in [11], the vectors can be rearranged to form the Kronecker factors respectively. We thus have
(7) 
where . Our algorithm is then:
1) Rearrange to form .
2) Solve the (biconvex) weighted LS rank approximation problem in Equation (7). We use the iterative method of alternating projections [12] over and , initializing using the unweighted SVD solution since the number of missing values is relatively small ( out of ).
3) Reform to get .
4) Determine , which must be diagonal. We set , where and the zero cutoff exists as it helps preserve positive semidefiniteness. Further additions to can be added for regularization.
We found that prediction accuracy is typically better if the diagonally loaded LS approximation is applied to the sample correlation instead of the sample covariance.
IiB CramérRao Bound (CRB) on Predictor Coefficients
The CRB on the asymptotic optimal performance of an unbiased estimator of a Kronecker product covariance matrix using iid samples is given by [11]
(8)  
where
is a permutation matrix described in [11], and are such that (allowing for imposition of certain types of structure).
The predictor coefficients are . Let . Then the asymptotic CRB of is
(9) 
where is the Jacobian of with respect to . The values of for the portions of not used in the predictor coefficients are trivially zero. For the portion,
(10) 
For the portion,
(11) 
Now that the CRB has been derived for the predictor coefficients, it is possible to obtain the asymptotic reduction in accuracy of the Kronecker based predictor relative to the infinite training sample predictor. Assume that are independent of the training samples. Without loss of generality, assume . Define
(12) 
Thus . Also, by independence, . Since the CRB assumes an unbiased estimator, assume that the estimator of is unbiased. Then . The error covariance is then given by
(13)  
where denotes the th row of . The asymptotic covariance of the predictor coefficient estimates is given by the CRB for the predictor coefficients (9), thus giving the asymptotic lower bound on the covariance of the additional error resulting from the use of the estimated instead of the true .
For comparison, the asymptotic CRB for covariance estimation (arbitrary ) with no structural knowledge can be obtained by setting in Equation (8).
Iii Sum of Kronecker Products Decomposition Accuracy in Video
In this section, we focus on the MSE accuracy of approximating sample covariance matrices using the sumofKroneckers approximation, in particular, the number of Kronecker sums required to obtain a good approximation. Since we are focusing on the MSE, the standard sumofkroneckers approximation of Equation (3) is appropriate.
Iii1 Texture Video
For the computation of the sample covariance, we use a sliding window approach, where to obtain each new multiframe sample, the window is incremented by one frame. Linear dependence is avoided, but the samples are clearly not independent. However, it does improve the learning rate and enforces stationarity which could otherwise be lost in periodic situations.
We examine the accuracy of the Kronecker approximation for a video of blowing grass (15 frames/sec). Due to the size of the image, it is divided up into blocks for estimation. Figure 1 shows a still frame of the video, along with the variance image for reference. The % RMSE of using a single Kronecker factor to approximate the 15 frame, 2000 sample, sample covariances is shown for each block. Due to the relatively low number of samples compared to the number of variables (1500), a significant portion of this error is likely due to noise in the sample covariance.
Figure 2 shows the Kronecker error for 3, 5, and 7 frame blocs as a function of downsampling factor and pixel block size.
Iiia Human Activity Video
For the computation of the sample covariance, we use a sliding window approach, where to obtain each new multiframe sample, the window is incremented by one frame. This in effect forces near block Toeplitz structure (for time stationarity) in the sample covariance.
We applied covariance estimation to the CMU human activity mocap videos. These videos are processed for the dataset to give the position of a set of fixed points as a function of time (downsampled to 40 frames/sec) on the human as the human performs an activity. We used videos of a person walking, and performing fencing moves. This type of data would also arise in situations where feature points in a video are being tracked.
The points travel through space, causing mean drift. To remove this, we preprocessed the data by computing the error of a frame ahead linear extrapolator (hereafter referred to as the zeroth order predictor) based on two frames close together in time. We then applied covariance estimation to the result. This setup allows for up to ahead causal prediction of the original variables.
In Figure 3, we show LS Kronecker covariance approximation results for CMU videos of fencing (44 points) and walking (downsampled to 14 points because of sample paucity). Approximations to a multiframe sample covariance learned using 500 and 100 samples respectively for the fencing and walking videos are considered. The RMS energy of the first 10 Kronecker product factors are shown for several different covariance sizes, as well as the % RMSE of using only the first Kronecker product. The low number of samples for the walking video especially creates noise in the sample covariance, thus the RMSE values are somewhat inflated relative to the true covariance.
Iv Results for Time Series Prediction
Iva Asymptotic CRB
Example CRB based asymptotic MLE prediction accuracy (found using Equation (13)) results are shown in Figure 4, along with the Monte Carlo averages of prediction performance as a function of training sample size and the performance () achieved using perfect knowledge of the covariance (“omniscient”). The covariance matrix used was a Kronecker product () LS approximation to a 7 frame covariance with 2 frame ahead prediction learned from the fencing video. The same covariance was used for both the sample covariance and Kronecker cases, with the only difference being that the Kronecker estimator has prior information that the covariance has Kronecker structure. As can be seen, our asymptotic CRB results match the asymptotic empirical performance well.
IvB Forward Prediction
Figure 5 shows the RMSE results for forward prediction averaged over 100 consecutive frames in the CMU fencing video as a function of learning sample size. Prediction methods compared are the original predictor, the sample covariance (after L2 regularization [3]), the standard LS Kronecker (2), and the diagonally corrected Kronecker. Using the original predictor corresponds to using the mean (0) as the prediction, thus it can always be achieved using infinite regularization. In Figure 5, the covariance is learned on the samples immediately prior to location at which prediction is occurring.
While most of the fencing video had strong enough Kronecker structure that additional Kronecker factors didn’t improve prediction, some portions had sufficiently high Kronecker rank to warrant their use. By resampling from learned video covariances, we analyzed the RMSE as a function of the number of Kronecker factors used for both the standard [1] and diagonally corrected Kronecker methods. It was found that due to very poor conditioning, the standard Kronecker based predictions became unstable whereas the diagonally corrected estimate remained accurate when more than one Kronecker factor was used.
We used learned sample covariances from the fencing video and used it to generate new 10 frame sample covariances using the sliding window approach. The 5 frame ahead prediction RMSE was then computed over 200 Monte Carlo runs for the standard [1] and diagonally corrected Kronecker methods as well as the relearned sample covariance and true covariance as a function of the number of Kronecker factors used. Both Kronecker covariance estimates were forced to be positive semidefinite by projection.
Figure 6 shows the results using 15 training samples (left) and 200 (right). Note the poor standard Kronecker results using more than one Kronecker term, demonstrating the benefit of diagonal correction in the multiple Kronecker case.
IvC Forward Prediction with Partial Data
An additional prediction task which may arise is forward prediction in the case that more recent history is available for some variables than for others. In the two part case we want
(14) 
where , . The predictor (Equation (4)) thus incorporates both “forward” and “sideways” prediction. For forward only prediction the structure of the single Kronecker model results in pixel predictions that are weighted averages of only the corresponding pixel values in the previous frames [8] with weights only a function of . In the partial prediction case this structure disappears resulting in the use of crosspixel information. Since is typically larger than , this results in an increase in the number of parameters. In addition, as discussed earlier, when uncorrelated noise is present the standard Kronecker has a tendency to overestimate interpixel correlations. Since the covariances we use have rather poor conditioning (large correlations) we expect the predictions using the standard Kronecker estimate to degrade significantly even for large numbers of samples and that the diagonally corrected method will result in better performance.
For this experiment, we used the walking video from the CMU dataset. Figure 7 shows the RMSE averaged over 100 frames of predicting 2/3 of the points (1/3 are observed at all times) 5 frames ahead using a 20frame covariance. Note the failure of the standard Kronecker as anticipated, while the diagonally corrected Kronecker has lower error in the low sample regime than the regularized sample covariance.
V Conclusion
We considered the sum of Kronecker products representation for covariance matrices developed in [1], and examined its applicability to spatiotemporal covariance estimation, especially in video. It was found that a small sum (usually two) of Kronecker factors is a good approximation to the covariance of the CMU videos, and due to the reduction in the number of parameters gives improved lowsample estimation as compared to the standard sample covariance matrix.
We also proposed a diagonally loaded sumofKronecker products representation, which resulted in improved prediction performance. As an example application, we used it for prediction of human motion patterns in the CMU videos. In addition to the significant potential computational improvement of using a single Kronecker factor, it was found that the representation allowed accurate prediction using significantly fewer training samples than needed using the sample covariance matrix.
To analyze this gain, we derived the CRB for the predictor coefficients and the optimal asymptotic predictor performance assuming an underlying Kronecker covariance as well as for the unstructured covariance case.
In certain cases the use of multiple Kronecker factors using the diagonally corrected method gave improved performance as the number of samples increased sufficiently, whereas the standard method gave worse performance. This allowed the small sumofKroneckers representation to be competitive even for large numbers of samples.
Vi Acknowledgements
This research was partially supported by ARO under grant W911NF1110391 and by AFRL under grant FA865007D12200006. The CMU data was obtained from mocap.cs.cmu.edu. The dataset was created with funding from NSF EIA0196217.
References
 [1] T. Tsiligkaridis and A. Hero, “Covariance estimation in high dimensions via kronecker product expansions,” in arXiv 1302.2686, Feb 2013.
 [2] T. Tsiligkaridis, A. Hero, and S. Zhou, “On convergence of kronecker graphical lasso algorithms,” IEEE Trans. Signal Proc., vol. 61, no. 7, pp. 1743–1755, 2013.

[3]
G. I. Allen and R. Tibshirani, “Transposable regularized covariance models with an application to missing data imputation,”
Annals of Applied Statistics, vol. 4, no. 2, pp. 764–790, 2010.  [4] M. G. Genton, “Separable approximations of spacetime covariance matrices,” Environmetrics, vol. 18, no. 7, pp. 681–695, 2007.
 [5] K. Werner and M. Jansson, “Estimation of kronecker structured channel covariances using training data,” in Proceedings of EUSIPCO, 2007.
 [6] N. Cressie, Statistics for Spatial Data. Wiley, New York, 1993.

[7]
J. Yin and H. Li, “Model selection and estimation in the matrix normal
graphical model,”
Journal of Multivariate Analysis
, vol. 107, 2012.  [8] E. Bonilla, K. M. Chai, and C. Williams, “Multitask gaussian process prediction,” in NIPS, 2007.
 [9] K. Yu, J. Lafferty, S. Zhu, and Y. Gong, “Largescale collaborative prediction using a nonparametric random effects model,” in ICML, 2009, pp. 1185–1192.
 [10] Y. Zhang and J. Schneider, “Learning multiple tasks with a sparse matrixnormal penalty,” Advances in Neural Information Processing Systems, vol. 23, pp. 2550–2558, 2010.
 [11] K. Werner, M. Jansson, and P. Stoica, “On estimation of covariance matrices with kronecker product structure,” Signal Processing, IEEE Transactions on, vol. 56, no. 2, pp. 478–491, 2008.
 [12] A. M. Buchanan and A. W. Fitzgibbon, “Damped newton algorithms for matrix factorization with missing data,” in Computer Vision and Pattern Recognition, vol. 2, 2005, pp. 316–322.
Comments
There are no comments yet.