1 Introduction
The ShannonNyquist theorem dictates that to sense features at a particular frequency, we must sample uniformly at twice that rate. For generic imaging applications, this sampling rate might be too high; in modern digital cameras, invariably, the sensed imaged is compressed immediately without much loss in quality. For other applications, such as high speed imaging and sensing in the nonvisual spectrum, camera/sensor designs based on the ShannonNyquist theorem lead to impractical and costly designs. Part of the reason for this is that the ShannonNyquist sampling theory does not exploit any structure in the sensed signal beyond that of bandlimitedness. Signals with redundant structures can potentially be sensed more parsimoniously. This is the key idea underlying the new field of compressive sensing (CS) [7]. When the signal of interest exhibits a sparse representation, CS enables sensing at measurement rates below the Nyquist rate. Indeed, signal recovery is possible from a number of measurements that is proportional to the sparsity level of the signal, as opposed to its bandwidth.
In this paper, we consider the problem of sensing videos compressively. We are interested in this problem motivated by the success of video compression algorithms, which indicates that videos are highly redundant. Bridging the gap between compression and sensing can lead to compelling camera designs that significantly reduce the amount of data sensed and enable designs for application domains where sensing is inherently costly.
Video CS is challenging for two main reasons:

Ephemeral nature of videos: The scene changes during the measurement process; moreover, we cannot obtain additional measurements of an event after it has occurred.

Highdimensional signals: Videos are significantly higherdimensional than images. This makes the recovery process computationally intensive.
One way to address these challenges is to narrow our scope to certain parametric models that are suitable for a broad class of videos; this morphs the video recovery problem to one of parameter estimation and provides a scaffold to address the challenges listed above.
In this paper, we develop a CS framework for videos modeled as linear dynamical systems (LDSs), which is motivated, in part, by the extensive use of such models in characterizing dynamic textures [10, 15, 33], activity modeling, and video clustering [37]. Parameteric models, like LDSs, offer lower dimensional representations for otherwise highdimensional videos. This significantly reduces the number of free parameters that need to be estimated and, as a consequence, reduces the amount of data that needs to be sensed. In the context of video sensing, LDSs offer interesting tradeoffs by characterizing the video signal using a mix of dynamic/timevarying parameters and static/timeinvariant parameters. Further, the generative nature of LDSs provides a prior for the evolution of the video in both forward and reverse time. To a large extent, this property helps us circumvent the challenges presented by the ephemeral nature of videos.
The paper makes the following contributions. We propose a framework called CSLDS for video acquisition using an LDS model coupled with sparse priors for the parameters of the LDS model. The core of the framework is a twostep measurement strategy that enables the recovery of the LDS parameters from compressive measurements by solving a sequence of linear and convex problems. We demonstrate that CSLDS is capable of sensing videos with far fewer measurements than the Nyquist rate. Finally, the LDS parameters form an important class of features for activity recognition and scene analysis, thereby making our camera designs purposive [25] as well.
2 Background
2.1 Compressive sensing
CS deals with the recovery of a signal from undersampled linear measurements of the form , where is the measurement matrix, and is the measurement noise [7, 14]. Estimating from the measurements is illconditioned, since the linear system formed by is underdetermined. CS works under the assumption that the signal is sparse in a basis ; that is, the signal , defined as , has at most nonzero components. Exploiting the sparsity of , the signal can be recovered exactly from measurements provided the matrix satisfies the socalled restricted isometry property (RIP) [4]. In particular, when is an orthonormal basis and the entries of the matrix
are i.i.d. samples from a subGaussian distribution, the product
satisfies the RIP. Further, the signal can be recovered from by solving a convex problem of the form(1) 
where is an upper bound on the measurement noise . It can be shown that the solution to (1
) is with high probability the
sparse solution that we seek. The theoretical guarantees of CS have been extended to compressible signals, where the sorted coefficients of decay rapidly according to a powerlaw [22].There exist a wide range of algorithms to solve (1) under various approximations or reformulations [7, 38]. Greedy techniques such as Orthogonal Matching Pursuit [28] and CoSAMP [26] solve the sparse approximation problem efficiently with strong convergence properties and low computational complexity. It is also simple to impose structural constraints such as block sparsity into CoSAMP, giving variants such as modelbased CoSAMP [3].
2.2 Video compressive sensing
In this paper, we model a video as a sequence of timeindexed images. Specifically, if is the image of a scene at time , then is the video of the scene from time to . Further, we also refer to as the “video frame” at time .
In video CS, the goal is to sense a timevarying scene using compressive measurements of the form , where and are the compressive measurements, the measurement matrix and the video frame at time , respectively. Given the sequence of compressive measurements , our goal is to recover the video . There are currently two fundamentally different imaging architectures for video CS: the single pixel camera (SPC) and the programmable pixel camera. The SPC [16] uses a single or a small number of sensing elements. Typically, a photodetector is used to obtain a single measurement at each time instant of the form , where
is a pseudorandom vector of
s and s. Typically, under an assumption of a slowly varying scene, consecutive measurements from the SPC are grouped as measurements of the same video frame. This assumption works only when the scene motion is small or when the number of measurements associated with a frame is small. The SPC provides complete freedom in the spatial multiplexing of pixels; however, there is no temporal multiplexing. In contrast, programmable pixel cameras [43, 31, 23] use a full frame sensor array; during each exposure of the sensor array, the shutter at each pixel is temporally modulated. This enables extensive temporal multiplexing but a limited amount of spatial multiplexing. A key advantage of SPCbased designs is that they can operate efficiently at wavelengths (such as the far infrared) that require exotic detectors; in such cases, building a full frame sensor can be prohibitively expensive.To date, recovery algorithms for the SPC have used various signal models to reconstruct the sensed scene. Wakin et al. [45] use 3D wavelets as the sparsifying basis for recovering videos from compressive measurements. Park and Wakin [27] use a coarsetofine estimation framework wherein the video, reconstructed at a coarse scale, is used to estimate motion vectors that are subsequently used to design dictionaries for reconstruction at a finer scale. Vaswani [40] and Vaswani and Lu [41] use a sequential framework that exploits the similarity of support of the signal between adjacent frames of a video. Under this model, a frame of video is reconstructed using a linear inversion over the support at the previous time instant and a smallscale CS recovery over the residue to detect components beyond the known support. Cevher et al. [9] provide a CS framework for directly sensing innovations over a static scene thereby enabling background subtraction from compressive measurements.
2.3 Linear dynamical system model for video sequences
Linear dynamical systems (LDSs) represent an important class of parametric models for timeseries data. A wide variety of spatiotemporal signals have often been modeled as realizations of LDSs. These include dynamic textures [15], traffic scenes [10], video inpainting [13], multicamera tracking [2] and human activities [37]. The interested reader is referred to [36]
for a survey of the use of LDSs as a concise representation for a wide range of computer vision problems.
Intuitively, a LDS for a video comprises of two models. First, an observation model that suggests that frames of the video lie close to a dimensional subspace; the frame of the video at time can be represented as where is a basis for the subspace and are the subspace coefficients or the state vector at time . Second, the trajectory that the video charts out in this dimensional subspace varies smoothly, is predictable and modeled by a linear evolution of the form . Figure 1 provides an example of an LDS.
We now formally define the LDS for a video. The model equations are given by
(2)  
(3) 
where is the state vector at time , is the dimension of the state space, is the state transition matrix, is the observation matrix, represents the observed measurements, where for the videos of interest in this paper, . and are noise components modeled as Gaussian with mean vector and covariance matrices given by and , respectively. The Gaussian assumption for the process noise is not necessarily an optimal one, but is made for the sake of simplifying the model estimation algorithm. It is known to work well for representing a large class of dynamic textures [15].
An LDS is parameterized by the matrix pair . Note that the choice of and the state sequence is unique only up to a linear transformation given the inherent ambiguities in the notion of a state space. In particular, given any invertible matrix , the LDS defined by with the state sequence is equivalent to the LDS defined by with the state sequence . This lack of uniqueness has implications that we will touch upon later in Section 5.
Given a video sequence, the most common approach to fitting an LDS model is to first estimate a lowerdimensional embedding of the observations via principal component analysis (PCA) and then learn the temporal dynamics captured in
, and equivalently . The most popular model estimation algorithms are N4SID [39], PCAID [35], and expectationmaximization (EM)
[10]. N4SID is a subspace identification algorithm that provides an asymptotically optimal solution for the model parameters. However, for large problems the computational requirements make this method prohibitive. PCAID [35] is a suboptimal solution to the learning problem. It makes the assumption that estimation of the observation matrix and the state transition matrix can be separable, which makes it possible to estimate the parameters of the model very efficiently via PCA. Under this assumption, one first estimates the observation matrix , (spacefilter) and then uses the result to estimate the state state transition matrix (timefilter) [15]. This learning problem can also be posed as a maximum likelihood estimation of the model parameters that maximize the likelihood of the observations, which can be solved by the EM algorithm [10].3 CSLDS Architecture
We provide a high level overview of our proposed framework for video CS; the goal here is to build a CS framework, implementable on the SPC, for videos that are modeled as LDSs. We flesh out the details in Sections 4 and 5. This amounts to estimating the LDS parameters from compressive measurements, i.e, we seek to recover the model parameters and given compressive measurements of the form . We recall that is the timeinvariant observation matrix of the LDS, and and are the video frame and the state at time , respectively. The compressive measurements are hence expressed as bilinear terms in the unknown parameters and . Handling bilinear unknowns typically requires nonconvex optimization techniques thereby invalidating conventional CS recovery algorithms. To avoid this, we propose a twostep sensing method that is specifically designed to address the bilinearity; we refer to this sensing method and its associated recovery algorithm as the CSLDS framework [34] .
Measurement model:
We summarize the CSLDS measurement model as follows. At time , we take two sets of measurements:
(4) 
where and such that the total number of measurements at each frame is .^{1}^{1}1The SPC obtains only one measurement at each time instant. Multiple measurements for a video frame are obtained by grouping consecutive measurements from the SPC. When is small, compared to the sampling rate of the SPC, this is an acceptable approximation especially for slowly varying scenes. The measurement matrix in (4) is composed of two distinct components: the timeinvariant part and the timevarying part . We denote by the common measurements and by the innovation measurements.
We solve for the LDS parameters in two steps. First, we obtain an estimate of the state sequence using only the common measurements . Second, we use this state sequence estimate to recover the observation matrix using the innovation measurements.
State sequence estimation:
We recover the state sequence using only the common measurements . The key idea is that when form the observations of an LDS with system matrices , the measurements form the observations of an LDS with system matrices
. Estimation of the state sequence now can be mapped to a simple exercise in system identification. In particular, an estimate of the state sequence can be obtained by the singular value decomposition (
) of the blockHankel matrix(5) 
Given the , the state sequence estimate is given by
In Section 4, we leverage results from system identification to analyze the properties of this particular estimate as well as characterize the number of measurements required.
Observation matrix estimation:
Given an estimate of the state sequence, , the relationship between the observation matrix and the innovation measurements is linear, i.e., . In addition, is timeinvariant. Hence, we can accumulate innovation measurements over a duration of time to stably reconstruct . This significantly reduces the number of innovation measurements required at each frame. This is especially important in the context of sensing videos, since the scene changes as we acquire measurements. Hence, requiring fewer measurements for each reconstructed frame of the video implies less error due to motion blur.
Using the estimates of the state sequence , we can recover by solving the following convex problem:
(6) 
where denotes the th column of and is a sparsifying basis for the columns of . Note that, in (6), we use all of the compressive measurements obtained for each frame of the video — that is, we use both the common and innovation measurements since the common measurement, much like the innovation measurements, are linear measurements of the frames. Further, as we show later in Section 5.2, ambiguities in the estimation of the state sequence induce a structured sparsity pattern in the support of . The convex program (6) can be modified to incorporate such constraints. In addition to this, in Section 5, we also propose a greedy alternative for solving a variant of the convex program.
To summarize, the twostep measurement process described in (4) enables a twostep recovery (see Figure 2). First, we obtain an estimate of the state sequence using SVD on just the common measurements. Second, we use the state sequence estimate for recovering the observation matrix using a convex program. The details of these two steps are discussed in the next two sections.
4 Estimating the state sequence
In this section, we discuss methods to estimate the state sequence from the compressive measurements . In particular, we seek to establish sufficient conditions under which the state sequence can be estimated reliably.
4.1 Observability of the state sequence
Consider the compressive measurements given by
(7) 
where are the compressive measurements at time , is the corresponding measurement matrix, and is the measurement noise. Note that is timeinvariant; hence, (7) is a part of the measurement model described in (4) relating to the common measurements. A key observation is that, when form the observations of an LDS defined by , the compressive measurement sequence forms an LDS as well; that is,
The LDS associated with is parameterized by the system matrices . Estimating the state sequence from the observations of an LDS is possible only when the LDS is observable [5]. Thus, it is important to consider the question of observability of the LDS parameterized by .^{2}^{2}2Observability of LDSs in the context of CS has been studied earlier by Wakin et al. [46], who consider the scenario when the observation matrix is randomly generated and the state vector at is sparse. In contrast, the analysis we present is for a nonsparse state vector.
Definition 4.1 (Observability of an LDS [5])
An LDS is observable if, for any possible state sequence, the current state can be estimated from a finite number of observations.
Lemma 4.2 (Test for observability of an LDS [5])
An LDS defined by the system matrices and of state space dimension is observable if and only if the observability matrix
(8) 
is full rank.
A necessary condition for the observability of the LDS defined by is that the LDS defined by is observable. However, for the LDSs we consider in this paper, ; for such systems, the LDS defined by is observable. Given this assumption, we consider the observability of the LDS parameterized by next.
Lemma 4.3
For , the LDS defined by is observable, with high probability, if and the entries of the matrix are sampled i.i.d. from a subGaussian distribution.
This is established by proving that when . Assume that , i.e., such that . Let be a row of . The event that is one of negligible probability when the elements of are assumed to be i.i.d. according to a subGaussian distribution such as Gaussian or Bernoulli. Hence, with high probability when .
Observability is the key criterion for recovering the state sequence from the common measurements. When the LDS associated with the common measurements is observable, we can estimate the state sequence — up to a linear transformation — by factorizing the block Hankel matrix in (5). can be written as
Hence, when the observability matrix is full rank, we can recover the state sequence by factoring the Hankel matrix using the . Suppose the SVD of the Hankel matrix is . Then, the estimate of the state sequence is obtained by
(9) 
where is the diagonal matrix containing the largest singular values in , and is the matrix composed of the right singular vectors corresponding to these singular values. The estimate of the state sequence obtained from differs from its true value by a linear transformation. This is a fundamental ambiguity that stems from the lack of uniqueness in the definition of the state space (see Section 2.3). The state sequence estimate in (9) can be improved, especially for high levels of measurement noise, by using system identification techniques mentioned in Section 2.3. However, the simplicity of this estimate makes it amenable for further analysis.
When , we can choose to factorize a smallersized Hankel matrix provided . Note that when , we do not enforce the constraints provided by the state transition model, thereby simply reducing the LDS to a linear system. For , we enforce the state transition model over successive time instants; i.e., we enforce
Larger values of lead to smoother state sequences, since the estimates conform to the state transition model for longer durations.
We next study the observability properties of specific classes of interesting LDSs and the conditions on under which the observability of holds.
4.2 Case:
A particularly interesting scenario is when we obtain exactly one common measurement for each video frame. For such a scenario, and, hence, the measurement matrix can be written as a rowvector: . We now establish conditions when the observability matrix is full rank for this particular scenario. Let and . We seek a condition when the observability matrix, or equivalently its transpose,
(10) 
is full rank.^{3}^{3}3There is an interesting connection to Krylovsubspace methods here. In Krylovsubspace methods, a lowrank approximation to a matrix is obtained by forming the matrix with randomly chosen. Convergence proofs for this method are closely related to Theorem 4.4. To the best of our knowledge, diagonalizability of plays an important role in most of these proofs. The interested reader is referred to [32] for more details. We concentrate on the specific scenario where the matrix (and hence, ) is diagonalizable, i.e., , where
is an invertible matrix (hence, full rank) and
is a diagonal matrix with diagonal elements . For such matrices, the transpose of the observability matrix can be written aswhere . This can be expanded as
and further into
We can establish a sufficient condition for when the observability matrix is full rank.
Theorem 4.4
Let and let the elements of
be i.i.d. from a subGaussian distribution. Then, with high probability, the observability matrix is full rank when the state transition matrix is diagonalizable and its eigenvectors and eigenvalues are unique.
From the discussion above, the observability matrix can be written as a product of three square matrices: , the matrix of eigenvectors of ; a diagonal matrix with entries defined by the vector ; and a Vandermonde matrix defined by the vector of eigenvalues of . When the eigenvectors and eigenvalues are distinct, the first and last matrices are full rank. Given that the elements of are i.i.d., the probability that is negligible and, hence, the diagonal matrix is full rank with high probability. Since the product of full rank square matrices is full rank as well, this implies that the observability matrix is full rank with high probability.
Remark: Theorem 4.4
requires that the statetransition matrix be fullrank (nonzero Eigenvalues) and be diagonalizable with unique Eigenvalues. Most matrices are diagonalizable (once, we allow complex Eigenvalues) and hence, the requirement that state transition matrix be diagonalizable is not restrictive. A more restrictive condition is requiring the Eigenvalues of the matrix to be unique. Unfortunately, this eliminates some commonly observed state transition matrix such as the Identity matrix — which is coupled with Brownian processes. Nonetheless, Theorem
4.4 is intriguing, since it guarantees recovery of the state sequence even when we obtain only one common measurement per time instant. This is immensely useful in reducing the number of measurements required to sense a video sequence.Interestingly, we can reduce even further. This is achieved by not obtaining common measurements at some time instants.
4.3 Missing measurements: Case
If we do not obtain common measurements at some time instants, then is it still possible to obtain an estimate of the state sequence? One way to view this problem is that we have incomplete knowledge of the Hankel matrix defined in (5) and we seek to complete this matrix. Matrix completion, especially for low rank matrices, has received significant attention recently [30, 6, 8].
Given that the Hankel matrix in (5) is low rank for videos modeled as LDSs, we formulate the missing measurement recovery problem as one of matrix completion. Suppose that we have the common measurements only at time instants given by the index set , i.e., we have knowledge of . We can recover the missing measurements by exploiting the lowrank property of . Specifically, we solve the following problem to obtain the missing measurements:
However, is a nonconvex function which renders the above problem NPcomplete. In practice, we can solve a convex relaxation of this problem^{4}^{4}4Historically, the use of nuclear normbased optimization for system identification goes back to Fazel et al. [19, 20]. Since then, there has been much work towards establishing the equivalence of these two problems [30, 6]. Further, the convex program in (11) was used for video inpainting in [13].
(11) 
where is the nuclear norm of the matrix , which equals the sum of its singular values. Once we fill in the missing measurements, we use (9) to recover an estimate of the state sequence.
An important quantity to characterize is the proportion of time instants in which we can choose to not obtain common measurements. This amounts to developing a sampling theorem for the completion of lowrank Hankel matrices; to the best of our knowledge, there has been little theoretical work on this problem. Instead, we address it empirically in Section 6.
5 Estimating the observation matrix
In this section, we discuss estimation of the observation matrix given the estimates of the state space sequence .
5.1 Need for innovation measurements
Given estimates of the state sequence , the matrix is linear in the compressive measurements which enables a host of conventional based methods as well as based recovery algorithms to estimate . However, recall that the is a matrix and, hence, the common measurements by themselves are not enough to recover , unless is large.
The common measurements used in the estimation of the state sequence are measured using a timeinvariant measurement matrix . A timeinvariant measurement matrix, by itself, is not sufficient for estimating unless is very large. To alleviate this problem, we take additional compressive measurements of each frame using a timevarying measurement matrix. Let where and are the compressive measurements and the corresponding measurement matrix at time . As mentioned earlier in Section 3, we refer to these as innovation measurements. Noting that is a timeinvariant parameter, we can collect innovation measurements over a period of time before reconstructing . This enables a significant reduction in the number of measurements taken at each time instant.
5.2 Structured sparsity for
Individual frames of a video, being images, exhibit sparsity/compressibility in a certain transform bases such as wavelets and DCT. If the support of the frames are highly overlapping — this is to be expected given the redundancies in a video — then columns of are compressible in the same transform bases; a consequence of being a basis for the frames of the video. Further, note that the columns of are also the top principal components and hence, capture the dominant motion patterns in the scene; when motion in the scene is spatially correlated, the columns of are compressible in wavelet/DCT basis. For these reasons, we assume that the columns of are compressible in a wavelet/DCT basis and employ sparse priors in the recovery of the observation matrix . We can potentially obtain an estimate of by solving the following convex program:
(12) 
Here, we denote the columns of the matrix as . is a sparsifying basis for the columns of ; we have the freedom to choose different sparsifying bases for different columns of .
The assumption of compressibility in a transform basis was sufficient for all the videos we test on (see Section 6). However, it is entirely possible that a video is not compressible in a transform basis. There are two possible ways to address such a scenario. First, given training data, we can use dictionary learning algorithms [24] to learn an appropriate basis where in the columns of are sparse/compressible. Second, in the absence of training data, we revert to based methods to recover ; in such cases, we would typically need more measurements to recover .
However, the convex program is not sufficient asis to recover . The reason for this stems from ambiguities in the definition of the LDS (see Section 2.3). The use of for recovering the state sequence introduces an ambiguity in the estimates of the state sequence in the form of , where is an invertible matrix. As a consequence, this will lead to an estimate satisfying . Suppose the columns of are sparse (equivalently, compressible for a certain value of ) each in with support for the th column. Then, the columns of are potentially sparse with identical supports . The support is exactly sparse when the are disjoint and is dense. At first glance, this seems to be a significant drawback, since the overall sparsity of has increased to (the sparsity of is ). However, this apparent increase in sparsity is alleviated by the columns having identical supports, which can be exploited in the recovery process [17].
Given the estimates , we estimate the matrix by solving the following convex program:
(13) 
where is the th row of the matrix and is a sparsifying basis for the columns of . The above problem is an instance of an mixednorm optimization that promotes group sparsity; in this instance, we use it to promote group column sparsity in the matrix , i.e., all columns have the same sparsity pattern.
There are multiple efficient ways to solve including solvers such as SPGL1 [38] and modelbased CoSAMP [3]. Algorithm 1 summarizes a modelbased CoSAMP algorithm used for recovering the observation matrix . The specific model used here is a unionofsubspaces model that groups each row of into a single subspace/model.
5.3 Value of
For stable recovery of the observation matrix , we need in total measurements; for a large class of practical solvers, a rule of thumb is . Given that we measure timevarying compressive measurements at each time instant, over a period of time instants, we have compressive measurements for estimating . Hence, for stable recovery of , we need approximately
(14) 
This indicates extremely favorable operating scenarios for the CSLDS framework, especially when is large (as in high frame rate capture). Let where is the time duration of the video in seconds and is the sampling rate of the measurement device. The number of compressive measurements required in this case is . Given that the complexity of the LDS typically (however, not always) depends on , for a fixed the number of measurements required to estimate decreases as as the sampling rate is increased. Indeed, as the sampling rate increases, can be decreased while keeping constant. This will ensure that (14) is satisfied, enabling stable recovery of .
5.4 Mean + LDS
In many instances, a dynamical scene is modeled better as an LDS over a static background, that is, . This can be handled with two small modifications to the Algorithm 1. First, the state sequence is obtained by performing an SVD on the matrix modified such that each row sums to zero. This works under the assumption that the sample mean of is equal to , the compressive measurement of . Second, given that the support of need not be similar to that of , the resulting optimization problem can be reformulated as
(15) 
As with the convex formulation, the modelbased CoSAMP algorithm described in Algorithm 1 can be modified to incorporate the mean term ; an additional modification here is the requirement to specify a priori the sparsity of the mean .
6 Experiments
We present a range of experiments validating various aspects of the CSLDS framework. We use permuted noiselets [12] for the measurement matrices, since they have a fast scalable implementation. We use the term compression ratio to denote the reduction in the number of measurements as compared to the Nyquist rate. Finally, we use the reconstruction SNR to evaluate the recovered videos. Given the ground truth video and a reconstruction , the reconstruction SNR in dB is defined by
(16) 
We compare CSLDS against framebyframe CS, where each frame of the video is recovered separately using conventional CS techniques. We use the term oracle LDS when the parameters and video reconstruction are obtained by operating on the original data itself. Oracle LDS estimates the parameters using a rank approximation of the ground truth data. The reconstruction SNR of the oracle LDS gives an upper bound on the achievable SNR. Finally, the ambiguity in the observation matrix (due to nonuniqueness of the SVD based factorization) as estimated by oracle LDS and CSLDS is resolved by finding the best linear transformation that registers the two estimates.
(a)  (b) 
6.1 State sequence estimation
We first provide empirical verification of the results derived in Sections 4.1 and 4.2. It is worth noting that, in the absence of noise, Theorem 4.4 suggests exact recovery of the state sequence. In practice, it is important to check the robustness of the estimate to measurement noise. Figure 3(a) analyzes the performance of the state space estimation for different values of the number of common measurements and different SNRs of the measurement noise. We define input SNR in dB as , where
is the standard deviation of the noise. Here, we consider the scenario when
. The underlying state space dimension is with frames. As expected, for low SNRs, the reconstruction SNR is very high even for small values of . In addition to this, the accuracy at is acceptable, especially at low SNRs.Next, we validate the implications of Section 4.3, where we discuss the scenario of by simulating various proportions of missing common measurements. Figure 3(b) shows reconstruction SNR for the Hankel matrix in (5) for varying amounts of missing measurements. We recover the Hankel matrix by solving (11) using CVX [21]. Figure 3(b) demonstrates a very high reconstruction SNR even at a very high rate of missing measurements. As mentioned earlier, not having to sense common measurements at all frames is very useful, since we can stagger our acquisition of common and innovation measurements. In theory, this enables a measurement strategy where we need to sense only one measurement per frame of the video without having to group consecutive measurements of the SPC. Hence, we can aim to reconstruct videos at the sampling rate of the SPC. To the best of our knowledge, this is the first video CS acquisition design capable of doing this.
6.2 Dynamic Textures
Our test dataset comprises of videos from the DynTex dataset [29]. We used the mean+LDS model from Section 5.4 for all the video CS experiments with the 2D DCT as the sparsifying basis for the columns of and 2D wavelets as the sparsifying basis for the mean. We used the modelbased CoSAMP solver in Algorithm 1 for these results, since it provides explicit control of the sparsity of the mean and the columns of . We used (14) as a guide to select these values.
Figure 4 shows video reconstruction of a dynamic texture from the DynTex dataset [29]. Reconstruction results are under a compression this is an operating point where frametoframe CS recovery is completely infeasible. However, the dynamic component of the scene is relatively small (), which allows us to recover the video from relatively few measurements. The reconstruction SNRs of the recovered videos shown are as follows: oracle LDS = dB, frametoframe CS = dB and CSLDS = dB.
Figure 5 shows the reconstruction of a video, of 6 blinking LED lights, from the DynTex dataset. We show reconstruction results at different compression ratios as well as different image resolutions. It is noteworthy that, even at a compression, the reconstruction at a resolution of pixels preserves fine details.
Performance with measurement noise:
We validate the performance of our recovery algorithm under various amounts of measurement noise. Note that the columns of with larger singular values are, inherently, better conditioned to deal with this measurement error. The columns corresponding to the smaller singular values are invariably estimated with higher error. Figure 6 shows the performance of the recovery algorithm for various levels of measurement noise. The effect of the measurement noise on the reconstructions is perceived only at low input SNRs. In part, this robustness to measurement noise is due to the LDS model mismatch dominating the reconstruction error at high input SNRs. As the input SNR drops significantly below the model mismatch term, predictably, it starts influencing the reconstructions more. This provides a certain amount of flexibility in the design of potential CSLDS cameras.
Computation time and spatial resolution:
Figure 7 shows recovery algorithm applied to a video of length frames at different spatial resolutions. Shown in Figure 7 are the amount of time taken for each recovery, which scales gracefully for increasing spatial resolution, and reconstruction SNR, which approaches the performance of an oracle LDS. The improvement in reconstruction comes due to the increase in the number of compressive measurements at high resolutions, since the compression ratio is held fixed. However this does comes at the cost of requiring a faster compressive camera to acquire the data since a larger number of measurements.
Gallery of results:
6.3 Application in activity analysis
As mentioned in Section 2.3, LDSs are often used in classification problems, especially in the context of scene/activity analysis. A key experiment in this context is to check if the CSLDS framework recovers videos that are sufficiently informative for such applications. To this end, we experiment with two different activity analysis datasets: the UCSD Traffic Dataset [10] and the UMD Human Activity Dataset [42].
Activity recognition methodology
In both the scenarios considered here (single human activity, and traffic), we model the observed video using the linear dynamical model framework. For recognition, we used the Procrustes distance [11]
between the column spaces of the observability matrices in conjunction with a nearestneighbor classifier. Given the observability matrix
defined in (8), let be an orthonormal matrix such that . Given two LDSs, the squared Procrustes distance between them is given bywhere and . We use this distance function in a nearest neighbor classifier in both the activity classification experiment.
The UCSD Traffic Dataset
[10] consists of videos capturing traffic of three types: light, moderate, and heavy. Each video is of length frames at a resolution of pixels. Figure 9 shows the reconstruction results on a traffic sequence from the dataset. We perform a classification experiment of the videos into these three categories. There are four different traintest scenarios provided with the dataset. For comparison, we also perform the same experiments with fitting the LDS model on the original frames (oracle LDS). We perform classification at two different values of the state space dimension and at a fixed compression ratio of . Table 1 shows classification results. We also show comparative results obtained using a probabilistic kernel on dynamic texture models [10] in conjunction with SVMs in the last two rows of the table. Results for each individual experiment were not reported, only an aggregate number was reported which is shown in the table. It can be seen that even without sophisticated nonlinear classifiers, we are able to obtain comparable performance using a simple nearest neighbor classifier using the dynamic texture model parameters. This shows that the obtained parameters possess discriminatory properties, and can be used in conjunction with other sophisticated classifiers that build on dynamic texture models as in [10].
The UMD Human Activity Dataset
[42] consists of videos, each of length frames, depicting different activities: pickup object, jog, push, squat, wave, kick, bend, throw, turn around and talk on cellhpone. Each activity was repeated times, so there were a total of
sequences in the dataset. As with the traffic experiment, we use an LDS model on the image intensity values without any feature extraction. Images were cropped to contain the human and resized to
. The state space dimension was fixed at and the compression was varied from to . We performed a leaveoneexecutionout test. The results are summarized in table 2. As can be seen, the CSLDS framework obtained a classification performance that is comparable to the oracle LDS. For this dataset, both oracle LDS and CSLDS obtained a perfect classification score of up to a compression ratio of . Further, as shown in Table 2, we obtain comparable performance to a far more sophisticated method employing advance shapebased features for activity recognition. This suggests that the CSLDS framework should be extremely useful in a wide range of applications beyond just video recovery, and can provide a basis to acquire more sophisticated features for tackling challenging activity recognition problems.Expt 1  Expt 2  Expt 3  Expt 4  Average  

(d = 10)  
Oracle LDS  85.71  85.93  87.5  92.06  87.8% 
CSLDS  84.12  87.5  89.06  85.71  86.59% 
(d = 5)  
Oracle LDS  77.77  82.81  92.18  80.95  83.42% 
CSLDS  85.71  73.43  78.1  76.1  78.34% 
State KLSVM (d = 10)[10]  n.a.  n.a.  n.a.  n.a.  93% 
State KLSVM (d = 5)[10]  n.a.  n.a.  n.a.  n.a.  87% 
Activity  Shape dynamics [44]  

Pickup Object  100  100  100  100 
Jog  100  100  90  100 
Push  100  90  50  100 
Squat  90  100  100  100 
Wave  100  100  60  100 
Kick  100  90  80  100 
Bend  100  100  100  100 
Throw  100  100  90  100 
Turn Around  100  100  100  100 
Talk on Cellphone  100  20  10  100 
Average  94%  90%  78%  100% 
7 Discussion
In this paper, we have proposed a framework for the compressive acquisition of dynamic scenes modeled as LDSs. In particular, this paper emphasizes the power of predictive/generative video models. In this regard, we have shown that a strong model for the scene dynamics enables stable video reconstructions at very low measurement rates. In particular, it enables the estimation of the state sequence associated with a video even at fractional number of common measurements per video frame (). The use of CSLDS for dynamic scene modeling and classification also highlights the purposive nature of the framework.
Implementation issues:
The results provided in the paper are mainly based on simulations. While a fullfledged implementation on hardware is beyond the scope of this paper, we discuss some of the key issues and challenges in obtaining such results. Focusing on the single pixel camera (SPC) as our imaging architecture, the achievable compression and resolution are limited by the amount of motion in the scene and the sampling rate of the camera. We discuss the roles these two parameters play in practice.
Amount of motion determines an inherent notion of framerate of the video; note that real life scenes have no notion of “framerate”. If the scene changes negligibly for a time duration , then (for the largest value of ) becomes a good measure of framerate for a scene. For example, static scenes do not change over an infinite time duration () and hence, can be sensed at fps. Given that we seek to sense this scene at a spatial resolution of pixels, a Nyquist camera would need to operate at measurements per second.
Suppose this scene over a duration of seconds can be well approximated by a dimensional LDS, then the total number of free variables to estimate is approximately for the state sequence and for the observation matrix. An SPC operating at samples per second obtains a total of compressive measurements. If CSLDS were employed at a compression ratio of , then
The key dependence here are on how , and change as a function of . In particular, even if and increased as , then would need be scale linearly in to maintain the same compression level.
Connection to affinerank minimization:
The pioneering work of Fazel [18] in developing convex optimization techniques to system identification problems has interesting parallels to the ideas proposed in this paper. One of the key ideas espoused in [18] is that, when the video sequence is an LDS, the block Hankel matrix is low rank. When we have linear measurements of the video frames, we can solve an affinerank problem to recover the video. However, such methods optimize on the Hankel matrix directly and lead to computationally infeasible designs even for videos of very small dimensions. In contrast, CSLDS has been shown to be fast and computationally feasible for very large videos involving millions of variables. The key is our twostep solution that isolates the space of unknowns into two manageable sets and solves for each separately.
Universality:
An attractive property of random matrixbased CS measurement is the universality of the measurement process. Universality implies that the sensing process is independent of the subsequent reconstruction algorithm. This makes the sensing design “futureproof”; for such systems, if we devise a more sophisticated and powerful recovery algorithm in the future, then we do not need to redesign the camera or the sensing framework. The CSLDS framework violates this property. The twostep measurement process of Section
3, which is key to breaking the bilinearity introduced by the LDS prior, implies that the CSLDS design is not universal. An intriguing direction for future research is the design of a universal CSLDS measurement process.Online tracking:
We have made the assumption of a static observation matrix . However, as the length of the video increases, the assumption of a static is satisfied only by increasing the state space dimension. An alternate approach is to allow for a timevarying observation matrix and track it from the compressive measurements. This would give us the benefit of a low state space dimension and yet, be accurate when we sense for long durations.
Beyond LDS:
Figure 10 captures the relative performance of MPEG4 compression algorithm and CSLDS on a video. MPEG4 has access to the ground truth video and, as a consequence, it achieves significantly better compressions for the same performance in recovery (see Figure 10(a)). Further, it is worth noting that the nonlinear encoding in MPEG4 produces errors that are imperceptible and hence, even at the same level of reconstruction error, produces videos that are of higher visual quality (see Figure 10(b,c)). This points at the inherent drawbacks of a linear encoder. While the CSLDS framework makes a compelling case study of LDSs for video CS, its applicability to arbitrary videos is limited. In particular, it does not extend to simple nonstationary scenes such as people walking or panning cameras (see the result associated with Figure 8(h)). This motivates the search for models more general than LDS. In this regard, a promising line of future research is to leverage models from the video compression literature for CS recovery.
Acknowledgments
ACS and RGB were partially supported by the grants NSF CCF0431150, CCF0728867, CCF0926127, CCF1117939, ARO MURI W911NF0910383, W911NF0710185, DARPA N660011114090, N6600111C4092, N660010812065, ONR N000141210124 and AFOSR FA95500910432.
RC was partially supported by the Office of Naval Research under the Grant N000141210124.
References
 [1] CSLDS Project webpage. URL = http://www.ece.rice.edu/~as48/research/cslds.
 [2] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. I. Camps, Dynamic subspacebased coordinated multicamera tracking, in IEEE Intl. Conf. Comp. Vision, 2011.
 [3] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, Modelbased compressive sensing, IEEE Trans. Inf. Theory, 56 (2010), pp. 1982–2001.
 [4] R. G. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A simple proof of the restricted isometry property for random matrices, Constr. Approx., 28 (2008), pp. 253–263.
 [5] R. W. Brockett, Finite Dimensional Linear Systems, Wiley, 1970.
 [6] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Found. Comp. Math., 9 (2009), pp. 717–772.
 [7] E. J. Candès, J. Romberg, and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, IEEE Trans. Inf. Theory, 52 (2006), pp. 489–509.
 [8] E. J. Candès and T. Tao, The power of convex relaxation: Nearoptimal matrix completion, IEEE Trans. Inf. Theory, 56 (2010), pp. 2053–2080.
 [9] V. Cevher, A. C. Sankaranarayanan, M. F. Duarte, D. Reddy, R. G. Baraniuk, and R. Chellappa, Compressive sensing for background subtraction, in Euro. Conf. Comp. Vision, Oct. 2008.
 [10] A. B. Chan and N. Vasconcelos, Probabilistic kernels for the classification of autoregressive visual processes, in IEEE Conf. Comp. Vision and Pattern Recog, June 2005.
 [11] Y. Chikuse, Statistics on special manifolds, Springer Verlag, 2003.
 [12] R. Coifman, F. Geshwind, and Y. Meyer, Noiselets, Appl. Comp. Harm. Anal., 10 (2001), pp. 27–44.
 [13] T. Ding, M. Sznaier, and O. I. Camps, A rank minimization approach to video inpainting, in IEEE Intl. Conf. Comp. Vision, 2007.
 [14] D. L. Donoho, Compressed sensing, IEEE Trans. Inf. Theory, 52 (2006), pp. 1289–1306.
 [15] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, Dynamic textures, Intl. J. Comp. Vision, 51 (2003), pp. 91–109.
 [16] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, Singlepixel imaging via compressive sampling, IEEE Signal Process. Mag., 25 (2008), pp. 83–91.
 [17] M. F. Duarte, M. B. Wakin, D. Baron, S. Sarvotham, and R. G. Baraniuk, Measurement bounds for sparse signal ensembles via graphical models, IEEE Trans. Inf. Theory, 59 (2013), pp. 4280–4289.
 [18] M. Fazel, Matrix rank minimization with applications, PhD thesis, Stanford University, 2002.

[19]
M. Fazel, H. Hindi, and S. P. Boyd,
A rank minimization heuristic with application to minimum order system approximation
, in IEEE Amer. Control Conf., June 2001.  [20] , Logdet heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices, in IEEE Amer. Control Conf., June 2003.
 [21] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 1.21, Available at http://cvxr. com/cvx, (2011).
 [22] J. Haupt and R. Nowak, Signal reconstruction from noisy random projections, IEEE Trans. Inf. Theory, 52 (2006), pp. 4036–4048.
 [23] Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, Video from a single coded exposure photograph using a learned overcomplete dictionary, in IEEE Intl. Conf. Comp. Vision, Nov. 2011.
 [24] K. KreutzDelgado, J. F. Murray, B. D. Rao, K. Engan, T. W. Lee, and T. J. Sejnowski, Dictionary learning algorithms for sparse representation, Neural Comp., 15 (2003), pp. 349–396.
 [25] S. K. Nayar, V. Branzoi, and T. E. Boult, Programmable imaging: Towards a flexible camera, Intl. J. Comp. Vision, 70 (2006), pp. 7–22.
 [26] D. Needell and J. A. Tropp, Cosamp: Iterative signal recovery from incomplete and inaccurate samples, Appl. Comp. Harm. Anal., 26 (2009), pp. 301–321.
 [27] J. Y. Park and M. B. Wakin, A multiscale framework for compressive sensing of video, in Pict. Coding Symp., May 2009.
 [28] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition, in Asilomar Conf. Signals Sys. Comp., Nov. 1993.
 [29] R. Péteri, S. Fazekas, and M.J. Huiskes, DynTex: A comprehensive database of dynamic textures, Pattern Recog. Letters, 31 (2010), pp. 1627–1632.
 [30] B. Recht, M. Fazel, and P. A. Parrilo, Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization, arXiv:0706.4138, (2007).
 [31] D. Reddy, A. Veeraraghavan, and R. Chellappa, P2C2: Programmable pixel compressive camera for high speed imaging, in IEEE Conf. Comp. Vision and Pattern Recog, June 2011.
 [32] Y. Saad, Krylov subspace methods for solving large unsymmetric linear systems, Math. Comput., 37 (1981), pp. 105–126.
 [33] P. Saisan, G. Doretto, Y. Wu, and S. Soatto, Dynamic texture recognition, in IEEE Conf. Comp. Vision and Pattern Recog, Dec. 2001.
 [34] A. C. Sankaranarayanan, P. Turaga, R. Baraniuk, and R. Chellappa, Compressive acquisition of dynamic scenes, in Euro. Conf. Comp. Vision, Sep. 2010.
 [35] S. Soatto, G. Doretto, and Y. N. Wu, Dynamic textures, in IEEE Intl. Conf. Comp. Vision, July 2001.
 [36] M. Sznaier, Compressive information extraction: A dynamical systems approach, in System Identification, vol. 16, 2012, pp. 1559–1568.
 [37] P. Turaga, A. Veeraraghavan, and R. Chellappa, Unsupervised view and rate invariant clustering of video sequences, Comp. Vision and Image Understd., 113 (2009), pp. 353–371.
 [38] E. van den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit solutions, SIAM J. Scientific Comp., 31 (2008), pp. 890–912.
 [39] P. Van Overschee and B. De Moor, N4SID: Subspace algorithms for the identification of combined deterministicstochastic systems, Automatica, 30 (1994), pp. 75–93.
 [40] N. Vaswani, Kalman filtered compressed sensing, in IEEE Conf. Image Process., Oct. 2008.
 [41] N. Vaswani and W. Lu, ModifiedCS: Modifying compressive sensing for problems with partially known support, in Intl. Symp. Inf. Theory, June 2009.
 [42] A. Veeraraghavan, R. Chellappa, and A. K. RoyChowdhury, The function space of an activity, in IEEE Conf. Comp. Vision and Pattern Recog, June 2006.
 [43] A. Veeraraghavan, D. Reddy, and R. Raskar, Coded strobing photography: Compressive sensing of high speed periodic events, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2011), pp. 671–686.
 [44] A. Veeraraghavan, A. K. RoyChowdhury, and R. Chellappa, Matching shape sequences in video with applications in human movement analysis, IEEE Trans. Pattern Anal. Mach. Intell., 27 (2005), pp. 1896–1909.
 [45] M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly, and R. G. Baraniuk, Compressive imaging for video representation and coding, in Pict. Coding Symp., Apr. 2006.
 [46] M. B. Wakin, B. M. Sanandaji, and T. L. Vincent, On the observability of linear systems from random, compressive measurements, in IEEE Conf. on Decision and Control, Dec. 2010.
Comments
There are no comments yet.