1 Introduction
The mathematical theory of compressive sensing (CS) [1] asserts that one can acquire signals from measurements whose rate is much lower than the total bandwidth. Whereas the CS theory is now well developed, challenges concerning hardware implementations [2, 3, 4, 5, 6, 7] of CSbased acquisition devices, especially in optics, have only started being addressed. This paper will introduce a color video CS camera capable of capturing lowframerate measurements at acquisition, with highframerate video recovered subsequently via computation (decompression of the measured data).
The Coded Aperture Compressive Temporal Imaging (CACTI) [8] system uses a moving binary mask pattern to modulate a video sequence within the integration time many times prior to integration by the detector. The number of highspeed frames recovered from a codedexposure measurement depends on the speed of video modulation. Within the CACTI framework, modulating the video times per second corresponds to moving the mask pixels within the integration time . If frames are to be recovered per compressive measurement by a camera collecting data at framespersecond (fps), the time variation of the code is required to be fps. The liquidcrystalonsilicon (LCoS) modulator used in [9, 10] can modulate as fast as fps by prestoring the exposure codes, but, because the coding pattern is continuously changed at each pixel throughout the exposure, it requires considerable energy consumption (). The mechanical modulator in [8], by contrast, modulates the exposure through periodic mechanical translation of a single mask (coded aperture), using a pizeoelectronic translator that consumes minimal energy (). The coded aperture compressive temporal imaging (CACTI) [8] now has been extended to the color video [11], which can capture “R”, “G” and “B” channels of the context. By appropriate reconstruction algorithms [12, 13, 14], we can get frames color video from a single grayscale measurement.
While numerous algorithms have been used for CS inversion, the Bayesian CS algorithm [16] has been shown with significant advantages of providing a full posterior distribution. This paper develops a new Bayesian inversion algorithm to reconstruct videos based on raw measurements acquired by the colorCACTI camera. By exploiting the hybrid three dimensional (3D) treestructure of the wavelet and DCT (Discrete Cosine Transform) coefficients, we have developed a Hidden Markov tree (HMT) [17] model in the context of a Bayesian framework. Research in [18, 19, 20, 21, 22] has shown that by employing the HMT structure of an image, the CS measurements can be reduced. This paper extends this HMT to 3D and a sophisticated 3D treestructure is developed for video CS, with colorCACTI shown as an example. Experimental results with both simulated and real datasets verify the performance of the proposed algorithm. The basic model and inversion method may be applied to any of the compressive video cameras discussed above.
2 ColorCACTI
2.1 Coding Scenario
Let be the continuous/analog spatiotemporal volume of the video being measured; represents a moving mask (code) with denoting its spatial translation at time ; and denotes the camera spatial sampling function, with spatial resolution . The coded aperture compressive camera system modulates each temporal segment of duration with the moving mask (the motion is periodic with the period equal to ), and collapses (sums) the coded video into a single photograph ( ):
(1)  
and , with the detector size pixels. The set of data , which below we represent as , corresponds to the th compressive measurement. The code/mask is here binary, corresponding to photon transmission and blocking (see Figure 1).
2.2 Measurement Model
Denote , defining the original continuous video sampled in space and in time ( discrete temporal frames, , within the time window of the th compressive measurement). We also define
(2)  
We can rewrite (1) as
(3)  
(4) 
where is an added noise term, , and denotes elementwise multiplication (Hadamard product). In (3), denotes the mask/code at the th shift position (approximately discretized in time), and is the underlying video, for video frame within CS measurement . Dropping subscript for simplicity, (3) can be written as
(5)  
(6)  
(7) 
where and
is standard vectorization.
2.3 Mosaicing and Demosaicing of Color Video
We record temporally compressed measurements for RGB colors on a Bayerfilter mosaic, where the three colors are arranged in the pattern shown in the right bottom of Figure 1. The single coded image is partitioned into four components, one for R and B and two for G (each is the size of the original spatial image). The CS recovery (video from a single measurement) is performed separately on these four mosaiced components, prior to demosaicing as shown in Figure 1(b). One may also jointly perform CS inversion on all 4 components, with the hope of sharing information on the importance of (here wavelet and DCT) components; this was also done, and the results were very similar to processing R, B, G1 and G2 separately. Note that this is the key difference between colorCACTI and the previous work of CACTI in [8].
3 Bayesian Compressive Sensing for Video Reconstruction
3.1 3D Tree Structure of Wavelet Coefficients
An image’s zerotree structure [23] has been investigated thoroughly since the advent of wavelets [24]. The 3D wavelet tree structure of video, an extension of the 2D image, has also attracted extensive attention in the literature [25]. Xiong et al. introduced a treebased representation to characterize the blockDCT transform associated with JPEG [26]. For the video representation, we here use the wavelet in space and DCT in time.
Considering the video sequence has frames with spatial pixels, and let denote the indices of the DCT/Wavelet coefficients. Assume there are levels (scales) of the coefficients ( in figure 2). The parentchildren linkage of the coefficients are as follows: a) a rootnode has 7 children, , where denotes the size of scaling (LL) coefficients; b) an internal node has 8 children ; and c) a leafnode has no children.
When the tree structure is used in 3D DCT, we consider the block size of the 3D DCT is , and . The parentchildren linkage is the same as with the wavelet coefficients [26].
The properties of wavelet coefficients that lead to the Bayesian model derived in the following section are [17]:
1) Large/small values of wavelet coefficients generally persist across the scales of the wavelet tree (the two states of the binary part of the model developed in the following section).
2) Persistence becomes stronger at finer scales (the confidence of the probability of the binary part is proportional to the number of coefficients at that scale).
3) The magnitude of the wavelet coefficients decreases exponentially as we move to the finer scales. In this paper, we use a multiplicative gamma prior [27], a typical shrinkage prior, for the nonzero wavelet coefficients at different scale to embed this decay.
3.2 Statistical Bayesian Model
Let , , be orthonormal matrices defining bases such as wavelets or the DCT [24]. Define
(8)  
(9) 
where symbolizes the 3D wavelet/DCT coefficients corresponding to and and denotes the Kronecker product. It is worth noting here the is the 3D transform of the projection matrix . Unlike the model used in [19, 20, 18], where the projection matrix is put directly on the wavelet/DCT coefficients, in the coding strategy of colorCACTI, we get the projection matrix from the hardware by capturing the response of the mask at different positions. Following this, we transform rowbyrow to the wavelet/DCT domain, to obtain .
The measurement noise is modeled as zero mean Gaussian with precision matrix (inverse of the covariance matrix) , where
is the identity matrix. We have:
(10) 
To model the sparsity of the 3D coefficients of wavelet/DCT, the spikeandslab prior is imposed on as:
(11) 
where is a vector of nonsparse coefficients and is a binary vector (zero/one indicators) denoting the two state of the HMT [17], with “zero” signifying the “lowstate” in the HMT and “one” symbolizing the “highstate”. Note when the coefficients lie in the “lowstate”, they are explicitly set to zero, which leads to the sparsity.
To model the linkage of the tree structure across the scales of the wavelet/DCT, we use the the binary vector,
, which is drawn from a Bernoulli distribution. The parentchildren linkage is manifested by the probability of this vector. We model
is drawn from a Gaussian distribution with the precision modeled as a multiplicative Gamma prior.
The full Bayesian model is:
(12)  
(13)  
(14)  
(15)  
(16) 
where denotes the th component at level , and denotes the scaling coefficients of wavelet (or DC level of a DCT).
(17)  
(18)  
(19)  
(20)  
(21)  
(22) 
In the experiments, we use the following settings:
(23)  
(24)  
(25)  
(26) 
where is the number of coefficients at th level, and is the length of .
3.3 Inferences
4 Experimental Results
Both simulated and real datasets are adopted to verify the performance of the proposed model for video reconstruction. The hyperparameters are setting as
; the same used in [19, 20]. Best results are found when and are wavelets (here the Daubechies8 [24]) and corresponds to a DCT. The proposed treestructure Bayesian CS inversion algorithm is compared with the following algorithms: ) Generalized alternating projection (GAP) algorithm [14, 13]; ) twostep iterative shrinkage/thresholding (TwIST) [28] (with total variation norm); ) KSVD [29] with orthogonal matching pursuit (OMP) [30] used for inversion;) a Gaussian mixture model (GMM) based inversion algorithm
[12]; and ) the linearized Bregman algorithm [31]. The norm of DCT or wavelet coefficients is adopted in linearized Bregman and GAP with the same transformation as the proposed model. GMM and KSVD are patchbased algorithms and we used a separate dataset for training purpose. A batch of training videos were used to pretrain KSVD and GMM, and we selected the best reconstruction results for presentation here.4.1 Simulation Datasets
We consider a scene in which a basketball player performs a dunk; this video is challenging due to the complicated motion of the basketball players and the varying lighting conditions; see the example video frames in Figure 1(a). We consider a binary mask, with 1/0 coding drawn at random Bernoulli(0.5); the code is shifted spatially via the coding mechanism in Figure 1(a)), as in our physical camera. The video frames are spatially, and we choose . It can be seen clearly that the proposed treestructure Bayesian CS algorithm demonstrates improved PSNR performance for the inversion.
4.2 Real Datasets
We test our algorithm using real datasets captured by our colorCACTI camera, with selected results shown in Figures 45. Figure 4 shows lowframerate (captured at 30fps) compressive measurements of fruit falling/rebounding and corresponding highframerate reconstructed video sequences. In the left are shown four contiguous measurements, and in the right are shown 22 frames reconstructed per measurement. Note the spin of the red apple and the rebound of the orange in the reconstructed frames. Figure 5 shows a process of a purple hammer hitting a red apple with 3 contiguous measurements. We can see the clear hitting process from the reconstructed frames.
5 Conclusion
We have implemented a color video CS camera, colorCACTI, capable of compressively capturing and reconstructing videos at lowand highframerates, respectively. A treestructure Bayesian compressive sensing framework is developed for the video CS inversion by exploiting the 3D tree structure of the wavelet/DCT coefficients. Both simulated and real datasets demonstrate the efficacy of the proposed model.
References
 [1] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.

[2]
A. C. Sankaranarayanan, P. K. Turaga, R. G. Baraniuk, and R. Chellappa,
“Compressive acquisition of dynamic scenes,”
11th European Conference on Computer Vision, Part I
, pp. 129–142, September 2010.  [3] A. C. Sankaranarayanan, C. Studer, and R. G. Baraniuk, “CSMUVI: Video compressive sensing for spatialmultiplexing cameras,” IEEE International Conference on Computational Photography, pp. 1–10, April 2012.
 [4] M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly, and R. G. Baraniuk, “Compressive imaging for video representation and coding,” Proceedings of the Picture Coding Symposium, pp. 1–6, April 2006.
 [5] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, K. F. Kelly T. Sun, and R. G. Baraniuk, “Single pixel imaging via compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 83–91, 2008.

[6]
D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady,
“Multiframe image estimation for coded aperture snapshot spectral imagers,”
Applied Optics, vol. 49, no. 36, pp. 6824–6833, December 2010.  [7] A. Veeraraghavan, D. Reddy, and R. Raskar, “Coded strobing photography: Compressive sensing of high speed periodic videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 4, pp. 671–686, April 2011.
 [8] P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, “Coded aperture compressive temporal imaging,” Optics Express, vol. 21, no. 9, pp. 10526–10545.
 [9] Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video from a single coded exposure photograph using a learned overcomplete dictionary,” IEEE International Conference on Computer Vision (ICCV), pp. 287–294, November 2011.

[10]
D. Reddy, A. Veeraraghavan, and R. Chellappa,
“P2C2: Programmable pixel compressive camera for high speed
imaging,”
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 329–336, June 2011.  [11] X. Yuan, P. Llull, X. Liao, J. Yang, G. Sapiro, D. J. Brady, and L. Carin, “Lowcost compressive sensing for color video and depth,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [12] J. Yang, X. Yuan, X. Liao, P. Llull, G. Sapiro, D. J. Brady, and L. Carin, “Video compressive sensing using Gaussian mixture models,” IEEE Transaction on Image Processing, vol. 23, no. 21, pp. 4863–4878, 2014.
 [13] X. Yuan, J. Yang, P. Llull, X. Liao, G. Sapiro, D. J. Brady, and L. Carin, “Adaptive temporal compressive sensing for video,” International Conference on Image Processing, 2013.
 [14] X. Liao, H. Li, and L. Carin, “Generalized alternating projection for weighted minimization with applications to modelbased compressive sensing,” SIAM Journal on Imaging Sciences, vol. 7, no. 2, pp. 797–823, 2014.
 [15] “AVT Guppy PRO, technical manual,” Allied Vision Technologies, v 4.0, July 2012.
 [16] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2346–2356, June 2008.

[17]
M.S. Crouse, R.D. Nowak, and R.G. Baraniuk,
“Waveletbased statistical signal processing using hidden markov models,”
IEEE Transactions on Signal Processing, vol. 46, no. 4, pp. 886–902, April 1998.  [18] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Modelbased compressive sensing,” IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 1982–2001, April 2010.
 [19] L. He and L. Carin, “Exploiting structure in waveletbased bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 57, no. 9, pp. 3488–3497, September 2009.
 [20] L. He, H. Chen, and L. Carin, “Treestructured compressive sensing with variational bayesian analysis,” IEEE Signal Processing Letters, vol. 17, no. 3, pp. 233–236, 2010.
 [21] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a markovtree prior,” IEEE Transactions on Signal Processing, pp. 3439–3448, 2012.
 [22] Z. Song and A. Dogandzoc, “A maxproduct em algorithm for reconstructing markovtree sparse signals from compressive samples,” IEEE Transactions on Signal Processing, 2013.
 [23] M.S. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3445–3462, Decemeber 1993.
 [24] S. Mallat, A wavelet tour of signal processing: The sparse way, Academic Press, 2008.
 [25] Y. Chen and W.A. Pearlman, “Threedimensional subband coding of video using the zerotree method,” SPIE Symposium on Visual Communications and Imaging Processing, vol. 46, pp. 1302–1309, 1996.
 [26] Z. Xiong, O. G. Gulerguz, and M. T. Orchard, “A DCTbased embedded image coder,” IEEE Signal Processing Letters, vol. 3, no. 11, pp. 289–290, November 1996.
 [27] A. Bhattacharya and D. B. Dunson, “Sparse bayesian infinite factor models,” Biometrika, vol. 98, no. 2, pp. 291–306, 2011.
 [28] J.M. BioucasDias and M.A.T. Figueiredo, “A new TwIST: Twostep iterative shrinkage/thresholding algorithms for image restoration,” IEEE Transactions on Image Processing, vol. 16, no. 12, pp. 2992–3004, December 2007.
 [29] M. Aharon, M. Elad, and A. Bruckstein, “KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, November 2006.
 [30] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, October 2004.
 [31] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143–168, 2008.