Tree-Structure Bayesian Compressive Sensing for Video

10/12/2014 ∙ by Xin Yuan, et al. ∙ Duke University 0

A Bayesian compressive sensing framework is developed for video reconstruction based on the color coded aperture compressive temporal imaging (CACTI) system. By exploiting the three dimension (3D) tree structure of the wavelet and Discrete Cosine Transformation (DCT) coefficients, a Bayesian compressive sensing inversion algorithm is derived to reconstruct (up to 22) color video frames from a single monochromatic compressive measurement. Both simulated and real datasets are adopted to verify the performance of the proposed algorithm.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The mathematical theory of compressive sensing (CS) [1] asserts that one can acquire signals from measurements whose rate is much lower than the total bandwidth. Whereas the CS theory is now well developed, challenges concerning hardware implementations [2, 3, 4, 5, 6, 7] of CS-based acquisition devices, especially in optics, have only started being addressed. This paper will introduce a color video CS camera capable of capturing low-frame-rate measurements at acquisition, with high-frame-rate video recovered subsequently via computation (decompression of the measured data).

The Coded Aperture Compressive Temporal Imaging (CACTI) [8] system uses a moving binary mask pattern to modulate a video sequence within the integration time many times prior to integration by the detector. The number of high-speed frames recovered from a coded-exposure measurement depends on the speed of video modulation. Within the CACTI framework, modulating the video times per second corresponds to moving the mask pixels within the integration time . If frames are to be recovered per compressive measurement by a camera collecting data at frames-per-second (fps), the time variation of the code is required to be fps. The liquid-crystal-on-silicon (LCoS) modulator used in [9, 10] can modulate as fast as fps by pre-storing the exposure codes, but, because the coding pattern is continuously changed at each pixel throughout the exposure, it requires considerable energy consumption (). The mechanical modulator in [8], by contrast, modulates the exposure through periodic mechanical translation of a single mask (coded aperture), using a pizeoelectronic translator that consumes minimal energy (). The coded aperture compressive temporal imaging (CACTI) [8] now has been extended to the color video [11], which can capture “R”, “G” and “B” channels of the context. By appropriate reconstruction algorithms [12, 13, 14], we can get frames color video from a single gray-scale measurement.

Figure 1: (a) First row shows color (RGB) frames of the original high-speed video; second shows each color frame rearranged into a Bayer-filter mosaic; third row depicts the (horizontally) moving mask used to modulate the high-speed frames (black is zero, white is one); fast translation manifested by a pizeoelectronic translator. Fourth row shows the modulated frames, whose sum gives a single coded-exposure photo. (b) Recovered RGB frames arranged into a Bayer-filter mosaic (second row), which is de-mosaicked to give the color frames (first row). (c) The de-mosaicing process [15].

While numerous algorithms have been used for CS inversion, the Bayesian CS algorithm [16] has been shown with significant advantages of providing a full posterior distribution. This paper develops a new Bayesian inversion algorithm to reconstruct videos based on raw measurements acquired by the color-CACTI camera. By exploiting the hybrid three dimensional (3D) tree-structure of the wavelet and DCT (Discrete Cosine Transform) coefficients, we have developed a Hidden Markov tree (HMT) [17] model in the context of a Bayesian framework. Research in [18, 19, 20, 21, 22] has shown that by employing the HMT structure of an image, the CS measurements can be reduced. This paper extends this HMT to 3D and a sophisticated 3D tree-structure is developed for video CS, with color-CACTI shown as an example. Experimental results with both simulated and real datasets verify the performance of the proposed algorithm. The basic model and inversion method may be applied to any of the compressive video cameras discussed above.

2 Color-CACTI

2.1 Coding Scenario

Let be the continuous/analog spatiotemporal volume of the video being measured; represents a moving mask (code) with denoting its spatial translation at time ; and denotes the camera spatial sampling function, with spatial resolution . The coded aperture compressive camera system modulates each temporal segment of duration with the moving mask (the motion is periodic with the period equal to ), and collapses (sums) the coded video into a single photograph ( ):


and , with the detector size pixels. The set of data , which below we represent as , corresponds to the th compressive measurement. The code/mask is here binary, corresponding to photon transmission and blocking (see Figure 1).

2.2 Measurement Model

Denote , defining the original continuous video sampled in space and in time ( discrete temporal frames, , within the time window of the th compressive measurement). We also define


We can rewrite (1) as


where is an added noise term, , and denotes element-wise multiplication (Hadamard product). In (3), denotes the mask/code at the th shift position (approximately discretized in time), and is the underlying video, for video frame within CS measurement . Dropping subscript for simplicity, (3) can be written as


where and

is standard vectorization.

2.3 Mosaicing and De-mosaicing of Color Video

We record temporally compressed measurements for RGB colors on a Bayer-filter mosaic, where the three colors are arranged in the pattern shown in the right bottom of Figure 1. The single coded image is partitioned into four components, one for R and B and two for G (each is the size of the original spatial image). The CS recovery (video from a single measurement) is performed separately on these four mosaiced components, prior to demosaicing as shown in Figure 1(b). One may also jointly perform CS inversion on all 4 components, with the hope of sharing information on the importance of (here wavelet and DCT) components; this was also done, and the results were very similar to processing R, B, G1 and G2 separately. Note that this is the key difference between color-CACTI and the previous work of CACTI in [8].

3 Bayesian Compressive Sensing for Video Reconstruction

3.1 3D Tree Structure of Wavelet Coefficients

An image’s zero-tree structure [23] has been investigated thoroughly since the advent of wavelets [24]. The 3D wavelet tree structure of video, an extension of the 2D image, has also attracted extensive attention in the literature [25]. Xiong et al. introduced a tree-based representation to characterize the block-DCT transform associated with JPEG [26]. For the video representation, we here use the wavelet in space and DCT in time.

Figure 2: 3D tree structure of wavelet coefficients.

Considering the video sequence has frames with spatial pixels, and let denote the indices of the DCT/Wavelet coefficients. Assume there are levels (scales) of the coefficients ( in figure 2). The parent-children linkage of the coefficients are as follows: a) a root-node has 7 children, , where denotes the size of scaling (LL) coefficients; b) an internal node has 8 children ; and c) a leaf-node has no children.

When the tree structure is used in 3D DCT, we consider the block size of the 3D DCT is , and . The parent-children linkage is the same as with the wavelet coefficients [26].

The properties of wavelet coefficients that lead to the Bayesian model derived in the following section are [17]:
1) Large/small values of wavelet coefficients generally persist across the scales of the wavelet tree (the two states of the binary part of the model developed in the following section).

2) Persistence becomes stronger at finer scales (the confidence of the probability of the binary part is proportional to the number of coefficients at that scale).

3) The magnitude of the wavelet coefficients decreases exponentially as we move to the finer scales. In this paper, we use a multiplicative gamma prior [27], a typical shrinkage prior, for the non-zero wavelet coefficients at different scale to embed this decay.

3.2 Statistical Bayesian Model

Let , , be orthonormal matrices defining bases such as wavelets or the DCT [24]. Define


where symbolizes the 3D wavelet/DCT coefficients corresponding to and and denotes the Kronecker product. It is worth noting here the is the 3D transform of the projection matrix . Unlike the model used in [19, 20, 18], where the projection matrix is put directly on the wavelet/DCT coefficients, in the coding strategy of color-CACTI, we get the projection matrix from the hardware by capturing the response of the mask at different positions. Following this, we transform row-by-row to the wavelet/DCT domain, to obtain .

The measurement noise is modeled as zero mean Gaussian with precision matrix (inverse of the covariance matrix) , where

is the identity matrix. We have:


To model the sparsity of the 3D coefficients of wavelet/DCT, the spike-and-slab prior is imposed on as:


where is a vector of non-sparse coefficients and is a binary vector (zero/one indicators) denoting the two state of the HMT [17], with “zero” signifying the “low-state” in the HMT and “one” symbolizing the “high-state”. Note when the coefficients lie in the “low-state”, they are explicitly set to zero, which leads to the sparsity.

To model the linkage of the tree structure across the scales of the wavelet/DCT, we use the the binary vector,

, which is drawn from a Bernoulli distribution. The parent-children linkage is manifested by the probability of this vector. We model

is drawn from a Gaussian distribution with the precision modeled as a multiplicative Gamma prior.

The full Bayesian model is:


where denotes the th component at level , and denotes the scaling coefficients of wavelet (or DC level of a DCT).


In the experiments, we use the following settings:


where is the number of coefficients at th level, and is the length of .

3.3 Inferences

We developed the variational Bayesian methods to infer the parameters in the model as in [20]. The posterior inference of , thus is different from the model in [20], and we show it below:


where denotes the expectation in .

4 Experimental Results

Both simulated and real datasets are adopted to verify the performance of the proposed model for video reconstruction. The hyperparameters are setting as

; the same used in [19, 20]. Best results are found when and are wavelets (here the Daubechies-8 [24]) and corresponds to a DCT. The proposed tree-structure Bayesian CS inversion algorithm is compared with the following algorithms: ) Generalized alternating projection (GAP) algorithm [14, 13]; ) two-step iterative shrinkage/thresholding (TwIST) [28] (with total variation norm); ) K-SVD [29] with orthogonal matching pursuit (OMP) [30] used for inversion;

) a Gaussian mixture model (GMM) based inversion algorithm

[12]; and ) the linearized Bregman algorithm [31]. The -norm of DCT or wavelet coefficients is adopted in linearized Bregman and GAP with the same transformation as the proposed model. GMM and K-SVD are patch-based algorithms and we used a separate dataset for training purpose. A batch of training videos were used to pre-train K-SVD and GMM, and we selected the best reconstruction results for presentation here.

4.1 Simulation Datasets

We consider a scene in which a basketball player performs a dunk; this video is challenging due to the complicated motion of the basketball players and the varying lighting conditions; see the example video frames in Figure 1(a). We consider a binary mask, with 1/0 coding drawn at random Bernoulli(0.5); the code is shifted spatially via the coding mechanism in Figure 1(a)), as in our physical camera. The video frames are spatially, and we choose . It can be seen clearly that the proposed tree-structure Bayesian CS algorithm demonstrates improved PSNR performance for the inversion.

Figure 3: PSNR comparison of proposed tree-structure Bayesian CS inversion method, GAP, TwIST, linearized Bregman, K-SVD, and GMM algorithms with the simulated dataset.

4.2 Real Datasets

We test our algorithm using real datasets captured by our color-CACTI camera, with selected results shown in Figures 4-5. Figure 4 shows low-framerate (captured at 30fps) compressive measurements of fruit falling/rebounding and corresponding high-framerate reconstructed video sequences. In the left are shown four contiguous measurements, and in the right are shown 22 frames reconstructed per measurement. Note the spin of the red apple and the rebound of the orange in the reconstructed frames. Figure 5 shows a process of a purple hammer hitting a red apple with 3 contiguous measurements. We can see the clear hitting process from the reconstructed frames.

Figure 4: Reconstruction results of real dataset with “plastic fruits” (a yellow orange and a red apple) falling and rebounding.
Figure 5: Reconstruction results of a hammer hitting an apple.

5 Conclusion

We have implemented a color video CS camera, color-CACTI, capable of compressively capturing and reconstructing videos at low-and high-framerates, respectively. A tree-structure Bayesian compressive sensing framework is developed for the video CS inversion by exploiting the 3D tree structure of the wavelet/DCT coefficients. Both simulated and real datasets demonstrate the efficacy of the proposed model.


  • [1] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
  • [2] A. C. Sankaranarayanan, P. K. Turaga, R. G. Baraniuk, and R. Chellappa, “Compressive acquisition of dynamic scenes,”

    11th European Conference on Computer Vision, Part I

    , pp. 129–142, September 2010.
  • [3] A. C. Sankaranarayanan, C. Studer, and R. G. Baraniuk, “CS-MUVI: Video compressive sensing for spatial-multiplexing cameras,” IEEE International Conference on Computational Photography, pp. 1–10, April 2012.
  • [4] M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly, and R. G. Baraniuk, “Compressive imaging for video representation and coding,” Proceedings of the Picture Coding Symposium, pp. 1–6, April 2006.
  • [5] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, K. F. Kelly T. Sun, and R. G. Baraniuk, “Single pixel imaging via compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 83–91, 2008.
  • [6] D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady,

    “Multiframe image estimation for coded aperture snapshot spectral imagers,”

    Applied Optics, vol. 49, no. 36, pp. 6824–6833, December 2010.
  • [7] A. Veeraraghavan, D. Reddy, and R. Raskar, “Coded strobing photography: Compressive sensing of high speed periodic videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 4, pp. 671–686, April 2011.
  • [8] P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, “Coded aperture compressive temporal imaging,” Optics Express, vol. 21, no. 9, pp. 10526–10545.
  • [9] Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary,” IEEE International Conference on Computer Vision (ICCV), pp. 287–294, November 2011.
  • [10] D. Reddy, A. Veeraraghavan, and R. Chellappa, “P2C2: Programmable pixel compressive camera for high speed imaging,”

    IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 329–336, June 2011.
  • [11] X. Yuan, P. Llull, X. Liao, J. Yang, G. Sapiro, D. J. Brady, and L. Carin, “Low-cost compressive sensing for color video and depth,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [12] J. Yang, X. Yuan, X. Liao, P. Llull, G. Sapiro, D. J. Brady, and L. Carin, “Video compressive sensing using Gaussian mixture models,” IEEE Transaction on Image Processing, vol. 23, no. 21, pp. 4863–4878, 2014.
  • [13] X. Yuan, J. Yang, P. Llull, X. Liao, G. Sapiro, D. J. Brady, and L. Carin, “Adaptive temporal compressive sensing for video,” International Conference on Image Processing, 2013.
  • [14] X. Liao, H. Li, and L. Carin, “Generalized alternating projection for weighted- minimization with applications to model-based compressive sensing,” SIAM Journal on Imaging Sciences, vol. 7, no. 2, pp. 797–823, 2014.
  • [15] “AVT Guppy PRO, technical manual,” Allied Vision Technologies, v 4.0, July 2012.
  • [16] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2346–2356, June 2008.
  • [17] M.S. Crouse, R.D. Nowak, and R.G. Baraniuk,

    “Wavelet-based statistical signal processing using hidden markov models,”

    IEEE Transactions on Signal Processing, vol. 46, no. 4, pp. 886–902, April 1998.
  • [18] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Model-based compressive sensing,” IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 1982–2001, April 2010.
  • [19] L. He and L. Carin, “Exploiting structure in wavelet-based bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 57, no. 9, pp. 3488–3497, September 2009.
  • [20] L. He, H. Chen, and L. Carin, “Tree-structured compressive sensing with variational bayesian analysis,” IEEE Signal Processing Letters, vol. 17, no. 3, pp. 233–236, 2010.
  • [21] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a markov-tree prior,” IEEE Transactions on Signal Processing, pp. 3439–3448, 2012.
  • [22] Z. Song and A. Dogandzoc, “A max-product em algorithm for reconstructing markov-tree sparse signals from compressive samples,” IEEE Transactions on Signal Processing, 2013.
  • [23] M.S. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3445–3462, Decemeber 1993.
  • [24] S. Mallat, A wavelet tour of signal processing: The sparse way, Academic Press, 2008.
  • [25] Y. Chen and W.A. Pearlman, “Three-dimensional subband coding of video using the zero-tree method,” SPIE Symposium on Visual Communications and Imaging Processing, vol. 46, pp. 1302–1309, 1996.
  • [26] Z. Xiong, O. G. Gulerguz, and M. T. Orchard, “A DCT-based embedded image coder,” IEEE Signal Processing Letters, vol. 3, no. 11, pp. 289–290, November 1996.
  • [27] A. Bhattacharya and D. B. Dunson, “Sparse bayesian infinite factor models,” Biometrika, vol. 98, no. 2, pp. 291–306, 2011.
  • [28] J.M. Bioucas-Dias and M.A.T. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Transactions on Image Processing, vol. 16, no. 12, pp. 2992–3004, December 2007.
  • [29] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, November 2006.
  • [30] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, October 2004.
  • [31] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for -minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143–168, 2008.