Deep Fully-Connected Networks for Video Compressive Sensing

03/16/2016 ∙ by Michael Iliadis, et al. ∙ Northwestern University 1

In this work we present a deep learning framework for video compressive sensing. The proposed formulation enables recovery of video frames in a few seconds at significantly improved reconstruction quality compared to previous approaches. Our investigation starts by learning a linear mapping between video sequences and corresponding measured frames which turns out to provide promising results. We then extend the linear formulation to deep fully-connected networks and explore the performance gains using deeper architectures. Our analysis is always driven by the applicability of the proposed framework on existing compressive video architectures. Extensive simulations on several video sequences document the superiority of our approach both quantitatively and qualitatively. Finally, our analysis offers insights into understanding how dataset sizes and number of layers affect reconstruction performance while raising a few points for future investigation. Code is available at Github: https://github.com/miliadis/DeepVideoCS

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The subdivision of time by motion picture cameras, the frame rate, limits the temporal resolution of a camera system. Even though frame rate increase above  Hz may be imperceptible to human eyes, high speed motion picture capture has long been a goal in scientific imaging and cinematography communities. Despite the increasing availability of high speed cameras through the reduction of hardware prices, fundamental restrictions still limit the maximum achievable frame rates.

Video compressive sensing (CS) aims at increasing the temporal resolution of a sensor by incorporating additional hardware components to the camera architecture and employing powerful computational techniques for high speed video reconstruction. The additional components operate at higher frame rates than the camera’s native temporal resolution giving rise to low frame rate multiplexed measurements which can later be decoded to extract the unknown observed high speed video sequence. Despite its use for high speed motion capture Llull2015 , video CS also has applications to coherent imaging (e.g., holography) for tracking high-speed events Wang2017 (e.g., particle tracking, observing moving biological samples). The benefits of video CS are even more pronounced for non-visible light applications where high speed cameras are rarely available or prohibitively expensive (e.g., millimeter-wave imaging, infrared imaging) Babacan2011 ; Chen2015 .

Video CS comes in two incarnations, namely, spatial CS and temporal CS. Spatial video CS architectures stem from the well-known single-pixel-camera Duarte2008 , which performs spatial multiplexing per measurement, and enable video recovery by expediting the capturing process. They either employ fast readout circuitry to capture information at video rates Chen2014 or parallelize the single-pixel architecture using multiple sensors, each one responsible for sampling a separate spatial area of the scene Chen2015 ; Wang2015 .

In this work, we focus on temporal CS where multiplexing occurs across the time dimension. Figure 1 depicts this process, where a spatio-temporal volume of size is modulated by binary random masks during the exposure time of a single capture, giving rise to a coded frame of size .

We denote the vectorized versions of the unknown signal and the captured frame as

and , respectively. Each vectorized sampling mask is expressed as giving rise to the measurement model

(1)

where and creates a diagonal matrix from its vector argument.

Figure 1: Temporal compressive sensing measurement model.

Various successful temporal CS architectures have been proposed. Their differences mainly involve the implementation of the random masks on the optical path (i.e., the measurement matrix in Figure 1). Digital micromirror devices (DMD), spatial light modulators (SLM) and liquid crystal on silicon (LCoS) were used in Chen2015 ; Wang2015 ; Gao2014 ; Liu2013 ; Reddy2011 while translating printed masks were employed in Koller2015 ; Llull2013b . Moreover, a few architectures have eliminated additional optical elements by directly programming the chip’s readout mode through hardware circuitry modifications Fernandez-Cull2014 ; Orchard2012 ; Spinoulas2015 .

Despite their reasonable performance, temporal CS architectures lack practicality. The main drawback is that existing reconstruction algorithms (e.g., using sparsity models Chen2015 ; Holloway2012 , combining sparsity and dictionary learning Liu2013

or using Gaussian mixture models 

Yang2015 ; Yang2014 ) are often too computationally intensive, rendering the reconstruction process painfully slow. Even with parallel processing, recovery times make video CS prohibitive for modern commercial camera architectures.

In this work, we address this problem by employing deep learning and show that video frames can be recovered in a few seconds at significantly improved reconstruction quality compared to existing approaches.

Our contributions are summarized as follows:

  1. We present the first deep learning architecture for temporal video CS reconstruction approach, based on fully-connected neural networks, which learns to map directly temporal CS measurements to video frames. For such task to be practical, a measurement mask with a repeated pattern is proposed.

  2. We show that a simple linear regression-based approach learns to reconstruct video frames adequately at a minimal computational cost. Such reconstruction could be used as an initial point to other video CS algorithms.

  3. The learning parading is extended to deeper architectures exhibiting reconstruction quality and computational cost improvements compared to previous methods.

2 Motivation and Related Work

Deep learning LeCun2015

is a burgeoning research field which has demonstrated state-of-the-art performance in a multitude of machine learning and computer vision tasks, such as image recognition 

He2015 or object detection Pinheiro2015 .

In simple words, deep learning tries to mimic the human brain by training large multi-layer neural networks with vast amounts of training samples, describing a given task. Such networks have proven very successful in problems where analytical modeling is not easy or straightforward (e.g., a variety of computer vision tasks Krizhevsky2012 ; Lecun1998 ).

The popularity of neural networks in recent years has led researchers to explore the capabilities of deep architectures even in problems where analytical models often exist and are well understood (e.g., restoration problems Burger2012 ; Schuler2013 ; Xie2012

). Even though performance improvement is not as pronounced as in classification problems, many proposed architectures have achieved state-of-the-art performance in problems such as deconvolution, denoising, inpainting, and super-resolution.

More specifically, investigators have employed a variety of architectures: deep fully-connected networks or multi-layer perceptrons (MLPs) 

Burger2012 ; Schuler2013 ; stacked denoising auto-encoders (SDAEs) Xie2012 ; Agostinelli2013 ; Fleet2014 ; Vincent2010

, which are MLPs whose layers are pre-trained to provide improved weight initialization; convolutional neural networks (CNNs)

Wang2015 ; Sun2015 ; Dong2015 ; Lecun1989 ; Ren2015 ; Li2014

and recurrent neural networks (RNNs) 

Yan2015 .

Based on such success in restoration problems, we wanted to explore the capabilities of deep learning for the video CS problem. However, the majority of existing architectures involve outputs whose dimensionality is smaller than the input (e.g., classification) or have the same size (e.g., denoising/deblurring). Hence, devising an architecture that estimates

unknowns, given inputs, where is not necessarily straightforward.

Two recent studies, utilizing SDAEs Mousavi2015 or CNNs Kulkarni2016 , have been presented on spatial CS for still images exhibiting promising performance. Our work constitutes the first attempt to apply deep learning on temporal video CS. Our approach differs from prior 2D image restoration architectures Burger2012 ; Schuler2013 since we are recovering a 3D volume from 2D measurements.

3 Deep Networks for Compressed Video

3.1 Linear mapping

We started our investigation by posing the question: can training data be used to find a linear mapping such that ? Essentially, this question asks for the inverse of in equation (1) which, of course, does not exist. Clearly, such a matrix would be huge to store but, instead, one can apply the same logic on video blocks Liu2013 .

We collect a set of training video blocks denoted by , of size . Therefore, the measurement model per block is now with size , where and refers to the corresponding measurement matrix per block.

Collecting a set of video blocks, we obtain the matrix equation

(2)

where , and is the same for all blocks. The linear mapping we are after can be calculated as

(3)

where is of size .

Figure 2: Average reconstruction performance of linear mapping for videos (unrelated to the training data), using measurement matrices with varying percentages of nonzero elements.

Intuitively, such an approach would not necessarily be expected to even provide a solution due to ill-posedness. However, it turns out that, if is sufficiently large and the matrix has at least one nonzero in each row (i.e., sampling each spatial location at least once over time), the estimation of ’s by the ’s provides surprisingly good performance.

Specifically, we obtain measurements from a test video sequence applying the same per video block and then reconstruct all blocks using the learnt . Figure 2 depicts the average peak signal-to-noise ratio (PSNR) and structural similarity metric (SSIM) Wang2004 for the reconstruction of video sequences using different realizations of the random binary matrix for varying percentages of nonzero elements. The empty bars for and of nonzeros in realizations and , respectively, refer to cases when there was no solution due to the lack of nonzeros at some spatial location. In these experiments was selected as simulating the reconstruction of frames by a single captured frame and .

3.2 Measurement Matrix Construction

Based on the performance in Figure 2, investigating the extension of the linear mapping in to a nonlinear mapping using deep networks seemed increasingly promising. In order for such an approach to be practical, though, reconstruction has to be performed on blocks and each block must be sampled with the same measurement matrix . Furthermore, such a measurement matrix should be realizable in hardware. Hence we propose constructing a which consists of repeated identical building blocks of size , as presented in Figure 3. Such a matrix can be straightforwardly implemented on existing systems employing DMDs, SLMs or LCoS Chen2015 ; Wang2015 ; Gao2014 ; Liu2013 ; Reddy2011 . At the same time, in systems utilizing translating masks Koller2015 ; Llull2013b , a repeated mask can be printed and shifted appropriately to produce the same effect.

In the remainder of this paper, we select a building block of size as a random binary matrix containing of nonzero elements and set , such that and . Therefore, the compression ratio is . In addition, for the proposed matrix , each block is the same allowing reconstruction for overlapping blocks of size with spatial overlap of . Such overlap can usually aid at improving reconstruction quality. The selection of of nonzeros was just a random choice since the results of Figure 2 did not suggest that a specific percentage is particularly beneficial in terms of reconstruction quality.

Figure 3: Construction of the proposed full measurement matrix by repeating a three dimensional random array (building block) in the horizontal and vertical directions.

3.3 Multi-layer Network Architecture

In this section, we extend the linear formulation to MLPs and investigate the performance in deeper structures.

Choice of Network Architecture. We consider an end-to-end MLP architecture to learn a nonlinear function that maps a measured frame patch via several hidden layers to a video block , as illustrated in Figure 4. The MLP architecture was chosen for the problem of video CS reconstruction due to the following two considerations;

Figure 4: Illustration of the proposed deep learning architecture for video compressive sensing.
  1. The first hidden layer should be a fully-connected layer that would provide a 3D signal from the compressed 2D measurements. This is necessary for temporal video CS as in contrast to the super-resolution problem (or other related image reconstruction problems) where a low-resolution image is given as input, here we are given CS encoded measurements. Thus, convolution does not hold and therefore a convolutional layer cannot be employed as a first layer.

  2. Following that, one could argue that the subsequent layers could be 3D Convolutional layers Tran2015 . Although that would sound reasonable for our problem, in practice, the small size of blocks used in this paper () do not allow for convolutions to be effective. Increasing the size of blocks to , so that convolutions can be applied, would dramatically increase the network complexity in 3D volumes such as in videos. For example, if we use a block size of as input, the first fully-connected layer would contain parameters! Besides, such small block sizes () have provided good reconstruction quality in dictionary learning approaches used for CS video reconstruction Liu2013 . It was shown that choosing larger block sizes led to worse reconstruction quality.

Thus, MLPs (i.e., apply fully-connected layers for the entire network) were considered more reasonable in our work and we found that when applied to blocks they capture the motion and spatial details of videos adequately.

It is interesting to note here that another approach would be to try learning the mapping between and , since matrix is known Mehta17 . Such approach could provide better pixel localization since places the values in in the corresponding pixel locations that were sampled to provide the summation in the direction. However, such an architecture would require additional weights between the input and the first hidden layer since the input would now be of size () instead of (). Such approach was tested and resulted in almost identical performance, albeit with a higher computational cost, hence it is not presented here.

Network Architecture Design. As illustrated in Figure 4, each hidden layer , is defined as

(4)

where

is the bias vector and

is the output weight matrix, containing linear filters. connects to the first hidden layer, while for the remaining hidden layers, . The last hidden layer is connected to the output layer via and without nonlinearity. The non-linear function

is the rectified linear unit (ReLU

Nair2010 defined as, . In our work we considered two different network architectures, one with and another with hidden layers.

To train the proposed MLP, we learn all the weights and biases of the model. The set of parameters is denoted as

and is updated by the backpropagation algorithm 

Rumelhart1988 minimizing the quadratic error between the set of training mapped measurements and the corresponding video blocks

. The loss function is the mean squared error (MSE) which is given by

(5)

The MSE was used in this work since our goal is to optimize the PSNR which is directly related to the MSE.

Figure 5: Example frames from the video sequences used for training.

4 Experiments

We compare our proposed deep architecture with state-of-the-art approaches both quantitatively and qualitatively. The proposed approaches are evaluated assuming noiseless measurements or under the presence of measurement noise. Finally, we investigate the performance of our methods under different network parameters (e.g., number of layers) and size of training samples. The metrics used for evaluation were the PSNR and SSIM.

4.1 Training Data Collection

For deep neural networks, increasing the number of training samples is usually synonymous to improved performance. We collected a diverse set of training samples using high-definition videos from Youtube, depicting natural scenes. The video sequences contain more than frames which were converted to grayscale. All videos are unrelated to the test set. We randomly extracted million video blocks of size while keeping the amount of blocks extracted per video proportional to its duration. This data was used as output while the corresponding input was obtained by multiplying each sample with the measurement matrix (see subsection 3.2 for details). Example frames from the video sequences used for training are shown in Figure 5.

4.2 Implementation Details

Our networks were trained for up to iterations using a mini-batch size of

. We normalized the input per-feature to zero mean and standard deviation one. The weights of each layer were initialized to random values uniformly distributed in

, where is the size of the previous layer Xavier2010

. We used Stochastic Gradient Descent (SGD) with a starting learning rate of

, which was divided by after iterations. The momentum was set to 0.9 and we further used

norm gradient clipping to keep the gradients in a certain range. Gradient clipping is a widely used technique in recurrent neural networks to avoid exploding gradients 

Pascanu2013 . The threshold of gradient clipping was set to .

4.3 Comparison with Previous Methods

We compare our method with the state-of-the-art video compressive sensing methods:

  • GMM-TP, a Gaussian mixture model (GMM)-based algorithm Yang2014 .

  • MMLE-GMM, a maximum marginal likelihood estimator (MMLE), that maximizes the likelihood of the GMM of the underlying signals given only their linear compressive measurements Yang2015 .

For temporal CS reconstruction, data driven models usually perform better than standard sparsity-based schemes Yang2015 ; Yang2014 . Indeed, both GMM-TP and MMLE-GMM have demonstrated superior performance compared to existing approaches in the literature such as Total-Variation (TV) or dictionary learning Liu2013 ; Yang2015 ; Yang2014 , hence we did not include experiments with the latter methods.

In GMM-TP Yang2014 we followed the settings proposed by the authors and used our training data (randomly selecting samples) to train the underlying GMM parameters. We found that our training data provided better performance compared to the data used by the authors. In our experiments we denote this method by GMM- to denote reconstruction of overlapping blocks with spatial overlap of pixels, as discussed in subsection 3.2.

MMLE Yang2015 is a self-training method but it is sensitive to initialization. A satisfactory performance is obtained only when MMLE is combined with a good starting point. In Yang2015 , the GMM-TP Yang2014 with full overlapping patches (denoted in our experiments as GMM-1) was used to initialize the MMLE. We denote the combined method as GMM-1+MMLE. For fairness, we also conducted experiments in the case where our method is used as a starting point for the MMLE.

Reconstruction Method

Video Sequence
Metric W-10M FC7-10M GMM-4 Yang2014 GMM-1 Yang2015 FC7-10M +MMLE GMM-1 +MMLE Yang2015

Electric Ball
PSNR
2-8 SSIM

Horse
PSNR
2-8 SSIM

Bow & Arrow
PSNR
2-8 SSIM

Bus
PSNR
2-8 SSIM

Dogs
PSNR
2-8 SSIM

City
PSNR
2-8 SSIM

Crew
PSNR
2-8 SSIM

Filament
PSNR
2-8 SSIM

Hammer
PSNR
2-8 SSIM

Football
PSNR
2-8 SSIM

Kayak
PSNR
2-8 SSIM

Porsche
PSNR
2-8 SSIM

Golf
PSNR
2-8 SSIM

Basketball
PSNR
2-8 SSIM

Time
Table 1: Average performance for the reconstruction of the first frames for video sequences using several methods. Maximum values are highlighted for each side (left/right) of the table. The time (at the bottom row) refers to the average time for reconstructing a sequence of frames using a single captured frame.

In our methods, a collection of overlapping patches of size is extracted by each coded measurement of size and subsequently reconstructed into video blocks of size . Overlapping areas of the recovered video blocks are then averaged to obtain the final video reconstruction results, as depicted in Figure 4. The step of the overlapping patches was set to due to the special construction of the utilized measurement matrix, as discussed in subsection 3.2.

We consider six different architectures:

  • W-10M, a simple linear mapping (equation (3)) trained on samples.

  • FC4-1M, a MLP trained on samples (randomly selected from our samples).

  • FC4-10M, a MLP trained on samples.

  • FC7-1M, a MLP trained on samples (randomly selected from our samples).

  • FC7-10M, a MLP trained on samples.

  • FC7-10M+MMLE, a MLP trained on samples which is used as an initialization to the MMLE Yang2015 method.

Note that the subset of randomly selected million samples used for training FC4-1M and FC7-1M was the same.

Our test set consists of video sequences. They involve a set of videos that were used for dictionary training in Liu2013 , provided by the authors, as well as the “Basketball” video sequence used by Yang2015 . All video sequences are unrelated to the training set (see subsection 4.1 for details). For fair comparisons, the same measurement mask was used in all methods, according to subsection 3.2. All code implementations are publicly available provided by the authors.

4.4 Reconstruction Results

Quantitative reconstruction results for all video sequences with all tested algorithms are illustrated in Table 1 and average performance is summarized in Figure 7. The presented metrics refer to average performance for the reconstruction of the first frames of each video sequence, using consecutive captured coded frames through the video CS measurement model of equation (1). In both, Table 1 and Figure 7, results are divided in two parts. The first part lists reconstruction performance of the tested approaches without the MMLE step, while the second compares the performance of the best candidate in the proposed and previous methods, respectively, with a subsequent MMLE step Yang2015 . In Table 1 the best performing algorithms are highlighted for each part while the bottom row presents average reconstruction time requirements for the recovery of video frames using captured coded frame.

Figure 6: Qualitative reconstruction comparison of frames from two video sequences between our methods and GMM-1 Yang2015 , GMM-1+MMLE Yang2015 .
Figure 7: Average PSNR and SSIM over all video sequences for several methods.

Our FC7-10M and FC7-10M+MMLE yield the highest PSNR and SSIM values for all video sequences. Specifically, the average PSNR improvement of FC7-10M over the GMM-1 Yang2015 is dB. When these two methods are used to initialize the MMLE Yang2015 algorithm, the average PSNR gain of FC7-10M+MMLE over the GMM-1+MMLE Yang2015 is dB. Notice also that the FC7-10M achieves dB higher than the combined GMM-1+MMLE. The highest PSNR and SSIM values are reported in the FC7-10M+MMLE method with dB average PSNR over all test sequences. However, the average reconstruction time for the reconstruction of frames using this method is almost two hours while for the second best, the FC7-10M, is about seconds, with average PSNR dB. We conclude that, when time is critical, FC7-10M should be the preferred reconstruction method.

Qualitative results of selected video frames are shown in Figure 6. The proposed MLP architectures, including the linear regression model, favorably recover motion while the additional hidden layers emphasize on improving the spatial resolution of the scene (see supplementary material for example reconstructed videos). One can clearly observe the sharper edges and high frequency details produced by the FC7-10M and FC7-10M+MMLE methods compared to previously proposed algorithms.

Figure 8: PSNR comparison for all the frames of video sequences between the proposed method FC7-10M and the previous method GGM-4 Yang2014 .

Due to the extremely long reconstruction times of previous methods, the results presented in Table 1 and Figure 7 refer to only the first frames of each video sequence, as mentioned above. Figure 8 compares the PSNR for all the frames of video sequences using our FC7-10M algorithm and the fastest previous method GMM-4 Yang2014 , while Figure 9 depicts representative snapshots for some of them. The varying PSNR performance across the frames of a frame block is consistent for both algorithms and is reminiscent of the reconstruction tendency observed in other video CS papers in the literature Koller2015 ; Llull2013b ; Yang2015 ; Yang2014 .

Figure 9: Qualitative reconstruction performance of video frames between the proposed method FC7-10M and the previous method GMM-4 Yang2014 . The corresponding PSNR results for all video frames are shown in Figure 8.

4.5 Reconstruction Results with Noise

Previously, we evaluated the proposed algorithms assuming noiseless measurements. In this subsection, we investigate the performance of the presented deep architectures under the presence of measurement noise. Specifically, the measurement model of equation (1) is now modified to

(6)

where is the additive measurement noise vector.

We employ our best architecture utilizing hidden layers and follow two different training schemes. In the first one, the network is trained on the samples, as discussed in subsection 4.3 (i.e., the same FC7-10M network as before) while in the second, the network is trained using the same data pairs after adding random Gaussian noise to each vector . Each vector was corrupted with a level of noise such that signal-to-noise ratio (SNR) is uniformly selected in the range between dB giving rise to a set of noisy samples for training. We denote the network trained on the noisy dataset as FC7N-10M.

We now compare the performance of the two proposed architectures with the previous methods GMM-4 and GMM-1 using measurement noise. We did not include experiments with the MMLE counterparts of the algorithms since, as we observed earlier, the performance improvement is always related to the starting point of the MMLE algorithm. Figure 10 shows the average performance comparison for the reconstruction of the first frames of each tested video sequence under different levels of measurement noise while Figure 11 depicts example reconstructed frames.

Figure 10: Average PSNR and SSIM over all video sequences for several methods under different levels of measurement noise.
Figure 11: Qualitative reconstruction comparison between our methods and GMM-4 Yang2014 , GMM-1 Yang2015 under different levels of measurement noise. The original frame and corresponding inset are presented in Figure 6.

As we can observe, the network trained on noiseless data (FC7-10M) provides good performance for low measurement noise (e.g., dB) and reaches similar performance to GMM-1 for more severe noise levels (e.g., dB). The network trained on noisy data (FC7N-10M), proves more robust to noise severity achieving better performance than GMM-1 under all tested noise levels.

Despite proving more robust to noise, our algorithms in general recover motion favorably but, for high noise levels, there is additive noise throughout the reconstructed scene (observe results for dB noise level in Figure 11). Such degradation could be combated by cascading our architecture with a denoising deep architecture (e.g., Burger2012 ) or denoising algorithm to remove the noise artifacts. Ideally, for a specific camera system, data would be collected using this system and trained such that the deep architecture incorporates the noise characteristics of the underlying sensor.

4.6 Run Time

Run time comparisons for several methods are illustrated at the bottom row of Table 1

. All previous approaches are implemented in MATLAB. Our deep learning methods are implemented in Caffe package 

Jia2014 and all algorithms were executed by the same machine. We observe that the deep learning approaches significantly outperform the previous approaches in order of several magnitudes. Note that a direct comparison between the methods is not trivial due to the different implementations. Nevertheless, previous methods solve an optimization problem during reconstruction while our MLP is a feed-forward network that requires only few matrix-vector multiplications.

4.7 Number of Layers and Dataset Size

From Figure 7 we observe that as the number of training samples increases the performance consistently improves. However, the improvement achieved by increasing the number of layers (from to ) for architectures trained on small datasets (e.g., 1M) is not significant (performance is almost the same). This is perhaps expected as one may argue that in order to achieve higher performance with extra layers (thus, more parameters to train) more training data would be required. Intuitively, adding hidden layers enables the network to learn more complex functions. Indeed, reconstruction performance in our 10 million dataset is slightly higher in FC7-10M than in FC4-10M. The average PSNR for all test videos is 32.66 dB for FC4-10M and 32.91 dB for FC7-10M. This suggests that 4-hidden layers are sufficient to learn the mappings in our 10M training set. However, we wanted to explore the possible performance benefits of adding extra hidden layers to the network architecture.

In order to provide more insights regarding the slight performance improvement of FC7-10M compared to FC4-10M we visualize in Figure 12 an example video block from our training set and its respective reconstruction using the two networks. We observe that FC7-10M is able to reconstruct the patches of the video block slightly better than FC4-10M. This suggests that the additional parameters help in fitting the training data more accurately. Furthermore, we observed that reconstruction performance of our validation set was better in FC7-10M than in FC4-10M. Note that a small validation set was kept for tuning the hyper-parameters during training and that we also employed weight regularization ( norm) to prevent overfitting. Increasing the number of hidden layers further did not help in our experiments as we did not observe any additional performance improvement based on our validation set. Thus, we found that learning to reconstruct training patches accurately was important in our problem.

Figure 12: Qualitative reconstruction comparison for a video block of the training set. First row shows patches from the original video block of size ; second row shows the reconstruction using the trained network with hidden layers (FC7-10M); third row shows the reconstruction using the trained network with hidden layers (FC4-10M). The slight improvement in reconstruction quality using network FC7-10M is apparent while the norm reconstruction error is and for FC7-10M and FC4-10M, respectively.

5 Conclusions

To the best of our knowledge, this work constitutes the first deep learning architecture for temporal video compressive sensing reconstruction. We demonstrated superior performance compared to existing algorithms while reducing reconstruction time to a few seconds. At the same time, we focused on the applicability of our framework on existing compressive camera architectures suggesting that their commercial use could be viable. We believe that this work can be extended in three directions: 1) exploring the performance of variant architectures such as RNNs, 2) investigate the training of deeper architectures and 3) finally, examine the reconstruction performance in real video sequences acquired by a temporal compressive sensing camera.

References

  • (1) F. Agostinelli, M. R. Anderson, and H. Lee. Adaptive multi-column deep neural networks with application to robust image denoising. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 26, pages 1493–1501. Curran Associates, Inc., 2013.
  • (2) S. D. Babacan, M. Luessi, L. Spinoulas, A. K. Katsaggelos, N. Gopalsami, T. Elmer, R. Ahern, S. Liao, and A. Raptis. Compressive passive millimeter-wave imaging. In IEEE Int. Conf. Image Processing, pages 2705–2708, Sept 2011.
  • (3) H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In

    Proc. IEEE Conf. Comp. Vision Pattern Recognition

    , pages 2392–2399, June 2012.
  • (4) H. Chen, M. S. Asif, A. C. Sankaranarayanan, and A. Veeraraghavan. FPA-CS: Focal plane array-based compressive imaging in short-wave infrared. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 2358–2366, June 2015.
  • (5) H. Chen, Z. Weng, Y. Liang, C. Lei, F. Xing, M. Chen, and S. Xie. High speed single-pixel imaging via time domain compressive sampling. In CLEO: 2014, page JTh2A.132. Optical Society of America, 2014.
  • (6) Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen. Deep network cascade for image super-resolution. In Computer Vision – ECCV 2014, volume 8693 of Lecture Notes in Computer Science, pages 49–64. Springer International Publishing, 2014.
  • (7) C. Dong, C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295–307, Feb. 2016.
  • (8) M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk. Single-Pixel imaging via compressive sampling. IEEE Signal Process. Mag., 25(2):83–91, Mar. 2008.
  • (9) C. Fernandez-Cull, B. M. Tyrrell, R. D’Onofrio, A. Bolstad, J. Lin, J. W. Little, M. Blackwell, M. Renzi, and M. Kelly. Smart pixel imaging with computational-imaging arrays. In Proc. SPIE, volume 9070, pages 90703D–90703D–13, 2014.
  • (10) L. Gao, J. Liang, C. Li, and L. V. Wang. Single-Shot compressed ultrafast photography at one hundred billion frames per second. Nature, 516:74–77, 2014.
  • (11) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh and M. Titterington, editors,

    Proc. Int. Conf. Artificial Intelligence and Statistics

    , volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
  • (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 770–778, June 2016.
  • (13) J. Holloway, A. C. Sankaranarayanan, A. Veeraraghavan, and S. Tambe. Flutter shutter video camera for compressive sensing of videos. In IEEE Int. Conf. Comp. Photography, pages 1–9, April 2012.
  • (14) Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. Neural Inf. Process. Syst. 28, pages 235–243. Curran Associates, Inc., 2015.
  • (15) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM Int. Conf. Multimedia, MM ’14, pages 675–678, New York, NY, USA, 2014. ACM.
  • (16) R. Koller, L. Schmid, N. Matsuda, T. Niederberger, L. Spinoulas, O. Cossairt, G. Schuster, and A. K. Katsaggelos. High spatio-temporal resolution video with compressed sensing. Opt. Express, 23(12):15992–16007, June 2015.
  • (17) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • (18) K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok. ReconNet: Non-iterative reconstruction of images from compressively sensed measurements. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 449–458, June 2016.
  • (19) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
  • (20) Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec. 1989.
  • (21) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
  • (22) D. Liu, J. Gu, Y. Hitomi, M. Gupta, T. Mitsunaga, and S. K. Nayar. Efficient space-time sampling with pixel-wise coded exposure for high-speed imaging. IEEE Trans. Pattern Anal. Mach. Intell., 36(2):248–260, Feb 2014.
  • (23) P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady. Coded aperture compressive temporal imaging. Opt. Express, 21(9):10526–10545, May 2013.
  • (24) P. Llull, X. Yuan, X. Liao, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady. Temporal Compressive Sensing for Video, pages 41–74. Springer International Publishing, Cham, 2015.
  • (25) J. Mehta and A. Majumdar.

    Rodeo: Robust de-aliasing autoencoder for real-time medical image reconstruction.

    Pattern Recognition, 63:499 – 510, 2017.
  • (26) A. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signal recovery. In Annual Allerton Conf. Communication, Control, and Computing, pages 1336–1343, Sept 2015.
  • (27) V. Nair and G. E. Hinton.

    Rectified linear units improve restricted boltzmann machines.

    In J. Fürnkranz and T. Joachims, editors, Proc. Int. Conf. Machine Learning, pages 807–814. Omnipress, 2010.
  • (28) G. Orchard, J. Zhang, Y. Suo, M. Dao, D. T. Nguyen, S. Chin, C. Posch, T. D. Tran, and R. Etienne-Cummings. Real time compressive sensing video reconstruction in hardware. IEEE Trans. Emerg. Sel. Topics Circuits Syst., 2(3):604–615, Sept. 2012.
  • (29) R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In S. Dasgupta and D. McAllester, editors, Proc. Int. Conf. Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  • (30) P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In Proc. Int. Conf. Neural Inf. Process. Systems, NIPS’15, pages 1990–1998, Cambridge, MA, USA, 2015. MIT Press.
  • (31) D. Reddy, A. Veeraraghavan, and R. Chellappa. P2C2: Programmable pixel compressive camera for high speed imaging. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 329–336, June 2011.
  • (32) J. S. Ren, L. Xu, Q. Yan, and W. Sun. Shepard convolutional neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. Neural Inf. Process. Syst. 28, pages 901–909. Curran Associates, Inc., 2015.
  • (33) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.
  • (34) C. Schuler, H. Burger, S. Harmeling, and B. Scholkopf. A machine learning approach for non-blind image deconvolution. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 1067–1074, June 2013.
  • (35) L. Spinoulas, K. He, O. Cossairt, and A. Katsaggelos. Video compressive sensing with on-chip programmable subsampling. In Proc. IEEE Conf. Comp. Vision Pattern Recognition Workshops, pages 49–57, June 2015.
  • (36) J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 769–777, June 2015.
  • (37) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In IEEE Int. Conf. Computer Vision, pages 4489–4497, Dec 2015.
  • (38) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.

    Stacked Denoising Autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    J. Mach. Learn. Res., 11:3371–3408, Dec. 2010.
  • (39) J. Wang, M. Gupta, and A. C. Sankaranarayanan. LiSens- A scalable architecture for video compressive sensing. In Proc. IEEE Conf. Comp. Photography, pages 1–9, April 2015.
  • (40) Z. Wang, A. C. Bovik, H. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, April 2004.
  • (41) Z. Wang, L. Spinoulas, K. He, L. Tian, O. Cossairt, A. K. Katsaggelos, and H. Chen. Compressive holographic video. Opt. Express, 25(1):250–262, Jan 2017.
  • (42) J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 25, pages 341–349. Curran Associates, Inc., 2012.
  • (43) L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 27, pages 1790–1798. Curran Associates, Inc., 2014.
  • (44) J. Yang, X. Liao, X. Yuan, P. Llull, D. J. Brady, G. Sapiro, and L. Carin. Compressive sensing by learning a gaussian mixture model from measurements. IEEE Trans. Image Processing, 24(1):106–119, Jan. 2015.
  • (45) J. Yang, X. Yuan, X. Liao, P. Llull, D. J. Brady, G. Sapiro, and L. Carin. Video compressive sensing using gaussian mixture models. IEEE Trans. Image Processing, 23(11):4863–4878, Nov. 2014.