1 Introduction
The subdivision of time by motion picture cameras, the frame rate, limits the temporal resolution of a camera system. Even though frame rate increase above Hz may be imperceptible to human eyes, high speed motion picture capture has long been a goal in scientific imaging and cinematography communities. Despite the increasing availability of high speed cameras through the reduction of hardware prices, fundamental restrictions still limit the maximum achievable frame rates.
Video compressive sensing (CS) aims at increasing the temporal resolution of a sensor by incorporating additional hardware components to the camera architecture and employing powerful computational techniques for high speed video reconstruction. The additional components operate at higher frame rates than the camera’s native temporal resolution giving rise to low frame rate multiplexed measurements which can later be decoded to extract the unknown observed high speed video sequence. Despite its use for high speed motion capture Llull2015 , video CS also has applications to coherent imaging (e.g., holography) for tracking highspeed events Wang2017 (e.g., particle tracking, observing moving biological samples). The benefits of video CS are even more pronounced for nonvisible light applications where high speed cameras are rarely available or prohibitively expensive (e.g., millimeterwave imaging, infrared imaging) Babacan2011 ; Chen2015 .
Video CS comes in two incarnations, namely, spatial CS and temporal CS. Spatial video CS architectures stem from the wellknown singlepixelcamera Duarte2008 , which performs spatial multiplexing per measurement, and enable video recovery by expediting the capturing process. They either employ fast readout circuitry to capture information at video rates Chen2014 or parallelize the singlepixel architecture using multiple sensors, each one responsible for sampling a separate spatial area of the scene Chen2015 ; Wang2015 .
In this work, we focus on temporal CS where multiplexing occurs across the time dimension. Figure 1 depicts this process, where a spatiotemporal volume of size is modulated by binary random masks during the exposure time of a single capture, giving rise to a coded frame of size .
We denote the vectorized versions of the unknown signal and the captured frame as
and , respectively. Each vectorized sampling mask is expressed as giving rise to the measurement model(1) 
where and creates a diagonal matrix from its vector argument.
Various successful temporal CS architectures have been proposed. Their differences mainly involve the implementation of the random masks on the optical path (i.e., the measurement matrix in Figure 1). Digital micromirror devices (DMD), spatial light modulators (SLM) and liquid crystal on silicon (LCoS) were used in Chen2015 ; Wang2015 ; Gao2014 ; Liu2013 ; Reddy2011 while translating printed masks were employed in Koller2015 ; Llull2013b . Moreover, a few architectures have eliminated additional optical elements by directly programming the chip’s readout mode through hardware circuitry modifications FernandezCull2014 ; Orchard2012 ; Spinoulas2015 .
Despite their reasonable performance, temporal CS architectures lack practicality. The main drawback is that existing reconstruction algorithms (e.g., using sparsity models Chen2015 ; Holloway2012 , combining sparsity and dictionary learning Liu2013
or using Gaussian mixture models
Yang2015 ; Yang2014 ) are often too computationally intensive, rendering the reconstruction process painfully slow. Even with parallel processing, recovery times make video CS prohibitive for modern commercial camera architectures.In this work, we address this problem by employing deep learning and show that video frames can be recovered in a few seconds at significantly improved reconstruction quality compared to existing approaches.
Our contributions are summarized as follows:

We present the first deep learning architecture for temporal video CS reconstruction approach, based on fullyconnected neural networks, which learns to map directly temporal CS measurements to video frames. For such task to be practical, a measurement mask with a repeated pattern is proposed.

We show that a simple linear regressionbased approach learns to reconstruct video frames adequately at a minimal computational cost. Such reconstruction could be used as an initial point to other video CS algorithms.

The learning parading is extended to deeper architectures exhibiting reconstruction quality and computational cost improvements compared to previous methods.
2 Motivation and Related Work
Deep learning LeCun2015
is a burgeoning research field which has demonstrated stateoftheart performance in a multitude of machine learning and computer vision tasks, such as image recognition
He2015 or object detection Pinheiro2015 .In simple words, deep learning tries to mimic the human brain by training large multilayer neural networks with vast amounts of training samples, describing a given task. Such networks have proven very successful in problems where analytical modeling is not easy or straightforward (e.g., a variety of computer vision tasks Krizhevsky2012 ; Lecun1998 ).
The popularity of neural networks in recent years has led researchers to explore the capabilities of deep architectures even in problems where analytical models often exist and are well understood (e.g., restoration problems Burger2012 ; Schuler2013 ; Xie2012
). Even though performance improvement is not as pronounced as in classification problems, many proposed architectures have achieved stateoftheart performance in problems such as deconvolution, denoising, inpainting, and superresolution.
More specifically, investigators have employed a variety of architectures: deep fullyconnected networks or multilayer perceptrons (MLPs)
Burger2012 ; Schuler2013 ; stacked denoising autoencoders (SDAEs) Xie2012 ; Agostinelli2013 ; Fleet2014 ; Vincent2010, which are MLPs whose layers are pretrained to provide improved weight initialization; convolutional neural networks (CNNs)
Wang2015 ; Sun2015 ; Dong2015 ; Lecun1989 ; Ren2015 ; Li2014and recurrent neural networks (RNNs)
Yan2015 .Based on such success in restoration problems, we wanted to explore the capabilities of deep learning for the video CS problem. However, the majority of existing architectures involve outputs whose dimensionality is smaller than the input (e.g., classification) or have the same size (e.g., denoising/deblurring). Hence, devising an architecture that estimates
unknowns, given inputs, where is not necessarily straightforward.Two recent studies, utilizing SDAEs Mousavi2015 or CNNs Kulkarni2016 , have been presented on spatial CS for still images exhibiting promising performance. Our work constitutes the first attempt to apply deep learning on temporal video CS. Our approach differs from prior 2D image restoration architectures Burger2012 ; Schuler2013 since we are recovering a 3D volume from 2D measurements.
3 Deep Networks for Compressed Video
3.1 Linear mapping
We started our investigation by posing the question: can training data be used to find a linear mapping such that ? Essentially, this question asks for the inverse of in equation (1) which, of course, does not exist. Clearly, such a matrix would be huge to store but, instead, one can apply the same logic on video blocks Liu2013 .
We collect a set of training video blocks denoted by , of size . Therefore, the measurement model per block is now with size , where and refers to the corresponding measurement matrix per block.
Collecting a set of video blocks, we obtain the matrix equation
(2) 
where , and is the same for all blocks. The linear mapping we are after can be calculated as
(3) 
where is of size .
Intuitively, such an approach would not necessarily be expected to even provide a solution due to illposedness. However, it turns out that, if is sufficiently large and the matrix has at least one nonzero in each row (i.e., sampling each spatial location at least once over time), the estimation of ’s by the ’s provides surprisingly good performance.
Specifically, we obtain measurements from a test video sequence applying the same per video block and then reconstruct all blocks using the learnt . Figure 2 depicts the average peak signaltonoise ratio (PSNR) and structural similarity metric (SSIM) Wang2004 for the reconstruction of video sequences using different realizations of the random binary matrix for varying percentages of nonzero elements. The empty bars for and of nonzeros in realizations and , respectively, refer to cases when there was no solution due to the lack of nonzeros at some spatial location. In these experiments was selected as simulating the reconstruction of frames by a single captured frame and .
3.2 Measurement Matrix Construction
Based on the performance in Figure 2, investigating the extension of the linear mapping in to a nonlinear mapping using deep networks seemed increasingly promising. In order for such an approach to be practical, though, reconstruction has to be performed on blocks and each block must be sampled with the same measurement matrix . Furthermore, such a measurement matrix should be realizable in hardware. Hence we propose constructing a which consists of repeated identical building blocks of size , as presented in Figure 3. Such a matrix can be straightforwardly implemented on existing systems employing DMDs, SLMs or LCoS Chen2015 ; Wang2015 ; Gao2014 ; Liu2013 ; Reddy2011 . At the same time, in systems utilizing translating masks Koller2015 ; Llull2013b , a repeated mask can be printed and shifted appropriately to produce the same effect.
In the remainder of this paper, we select a building block of size as a random binary matrix containing of nonzero elements and set , such that and . Therefore, the compression ratio is . In addition, for the proposed matrix , each block is the same allowing reconstruction for overlapping blocks of size with spatial overlap of . Such overlap can usually aid at improving reconstruction quality. The selection of of nonzeros was just a random choice since the results of Figure 2 did not suggest that a specific percentage is particularly beneficial in terms of reconstruction quality.
3.3 Multilayer Network Architecture
In this section, we extend the linear formulation to MLPs and investigate the performance in deeper structures.
Choice of Network Architecture. We consider an endtoend MLP architecture to learn a nonlinear function that maps a measured frame patch via several hidden layers to a video block , as illustrated in Figure 4. The MLP architecture was chosen for the problem of video CS reconstruction due to the following two considerations;

The first hidden layer should be a fullyconnected layer that would provide a 3D signal from the compressed 2D measurements. This is necessary for temporal video CS as in contrast to the superresolution problem (or other related image reconstruction problems) where a lowresolution image is given as input, here we are given CS encoded measurements. Thus, convolution does not hold and therefore a convolutional layer cannot be employed as a first layer.

Following that, one could argue that the subsequent layers could be 3D Convolutional layers Tran2015 . Although that would sound reasonable for our problem, in practice, the small size of blocks used in this paper () do not allow for convolutions to be effective. Increasing the size of blocks to , so that convolutions can be applied, would dramatically increase the network complexity in 3D volumes such as in videos. For example, if we use a block size of as input, the first fullyconnected layer would contain parameters! Besides, such small block sizes () have provided good reconstruction quality in dictionary learning approaches used for CS video reconstruction Liu2013 . It was shown that choosing larger block sizes led to worse reconstruction quality.
Thus, MLPs (i.e., apply fullyconnected layers for the entire network) were considered more reasonable in our work and we found that when applied to blocks they capture the motion and spatial details of videos adequately.
It is interesting to note here that another approach would be to try learning the mapping between and , since matrix is known Mehta17 . Such approach could provide better pixel localization since places the values in in the corresponding pixel locations that were sampled to provide the summation in the direction. However, such an architecture would require additional weights between the input and the first hidden layer since the input would now be of size () instead of (). Such approach was tested and resulted in almost identical performance, albeit with a higher computational cost, hence it is not presented here.
Network Architecture Design.
As illustrated in Figure 4, each hidden layer , is defined as
(4) 
where
is the bias vector and
is the output weight matrix, containing linear filters. connects to the first hidden layer, while for the remaining hidden layers, . The last hidden layer is connected to the output layer via and without nonlinearity. The nonlinear functionis the rectified linear unit (ReLU)
Nair2010 defined as, . In our work we considered two different network architectures, one with and another with hidden layers.To train the proposed MLP, we learn all the weights and biases of the model. The set of parameters is denoted as
and is updated by the backpropagation algorithm
Rumelhart1988 minimizing the quadratic error between the set of training mapped measurements and the corresponding video blocks. The loss function is the mean squared error (MSE) which is given by
(5) 
The MSE was used in this work since our goal is to optimize the PSNR which is directly related to the MSE.
4 Experiments
We compare our proposed deep architecture with stateoftheart approaches both quantitatively and qualitatively. The proposed approaches are evaluated assuming noiseless measurements or under the presence of measurement noise. Finally, we investigate the performance of our methods under different network parameters (e.g., number of layers) and size of training samples. The metrics used for evaluation were the PSNR and SSIM.
4.1 Training Data Collection
For deep neural networks, increasing the number of training samples is usually synonymous to improved performance. We collected a diverse set of training samples using highdefinition videos from Youtube, depicting natural scenes. The video sequences contain more than frames which were converted to grayscale. All videos are unrelated to the test set. We randomly extracted million video blocks of size while keeping the amount of blocks extracted per video proportional to its duration. This data was used as output while the corresponding input was obtained by multiplying each sample with the measurement matrix (see subsection 3.2 for details). Example frames from the video sequences used for training are shown in Figure 5.
4.2 Implementation Details
Our networks were trained for up to iterations using a minibatch size of
. We normalized the input perfeature to zero mean and standard deviation one. The weights of each layer were initialized to random values uniformly distributed in
, where is the size of the previous layer Xavier2010. We used Stochastic Gradient Descent (SGD) with a starting learning rate of
, which was divided by after iterations. The momentum was set to 0.9 and we further usednorm gradient clipping to keep the gradients in a certain range. Gradient clipping is a widely used technique in recurrent neural networks to avoid exploding gradients
Pascanu2013 . The threshold of gradient clipping was set to .4.3 Comparison with Previous Methods
We compare our method with the stateoftheart video compressive sensing methods:

GMMTP, a Gaussian mixture model (GMM)based algorithm Yang2014 .

MMLEGMM, a maximum marginal likelihood estimator (MMLE), that maximizes the likelihood of the GMM of the underlying signals given only their linear compressive measurements Yang2015 .
For temporal CS reconstruction, data driven models usually perform better than standard sparsitybased schemes Yang2015 ; Yang2014 . Indeed, both GMMTP and MMLEGMM have demonstrated superior performance compared to existing approaches in the literature such as TotalVariation (TV) or dictionary learning Liu2013 ; Yang2015 ; Yang2014 , hence we did not include experiments with the latter methods.
In GMMTP Yang2014 we followed the settings proposed by the authors and used our training data (randomly selecting samples) to train the underlying GMM parameters. We found that our training data provided better performance compared to the data used by the authors. In our experiments we denote this method by GMM to denote reconstruction of overlapping blocks with spatial overlap of pixels, as discussed in subsection 3.2.
MMLE Yang2015 is a selftraining method but it is sensitive to initialization. A satisfactory performance is obtained only when MMLE is combined with a good starting point. In Yang2015 , the GMMTP Yang2014 with full overlapping patches (denoted in our experiments as GMM1) was used to initialize the MMLE. We denote the combined method as GMM1+MMLE. For fairness, we also conducted experiments in the case where our method is used as a starting point for the MMLE.
Reconstruction Method  
Video Sequence 
Metric  W10M  FC710M  GMM4 Yang2014  GMM1 Yang2015  FC710M +MMLE  GMM1 +MMLE Yang2015 
Electric Ball 
PSNR  
28  SSIM  
Horse 
PSNR  
28  SSIM  
Bow & Arrow 
PSNR  
28  SSIM  
Bus 
PSNR  
28  SSIM  
Dogs 
PSNR  
28  SSIM  
City 
PSNR  
28  SSIM  
Crew 
PSNR  
28  SSIM  
Filament 
PSNR  
28  SSIM  
Hammer 
PSNR  
28  SSIM  
Football 
PSNR  
28  SSIM  
Kayak 
PSNR  
28  SSIM  
Porsche 
PSNR  
28  SSIM  
Golf 
PSNR  
28  SSIM  
Basketball 
PSNR  
28  SSIM  

Time 
In our methods, a collection of overlapping patches of size is extracted by each coded measurement of size and subsequently reconstructed into video blocks of size . Overlapping areas of the recovered video blocks are then averaged to obtain the final video reconstruction results, as depicted in Figure 4. The step of the overlapping patches was set to due to the special construction of the utilized measurement matrix, as discussed in subsection 3.2.
We consider six different architectures:

W10M, a simple linear mapping (equation (3)) trained on samples.

FC41M, a MLP trained on samples (randomly selected from our samples).

FC410M, a MLP trained on samples.

FC71M, a MLP trained on samples (randomly selected from our samples).

FC710M, a MLP trained on samples.

FC710M+MMLE, a MLP trained on samples which is used as an initialization to the MMLE Yang2015 method.
Note that the subset of randomly selected million samples used for training FC41M and FC71M was the same.
Our test set consists of video sequences. They involve a set of videos that were used for dictionary training in Liu2013 , provided by the authors, as well as the “Basketball” video sequence used by Yang2015 . All video sequences are unrelated to the training set (see subsection 4.1 for details). For fair comparisons, the same measurement mask was used in all methods, according to subsection 3.2. All code implementations are publicly available provided by the authors.
4.4 Reconstruction Results
Quantitative reconstruction results for all video sequences with all tested algorithms are illustrated in Table 1 and average performance is summarized in Figure 7. The presented metrics refer to average performance for the reconstruction of the first frames of each video sequence, using consecutive captured coded frames through the video CS measurement model of equation (1). In both, Table 1 and Figure 7, results are divided in two parts. The first part lists reconstruction performance of the tested approaches without the MMLE step, while the second compares the performance of the best candidate in the proposed and previous methods, respectively, with a subsequent MMLE step Yang2015 . In Table 1 the best performing algorithms are highlighted for each part while the bottom row presents average reconstruction time requirements for the recovery of video frames using captured coded frame.
Our FC710M and FC710M+MMLE yield the highest PSNR and SSIM values for all video sequences. Specifically, the average PSNR improvement of FC710M over the GMM1 Yang2015 is dB. When these two methods are used to initialize the MMLE Yang2015 algorithm, the average PSNR gain of FC710M+MMLE over the GMM1+MMLE Yang2015 is dB. Notice also that the FC710M achieves dB higher than the combined GMM1+MMLE. The highest PSNR and SSIM values are reported in the FC710M+MMLE method with dB average PSNR over all test sequences. However, the average reconstruction time for the reconstruction of frames using this method is almost two hours while for the second best, the FC710M, is about seconds, with average PSNR dB. We conclude that, when time is critical, FC710M should be the preferred reconstruction method.
Qualitative results of selected video frames are shown in Figure 6. The proposed MLP architectures, including the linear regression model, favorably recover motion while the additional hidden layers emphasize on improving the spatial resolution of the scene (see supplementary material for example reconstructed videos). One can clearly observe the sharper edges and high frequency details produced by the FC710M and FC710M+MMLE methods compared to previously proposed algorithms.
Due to the extremely long reconstruction times of previous methods, the results presented in Table 1 and Figure 7 refer to only the first frames of each video sequence, as mentioned above. Figure 8 compares the PSNR for all the frames of video sequences using our FC710M algorithm and the fastest previous method GMM4 Yang2014 , while Figure 9 depicts representative snapshots for some of them. The varying PSNR performance across the frames of a frame block is consistent for both algorithms and is reminiscent of the reconstruction tendency observed in other video CS papers in the literature Koller2015 ; Llull2013b ; Yang2015 ; Yang2014 .
4.5 Reconstruction Results with Noise
Previously, we evaluated the proposed algorithms assuming noiseless measurements. In this subsection, we investigate the performance of the presented deep architectures under the presence of measurement noise. Specifically, the measurement model of equation (1) is now modified to
(6) 
where is the additive measurement noise vector.
We employ our best architecture utilizing hidden layers and follow two different training schemes. In the first one, the network is trained on the samples, as discussed in subsection 4.3 (i.e., the same FC710M network as before) while in the second, the network is trained using the same data pairs after adding random Gaussian noise to each vector . Each vector was corrupted with a level of noise such that signaltonoise ratio (SNR) is uniformly selected in the range between dB giving rise to a set of noisy samples for training. We denote the network trained on the noisy dataset as FC7N10M.
We now compare the performance of the two proposed architectures with the previous methods GMM4 and GMM1 using measurement noise. We did not include experiments with the MMLE counterparts of the algorithms since, as we observed earlier, the performance improvement is always related to the starting point of the MMLE algorithm. Figure 10 shows the average performance comparison for the reconstruction of the first frames of each tested video sequence under different levels of measurement noise while Figure 11 depicts example reconstructed frames.
As we can observe, the network trained on noiseless data (FC710M) provides good performance for low measurement noise (e.g., dB) and reaches similar performance to GMM1 for more severe noise levels (e.g., dB). The network trained on noisy data (FC7N10M), proves more robust to noise severity achieving better performance than GMM1 under all tested noise levels.
Despite proving more robust to noise, our algorithms in general recover motion favorably but, for high noise levels, there is additive noise throughout the reconstructed scene (observe results for dB noise level in Figure 11). Such degradation could be combated by cascading our architecture with a denoising deep architecture (e.g., Burger2012 ) or denoising algorithm to remove the noise artifacts. Ideally, for a specific camera system, data would be collected using this system and trained such that the deep architecture incorporates the noise characteristics of the underlying sensor.
4.6 Run Time
Run time comparisons for several methods are illustrated at the bottom row of Table 1
. All previous approaches are implemented in MATLAB. Our deep learning methods are implemented in Caffe package
Jia2014 and all algorithms were executed by the same machine. We observe that the deep learning approaches significantly outperform the previous approaches in order of several magnitudes. Note that a direct comparison between the methods is not trivial due to the different implementations. Nevertheless, previous methods solve an optimization problem during reconstruction while our MLP is a feedforward network that requires only few matrixvector multiplications.4.7 Number of Layers and Dataset Size
From Figure 7 we observe that as the number of training samples increases the performance consistently improves. However, the improvement achieved by increasing the number of layers (from to ) for architectures trained on small datasets (e.g., 1M) is not significant (performance is almost the same). This is perhaps expected as one may argue that in order to achieve higher performance with extra layers (thus, more parameters to train) more training data would be required. Intuitively, adding hidden layers enables the network to learn more complex functions. Indeed, reconstruction performance in our 10 million dataset is slightly higher in FC710M than in FC410M. The average PSNR for all test videos is 32.66 dB for FC410M and 32.91 dB for FC710M. This suggests that 4hidden layers are sufficient to learn the mappings in our 10M training set. However, we wanted to explore the possible performance benefits of adding extra hidden layers to the network architecture.
In order to provide more insights regarding the slight performance improvement of FC710M compared to FC410M we visualize in Figure 12 an example video block from our training set and its respective reconstruction using the two networks. We observe that FC710M is able to reconstruct the patches of the video block slightly better than FC410M. This suggests that the additional parameters help in fitting the training data more accurately. Furthermore, we observed that reconstruction performance of our validation set was better in FC710M than in FC410M. Note that a small validation set was kept for tuning the hyperparameters during training and that we also employed weight regularization ( norm) to prevent overfitting. Increasing the number of hidden layers further did not help in our experiments as we did not observe any additional performance improvement based on our validation set. Thus, we found that learning to reconstruct training patches accurately was important in our problem.
5 Conclusions
To the best of our knowledge, this work constitutes the first deep learning architecture for temporal video compressive sensing reconstruction. We demonstrated superior performance compared to existing algorithms while reducing reconstruction time to a few seconds. At the same time, we focused on the applicability of our framework on existing compressive camera architectures suggesting that their commercial use could be viable. We believe that this work can be extended in three directions: 1) exploring the performance of variant architectures such as RNNs, 2) investigate the training of deeper architectures and 3) finally, examine the reconstruction performance in real video sequences acquired by a temporal compressive sensing camera.
References
 (1) F. Agostinelli, M. R. Anderson, and H. Lee. Adaptive multicolumn deep neural networks with application to robust image denoising. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 26, pages 1493–1501. Curran Associates, Inc., 2013.
 (2) S. D. Babacan, M. Luessi, L. Spinoulas, A. K. Katsaggelos, N. Gopalsami, T. Elmer, R. Ahern, S. Liao, and A. Raptis. Compressive passive millimeterwave imaging. In IEEE Int. Conf. Image Processing, pages 2705–2708, Sept 2011.

(3)
H. C. Burger, C. J. Schuler, and S. Harmeling.
Image denoising: Can plain neural networks compete with BM3D?
In
Proc. IEEE Conf. Comp. Vision Pattern Recognition
, pages 2392–2399, June 2012.  (4) H. Chen, M. S. Asif, A. C. Sankaranarayanan, and A. Veeraraghavan. FPACS: Focal plane arraybased compressive imaging in shortwave infrared. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 2358–2366, June 2015.
 (5) H. Chen, Z. Weng, Y. Liang, C. Lei, F. Xing, M. Chen, and S. Xie. High speed singlepixel imaging via time domain compressive sampling. In CLEO: 2014, page JTh2A.132. Optical Society of America, 2014.
 (6) Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen. Deep network cascade for image superresolution. In Computer Vision – ECCV 2014, volume 8693 of Lecture Notes in Computer Science, pages 49–64. Springer International Publishing, 2014.
 (7) C. Dong, C. Loy, K. He, and X. Tang. Image superresolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295–307, Feb. 2016.
 (8) M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk. SinglePixel imaging via compressive sampling. IEEE Signal Process. Mag., 25(2):83–91, Mar. 2008.
 (9) C. FernandezCull, B. M. Tyrrell, R. D’Onofrio, A. Bolstad, J. Lin, J. W. Little, M. Blackwell, M. Renzi, and M. Kelly. Smart pixel imaging with computationalimaging arrays. In Proc. SPIE, volume 9070, pages 90703D–90703D–13, 2014.
 (10) L. Gao, J. Liang, C. Li, and L. V. Wang. SingleShot compressed ultrafast photography at one hundred billion frames per second. Nature, 516:74–77, 2014.

(11)
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In Y. W. Teh and M. Titterington, editors,
Proc. Int. Conf. Artificial Intelligence and Statistics
, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.  (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 770–778, June 2016.
 (13) J. Holloway, A. C. Sankaranarayanan, A. Veeraraghavan, and S. Tambe. Flutter shutter video camera for compressive sensing of videos. In IEEE Int. Conf. Comp. Photography, pages 1–9, April 2012.
 (14) Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional networks for multiframe superresolution. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. Neural Inf. Process. Syst. 28, pages 235–243. Curran Associates, Inc., 2015.
 (15) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM Int. Conf. Multimedia, MM ’14, pages 675–678, New York, NY, USA, 2014. ACM.
 (16) R. Koller, L. Schmid, N. Matsuda, T. Niederberger, L. Spinoulas, O. Cossairt, G. Schuster, and A. K. Katsaggelos. High spatiotemporal resolution video with compressed sensing. Opt. Express, 23(12):15992–16007, June 2015.
 (17) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 25, pages 1097–1105. Curran Associates, Inc., 2012.
 (18) K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok. ReconNet: Noniterative reconstruction of images from compressively sensed measurements. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 449–458, June 2016.
 (19) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
 (20) Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec. 1989.
 (21) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
 (22) D. Liu, J. Gu, Y. Hitomi, M. Gupta, T. Mitsunaga, and S. K. Nayar. Efficient spacetime sampling with pixelwise coded exposure for highspeed imaging. IEEE Trans. Pattern Anal. Mach. Intell., 36(2):248–260, Feb 2014.
 (23) P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady. Coded aperture compressive temporal imaging. Opt. Express, 21(9):10526–10545, May 2013.
 (24) P. Llull, X. Yuan, X. Liao, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady. Temporal Compressive Sensing for Video, pages 41–74. Springer International Publishing, Cham, 2015.

(25)
J. Mehta and A. Majumdar.
Rodeo: Robust dealiasing autoencoder for realtime medical image reconstruction.
Pattern Recognition, 63:499 – 510, 2017.  (26) A. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signal recovery. In Annual Allerton Conf. Communication, Control, and Computing, pages 1336–1343, Sept 2015.

(27)
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.
In J. Fürnkranz and T. Joachims, editors, Proc. Int. Conf. Machine Learning, pages 807–814. Omnipress, 2010.  (28) G. Orchard, J. Zhang, Y. Suo, M. Dao, D. T. Nguyen, S. Chin, C. Posch, T. D. Tran, and R. EtienneCummings. Real time compressive sensing video reconstruction in hardware. IEEE Trans. Emerg. Sel. Topics Circuits Syst., 2(3):604–615, Sept. 2012.
 (29) R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In S. Dasgupta and D. McAllester, editors, Proc. Int. Conf. Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
 (30) P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In Proc. Int. Conf. Neural Inf. Process. Systems, NIPS’15, pages 1990–1998, Cambridge, MA, USA, 2015. MIT Press.
 (31) D. Reddy, A. Veeraraghavan, and R. Chellappa. P2C2: Programmable pixel compressive camera for high speed imaging. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 329–336, June 2011.
 (32) J. S. Ren, L. Xu, Q. Yan, and W. Sun. Shepard convolutional neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. Neural Inf. Process. Syst. 28, pages 901–909. Curran Associates, Inc., 2015.
 (33) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Backpropagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.
 (34) C. Schuler, H. Burger, S. Harmeling, and B. Scholkopf. A machine learning approach for nonblind image deconvolution. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 1067–1074, June 2013.
 (35) L. Spinoulas, K. He, O. Cossairt, and A. Katsaggelos. Video compressive sensing with onchip programmable subsampling. In Proc. IEEE Conf. Comp. Vision Pattern Recognition Workshops, pages 49–57, June 2015.
 (36) J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neural network for nonuniform motion blur removal. In Proc. IEEE Conf. Comp. Vision Pattern Recognition, pages 769–777, June 2015.
 (37) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In IEEE Int. Conf. Computer Vision, pages 4489–4497, Dec 2015.

(38)
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol.
Stacked Denoising Autoencoders: Learning useful representations in a deep network with a local denoising criterion.
J. Mach. Learn. Res., 11:3371–3408, Dec. 2010.  (39) J. Wang, M. Gupta, and A. C. Sankaranarayanan. LiSens A scalable architecture for video compressive sensing. In Proc. IEEE Conf. Comp. Photography, pages 1–9, April 2015.
 (40) Z. Wang, A. C. Bovik, H. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, April 2004.
 (41) Z. Wang, L. Spinoulas, K. He, L. Tian, O. Cossairt, A. K. Katsaggelos, and H. Chen. Compressive holographic video. Opt. Express, 25(1):250–262, Jan 2017.
 (42) J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 25, pages 341–349. Curran Associates, Inc., 2012.
 (43) L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Adv. Neural Inf. Process. Syst. 27, pages 1790–1798. Curran Associates, Inc., 2014.
 (44) J. Yang, X. Liao, X. Yuan, P. Llull, D. J. Brady, G. Sapiro, and L. Carin. Compressive sensing by learning a gaussian mixture model from measurements. IEEE Trans. Image Processing, 24(1):106–119, Jan. 2015.
 (45) J. Yang, X. Yuan, X. Liao, P. Llull, D. J. Brady, G. Sapiro, and L. Carin. Video compressive sensing using gaussian mixture models. IEEE Trans. Image Processing, 23(11):4863–4878, Nov. 2014.
Comments
There are no comments yet.