deeppixellevelprior
Solving Inverse Computational Imaging Problems using Deep Pixellevel Prior
view repo
Generative models based on deep neural networks are quite powerful in modelling natural image statistics. In particular, deep autoregressive models provide state of the art performance, in terms of log likelihood scores, by modelling tractable densities over the image manifold. In this work, we employ a learned deep autoregressive model as data prior for solving different inverse problems in computational imaging. We demonstrate how our approach can reconstruct images which have better pixellevel consistencies, as compared to the existing deep autoencoder based approaches. We also show how randomly dropping the update of some pixels in every iteration helps in a better image reconstruction. We test our approach on three computational imaging setups: Single Pixel Camera, LiSens and FlatCam with real and simulated measurements. We obtain better reconstructions than the stateoftheart methods for these problems, in terms of both perceptual quality and quantitative metrics such as PSNR and SSIM.
READ FULL TEXT VIEW PDFSolving Inverse Computational Imaging Problems using Deep Pixellevel Prior
Computational imaging systems enable us to extract much more information out of the visual world as compared to the traditional imaging systems. This is achieved by jointly designing optics, to encode the desired signal information, and algorithms to reconstruct the signal back from those measurements. Signal reconstruction corresponds to inverting the forward model used in acquiring the measurements. Hence, reconstruction algorithms for different computational imaging devices amount to solving different inverse problems. Solving these inverse problems becomes challenging as they are often illposed. For compressive imaging setups such as Single Pixel Camera (SPC) [2], [3], high speed imaging [4], [5] and compressive hyperspectral imaging [6], the reconstruction becomes illposed as the number of measurements is quite less than the signal dimension.
Generally, for solving an illposed problems, we need to incorporate the prior information about the signal to be reconstructed. Traditionally these priors are either analytically derived or handcrafted based on the observations. For example, sparsity of image gradients [7], sparsity of coefficients in wavelet and DCT domain [8] etc. have been used for solving inverse imaging problems. However, the underlying data distribution may not precisely follow these analytic priors leading to poor solutions in challenging scenarios. Dictionary learning [9] methods being data driven are an improvement over these analytic priors. However, being limited by patch size they cannot account for long range dependencies which are necessary for handling global multiplexing in case of compressive image reconstruction.
On the other hand, deep learning based reconstruction algorithms recently have led to stateoftheart results in solving such illposed problems in computational imaging
[10], [11] [12] [13]. These approaches typically learn an inverse mapping from measurements to the signal by minimizing reconstruction loss on a set of training examples. However, this kind of training, popularly known as discriminative learning, makes the network task specific. Furthermore, we need to retrain the network for various parameter settings of the forward model. For example, for every new setting of measurement rate and sensing matrix in SPC, we need to relearn the network parameters. Instead of having to design/retrain a different network for each task and parameter setting, it would be more efficient to have a generalized framework which can be used for solving various inverse problems.A more flexible approach would be to learn the natural image statistics using a generative model and use it for solving various inverse problems. Recently, deep generative models especially using autoregressive framework [14, 15, 16] have led to stateoftheart performance in modeling natural image manifold. Autoregressive models factorize the image distribution as a 2D directed causal graph and hence model it as a D sequence where current pixel’s distribution is conditioned on the causal context. By employing deep neural networks for summarizing the causal context, autoregressive models excel at capturing long range dependencies in images. Also, being a pixel level model it explicitly accounts for higher order correlations like texture patterns, sharp edges, etc. within a neighbourhood. Thus, these models are capable of generating visually convincing and crisp images [14]
. Examples of deep autoregressive image models are recurrent image density estimator (RIDE)
[15], pixel recurrent neural networks (PixelRNN) and its CNN equivalent (PixelCNN)
[14] and PixelCNN++ [16].We show that deep autoregressive generative models are ideally suitable for solving various computational imaging problems for the following reasons. First, it explicitly models the distribution of each pixel in relation to its causal neighbor. Thus, when used as an image prior, this explicit pixel dependency modeling helps it to better reconstruct low level details without artifacts (see Figure 1). Second, this framework gives us an explicit expression for the image prior, which can be used for doing MAP inference. Moreover, the entire framework is differentiable, which is amenable for gradient based inference. Third, its ability to capture long range dependencies in images makes them ideal for handling global multiplexing in compressive imaging setups. Given these advantages with deep autoregressive models, we use it for solving various computational imaging problems such as  Single Pixel Camera (SPC) [2], Line Sensor (LiSens) [3] and lensless imaging  FlatCam [17]. Our results demonstrate that we perform better than the current stateoftheart methods in both traditional and learning based approaches.
In summary we make the following contributions:
We propose a versatile approach which employs the same learned prior model for solving various computational imaging problems.
We utilize backpropagation to the inputs for obtaining tractable estimates of the prior gradients and employ them for solving inverse problems using MAP inference.
We observe that randomly dropping the gradient updates for a certain percentage of pixels at every iteration helps in reconstructing the texture better. We analyze the effect of this pixel dropout ratio on the quality of reconstructions.
We demonstrate better reconstructions than the existing stateoftheart methods for three computational imaging problems: Single Pixel Camera, LiSens, and FlatCam.
Compressive imaging Single Pixel Camera (SPC) [2] is a classic example of compressive imaging. It uses a programmable digital micromirror device (DMD) array to multiplex the scene on to a single photodetector. Using different settings on the DMD, we can sequentially acquire a set of measurements. Thus, scene at full resolution is reconstructed from much less than 100% measurements. Compressive imaging systems pose a viable solution for high resolution imaging in nonvisible parts of the spectrum where full frame sensors are very expensive.
The measurement bandwidth of the SPC is limited by the operating speed of the DMDs (Tens of kHz for commerciallyavailable units). With this speed, SPC cannot be extended for high resolution video sensing. On one end, we have exorbitant full frame sensors (Nyquist sampling) for high resolution imaging in non visible bands, and on the other, we have SPC, an inexpensive compressive sensing setup but with low measurement rates. Wang et al. [3] propose LiSens  Line Sensor based compressive camera which lies midway between these two imaging extremes. Each pixel in the line sensor is mapped to a row in DMD array. Thus, unlike SPC, where the whole scene is multiplexed, here only rows of the scene are multiplexed.
Lensless imaging FlatCam [17] and DiffuserCam [18] are novel imaging systems which get rid of the conventional lens optics. Instead, they use amplitude and diffuser mask respectively to encode light coming from different parts of the scene onto the sensor. As a result, information localized at a point in the scene gets spread throughout the sensor, making priors essential for accurate recovery of the image. These works use traditional reconstruction algorithms such as Total Variation norm and Tikhonov regularization which are quick but do not provide natural looking reconstructions.
Reconstruction with analytical priors Many algorithms have been proposed for compressive image reconstruction. Typically, reconstruction algorithms use regularization, exploiting the sparsity of spatial gradients in natural images. Total Variation (TV) minimization prior [7, 19] is the most commonly used reconstruction algorithm based on this sparsity. Chengbo et al. [20] propose an efficient augmented Lagrangian based TV minimization for CS reconstruction. Recent approaches involving compressive architectures such as fpacs [21], LiSens [3], and video CS [22], demonstrated successful results with TV minimization prior. However, at lower measurement rates, reconstructions suffer from the piecewise smooth modeling of TV prior and results tend to be blocky, as is noted by recent works [10, 23]. Metzler et al. [24] propose a denoiser based CS reconstruction algorithm. Specifically, use a Gaussian denoiser with approximate message passing algorithm (DAMP). At very low measurement rates, the denoiser tends to result in overly smooth images as is recently shown by Dave et al. [23], Kulkarni et al. [10].
Data driven CS reconstruction Duarte et al. [25]
propose an approach for simultaneous learning of the sensing matrix and dictionary atoms. Due to the small patch size of the atoms, their usage for compressive image reconstruction is limited to local multiplexing, unlike the actual SPC involving global multiplexing of the scene. Reconstruction algorithms using convolutional neural networks (CNNs) typical take input as measurements from an image patch and try to output the image back by minimizing the reconstruction loss. Kulkarni et al.
[10] proposed ReconNet, Yao et al. [11] proposedNet having residual connections for reconstruction. Although these approaches lead to a noniterative and hence faster inference, being task specific, they only work for the fixed settings of the sensing matrix and measurement rates used for training. Changing the settings requires retraining the architecture which is not very appealing. Also, being patchwise, they also fail to account for global multiplexing in SPC.
Deep generative models With the success of deep neural networks, there have been multiple works proposing deep generative models, which explicitly or implicitly try to model the distribution of natural images. For example, latent representation models like adversarial networks, GAN by Goodfellow et al. [26], variational autoencoders by Kingma et al. [27] and autoregressive models like RIDE by Theis et al. [15], PixelRNN/CNN by Oord et al. [14], PixelCNN++ by Salimans et al. [16]
. GANs learn to transform samples from a Gaussian distribution to a sample in the natural image manifold via a generator network, which is trained with an adversarial learning framework involving a discriminator network. VAEs are a probabilistic framework of autoencoders that learn to encode and decode the images from a distribution.
Autoregressive models factorize an image as a 2D directed graph by conditioning the current pixel ’s distribution on the pixels before it as in a raster scan . Modeling this conditional density is analogous to sequence modeling and initial methods proposed to use spatial 2D recurrent neural networks, given their efficacy in modeling sequences. RIDE by Theis et al. [15]
uses 2D Long Short Term Memory (LSTM) units called SpatialLSTMs for modeling the causal context
, and GSMs for parametrizing the distribution. PixelRNN by Oord et al. [14] uses a much complex architecture using LSTMs and residual connections to better handle the causal context. Importantly, it models the conditional density as a discrete distribution with . PixelRNN has resulted in stateoftheart negative loglikelihood (NLL) scores. However, due to the sequential nature of distribution modeling, both training and sampling are computationally demanding with the runtime as , where is the total number of pixels. Oord et al. proposed PixelCNN which is a convolutional version of PixelRNN. This led to an improvement in the training time by a large factor at the cost of slight loss in the accuracy as with convolutions we can now only capture bounded context. Salimans et al. [16] proposed PixelCNN++, which builds on PixelCNN by employing a discretized mixture of logistics for modeling the distribution, and using dropout regularization, and additional skip connections. It improves on the NLL score over PixelRNN on the CIFAR dataset leading to stateoftheart results.Deep image priors When solving linear inverse problems using the alternating direction method of multipliers (ADMM) algorithm, Venkatakrishnan et al. [28] observed that it results in two decoupled optimizations. The first one enforces the data prior while the second enforces data fidelity to the observation. The first step can be thought of as a denoising problem, thus, a denoiser can be employed to solve this step thereby avoiding the need for an explicit image prior. Venkatakrishnan et al. [28] use denoisers like BM3D [29] in ADMM setting for image restoration. Inspired by this, recent methods propose learningbased proximal operators for the denoising step of ADMM. OneNet by Chang et al. [1], CNN denoiser by Zhang et al.[30], Meinhardt et al. [31] . In this work, we compare our explicit natural image prior based MAP inference with the learned proximal operator of OneNet. Our evaluations show that our results are superior to OneNet. It is important to note that OneNet’s proximal operator uses adversarial loss [26] which is known to result in sharper recovery of details.
In this paper, we extend upon our previous work, Dave et al. [23] (RIDECS), where we used recurrent image density estimator (RIDE) for CS reconstruction. We observed that the sequential nature of recurrent networks in RIDE makes it too slow for inference and training (computational cost is proportional to the image size). Also, in our experiments, the two layer RIDE fails to yield results comparable to recent approaches like OneNet [1]. Here, we explore sophisticated deep autoregressive models which are order faster than RIDECS for both training and inference. We apply the deep autoregressive model based inference to recent frameworks in computational imaging like LiSens [3] and FlatCam [17]. We enhance the inference algorithm by incorporating the augmented Lagrangian method when necessary. In addition, we improve texture recovery using pixelwise stochastic gradient updates.
Consider to be a matrix corresponding to a natural image and
to be a linear transformation corresponding to the forward model of a computational camera. The measurements obtained
can be written as . Our goal is to reconstruct back the image from the measurements .Discriminative networks learn the inverse mapping by modelling as a deep neural network and minimizing the reconstruction error on a set of training examples . Hence, the inverse mapping is implicitly dependant on the forward model . Dealing with reconstructions for multiple forward models would require learning separate networks for each model which can be expensive.
For our generative approach, we model the distribution of natural images using a deep autoregressive model. We formulate the inverse problem as MAP inference. Hence, the estimated image can be written as
(1)  
(2) 
The likelihood term varies for different imaging systems based on the forward model but the image prior remains the same. Thus, we need to learn the prior only once for all the problems.
Let the
column vector
represent the rasterized version of the image matrix i.e. by taking pixels row by row. The forward models that we consider in this work are as follows:Here, we randomly set certain number of pixels in an image to by missing, by setting their values to zero. Hence, i.e. the vectorized version of the resultant image can be written as
(3) 
where denotes the Hadamard product and is a Bernoulli random vector. The above equation can also be expressed in a matrixvector multiplication form as :
(4) 
where is a subsampling matrix.
In SPC [2], the DMD array optically multiplexes the scene onto a single pixel sensor. By changing the orientation of the array, we will get different multiplexing patterns, which results in different measurements. If is the vector of single pixel measurements from SPC and is the compressive sensing matrix, then we have the forward model as:
(5) 
In Lisens [3], the 2D image of the scene formed on the DMD plane is mapped onto a 1D linesensor which essentially captures the 1D integral of the 2D image (along rows or columns). If is the matrix formed by stacking line sensor measurements from Lisens and is the sensing matrix, then we have
(6) 
FlatCam [17] replaces the lens system by a coded amplitude mask close to the sensor. For ease of calibration, this mask is designed to be separable, i.e., it can be written as an outer product of 2 one dimensional patterns. Neglecting the diffraction effects, it was shown in [17] that using such a mask, the measurements obtained on the FlatCam sensor can be written as
(7) 
where and are matrices corresponding to 1D convolution of the scene along the rows and columns respectively.
Here we model the dependencies between pixels using a directed probabilistic chain. The pixel depends on all the pixels before the index in , which we denote as
. Hence the joint distribution over the pixels in the image can be factorized as
(8) 
In this work, we use stateoftheart autoregressive generative model, PixelCNN++ [16]. Here, the context for the conditional distribution of each of the pixels is modelled using a deep convolutional neural network with residual connections. The convolution kernels are masked appropriately to ensure that the context of a pixel does not depend on the pixels after it. The conditional distribution is then modelled as a mixture of logistic distributions, where the parameters of the distribution depend on the context. This model is then learned on RGB images using maximum likelihood training.
Once the model is trained, it can be used to solve different inference tasks, as we describe below. Sampling from autoregressive models is slow because of their sequential nature which limits their utility. However, for our approach, we only require the gradients of the density with respect to the image
. This can be computed efficiently using backpropagation to the inputs.
In this section, we discuss inference methods for various forward models discussed earlier. We want the desired solution to have higher likelihood (lower NLL) under the image prior and at the same time satisfy the constraints specified by the forward model. For this, we perform projected gradient descent. We divide our approach into three categories based on the amount of noise and the kind of forward model. Hard constraint (equality) method is used when there is less or no measurement noise (Section IVA). For certain imaging models like FlatCam, there is no closed form for the projection operator. We instead use the Augmented Lagrangian Method (ALM), see Section IVB. For the cases of high noise, the measurements deviate significantly from the forward model, and the soft constraint method (inequality) is used (Section IVC). Further, in Sections IVD and IVE, we describe two implementation hacks which have proved useful for our approach.
We first analyze the case when the measurement is directly obtained using the imaging model without any noise. is then a deterministic function of and hence the likelihood term would correspond to constraints. The problem can be formulated as
(9) 
where is provided by the imaging model. The signal prior model is the learned autoregressive model with parameters . Also, we constrain the intensity of the image to be between and . Thus our problem is given by:
(10) 
Let and denote the constraint sets and respectively.
We use projected gradient descent to solve this constrained optimization, which involves performing the following steps iteratively:
(11)  
(12)  
(13) 
where and are projection operators to the constraint sets and respectively. For Eq. 11 backpropagation to the inputs is used to get the data gradients. For Eq. 13, pixels in the image are clipped between and in every iteration.
is different for different imaging problems. For the randomly missing pixels case,
(14) 
where is an vector of ones. This implies that we should only be updating the missing pixels and leave the other pixels the same, which is intuitive.
For Single Pixel Camera we have,
(15) 
where and are vector representations of matrices and respectively. We consider roworthonormalized matrices for compressive sensing, hence
is an identity matrix.
For LiSens case, similar to SPC, we have
(16) 
For the case of FlatCam reconstruction, the matrices and are illconditioned and can’t be inverted. A closed form solution for projection operator doesn’t exist. So, we consider the augmented Lagrangian corresponding to , with a dual parameter .
(17) 
However, instead of minimizing the Lagrangian with respect to the primal variable in each iteration, we just take one step of gradient descent. We further separate the gradient descent into two steps, one entirely depends on the prior while the other entirely depends on the imaging model. The update steps are as follows.
(18)  
(19)  
(20)  
(21) 
Consider the case when the sensor has measurement noise,
(22) 
Assume the measurement noise to be Gaussian distributed, i.e.
(23)  
(24) 
The MAP estimation problem can hence be reduced to
(25) 
where
has to be estimated if we do not know the standard deviation of the measurement noise. Since the constraints are not exact here, we replace the step to project to the constraint space by instead taking a step towards minimizing the likelihood. Hence, we replace Eq.
12 by gradient descent over likelihood,(26) 
We observe that if we update all the pixels in the gradient update (Eq. 11), then we get washed out reconstructions. The autoregressive prior directly models correlation between neighbouring pixels. Hence it tends to assign same values to neighbouring problems. We combat this problem by randomly selecting a certain amount of pixels to update in each step. Hence, not all pixels get updates at every step. We call this pixel dropout, and for incorporating that, we replace the gradient in Eq. 11 by stochastic gradients, i.e.,
(27) 
where
is a random binary mask with the percentage of zeros determined by the pixel dropout ratio. This is analogous to the case of training deep neural networks, where Stochastic Gradient Descent (SGD) helps in escaping from sharp local minima
[32]. Here, the washed out reconstructions correspond to sharp local minima owing to the strong correlation between pixels. We demonstrate the effect of the amount of pixel dropout on the reconstructions in Section VID1.Our prior model is trained on patches, hence the input for has to be . While we perform the likelihood step on the entire image, our approach is designed such that the prior gradient update, projection, and clipping steps are separate. Before the prior gradient update, we split the image into a batch of patches. Before performing the likelihood step, we stitch the patches back into original dimensions.
Our approach is summarized as follows:
We train PixelCNN++ on the downsampled ImageNet data as introduced in [14]
for 6 epochs. Batch size is kept as 36 and the number of filter channels as 100. The rest of the parameters are same as the ones used for training PixelCNN++ on
ImageNet in [16]. We obtain a negative log likelihood score of 3.66 on test data and 3.5 on train data which is consistent with the numbers reported in [16] for similar data.With this learned model, we use our proposed algorithm as described in Algorithm 1, for the experiments described below. An initial image is sampled from a uniform random distribution. However, we observe that starting with different initial images doesn’t have much effect on the final converged reconstruction. We use momentum in the gradient update for faster convergence, with its value set to 0.9. Step size , maximum iterations, likelihood weightage for each experiment are mentioned in the subsequent section.
For reconstructing color images, we consider multiplexing along individual color channels. Hence, we have separate matrices for all the three channels and obtain three separate measurement vectors for each channel.
We have made the code of our implementation for the task Single Pixel Camera reconstruction available online^{1}^{1}1https://github.com/adaveiitm/deeppixellevelprior.
We use the original implementation of [1] available online^{2}^{2}2https://github.com/rickchang/OneNet with certain modification as mentioned below.
For simulating color Single Pixel Camera, the original implementation rasterizes the entire image into a single vector and creates one matrix to compress this into a single measurement vector. We believe that this might not be feasible to implement in a real system. Hence, we modify their implementation to instead simulate separate matrices for each channel as in Section VA.
While simulating SPC measurements on large images, the original implementation only deals with local multiplexing. It breaks them down into patches of and compresses each of these patches separately. We modify this to deal with the more challenging case of global multiplexing, where we compress the entire image.
We extend the original implementation for LiSens and FlatCam as well, by considering the above modifications and incorporating the respective forward models.
We use model provided which was trained on Imagenet for 2 epochs for testing the results. We found that the results were very much dependent on the alpha parameter (penalty parameter) which had to be tuned for each image to get the best solution.
For comparisons with TVAL3 ( TV minimization by Augmented Lagrangian and ALternating direction algorithms ) [20], we use the MATLAB implementation^{3}^{3}3http://www.caam.rice.edu/ optimization/L1/TVAL3/ with the default parameters. The number of iterations is set to 80. For color image reconstruction, we update each channel separately using TVAL3.
In this section, we present the reconstructions from our approach and compare them with the existing stateoftheart approaches. To being with, we illustrate the ability of an autoregressive prior in reconstructing pixel level details using an example of missing pixel inpainting in an image. For this, we randomly mask out pixels from the image and use our prior to reconstruct these missing pixels. We perform by keeping the observed pixel values as same and update missing pixels to maximize the the prior loglikelihood. Specifically, we take an image of size 384x512 and mask 80% of the pixels in the initial image as could be seen in Figure 2. We compare our results with that of OneNet [1], and we can observe details in our reconstruction much better like the text outlines, also quantitatively in terms of PSNR and SSIM. We use a step size of 75 and run for approximately 1000 iterations.
For all the three imaging setups of SPC, Lisens and Flatcam we perform reconstructions on both simulated data and real measurements. In case of simulation we compare our reconstructions with TVAL3 [20] and OneNet [1]. In case of reconstructions from real measurements, we compare our results with TVAL3. OneNet experiments failed to converge to a stable point in this case hence we could not provide comparison with this approach. For real Lisens at 66% measurements, although OneNet converges, results obtained were very poor compared to other approaches.
We show quantitative and qualitative comparisons of simulated SPC reconstruction results on images of sizes 128128 and 256256 respectively as shown in Table I and Figure 10 respectively. Measurement rates considered are and for 128128 and 5% and 10% for 256256. Similar to RIDECS [23], we generate the
matrix as a random Gaussian with orthonormal rows. We perform gradient descent and projection operation on the compressed image for 2000 iterations in the case of 25% measurement rate and for 2500 iterations in case of 10% measurement rate. We use a stepsize of 7.5 and the hard constraint projection method. In all cases, we intialize with random image from uniform distribution. We compare our results to
[1] and we are able to show significant improvement in reconstruction results in terms of PSNR and SSIM values. Our reconstructions have better edges and textures compared to the reconstructions from OneNet.
Name 
M.R.  TVAL3  OneNet  Ours  

PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  
bird  10  23.67  0.91  23.92  0.93  29.52  0.97 
25  29.67  0.97  26.89  0.96  32.96  0.98  
building  10  18.81  0.61  23.85  0.86  25.93  0.88 
25  22.72  0.79  24.06  0.87  32.05  0.96  
cat  10  23.27  0.72  25.15  0.82  26.68  0.85 
25  26.87  0.85  26.60  0.88  31.23  0.94  
flower  10  20.07  0.68  23.39  0.84  26.22  0.89 
25  24.84  0.86  25.13  0.90  31.05  0.96  
parrot  10  18.49  0.64  25.82  0.89  27.59  0.90 
25  23.67  0.84  26.79  0.91  32.18  0.95  
mean  10  20.86  0.72  24.43  0.87  27.19  0.90 
25  25.55  0.86  25.74  0.90  31.89  0.96 
We show our real SPC reconstruction results in Figure 4. Data for this experiment is provided to us by the authors of [3]. We obtain the real SPC sensor measurements at 30% and 15% measurement rate respectively. The images we reconstruct in this case are grey scale images. Here also, we use the Hard constraint projection method for inference. We compare our results with TVAL3 and RIDECS [23]
. Our method performs better than both RIDECS and TVAL3 in terms of PSNR and SSIM values. Apart from these measures, we observe that our method produces a sharper reconstruction. We use the same hyperparameters and training procedures as in the simulated case.
The reconstruction in case of simulated LiSens is done at 25% and 40% measurement rates. Our LiSens experiments, similar to SPC experiments, have been done on both 128x128 and 256x256 images as shown in Table II and Figure 11 respectively. We compare our reconstructions with that obtained using OneNet. Our method provides better results in terms of visual perception as well as PSNR and SSIM values. Our reconstructions have welldefined boundaries of different objects in the image and do not produce artifacts which are observed in case of OneNet. We have used hard constraint case for the simulated LiSens reconstruction for approximately 2000 iterations with a stepsize of 7.5.
Name 
M.R.  TVAL3  OneNet  Ours  

PSNR  SSIM  PSNR  SSIM  PSNR  SSIM  
bird  25  24.59  0.95  24.98  0.82  27.13  0.96 
40  29.34  0.98  27.52  0.96  34.14  0.99  
building  25  18.72  0.67  21.16  0.79  30.87  0.95 
40  23.41  0.82  22.41  0.84  35.06  0.98  
cat  25  23.41  0.67  27.27  0.89  29.95  0.94 
40  25.83  0.87  29.03  0.92  34.65  0.97  
flower  25  21.00  0.72  27.85  0.91  26.54  0.88 
40  23.66  0.83  30.79  0.95  30.21  0.93  
parrot  25  15.27  0.65  26.02  0.90  30.17  0.94 
40  19.75  0.85  27.99  0.93  32.35  0.96  
mean  25  20.60  0.73  25.45  0.89  28.93  0.94 
40  24.40  0.87  27.55  0.92  33.28  0.97 
The real LiSens experiments have been done at 16% and 33% measurement rates obtained at a resolution of , as provided by the authors of [3]. We compare our real Lisens with TVAL3 as in Figure 7. Our method performs better reconstruction with respect to low level details in the image. Our proposed method’s reconstruction has little or no blur compared to TVAL3 and the reconstruction is sharper in terms of object boundaries in the image. We use Hard constraint method for reconstruction with 25% dropout in pixelwise update. We use an update step of 7.5 and 2000 iterations for reconstruction, similar to simulated experiment.
The matrices and in the FlatCam imaging model are estimated based on the calibration procedure mentioned in [17]. As we want to deal with RGB images, separate and matrices are calibrated for each of the R, G and B channels with the help of a Bayer color filter array on the sensor. We compare our results with OneNet and L2 regularisation, on two 256x256 images as shown in Figure 5. Our method shows better PSNR, SSIM, and perceptually better quality samples. Our method produces the least blurry solution and objects in the image has well defined boundaries. We use 25 pixel dropout and perform 1000 iterations of augmented Lagrangian method with the step size as and as 10.
We use the data provided by the authors of [17]. The original images were displayed on a monitor and captured using FlatCam. Using a Bayer color filter on the sensor, separate measurements for the three color channels can be obtained. We compare our reconstructions with L2 regularization as shown in Figure 6. Our reconstructions are more accurate in terms of brightness, boundaries and sharpness of the image. We use soft constraint case for reconstruction and use the same hyperparameters as in the simulation case.
We observe that reconstructions from real FlatCam are not qualitatively as good as with real SPC and LiSens measurements. This is because the forward model assumed in this case is erroneous. Firstly, there are calibration errors in estimating the and matrices. Secondly, the forward model in [17] relies on the separability assumption leading to model error.
In this experiment, we vary the amount of pixels not updated in each iteration and observe its effect on the reconstructed image, see Figure 8. When the dropout ratio is zero, the area in the image having texture is over smooth. With considerable dropout ratio (), the texture is reconstructed better amounting to a higher PSNR and SSIM. However, on increasing it further, the reconstructions appear noisy with a reduction in quality. Thus, for all our experiments, we used dropout.
While we train our model on colored Imagenet data, we observe that in practice this approach works well on reconstructing grayscale images as well. We compare our reconstruction with that of RIDECS [23], which uses the autoregressive model RIDE [15] as image prior. In Figure 9, we compare the reconstruction of a grayscale image from Single Pixel Camera measurements using our approach and RIDECS for measurement rate. The reconstruction obtained from our approach is better than that of RIDECS. This is because we use PixelCNN++ which is a deeper network than RIDE and hence has better representation power. Also, the running time of our approach ( minutes ) is much less than that of RIDECS ( minutes). Our approach is CNN based and hence can be parallelized over multiple GPUs while RIDECS relies on a network of spatial LSTMs which are tough to parallelize.
Till now we have performed all the experiments with different matrix for each color channel. However in OneNet [1], the authors have considered one matrix that multiplexes across the three color channels, which might not be feasible to implement in a real system. For this ablation experiment, we consider the original setting as used in [1] and compare their reconstructions with ours for 10% SPC reconstruction on the 9 test ImageNet images mentioned in the [1]. PSNR and SSIM values for the same are mentioned in Table III. Our approach performs better than OneNet.
Figure Name 
OneNet  Ours  

PSNR  SSIM  PSNR  SSIM  
ball 
24.696  0.9023  26.656  0.9300 
dalmatian  20.650  0.8314  21.812  0.8518 
dog  26.873  0.8734  28.552  0.8952 
field  26.470  0.9112  29.017  0.9149 
man  29.152  0.9460  31.787  0.9540 
mountain  25.484  0.8821  28.993  0.8912 
table  19.397  0.8083  20.955  0.6662 
woman  25.512  0.8518  27.321  0.8906 
wolf  25.976  0.8839  28.355  0.9061 

We demonstrate the efficacy of deep pixel level image prior for illposed reconstruction in different computational imaging problems. Among the three proposed approaches for inference, hard and soft constraint based and ALM based, overall, soft constraintbased method works well and can handle noisy measurements by appropriately varying the tuning parameter, . However, when there is no noise or less noise in the measurements, the hard constraintbased method performs as good as soft constraint case with an additional advantage of being parameter free and hence is preferable. In fact, for our real experiments on SPC (Figure 4) and Lisens (Figure 7), we use hard constraintbased inference, which produces reasonable results. For cases such as Flatcam, noninvertibility of prevents the use of hardconstraint based inference.
Our approach enjoys the versatility of image priors and rich feature representation of deep neural networks. Being pixel level, it explicitly accounts for pixel level correlations resulting in consistent texture and edges. We show our evaluations on both the simulation of forward models and data from real setups. In all cases, both quantitative and qualitative metrics suggest that our approach performs better than traditional methods and current stateoftheart learning based methods. An interesting line of work would be to incorporate deviations from the forward model, due to calibration and model errors, in our approach to further improve the quality of reconstruction for FlatCam.
This work is supported by Qualcomm Innovation Fellowship (QInF) 2016 and 2017. We would like to thank Dr. Aswin Sankaranarayanan and Jian Wang from CMU for sharing the real measurements for SPC and LiSens setup. We would like to thank Dr. Ashok Veeraraghavan, Vivek Boominathan, Jasper Tan from Rice University for sharing the FlatCam data and for useful discussions.
“Bm3d image denoising with shapeadaptive principal component analysis,”
in SPARS’09Signal Processing with Adaptive Sparse Structured Representations, 2009.