1 Introduction
In the context of structural signal recovery, the task of image reconstruction from the compressive sampling has been closely associated with computational imaging shapiro2008computational using a single pixel camera wakin2006architecture; sank2015video. Single pixel camera architectures are of particular interest when imaging outside the visible range of the electromagnetic spectrum in cases where detector technology is expensive or difficult to manufacture. This approach to image acquisition involves illuminating an object scene using a sampling device which produces structured light in the form of 2D pseudorandom patterns. For each pattern, the intensity of the back scattered light is measured by a single pixel photodetector. In the computational imaging paradigm wakin2006architecture, each measurement corresponds to the inner product between a sensing pattern and the image to be reconstructed. This can be formulated as:
(1) 
where
is the image rearranged as a vector,
, , are random sensing patterns (also concatenated into vector form), are measurement errors and are the measurements. The number of sensing patterns can be much fewer than the total number of pixels comprising the reconstructed image, resulting in a measurement ratio of .A digital micromirror device (DMD) is widely used as the sampling component in single pixel camera architectures and for coded aperture imaging sun2016single; sun2019single; Lochocki:16; Zhang:17; Sun:16; chiranjan2016implementation. It contains a 2d array of micromirrors (hence the name) and each micromirror can be positioned at one of two angles to be in either an activated or inactivated state. When the array is illuminated by a uniform light source, shifting the micromirrors between states produces different binary sensing patterns, such as random Bernoulli, Hadamard, which are projected onto the object scene of interest. Given an incident, uniform, light source, shifting mirrors between states produces different binary sensing patterns, such as random Bernoulli, Hadamard, which are used to illuminate the object scene of interest.
To reconstruct signals/images from compressively sampled measurements, Compressed sensing (CS)donoho2006compressed; candes2006robust, to be exact sparse optimization methods such as NESTABecker2011_NESTA, ADMM MAL016 etc. have been proposed and have become the predominant algorithms using in a variety of applications. However, one major drawback of these numerical nonlinear optimization methods is that they often take a few minutes to recover a single large image at good quality.
Deep neural networks (DNNs) have become prevalent in a broad range of image processing tasks krizhevsky2012imagenet; girshick2014rich; long2015fully; gu2018recent; tu2019survey. Specifically, DNN has been shown to achieve favorable results in image recovery tai2017memnet. Motivated by this success in image reconstruction tasks, DNNs were subsequently investigated for image reconstruction problems based on compressively sensed image data, mousavi2015deep; adler2016deep; mousavi2017learning; mousavi2017deepcodec; kulkarni2016reconnet; yao2017dr; xie2017adaptive; xie2017fully; zhao2019visualizing; xie2018full; shi2019image. These neural network based solutions were reported to outperform the stateoftheart in compressed sensing algorithms in terms of speed, accuracy and data compression.
Although a variety of different network architectures were proposed, few were deliberately designed to be adaptable to the sensing hardware. To date, there have been two issues that remain to be solved. First, the realvalued sensing patterns of all existing neural network implementations for this application were stored in 32bit floatingpoint format. Although highprecision sensing patterns can be used for software simulation of image sampling on modern GPUs, this is not a realistic representation of sampling using structured light sensing hardware, where instead binary patterns are used to reduce sampling complexity. Second, previous methods assumed that the sensing patterns and the reconstructed images have the same resolution. Therefore, the size of the recovered image is dependent on the size of the sensing patterns (for denseconnection based methods) or the number of convolutional patchsampling operations (for convolutionalbased methods). For large images, these methods result in large intermediate feature maps and increase the number of operations required for recovering an image. This is because the number of sampling measurements and convolutional computations depends on the size of the feature maps. In addition, when the patterns are loaded in hardware, such as a DMD, the maximum reconstruction resolution will be limited by the size of the mirror array (which is fixed) used in the sensing device.
The limitations of previous methods motivated us to design a hardwarefriendly deep learning solution, incorporating binary sensing patterns to reconstruct highresolution images. Previous papers have highlighted the importance of integrating the DNN solutions with hardware
zhao2019visualizing; xie2018full. In this respect, we go one step further than previous work and provide evidence that our architecture performs well with imaging hardware. We propose a new network architecture that:
Uses a mixedweights network with sparse binary patterns which lends itself naturally to hardware implementation and can be trained in an endtoend manner. Unlike floatingpoint numbers, binary patterns are appropriate for both sampling and measuring hardware. Specifically, the sparse binary patterns can be represented on a DMD without the need for any additional modulation and require less onboard memory usage. Our approach effectively increases the light intensity sensitivity of the single pixel camera (the photodiode) and the analogue to digital conversion range, compared with methods based on realvalued sensing patterns.

Uses a novel sensingreconstruction scheme, which we term lowresolution sensing with highresolution reconstruction (LSHR), to directly reconstruct highresolution images from lowresolution sampled measurements. Given a pattern generated by a DMD of fixed size, the network reconstructs a highresolution image which has more pixels than the number of micromirrors in the array. This lowthroughput sampling scheme results in smaller feature maps, and therefore, fewer computational operations are required. Hence, it is more efficient than previously reported methods for use with hardware imaging setups.

Has a residualcorrection subnet that consists of a chain of recursive residual blocks, where weights are shared between different blocks. Compared with previous methods, our structure further reduces the model size, making it ideal for the limited onboard memory capacity of the hardware (e.g. single pixel camera) while yielding higher reconstruction PSNR accuracy.

Achieves stateoftheart results on benchmark datasets and has been validated on proofofconcept hardware.
The remainder of this paper is organized as follows: In Section 2, we review the related work on sensing patterns. We describe the design of our proposed network in Section 3. In Section 4 we show software simulation results for our model and compare them with existing methods. In Section 5, we present the work of integrating the model with hardware. Finally, in Section 6 we conclude our discussion and suggest potential future directions for the work.
2 Related work
The concept of neural network based image reconstruction was first implemented using a fullyconnected network mousavi2015deep
. Thereafter, the problem was approached using convolutional neural networks which avoid the fixed size input image constraint. We organized the related methods
mousavi2015deep; kulkarni2016reconnet; yao2017dr; mousavi2017learning; adler2016deep; mousavi2017deepcodec; xie2017adaptive; xie2017fully; iliadis2018deep; iliadis2016deepbinarymask; zhao2019visualizing; xie2018full; shi2019image into three categories according to the type of sensing pattern used (randomly generated, learned and binary) and discuss relevant prior work below.Networks based on pregenerated (static) patterns. A stacked denoising autoencoder (SDA) was previously implemented mousavi2015deep comprising fullyconnected layers. It was trained with measurements acquired by sensing images with pregenerated random Gaussian patterns. Inspired by SDA, ReconNet kulkarni2016reconnet was subsequently proposed. It improved the accuracy by extending the network with additional convolutional layers of different kernel sizes. However the fullyconnected layer caused heavy computation and large model size, the sensing area was constrained to small patches of the original image. In the postprocessing step, the reconstructed small patches were concatenated to form the whole image. The BM3D dabov2007image was then applied to smooth the edges between patches. The performance of the ReconNet was further improved by DRNet yao2017dr. Here the convolutional layers were replaced with residual blocks which make the network easier to train. But the sensing was still done in small patches. In contrast to previous methods that used fixed (pregenerated) Gaussian sensing patterns, DeepInverse mousavi2017learning used real time generation of random patterns for sampling images.
Networks based on learned patterns. Some of the work described in the previous paragraph has been modified such that the sensing patterns adapt to a particular set of images through a learning process. The SDA was further adapted to learn the patterns with a fullyconnected layer that inputs an image directly into the network. The fullyconnected layer was trained to obtain the measurements when presented with . This operation can be represented as where the
is an activation function and
and are the weights and bias of the fullyconnected layer. A similar structure to SDA was also proposed that employed a fullyconnected neural network to implement the blockbased compressed sensing adler2016deep. The model was trained to jointly optimize the sensing patterns and the network weights. DeepInverse was also optimized resulting in a new model named DeepCodedec mousavi2017deepcodec. It had an encoderdecoder architecture. The network was trained to take measurements from images using several convolutional layers. Unlike SDA, it gradually reduced the dimension of the intermediate feature maps prior to generating the measurements. The efficiency was improved by applying convolutional layers. The ReconNet was also further improved using learned patterns xie2017adaptive and zhao2019visualizing. Before training, the fullyconnected layer was initialized with random Gaussian patterns. It was then updated during the training. For testing the network, the trained patterns were fixed to perform the sensing. The results showed further improvements in reconstruction accuracy due to learning the patterns. However, the fullyconnected layer caused intensive computation and blocking artifacts to appear in the reconstructed images. To deal with the aforementioned limitations, the authors proposed two networks,
xie2017fully and xie2018full, that sensed images with a convolutional layer with a small stride step to avoid the blocking artifacts.
Networks based on a binary matrix. Neural networks with binary weights were initially designed for image classification tasks, cour2015binaryconnect; rastegari2016xnor. A network for video reconstruction, using binary patterns, was described in iliadis2018deep. The network applied a 3D binary sampling matrix to downsample a sequence of the temporal video frames and learned a nonlinear rule, mapping between the measurements and the reconstructed frames via fullyconnected layers. In more recent network, DeepBinaryMask iliadis2016deepbinarymask, followed the same strategy of using a binary downsampling matrix for sensing video frames but introduced a learning procedure for generating the masks. However, their work focused on temporal compression which is functionally different from the spatial compression task which is the focus of our work. Inspired by the SDA, a network with an improved architecture was proposed to implement the CS image reconstruction shi2019image. Differently from previous reconstruction methods, it is initial reconstruction consisted of multiple convolutions and a reshape operation. The convolution, in principle, is functionally equivalent to a fullyconnected layer, which fixed the reconstructed image size. After the convolution, the reconstructed 1D vector was reshaped into an initial 2D image. In this work, they experimentally tested their model with binary weights and bipolar weights for image sampling. However, the simple replacement of sampling patterns did not involve the optimization of the overall network. The reported results indicated that the reconstruction accuracy of these two types of weights was suboptimal compared with their floatingpointsbased model.
In Section 3, we describe our own network architecture, which aims to solve the aforementioned limitations of the existing methods.
3 Overview of the proposed network
In this section, the network structure is explained in detail. The architecture is shown in Figure 1. It is functionally divided into two parts, i.e. the image reconstruction subnet, and the residual correction subnet.
Our LSHR scheme assumes an object scene is sampled with lowresolution patterns. In practical applications, ground truth, highresolution, images are not known a priori. During the training stage, we use the original images as our groundtruth and resample these at low resolution for the purposes of simulating image quality typical of current single pixel imaging systems. These low resolution and ground truth image pairs are used to train our network.
The image reconstruction subnet samples the lowresolution input images with binary patterns to generate the measurements. From those measurements, the transposed convolution layer learns a nonlinear mapping to generate a lowresolution version of the reconstructed image. After that, the residual correction subnet learns the detail corrections and upscales the image to the final highresolution size with a phase shift operation. Together these two parts are able to reconstruct the highresolution image directly from the lowresolution sampling.
3.1 Image reconstruction subnet
The image reconstruction subnet learns both the binary patterns and how to reconstruct the image from the measurements. During the training, the sampling process of the computational imaging is done using a convolutional layer where the convolutional kernels act as the digital mirror array and the kernel values (weights) act as binary patterns. When the trained model is integrated with the hardware, the learned kernel values can be uploaded to the digital mirror array to do the sampling and the measurements of the back scattered light intensity are sent back to the network to reconstruct the image.
The schematic of the image reconstruction subnet is shown in Figure 2. The sampling and reconstruction can be formulated as
(2)  
where is the reconstructed preliminary image. The is the transposed convolution with and are the realvalued kernels and bias respectively. The downscales the original images for simulating the sampling process. The measurements are generated by the convolution of image and the binary kernels where each kernel corresponds to a sensing pattern. In our work, we studied two approaches to generate the binary patterns, i.e. the pregenerated and learned patterns. We describe these in detail below and compare their performance (Section 4).
Randomly pregenerated binary weights.
In this approach, the patterns were randomly generated and remained static during the training. Before the training, we initialized the binary weights from the random Bernoulli distribution with
. The distribution was applied to each kernel independently. During the training process, we updated the weights for the rest of the network. In this approach, the network was trained to fit to a specific set of static binary patterns. In our experiments, we compared this scheme with the learned binary weights to study the benefit of weight optimization during the training.Learned binary weights.
The kernels were initialized with realvalued weights following the uniform distribution within range
. This ensured the initialized weights were equally assigned to positive and negative values. Since the realvalued weights were necessary for the network optimizer during training, these were used for gradient calculation. These were then mapped to binary values and applied to the sensing kernels for forward propagation. The binarization scheme is,
(3) 
where the are the 0, 1 binary weights and the are the realvalued weights. Note that in our network, only the binary kernels were involved in the convolution operations. In addition, we clipped the realvalued weights to fit within the range . This ensured the effective binarizaiton mapping since the very large values out of the range did not have significant impact on the binarization process. We also applied an norm regularization to the weights to avoid the risk of gradient explosion.
3.2 Residual correction subnet
Taking the output of the image reconstruction subnet as input, the residual correction subnet predicts the fine details resulting in a highresolution output image. The schematic of the residual correction subnet is shown in the red block in Figure 1
. This subnet has two branches: upscaling and residual mapping. During the training, the upscaling branch interpolates the intermediate input image to the required size of the highresolution output. The residual mapping branch learns the reconstruction residual (fine details) between the upscaled intermediate input image and the original ground truth image using the longterm recursive residual blocks. The outputs of the two branches are added elementwise to reconstruct the final highresolution image. In the remainder of the section, we describe the longterm recursive residual blocks and the image upscaling processes.
The conventional residual block is formulated as where and are the input and output of the residual block, indicates the weights of the residual block, learns the residual mapping between the input and the output and is the identity mapping function. Our approach differs from the conventional residual block formulation. All of our blocks have skip connections with the intermediate reconstructed images, which we refer to as longterm connections. Each block share weights, forming a recursive chain. The sequence of the blocks in our network is shown in Figure 3. We used two convolutional layers with a preactivation function in each block. For the identity mapping, we connected the feature maps associated with the low resolution input (generated by the first convolutional layer) to the output of each block. This longterm connection directly related these features with the outputs of the deep residual blocks. This can be formulated as
(4)  
where is the residual mapping function of the th block, is the initial features, and is the output of th block. is the weight and is the Leaky ReLU activation function he2015delving. The thlayer in each block shared the same weights where . This formed a recursive structure and reduced the total amount of model parameters significantly.
The image upscaling was implemented at the end of the residual correction subnet. After the residual mapping branch extracted the residual from the preliminary lowresolution image, we applied a phase shift layer shi2016real to enlarge the size of the learned residual by a factor of to have highresolution residual features. We set the network such that the highresolution residual features have the same number of channels (one for grayscale and three for RGB) as the final image. In the upscaling branch, we also enlarged the image size by with the phase shift operation. Then the residual and the image were added, elementwise, to generate the output image in the highresolution. In our experiment, we set the upscaling factor as .
3.3 Network training
The details of the network structure used in our experiment are illustrated in Figure 4. The network structure code can be downloaded at our GitHub repository. The proposed network consists of two functionally different subnets which contain different types of weights respectively. A straightforward strategy, used in previous work, to train such a heterogeneous network, is to train the two parts separately in a pipeline manner. Hence, the imagereconstruction subnet is first trained and then used as a pretrained model for training the whole network. This approach can be viewed as either a twostep training strategy or as a semidecoupled strategy yao2017dr. In contrast, we trained the heterogeneous network with pure endtoend learning. These two parts of the network were trained jointly with a separate learningrate update scheme for each. Specifically, for the image reconstruction subnet, we set a larger initial learning rate with faster decay. This encouraged a rapid updating of the binary weights in the early stages of training and a slower update in the later stages, facilitating the residual correction subnet to recover the fine image details. For the the residual correction subnet, we initialized a relatively small learning rate with a slower decay rate since the residual correction for the details is more difficult to learn.
Denoting the original image as , we aim to train the whole network to reconstruct the highresolution image
, where W denotes the weights of the model. We associated the loss function with the output of both subnets (parts), i.e. the reconstructed lowresolution image and the upscaled highresolution image, to train the network. In contrast to the common
norm loss function, used in previous work, we trained the network using the Charbonnier loss function, which is a variant of the norm function. Given the generated image at upscaling factor, then our loss function is written as(5) 
where is the batch size and denotes the Charbonnier penalty. The second term is the norm regularization for the weights. Our experiments indicated that images generated using the Charbonnier loss function were usually sharper than the results obtained using an norm loss function. We accumulated the loss of both subnets. The ground truth image was generated by downsizing the original image using the bicubic interpolation method. The scalar weight controls the influence of each in the loss function. In our experiment, we set for each part. This multiloss function forms a supervision scheme that can control the residual training at each part of the network.
4 Experiments
We conducted a series of tests to study the performance of the network. First, we evaluated the image reconstruction quality (see Section 4.3
) on three datasets. Our learned and fixedpattern binary models showed the first and second highest peak signaltonoise ratio (PSNR) compared to the four methods reviewed in Section
2. In Section 4.4, we analyze how fixed and learned patterns affected the model training process. Finally, in Section 4.5, we assess the reconstruction efficiency of the network in comparison with other tested methods.4.1 Datasets
We used the DIV2K image dataset agustsson2017ntire for training and validation. We applied data augmentation to the training images. Specifically, we randomly cropped small patches of size from each of the images, that comprise the DIV2K dataset, to generate training images. In addition, we randomly applied flipping and rotation to the original patches. We used the cropped image patches as ground truth images for the highresolution output.
Three datasets were used to test the model’s performance. First we used a benchmark dataset of 11 test images, which has been used in existing work, to evaluate the reconstructed image quality and compare it with the results of previous methods. Secondly, we evaluated the proposed method on a much larger dataset – the test set of ILSVRC2017, comprising natural images from classes ILSVRC15. It is known that natural images are often approximately sparse in the domain of the discrete cosine transformation (DCT) and the wavelet transform taubman2012jpeg2000, and CS is an efficient method for approximate recovery of such images. Since our method is an alternative to CS, we have also evaluated the performance of our structured signal recovery method with images of various levels of sparsity. For this experiment, we generated a DCTsparse version of the ILSVRC2017 test set and we controlled the sparsity of the DCT coefficients as follows: Each image was first transferred into the DCT domain where the coefficients were reordered based on their magnitude, then we set 5 percentage threshold cases for coefficient magnitude such that , , , or of the coefficients were retained and all other coefficients were set to zero.
4.2 Setting network parameters and hyperparameters
For the image reconstruction subnet, we used patterns for both the sensing kernels and the transposed convolution kernels. For the residual blocks, the kernel size for the convolutional layers was and we used leaky ReLU activation with leaky rate . We used channels for each of the convolutional layers.
The network was trained with a batch size of using the Adam optimizer for epochs. For the image reconstruction subnet, we set the initial learning rate and the decay rate to and respectively. For the residual correction subnet, we set the initial learning rate and the decay rate to and respectively. We set the decay step to . The proposed method was trained on an NVidia GeForce GTX 1080Ti GPU.
In our experiment, we trained the network with different measurement ratios, , of , and , where is the number of sampling kernels and is the number of pixels in the sensing images. Accordingly, the number of binary kernels for the 128128 benchmark sampling images are 164, 1638 and 4096.
4.3 Image reconstruction results
We evaluated our model on the benchmark dataset and compared the results with seven recently proposed methods: ReconNet kulkarni2016reconnet, DRNet yao2017dr, AdpRec xie2017adaptive, FullyConv xie2017fully, 2FC2Res zhao2019visualizing, FullyBlock Net xie2018full, and CSNet shi2019image. To be consistent with previous work, we used the PSNR as the metric. The comparison results are summarized in Table 3. From the table, it can be seen that our network with learned patterns achieved the highest average PSNR at all three measurement ratios. Note the comparison with the FullyBlock Net and CSNet follows protocols that were reported in their work. Our model with learned patterns indicates better results using the same protocol.
The example images reconstructed by different methods at measurement ratios of , and are shown in Figures 5, 6 and 7 respectively. Our model reconstructed more details than other methods, resulting in images that are visually sharper. At the lowest measurement ratio , the block effect is not observed in the output images generated by the FullyConv network and our network. This is because both methods used the convolutional layer rather than the fullyconnected layer to implement the sensing. Therefore the network could be trained in an endtoend fashion and postprocessing was not required to smooth the output images. At the measurement ratio of , the blocking effect can be eliminated for all methods since a sufficient number of measurements were acquired. At the highest measurement ratio , the FullyConv network is visually comparable to our method but our learnedweights model still achieved a higher PSNR value.
The difference between the results relating to the static patterns and the learned patterns, of our network, is significant at the measurement ratio of . The learnedpatterns model achieved better average PSNR and reconstructed more detail. This implies that learning binary weights can help preserve more detail for the same measurement ratio and make the model converge faster, thereby reducing the training time.
Sparsity  Raw image  Reconstruction  

original 




20% 




10% 




5% 




1% 




Next, we evaluated the model on the ILSVRC2017 test dataset. Figure 8 shows the mean PSNR values of the reconstructed images from ILSVRC2017 test set. The mean PSNR values produced by our method in a largescale test are similar to those produced on a small benchmark set, indicating good generalization. Furthermore, PSNR values increase with increased sparsity. This indicates that the model performs well also on DCTsparse images.
We also found that the PSNR of the reconstructed images, at three measurement ratios, tend to be similar when we increase the sparsity of the image in the DCT domain. We present examples of reconstructed images in Table 1.
4.4 Model training analysis with fixed and binary sampling schemes
First, we analyzed the training efficiency by monitoring the validation loss in both sampling schemes. We found that training with the learned patterns produced a faster loss reduction for all three measurement ratios (as shown in Figure 9) than training with fixed patterns. When the measurement ratio was increased, the discrepancy between the losses of the two networks also increased. Furthermore, the network with learned patterns yielded a lower final loss, than the fixed patterns network, especially for R of and . Even though the learned patterns network showed some instability compared to the fixed patterns network ^{1}^{1}1In the static scheme, the sampling patterns were not involved in the calculation of backpropagation. Only the realvalued weights in the rest of the network were updated. In the learned scheme, the binary weights were updated in each step. The binarization function introduced fluctuations in the gradient calculation, which made the training progress less stable., it still is beneficial since it can be trained more quickly.
Next, we analyzed the sparsity of learned patterns by exploring the percentage of valid pixels (with value 1) in the patterns during pattern update. In compressive sampling theory, we typically use a small number of dense sensing patterns (equal numbers of ones and zeros) in contrast with a raster scan sensing in which each pattern is maximally sparse (contains one on pixel) and records the intensity of single pixel values one at a time. Conversely the sparse patterns are more efficient for single pixel imaging hardware as they require less onboard memory usage. Our approach effectively adapts the sparsity of patterns according to the measurement ratios and hence finds an optimal compromise between sensing efficiency and hardware performance. Specifically, we initialized all patterns using a single precision uniform distribution within the range (as required for model optimization), which were subsequently binarized to form patterns with a similar number of ones and zeros. However, the number of ones decreased dramatically during training since the model at large sampling rates does not necessarily need dense patterns. In contrast, for a relatively small measurement ratio of , the number of ones remained consistently high, which suggests that more information was sampled by each pattern. As a result, the sampling patterns at and contain fewer ones compared to the patterns at , as seen in Figure 10. This variation due to implies that the learning process can generate efficient binary sampling patterns that adapt to different measurements.
4.5 Analysis of the reconstruction efficiency
We analyzed the computational efficiency of the network by calculating the time and space complexity, which are introduced in the following content. The results demonstrated that our model has a good balance between the computational cost and the model size for the best image quality.
Reconstruction efficiency and model size of 8 methods  
Imagerestoration  Residualcorrection  
Name  # Weights  Format  # Conv layers  Structure  Share weights  Kernel size  
ReconNet  1024  32bit  6  Plain  No  
DRNet  1024  32bit  12  4 Blocks  No  
AdpRec  1024  32bit  6  Plain  No  
2FC2Res  1024  32bit  6  2 Blocks  No  
FullyConv  2560  256  32bit      No  
FullyBlock Net  2560  256  32bit  25  12 Blocks  No  
CSNet  1024  32bit  12  5 Blocks  No  
Ours  2560  256  1bit  12  6 Blocks  Yes 
To determine the relative computational efficiency of our network, we compared the model size (space complexity) and the number of operations (time complexity) of our network’s image reconstruction layer with the other 4 networks used in prior work (see Table 2). The comparison is based on the reconstruction of a single channel (greyscale) image of size with a measurement ratio . The comparison is valid for any image size. The time and space complexity are formulated as and , where is the size of the feature map, is the size of the kernel, and are number of input and output channels separately.
Our network has the smallest model size among all the tested networks and lower time complexity than the FullyConv network. Note that the ReconNet, DRNet, AdpRec, and 2RC2Res perform fewer operations in the initial image reconstruction step because these networks use fullyconnected layers. However, the fullyconnected layer can only be trained for a specific image size, which is less practical.
For the residualcorrection part, our recursive residual block with LSHR sampling scheme generates smaller intermediate feature maps and uses fewer model weights, thereby reducing the computational burden. In the FullyConv and FullyBlock Net networks, images were reconstructed directly back to the highresolution size. The network then corrected the reconstruction error by applying convolution to the feature maps that had the same size as the highresolution test image. Since the time complexity is directly related to , which is the square of the image size, the computational cost of these three networks increases quadratically when the output image size is doubled. In contrast, our own network reconstructs the image at low resolution, and then convolutional operations are performed on small feature maps. These are upscaled back to the original size only at the last layer. Therefore, the number of operations performed by our network is order of , which is four times less than the FullyConv and FullyBlock Net. Furthermore, the number of blocks does not affect the total number of weights since weights are shared between blocks forming a recursive residual block structure. Specifically, the weights are only shared between the first layers (or second layers) among each of the six, twolayer, recursive residual blocks.
The last part of our analysis evaluated the performance of the network for different numbers of residual blocks in our recursive structure. The depth of the recursive residual block affects the reconstruction accuracy. It is seen in Figure 11 that the image quality increases by adding more blocks and the best performance (time and accuracy) is obtained with the 6block structure. Adding more blocks leads to degradation of the image quality. In principle, adding more residual blocks could improve the capability of the residual mapping, but in practice, training a deeper network is harder. It is also observed in Figure 11 that the reconstruction time increases linearly with the addition of blocks. Therefore, our final model was constructed by using 6 blocks, which gave the best performance and reasonable reconstruction time. It was found that the accuracy increased and reached the best performance with 6 blocks, which was used in our final model.
Image  Methods 

Image  Methods  measurement ratio  
R=0.25  R=0.1  R=0.01  R=0.25  R=0.1  R=0.01  
Barbara  ReconNet  23.58dB  22.17dB  19.08dB  Boats  ReconNet  27.83dB  24.56dB  18.82dB  
DRNet  25.77dB  22.69dB  18.65dB  DRNet  30.09dB  25.58dB  18.67dB  
AdpRec  27.40dB  24.28dB  21.36dB  AdpRec  32.47dB  28.80dB  21.09dB  
FullyConv  28.59dB  24.28dB  22.06dB  FullyConv  33.88dB  29.48dB  22.3dB  
2FC2Res  27.92dB  24.27dB  21.48dB  2FC2Res  33.59dB  29.12dB  21.29dB  
Ours (static)  27.52dB  24.57dB  22.03dB  Ours (static)  32.05dB  29.55dB  22.59dB  
Ours (Learned)  31.11dB  24.56dB  22.34dB  Ours (Learned)  34.13dB  29.59dB  23.31dB  
Fingerprint  ReconNet  26.15dB  20.99dB  15.01dB  Cameraman  ReconNet  23.48dB  21.54dB  17.51dB  
DRNet  27.65dB  22.03dB  14.73dB  DRNet  25.62dB  22.46dB  17.08dB  
AdpRec  32.31dB  26.55dB  16.22dB  AdpRec  27.11dB  24.97dB  19.74dB  
FullyConv  32.91dB  27.36dB  16.33dB  FullyConv  28.99dB  25.62dB  20.63dB  
2FC2Res  32.17dB  25.92dB  16.22dB  2FC2Res  28.84dB  25.07dB  19.98dB  
Ours (static)  30.36dB  26.07dB  17.10dB  Ours (static)  28.68dB  26.53dB  20.84dB  
Ours (Learned)  33.38dB  26.40dB  17.23dB  Ours (Learned)  30.63dB  26.56dB  21.35dB  
Flinstones  ReconNet  22.74dB  19.04dB  14.14dB  Foreman  ReconNet  32.08dB  29.02dB  22.03dB  
DRNet  26.19dB  21.09dB  14.01dB  DRNet  33.53dB  29.20dB  20.59dB  
AdpRec  27.94dB  23.83dB  16.12dB  AdpRec  36.18dB  33.51dB  25.53dB  
FullyConv  30.26dB  24.98dB  16.92dB  FullyConv  38.10dB  34.00dB  27.26dB  
2FC2Res  29.72dB  24.94dB  16.27dB  2FC2Res  38.25dB  34.29dB  25.77dB  
Ours (static)  28.00dB  24.34dB  16.81dB  Ours (static)  35.34dB  33.13dB  26.36dB  
Ours (Learned)  31.01dB  24.66dB  17.27dB  Ours (Learned)  36.91dB  33.45dB  27.13dB  
Lena  ReconNet  27.47dB  24.48dB  18.51dB  House  ReconNet  29.96dB  26.74dB  20.30dB  
DRNet  29.42dB  25.39dB  17.97dB  DRNet  31.83dB  27.53dB  19.61dB  
AdpRec  31.63dB  28.50dB  21.49dB  AdpRec  34.38dB  31.43dB  22.93dB  
FullyConv  33.00dB  28.97dB  22.51dB  FullyConv  36.22dB  32.36dB  23.67dB  
2FC2Res  32.97dB  28.86dB  21.57dB  2FC2Res  35.35dB  31.45dB  22.92dB  
Ours (static)  31.60dB  29.37dB  23.13dB  Ours (static)  34.80dB  32.55dB  24.82dB  
Ours (Learned)  34.18dB  29.57dB  23.52dB  Ours (Learned)  36.61dB  33.73dB  25.12dB  
Monarch  ReconNet  24.95dB  21.49dB  15.61dB  Peppers  ReconNet  25.74dB  22.72dB  17.39dB  
DRNet  27.95dB  23.10dB  15.33dB  DRNet  28.49dB  24.32dB  16.90dB  
AdpRec  29.25dB  26.65dB  17.70dB  AdpRec  29.65dB  26.67dB  19.75dB  
FullyConv  32.63dB  27.61dB  18.46dB  FullyConv  32.90dB  28.72dB  21.38dB  
2FC2Res  32.46dB  27.60dB  17.85dB  2FC2Res  32.82dB  27.52dB  20.05dB  
Ours (static)  31.51dB  28.71dB  20.09dB  Ours (static)  31.20dB  28.23dB  21.52dB  
Ours (Learned)  34.20dB  29.07dB  20.79dB  Ours (Learned)  33.51dB  28.61dB  22.10dB  
Parrot  ReconNet  26.66dB  23.36dB  18.93dB  Mean  ReconNet  26.42dB  23.28dB  17.94dB  
DRNet  28.73dB  23.94dB  18.01dB  DRNet  28.66dB  24.32dB  17.44dB  
AdpRec  30.51dB  27.59dB  21.67dB  AdpRec  30.80dB  27.53dB  20.33dB  
FullyConv  32.13dB  27.92dB  22.49dB  FullyConv  32.69dB  28.30dB  21.27dB  
2FC2Res  31.89dB  27.93dB  21.77dB  2FC2Res  32.36dB  27.91dB  20.47dB  
Ours (static)  32.64dB  29.84dB  22.57dB  Ours (static)  31.25dB  28.44dB  21.62dB  
Ours (Learned)  34.75dB  30.18dB  23.01dB  Ours (Learned)  33.68dB  28.67dB  22.11dB  
Mean  CSNet {0,1}    26.39dB  20.62dB  Mean  
CSNet    28.37dB  21.02dB  FullyBlock Net  33.57dB  28.94dB  22.12dB  
Ours (Learned)  33.68dB  28.67dB  22.11dB  Ours (Learned)  33.66dB  29.04dB  22.79dB 

Results of CSNet0,1 and CSNet at R = 25% were not reported in their work shi2019image.

The FullyBlock Net xie2018full was tested only on a subset of the standard benchmark. To be specific, seven images from the standard benchmark set were selected for testing. To compare with their results, we presented in the table our results on the same subset.
5 Implementation on hardware
In realworld applications, the signal/image sampling is usually done by optical devices which inevitably introduce noise and artifacts into the image data. Computer simulations alone provide no guarantees that an image recovery network architecture will be robust to these aspects of practical singlepixel imaging systems. Therefore it is important to validate the efficacy of our LSHRNet software solution, which uses learned binary patterns, with respect to typical single pixel imaging hardware.
Our hardware comprised a silicon planar photodetector with a purposely designed amplifier circuit, lenses and a light projector. The photodetector had a peak sensitivity at the wavelength of and its sensitive area was . We connected the circuit to an Arduino circuit board which performed 10bit analogtodigital conversion (1024 scales). For evaluation purposes, we used test images from a database as an alternative to setting up unique object scenes. Test images were multiplied, in software, with each of the sampling patterns (forming modulated images) and projected using a TI DLP LightCrafter evaluation module consisting of a builtin DMD plane with a array. The size, in pixels, of the sampling patterns was constrained by the sensitivity of the photodetector and the analogtodigital conversion resolution. A good practical resolution for the sampling patterns was found to be 16x16 pixels. Each of the modulated test images were focused onto the photodetector using a set of lenses with focal length of , and . A filter with fixed attenuation was used to reduce light intensity at the photodetector thereby avoiding saturation. We recorded the light intensity of the modulated images and sent these measurements as inputs data to the model.
For the hardware experiments, we trained our model with the MNIST dataset lecunmnisthandwrittendigit2010 using the same training settings described in Section 4. The network was trained with MNIST images. The model was evaluated using 18 randomly selected test images of handwritten digits (9 each from MNIST and the Omniglot datasets). We used the Omniglot dataset lake2015human, which consists of a set of natural language characters, to demonstrate that the proposed method can generalize to datasets containing images that contains with different image structure from the training set.
The model reconstructed images directly from the photodetector measurements at a super resolution size of . We evaluated performance at the same measurement ratios used in Section 4.2. Results on MNIST and Omniglot are shown in Figure 12 and 13 respectively. It is observed that the reconstruction quality of the character structure was improved by increasing the number of measurements. At the same time, artifacts in the reconstructed images can be seen. These are caused predominantly by noise in the hardware setup (e.g. by the amplifier circuit). The average SNR of the recorded measurement signal was dB. Moreover, in Figure 12 and 13 it can be seen that the reconstructed images of are more pixelated than those of and . Visually, the model resulted in better reconstruction quality. This is however due to the smoothing effect which is also seen in Figure 5.
6 Conclusions
In this paper, we have proposed a hardwarefriendly method for image reconstruction from compressively sensed measurements, using mixedweights deep neural networks. The proposed method, which consists of sampling and reconstruction networks, was specially designed to ease hardware realization, particularly to integrate our work with single pixel camera. Our novel LSHR network uses trainable binary sampling patterns that can be deployed on a single pixel camera’s DMD sampling array. LSHR net samples light intensity functions at lowresolution and reconstructs images with highresolution details. Effectively, it reduces the number of measurements at the same measurement ratio and reduces the convolutional computing cost. Hence, it improves the efficiency of the reconstruction process significantly compared with previous work. For the purpose of reducing the hardware storage requirement for image reconstruction, the reconstruction network equips longterm recursive residual blocks. It has a weightssharing strategy that makes the trained models of our method much more compact than those of previously reported network architectures and requires less onboard storage in the imaging hardware. The experimental results on the benchmark image datasets indicate that our method yields better image quality than those reported in previous work for a number of different measurement ratios. We also implemented our method on proofofconcept hardware and demonstrated that it can sample images as compact measurements and then recover them from the measurements successfully. Our network architecture has potential applications beyond the scope of single pixel imaging. For example, it may be adapted for similar imaging modalities such as coded aperture imaging and structured light sensing. An efficient approach to network training for different imaging modalities may involve transfer learning and this could be the focus of future work in this area. Moreover, for a specific hardware setup, finetuning after the initial deployment of hardware can potentially yield improvements in image quality using software alone.
Acknowledgements
We dedicate this article to memory of Craig Douglas (Seismicstuff ltd) who designed and made the amplifier circuit used in our hardware experiment.