In the context of structural signal recovery, the task of image reconstruction from the compressive sampling has been closely associated with computational imaging shapiro2008computational using a single pixel camera wakin2006architecture; sank2015video. Single pixel camera architectures are of particular interest when imaging outside the visible range of the electromagnetic spectrum in cases where detector technology is expensive or difficult to manufacture. This approach to image acquisition involves illuminating an object scene using a sampling device which produces structured light in the form of 2D pseudo-random patterns. For each pattern, the intensity of the back scattered light is measured by a single pixel photo-detector. In the computational imaging paradigm wakin2006architecture, each measurement corresponds to the inner product between a sensing pattern and the image to be reconstructed. This can be formulated as:
is the image rearranged as a vector,, , are random sensing patterns (also concatenated into vector form), are measurement errors and are the measurements. The number of sensing patterns can be much fewer than the total number of pixels comprising the reconstructed image, resulting in a measurement ratio of .
A digital micro-mirror device (DMD) is widely used as the sampling component in single pixel camera architectures and for coded aperture imaging sun2016single; sun2019single; Lochocki:16; Zhang:17; Sun:16; chiranjan2016implementation. It contains a 2d array of micro-mirrors (hence the name) and each micro-mirror can be positioned at one of two angles to be in either an activated or inactivated state. When the array is illuminated by a uniform light source, shifting the micromirrors between states produces different binary sensing patterns, such as random Bernoulli, Hadamard, which are projected onto the object scene of interest. Given an incident, uniform, light source, shifting mirrors between states produces different binary sensing patterns, such as random Bernoulli, Hadamard, which are used to illuminate the object scene of interest.
To reconstruct signals/images from compressively sampled measurements, Compressed sensing (CS)donoho2006compressed; candes2006robust, to be exact sparse optimization methods such as NESTABecker2011_NESTA, ADMM MAL016 etc. have been proposed and have become the predominant algorithms using in a variety of applications. However, one major drawback of these numerical nonlinear optimization methods is that they often take a few minutes to recover a single large image at good quality.
Deep neural networks (DNNs) have become prevalent in a broad range of image processing tasks krizhevsky2012imagenet; girshick2014rich; long2015fully; gu2018recent; tu2019survey. Specifically, DNN has been shown to achieve favorable results in image recovery tai2017memnet. Motivated by this success in image reconstruction tasks, DNNs were subsequently investigated for image reconstruction problems based on compressively sensed image data, mousavi2015deep; adler2016deep; mousavi2017learning; mousavi2017deepcodec; kulkarni2016reconnet; yao2017dr; xie2017adaptive; xie2017fully; zhao2019visualizing; xie2018full; shi2019image. These neural network based solutions were reported to outperform the state-of-the-art in compressed sensing algorithms in terms of speed, accuracy and data compression.
Although a variety of different network architectures were proposed, few were deliberately designed to be adaptable to the sensing hardware. To date, there have been two issues that remain to be solved. First, the real-valued sensing patterns of all existing neural network implementations for this application were stored in 32-bit floating-point format. Although high-precision sensing patterns can be used for software simulation of image sampling on modern GPUs, this is not a realistic representation of sampling using structured light sensing hardware, where instead binary patterns are used to reduce sampling complexity. Second, previous methods assumed that the sensing patterns and the reconstructed images have the same resolution. Therefore, the size of the recovered image is dependent on the size of the sensing patterns (for dense-connection based methods) or the number of convolutional patch-sampling operations (for convolutional-based methods). For large images, these methods result in large intermediate feature maps and increase the number of operations required for recovering an image. This is because the number of sampling measurements and convolutional computations depends on the size of the feature maps. In addition, when the patterns are loaded in hardware, such as a DMD, the maximum reconstruction resolution will be limited by the size of the mirror array (which is fixed) used in the sensing device.
The limitations of previous methods motivated us to design a hardware-friendly deep learning solution, incorporating binary sensing patterns to reconstruct high-resolution images. Previous papers have highlighted the importance of integrating the DNN solutions with hardwarezhao2019visualizing; xie2018full. In this respect, we go one step further than previous work and provide evidence that our architecture performs well with imaging hardware. We propose a new network architecture that:
Uses a mixed-weights network with sparse binary patterns which lends itself naturally to hardware implementation and can be trained in an end-to-end manner. Unlike floating-point numbers, binary patterns are appropriate for both sampling and measuring hardware. Specifically, the sparse binary patterns can be represented on a DMD without the need for any additional modulation and require less on-board memory usage. Our approach effectively increases the light intensity sensitivity of the single pixel camera (the photo-diode) and the analogue to digital conversion range, compared with methods based on real-valued sensing patterns.
Uses a novel sensing-reconstruction scheme, which we term low-resolution sensing with high-resolution reconstruction (LSHR), to directly reconstruct high-resolution images from low-resolution sampled measurements. Given a pattern generated by a DMD of fixed size, the network reconstructs a high-resolution image which has more pixels than the number of micro-mirrors in the array. This low-throughput sampling scheme results in smaller feature maps, and therefore, fewer computational operations are required. Hence, it is more efficient than previously reported methods for use with hardware imaging set-ups.
Has a residual-correction sub-net that consists of a chain of recursive residual blocks, where weights are shared between different blocks. Compared with previous methods, our structure further reduces the model size, making it ideal for the limited onboard memory capacity of the hardware (e.g. single pixel camera) while yielding higher reconstruction PSNR accuracy.
Achieves state-of-the-art results on benchmark datasets and has been validated on proof-of-concept hardware.
The remainder of this paper is organized as follows: In Section 2, we review the related work on sensing patterns. We describe the design of our proposed network in Section 3. In Section 4 we show software simulation results for our model and compare them with existing methods. In Section 5, we present the work of integrating the model with hardware. Finally, in Section 6 we conclude our discussion and suggest potential future directions for the work.
2 Related work
The concept of neural network based image reconstruction was first implemented using a fully-connected network mousavi2015deep
. Thereafter, the problem was approached using convolutional neural networks which avoid the fixed size input image constraint. We organized the related methodsmousavi2015deep; kulkarni2016reconnet; yao2017dr; mousavi2017learning; adler2016deep; mousavi2017deepcodec; xie2017adaptive; xie2017fully; iliadis2018deep; iliadis2016deepbinarymask; zhao2019visualizing; xie2018full; shi2019image into three categories according to the type of sensing pattern used (randomly generated, learned and binary) and discuss relevant prior work below.
Networks based on pre-generated (static) patterns. A stacked denoising auto-encoder (SDA) was previously implemented mousavi2015deep comprising fully-connected layers. It was trained with measurements acquired by sensing images with pre-generated random Gaussian patterns. Inspired by SDA, ReconNet kulkarni2016reconnet was subsequently proposed. It improved the accuracy by extending the network with additional convolutional layers of different kernel sizes. However the fully-connected layer caused heavy computation and large model size, the sensing area was constrained to small patches of the original image. In the post-processing step, the reconstructed small patches were concatenated to form the whole image. The BM3D dabov2007image was then applied to smooth the edges between patches. The performance of the ReconNet was further improved by DR-Net yao2017dr. Here the convolutional layers were replaced with residual blocks which make the network easier to train. But the sensing was still done in small patches. In contrast to previous methods that used fixed (pre-generated) Gaussian sensing patterns, DeepInverse mousavi2017learning used real time generation of random patterns for sampling images.
Networks based on learned patterns. Some of the work described in the previous paragraph has been modified such that the sensing patterns adapt to a particular set of images through a learning process. The SDA was further adapted to learn the patterns with a fully-connected layer that inputs an image directly into the network. The fully-connected layer was trained to obtain the measurements when presented with . This operation can be represented as where the
is an activation function andand are the weights and bias of the fully-connected layer. A similar structure to SDA was also proposed that employed a fully-connected neural network to implement the block-based compressed sensing adler2016deep. The model was trained to jointly optimize the sensing patterns and the network weights. DeepInverse was also optimized resulting in a new model named DeepCodedec mousavi2017deepcodec. It had an encoder-decoder architecture. The network was trained to take measurements from images using several convolutional layers. Unlike SDA, it gradually reduced the dimension of the intermediate feature maps prior to generating the measurements. The efficiency was improved by applying convolutional layers. The ReconNet was also further improved using learned patterns xie2017adaptive and zhao2019visualizing
. Before training, the fully-connected layer was initialized with random Gaussian patterns. It was then updated during the training. For testing the network, the trained patterns were fixed to perform the sensing. The results showed further improvements in reconstruction accuracy due to learning the patterns. However, the fully-connected layer caused intensive computation and blocking artifacts to appear in the reconstructed images. To deal with the aforementioned limitations, the authors proposed two networks,xie2017fully and xie2018full
, that sensed images with a convolutional layer with a small stride step to avoid the blocking artifacts.
Networks based on a binary matrix. Neural networks with binary weights were initially designed for image classification tasks, cour2015binaryconnect; rastegari2016xnor. A network for video reconstruction, using binary patterns, was described in iliadis2018deep. The network applied a 3D binary sampling matrix to down-sample a sequence of the temporal video frames and learned a non-linear rule, mapping between the measurements and the reconstructed frames via fully-connected layers. In more recent network, DeepBinaryMask iliadis2016deepbinarymask, followed the same strategy of using a binary down-sampling matrix for sensing video frames but introduced a learning procedure for generating the masks. However, their work focused on temporal compression which is functionally different from the spatial compression task which is the focus of our work. Inspired by the SDA, a network with an improved architecture was proposed to implement the CS image reconstruction shi2019image. Differently from previous reconstruction methods, it is initial reconstruction consisted of multiple convolutions and a reshape operation. The convolution, in principle, is functionally equivalent to a fully-connected layer, which fixed the reconstructed image size. After the convolution, the reconstructed 1D vector was reshaped into an initial 2D image. In this work, they experimentally tested their model with binary weights and bipolar weights for image sampling. However, the simple replacement of sampling patterns did not involve the optimization of the overall network. The reported results indicated that the reconstruction accuracy of these two types of weights was sub-optimal compared with their floating-points-based model.
In Section 3, we describe our own network architecture, which aims to solve the aforementioned limitations of the existing methods.
3 Overview of the proposed network
In this section, the network structure is explained in detail. The architecture is shown in Figure 1. It is functionally divided into two parts, i.e. the image reconstruction sub-net, and the residual correction sub-net.
Our LSHR scheme assumes an object scene is sampled with low-resolution patterns. In practical applications, ground truth, high-resolution, images are not known a priori. During the training stage, we use the original images as our ground-truth and resample these at low resolution for the purposes of simulating image quality typical of current single pixel imaging systems. These low resolution and ground truth image pairs are used to train our network.
The image reconstruction sub-net samples the low-resolution input images with binary patterns to generate the measurements. From those measurements, the transposed convolution layer learns a non-linear mapping to generate a low-resolution version of the reconstructed image. After that, the residual correction sub-net learns the detail corrections and up-scales the image to the final high-resolution size with a phase shift operation. Together these two parts are able to reconstruct the high-resolution image directly from the low-resolution sampling.
3.1 Image reconstruction sub-net
The image reconstruction sub-net learns both the binary patterns and how to reconstruct the image from the measurements. During the training, the sampling process of the computational imaging is done using a convolutional layer where the convolutional kernels act as the digital mirror array and the kernel values (weights) act as binary patterns. When the trained model is integrated with the hardware, the learned kernel values can be uploaded to the digital mirror array to do the sampling and the measurements of the back scattered light intensity are sent back to the network to reconstruct the image.
The schematic of the image reconstruction sub-net is shown in Figure 2. The sampling and reconstruction can be formulated as
where is the reconstructed preliminary image. The is the transposed convolution with and are the real-valued kernels and bias respectively. The down-scales the original images for simulating the sampling process. The measurements are generated by the convolution of image and the binary kernels where each kernel corresponds to a sensing pattern. In our work, we studied two approaches to generate the binary patterns, i.e. the pre-generated and learned patterns. We describe these in detail below and compare their performance (Section 4).
Randomly pre-generated binary weights.
In this approach, the patterns were randomly generated and remained static during the training. Before the training, we initialized the binary weights from the random Bernoulli distribution with. The distribution was applied to each kernel independently. During the training process, we updated the weights for the rest of the network. In this approach, the network was trained to fit to a specific set of static binary patterns. In our experiments, we compared this scheme with the learned binary weights to study the benefit of weight optimization during the training.
Learned binary weights.
The kernels were initialized with real-valued weights following the uniform distribution within range
. This ensured the initialized weights were equally assigned to positive and negative values. Since the real-valued weights were necessary for the network optimizer during training, these were used for gradient calculation. These were then mapped to binary values and applied to the sensing kernels for forward propagation. The binarization scheme is,
where the are the 0, 1 binary weights and the are the real-valued weights. Note that in our network, only the binary kernels were involved in the convolution operations. In addition, we clipped the real-valued weights to fit within the range . This ensured the effective binarizaiton mapping since the very large values out of the range did not have significant impact on the binarization process. We also applied an norm regularization to the weights to avoid the risk of gradient explosion.
3.2 Residual correction sub-net
Taking the output of the image reconstruction sub-net as input, the residual correction sub-net predicts the fine details resulting in a high-resolution output image. The schematic of the residual correction sub-net is shown in the red block in Figure 1
. This sub-net has two branches: up-scaling and residual mapping. During the training, the upscaling branch interpolates the intermediate input image to the required size of the high-resolution output. The residual mapping branch learns the reconstruction residual (fine details) between the upscaled intermediate input image and the original ground truth image using the long-term recursive residual blocks. The outputs of the two branches are added element-wise to reconstruct the final high-resolution image. In the remainder of the section, we describe the long-term recursive residual blocks and the image upscaling processes.
The conventional residual block is formulated as where and are the input and output of the residual block, indicates the weights of the residual block, learns the residual mapping between the input and the output and is the identity mapping function. Our approach differs from the conventional residual block formulation. All of our blocks have skip connections with the intermediate reconstructed images, which we refer to as long-term connections. Each block share weights, forming a recursive chain. The sequence of the blocks in our network is shown in Figure 3. We used two convolutional layers with a pre-activation function in each block. For the identity mapping, we connected the feature maps associated with the low resolution input (generated by the first convolutional layer) to the output of each block. This long-term connection directly related these features with the outputs of the deep residual blocks. This can be formulated as
where is the residual mapping function of the -th block, is the initial features, and is the output of -th block. is the weight and is the Leaky ReLU activation function he2015delving. The th-layer in each block shared the same weights where . This formed a recursive structure and reduced the total amount of model parameters significantly.
The image upscaling was implemented at the end of the residual correction sub-net. After the residual mapping branch extracted the residual from the preliminary low-resolution image, we applied a phase shift layer shi2016real to enlarge the size of the learned residual by a factor of to have high-resolution residual features. We set the network such that the high-resolution residual features have the same number of channels (one for grayscale and three for RGB) as the final image. In the up-scaling branch, we also enlarged the image size by with the phase shift operation. Then the residual and the image were added, element-wise, to generate the output image in the high-resolution. In our experiment, we set the upscaling factor as .
3.3 Network training
The details of the network structure used in our experiment are illustrated in Figure 4. The network structure code can be downloaded at our GitHub repository. The proposed network consists of two functionally different sub-nets which contain different types of weights respectively. A straightforward strategy, used in previous work, to train such a heterogeneous network, is to train the two parts separately in a pipeline manner. Hence, the image-reconstruction sub-net is first trained and then used as a pre-trained model for training the whole network. This approach can be viewed as either a two-step training strategy or as a semi-decoupled strategy yao2017dr. In contrast, we trained the heterogeneous network with pure end-to-end learning. These two parts of the network were trained jointly with a separate learning-rate update scheme for each. Specifically, for the image reconstruction sub-net, we set a larger initial learning rate with faster decay. This encouraged a rapid updating of the binary weights in the early stages of training and a slower update in the later stages, facilitating the residual correction sub-net to recover the fine image details. For the the residual correction sub-net, we initialized a relatively small learning rate with a slower decay rate since the residual correction for the details is more difficult to learn.
Denoting the original image as , we aim to train the whole network to reconstruct the high-resolution image
, where W denotes the weights of the model. We associated the loss function with the output of both sub-nets (parts), i.e. the reconstructed low-resolution image and the upscaled high-resolution image, to train the network. In contrast to the common-norm loss function, used in previous work, we trained the network using the Charbonnier loss function, which is a variant of the -norm function. Given the generated image at upscaling factor, then our loss function is written as
where is the batch size and denotes the Charbonnier penalty. The second term is the -norm regularization for the weights. Our experiments indicated that images generated using the Charbonnier loss function were usually sharper than the results obtained using an norm loss function. We accumulated the loss of both sub-nets. The ground truth image was generated by downsizing the original image using the bicubic interpolation method. The scalar weight controls the influence of each in the loss function. In our experiment, we set for each part. This multi-loss function forms a supervision scheme that can control the residual training at each part of the network.
We conducted a series of tests to study the performance of the network. First, we evaluated the image reconstruction quality (see Section 4.3
) on three datasets. Our learned and fixed-pattern binary models showed the first and second highest peak signal-to-noise ratio (PSNR) compared to the four methods reviewed in Section2. In Section 4.4, we analyze how fixed and learned patterns affected the model training process. Finally, in Section 4.5, we assess the reconstruction efficiency of the network in comparison with other tested methods.
We used the DIV2K image dataset agustsson2017ntire for training and validation. We applied data augmentation to the training images. Specifically, we randomly cropped small patches of size from each of the images, that comprise the DIV2K dataset, to generate training images. In addition, we randomly applied flipping and rotation to the original patches. We used the cropped image patches as ground truth images for the high-resolution output.
Three datasets were used to test the model’s performance. First we used a benchmark dataset of 11 test images, which has been used in existing work, to evaluate the reconstructed image quality and compare it with the results of previous methods. Secondly, we evaluated the proposed method on a much larger dataset – the test set of ILSVRC2017, comprising natural images from classes ILSVRC15. It is known that natural images are often approximately sparse in the domain of the discrete cosine transformation (DCT) and the wavelet transform taubman2012jpeg2000, and CS is an efficient method for approximate recovery of such images. Since our method is an alternative to CS, we have also evaluated the performance of our structured signal recovery method with images of various levels of sparsity. For this experiment, we generated a DCT-sparse version of the ILSVRC2017 test set and we controlled the sparsity of the DCT coefficients as follows: Each image was first transferred into the DCT domain where the coefficients were reordered based on their magnitude, then we set 5 percentage threshold cases for coefficient magnitude such that , , , or of the coefficients were retained and all other coefficients were set to zero.
4.2 Setting network parameters and hyperparameters
For the image reconstruction sub-net, we used patterns for both the sensing kernels and the transposed convolution kernels. For the residual blocks, the kernel size for the convolutional layers was and we used leaky ReLU activation with leaky rate . We used channels for each of the convolutional layers.
The network was trained with a batch size of using the Adam optimizer for epochs. For the image reconstruction sub-net, we set the initial learning rate and the decay rate to and respectively. For the residual correction sub-net, we set the initial learning rate and the decay rate to and respectively. We set the decay step to . The proposed method was trained on an NVidia GeForce GTX 1080Ti GPU.
In our experiment, we trained the network with different measurement ratios, , of , and , where is the number of sampling kernels and is the number of pixels in the sensing images. Accordingly, the number of binary kernels for the 128128 benchmark sampling images are 164, 1638 and 4096.
4.3 Image reconstruction results
We evaluated our model on the benchmark dataset and compared the results with seven recently proposed methods: ReconNet kulkarni2016reconnet, DR-Net yao2017dr, Adp-Rec xie2017adaptive, Fully-Conv xie2017fully, 2FC2Res zhao2019visualizing, Fully-Block Net xie2018full, and CSNet shi2019image. To be consistent with previous work, we used the PSNR as the metric. The comparison results are summarized in Table 3. From the table, it can be seen that our network with learned patterns achieved the highest average PSNR at all three measurement ratios. Note the comparison with the Fully-Block Net and CSNet follows protocols that were reported in their work. Our model with learned patterns indicates better results using the same protocol.
The example images reconstructed by different methods at measurement ratios of , and are shown in Figures 5, 6 and 7 respectively. Our model reconstructed more details than other methods, resulting in images that are visually sharper. At the lowest measurement ratio , the block effect is not observed in the output images generated by the Fully-Conv network and our network. This is because both methods used the convolutional layer rather than the fully-connected layer to implement the sensing. Therefore the network could be trained in an end-to-end fashion and post-processing was not required to smooth the output images. At the measurement ratio of , the blocking effect can be eliminated for all methods since a sufficient number of measurements were acquired. At the highest measurement ratio , the Fully-Conv network is visually comparable to our method but our learned-weights model still achieved a higher PSNR value.
The difference between the results relating to the static patterns and the learned patterns, of our network, is significant at the measurement ratio of . The learned-patterns model achieved better average PSNR and reconstructed more detail. This implies that learning binary weights can help preserve more detail for the same measurement ratio and make the model converge faster, thereby reducing the training time.
Next, we evaluated the model on the ILSVRC2017 test dataset. Figure 8 shows the mean PSNR values of the reconstructed images from ILSVRC2017 test set. The mean PSNR values produced by our method in a large-scale test are similar to those produced on a small benchmark set, indicating good generalization. Furthermore, PSNR values increase with increased sparsity. This indicates that the model performs well also on DCT-sparse images.
We also found that the PSNR of the reconstructed images, at three measurement ratios, tend to be similar when we increase the sparsity of the image in the DCT domain. We present examples of reconstructed images in Table 1.
4.4 Model training analysis with fixed and binary sampling schemes
First, we analyzed the training efficiency by monitoring the validation loss in both sampling schemes. We found that training with the learned patterns produced a faster loss reduction for all three measurement ratios (as shown in Figure 9) than training with fixed patterns. When the measurement ratio was increased, the discrepancy between the losses of the two networks also increased. Furthermore, the network with learned patterns yielded a lower final loss, than the fixed patterns network, especially for R of and . Even though the learned patterns network showed some instability compared to the fixed patterns network 111In the static scheme, the sampling patterns were not involved in the calculation of back-propagation. Only the real-valued weights in the rest of the network were updated. In the learned scheme, the binary weights were updated in each step. The binarization function introduced fluctuations in the gradient calculation, which made the training progress less stable., it still is beneficial since it can be trained more quickly.
Next, we analyzed the sparsity of learned patterns by exploring the percentage of valid pixels (with value 1) in the patterns during pattern update. In compressive sampling theory, we typically use a small number of dense sensing patterns (equal numbers of ones and zeros) in contrast with a raster scan sensing in which each pattern is maximally sparse (contains one on pixel) and records the intensity of single pixel values one at a time. Conversely the sparse patterns are more efficient for single pixel imaging hardware as they require less on-board memory usage. Our approach effectively adapts the sparsity of patterns according to the measurement ratios and hence finds an optimal compromise between sensing efficiency and hardware performance. Specifically, we initialized all patterns using a single precision uniform distribution within the range (as required for model optimization), which were subsequently binarized to form patterns with a similar number of ones and zeros. However, the number of ones decreased dramatically during training since the model at large sampling rates does not necessarily need dense patterns. In contrast, for a relatively small measurement ratio of , the number of ones remained consistently high, which suggests that more information was sampled by each pattern. As a result, the sampling patterns at and contain fewer ones compared to the patterns at , as seen in Figure 10. This variation due to implies that the learning process can generate efficient binary sampling patterns that adapt to different measurements.
4.5 Analysis of the reconstruction efficiency
We analyzed the computational efficiency of the network by calculating the time and space complexity, which are introduced in the following content. The results demonstrated that our model has a good balance between the computational cost and the model size for the best image quality.
|Reconstruction efficiency and model size of 8 methods|
|Name||# Weights||Format||# Conv layers||Structure||Share weights||Kernel size|
|Fully-Block Net||2560||256||32-bit||25||12 Blocks||No|
To determine the relative computational efficiency of our network, we compared the model size (space complexity) and the number of operations (time complexity) of our network’s image reconstruction layer with the other 4 networks used in prior work (see Table 2). The comparison is based on the reconstruction of a single channel (greyscale) image of size with a measurement ratio . The comparison is valid for any image size. The time and space complexity are formulated as and , where is the size of the feature map, is the size of the kernel, and are number of input and output channels separately.
Our network has the smallest model size among all the tested networks and lower time complexity than the Fully-Conv network. Note that the ReconNet, DR-Net, Adp-Rec, and 2RC2Res perform fewer operations in the initial image reconstruction step because these networks use fully-connected layers. However, the fully-connected layer can only be trained for a specific image size, which is less practical.
For the residual-correction part, our recursive residual block with LSHR sampling scheme generates smaller intermediate feature maps and uses fewer model weights, thereby reducing the computational burden. In the Fully-Conv and Fully-Block Net networks, images were reconstructed directly back to the high-resolution size. The network then corrected the reconstruction error by applying convolution to the feature maps that had the same size as the high-resolution test image. Since the time complexity is directly related to , which is the square of the image size, the computational cost of these three networks increases quadratically when the output image size is doubled. In contrast, our own network reconstructs the image at low resolution, and then convolutional operations are performed on small feature maps. These are upscaled back to the original size only at the last layer. Therefore, the number of operations performed by our network is order of , which is four times less than the Fully-Conv and Fully-Block Net. Furthermore, the number of blocks does not affect the total number of weights since weights are shared between blocks forming a recursive residual block structure. Specifically, the weights are only shared between the first layers (or second layers) among each of the six, two-layer, recursive residual blocks.
The last part of our analysis evaluated the performance of the network for different numbers of residual blocks in our recursive structure. The depth of the recursive residual block affects the reconstruction accuracy. It is seen in Figure 11 that the image quality increases by adding more blocks and the best performance (time and accuracy) is obtained with the 6-block structure. Adding more blocks leads to degradation of the image quality. In principle, adding more residual blocks could improve the capability of the residual mapping, but in practice, training a deeper network is harder. It is also observed in Figure 11 that the reconstruction time increases linearly with the addition of blocks. Therefore, our final model was constructed by using 6 blocks, which gave the best performance and reasonable reconstruction time. It was found that the accuracy increased and reached the best performance with 6 blocks, which was used in our final model.
|Ours (static)||27.52dB||24.57dB||22.03dB||Ours (static)||32.05dB||29.55dB||22.59dB|
|Ours (Learned)||31.11dB||24.56dB||22.34dB||Ours (Learned)||34.13dB||29.59dB||23.31dB|
|Ours (static)||30.36dB||26.07dB||17.10dB||Ours (static)||28.68dB||26.53dB||20.84dB|
|Ours (Learned)||33.38dB||26.40dB||17.23dB||Ours (Learned)||30.63dB||26.56dB||21.35dB|
|Ours (static)||28.00dB||24.34dB||16.81dB||Ours (static)||35.34dB||33.13dB||26.36dB|
|Ours (Learned)||31.01dB||24.66dB||17.27dB||Ours (Learned)||36.91dB||33.45dB||27.13dB|
|Ours (static)||31.60dB||29.37dB||23.13dB||Ours (static)||34.80dB||32.55dB||24.82dB|
|Ours (Learned)||34.18dB||29.57dB||23.52dB||Ours (Learned)||36.61dB||33.73dB||25.12dB|
|Ours (static)||31.51dB||28.71dB||20.09dB||Ours (static)||31.20dB||28.23dB||21.52dB|
|Ours (Learned)||34.20dB||29.07dB||20.79dB||Ours (Learned)||33.51dB||28.61dB||22.10dB|
|Ours (static)||32.64dB||29.84dB||22.57dB||Ours (static)||31.25dB||28.44dB||21.62dB|
|Ours (Learned)||34.75dB||30.18dB||23.01dB||Ours (Learned)||33.68dB||28.67dB||22.11dB|
|Ours (Learned)||33.68dB||28.67dB||22.11dB||Ours (Learned)||33.66dB||29.04dB||22.79dB|
Results of CSNet0,1 and CSNet at R = 25% were not reported in their work shi2019image.
The Fully-Block Net xie2018full was tested only on a subset of the standard benchmark. To be specific, seven images from the standard benchmark set were selected for testing. To compare with their results, we presented in the table our results on the same subset.
5 Implementation on hardware
In real-world applications, the signal/image sampling is usually done by optical devices which inevitably introduce noise and artifacts into the image data. Computer simulations alone provide no guarantees that an image recovery network architecture will be robust to these aspects of practical single-pixel imaging systems. Therefore it is important to validate the efficacy of our LSHR-Net software solution, which uses learned binary patterns, with respect to typical single pixel imaging hardware.
Our hardware comprised a silicon planar photo-detector with a purposely designed amplifier circuit, lenses and a light projector. The photo-detector had a peak sensitivity at the wavelength of and its sensitive area was . We connected the circuit to an Arduino circuit board which performed 10-bit analog-to-digital conversion (1024 scales). For evaluation purposes, we used test images from a database as an alternative to setting up unique object scenes. Test images were multiplied, in software, with each of the sampling patterns (forming modulated images) and projected using a TI DLP LightCrafter evaluation module consisting of a built-in DMD plane with a array. The size, in pixels, of the sampling patterns was constrained by the sensitivity of the photo-detector and the analog-to-digital conversion resolution. A good practical resolution for the sampling patterns was found to be 16x16 pixels. Each of the modulated test images were focused onto the photo-detector using a set of lenses with focal length of , and . A filter with fixed attenuation was used to reduce light intensity at the photo-detector thereby avoiding saturation. We recorded the light intensity of the modulated images and sent these measurements as inputs data to the model.
For the hardware experiments, we trained our model with the MNIST dataset lecun-mnisthandwrittendigit-2010 using the same training settings described in Section 4. The network was trained with MNIST images. The model was evaluated using 18 randomly selected test images of handwritten digits (9 each from MNIST and the Omniglot datasets). We used the Omniglot dataset lake2015human, which consists of a set of natural language characters, to demonstrate that the proposed method can generalize to datasets containing images that contains with different image structure from the training set.
The model reconstructed images directly from the photo-detector measurements at a super resolution size of . We evaluated performance at the same measurement ratios used in Section 4.2. Results on MNIST and Omniglot are shown in Figure 12 and 13 respectively. It is observed that the reconstruction quality of the character structure was improved by increasing the number of measurements. At the same time, artifacts in the reconstructed images can be seen. These are caused predominantly by noise in the hardware setup (e.g. by the amplifier circuit). The average SNR of the recorded measurement signal was dB. Moreover, in Figure 12 and 13 it can be seen that the reconstructed images of are more pixelated than those of and . Visually, the model resulted in better reconstruction quality. This is however due to the smoothing effect which is also seen in Figure 5.
In this paper, we have proposed a hardware-friendly method for image reconstruction from compressively sensed measurements, using mixed-weights deep neural networks. The proposed method, which consists of sampling and reconstruction networks, was specially designed to ease hardware realization, particularly to integrate our work with single pixel camera. Our novel LSHR network uses trainable binary sampling patterns that can be deployed on a single pixel camera’s DMD sampling array. LSHR net samples light intensity functions at low-resolution and reconstructs images with high-resolution details. Effectively, it reduces the number of measurements at the same measurement ratio and reduces the convolutional computing cost. Hence, it improves the efficiency of the reconstruction process significantly compared with previous work. For the purpose of reducing the hardware storage requirement for image reconstruction, the reconstruction network equips long-term recursive residual blocks. It has a weights-sharing strategy that makes the trained models of our method much more compact than those of previously reported network architectures and requires less onboard storage in the imaging hardware. The experimental results on the benchmark image datasets indicate that our method yields better image quality than those reported in previous work for a number of different measurement ratios. We also implemented our method on proof-of-concept hardware and demonstrated that it can sample images as compact measurements and then recover them from the measurements successfully. Our network architecture has potential applications beyond the scope of single pixel imaging. For example, it may be adapted for similar imaging modalities such as coded aperture imaging and structured light sensing. An efficient approach to network training for different imaging modalities may involve transfer learning and this could be the focus of future work in this area. Moreover, for a specific hardware setup, fine-tuning after the initial deployment of hardware can potentially yield improvements in image quality using software alone.
We dedicate this article to memory of Craig Douglas (Seismicstuff ltd) who designed and made the amplifier circuit used in our hardware experiment.