Nowadays, information is one of the most important component in human world. Visual information takes up most of the percentage. There are billions of images and videos around our daily life. Computer vision has underwent huge resurgence in recent years, since deep learning has made a significant difference in this field. Researchers have shown that deep learning has made breakthrough achievements in the following two broad categories. The first category is the high-level computer vision tasks. For example, image and video classification or recognition  , object detection  , image caption  , and visual tracking  
. The second category is low-level reconstruction tasks. For example, image denoising, super-resolution , style transfer  
, and optical flow estimation .
Researches on inverse problems in imaging   have been carried on for decades, which cover various low-level computer vision tasks. Compressive sensing (CS)    is a typical inverse problem in imaging. Conventional CS works to recover the signal by optimization algorithms  
. However, this model is hard to be implemented and costs much computational complexity. The application of deep neural networks in inverse problems in imaging makes it possible that the CS measurements can be recovered real-time. Data-driven CS   learns the recovery network by the training data. Adp-Rec  jointly train the coder-decoder and brings significant improvement on reconstruction quality. Fully convolutional measurement network (FCMN)  firstly measures and recovers full images. However, all the above methods focus on pixel level, and ignore the high-level structure information. This makes the reconstructed results look smooth and have unsatisfactory visual effect. To overcome the drawback, we consider to add high-level perceptual information to CS. So the question is, how to add high-level perceptual information on the low-level CS task.
Recently, perceptual loss  has been used in many reconstruction tasks, such as style transfer  . They are a combination of low-level detailed information and high level semantic information. Perceptual loss is a widely used way to achieve these goals. It is because perceptual loss is defined in feature space, which can convert the ability of catching high-level structure information to recovery network. Thus, the recovered images will contain rich structure information. Inspired by the above applications, we propose perceptual CS, which focuses more on sensing and recovering structure information. Here perceptual loss is employed on CS framework. We use FCMN  as base network to measure and recover scene images, and adopt perceptual loss to train it. We surprisingly find that this framework is capable of capturing and recovering the structure information, especially at extremely low measurement rate, where the measurements can merely contain very limited amount of information.
The contribution of this paper is that, we propose perceptual CS, which can measure and recover the structure information of scene images. It should be pointed out that, only one deconvolutional layer and one Res-block are used in our proposed framework. This is just an illustration. One can employ a deeper network architecture if necessary.
Moreover, perceptual CS indicates an universal architecture. One can change the loss network using pre-trained or dynamic feature extractors for more specific tasks. In this paper, we use VGG as an example. Our code will be available on github111https://github.com/jiang-du/Perceptual-CS for further reproduction.
The organization of the rest part of this paper is as follows. Section 2 introduces some related works of this paper. Section 3 describes the technical design and theoretical analysis of the proposed framework. Section 4 presents experimental results of perceptual CS and gives detailed analysis. Section 5 draws the conclusion.
2 Related work
2.1 Compressive Sensing
CS    proves signal can be reconstructed after being sampled at sub-Nyquist rates as long as the signal is sparse in a certain domain. Reconstructing signal from measurements is an ill-posed problem. Traditional CS usually solves an optimization problem, which leads to high computational complexity.
Recently, deep neural networks (DNNs) has been applied to CS tasks      . These DNN-based methods methods can be divided into two categories depending on whether measurement and reconstruction process are trained jointly. The first category trains the recovery network while the measurement part is fixed, like SDA , ReconNet , and DeepInverse . SDA  first applies deep learning approach to solve the CS recovery problem, which uses fully-connected layers in the recovery part. ReconNet  uses a fully-connected layer along with convolutional layers to recover signals block by block. While, DeepInverse  uses pure convolutional layers. The random Gaussian fashion of the measurement part will mismatch the learned recovery part.
The second category jointly trains the measurement part and the recovery part, such as Deepcodec , Adaptive , and FCMN . These methods totally overcome the problem that the measurement part is independent from the recovery part. Deepcodec  is a framework where both measurement and approximate inverse process are learned end-to-end by a deep fully-connected encoder-decoder network. In , a fully-connected layer as the measurement matrix along with a super-resolution network as the recovery part is trained. FCMN  firstly uses a fully convolutional network where the measurement part is implemented with an overlapped convolution operation. All these methods recover the scene image on pixel level. They ignore the structure information of images.
2.2 Perceptual Loss
Recently, perceptual loss  is widely used in many image reconstruction tasks      . It can recover the image with better visual effect since it is defined on feature space. Typically, perceptual loss calculates the Euclidean distance between the features maps of the reconstructed images and the labels from the same layer of the same pre-trained classification network. Perceptual loss reflects the similarity in the feature level between the label and output images, which makes the reconstructed images retain high-level structure information. In contrast, per-pixel loss focuses on similarity in pixel level, which only preserves low-level pixel information.
Perceptual loss achieves more excellent performance than per-pixel loss in most of image restoration tasks. For example, Johnson et al.  use perceptual loss for style transfer and super resolution. The output images have sharper edges compared to per-pixel loss. SRGAN 
trained by perceptual loss generates more photo-realistic super-resolved images than by MSE loss. When used in image inpainting, perceptual loss produces satisfactory results due to the addition of high-level context. Additionally, perceptual loss helps to remain finer details for image editing . Inspired by the advantages of perceptual loss in preserving structure and detail, we attempt to apply it to CS field and it accordingly performs well.
3 Perceptual CS Framework
In this section, we mainly introduce the technical design of the perceptual CS framework. The architecture is shown in Fig. 1. It consists of two parts: compressive sensing network and perceptual loss network. The compressive sensing network originally performs reconstruction in pixel-wise manner. With the perceptual loss network added, the perceptual CS network preserves the structure information of the recovered images. With the help of perceptual recovery, the proposed network is able to acquire high-level perceptual information.
The compressive sensing network measures and recovers the full scene images. The full image processing fashion provides an enough receptive field that makes it possible to perform perceptual reconstruction. While, in the perceptual loss network, we employ a classification network, VGG19, as an auxiliary network. It plays the role of extracting the perceptual information of the images.
3.1 Full Image Compressive Sensing Network
In most existing CS methods, the scene image is measured and recovered block by block, and each block is reshaped into a column vector. This breaks the structure of the full image. Besides, the computational complexity of the existing methods will extremely increase when the size of the image becomes larger. For example, when an image with the size ofis measured, the memory consumption of the sensing matrix can be up to . Thus, it is nearly impossible to design a large sensing matrix, let alone measuring the full image. This is because the mapping from the scene image to the measurements is fully-connected, leading to an extremely large-scale parameter nightmare.
Inspired by fully convolutional measurement network (FCMN) , we employ a fully convolutional architecture to measure and recover the scene images in the proposed framework, which can get rid of the disaster of the exploding number of parameters. Moreover, the fully convolutional architecture can preserve the correspondence among pixels (instead of reshaping into column vector). In this way, block-effect has been largely removed in the recovered images due to the overlapped convolutional measurement. This preserves the structure information of the whole image. Furthermore, the full image method makes it possible to use perceptual loss for semantic reconstruction.
Although the convolution and deconvolution layers can recover the image, for better visual effect, we enhance the proposed framework with residual learning. In detail, we just add one residual block and it works quite well, as is shown in Fig 2 (b). One can add more residual blocks for further improvements if necessary.
3.2 Perceptual Reconstruction for Compressive Sensing
In the proposed network, we focus on the perceptual recovery. In the classic CS task, the recovery network approximates the error in the pixel-wise space. To extract the structure information, we recover the scene image in feature-level space. Instead of MSE loss, we consider the perceptual loss, which focuses on perceptual recovery.
: In classic CNN-based CS, the loss function is usually defined with pixel-wise loss:
This pixel-wise loss will force the image to have the minimized average Euclidean distance between the reconstruction images and the labels . Here, represents the parameters of the whole network, including the measurement and the recovery parts. Although MSE loss in (1) can help to achieve the reconstructed images with high peak signal-to-noise ratio (PSNR), the reconstructed images usually look smooth and the structure information is not clear. We can see in Fig 2 (b) that the face and the hat of the person is very smooth compared with the original image in Fig 2 (a). Especially the wrinkle on the face cannot be clearly seen.
Perceptual loss: Considering the current popular classification network works by extracting the features in an image, we can take this advantage into our proposed method. Thus, we apply the perceptual loss. It is formulated as
where denotes the feature map of the -th layer of VGG19 with the input image . Different from (1), a typical kind of perceptual loss is defined with the (squared, normalized) Euclidean distance between the feature maps generated from the reconstructed image and the label. Actually, when applying CS at a very low measurement rate, we do not care much about the detailed texture of it. Correspondingly, we emphasize the importance of the structural information. As is shown in Fig 2 (c) and (d), the structure information recovered better, especially the hat of the person has richer structure information compared with Fig 2 (b).
In practical, we define the loss function on VGG or VGG of VGG19 (actually pooling 2 or pooling 3) as examples. The results can be addressed in Fig 2 (c) and (d). The feature map of bottom layers contains detailed low-level information and the top layers have more high-level semantic features. We can also choose other layers by different requirements. In this paper, We do not apply perceptual loss by too high level layers because in terms of compressive sensing, higher level drops too much information that it is nearly impossible to inverse, even if pre-trained.
4 Experiments with Analysis
In this section, we conduct the experiments to illustrate the performance of the proposed perceptual CS framework. We test our framework with a standard dataset  containing 11 grayscale images. We also compare the reconstruction results with some typical CS methods. Furthermore, we take some reconstruction results as examples to make a detailed analysis of the performance of the proposed method.
Experiment Setup The learning rate is set to when perceptual loss is defined on VGG, and when perceptual loss is defined on VGG. The bench size is set to 5 while training. For each measurement rate, the iteration time is
. We use the caffe framework for network training and MATLAB for testing. Our computer is equipped with Intel Core i7-6700K CPU with frequency of 4.0GHz, 4 NVidia GeForce GTX Titan XP GPUs, 128 GB RAM, and the framework runs on the Ubuntu 16.04 operating system. The training dataset consists of 800 pieces of size images down sampled and cropped from 800 images in DIV2K dataset .
Results with analysis The following is the analysis of the experimental results at different measurement rates.
The explanation from Fig 3 at measurement rate 1% is as follows.
All existing CS-based image reconstruction works rely on MSE loss. While, FCMN  makes perceptual loss promising.
The explanation of measurement rate=4% in Fig 4 is as follows:
DR-Net achieves highest PSNR among random Gaussian methods, since it adds several Res-blocks that fully convergence in the reconstruction stage.
(2) The method with adaptive measurement for Fig 4 (d) adopts one Res-block, achieving the highest PSNR. The comparison among several typical methods including DR-Net is in Fig 4, where FCMN  with full image gets the best result in terms of PSNR.
It should be pointed out that only one Res-block is used in both FCMN  and the proposed framework in this paper. One can add more Res-blocks for further improvement.
(3) With just one Res-block, perceptual loss in Fig 4 (e) and (f) works well, which improves FCMN . Structure information is kept. In some case, even wake structure can become strong (see Fig 4 (f) compared to Fig 4 (a) and (d)).
It should be noted that, even if PSNR is worse with perceptual loss, the structure information is clearly reconstructed.
Evaluation of Perceptual CS. To evaluate the performance of the proposed method, we evaluate quality of the reconstructed images with PSNR and SSIM. Furthermore, we also use Mean Opinion Score (MOS)  to test the visual effect of these methods. In this metric, an image is scored by 26 volunteers and the final score is the average value. The quality ranking is represented by scores from 1 to 5, where 1 denotes lowest quality and 5 denotes the highest. All the test images are ranked randomly before being scored and they are displayed group by group. Each group has six reconstruction images, in different methods. All participants take this test on the same computer screen, from the same angle and distance. Here the distance from the screen to the tested persons is 50 cm and the eyes of those persons are of the same height of the center of the screen.
The detailed comparison results of mean PSNR, SSIM and MOS is shown in Table 1. we can draw the following conclusion. Our method achieves the highest MOS rating. The PSNR and SSIM value of typical methods is higher, since their loss function is defined as the Euclidean distance between the output and label. While, perceptual CS concentrates more on the visual effect. Thus, it is helpful for MOS, instead of PSNR and SSIM.
Moreover, we give some examples of color images. In terms of color channels, we measure and recover the RGB channels respectively, and then combine them to a whole color image. The results of perceptual CS with color images are shown in Fig 5. Of course, we give the comparison with existing methods. We can see obviously from the figure that the visual effect of perceptual CS is quite well.
In terms of hardware implementation, we follow the approach of the existing work proposed in  in which sliding window is used to measure the scene. Similarly, we can replace the random Gaussian measurement matrix with the learned pre-defined parameters in the convolution layer of the measurement network. The reconstruction part is not on optical device, so only the measurement part needs to be implemented with the approach above.
In this paper, we propose perceptual CS for sensing and recovering structured scene images. The proposed framework managed to recover structure information from CS measurements. Our work is of profound significance, which may open a door towards alternative to semantic sensing and recovery.
This work is supported by Natural Science Foundation (NSF) of China (61472301, 61632019) and Ministry of Education project (6141A02011601).
Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. vol. 3, p. 2 (2017)
-  Baraniuk, R.G.: Compressive sensing [lecture notes]. IEEE Signal Processing Magazine 24(4), 118–121 (July 2007). https://doi.org/10.1109/MSP.2007.4286571
-  Baraniuk, R.G.: More is less: signal processing and the data deluge. Science 331(6018), 717–719 (2011)
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Transactions on Information Theory51(12), 4203–4215 (2005)
-  Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: An explicit representation for neural image style transfer. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
-  Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: Segflow: Joint learning for video object segmentation and optical flow. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Dong, C., Chen, C.L., He, K., Tang, X.: Learning a Deep Convolutional Network for Image Super-Resolution. Springer International Publishing (2014)
-  Donoho, D.L.: Compressed sensing. IEEE Transactions on Information Theory 52(4), 1289–1306 (April 2006). https://doi.org/10.1109/TIT.2006.871582
-  Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing 1(4), 586–597 (2008)
Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
-  Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
-  He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Huang, R., Zhang, S., Li, T., He, R.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2458–2467 (July 2017)
-  Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia. pp. 675–678. MM ’14, ACM, New York, NY, USA (2014). https://doi.org/10.1145/2647868.2654889
-  Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 694–711. Springer International Publishing, Cham (2016)
-  Kulkarni, K., Lohit, S., Turaga, P., Kerviche, R., Ashok, A.: Reconnet: Non-iterative reconstruction of images from compressively sensed measurements. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 449–458 (June 2016)
-  Kunis, S., Rauhut, H.: Random sampling of sparse trigonometric polynomials, ii. orthogonal matching pursuit versus basis pursuit. Foundations of Computational Mathematics 8(6), 737–763 (2008)
-  Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-realistic single image super-resolution using a generative adversarial network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
-  Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Lohit, S., Kulkarni, K., Kerviche, R., Turaga, P., Ashok, A.: Convolutional neural networks for non-iterative reconstruction of compressively sensed images. arXiv preprint arXiv:1708.04669 (2017)
-  Lucas, A., Iliadis, M., Molina, R., Katsaggelos, A.K.: Using deep neural networks for inverse problems in imaging: Beyond analytical methods. IEEE Signal Processing Magazine 35(1), 20–36 (Jan 2018). https://doi.org/10.1109/MSP.2017.2760358
-  McCann, M.T., Jin, K.H., Unser, M.: Convolutional neural networks for inverse problems in imaging: A review. IEEE Signal Processing Magazine 34(6), 85–95 (Nov 2017). https://doi.org/10.1109/MSP.2017.2739299
-  Mousavi, A., Baraniuk, R.G.: Learning to invert: Signal recovery via deep convolutional networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 2272–2276 (2017)
-  Mousavi, A., Dasarathy, G., Baraniuk, R.G.: Deepcodec: Adaptive sensing and recovery via deep convolutional neural networks. arXiv preprint arXiv:1707.03386 (2017)
-  Mousavi, A., Patel, A.B., Baraniuk, R.G.: A deep learning approach to structured signal recovery. In: Communication, Control, and Computing. pp. 1336–1343 (2016)
-  Ranjan, R., Sankaranarayanan, S., Bansal, A., Bodla, N., Chen, J.C., Patel, V.M., Castillo, C.D., Chellappa, R.: Deep learning for understanding faces: Machines may be just as good, or better, than humans. IEEE Signal Processing Magazine 35(1), 66–83 (Jan 2018). https://doi.org/10.1109/MSP.2017.2764116
-  Recommendatios, I.R.: Recommendation 500-10; methodology for the subjective assessment of the quality of television pictures. ITU-R Rec. BT. 500-10 (2000)
-  Shen, W., Liu, R.: Learning residual images for face attribute manipulation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1225–1233 (July 2017)
-  Shi, G., Gao, D., Song, X., Xie, X., Chen, X., Liu, D.: High-resolution imaging via moving random exposure and its simulation. IEEE Transactions on Image Processing 20(1), 276–282 (Jan 2011). https://doi.org/10.1109/TIP.2010.2052271
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014)
-  Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W.H., Yang, M.H.: Crest: Convolutional residual learning for visual tracking. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory 53(12), 4655–4666 (2007)
-  Wang, Y., Long, M., Wang, J., Yu, P.S.: Spatiotemporal pyramid network for video action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
-  Xie, X., Du, J., Wang, C., Shi, G., Xu, X., Wang, Y.: Fully convolutional measurement network for compressive sensing image reconstruction. Neurocomputing (2018). https://doi.org/10.1016/j.neucom.2018.04.084
-  Xie, X., Wang, Y., Shi, G., Wang, C., Du, J., Han, X.: Adaptive measurement network for cs image reconstruction. In: CCF Chinese Conference on Computer Vision. pp. 407–417. Springer (2017)
-  Yao, H., Dai, F., Zhang, D., Ma, Y., Zhang, S., Zhang, Y.: Dr-net: Deep residual reconstruction network for image compressive sensing. arXiv preprint arXiv:1702.05743 (2017)
-  Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
-  Yeh, R.A., Chen, C., Lim, T.Y., Schwing, A.G., Hasegawa-Johnson, M., Do, M.N.: Semantic image inpainting with deep generative models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6882–6890 (July 2017)
Yun, S., Choi, J., Yoo, Y., Yun, K., Young Choi, J.: Action-decision networks for visual tracking with deep reinforcement learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)