The aim of the image sampling is to generate a low-resolution (LR) image from a high-resolution (HR) image or reconstruct the HR image in reverse. Single-image up-sampling is widely used in computer vision applications including HDTV, medical imaging , satellite imaging , and surveillance , where high-frequency details are required on demand. As the use of mobile social networks (e.g., Google+, WeChat, and Twitter) continues to grow, thumbnail down-sampling is another important way to optimize data storage and transmission over limited-capacity channels.
Image up-sampling, also known as super-resolution (SR), has been studied for decades. Early methods including bicubic interpolation , Lanczos resampling , gradient profiles , and patch redundancy  are based on statistical image priors or internal patch representations. More recently, learning-based methods have been proposed to model a mapping from LR to HR patches such as neighbor embedding , sparse coding 
, and random forests. However, image up-sampling is highly ill-posed, since the HR to LR process contains non-invertible down-sampling.
Due to their powerful learning capability, deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in many computer vision tasks, such as image classification , object detection , and image segmentation . Recently, CNNs have been used to address this ill-posed inverse problem, demonstrating superiority over traditional learning paradigms. The first example of the approach, super-resolution convolutional neural network (SRCNN)  predicted the nonlinear LR to HR mapping in an end-to-end manner. To reduce computational complexity, fast SRCNN (FSRCNN)  and efficient sub-pixel convolutional neural network (ESPCN)  up-scaled the resolution only at the output layer. Kim et al.  developed a very deep super-resolution (VDSR) network with 20 convolutional layers by residual learning, and Mao et al.  proposed a 30-layer residual encoder-decoder (RED) network with symmetric skip connections to facilitate training. Deeply-recursive convolutional network (DRCN)  introduced a very deep recursive layer via a chain structure with 16 recursions, and deep recursive residual network (DRRN)  adopted recursive residual units to control the model parameters while increasing depth.
Despite achieving excellent performance, the CNN-based methods are highly dependent on interpolation-based down/up-sampling, as shown in Fig. 1. The limitations of these methods arise from two aspects:
Down-sampling. For down-sampling, the model trained for a specific interpolation does not work well with the other interpolations, and different interpolations significantly alter restoration accuracy. Therefore, all the solutions mentioned above generally assume that the degradation is a bicubic interpolation (the default setting of in Matlab) when shrinking an image. However, the bicubic function based on a weighting transformation resprents a cheap down-sampling process that discards useful high-frequency details for HR image restoration.
Up-sampling. With cheap up-sampling, the networks [18, 19, 20, 21] increase the resolution during preprocessing or at the first layer to learn the interpolation’s residual. However, the up-sampling do not add information to solve the ill-posed restoration problem. To replace cheap interpolation, ESPCN  and FSRCNN  adopt sub-pixel shuffling and a deconvolution layer to improve efficiency, respectively. However, without cheap up-sampling, they carry the input and restore the details as an auto-encoder, so converge slowly.
To address the above problems, we propose a deep sampling network (DSN) without any cheap interpolation that is trained for simultaneous down- and up-sampling.
A learnable down-sampling subnetwork (Down-SNet) is trained without supervision, thereby preserving more information and transmitting a better visual effect to the LR image. For self-supervision, the super-pixel residual is adopted with a novel activation function, called quantized bilateral ReLU (Q-BReLU).
An up-sampling subnetwork (Up-SNet) learns a sub-pixel residual with dense connections to accelerate convergence and improve performance. First, the dense pixel representation trained with deep supervision extracts the multi-scale features by multi-level dictionaries, and second, the sub-pixel residual restores the HR result without cheap up-sampling.
Compare to traditional down-sampling methods, the Down-SNet can preserve more useful information and generate photo-realistic LR images. Compared to existing CNN-based up-sampling methods, the co-training DSN achieves the best performance with lower computational complexity.
2 CNN-based Sampling in Related Works
We first design an experiment to investigate the sampling in CNN-based SR methods. In this section, we re-implement111The SRCNN (9-1-5) re-implemented here achieves better perform than reported by the authors (32.39dB) . the baseline SRCNN  model with the Adam  optimizer. The learning rate decreases by a factor of 0.1 from to
every 50 epochs.
2.1 Learn Restoration for Specified Down-sample
Almost all CNN-based SR methods [15, 16, 17, 19, 18, 20, 21, 23] are trained for a specified down-sampling. As shown in Table 1, down-sampling asymmetry in the training/testing phase results in poor restorations, even worse than the direct bicubic interpolation. This is because the CNNs learn the targeted mapping for the specified degradation. Moreover, different down-sampling (nearest-neighbor, bilinear, and bicubic) produce significantly different convergence rates and restoration accuracies shown in Fig. 2(a). The degradation with bicubic down-sampling contains more useful information, so the bicubic model converges faster and achieves better performance. However, the bicubic interpolation is still a cheap down-sampling process that discards useful details for HR image restoration. In this paper, we simultaneously train a deep sampling network for down-sampling and corresponding up-sampling.
2.2 Learn Mapping from Cheap Up-sample
In the popular CNN-based SR methods (e.g. SRCNN , VDSR , DRCN , and DRRN ), cheap up-sampling is used during preprocessing to increase the resolution before or at the first network layer. As shown in Fig. 2(b), interpolations do not add information to improve restoration accuracy. Instead of improving accuracy, the complex (bicubic) preprocessing prematurely introduces smooth and inaccurate interpolation, resulting in a hard-to-train network. Conversely, the nearest-neighbor interpolation achieves the best performance because it selects the raw value of the nearest point and does not consider the values of neighboring points. To address this problem, ESPCN  and FSRCNN  carry the input and restore details without cheap up-sampling. However, deep information carry results in a decreasing convergence rate. Therefore, we propose sub-pixel residual learning to accelerate convergence and improve performance.
3 Deep Sampling Network
In this paper, we propose a deep sampling network (DSN) composed of a down-sampling subnetwork and an up-sampling subnetwork. The DSN architecture is presented in Fig. 3.
3.1 Unsupervised Down-sampling Subnetwork
A learnable down-sampling subnetwork (Down-SNet) is trained without supervision, which learns the super-pixel residual with a novel activation function, called quantized bilateral ReLU (Q-BReLU).
3.1.1 Super-pixel Residual Learning
Without a supervised signal, Down-SNet can proactively retain useful information and discard redundant information. However, the multi-layer network is an end-to-end relationship requiring very long-term memory. For this reason, the LR image generated from the learned features contains artifacts. We can simply solve this problem by super-pixel residual-learning.
In Down-SNet, the pixel of LR output of scale factor is largely similar to the super-pixel of the HR input H. Therefore, we define the super-pixel residual image for down-sampling , where reduces the integer and is a neighborhood with the size of . In , most values are likely to be very low and even close to zero. Formally, the down-sampling is denoted , which includes an inference model (three convolutional layers of size
and stride) and a down-sampling layer (a convolutional layer of size and stride of ). The original mapping is recast into
which can be implemented by feedforward neural networks with shortcut connections  and average pooling.
3.1.2 Quantized Bilateral ReLU (Q-BReLU)
Standard nonlinear activation functions such as the rectified linear unit (ReLU) offer local linearity to overcome the vanishing gradient problem. However, ReLU is designed for classification problems rather than image restoration. In particular, ReLU only inhibits values less than zero, which might lead to response overflow especially without supervision. Moveover, the general digital image is quantified to integers between 0 and 255.
To overcome this limitation, here we propose the quantized bilateral rectified linear unit (Q-BReLU) to keep bilateral restraint and response quantization, as shown in Fig. 4. Q-BReLU is a variation of BReLU , which is adopted for haze transmission restoration. BReLU is defined as , where is the marginal value. Denoting for terse expression, Q-BReLU is defined as
where is the number of quantities.
Center quantization: the high-precision value is rounded to the nearest quantization interval (between neighboring blue dashes); Zero mean deviation: the positive/negative (green/yellow area) deviation balance out the approximate bias.
However, the gradient of Q-BReLU alternates between 0 and according to (2
). We exploit an approximate gradient with local continuity for backpropagation learning. To retain center quantization and zero mean deviation, BReLU is adopted as a spline function to fit Q-BReLU, as shown in Fig.4(b). Therefore, the approximate gradient of Q-BReLU is defined as
To verify the impact of Q-BReLU, we illustrate an example of down-sampling in Fig. 5. The LR image generated without Q-BReLU contains irrational noise, while the result with Q-BReLU appears natural.
3.2 Residual Up-sampling Subnetwork
The up-sampling subnetwork (Up-SNet) is a dense architecture that learns the sub-pixel residual, achieving a high performance and efficiency without cheap interpolation.
3.2.1 Dense Pixel Representation
Up-SNet densely connects the pixel representation  for pixel-wise prediction. Each layer produces feature maps, so it follows that the -th layer has input feature maps. In the deep connection layers, a large number of feature maps increase the computational cost and model size.  demonstrated that a bottleneck convolution improves computational efficiency and keeps the model compact. The performance improvement of dense pixel representation come from three aspects:
Multi-scale feature. For up-scaling, different components may be relevant to different neighbourhood scales in the LR image. Up-SNet is a type of multi-scale, performance-improving architecture: the receptive field get larger when the network stacks more layers. Given a fixed kernel of size , there are multi-scale streams corresponding to receptive fields, respectively.
Deeply-supervised learning. Image up-sampling is a low-level vision task, where the kernels in the shallow layers can be shared to recursively boost performance. However, recursions are hard to train due to exploding and/or vanishing gradients. Skip connections, similar to the deeply-supervised learning, overcome the vanishing gradient problem and enhance feature propagation.
Multi-level dictionaries. Sparse-coding is a representative example-based up-sampling method, where sparse coefficients are passed into a dictionary to restore HR patches. Up-SNet can be viewed as a type of sparse coding: convolutional kernels of size are equivalent to dictionaries, and the bottlenecks with nonlinear activation functions are equivalent to sparse coefficients. With dense connections, the neural unit learns multi-level dictionaries.
3.2.2 Sub-pixel Residual Learning
The HR image can be decomposed into low-frequency information (low-resolution image) and high-frequency information (residual image). In Up-SNet, the input image L and output image S share the same low-frequency information. Without any cheap interpolation, we adopt a sub-pixel residual learning to transmit the LR input to the HR result.
Depending on different sub-pixel location in HR space, the residual patterns containing channels are activated by a convolution of size . Sub-pixel shuffle  is a periodic operator that rearranges the elements of an tensor to a tensor of shape . In the mathematica formula, the sub-pixel residual image for up-sampling is written as , where
denotes the remainder operator. To learn the sub-pixel residual image similarity to (1), the restoration result is defined by
where is the sub-pixel residual prediction. It is effectively implemented using a tile layer and an element-wise sum layer.
4.1 Implementation Details
The model is trained on 91 images from  and 200 images from the training set in , which are widely used for SR [18, 20, 21, 23]. Following , the luminance channel is only considered in YCbCr color space, because humans are more sensitive to luminance changes. We train a specific network for each scale factor ().
is used instead of ReLU as the activation function, except in the output layer. The layers with residual learning are initialized by drawing randomly from a Gaussian distribution (), because most values in the residual images are likely to be zero or small. The other filter weights are initialized according to .
In the training phase, we rotate the images through , , and for data augmentation. Sub-images are extracted to ensure that all pixels in the original image appear once and only once as the ground truth of the training data. For , , and , we set the size of training sub-images as 60, 69, and 72, respectively. The model is trained with L1 loss using an Adam  optimizer in the Caffe  package. The learning rate decreases by half from to every 50 epochs. The final layer learns 10 times slower as in . Based on the parameters above, training DSN with a batch-size of 256 takes about one day using one Nvidia GeForce GTX 1080 GPU.
4.2 Image Reduced-Resolution Comparisons
Existing image down-sampling (e.g., nearest-neighbor, bilinear, and bicubic) is based on local weighting. Interpolation transformation struggles to find the pixel-wise weights of plausible solutions, which are typically over-smooth and of poor perceptual quality; that is, they will lose valuable high-frequency details such as texture. We illustrate this problem in Fig. 6, where multiple potential solutions with high textural details are weighted to create smooth bilinear or bicubic results. The solutions based on linear functions (bilinear and bicubic) appear overly smooth due to the pixel-wise weighing of possible solutions; the solution based on nearest-neighbor optionally selects a sample in the manifold space. While Down-SNet learns the residual from a pixel-wise average (average pooling) towards the potential manifold, and produces perceptually more convincing solutions.
The proposed method provides a powerful Down-SNet for generating photo-realistic LR images of high perceptual quality. Down-SNet encourages the LR image to move towards regions of the potential manifold with high probability of containing photo-realistic textures. Fig.7 shows two standard test images (lena and baboon) generated by Down-SNet compared to traditional down-sampling methods. Down-SNet generates relatively sharper and richer textures.
To quantitatively assess the down-sampling performance, bicubic degradation is replaced by Down-SNet to generate training samples for HR restoration. We retrain the SR networks with three types of representative architectures, including plain network (SRCNN ), residual network (VDSR ), and dense network (Up-SNet). The Down-SNet remains fixed during the optimization process. Due to useful information preserving, Down-SNet brings significant improvement for HR restoration shown in Table 3. Although the Down-SNet is trained with Up-SNet in DSN, it has excellent generalization with the other network architectures. Therefore, Down-SNet as a CNN-based down-sampling can be used to replace traditional interpolation in image processing.
|Dataset||SRCNN ||VDSR ||Up-SNet||DSN|
4.3 Image Super-Resolution Comparisons
With the powerful down-sampling process, more useful information is preserved for image up-sampling. To further assess co-training DSN for SR, DSN is evaluated using three different scale factors (, , ) on four datasets [24, 33, 29, 34]. We compute the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) to compare five recent methods including FSRCNN , VDSR , DRCN , DRRN , and MemNet . As shown in Table 4, the proposed DSN outperforms the other methods. Qualitative comparisons of SRCNN , FSRCNN , and VDSR  are illustrated in Fig. 8 with their public codes. Our method produces relatively sharper edges and contours, while the other methods generate blurry results. In addition, existing methods produce severe distortions in some reconstructed results, whereas DSN reconstructs the texture patterns and avoids the distortions.
|Dataset||Scale||FSRCNN ||VDSR ||DRCN ||DRRN ||MemNet ||DSN|
In addition, we evaluate effectiveness according to execution time using the public code of the compared methods. The experiments are conducted with an Intel CPU (Xeon E5-2620, 2.1 GHz) and an NVIDIA GPU (GeForce GTX 1080). Fig. 9 shows the PSNR of the comparator methods versus execution time. The up-sampling phase of DSN out-performs existing methods. Even on a mobile CPU platform (A9 of iPhone 6S), our method for scale factor implemented with the ncnn222https://github.com/Tencent/ncnn library processes a image in approximately 200 ms. Therefore, DSN can be used to generate thumbnails and reconstruct HR images for wide mobile applications.
4.4 Image Compression Comparisons
Image compression is a fundamental and well-studied engineering problem that aims to reduce irrelevance and redundancy for storage and transmission. With decreasing bits per pixel (bpp), high compression ratios cause blocking artifacts or noises in the decoded images. Existing image codecs usually consists of transformation, quantization, and entropy coding. Recently, deep learning-based image compression methods[35, 36] have achieved competitive performance. However, they are incompatible with existing image codecs, limiting their wideapread application in engineering. DSN is compatible with existing image coding standards to improve image compression. In DSN, Down-SNet produces a compact transformation for encoding using existing codecs, and Up-SNet reconstructs the decoded image to avoid blocking artifacts.
To evaluate the performance of DSN for image compression, we conduct experiments with standard compression methods including JPEG and JPEG2000. For compression evaluation, luminance values are usually considered in YCbCr color space. Bits for header information of compressed files count towards the bit rate of the compared methods. Since JPEG is without lossless compression, compared to JPEG, we use DSN for compression transforming and a common file compressor for quantization coding. The raw image is down-sampled by Down-SNet and stored as a .pgm file, an uncompressed format. Then, the .pgm file is coded by 7-Zip 333http://www.7-zip.org/ with solid compression. Compared to JPEG2000, we simply adopt OpenJPEG 444http://www.openjpeg.org/ with lossless compression to code the LR image generated by Down-SNet.
|DSN + 7-Zip||JPEG ()||DSN + J2K||J2K ()|
In Table 5, we evaluate the compression ratio on Set5 with a similar distortion factor SSIM. We use DSN trained with scale factor as the transformation. For JPEG and JPEG2000, we test the codecs at quality parameter and compression ratio , respectively. The comparisons show that the proposed method significantly outperforms JPEG and JPEG2000 in terms of bbp. To demonstrate the qualitative nature of compression artifacts, we show a representative example of the compressed image butterfly with in Fig. 10.
In image sampling, down-sampling loses useful information and up-sampling at the first layer does not provide extra information. To address these problems, hear we proposed a deep sampling network (DSN). DSN is an end-to-end system without any cheap interpolation to simultaneously learn mappings for resolution reduction and improvement. The down-sampling subnetwork in DSN can also be applied to generate photo-realistic LR images and replace traditional interpolation in image processing. Moreover, our experimental results reveal that the co-training network achieves state-of-the-art performance on SR at higher speed, and improves image compression with existing image coding standards.
Goto, T., Fukuoka, T., Nagashima, F., Hirano, S., Sakurai, M.:
Super-resolution system for 4k-hdtv.
In: Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE (2014) 4453–4458
-  Shi, W., Caballero, J., Ledig, C., Zhuang, X., Bai, W., Bhatia, K., de Marvao, A.M.S.M., Dawes, T., O’Regan, D., Rueckert, D.: Cardiac image super-resolution with global correspondence using multi-atlas patchmatch. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2013) 9–16
-  Thornton, M., Atkinson, P.M., Holland, D.: Sub-pixel mapping of rural land cover objects from fine spatial resolution satellite sensor imagery using super-resolution pixel-swapping. International Journal of Remote Sensing 27(3) (2006) 473–491
-  Zhang, L., Zhang, H., Shen, H., Li, P.: A super-resolution reconstruction algorithm for surveillance images. Signal Processing 90(3) (2010) 848–859
-  De Boor, C.: Bicubic spline interpolation. Studies in Applied Mathematics 41(1-4) (1962) 212–218
-  Duchon, C.E.: Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18(8) (1979) 1016–1022
-  Sun, J., Xu, Z., Shum, H.Y.: Image super-resolution using gradient profile prior. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (2008) 1–8
-  Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: Computer Vision, 2009 IEEE 12th International Conference on, IEEE (2009) 349–356
-  Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding. In: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Volume 1., IEEE (2004) I–I
-  Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE transactions on image processing 19(11) (2010) 2861–2873
-  Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image upscaling with super-resolution forests. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3791–3799
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2014) 580–587
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: European Conference on Computer Vision, Springer (2014) 184–199
-  Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: European Conference on Computer Vision, Springer (2016) 391–407
-  Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1874–1883
-  Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1646–1654
-  Mao, X., Shen, C., Yang, Y.B.: Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Advances in Neural Information Processing Systems. (2016) 2802–2810
-  Kim, J., Kwon Lee, J., Mu Lee, K.: Deeply-recursive convolutional network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1637–1645
-  Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017)
-  Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Tai, Y., Yang, J., Liu, X., Xu, C.: Memnet: A persistent memory network for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 4539–4547
-  Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding. (2012)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing 25(11) (2016) 5187–5198
-  Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
-  Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2818–2826
-  Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Volume 2., IEEE (2001) 416–423
He, K., Zhang, X., Ren, S., Sun, J.:
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In: Proceedings of the IEEE international conference on computer vision. (2015) 1026–1034
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, ACM (2014) 675–678
-  Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2) (2016) 295–307
-  Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: International conference on curves and surfaces, Springer (2010) 711–730
-  Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 5197–5206
-  Theis, L., Shi, W., Cunningham, A., Huszár, F.: Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395 (2017)
-  Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. arXiv preprint arXiv:1611.01704 (2016)