Super-resolution aims to transform low resolution (LR) images into images with high resolution (HR). In computer vision, super-resolution is relevant for applications where high-frequency information and detailing is desirable yet not always captured,e.g. medical imaging, satellite imaging, surveillance, etc. Our interest is in single image super-resolution (SISR), a special case where only one LR image is available. SISR is an ill-posed inverse problem with multiple solutions, i.e. multiple HR images could lead to the same LR image after applying a low-resolution filter.
SISR been solved with many different approaches. Early approaches relied on spatial relationships present in the image and include nearest neighbours , bi-cubic interpolation . Other techniques build upon the assumption that the underlying signal is sparse, e.g. compressive sensing  and dictionary learning 
. These early approaches tend to yield HR images which are blurry and or noisy, due mainly to the fact that these methods are crafted heuristically. As a result, they cannot sufficiently recover enough information to yield sharp photo-realistic images.
Deep learning, and more specifically, convolutional neural networks (CNNs), has afforded significant performance gains in SISR. The newest frameworks [6, 7, 8, 9, 10] learn feature representations for super-resolution in a supervised end-to-end manner and are achieving ever-improving performance both in terms of speed and quantitative evaluation. However, despite their impressive results, most deep networks are being used as black boxes to make point estimates. The regressed outputs represent only a mean value and give no indication of model uncertainty. In the case of SISR, this means that we make (possibly very accurate) guesses for the unknown pixels, but we have no idea how good these guesses are. This can be highly problematic for applications such as microscopy or medical imaging, where model confidence is just as important as the actual output.
We would like to leverage the strong learning capabilities of deep networks and embed them within a known statistical framework for which we can derive model uncertainty. As such, we propose deep kriging to solve SISR. Kriging [11, 12] is a geostatistics method used for spatial interpolation of attributes such as topography and natural resources. It has close relations to Gaussian process (GP) regression, although some sources oversimplify and describe the two as being equivalent. The main difference is that kriging makes stationary and ergodic assumptions on the random field representing the data, while GP regression considers it as a Gaussian process.111We refer the reader to our supplementary materials for a more detailed comparison of the two.
Kriging interpolates unknown values by taking a weighted average of known values and results in an unbiased estimator with minimum variance. One can solve for the weights by inverting a covariance matrix computed on the known instances. Kriging makes for a natural extension to the application of SISR, where HR images are interpolated from pixels observed in the LR version of the image. The computational cost of the kriging is cubic with respect to the number of known instances; hence it is preferable to work locally. However, interpolated results are often overly smooth and depend on the choice of the covariance function.
Our contributions can be summarized as follows:
We propose a novel deep learning-based statistical estimator with SISR results competitive with state-of-the-art.
We propose a deep kriging approach that solves for kriging weights and performs the kriging in a single network; this allows us to perform a supervised form of spatial interpolation which can be applied not only to SISR, but other geostatistical computations such as field reconstructions.
Our proposed deep kriging is a hybrid deep learning and statistical framework for which we can derive statistical bias and variance; such measures of model performance and uncertainty are not commonly available for other deep networks, making us the first to model and compute pixel uncertainty in deep SISR approaches.
2 Related work
Prior to the use of deep learning, SISR approaches applied variants of dictionary learning [13, 3, 4, 14, 15]. Patches were extracted from the low-resolution images and mapped to their corresponding high-resolution version which are then stitched together to increase the image resolution. Other learning-based approaches to increase image resolution include [16, 17, 18].
State-of-the-art SISR methods are deep-learning-based [8, 6, 7, 10, 9]. The VDSR  and DRCN  approaches showed the benefits of working with image residuals for super-resolution. The DRRN approach  generalizes VDSR and concludes that the deeper the network, the better the super-resolved image. We also use a residual network in our approach, but unlike all the other deep SISR methods, we are solving for a set of filter weights to perform the super-resolution with our network rather than directly estimating the HR image itself.
Several unsupervised techniques have also been developed for SISR [19, 20], though their performance is usually poorer than supervised approaches. One work of note , uses GP regression to perform super-resolution and resembles ours in spirit in that we both model pixel intensities as a random process regressed from neighbouring pixels. However,  is unsupervised while our approach learn the weights from supervise examples.
Our proposed approach can be thought of as a form of local filtering , and is most similar in spirit to works which combine deep learning and GP regression [23, 24]. However, we differ from these techniques since we do not apply GP regression but a modified version of local kriging. These techniques are similar to us in that they also consider the data to follow a random process. However, we do not explicitly learn the covariance matrix, which offers us more flexibility in the relationships we wish to express. Furthermore, we treat each image as following a random process; the set of training images is then a set of random processes, while the aforementioned works consider all the training data to follow the same process.
3 Deep Kriging
We begin with a short overview on classical kriging in Section 3.1 before we introduce our proposed method of supervised kriging in Section 3.2 and elaborate on its statistical properties in Section 3.3. Finally, we show how the proposed form of deep kriging can be implemented in a deep network in Section 3.4.
3.1 Classical kriging
Consider an image as a realization of a random field
. This random field is a collection of random variables,i.e. where each is a random variable and is a position index. Unknown values on at position , 222For concise notation, we use to denote , to denote and ., can be interpolated linearly from known realizations , with normalized weights , i.e.
The weights are found by minimizing the true risk:
We can equate the with the term in Eq. 2 because the constraint that the weights must sum up to implies that . However, since we do not have access to different realizations of the random field, this variance cannot be solved directly.
To infer the variance from a single event, the theory of geostatistics replaces the classical statistics assumption of having independent and identically distributed (iid) random variables with assumptions of stationarity and ergodicity on the random field. The random field is assumed to be first and second order stationary. This means that does not depend on and that depends only on
. In addition, the field is assumed to be second order ergodic, so a covariance or mean estimate in the probability domain is equivalent inthe spatial domain. This second assumption implies that an empirical covariance can be estimated and depends only on a distancebetween two realizations, i.e.
. However, it is difficult to work directly with the empirical covariance, since it can be noisy and also is not guaranteed to form a positive semi-definite matrix. As such, in kriging, we use the true covariance by fitting a parametric model. While different models can be used, the most common one is Gaussian, with covariancebetween points and defined as where and are parameters of the model.
A Lagrange multiplier (with constant ) can be used to minimize the risk function with the constraints on the weights , leading to the following cost function:
and the associated solution expressed in matrix form as:
The process of kriging then is reduced to solving for the weights . This involves inverting the covariance matrix on the left. Since the matrix is of dimension , where is the number of observed pixels in the image (patch), it can be very computationally expensive. Even though one can limit with local kriging, needs to be sufficiently large in order to accurately capture and represent the spatial relationships. We aim to bypass the matrix inversion and still maintain sufficient generalization power by using a deep network to directly learn the weights instead.
3.2 Supervised kriging
Consider , a subset of the discrete space , as a representation of the support space of a 2D image. We denote an image as and as its value at pixel . This assumes we work with greyscale images, which is a standard assumption made in most SISR algorithms [7, 6, 9].
We assume that we are given a set of training image pairs , where is an up-sampled low resolution version of with the same size. Our objective is to super-resolve the low-resolution images of the test set . We further assume that
is a realization of a first and second moment stationary random process. For convenience we denote the distribution of the random process. In addition is a realization of a random field . We denote the training set as , where is a joint meta-distribution where each training pair follows its own random process.
In classical kriging, the estimated is expressed as a linear combination of all the observed samples i.e. pixels of . In the case of super-resolution, the observations come from the low-resolution image . Furthermore, we estimate as a linear combination of only local observations of some window radius , leading to the estimator:
When no confusion is possible we denote , with . We found that a window of provided the best results. To find , we want to minimize the true risk as given in Eq. 2, leading to:
This risk is a compromise between the geostatistical model and the deep learning model, where the covariance of each field is learned through supervised learning. Furthermore, if we replace the true expectation with the empirical one and the random field with its realization, we arrive at
where is the number of pixels in image .
Now, rather than solving for the weights by inverting the covariance matrix in Eq. 4, we propose learning this set of weights directly with a CNN. This allow us to capture and model more complex spatial relationships between the data than the classical model with covariances. As such, the estimator becomes , where , with and representing the CNN network function and its parameters respectively.
3.3 Statistical properties
We can derive statistical properties for our proposed estimator . Given , the bias can be expressed as
Since and the random field is first-order stationary, we are left with zero bias according to the weights of the neural network, i.e. . In other words, our network does not add bias to the field . As such, from to , if the two fields have the same first moment, then no bias added to 333Of course, we cannot account for the low resolution process that produced . This point is critical since in classical statistics a good estimator should have zero bias and minimal variance.
By definition, the variance of our estimator is
Again, we estimate the covariance given . Since we assume that is second-order stationary, the covariance of depends only on the distance between two points, i.e.
By setting , we arrive at:
where the covariance is estimated from the low resolution image . The variance depends on the position . Note, however, since we directly estimate the weights without making assumptions on the covariance function, we cannot compute the variance and an approximation is needed to estimate the covariance empirically from . Eq. 10 however is particularly interesting since it shows that the estimator variance at is directly proportional to the variance of and the different values of near . The bigger these values are, the more uncertain the estimator is.
3.4 Network Implementation and Training
So far, the theory which we have presented on supervised kriging is generic and does not indicate how one should minimize the risk in Eq. 7 to solve for the kriging weights. We solve for the weights with a CNN network which we show in Fig. 1
. The weight estimation branch is composed of a residual network of 9 residual units; the residual unit itself has 2 convolutional layers, each preceded by a batch normalization
and a ReLU.
This branch follows a similar architecture as [9, 6], with the difference that they apply this architecture to directly estimate the HR image, while we use it to estimate our kriging weights, i.e. as a local dynamic filter . In addition, contrary to  our network is not recurrent.
The network learns a total of 20 convolution layers to determine the kriging weights. The first 19 are of depth 128, the last prediction layer has depth
and outputs the kriging weight vector. The weights are applied to the repeated input image via a point-wise multiplication and summation along the depth dimension to yield the HR output. Overall, the network with is very lightweight, inference on a sized image takes only seconds with a titan X.
Note our network is actually learning how to estimate the kriging weights , i.e. as a local dynamic filter. To derive this dynamic filter end-to-end, we adapt the following formulation of the Eq. 5 in convolution terms:
with or the Dirac function. The filter is denoted by “repeat input” in Fig. 1, since applying the Dirac functions in each position is a way to extract the pixel neighbourhood at position . After having determined the weights, we normalize them so that they sum up to , as given by the constraint in Eq. 11. We point-wise multiply the normalized weights to the output of the repeat filter; to arrive at the super-resolved image, we simply sum along the depth.
We implement our network in Tensorflow, minimizing the loss in Eq.7 with the Adam optimizer, a batch size of 8 and a learning rate of
. We apply gradient clipping and dropout between the last ReLU and the convolution layer that predicts, with a dropout rate of 0.3. The effective depth of our network is , with the number of times the residual units are applied. In our case, we use .
4.1 Datasets, Pre-processing and Evaluation
For training, we follow [9, 6] and use the 291 images combined from  and the Berkeley Segmentation datasets  . We further augment the data by rotating (, and ) and scaling the data (, , ). We limit to these fixed grades of rotation and scaling to follow the same protocol as . Like [9, 6], we work with patches of size
, sampled from the augmented dataset with a stride of. Testing is then performed on the four commonly used benchmarks: Set 5 , Set 14 , B100  and Urban 100 . To be consistent with [9, 6], we convert the RGB images to YCbCr and only resolved the Y component. This is then combined with an upsampled Cb and Cr component and converted back into RGB. Up-sampling is done via bi-cubic interpolation.
We evaluate the resulting super-resolved images quantitatively with the peak signal-to-noise ratio (PSNR): where and are the ground truth and the super-resolved images respectively and refers to the number of pixels in the (high-resolution) image. A higher PSNR corresponds to better results. We also evaluate using Structural Similarity (SSIM) , which measures the similarity between two images. The closer the SSIM value is to , the more similar the super-resolved image to the ground truth high resolution image.
4.2 Comparison to state-of-the-art
We compare our PSNR and SSIM measures for the different scales against state-of-the-art in Table 1. The first two methods, bi-cubic and local kriging are unsupervised, while all others are supervised approaches. We use the Matlab implementation of bicubic interpolation.
For the local kriging, we use a neighbourhood of and a stride of . At this sparse setting, unsupervised kriging does very well in comparison to bicubic interpolation. However, it is already extremely slow, since for each patch, we need to compute the empirical covariance, true covariance, and then invert the covariance matrix. In total, it takes approximately 1 minute for an image of size . In comparison, our proposed deep kriging, on the same image, applied densely is one magnitude faster, and takes only seconds with a titan x.
Looking at our approach (reported in the second last column in Table 1) with respect to the supervised methods, our performance is comparable to the DRRN  B1U9 setting, DRCN  and VSDR ; all three have networks with depth similar to ours. The best results are reported by the DRRN B1U25 , which has 50 convolution layers and is more than twice as deep as our network.
Given the trend of the community to work on deep learning-based approaches for super-resolution, as well as the fact that no labelled data is required for training, one can work with (infintely) large datasets [33, 8]. For fair comparison with state-of-the-art, however, we omit from this table the techniques which do not use the fixed 291 image training set as per the training protocol set by [9, 6].
4.3 Model uncertainty
One of the key strengths of our technique is the fact that we can estimate model uncertainty. We show in Figure 2 the estimated variance for each pixel based on Equation 10. To evaluate this equation, we need to apply covariance models for and . We do this by first estimating the empirical covariance and then solving for and of the Gaussian model in section 3.1 that is the closest to the empirical covariance.
The main advantage of our model of uncertainty is that we can have information on the black box CNN. It gives us feedback about the reliability of results provided by an unknown image. One can see in Figure 2
our uncertainty estimate. The estimated variance has a higher value than the real PSNR this is due to the fact that we have a noisy estimation of the PSNR from a low resolution image. Holistically, however, the are similar, in that areas with high variance corresponds to high PSNR. Quantitatively, we find for Set 5 that 91.9% of the super-resolved pixel values fall within 3 standard deviations based on the estimated variance. In addition, we evaluate the similarity between the images resulting from the PSNR and the one resulting from the variance in set 5. We use the correlation to measure the similarity and find that these images have high correlation up to 0.8.
|Scale||Bicubic||local kringing||SRCNN||VDSR||DRCN||DRRN B1U9||DRRN B1U25||ours|
In this paper, we have proposed a joint deep learning and statistical framework for single image super-resolution based on a form of supervised kriging. We solve for the kriging weights in the manner of a local dynamic filter and apply it directly to the low resolution image, all within a single network that can be learned end-to-end. Since we work within the known statistical framework of kriging, we can estimate model uncertainty, something typically not possible for deep networks. More specifically, we show through derivations that the statistical estimator generated by our network is unbiased and we calculate its variance.
This research was funded by the German Research Foun- dation (DFG) as part of the research training group GRK 1564 Imaging New Modalities.
-  H. Chang, D. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. In CVPR, volume 1, pages I–I. IEEE, 2004.
-  R. Keys. Cubic convolution interpolation for digital image processing. IEEE Trans. on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 1981.
-  David L Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289–1306, 2006.
-  J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang. Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing (TIP), 21(8):3467–3478, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  J. Kim, L. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
-  J. Kim, L. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, 2016.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In CVPR Workshops, 2017.
-  Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In CVPR, June 2017.
-  M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. EnhanceNet: Single image super-resolution through automated texture synthesis. In ICCV, Oct 2017.
-  Georges Matheron. Random sets and integral geometry. John Wiley & Sons, 1975.
-  Noel Cressie. Statistics for spatial data. John Wiley & Sons, 2015.
-  W. Freeman, T. Jones, and E. Pasztor. Example-based super-resolution. IEEE Computer Graphics and Applications, 22(2):56–65, 2002.
-  T. Peleg and M. Elad. A statistical prediction model based on sparse representations for single image super-resolution. IEEE Transactions on Image Processing (TIP), 23(6):2569–2582, 2014.
-  W. Dong, L. Zhang, G. Shi, and X. Wu. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing (TIP), 20(7):1838–1857, 2011.
-  S. Zhao, H. Han, and S. Peng. Wavelet-domain hmt-based image super-resolution. In International Conference on Image Processing (ICIP), volume 2, pages II–953. IEEE, 2003.
-  K. Zhang, X. Gao, D. Tao, and X. Li. Single image super-resolution with non-local means and steering kernel regression. IEEE Transactions on Image Processing (TIP), 21(11):4544–4556, 2012.
-  G. Anbarjafari and H. Demirel. Image super resolution based on interpolation of wavelet domain high frequency subbands and the spatial domain input image. ETRI journal, 32(3):390–394, 2010.
-  L. Xu and J. Jia. Two-phase kernel estimation for robust motion deblurring. In ECCV, pages 157–170. Springer, 2010.
-  A. Marquina and S. Osher. Image super-resolution by tv-regularization and bregman iteration. Journal of Scientific Computing, 37(3):367–382, 2008.
-  H. He and W. C. Siu. Single image super-resolution using Gaussian process regression. In CVPR, 2011.
-  B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016.
-  A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P Xing. Deep kernel learning. In AISTATS, 2016.
-  A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
-  Luc Pronzato and Maria-João Rendas. Bayesian local kriging. Technometrics, pages 1–12, 2017.
-  F. Meier, P. Hennig, and S. Schaal. Local gaussian regression. arXiv preprint arXiv:1402.0645, 2014.
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
International Conference on Machine Learning, pages 448–456, 2015.
-  J. Yang, J. Wright, T. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE Transactions on Image Processing (TIP), 19(11):2861–2873, 2010.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, volume 2, pages 416–423, July 2001.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
-  J. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, pages 5197–5206, 2015.
-  Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4):600–612, 2004.
-  T. Tong, G. Li, X. Liu, and Q. Gao. Image super-resolution using dense skip connections. In ICCV, Oct 2017.