1 Introduction
Superresolution aims to transform low resolution (LR) images into images with high resolution (HR). In computer vision, superresolution is relevant for applications where highfrequency information and detailing is desirable yet not always captured,
e.g. medical imaging, satellite imaging, surveillance, etc. Our interest is in single image superresolution (SISR), a special case where only one LR image is available. SISR is an illposed inverse problem with multiple solutions, i.e. multiple HR images could lead to the same LR image after applying a lowresolution filter.SISR been solved with many different approaches. Early approaches relied on spatial relationships present in the image and include nearest neighbours [1], bicubic interpolation [2]. Other techniques build upon the assumption that the underlying signal is sparse, e.g. compressive sensing [3] and dictionary learning [4]
. These early approaches tend to yield HR images which are blurry and or noisy, due mainly to the fact that these methods are crafted heuristically. As a result, they cannot sufficiently recover enough information to yield sharp photorealistic images.
Deep learning, and more specifically, convolutional neural networks (CNNs)
[5], has afforded significant performance gains in SISR. The newest frameworks [6, 7, 8, 9, 10] learn feature representations for superresolution in a supervised endtoend manner and are achieving everimproving performance both in terms of speed and quantitative evaluation. However, despite their impressive results, most deep networks are being used as black boxes to make point estimates. The regressed outputs represent only a mean value and give no indication of model uncertainty. In the case of SISR, this means that we make (possibly very accurate) guesses for the unknown pixels, but we have no idea how good these guesses are. This can be highly problematic for applications such as microscopy or medical imaging, where model confidence is just as important as the actual output.We would like to leverage the strong learning capabilities of deep networks and embed them within a known statistical framework for which we can derive model uncertainty. As such, we propose deep kriging to solve SISR. Kriging [11, 12] is a geostatistics method used for spatial interpolation of attributes such as topography and natural resources. It has close relations to Gaussian process (GP) regression, although some sources oversimplify and describe the two as being equivalent. The main difference is that kriging makes stationary and ergodic assumptions on the random field representing the data, while GP regression considers it as a Gaussian process.^{1}^{1}1We refer the reader to our supplementary materials for a more detailed comparison of the two.
Kriging interpolates unknown values by taking a weighted average of known values and results in an unbiased estimator with minimum variance. One can solve for the weights by inverting a covariance matrix computed on the known instances. Kriging makes for a natural extension to the application of SISR, where HR images are interpolated from pixels observed in the LR version of the image. The computational cost of the kriging is cubic with respect to the number of known instances; hence it is preferable to work locally. However, interpolated results are often overly smooth and depend on the choice of the covariance function.
Our contributions can be summarized as follows:

We propose a novel deep learningbased statistical estimator with SISR results competitive with stateoftheart.

We propose a deep kriging approach that solves for kriging weights and performs the kriging in a single network; this allows us to perform a supervised form of spatial interpolation which can be applied not only to SISR, but other geostatistical computations such as field reconstructions.

Our proposed deep kriging is a hybrid deep learning and statistical framework for which we can derive statistical bias and variance; such measures of model performance and uncertainty are not commonly available for other deep networks, making us the first to model and compute pixel uncertainty in deep SISR approaches.
2 Related work
Prior to the use of deep learning, SISR approaches applied variants of dictionary learning [13, 3, 4, 14, 15]. Patches were extracted from the lowresolution images and mapped to their corresponding highresolution version which are then stitched together to increase the image resolution. Other learningbased approaches to increase image resolution include [16, 17, 18].
Stateoftheart SISR methods are deeplearningbased [8, 6, 7, 10, 9]. The VDSR [6] and DRCN [7] approaches showed the benefits of working with image residuals for superresolution. The DRRN approach [9] generalizes VDSR and concludes that the deeper the network, the better the superresolved image. We also use a residual network in our approach, but unlike all the other deep SISR methods, we are solving for a set of filter weights to perform the superresolution with our network rather than directly estimating the HR image itself.
Several unsupervised techniques have also been developed for SISR [19, 20], though their performance is usually poorer than supervised approaches. One work of note [21], uses GP regression to perform superresolution and resembles ours in spirit in that we both model pixel intensities as a random process regressed from neighbouring pixels. However, [21] is unsupervised while our approach learn the weights from supervise examples.
Our proposed approach can be thought of as a form of local filtering [22], and is most similar in spirit to works which combine deep learning and GP regression [23, 24]. However, we differ from these techniques since we do not apply GP regression but a modified version of local kriging. These techniques are similar to us in that they also consider the data to follow a random process. However, we do not explicitly learn the covariance matrix, which offers us more flexibility in the relationships we wish to express. Furthermore, we treat each image as following a random process; the set of training images is then a set of random processes, while the aforementioned works consider all the training data to follow the same process.
3 Deep Kriging
We begin with a short overview on classical kriging in Section 3.1 before we introduce our proposed method of supervised kriging in Section 3.2 and elaborate on its statistical properties in Section 3.3. Finally, we show how the proposed form of deep kriging can be implemented in a deep network in Section 3.4.
3.1 Classical kriging
Consider an image as a realization of a random field
. This random field is a collection of random variables,
i.e. where each is a random variable and is a position index. Unknown values on at position , ^{2}^{2}2For concise notation, we use to denote , to denote and ., can be interpolated linearly from known realizations , with normalized weights , i.e.(1) 
In classical kriging, all known realizations are used in the interpolation and is the number of pixels in the image, though local variants have been proposed and studied in [25, 26].
The weights are found by minimizing the true risk:
(2) 
We can equate the with the term in Eq. 2 because the constraint that the weights must sum up to implies that . However, since we do not have access to different realizations of the random field, this variance cannot be solved directly.
To infer the variance from a single event, the theory of geostatistics replaces the classical statistics assumption of having independent and identically distributed (iid) random variables with assumptions of stationarity and ergodicity on the random field. The random field is assumed to be first and second order stationary. This means that does not depend on and that depends only on
. In addition, the field is assumed to be second order ergodic, so a covariance or mean estimate in the probability domain is equivalent inthe spatial domain. This second assumption implies that an empirical covariance can be estimated and depends only on a distance
between two realizations, i.e.. However, it is difficult to work directly with the empirical covariance, since it can be noisy and also is not guaranteed to form a positive semidefinite matrix. As such, in kriging, we use the true covariance by fitting a parametric model. While different models can be used, the most common one is Gaussian, with covariance
between points and defined as where and are parameters of the model.A Lagrange multiplier (with constant ) can be used to minimize the risk function with the constraints on the weights , leading to the following cost function:
(3) 
and the associated solution expressed in matrix form as:
(4) 
The process of kriging then is reduced to solving for the weights . This involves inverting the covariance matrix on the left. Since the matrix is of dimension , where is the number of observed pixels in the image (patch), it can be very computationally expensive. Even though one can limit with local kriging, needs to be sufficiently large in order to accurately capture and represent the spatial relationships. We aim to bypass the matrix inversion and still maintain sufficient generalization power by using a deep network to directly learn the weights instead.
3.2 Supervised kriging
Consider , a subset of the discrete space , as a representation of the support space of a 2D image. We denote an image as and as its value at pixel . This assumes we work with greyscale images, which is a standard assumption made in most SISR algorithms [7, 6, 9].
We assume that we are given a set of training image pairs , where is an upsampled low resolution version of with the same size. Our objective is to superresolve the lowresolution images of the test set . We further assume that
is a realization of a first and second moment stationary random process
. For convenience we denote the distribution of the random process. In addition is a realization of a random field . We denote the training set as , where is a joint metadistribution where each training pair follows its own random process.In classical kriging, the estimated is expressed as a linear combination of all the observed samples i.e. pixels of . In the case of superresolution, the observations come from the lowresolution image . Furthermore, we estimate as a linear combination of only local observations of some window radius , leading to the estimator:
(5) 
When no confusion is possible we denote , with . We found that a window of provided the best results. To find , we want to minimize the true risk as given in Eq. 2, leading to:
(6) 
This risk is a compromise between the geostatistical model and the deep learning model, where the covariance of each field is learned through supervised learning. Furthermore, if we replace the true expectation with the empirical one and the random field with its realization, we arrive at
(7) 
where is the number of pixels in image .
Now, rather than solving for the weights by inverting the covariance matrix in Eq. 4, we propose learning this set of weights directly with a CNN. This allow us to capture and model more complex spatial relationships between the data than the classical model with covariances. As such, the estimator becomes , where , with and representing the CNN network function and its parameters respectively.
3.3 Statistical properties
We can derive statistical properties for our proposed estimator . Given , the bias can be expressed as
Since and the random field is firstorder stationary, we are left with zero bias according to the weights of the neural network, i.e. . In other words, our network does not add bias to the field . As such, from to , if the two fields have the same first moment, then no bias added to ^{3}^{3}3Of course, we cannot account for the low resolution process that produced . This point is critical since in classical statistics a good estimator should have zero bias and minimal variance.
By definition, the variance of our estimator is
(8) 
Again, we estimate the covariance given . Since we assume that is secondorder stationary, the covariance of depends only on the distance between two points, i.e.
(9) 
By setting , we arrive at:
(10) 
where the covariance is estimated from the low resolution image . The variance depends on the position . Note, however, since we directly estimate the weights without making assumptions on the covariance function, we cannot compute the variance and an approximation is needed to estimate the covariance empirically from . Eq. 10 however is particularly interesting since it shows that the estimator variance at is directly proportional to the variance of and the different values of near . The bigger these values are, the more uncertain the estimator is.
3.4 Network Implementation and Training
So far, the theory which we have presented on supervised kriging is generic and does not indicate how one should minimize the risk in Eq. 7 to solve for the kriging weights. We solve for the weights with a CNN network which we show in Fig. 1
. The weight estimation branch is composed of a residual network of 9 residual units; the residual unit itself has 2 convolutional layers, each preceded by a batch normalization
[27]and a ReLU.
This branch follows a similar architecture as [9, 6], with the difference that they apply this architecture to directly estimate the HR image, while we use it to estimate our kriging weights, i.e. as a local dynamic filter [22]. In addition, contrary to [9] our network is not recurrent.
The network learns a total of 20 convolution layers to determine the kriging weights. The first 19 are of depth 128, the last prediction layer has depth
and outputs the kriging weight vector
. The weights are applied to the repeated input image via a pointwise multiplication and summation along the depth dimension to yield the HR output. Overall, the network with is very lightweight, inference on a sized image takes only seconds with a titan X.Note our network is actually learning how to estimate the kriging weights , i.e. as a local dynamic filter. To derive this dynamic filter endtoend, we adapt the following formulation of the Eq. 5 in convolution terms:
(11) 
with or the Dirac function. The filter is denoted by “repeat input” in Fig. 1, since applying the Dirac functions in each position is a way to extract the pixel neighbourhood at position . After having determined the weights, we normalize them so that they sum up to , as given by the constraint in Eq. 11. We pointwise multiply the normalized weights to the output of the repeat filter; to arrive at the superresolved image, we simply sum along the depth.
We implement our network in Tensorflow, minimizing the loss in Eq.
7 with the Adam optimizer, a batch size of 8 and a learning rate of. We apply gradient clipping and dropout between the last ReLU and the convolution layer that predicts
, with a dropout rate of 0.3. The effective depth of our network is , with the number of times the residual units are applied. In our case, we use .4 Experiments
4.1 Datasets, Preprocessing and Evaluation
For training, we follow [9, 6] and use the 291 images combined from [28] and the Berkeley Segmentation datasets [29] . We further augment the data by rotating (, and ) and scaling the data (, , ). We limit to these fixed grades of rotation and scaling to follow the same protocol as [9]. Like [9, 6], we work with patches of size
, sampled from the augmented dataset with a stride of
. Testing is then performed on the four commonly used benchmarks: Set 5 [30], Set 14 [7], B100 [29] and Urban 100 [31]. To be consistent with [9, 6], we convert the RGB images to YCbCr and only resolved the Y component. This is then combined with an upsampled Cb and Cr component and converted back into RGB. Upsampling is done via bicubic interpolation.We evaluate the resulting superresolved images quantitatively with the peak signaltonoise ratio (PSNR): where and are the ground truth and the superresolved images respectively and refers to the number of pixels in the (highresolution) image. A higher PSNR corresponds to better results. We also evaluate using Structural Similarity (SSIM) [32], which measures the similarity between two images. The closer the SSIM value is to , the more similar the superresolved image to the ground truth high resolution image.
4.2 Comparison to stateoftheart
We compare our PSNR and SSIM measures for the different scales against stateoftheart in Table 1. The first two methods, bicubic and local kriging are unsupervised, while all others are supervised approaches. We use the Matlab implementation of bicubic interpolation.
For the local kriging, we use a neighbourhood of and a stride of . At this sparse setting, unsupervised kriging does very well in comparison to bicubic interpolation. However, it is already extremely slow, since for each patch, we need to compute the empirical covariance, true covariance, and then invert the covariance matrix. In total, it takes approximately 1 minute for an image of size . In comparison, our proposed deep kriging, on the same image, applied densely is one magnitude faster, and takes only seconds with a titan x.
Looking at our approach (reported in the second last column in Table 1) with respect to the supervised methods, our performance is comparable to the DRRN [9] B1U9 setting, DRCN [7] and VSDR [6]; all three have networks with depth similar to ours. The best results are reported by the DRRN B1U25 [9], which has 50 convolution layers and is more than twice as deep as our network.
Given the trend of the community to work on deep learningbased approaches for superresolution, as well as the fact that no labelled data is required for training, one can work with (infintely) large datasets [33, 8]. For fair comparison with stateoftheart, however, we omit from this table the techniques which do not use the fixed 291 image training set as per the training protocol set by [9, 6].
4.3 Model uncertainty
One of the key strengths of our technique is the fact that we can estimate model uncertainty. We show in Figure 2 the estimated variance for each pixel based on Equation 10. To evaluate this equation, we need to apply covariance models for and . We do this by first estimating the empirical covariance and then solving for and of the Gaussian model in section 3.1 that is the closest to the empirical covariance.
The main advantage of our model of uncertainty is that we can have information on the black box CNN. It gives us feedback about the reliability of results provided by an unknown image. One can see in Figure 2
our uncertainty estimate. The estimated variance has a higher value than the real PSNR this is due to the fact that we have a noisy estimation of the PSNR from a low resolution image. Holistically, however, the are similar, in that areas with high variance corresponds to high PSNR. Quantitatively, we find for Set 5 that 91.9% of the superresolved pixel values fall within 3 standard deviations based on the estimated variance. In addition, we evaluate the similarity between the images resulting from the PSNR and the one resulting from the variance in set 5. We use the correlation to measure the similarity and find that these images have high correlation up to 0.8.


Dataset 
Scale  Bicubic  local kringing  SRCNN  VDSR  DRCN  DRRN B1U9  DRRN B1U25  ours 

PSNR/SSIM  PSNR/SSIM  PSNR/SSIM  PSNR/SSIM  PSNR/SSIM  PSNR/SSIM  PSNR/SSIM  PSNR/SSIM  
Set 5 
33.66/0.930  35.68/0.923  36.66/0.953  37.53/0.959  37.63/0.959  37.66/0.959  37.74/0.959  37.65/0.960  
30.39/0.868  31.72/0.864  32.75/0.909  33.66/0.921  33.82/0.923  33.93/0.923  34.04/0.924  33.94/0.923  
28.42/0.810  29.83/0.814  30.48/0.862  31.35/0.883  31.53/0.885  31.58/0.886  31.68/0.889  31.56/0.886  
Set 14 
30.24/0.869  31.26/0.874  32.45/0.907  33.03/0.912  33.04/0.912  33.19/0.913  33.23/0.914  33.10/0.912  
27.55/0.774  27.94/0.781  29.30/0.822  29.77/0.831  29.76/0.831  29.94/0.831  29.96/0.834  29.81/0.831  
26.00/0.703  26.43/0.715  27.50/0.751  28.01/0.767  28.02/0.767  28.02/0.767  28.18/0.770  28.08/0.771  
B 100 
29.56/0.843  31.06/0.848  31.36/0.888  31.90/0.8960  31.85/0.894  32.01/0.897  32.05/0.897  32.01/0.896  
27.21/0.8431  27.86/0.7357  31.36/0.8879  28.82/0.7976  28.80/0.7963  28.91/0.7992  28.95/0.800  28.82/0.800  
25.96/0.668  26.55/0.665  26.90/0.711  27.29/0.725  27.23/0.723  27.35/0.726  27.38/0.728  27.36/0.727  
Urban 100 
26.88/0.840  28.32/0.843  29.50/0.895  30.76/0.914  30.75/0.913  31.02/0.916  31.23/0.919  30.95/0.921  
24.46/0.735  25.06/0.715  29.50/0.799  27.14/0.715  27.15/0.828  27.38/0.833  27.53/0.838  
23.14/0.658  23.70/0.627  24.52/0.722  25.19/0.752  25.14/0.751  25.35/0.757  25.44/0.764  25.10/0.749 
5 Conclusions
In this paper, we have proposed a joint deep learning and statistical framework for single image superresolution based on a form of supervised kriging. We solve for the kriging weights in the manner of a local dynamic filter and apply it directly to the low resolution image, all within a single network that can be learned endtoend. Since we work within the known statistical framework of kriging, we can estimate model uncertainty, something typically not possible for deep networks. More specifically, we show through derivations that the statistical estimator generated by our network is unbiased and we calculate its variance.
Acknowledgement
This research was funded by the German Research Foun dation (DFG) as part of the research training group GRK 1564 Imaging New Modalities.
References
 [1] H. Chang, D. Yeung, and Y. Xiong. Superresolution through neighbor embedding. In CVPR, volume 1, pages I–I. IEEE, 2004.
 [2] R. Keys. Cubic convolution interpolation for digital image processing. IEEE Trans. on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, 1981.
 [3] David L Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289–1306, 2006.
 [4] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang. Coupled dictionary training for image superresolution. IEEE Transactions on Image Processing (TIP), 21(8):3467–3478, 2012.
 [5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [6] J. Kim, L. Kwon Lee, and K. Mu Lee. Accurate image superresolution using very deep convolutional networks. In CVPR, 2016.
 [7] J. Kim, L. Kwon Lee, and K. Mu Lee. Deeplyrecursive convolutional network for image superresolution. In CVPR, 2016.
 [8] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image superresolution. In CVPR Workshops, 2017.
 [9] Y. Tai, J. Yang, and X. Liu. Image superresolution via deep recursive residual network. In CVPR, June 2017.
 [10] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. EnhanceNet: Single image superresolution through automated texture synthesis. In ICCV, Oct 2017.
 [11] Georges Matheron. Random sets and integral geometry. John Wiley & Sons, 1975.
 [12] Noel Cressie. Statistics for spatial data. John Wiley & Sons, 2015.
 [13] W. Freeman, T. Jones, and E. Pasztor. Examplebased superresolution. IEEE Computer Graphics and Applications, 22(2):56–65, 2002.
 [14] T. Peleg and M. Elad. A statistical prediction model based on sparse representations for single image superresolution. IEEE Transactions on Image Processing (TIP), 23(6):2569–2582, 2014.
 [15] W. Dong, L. Zhang, G. Shi, and X. Wu. Image deblurring and superresolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing (TIP), 20(7):1838–1857, 2011.
 [16] S. Zhao, H. Han, and S. Peng. Waveletdomain hmtbased image superresolution. In International Conference on Image Processing (ICIP), volume 2, pages II–953. IEEE, 2003.
 [17] K. Zhang, X. Gao, D. Tao, and X. Li. Single image superresolution with nonlocal means and steering kernel regression. IEEE Transactions on Image Processing (TIP), 21(11):4544–4556, 2012.
 [18] G. Anbarjafari and H. Demirel. Image super resolution based on interpolation of wavelet domain high frequency subbands and the spatial domain input image. ETRI journal, 32(3):390–394, 2010.
 [19] L. Xu and J. Jia. Twophase kernel estimation for robust motion deblurring. In ECCV, pages 157–170. Springer, 2010.
 [20] A. Marquina and S. Osher. Image superresolution by tvregularization and bregman iteration. Journal of Scientific Computing, 37(3):367–382, 2008.
 [21] H. He and W. C. Siu. Single image superresolution using Gaussian process regression. In CVPR, 2011.
 [22] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016.
 [23] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P Xing. Deep kernel learning. In AISTATS, 2016.
 [24] A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
 [25] Luc Pronzato and MariaJoão Rendas. Bayesian local kriging. Technometrics, pages 1–12, 2017.
 [26] F. Meier, P. Hennig, and S. Schaal. Local gaussian regression. arXiv preprint arXiv:1402.0645, 2014.

[27]
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pages 448–456, 2015.  [28] J. Yang, J. Wright, T. Huang, and Y. Ma. Image superresolution via sparse representation. IEEE Transactions on Image Processing (TIP), 19(11):2861–2873, 2010.
 [29] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, volume 2, pages 416–423, July 2001.
 [30] M. Bevilacqua, A. Roumy, C. Guillemot, and M. AlberiMorel. Lowcomplexity singleimage superresolution based on nonnegative neighbor embedding. 2012.
 [31] J. Huang, A. Singh, and N. Ahuja. Single image superresolution from transformed selfexemplars. In CVPR, pages 5197–5206, 2015.
 [32] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4):600–612, 2004.
 [33] T. Tong, G. Li, X. Liu, and Q. Gao. Image superresolution using dense skip connections. In ICCV, Oct 2017.
Comments
There are no comments yet.