Image super-resolution (SR) aims to estimate a high-resolution (HR) image from low-resolution (LR) observations. In essence, due to the information loss in the image degradation process, SR is an ill-posed problem. The earliest works, image interpolation, estimate the HR image based on local statistics of the LR image. Typical methods include Bi-linear, Bi-cubic and new edge directed interpolation that predict the HR pixels by utilizing the spatial relationship between LR and HR pixels. Later on, many successive works[1, 2]
regard the image SR as a Maximum-a-posteriori estimation and propose to impose various priors to constrain the inverse estimation of image SR. In these methods, priors and constraints are typically achieved in a heuristic way. Thus, it is insufficient to represent the diversified patterns of natural images.
Learning based methods obtain a mapping between LR and HR images based on a large training set with dynamic learned prior knowledge. Sparse representation based methods such as  learn the map by building an LR and HR patch mapping dictionary. Neighbor embedding (NE) methods linearly combine the HR neighbors to infer the HR image. Timofte et al.  proposed an adjusted anchored neighborhood regression method for image SR. Li et al. 
proposed a neighbor preserving based method which specially utilizes HR reference patches only in reconstructing the high frequency region of LR images. Recently, deep-learning based methods[6, 7, 8, 9] are proposed. SRCNN is the first method  that utilizes a three-layer convolutional network for image SR. In , the sparse prior is incorporated into the network. Then, the residual learning  and sub-band recovery with edge guidance  networks are constructed to recover HF signal and offer state-of-the-art performance.
Despite impressive results achieved by the learning-based methods, some HF information has still been lost because of the ill-posed nature of the image SR and the problem that mean squared error leads to “regression to mean” . As a result, a few methods have recently been proposed, which additionally compensate for HF information loss with online retrieved HR references. Yue et al.  directly utilized the references to enhance the SR result by patch matching and patch blending. Li et al.  used the retrieved HR image patches to learn more accurate sparse distribution. Liu et al.  utilized a group-structured sparse representation to further use the nonlocal dependency information of HR references. However, in these methods there are still several important issues not being fully considered. For example, their fusion methods do not effectively extract external HF information for compensation, which may even bring artifacts. Besides, they did not make full use of the internal redundancy to benefit the recovery of HF information.
To address the above issues, we propose a unified deep network that additionally utilizes online retrieved data to facilitate image SR. Our work can efficiently extract an HF map from multiple HR references that are retrieved based on the intermediately inferred SR image.
Contributions of this paper are as follows: (1) This is the first work that efficiently extracts high-frequency information from HR references and successfully compensate for the HF information loss of the SR result with the deep framework. (2) Our work shows that it is capable to model internal and external images jointly, achieving a more accurate and robust fusion of internal and external information for HF information recovery. (3) Compared with both previous deep learning-based methods and online compensation SR methods, our approach achieves superior performance and has offered new state-of-the-art performance.
2 dual high-frequency recovery network
Given an LR image , we predict the HR image from with the reference of retrieved HR reference images by our dual high-frequency recovery network (DHN). In this paper, We use to represent the reference image. Architecture of the proposed DHN has been illustrated in Fig. 1. DHN consists of two components called internal high-frequency inference network (IHN) and external high-frequency compensation network (EHN), respectively. IHN infers missing HF information of merely based on internal data in . And then the intermediate SR image is generated by combining internal inferred HF (IHF) map and the simply up-smapled LR image . EHN further enhances the final SR result by adding the external extracted HF (EHF) map obtained from the aligned retrieved HR reference images to the intermediate image .
2.1 Internal high-frequency inference network
The first component IHN proposed by  is utilized to initially reconstruct the LR image with its own information. As shown in Fig.1, and its edge map, which is extracted by applying a hand-crafted edge detector, are utilized as the input of IHN. Then the recurrent network of IHN estimates the IHF map from the above input. IHN also predicts an HR edge map, which is used to further guide the HF map estimation.
With the inferred IHF map, The intermediate result image is then generalized as follows:
where is the direct sum operation between and and represents the process that IHN infers IHF map from LR image . is the image that simply up-sampled from . Specifically, at scaling size , the value of pixel in is copied to those pixels belong to the corresponding patch in . We then define the loss of IHN as the combination of loss of the predicted HR edge and . The loss is measured by the mean squared error (MSE) with the ground truth signal.
2.2 External high-frequency compensation network
IHN works well in predicting the HF map from an LR image. However, during this process not all HF information can be well recovered, as shown in Fig. 2 (e). This inspires us to construct EHN to further extract significant EHF map from each HR reference with the trained fixed IHN. Note that during training process is generated from the ground truth HR image.
As shown in Fig. 2, there are common illumination and color differences between the LR image and its reference images. Moreover, there is much useless low-frequency information in the references that may affect extracting HF information. Therefore we take different measures to improve the robustness of the process of extracting . Firstly, contrast of the label images is additionally adjusted to simulate the common conditions of illumination and color differences in training process. Besides, we alternatively utilize the difference image between and its intermediate SR image as the input of EHN, rather than directly input the information of . is obtained through up-sampling the down-sampled image of by IHN. The difference image is chosen because of its high efficiency in reducing illumination and color differences and removing redundant low-frequency information.
Then EHN extracts EHF map from the input by the recurrent network. Final reconstructed result is derived by:
where is the formulation of the process that EHN extracts the HF map . The operation represents the combination of the intermediate image and . During the training process, operation directly adds to . In the testing process, will be utilized based on patch matching results, which will be elaborated in Sec. 3.3. Loss of EHN is defined as MSE between and the raw ground truth image.
3 Online Retrieval For Compensation
Different with the training process, we retrieve HR reference images online in the testing process for compensation. And the extracted HF map will be fused based on the patch matching results.
3.1 Reference retrieval and registration
is first used to detect key points. Then a 144-dimension vector that contains discriminative information is extracted for each patch centered at the key point. Finally, the BOW model is used for indexing and retrieving reference images with extracted feature vectors.
However, we cannot directly utilize the retrieved HR references to compensate for information loss of because there are different scales and viewpoints between references and . As a result, each is aligned to for best compensation. We first detect the SIFT feature  of and each and match their feature points. Then the RANSAC algorithm is performed over the matched points to find the best homography transformation matrix. Finally, the aligned reference images are derived based on the transformation matrix and the aligned image of is defined as .
3.2 Patch matching
After obtaining the aligned references, the EHF map of each is then extracted from by IHN. As pixels in the aligned references are still not exactly corresponding to the pixels at the same position of , extracted feature values of can not be directly added to the intermediate up-sampled image . Therefore patch matching is utilized to find corresponding pixels between each and for guiding the combination of and .
There are usually significant differences on illumination, color and resolution between the intermediate SR image and the aligned HR references. As a result, for the purpose of better matching results we first utilize the intermediate SR reference image of each for matching, which shares similar resolution-level with . Then, we transfer each to reduce the effect of illumination difference:
where is the transform result, and
are the mean and standard deviation values of all pixels of the image, respectively. Thenis split into overlapped query patches of size at the step size 4. And we search for the corresponding patches of the query patches within a search window on each .
Since small patches contain little structural information of raw images, patch matching results at small patch size are not accurate. Thus we perform patch matching between and with large patches. Considering it is impossible for each patch in to have an exact corresponding large patch in , a method that adaptively adjusts patch sizes according to patch difference  is adopted for more accurate patch matching.
Let denote the query patch of size in centered at position and denote the candidate patch in centered at . We search for the best matching candidate patch of within the search window of size centered at in . The patch distance between and is defined as:
where is the operation that calculates the gradient of the patches and is the weighting parameter that controls the relative importance of pixel value differences and their gradient differences. is set to be in this paper. Besides, DC components of the patches are removed before distance computation.
The value of is defined as gradient mean square error (GMSE) and is set as minimum GMSE value between the query patch and the candidate patch within the search window. In particular, the value of is consistent with the quality of patch matching. In order to improve the quality of patch matching, patch sizes are then adaptively adjusted according to as follows:
Patch matching is performed at initial size and will be at smaller size if the value of is too large according to Eq. 5. The sliding step of patch matching is set to be . Then a closest candidate patch will be found. However, large step size may result in missing a better matching patch in . Thus we further search a candidate patch of the same size as within a size search window centered at position in , with the step size of .
3.3 External high-frequency information utilization
After patch matching, pixels at the same position in the matched patches between and each are matched. Then the EHF map is combined with based on the pixel-wise matching correlation. As mentioned in Sec. 2.2, the EHF map is denoted as . We define the final extracted HF map that can be directly added to as:
where is value of the pixel in map and similarly is the value of pixel . The set contains the matched pixels of from the HF maps extracted from all of the references. is the GMSE distance between the patches that and belong to. Note that pixel-wise correlations between and each here are the same as the builded pixel-wise correlations between and each . represents the number of elements in set .
Finally the result SR image is obtained by directly adding the final extracted HF map to the intermediate reconstructed SR image as .
4.1 Experimental settings
We train our DHN based on 91 images in  and 200 training images in BSD500 . Besides, as mentioned in Sec. 2.2, during the training process of EHN, contrast of the ground truth images is first adjusted with random perturbation for more robustly extracting HF map from the references.
With the mentioned 291 images, we first transfer the images to color space and only utilize the channel. The chrominance channels are later simply up-sampled by the Bi-cubic method in the testing process. Then we generate sub-images at the size of
from images in the dataset with the stride step of 16 pixel. Down-sampling method in is utilized that images are first blurred and then down-sampled with factors of 2, 3 and 4. As a result, around 10 thousand sub-images are obtained for training. The learning rate is initially set as .
We compare our algorithm with different SR methods including a typical learning-based SR method  (denoted as NE) and two online compensation methods [11, 13] (respectively denoted as Landmark and GSSR). For fairly comparison, we add the retrieved HR reference images to the training set of the learning-based method NE. Besides, the intermediate results derived by IHN  are also shown as the baseline. The baseline is one of the newest deep based SR methods without using external references. The testing images are chosen from the Oxford Building dataset11footnotemark: 1. There are totally 8 testing images named from ’a’ to ’h’ for comparison, which have also been shown in Fig. 3. For each testing image, we retrieve 4 reference images to extract the EHF map for enhancement.
Note that experimental results of Landmark are only enhanced by one single reference image. Whole experimental results will be updated soon.
4.2 Experimental results and analysis
Table 1 has shown the objective results of the chosen images. Our proposed method has obtained the best PSNR and SSIM values at every down-sampling scale factor for all of the 8 images. Even for the image ‘c’ shown in Fig. 2 that has great differences with the reference images in illumination and color, the proposed method still has sufficient gain over the baseline. Specifically, there is respectively 0.54, 0.58 and 0.3 dB gain in PSNR value at scaling size 2, 3 and 4. Note that though Landmark has successfully recovered some HF information, it does not perform well in objective comparison. It’s caused by the brought artifacts, which will also be analyzed later.
Subjective results are shown in Fig. 4. Because images in the Oxford Building dataset have large resolutions (mostly in the size of ), we only show part of each chosen image to more clearly compare the quality of HF information reconstructing. We also enlarge some HF regions in Fig. 4 for further comparison.
The edge-preserving based method NE has successfully obtained more sharp edge. However, it fails to reconstruct the other HF signal which is more detailed. Landmark has successfully combined some HF signal of the HR references in the result images. However, artifacts sometimes may be brought by incorrect patch matching results or inappropriate patch blending, as shown in Fig. 4. As a result, the visual quality of the results of Landmark is relatively low. Sparse-based method GSSR did not consider position feature of the reference patches. While there are many similar reference patches, more noise will be brought into GSSR’s SR reults. Edge feature combined baseline method  has also well reconstructed some HF signal. However, without information from HR references it fails to reconstruct the detail of many more complex regions. On the contrast, our method has obtained the best result in HF information reconstructing. Owing to the robustness of HF information extracting and correct patch matching, no artifact has been brought and our results also own the best visual quality.
To evaluate the effectiveness of our training policy with input random perturbation and normalization for EHN, we also compare with the case of directly inputting the information of the raw HR ground truth (we call this method DHN-d). Other training details are the same. The objective comparison between results of our proposed method and the mentioned DHN-d is shown in Table 2. Although DHN-d is capable of extracting some HF information and outperforms the baseline, we still obtain 0.40 dB gain with the input random perturbation and normalization training policy.
In this paper, we propose a deep online compensation work for image super-resolution. With the IHF map estimated by IHN, we initially obtain an intermediate SR result by combining the IHF map with a simply up-sampled LR image. Then the EHF map is further extracted from some HR references retrieved online for compensation. The final SR result is obtained by adding the EHF map to the intermediate SR result. Extensive experimental results demonstrate that the proposed method can robustly extract external HF map from the reference images and significantly improve the SR results based on the compensation brought by the EHF map.
-  J. Sun, J. Sun, Z. Xu, and H. Y. Shum, “Gradient profile prior and its applications in image super-resolution and enhancement,” IEEE Transactions on Image Processing, vol. 20, no. 6, pp. 1529–1542, June 2011.
-  A. Marquina and SJ. Osher, “Image super-resolution by TV-regularization and bregman iteration,” Journal of Scientific Computing, vol. 37, no. 3, pp. 367–382, December 2008.
-  J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.
R. Timofte, V. De Smet, and L. Van Gool,
“A+: Adjusted anchored neighborhood regression for fast
Proc. Asian Conference on Computer Vision, 2014.
-  Y. Li, J. Liu, W. Yang, and Z. Guo, “Neighborhood regression for edge-preserving image super-resolution,” in Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, 2015.
-  C. Dong, C. Chen, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. European Conference on Computer Vision, 2014, pp. 184–199.
-  D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust single image super-resolution via deep networks with sparse prior,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–3207, 2016.
J. Kim, J. K. Lee, and K. M. Lee,
“Accurate image super-resolution using very deep convolutional
Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2016.
-  W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep edge guided recurrent residual learning for image super-resolution,” in Arxiv, 1604.08671, 2016.
-  R. Timofte, VD. Smet, and LV. Gool, “Semantic super-resolution: When and where is it useful?,” Computer Vision and Image Understanding, vol. 142, pp. 1–12, 2016.
-  H. Yue, X. Sun, J. Yang, and F. Wu, “Landmark image super-resolution by retrieving web images,” IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 4865–4875, December 2013.
-  Y. Li, W. Dong, G. Shi, and X. Xie, “Learning parametric distributions for image super-resolution: Where patch matching meets sparse coding,” in Proc. IEEE Int’l Conf. Computer Vision, 2015.
-  J. Liu, W. Yang, X. Zhang, and Z. Guo, “Retrieval compensated group structured sparsity for image super-resolution,” IEEE Transactions on Multimedia, vol. PP, no. 99, 2016.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
-  F. Li and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2005, pp. 524–531.
-  D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int’l Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, November 2004.
-  P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2011.
-  Y. Li, W. Dong, G. Shi, and X. Xie, “Learning parametric distributions for image super-resolution: Where patch matching meets sparse coding view document,” in Proc. IEEE Int’l Conf. Computer Vision, 2015.