Texture Hallucination for Large-Scale Painting Super-Resolution

by   Yulun Zhang, et al.

We aim to super-resolve digital paintings, synthesizing realistic details from high-resolution reference painting materials for very large scaling factors (e.g., 8x, 16x). However, previous single image super-resolution (SISR) methods would either lose textural details or introduce unpleasing artifacts. On the other hand, reference-based SR (Ref-SR) methods can transfer textures to some extent, but is still impractical to handle very large scales and keep fidelity with original input. To solve these problems, we propose an efficient high-resolution hallucination network for very large scaling factors with efficient network structure and feature transferring. To transfer more detailed textures, we design a wavelet texture loss, which helps to enhance more high-frequency components. At the same time, to reduce the smoothing effect brought by the image reconstruction loss, we further relax the reconstruction constraint with a degradation loss which ensures the consistency between downscaled super-resolution results and low-resolution inputs. We also collected a high-resolution (e.g., 4K resolution) painting dataset PaintHD by considering both physical size and image resolution. We demonstrate the effectiveness of our method with extensive experiments on PaintHD by comparing with SISR and Ref-SR state-of-the-art methods.


page 1

page 5

page 7

page 8


Reference-Conditioned Super-Resolution by Neural Texture Transfer

With the recent advancement in deep learning, we have witnessed a great ...

Super-Resolving Compressed Video in Coding Chain

Scaling and lossy coding are widely used in video transmission and stora...

Robust Reference-based Super-Resolution via C2-Matching

Reference-based Super-Resolution (Ref-SR) has recently emerged as a prom...

Learning Texture Transformer Network for Image Super-Resolution

We study on image super-resolution (SR), which aims to recover realistic...

Boosting Resolution and Recovering Texture of micro-CT Images with Deep Learning

Digital Rock Imaging is constrained by detector hardware, and a trade-of...

Image Super-Resolution by Neural Texture Transfer

Due to the significant information loss in low-resolution (LR) images, i...

A Frequency Domain Constraint for Synthetic X-ray Image Super Resolution

Synthetic X-ray images can be helpful for image guiding systems and VR s...

1 Introduction

Super-resolution for digital painting images has important values in both culture and research aspects. Many historical masterpieces were damaged and their digital replications are in low-res, low-quality due to technological limitation in old days. Recovery of their fine details is crucial for maintaining and protecting human heritage. It is also a valuable research problem for computer scientists to restore high-res painting due to the rich content and texture of paintings in varying scales. Solving this problem could lay the cornerstone for the more challenging recovery of general natural images.

Super-resolving painting images is particularly challenging as vary large upscaling factors (, , or even larger) are required to recover the brush and canvas details of artworks, so that a viewer can fully appreciate the aesthetics as from the original painting. Fortunately, there is a big abundance of existing artworks scanned in high-resolution, which provide the references for the common texture details shared among most paintings. Therefore, in this paper we formulate a reference-based super-resolution problem setting for digital paintings with large upscaling factors.

To this end, we collect a large-scale high-quality dataset for oil painting with diverse contents and styles. We explore new deep network architectures with efficient memory usage and run time for large upscaling factors, and design wavelet-based texture loss and degradation loss to achieve high perceptual quality and fidelity at the same time. Our proposed method can hallucinate realistic details based on the given reference images, which is especially desired for large factor image upscaling. Compared to the previous state-of-the-arts, our proposed method achieves significantly improved visual results, which is verified in our human subjective evaluation. Our trained model can also be applied on natural photos and generates promising results. In Fig. 1, we compare with other state-of-the-art methods for large scaling factors (e.g., ).

In summary, the main contributions of this work are:

  • We proposed a reference-based image super-resolution framework for large upscaling factors (e.g., and ) with novel training objectives. Specifically, we proposed wavelet texture loss and degradation loss, which allow to transfer more detailed textures.

  • We collected a new digital painting dataset PaintHD with high-quality images and detailed meta information, by considering both physical and resolution sizes.

  • We achieved significantly improved visual results over previous single image super-resolution (SISR) and reference based SR (Ref-SR) state-of-the-arts. A new technical direction is opened for Ref-SR with large upscaling factor on general natural images.

2 Related Work

Recent work on deep-learning-based methods for image super-resolution  

[22, 24, 20, 35, 36, 37] is clearly outperforming more traditional methods [9, 3, 27] in terms of either PSNR/SSIM or visual quality. In this Section we will focus on the former for conciseness.

2.1 Single Image Super-Resolution

Single image super-resolution (SISR) recovers a high-resolution image directly from its low-resolution (LR) counterpart. The pioneering SRCNN proposed by Dong et al[5], made the breakthrough of introducing deep learning to SISR, achieving superior performance than traditional methods. Inspired by this seminal work, many representative works [29, 16, 17, 6, 25, 22, 35] were proposed to further explore the potential of deep learning and have continuously raised the baseline performance of SISR. In SRCNN and followups VDSR [16] and DRCN [17]

, the input LR image is upscaled to the target size through interpolation before fed into the network for recovery of details. Later works demonstrated that extracting features from LR directly and learning the upscaling process would improve both quality and efficiency. For example, Dong

et al[6] provide the LR image directly to the network and use a deconvolution for upscaling. Shi et al[25] further speed up the upscaling process using sub-pixel convolutions, which became widely adopted in recent works. Current state-of-the-art performance is achieved by EDSR [22] and RCAN [35]. EDSR takes inspiration from ResNet [13], using long-skip and sub-pix convolutions to achieve stronger edge and finer texture. RCAN introduced channel attention to enforce better learning of high-frequency information.

Once larger upscaling factors were achievable, e.g., 4, 8, many empirical studies [20, 24, 37] demonstrated that the commonly used quality measurements PSNR and SSIM proved to be not representative of visual quality, i.e., higher visual quality may result in lower PSNR; a fact first investigated by Johnson et al[15] and Ledig et al[20]. The former investigated perceptual loss using VGG [26], while the later proposed SRGAN by introducing GAN [11] loss into SISR, which boosted significantly the visual quality compared to previous works. Based on SRGAN [20], Sajjadi et al[24] further adopted texture loss to enhance textural reality. Along with higher visual quality, those GAN-based SR methods also introduce artifacts or new textures synthesized depending on the content, which would contribute to increased perceived fidelity at the expense of reduced PSNR.

Although SISR has been studied for decades, it is still limited by its ill-posed nature, making it difficult to recover fine texture detail for upscaling factors of 8 or 16. So, most existing SISR methods are limited to a maximum of 4. Otherwise, they suffer serious degradation of quality. Works that attempted to achieve 8 upscaling, e.g., LapSRN [19] and RCAN [35], found visual quality would degrade quadratically with the increase of upscaling factor.

2.2 Reference-based Super-Resolution

Different from SISR, reference-based SR (Ref-SR) methods attempt to utilize self or external information to enhance the texture. Freeman et al[8] proposed the first work on Ref-SR, which replaced LR patches with fitting HR ones from a database/dictionary. [7, 14] considered the input LR image itself as the database, from which references were extracted to enhance textures. These methods benefit the most from repeated patterns with perspective transformation. Light field imaging is an area of interest for Ref-SR, where HR references can be captured along the LR light field, just with a small offset. Thus, making easier to align the reference to the LR input, facilitating the transfer of high-frequency information in [2, 38]. CrossNet [39]

took advantage of deep learning to align the input and reference by estimating the flow between them and achieved state-of-the-art performance.

A more generic scenario for Ref-SR is to relax the constraints on references, i.e., the references could present large spacial/color shift from the input. More extremely, references and inputs could contain unrelated content. Sun et al[27] used global scene descriptors and internet-scale image databases to find similar scenes that provide ideal example textures. Yue et al[32] proposed a similar idea, retrieving similar images from the web and performing global registration and local matching. Recent works [30, 37] leveraged deep models and significantly improved Ref-SR performance, e.g., visual quality and generalization capacity.

Our proposed method further extends the feasible scaling factor of previous Ref-SR methods from 4 to 16. More importantly, as oppose to the previous approach [37] which transfers the high-frequency information from reference as a style transfer task, we conduct texture transfer only in high-frequency band, which reduces the transfer effect (and hence the distortion) on the low-frequency content.

3 Approach

We aim to hallucinate the super-resolution (SR) image for large scaling factor from its low-resolution (LR) input and transfer highly detailed textures from high-resolution (HR) reference . However, most previous Ref-SR methods [39, 37] could mainly handle relatively small scaling factors (e.g., ). To achieve visually pleasing with larger scaling factors, we firstly build a more compact network (see Fig. 2

) and then apply novel loss functions to the output.

3.1 Pipeline

We first define levels according to scaling factor , where . Inspired by SRNTT [37], we conduct texture swapping in the feature space to transfer highly detailed textures to the output (Fig. 2

). The feature upscaler acts as the mainstream of upscaling the input LR image. Meanwhile, the reference feature that carries richer texture is extracted by the deep feature extractor. At the finest layer (largest scale) the reference feature is transferred to the output.

Figure 2: The pipeline of our proposed method.

As demonstrated in recent works [21, 35]

, the batch normalization (BN) layers commonly used in deep models for stabilizing the training process turns to degrade the SR performance. Therefore, we avoid BN layer in our feature upscaling model. More importantly, the GPU memory usage is largely reduced, as the BN layer consumes similar amount of GPU memory as convolutional layer 


To efficiently transfer high-frequency information from the reference, we swap features at the finest level , where the reference features are swapped according to the local feature matching between the input and reference. Since patch matching is time-consuming, it is conducted at lower level (small spatial size), i.e., we obtain feature matching information in level via


where denotes feature matching operation in level . is a neural feature extractor (e.g., VGG19 [26]) matching the same level. is upscaled by Bicubic interpolation with scaling factor . To match the frequency band of , we first downscale and then upscale it with scaling factor . For each patch from , we could find its best matched patch from with highest similarity.

Then, using the matching information , we transfer features at level and obtain the new feature via


where denotes feature transfer operation. extracts neural feature from the high-resolution reference at level .

On the other hand, we also extract deep feature from the LR input and upscale it with scaling factor . Let’s denote the upscaled input feature as and the operation as , namely . To introduce the transferred feature into the image hallucination, we fuse and by using residual learning, and finally reconstruct the output . Such a process can be expressed as follows


where refers to channel-wise concatenation, denotes several residual blocks, and denotes a reconstruction layer.

Although we can already achieve super-resolved results with larger scaling factors by using the above simplifications and improvements, we still aim to transfer highly-detailed texture from reference even in such challenging cases (i.e., very large scaling factors). For that, we propose wavelet texture and degradation losses.

3.2 Wavelet Texture Loss

Motivation. Textures are mainly composed of high-frequency components. LR images contain less high-frequency components, when the scaling factor goes larger. If we apply the loss functions (including texture loss) on the color image space, it’s still hard to recover or transfer more high-frequency ones. However, if we pay more attention to the high-frequency components and relax the reconstruction of color image space, such an issue could be alleviated better. More specifically, we aim to transfer as many textures as possible from reference by applying texture loss on the high-frequency components. Wavelet is a proper way to decompose the signal into different bands with different levels of frequency components.

Haar wavelet. Inspired by the excellent WCT [31], where a wavelet-corrected transfer was proposed, we firstly apply Haar wavelet to obtain different components. The Haar wavelet transformation has four kernels, , where and denote the low and high pass filters,


As a result, such a wavelet operation would split the signal into four channels, capturing low-frequency and high-frequency components. We denote the extraction operations for these four channels as , , , and respectively. Then, we aim to pay more attention to the recovery of high-frequency components with the usage of wavelet texture loss.

Wavelet texture loss. As investigated in WCT [31], in Haar wavelet, the low-pass filter can extract smooth surface and parts of texture and high-pass filters capture higher frequency components (e.g., horizontal, vertical, and diagonal edge like textures).

Ideally, it’s a wise choice to apply texture loss on each channel split by Haar wavelet. However, as we calculate texture loss in different scales, such a choice would suffer from very heavy GPU memory usage and running time. Moreover, as it’s very difficult for the network to transfer highly detailed texture with very large scaling factors, focusing on the reconstruction of more desired parts would be a better choice. Consequently, we propose a wavelet texture loss with kernel and formulate it as follows


where extracts high-frequency component from the upscaled output with kernel. is the transferred feature in feature map space of . calculates the Gram matrix for each level , where is the corresponding normalization weight. denotes Frobenius norm.

As shown in Eq. (5), we mainly focus on the texture reconstruction of higher frequency components, which would transfer more textures with somehow creative ability. Then, we further relax the reconstruction constraint by proposing a degradation loss.

Figure 3: The illustration of our proposed degradation loss. We try to minimize the degradation loss between the downscaled output and the original input .

3.3 Degradation Loss

Motivation. Most previous single image SR methods (e.g., RCAN [35]) mainly concentrate on minimizing the loss between the upscaled image and ground truth . For small scaling factors (e.g., 2), those methods would achieve excellent results with very high PSNR values. However, when the scaling factor goes very large (e.g., ), the results of those methods would suffer from heavy smoothing artifacts and lack favorable textures (see Fig. 1). On the other hand, as we try to transfer textures to the results as many as possible, emphasizing on the overall reconstruction in the upscaled image may also smooth some transferred textures. To alleviate such texture oversmoothing artifacts, we turn to additionally introduce the LR input into network optimization.

Degradation loss. Different from image SR, which is more challenging to obtain favorable results, image downscaling could be relatively easier. It’s possible to learn a degradation network , that maps the HR image to an LR one. We train such a network by using HR ground truth as input and try to minimize the loss between its output and the original LR counterpart .

With the degradation network , we are able to mimic the degradation process from to , which can be a many-to-one case. Namely, there exists many upscaled images corresponding to the original LR image , which helps to relax the constraints on the reconstruction. To make use of this property, we try to narrow the gap between the downscaled output and the original LR input . As shown in Fig. 3, we formulate it as a degradation loss


where denotes the downscaled image from with scaling factor and denotes -norm. With the proposed loss functions, we further give details about the implementation.

3.4 Implementation Details

Loss functions. We also adopt another three common loss functions [15, 20, 24, 37]: reconstruction (), perceptual (), and adversarial () losses. We briefly introduce them as follows.


where extracts feature maps from 1- convolutional layer before 5-max-pooling layer of the VGG-19 [26] network. is the - feature map.

We also adopt WGAN-GP [12] for adversarial training [11], which can be expressed as follows


where and denote generator and discriminator respectively, and . and represent data and model distributions. For simplicity, here, we mainly focus on the adversarial loss for generator and show it as follows


Training. The weights for , , , , and are 1, 10, 1, 10, and 10

respectively. To stabilize the training process, we pre-train the network for 2 epochs with


. Then, all the losses are applied to train another 20 epochs. We implement our model with TensorFlow and apply Adam optimizer 

[18] with learning rate 10.

(a) LR
(b) Ours
(d) RCAN
Figure 4: Visual results () of RCAN [35], SRNTT [37], and our method on CUFED5. Our result is visually more pleasing than others, and generates plausible texture details.
Figure 5: Examples from our collected PaintHD dataset.

4 Dataset

For large upscaling factors, e.g., 8 and 16, input images with small size, e.g., 3030, but with rich texture in its originally HR counterpart will significantly increase the arbitrariness/smoothness for texture recovery because fewer pixels result in looser content constraints on the texture recovery. Existing datasets for Ref-SR are unsuitable for such large upscaling factors (see Fig. 4). Therefore, we collect a new dataset of high-resolution painting images that carry rich and diverse stroke and canvas texture.

The new dataset, named PaintHD, is sourced from the Google Art Project [4], which is a collection of very large zoom-able images. In total, we collected over 13,600 images, some of which achieve gigapixel. Intuitively, an image with more pixels does not necessarily present finer texture since the physical size of the corresponding painting may be large as well. To measure the richness of texture, the physical size of paintings is considered to calculate pixel per inch (PPI) for each image. Finally, we construct the training set consisting of 2,000 images and the testing set of 100 images with relatively higher PPI. Fig. 5 shows some examples of PaintHD, which contains abundant textures.

To further evaluate the generalization capacity of the proposed method, we also test on the CUFED5 [37] dataset, which is designed specifically for Ref-SR validation. There are 126 groups of samples. Each group consists of one HR image and four references with different levels of similarity to the HR image. For simplicity, we adopt the most similar reference for each HR image to construct the testing pairs. The images in CUFED5 are of much lower resolution, e.g., 500300, as compared to the proposed PaintHD dataset.

5 Experimental Results

We conduct extensive experiments to validate the contributions of each component in our method. We demonstrate the effectiveness of our method by comparing with other state-of-the-art methods quantitatively and visually. We also provide more details, results, and analyses in supplementary material111http://yulunzhang.com/papers/PaintingSR_supp_arXiv.pdf.

5.1 Ablation Study

5.1.1 Effect of Wavelet Texture Loss

The wavelet texture loss is imposed on the high-frequency band of the feature maps, rather than directly applying on raw features like SRNTT [37] and traditional style transfer [10]. Comparison between the wavelet texture loss and tradition texture loss is illustrated in Fig. 6. To highlight the difference, weights on texture losses during training are increased by 100 times as compared to the default setting in Section 3.4. Comparing Figs. 6(c) and 6(d), the result without wavelet is significantly affected by the texture/color from the reference (Fig. 6(b)), lost identity to the input content. By contrast, the result with wavelet still preserves similar texture and color to the ground truth (Fig. 6(a)).

(a) HR
(b) Reference
(c) w/o Wavelet
(d) w/ Wavelet
Figure 6: Comparison of super-resolved results with and without wavelet.

5.1.2 Effect of Degradation Loss

To demonstrate the effectiveness of our proposed degradation loss , we use SRResNet [20] as the backbone with scaling factor . The SRResNet baseline is trained by using only reconstruction loss . Similar as Fig. 3, we further introduce and to train another SRResNet. We evaluate the performance of those two baselines and test them on four standard benchmark datasets: Set5 [1], Set14 [33], B100 [23], and Urban100 [14]. We show the PSNR (dB) and SSIM [28] values in Tab. 1. We can see that SRResNet with obtains higher PSNR and SSIM values with obvious improvements. Such observations not only demonstrate the effectiveness of , but also are consistent with our analyses in Section 3.3.

5.2 Quantitative Results

We compare our method with state-of-the-art SISR and Ref-SR methods222We use implementations from
EDSR: https://github.com/thstkdgus35/EDSR-PyTorch
RCAN: https://github.com/yulunzhang/RCAN
SRGAN: https://github.com/tensorlayer/srgan
SRNTT: https://github.com/ZZUTK/SRNTT
. The SISR methods are EDSR [22], RCAN [35], and SRGAN [20], where RCAN [35] achieved state-of-the-art performance in terms of PSNR (dB). Due to limited space, we only introduce the state-of-the-art Ref-SR method SRNTT [37] for comparison. However, most of those methods are not originally designed for very large scaling factors. Here, to make them suitable for and SR, we adopt them with some modifications. In case, we use RCAN [35] to first upscale the input by . The upscaled intermediate result would be the input for EDSR, SRGAN, and SRNTT, which then upscale the result by . Analogically, in case, we use RCAN to first upscale by . The intermediate result would be def into RCAN, EDSR, SRGAN, and SRNTT, which further upscale it by . For our method, we would directly upscale the input by or in these two cases. We show quantitative results in Tab. 2, where we have some interesting and thought-provoking observations.

Data Set5 Set14 B100 Urban100
w/o 32.05/0.091 28.49/0.780 27.57/0.735 26.07/0.784
w/ 32.24/0.896 28.65/0.783 27.62/0.738 26.13/0.787
Table 1: Investigation of degradation loss . We show PSNR (dB)/SSIM values without (w/o) and with (w/) .
Data CUFED5 PaintHD
Bicubic 21.63/0.572 19.75/0.509 23.73/0.432 22.33/0.384
EDSR 23.02/0.653 20.70/0.548 24.42/0.477 22.90/0.405
RCAN 23.37/0.666 20.71/0.548 24.43/0.478 22.91/0.406
SRGAN 22.93/0.642 20.54/0.537 24.21/0.466 22.75/0.396
SRNTT 21.96/0.594 20.16/0.507 23.21/0.401 22.19/0.350
Ours 20.36/0.541 18.51/0.442 22.49/0.361 20.69/0.259
Ours- 22.40/0.635 19.71/0.526 24.02/0.461 22.13/0.375
Table 2: Quantitative results (PSNR/SSIM) of different SR methods for and on two datasets: CUFED5 [37] and our collected PaintHD. The methods are grouped into two categories: SISR (top group) and Ref-SR (bottom). We highlight the best results for each case. ‘Ours-’ denotes our method by using only .
Figure 7: Visual results () of different SR methods on PaintHD. The first column are LR inputs and references.

First, SISR methods would obtain higher PSNR and SSIM values than those of Ref-SR methods. This is reasonable because SISR methods mainly target to minimize MSE, which helps to pursue higher PSNR values. But, when the scaling factor goes to larger (e.g., ), the gap among SISR methods also becomes very smaller. It means that it would be difficult to distinguish the performance between different SISR methods by considering PSNR/SSIM.

Another interesting observation is that, in the group of Ref-SR, SRNTT [37] would achieve higher PSNR/SSIM values than our method. It’s predictable, as RCAN [35] has achieved top PSNR performance and helps upscale the input by or firstly. When we train our network with reconstruction loss only, denoted as ‘Ours-’, we can achieve comparable PSNR values as SRNTT. However, ‘Ours-’ achieves significant SSIM improvement, which means our method transfers more structural information.

Based on the observations and analyses above, we conclude that we should turn to other more visually-perceptual ways to evaluate the performance of SR methods, instead of only depending on PSNR/SSIM values. Consequently, we conduct more comparisons visual results and user study.

Figure 8: Visual results () of different SR methods on PaintHD. The first column are LR inputs and references.

5.3 Visual Comparisons

As our PaintHD contains very high-resolution images with abundant textures, it’s a practical way for us to show the zoom-in image patches for comparison. To better view the details of high-resolution image patches, it’s hard for us to show image patches from too many methods. As a result, we only show visual comparison with state-of-the-art SISR and Ref-SR methods: RCAN [35] and SRNTT [37].

We show visual comparisons in Figs. 7 and 8 for and cases respectively. In the case, RCAN could handle it to some degree, because the LR input has abundant details for reconstruction. But, RCAN still suffer from some blurring artifacts due to use PSNR-oriented loss function (e.g., -norm loss). By transferring textures from reference and using other loss functions (e.g., texture, perceptual, and adversarial losses), SRNTT [37] performs visually better than RCAN. But SRNTT still can hardly transfer more detailed textures. In contrast, our method would obviously address the blurring artifacts and can transfer more vivid textures.

When the case goes more challenging, namely , both RCAN and SRNTT would generate obvious over-smoothing artifacts (see Fig. 8). The reasons for RCAN are mainly that it aims to narrow the pixel-wise difference between the output and the ground truth . Such an optimization way would encourage more low-frequency components, but restrain the generation of high-frequency ones. SRNTT is originally designed for upscaling, which restricts its ability for larger scaling factors. SRNTT neglects to pay more attention to the recovery of high-frequency components. In contrast, our method establishes trainable network from the original LR input to the target HR output firstly. Moreover, we focus on the reconstruction of high-frequency components more with wavelet texture loss. We even further relax the constraint between the output and ground truth by propose the degradation loss. As a result, our method alleviates the over-smoothing artifacts to some degree and recovers more detailed textures (see Fig. 8).

5.4 User Study

Since the traditional metric PSNR and SSIM do not consistent to visual quality [15, 20, 24, 37], we conducted user study by following the setting of SRNTT [37] to compare our results with those from other methods, i.e., SRNTT [34], SRGAN [20], RCAN [35], and EDSR [22]. The EDSR and RCAN achieve state-of-the-art performance in terms of PSNR/SSIM, while SRGAN (SISR) and SRNTT (Ref-SR) focus more on visual quality. All methods are tested on a random subset of CUFED5 and PaintHD at the upscaling factor of 8 and 16. In each query, the user is asked to select the visually better one between two side-by-side images super-resolved from the same LR input, i.e., one from ours and the other from another method. In total, we collected 3,480 votes, and the results are shown in Fig. 9. The height of a bar indicates the percentage of users who favor our results as compared to those from a corresponding method as denoted under the bar.

In general, our results achieve better visual quality at both upscaling scales, and the relative quality at 16 further outperforms the others. The main reason lies in the texture transfer from references. With the increase of the upscaling factor, more details are lost in the LR input, which is difficult to be recovered solely by the deep model. Thus, externally high-frequency information tends to be more important to texture recovery, which causes the gap from 5% to 10% between the results of 8 and 16.

Figure 9: User study on the results of SRNTT, SRGAN, RCAN, EDSR, and ours on the PaintHD and CUFED5 datasets. The bar corresponding to each method indicates the percentage favoring ours as compared to the method. The two colors indicate the different upscaling factors, i.e., 8 and 16.

6 Conclusion

We aim to hallucinate painting images with very large upscaling factors and transfer high-resolution (HR) detailed textures from HR reference images. Such a task could be very challenging. The popular single image super-resolution (SISR) could hardly transfer textures from reference images. On the other hand, reference-based SR (Ref-SR) could transfer textures to some degree, but could hardly handle very large scaling factors. We address this problem by first construct an efficient Ref-SR network, being suitable for very large scaling factor. To transfer more detailed textures, we propose a wavelet texture loss to focus on more high-frequency components. To alleviate the potential over-smoothing artifacts caused by reconstruction constraint, we further relax it by proposed a degradation loss. We collect high-quality painting dataset PaintHD, where we conduct extensive experiments and compare with other state-of-the-art methods. We achieved significantly improvements over both SISR and Ref-SR methods. We believe such a Ref-SR network has promising benefits to general natural images.


  • [1] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
  • [2] Vivek Boominathan, Kaushik Mitra, and Ashok Veeraraghavan. Improving resolution and depth-of-field of light field cameras using a hybrid imaging system. In ICCP, 2014.
  • [3] Hong Chang, Dit-Yan Yeung, and Yimin Xiong. Super-resolution through neighbor embedding. In CVPR, 2004.
  • [4] Wikimedia Commons. Google art project. https://commons.wikimedia.org/wiki/Category:Google_Art_Project, 2018.
  • [5] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.
  • [6] Chao Dong, Chen Change Loy, and Xiaoou Tang.

    Accelerating the super-resolution convolutional neural network.

    In ECCV, 2016.
  • [7] Gilad Freedman and Raanan Fattal. Image and video upscaling from local self-examples. TOG, 2011.
  • [8] William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution. IEEE Computer Graphics and Applications, (2):56–65, 2002.
  • [9] William T Freeman, Egon C Pasztor, and Owen T Carmichael. Learning low-level vision. IJCV, 2000.
  • [10] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
  • [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
  • [12] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NeurIPS, 2017.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [14] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, 2015.
  • [15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • [16] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [17] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, 2016.
  • [18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [19] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017.
  • [20] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • [21] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In AISTATS, 2015.
  • [22] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In CVPRW, 2017.
  • [23] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • [24] Mehdi SM Sajjadi, Bernhard Schölkopf, and Michael Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. In ICCV, 2017.
  • [25] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
  • [26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [27] Libin Sun and James Hays. Super-resolution from internet-scale scene matching. In ICCP, 2012.
  • [28] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
  • [29] Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image super-resolution with sparse prior. In ICCV, 2015.
  • [30] Wenhan Yang, Sifeng Xia, Jiaying Liu, and Zongming Guo. Reference-guided deep super-resolution via manifold localized external compensation. TCSVT, 2018.
  • [31] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, and Jung-Woo Ha. Photorealistic style transfer via wavelet transforms. In ICCV, 2019.
  • [32] Huanjing Yue, Xiaoyan Sun, Jingyu Yang, and Feng Wu. Landmark image super-resolution by retrieving web images. TIP, 2013.
  • [33] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In Proc. 7th Int. Conf. Curves Surf., 2010.
  • [34] He Zhang and Vishal M Patel. Densely connected pyramid dehazing network. In CVPR, 2018.
  • [35] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In ECCV, 2018.
  • [36] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, 2018.
  • [37] Zhifei Zhang, Zhaowen Wang, Zhe Lin, and Hairong Qi. Image super-resolution by neural texture transfer. In CVPR, 2019.
  • [38] Haitian Zheng, Minghao Guo, Haoqian Wang, Yebin Liu, and Lu Fang. Combining exemplar-based approach and learning-based approach for light field super-resolution using a hybrid imaging system. In ICCV, 2017.
  • [39] Haitian Zheng, Mengqi Ji, Haoqian Wang, Yebin Liu, and Lu Fang. Crossnet: An end-to-end reference-based super resolution network using cross-scale warping. In ECCV, 2018.