PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)
The feed-forward architectures of recently proposed deep super-resolution networks learn representations of low-resolution inputs, and the non-linear mapping from those to high-resolution output. However, this approach does not fully address the mutual dependencies of low- and high-resolution images. We propose Deep Back-Projection Networks (DBPN), that exploit iterative up- and down-sampling layers, providing an error feedback mechanism for projection errors at each stage. We construct mutually-connected up- and down-sampling stages each of which represents different types of image degradation and high-resolution components. We show that extending this idea to allow concatenation of features across up- and down-sampling stages (Dense DBPN) allows us to reconstruct further improve super-resolution, yielding superior results and in particular establishing new state of the art results for large scaling factors such as 8x across multiple data sets.READ FULL TEXT VIEW PDF
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)
Significant progress in deep learning for vision[15, 13, 5, 40, 27, 34, 17] has recently been propagating to the field of super-resolution (SR) [20, 30, 6, 12, 21, 22, 25, 43].
Single image SR is an ill-posed inverse problem where the aim is to recover a high-resolution (HR) image from a low-resolution (LR) image. A currently typical approach is to construct an HR image by learning non-linear LR-to-HR mapping, implemented as a deep neural network[6, 7, 38, 25, 22, 23, 43]. These networks compute a sequence of feature maps from the LR image, culminating with one or more upsampling layers to increase resolution and finally construct the HR image. In contrast to this purely feed-forward approach, human visual system is believed to use a feedback connection to simply guide the task for the relevant results [9, 24, 26]. Perhaps hampered by lack of such feedback, the current SR networks with only feed-forward connections have difficulty in representing the LR to HR relation, especially for large scaling factors.
On the other hand, feedback connections were used effectively by one of the early SR algorithms, the iterative back-projection . It iteratively computes the reconstruction error then fuses it back to tune the HR image intensity. Although it has been proven to improve the image quality, the result still suffers from ringing effect and chessboard effect . Moreover, this method is sensitive to choices of parameters such as the number of iterations and the blur operator, leading to variability in results.
Inspired by , we construct an end-to-end trainable architecture based on the idea of iterative up- and down-sampling: Deep Back-Projection Networks (DBPN). Our networks successfully perform large scaling factors, as shown in Fig. 1. Our work provides the following contributions:
(1) Error feedback. We propose an iterative error-correcting feedback mechanism for SR, which calculates both up- and down-projection errors to guide the reconstruction for obtaining better results. Here, the projection errors are used to characterize or constraint the features in early layers. Detailed explanation can be seen in Section 3.
(2) Mutually connected up- and down-sampling stages. Feed-forward architectures, which is considered as a one-way mapping, only map rich representations of the input to the output space. This approach is unsuccessful to map LR and HR image, especially in large scaling factors, due to limited features available in the LR spaces. Therefore, our networks focus not only generating variants of the HR features using upsampling layers but also projecting it back to the LR spaces using downsampling layers. This connection is shown in Fig. 2 (d), alternating between up- (blue box) and down-sampling (gold box) stages, which represent the mutual relation of LR and HR image.
(3) Deep concatenation. Our networks represent different types of image degradation and HR components. This ability enables the networks to reconstruct the HR image using deep concatenation of the HR feature maps from all of the up-sampling steps. Unlike other networks, our reconstruction directly utilizes different types of LR-to-HR features without propagating them through the sampling layers as shown by the red arrow in Fig. 2 (d).
(4) Improvement with dense connection. We improve the accuracy of our network by densely connected  each up- and down-sampling stage to encourage feature reuse.
Deep Networks SR can be primarily divided into four types as shown in Fig. 2.
(a) Predefined upsampling commonly uses interpolation as the upsampling operator to produce middle resolution (MR) image. This schema was firstly proposed by SRCNN  to learn MR-to-HR non-linear mapping with simple convolutional layers. Later, the improved networks exploited residual learning [22, 43] and recursive layers . However, this approach might produce new noise from the MR image.
(b) Single upsampling offers simple yet effective way to increase the spatial resolution. This approach was proposed by FSRCNN  and ESPCN . These methods have been proven effective to increase the spatial resolution and replace predefined operators. However, they fail to learn complicated mapping due to limited capacity of the networks. EDSR , the winner of NTIRE2017 , belongs to this type. However, it requires a large number of filters in each layer and lengthy training time, around eight days as stated by the authors. These problems open the opportunities to propose lighter networks that can preserve HR components better.
(c) Progressive upsampling was recently proposed in LapSRN . It progressively reconstructs the multiple SR images with different scales in one feed-forward network. For the sake of simplification, we can say that this network is the stacked of single upsampling networks which only relies on limited LR features. Due to this fact, LapSRN is outperformed even by our shallow networks especially for large scaling factors such as in experimental results.
(d) Iterative up and downsampling is proposed by our networks. We focus on increasing the sampling rate of SR features in different depths and distribute the tasks to calculate the reconstruction error to each stage. This schema enables the networks to preserve the HR components by learning various up- and down-sampling operators while generating deeper features.
Rather than learning a non-linear mapping of input-to-target space in one step, the feedback networks compose the prediction process into multiple steps which allow the model to have a self-correcting procedure. Feedback procedure has been implemented in various computing tasks [3, 35, 47, 29, 49, 39, 32].
In the context of human pose estimation, Carreira et al. proposed an iterative error feedback by iteratively estimating and applying a correction to the current estimation. PredNet  is an unsupervised recurrent network to predictively code the future frames by recursively feeding the predictions back into the model. For image segmentation, Li et al.  learn implicit shape priors and use them to improve the prediction. However, to our knowledge, feedback procedures have not been implemented to SR.
introduced perceptual losses based on high-level features extracted from pre-trained networks. Ledig et al.
proposed SRGAN which is considered as a single upsampling method. It proposed the natural image manifold that is able to create photo-realistic images by specifically formulating a loss function based on the euclidian distance between feature maps extracted from VGG19 and SRResNet.
Our networks can be extended with the adversarial loss as generator network. However, we optimize our network only using an objective function such as mean square root error (MSE). Therefore, instead of training DBPN with the adversarial loss, we can compare DBPN with SRResNet which is also optimized by MSE.
Back-projection  is well known as the efficient iterative procedure to minimize the reconstruction error. Previous studies have proven the effectivity of back-projection [51, 11, 8, 46]. Originally, back-projection is designed for the case with multiple LR inputs. However, given only one LR input image, the updating procedure can be obtained by upsampling the LR image using multiple upsampling operators and calculate the reconstruction error iteratively . Timofte et al.  mentioned that back-projection can improve the quality of SR image. Zhao et al.  proposed a method to refine high-frequency texture details with an iterative projection process. However, the initialization which leads to an optimal solution remains unknown. Most of the previous studies involve constant and unlearnable predefined parameters such as blur operator and number of iteration.
To extend this algorithm, we develop an end-to-end trainable architecture which focuses to guide the SR task using mutually connected up- and down-sampling stages to learn non-linear relation of LR and HR image. The mutual relation between HR and LR image is constructed by creating iterative up and down-projection unit where the up-projection unit generates HR features, then the down-projection unit projects it back to the LR spaces as shown in Fig. 2 (d). This schema enables the networks to preserve the HR components by learned various up- and down-sampling operators and generates deeper features to construct numerous LR and HR features.
Let and be HR and LR image with and , respectively, where and . The main building block of our proposed DBPN architecture is the projection unit, which is trained (as part of the end-to-end training of the SR system) to map either an LR feature map to an HR map (up-projection), or an HR map to an LR map (down-projection).
The up-projection unit is defined as follows:
|scale residual up:||(4)|
|output feature map:||(5)|
where * is the spatial convolution operator, and are, respectively, the up- and down-sampling operator with scaling factor , and are (de)convolutional layers at stage .
This projection unit takes the previously computed LR feature map as input, and maps it to an (intermediate) HR map ; then it attempts to map it back to LR map (“back-project”). The residual (difference) between the observed LR map and the reconstructed is mapped to HR again, producing a new intermediate (residual) map ; the final output of the unit, the HR map , is obtained by summing the two intermediate HR maps. This step is illustrated in the upper part of Fig. 3.
The down-projection unit is defined very similarly, but now its job is to map its input HR map to the LR map as illustrated in the lower part of Fig. 3.
|scale residual down:||(9)|
|output feature map:||(10)|
We organize projection units in a series of stages, alternating between and . These projection units can be understood as a self-correcting procedure which feeds a projection error to the sampling layer and iteratively changes the solution by feeding back the projection error.
The projection unit uses large sized filters such as and . In other existing networks, the use of large-sized filter is avoided because it slows down the convergence speed and might produce sub-optimal results. However, iterative utilization of our projection units enables the network to suppress this limitation and to perform better performance on large scaling factor even with shallow networks.
The dense inter-layer connectivity pattern in DenseNets 
has been shown to alleviate the vanishing-gradient problem, produce improved feature, and encourage feature reuse. Inspired by this we propose to improve DBPN, by introducing dense connections in the projection units called, yielding Dense DBPN (D-DBPN).
Unlike the original DenseNets, we avoid dropout and batch norm, which are not suitable for SR, because they remove the range flexibility of the features . Instead, we use convolution layer as feature pooling and dimensional reduction [42, 12] before entering the projection unit.
In D-DBPN, the input for each unit is the concatenation of the outputs from all previous units. Let the and be the input for dense up- and down-projection unit, respectively. They are generated using which is used to merge all previous outputs from each unit as shown in Fig. 4. This improvement enables us to generate the feature maps effectively, as shown in the experimental results.
The proposed D-DBPN is illustrated in Fig. 5. It can be divided into three parts: initial feature extraction, projection, and reconstruction, as described below. Here, let be a convolutional layer, where is the filter size and is the number of filters.
Initial feature extraction. We construct initial LR feature-maps from the input using . Then is used to reduce the dimension from to before entering projection step where is the number of filters used in the initial LR features extraction and is the number of filters used in each projection unit.
Back-projection stages. Following initial feature extraction is a sequence of projection units, alternating between construction of LR and HR feature maps , ; each unit has access to the outputs of all previous units.
Reconstruction. Finally, the target HR image is reconstructed as where use as reconstruction and refers to the concatenation of the feature-maps produced in each up-projection unit.
Due to the definitions of these building blocks, our network architecture is modular. We can easily define and train networks with different numbers of stages, controlling the depth. For a network with stages, we have the initial extraction stage (2 layers), and then up-projection units and down-projection units, each with 3 layers, followed by the reconstruction (one more layer). However, for the dense network, we add in each projection unit, except the first three units.
In the proposed networks, the filter size in the projection unit is various with respect to the scaling factor. For enlargement, we useenlargement use convolutional layer with four striding and two padding. Finally, the enlargement use convolutional layer with eight striding and two padding.111We found these settings to work well based on general intuition and preliminary experiments.
We initialize the weights based on . Here, std is computed by where , is the filter size, and is the number of filters. For example, with and , the std is
. All convolutional and deconvolutional layers are followed by parametric rectified linear units (PReLUs).
, and ImageNet dataset without augmentation.222The comparison on DIV2K only are available in the supplementary material. To produce LR images, we downscale the HR images on particular scaling factors using Bicubic. We use batch size of 20 with size for LR image, while HR image size corresponds to the scaling factors. The learning rate is initialized to for all layers and decrease by a factor of 10 for every iterations for total iterations. For optimization, we use Adam with momentum to and weight decay to
. All experiments were conducted using Caffe, MATLAB R2017a on NVIDIA TITAN X GPUs.
Depth analysis. To demonstrate the capability of our projection unit, we construct multiple networks (), (), and () from the original DBPN. In the feature extraction, we use followed by . Then, we use for the reconstruction. The input and output image are luminance only.
The results on enlargement are shown in Fig. 6. DBPN outperforms the state-of-the-art methods. Starting from our shallow network, the network gives the higher PSNR than VDSR, DRCN, and LapSRN. The network uses only 12 convolutional layers with smaller number of filters than VDSR, DRCN, and LapSRN. At the best performance, networks can achieve dB which better dB, dB, dB than VDSR, DRCN, and LapSRN, respectively. The network shows performance improvement which better than all four existing state-of-the-art methods (VDSR, DRCN, LapSRN, and DRRN). At the best performance, the network can achieve dB which better dB, dB, dB, dB than VDSR, DRCN, LapSRN, and DRRN respectively. In total, the network use 24 convolutional layers which has the same depth as LapSRN. Compare to DRRN (up to 52 convolutional layers), the network undeniable shows the effectiveness of our projection unit. Finally, the network outperforms all methods with dB which better dB, dB, dB, dB than VDSR, DRCN, LapSRN, and DRRN, respectively.
The results of enlargement are shown in Fig. 7. The networks outperform the current state-of-the-art for enlargement which clearly show the effectiveness of our proposed networks on large scaling factors. However, we found that there is no significant performance gain from each proposed network especially for and networks where the difference only dB.
For the sake of low computation for real-time processing, we construct network which is the lighter version of the network, . We only use followed by for the initial feature extraction. However, the results outperform SRCNN, FSRCNN, and VDSR on both and enlargement. Moreover, our network performs better than VDSR with and fewer parameters on and enlargement, respectively.
Our network has about fewer parameters and higher PSNR than LapSRN on enlargement. Finally, D-DBPN has about fewer parameters, and approximately the same PSNR, compared to EDSR on enlargement. On the enlargement, D-DBPN has about fewer parameters with better PSNR compare to EDSR. This evidence show that our networks has the best trade-off between performance and number of parameter.
Deep concatenation. Each projection unit is used to distribute the reconstruction step by constructing features which represent different details of the HR components. Deep concatenation is also well-related with the number of (back-projection stage), which shows more detailed features generated from the projection units will also increase the quality of the results. In Fig. 10, it is shown that each stage successfully generates diverse features to reconstruct SR image.
Dense connection. We implement D-DBPN-L which is a dense connection of the network to show how dense connection can improve the network’s performance in all cases as shown in Table 1. On enlargement, the dense network, D-DBPN-L, gains dB and dB higher than DBPN-L on the Set5 and Set14, respectively. On , the gaps are even larger. The D-DBPN-L has dB and dB higher that DBPN-L on the Set5 and Set14, respectively.
To confirm the ability of the proposed network, we performed several experiments and analysis. We compare our network with eight state-of-the-art SR algorithms: A+ , SRCNN , FSRCNN , VDSR , DRCN , DRRN , LapSRN , and EDSR . We carry out extensive experiments using 5 datasets: Set5 , Set14 , BSDS100 , Urban100  and Manga109 . Each dataset has different characteristics. Set5, Set14 and BSDS100 consist of natural scenes; Urban100 contains urban scenes with details in different frequency bands; and Manga109 is a dataset of Japanese manga. Due to computation limit of Caffe, we have to divide each image in Urban100 and Manga109 into four parts and then calculate PSNR separately.
Our final network, D-DBPN, uses then for the initial feature extraction and for the back-projection stages. In the reconstruction, we use . RGB color channels are used for input and output image. It takes less than four days to train.
PSNR  and structural similarity (SSIM)  were used to quantitatively evaluate the proposed method. Note that higher PSNR and SSIM values indicate better quality. As used by existing networks, all measurements used only the luminance channel (Y). For SR by factor , we crop pixels near image boundary before evaluation as in [31, 7]. Some of the existing networks such as SRCNN, FSRCNN, VDSR, and EDSR did not perform enlargement. To this end, we retrained the existing networks by using author’s code with the recommended parameters.
We show the quantitative results in the Table 2. Our D-DBPN outperforms the existing methods by a large margin in all scales except EDSR. For the and enlargement, we have comparable PSNR with EDSR. However, the result of EDSR tends to generate stronger edge than the ground truth and lead to misleading information in several cases. The result of EDSR for eyelashes in Fig. 11 shows that it was interpreted as a stripe pattern. On the other hand, our result generates softer patterns which subjectively closer to the ground truth. On the butterfly image, EDSR separates the white pattern which shows that EDSR tends to construct regular pattern such ac circle and stripe, while D-DBPN constructs the same pattern as the ground truth. The previous statement is strengthened by the results from the Urban100 dataset which consist of many regular patterns from buildings. In Urban100, EDSR has dB higher than D-DBPN.
Our network shows it’s effectiveness in the enlargement. The D-DBPN outperforms all of the existing methods by a large margin. Interesting results are shown on Manga109 dataset where D-DBPN obtains dB which is dB better than EDSR. While on the Urban100 dataset, D-DBPN achieves 23.25 which is only dB better than EDSR. The results show that our networks perform better on fine-structures images such as manga characters, even though we do not use any animation images in the training.
The results of enlargement are visually shown in Fig. 12. Qualitatively, D-DBPN is able to preserve the HR components better than other networks. It shows that our networks can extract not only features but also create contextual information from the LR input to generate HR components in the case of large scaling factors, such as enlargement.
We have proposed Deep Back-Projection Networks for Single Image Super-resolution. Unlike the previous methods which predict the SR image in a feed-forward manner, our proposed networks focus to directly increase the SR features using multiple up- and down-sampling stages and feed the error predictions on each depth in the networks to revise the sampling results, then, accumulates the self-correcting features from each upsampling stage to create SR image. We use error feedbacks from the up- and down-scaling steps to guide the network to achieve a better result. The results show the effectiveness of the proposed network compares to other state-of-the-art methods. Moreover, our proposed network successfully outperforms other state-of-the-art methods on large scaling factors such as enlargement.
Accelerating the super-resolution convolutional neural network.In European Conference on Computer Vision, pages 391–407. Springer, 2016.