Official implementation of block state-based recursive network (BSRN) for super-resolution in TensorFlow
Recently, several deep learning-based image super-resolution methods have been developed by stacking massive numbers of layers. However, this leads too large model sizes and high computational complexities, thus some recursive parameter-sharing methods have been also proposed. Nevertheless, their designs do not properly utilize the potential of the recursive operation. In this paper, we propose a novel, lightweight, and efficient super-resolution method to maximize the usefulness of the recursive architecture, by introducing block state-based recursive network. By taking advantage of utilizing the block state, the recursive part of our model can easily track the status of the current image features. We show the benefits of the proposed method in terms of model size, speed, and efficiency. In addition, we show that our method outperforms the other state-of-the-art methods.READ FULL TEXT VIEW PDF
Official implementation of block state-based recursive network (BSRN) for super-resolution in TensorFlow
Single-image super-resolution is a task to obtain a high-resolution image from a given low-resolution image. It is a kind of ill-posed problems since it has to estimate image details under the lack of spatial information. Many researchers have proposed various approaches that can generate upscaled images having better quality than the simple interpolation methods such as nearest-neighbor, bilinear, and bicubic upscaling.
Recently, the emergence of deep learning techniques has flowed into the super-resolution field. For example, Dong et al. 
proposed the super-resolution convolutional neural network (SRCNN) model, which showed much improved performance in comparison to the previous approaches. Limet al. 
suggested the enhanced deep super-resolution (EDSR) model, which employs residual connections and various optimization techniques.
Many recent deep learning-based super-resolution methods tend to stack much more numbers of layers to obtain better upscaled images, but this dramatically increases the number of involved model parameters. For instance, the EDSR model requires about 43M parameters, which are at least 400 times more than those of the SRCNN model. To deal with this, recursive approaches that use some parameters repeatedly have been proposed, including deeply-recursive convolutional network (DRCN) , deep recursive residual network (DRRN) , and dual-state recurrent network (DSRN) .
The recursive super-resolution methods can be regarded as kinds of recurrent neural networks (RNNs). RNNs have been usually employed when sequential relation of the data is significant, such as language modeling  and human activity recognition . The beauty of RNNs comes from their two-fold structure: the recurrent unit handles not only the current input data but also the previously processed features. Since the previously processed features contain historical information, RNNs can deal with sequential dependency of the inputs properly.
However, two characteristics of the existing recursive super-resolution methods hinder them from fully exploiting the usefulness of the RNNs. First, there are no intermediate inputs and only the previously processed features are provided to the recurrent unit. Second, the final output of the recurrent unit is directly used to obtain the final upscaled image. In this situation, the output of the recurrent unit has to contain not only the super-resolved features, but also the historical information that is not useful in the non-recursive post-processing part.
To alleviate this problem, we propose a novel super-resolution method using block state-based recursive network (BSRN). Our method employs so-called “block state” along with the input features in the recursive part, which is a separate information storage to keep historical features. Thanks to the elaborate design, our method achieves various benefits on top of the previous recursive super-resolution methods, in terms of image quality, lightness, speed, and efficiency. As shown in Figure 1, our method achieves the best performance in terms of image quality, while the model complexity is significantly reduced. In addition, the BSRN model can generate the super-resolved images in a progressive manner, which is useful for real-world applications such as progressive image loading.
The rest of the paper is organized as follows. First, we discuss the related work in Section 2. Then, the overall structure of the proposed method is explained in Section 3. We present several experiments for in-depth analysis of our method in Section 4, including examining effectiveness of the newly introduced recursive structure and comparison with the other state-of-the-art methods. Finally, we conclude our work in Section 5.
Before deep learning has emerged, feature extraction-based methods have been widely used for super-resolution, such as sparse representation-based and Bayes forest-based  approaches. This trend has changed since deep learning showed significantly better performance in image classification tasks . Dong et al.  pioneered the deep learning-based super-resolution by introducing SRCNN, which enhances the interpolated image via three convolutional layers. Kim et al.  proposed very deep super-resolution (VDSR), which stacks 20 convolutional layers to improve the performance. Lim et al.  suggested the EDSR model, which employs more than 64 convolutional layers. These methods share the basic empirical rule of deep learning: deeper and larger models can achieve better performance .
As we addressed in the introduction, super-resolution methods sharing model parameters have been proposed. DRCN introduced by Kim et al.  proves the effectiveness of parameter sharing, which recursively applies the feature extraction layer for 16 times. Tai et al.  proposed DRRN that employs residual network (ResNet)  with sharing the model parameters. They also proposed the memory network (MemNet) model , which contains groups of recursive parts called “memory blocks” with skip connections across them. Han et al.  considered DRCN and DRRN as the RNNs employing recurrent states, and proposed DSRN, which uses dual recurrent states. Ahn et al.  developed the cascading residual network (CARN) model, which employs cascading residual blocks with sharing their model parameters. Although these methods can be regarded as RNNs as Han et al. mentioned , none of them uses a separate state, which is used in only the recursive part and not in the non-recursive post-processing part.
Some researchers proposed super-resolution methods that do not rely on shared parameters but have small numbers of model parameters. For example, Lai et al.  introduced the Laplacian pyramid super-resolution network (LapSRN) method, which progressively upscales the input image by a factor of 2. Hui et al.  proposed the information distillation network (IDN) method, which employs long and short feature extraction paths to maximize the amount of extracted information from the given low-resolution image. Along with the recursive super-resolution methods, the performance of these methods is also compared with that of our proposed method in Section 4.5.
We observe the following three common techniques from the previous work. First, increasing the spatial resolution at the latter stage can reduce the computational complexity than upscaling at the initial stage [3, 6, 12]. Second, employing multiple residual connections is beneficial to obtain better upscaled images [15, 24]. Third, obtaining multiple upscaled images from the same super-resolution model and combining them into one provides better quality than acquiring a single image directly [14, 19, 25]. Along with the newly introduced block state-based architecture, our proposed method is built with considering the aforementioned empirical knowledge.
In this section, we present how our super-resolution model works in detail. As similar to the existing super-resolution methods, our BSRN model can be divided into three parts: initial feature extraction, feature processing in a recursive manner, and upscaling. Figure 2 shows the overall structure of our method. As shown in the figure, the main objective of the super-resolution task is to obtain an image , which is upscaled from a given low-resolution image , where we want to be the same as the ground-truth image . Briefly, the initial features are extracted from the given input image. Then, the extracted features are further processed via a recursive residual block (RRB, Figure 3), which is employed multiple times with the same parameters. The final image is obtained from the upscaling module.
The BSRN model takes a low-resolution input image consisting of three channels of the RGB color space, where is the resolution of the image. Before we recursively process it, a convolutional layer extracts the initial features of the image, which can be represented as
where and are the weight and bias matrices, respectively, and the operator denotes the convolution operation. A variable determines the number of convolutional channels, thus the last dimension of is .
Starting from the initial features , our model performs the recursive operations in the shared part named “recursive residual block (RRB),” which is shown in Figure 3. The RRB takes two matrices as inputs at a given iteration : the feature matrix that has been processed at previous iterations from the original input image and an additional matrix called “block state,” where determines the feature dimension of . As shown in Figure 2, the block state matrix is not derived from the input image features. Instead, the initial block state matrix is initialized by zero values. Note that and have the same spatial dimension but different feature dimensions.
A RRB consists of three concatenated convolution (C-Conv) layers and one concatenated rectified linear unit (C-ReLU) layer. A C-Conv layer first concatenates two input matrices along the last dimension, performs a convolutional operation, and splits the result into two output matrices having the sizes of the input matrices. In other words, whenand are given, a C-Conv layer concatenates them (i.e., ), applies convolution as
and splits them into and , where and are the weight and bias matrices, respectively. A C-ReLU layer performs element-wise ReLU operations for the two inputs. In addition, two residual connections are involved for better performance as in the previous work [13, 15]. After processing and with three C-Conv layers, one C-ReLU layer, and two residual connections for the part, the RRB outputs and , which then serve as the inputs of the same RRB for the next recursion. This recursive process is performed times, which produces and .
There are two ways of configuring the BSRN model to get better performance: increasing the number of convolutional channels and increasing the feature dimension of the block state . When increases, the number of model parameters increases across all parts of the model, including the initial feature extraction, RRB, and upscaling parts. On the other hand, increased affects the number of model parameters only in the RRB part because the block state is involved only in RRB. Therefore, employing the block state is more beneficial to make the model compact than using a larger number of the convolutional channels.
In addition, because the block state can serve as a “memory,” the RRB can keep track of the status of the current image features over the recursive operations. When the block state does not exist, it is hard to track the current status because it has to be latently written on the image features (i.e., ). It may lead to quality degradation of upscaled images, since both the image features and the current status are inputted to the upscaling part. We investigate the effectiveness of employing the block state in Section 4.3.
Finally, the BSRN model upscales the processed feature matrix to generate an upscaled image . In particular, we use the depth-to-space operation as in the previous super-resolution models [3, 6], which is also known as sub-pixel convolution . For instance, in the upscaling part by a factor of 2, the first convolutional layer outputs the processed matrix having a size of , the depth-to-space operator modifies the shape of the matrix to , and the last convolutional layer outputs the final upscaled image having a shape of . Note that the block state is not used in the upscaling part.
Our model can generate upscaled images not only from the final processed feature matrix but also from the intermediate feature matrices . Therefore, with our model, it is possible to generate the upscaled images in a progressive manner. In addition, it is known that combining multiple outputs can improve the quality of the super-resolved images [9, 14]. Thus, we adopt a similar approach to obtain the final upscaled image by combining the intermediate outputs via the weighted sum as:
where is a so-called “frequency control variable,” which will be explained later. The term controls the amount of contribution of each intermediate output, where the later outputs contribute to more than the earlier outputs. This facilitates our model to generate intermediate upscaled images, which have progressively improved quality.
The variable in (3) controls the frequency of the progressive upscaling. For example, when and , is obtained from the weighted sum of , , , and . Since a larger value of reduces the number of times to employ the upscaling part, it is beneficial to reduce the processing time for generating the final super-resolved image. We discuss the influence of changing in Section 4.4.
The loss function of our model is calculated from the weighted sum of the pixel-by-pixel L1 loss, i.e.,
where is the spatial resolution of and , and and are the pixel values at of the upscaled and ground-truth images, respectively.
We conduct three experiments to investigate the advantages of the BSRN model. First, we examine the effectiveness of employing the block state. Second, we explore the role of the frequency control variable . Finally, we compare our models with the other state-of-the-art methods.
We employ the DIV2K dataset  for training the BSRN models, which is widely used for training the recent super-resolution models [3, 15]. For evaluating the performance of our models, we use four benchmark datasets, including Set5 , Set14 , BSD100 , and Urban100 .
We build both single-scale (4) and multi-scale (2, 3, and 4) BSRN models. The single-scale models are used to find out the benefits of the block state and frequency control variable, and the multi-scale model is used to evaluate the performance of our model in comparison to the other super-resolution methods across different scales. The number of the recursive operations and the frequency control variable are set to 16 and 1, respectively.
We implement the training and evaluation code of the BSRN model on the TensorFlow framework111The code is available at https://github.com/idearibosome/tf-bsrn-sr.. For each training step, eight image patches are randomly cropped from the training images. A cropping size of 3232 pixels is used for training the single-scale BSRN model and 4848 pixels is used for the multi-scale BSRN model. For data augmentation, the image patches are then randomly flipped and rotated. For the multi-scale BSRN model, one of the upscaling paths (i.e., , , and ) is randomly selected for every training step. The super-resolved images are obtained from our model by feeding the image patches. Then, the loss is calculated using (4) and the Adam optimization method  with , , and
is used to update the model parameters. To prevent the vanishing or exploding gradients problem
, we employ the L2 norm-based gradient clipping method, which clips each gradient so as to fit its L2 norm within. In this study, we set . The initial learning rate is set to and reduced by a half at every training steps. A total of and steps are executed for training the single-scale and multi-scale BSRN models, respectively.
As explained in Section 3.2, the BSRN model can be trained with various numbers of the convolutional channels (i.e., ) and the block state channels (i.e., ). Here, we investigate the effectiveness of employing the block state by comparing the single-scale BSRN models having an upscaling factor of 4, which are trained with and without using the block state. For the models with the block state, the number of the convolutional channels is fixed to 64 and the number of the block state channels is changed from 1 to 64. For the models without the block state, on the other hand, is fixed to 0 and is changed from 64 to 96. All models are tested with .
Figure 4 compares the performance of the trained BSRN models in terms of the number of parameters and the PSNR values measured for the BSD100 dataset . Overall, both the models with and without having the block states have a tendency to show better performance as the feature dimension (and consequently the number of parameters) increases. However, the BSRN models with the block state outperform the models without the block state, when the same numbers of parameters are used. This strongly supports that differentiating the place to store historical information from that for the image features helps to improve the quality of the upscaled images.
We further examine changes of the activation patterns of the BSRN models over the recursive iterations. Figure 5 shows , , and of the BSRN models trained with and , where the corresponding PSNR values are also reported. The values of the intermediate features and block states are averaged along the last dimension. Both the models with and without the block state generate the upscaled images with gradually improved quality in terms of the PSNR values over the iteration . However, the changes of the intermediate features are largely different. When the block state is not employed (Figure 5 (a)), both the patterns of the activation and range of the values drastically change, even though the super-resolved images are not. This implies that the RRB of the model without the block state has difficulty in generating progressively improved features and highly relies on the latter part (i.e., upscaling part) to generate good quality of the upscaled images. On the other hand, employing the block state (Figure 5 (b)) results in much more stable activations of than the model without the block state. Instead, the block states have major changes of the details, which provide historical information that can be used to produce gradually improved upscaled images over the iterations. This confirms that our model properly utilizes the block state along with the intermediate output features, which leads to better performance.
|Processing time (s)||PSNR (dB)||SSIM|
Our model can be configured with the frequency of progressive outputs via , along with the number of recursive iterations . While determines how many times the RRB is used to generate the final upscaled image, determines how many intermediate images are obtained from the model to generate the final image (i.e., how many times the upscaling part is employed), which is . Note that both and do not affect the number of model parameters.
In our proposed model, the upscaling part spends most of the computation time due to its increased number of the convolutional filters for the depth-to-space operation and increased spatial resolution after the depth-to-space operation. To verify this, we examine the BSRN model trained with and by testing with different values of and compare their efficiency in terms of speed and quality of the upscaled images (i.e., PSNR and SSIM).
Table 1 shows the average processing time spent on upscaling an image by a factor of 4, PSNR values, and SSIM values for the BSD100 dataset  for various values of . The processing time is measured on a NVIDIA GeForce GTX 1080 GPU. As expected, the processing time largely decreases when increases. For example, the BSRN model tested with requires more than 5 times less processing time than the model tested with . Nevertheless, the PSNR value decreases by only 0.002 dB and SSIM value even remains the same. This confirms that increasing significantly increases the processing speed with only negligible quality degradation. In addition, the experimental result implies that our proposed model has a capability of real-time processing. For example, when , our model can upscale more than 30 images per second, which is a common frame rate of videos.
|PSNR / SSIM||PSNR / SSIM||PSNR / SSIM||PSNR / SSIM|
|2||VDSR ||666K||37.53 / 0.9587||33.03 / 0.9124||31.90 / 0.8960||30.76 / 0.9140|
|DRCN ||1,774K||37.63 / 0.9588||33.04 / 0.9118||31.85 / 0.8942||30.75 / 0.9133|
|LapSRN ||436K||37.52 / 0.959||33.08 / 0.913||31.80 / 0.895||30.41 / 0.910|
|DRRN ||298K||37.74 / 0.9591||33.23 / 0.9136||32.05 / 0.8973||31.23 / 0.9188|
|MemNet ||686K||37.78 / 0.9597||33.23 / 0.9142||32.08 / 0.8978||31.31 / 0.9195|
|DSRN ||1,200K||37.66 / 0.959||33.15 / 0.913||32.10 / 0.897||30.97 / 0.916|
|IDN ||553K||37.83 / 0.9600||33.30 / 0.9148||32.08 / 0.8985||31.27 / 0.9196|
|CARN ||964K||37.76 / 0.9590||33.52 / 0.9166||32.09 / 0.8978||31.51 / 0.9312|
|BSRN (Ours)||594K||37.78 / 0.9591||33.43 / 0.9155||32.11 / 0.8983||31.92 / 0.9261|
|3||VDSR ||666K||33.66 / 0.9213||29.77 / 0.8314||28.82 / 0.7976||27.14 / 0.8279|
|DRCN ||1,774K||33.82 / 0.9226||29.76 / 0.8311||28.80 / 0.7963||27.15 / 0.8276|
|DRRN ||298K||34.03 / 0.9244||29.96 / 0.8349||28.95 / 0.8004||27.53 / 0.8378|
|MemNet ||686K||34.09 / 0.9248||30.00 / 0.8350||28.96 / 0.8001||27.56 / 0.8376|
|DSRN ||1,200K||33.88 / 0.922||30.26 / 0.837||28.81 / 0.797||27.16 / 0.828|
|IDN ||553K||34.11 / 0.9253||29.99 / 0.8354||28.95 / 0.8013||27.42 / 0.8359|
|CARN ||1,149K||34.29 / 0.9255||30.29 / 0.8407||29.06 / 0.8034||27.38 / 0.8404|
|BSRN (Ours)||779K||34.32 / 0.9255||30.25 / 0.8404||29.07 / 0.8039||28.04 / 0.8497|
|4||VDSR ||666K||31.35 / 0.8838||28.01 / 0.7674||27.29 / 0.7251||25.18 / 0.7524|
|DRCN ||1,774K||31.53 / 0.8854||28.02 / 0.7670||27.23 / 0.7233||25.14 / 0.7510|
|LapSRN ||872K||31.54 / 0.885||28.19 / 0.772||27.32 / 0.728||25.21 / 0.756|
|DRRN ||298K||31.68 / 0.8888||28.21 / 0.7720||27.38 / 0.7284||25.44 / 0.7638|
|MemNet ||686K||31.74 / 0.8893||28.26 / 0.7723||27.40 / 0.7281||25.50 / 0.7630|
|DSRN ||1,200K||31.40 / 0.883||28.07 / 0.770||27.25 / 0.724||25.08 / 0.717|
|IDN ||553K||31.82 / 0.8903||28.25 / 0.7730||27.41 / 0.7297||25.41 / 0.7632|
|CARN ||1,112K||32.13 / 0.8937||28.60 / 0.7806||27.58 / 0.7349||26.07 / 0.7837|
|BSRN (Ours)||742K||32.14 / 0.8937||28.56 / 0.7803||27.57 / 0.7353||26.03 / 0.7835|
Finally, we compare the performance of the multi-scale BSRN model with the other state-of-the-art super-resolution methods, including VDSR , DRCN , LapSRN , DRRN , MemNet , DSRN , IDN , and CARN . The DRCN, DRRN, MemNet, DSRN, and CARN models contain parameter-sharing parts. The VDSR, LapSRN, and IDN methods are also included in the comparison, since they have been recently proposed and have similar numbers of model parameters to ours.
Table 2 shows the performance of the state-of-the-art methods and ours in terms of the PSNR and SSIM values on the four benchmark datasets. The number of model parameters required to obtain the super-resolved image with the given upscaling factor for each method is also provided. First, the BSRN model outperforms the other methods that do not employ any recursive operations or parameter-sharing, including VDSR, LapSRN, and IDN. For example, our method achieves a quality gain of 0.31 dB for a scale factor of 2 on the BSD100 dataset over the LapSRN model. It confirms that recursive processing helps to obtain better super-resolved images with keeping the number of model parameters small enough.
In addition, our model employs much less numbers of parameters than DRCN, DSRN, and CARN. For instance, the BSRN model uses up to 70% less numbers of model parameters than the DRCN model. Nevertheless, our proposed model outperforms DRCN and DSRN, and shows comparable performance to CARN. In particular, BSRN shows almost the same performance as CARN despite the smaller model size. This proves that the proposed method handles the image features better than the other state-of-the-art methods.
Figure 6 provides a showcase of the images reconstructed by our proposed model and the other state-of-the-art methods. The figure shows that the BSRN model is highly reliable in recovering textures from the low-resolution images. For example, our method successfully upscales fine details of the structures in the Urban100 dataset, which results in clearer outputs, while the other methods produce highly blurred images or images containing large amounts of artifacts. This confirms that the BSRN model produces images having visually nice super-resolved images.
In this paper, we introduced the BSRN model, which employs a novel way of recursive operation using the block state, for the super-resolution tasks. We explained the benefits and efficiency of employing our model in terms of the number of model parameters, quality measures (i.e., PSNR and SSIM), and speed. In addition, comparison with the other state-of-the-art methods also showed that our method can generate better quality of the upscaled images than the others.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: A system for large-scale machine learning. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. pp. 265–283 (2016)
Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 126–135 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
Salvador, J., Perez-Pellitero, E.: Naive Bayes super-resolution forest. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 325–333 (2015)