Single-Image-Super-Resolution (SISR) is a well-studied computer vision problem. The problem’s goal is to create a high-resolution image from a single low-resolution image. Due to its nature, it is an ill-posed problem. Starting with the seminal work of Dong et al.  the problem is addressed by using deep-learning approaches. Dong’s model used a CNN with only 3-layers and beat the traditional approaches. Later on, to decrease the computational load, FSRCNN 
proposed postponing the upscaling to the end of the network while most of the computation and feature extracting done in low-resolution. Shi.et al. proposed ESPCN,  which replaced the transposed convolution layer with Depth2Space operator. Later, Kim et al. proposed , a 20-layer network, and showed that increasing the number of parameters can improve the network’s performance. EDSR proposed by Lim et al.  further improved the state-of-the-art by increasing the number of layers and omitting BatchNorm layers from the network. Later on, Yu et al. proposed WDSR , a network with 75M parameters and improved super-resolution results. Indeed, increasing the number of parameters improves the performance of a network, but it also makes it harder to use it in many practical real-time scenarios. For these reasons, researches started working on efficient models which aim to maintain image reconstruction performance with those of millions-of-parameter-networks while still being applicable for real-time scenarios [33, 20]. To decrease the number of parameters, recursive networks are employed [14, 29], but the number of FLOPS is very high for these networks. Besides this work, there are some work incorporating the attention idea into the SISR domain, such as [24, 34], which increases the receptive field and hence the performance of the network while keeping the parameters low at the cost of an increased number of operations.
In this context, Hui et al. proposed IDN  which uses channel splitting method to separate the high-level features from the low-level ones while keeping the number of parameters low and maintaining acceptable performance. IMDN 
further investigated the channel splitting idea at a granularity level and further improved the performance and inference speed. Besides channel splitting, IMDN employed Intermediate Information Collection (IIC) at the global level to accumulate the information from different information multi-distilling blocks (IMDB) and in the IMDBs it used Progressive Refinement Module (PRM) which splits the outputs of different convolution layers such that a portion of the information is directly flows to the end of the block while the rest is fed to the next Conv2d layer for further refinement.
Although IMDN is an efficient and well-performing network, global information fusion modules (IIC) and IMDB blocks are not ideal and there is still room for improvement. To this end, following the Network-in-Network  and Inception  spirit, we propose the Global Progressive Refinement Module (GPRM), which is an extension of the PRM in the global setting, in-place of the IIC module. Using GPRM gives us the flexibility to control the number of parameters while being able to integrate the mid-level information to the end of the network. To further reduce the number of parameters and operations, we proposed grouped information distilling blocks (GIDB) as the building blocks that employ grouped convolutions. Using grouped convolutions increases the room for further optimization during deployment. Furthermore, by incorporating the block-based non-local attention (NLA) blocks at the global level,  we further improved the performance of the proposed model.
Reconstruction efficiency of the model is shown in various different datasets, and inference efficiency is shown using NVIDIA TensorRT since it is training framework agnostic and optimizes the network for the hardware at the hand.
2 Related Works
As with many computer vision problems, SISR has benefited a lot from the recent advancements in deep learning. The first SISR model using deep learning started with the work of Dong . Later on, by postponing the upscaling stage to the end and processing the input image at a lower resolution, FSRCNN 
improved the inference speed. FSRCNN also replaced ReLU activation with PReLU. Later on VDSR introduced a deeper network and introduced a long upscaling skip connection. These showed that deeper networks improve the performance and long skip connection helps with the optimization. The same spirit continued with recursive architectures where a shared parameter sub-network is repeatedly applied at a cost of increased operations to solve SISR problem. LapSRN  aimed efficient super-resolution and used Laplacian pyramids to progressively extract features and reconstruct images at different scales with the same network. EDSR 
improved the reconstruction results by eliminating Batch Normalization layers from the network and increasing the number of parameters to 43M. WDSR further increased the parameters of the model to 75M and improved the results of EDSR. RDN  used DenseNet  style intermediate feature aggregation with residual blocks. More recently, researchers incorporated new ideas (such as grouped convolutions, attention layers etc.) into super resolution networks [24, 3, 4]. One obvious thing that can be deduced from these advancements is that as the number of parameters increases, the performance of the model increases as well. However, this comes at the cost of the model being practically not applicable. For these reasons, research interest in SISR has recently shifted towards building efficient models [33, 20]. IDN  follows this spirit; it uses channel splitting to distil features efficiently. IMDN further improves on this idea and uses channel splitting at granularity level and proposes information multi-distilling block (IMDB) which also includes a contrast-aware channel attention (CCA) layer. At the global level distilled information from the IMDBs are aggregated using Intermediate Information Collection (IIC). In this type of information collection, the information from the intermediate levels directly flows to the ends of the model. Indeed, this can be seen as a subset of the information collection type used in DenseNet and RDN, where DenseNet structure in RDN allows intermediate-to-intermediate flow as well.
The problem of a deep learning model not being practically applicable is indeed a problem with other deep-learning models from different fields as well. Because of this, researchers proposed different approaches that can make a model run in real-time, such as, Hand Picked Architectures / Blocks, Network Pruning/ Sparsification, Knowledge Distilling, Quantization, Network Architecture Search (NAS)
Hand-picked architectures focuses on manually designed architectures and blocks. Network sparsification and pruning, such as , follow a different approach and try to eliminate the redundancies in a larger network to come up with a more efficient network. Knowledge distilling  uses heavy teacher and lighter student networks in a setting where the teacher network guides the student network. Quantization, such as , focuses on the deployment side and tries to keep the network performance under lighter arithmetic operations. Network architecture search  goes beyond these ideas and tries to find the network architecture in an optimization setting.
Indeed, these ideas can be used to design super-resolution networks as well. For this purpose, Li et al.  proposed a differentiable pruning model. Their method reduced the number of parameters, FLOPS, and run-time of EDSR Baseline  and several other networks by a significant amount. In , Li et al. proposed Layer-Wise Differentiable Network Architectures to adjust the channel sizes of predefined networks and successfully reduced the number of parameters of EDSR Baseline while improving its performance. Song et al.  proposed an evolutionary network search algorithm for efficiently searching residual dense blocks for super-resolution networks. Wu et al.  proposed a trilevel NAS algorithm for optimizing networks, cells, and kernels of super-resolution networks at the same time. In  Li et al. followed a different approach for reducing the number of parameters, and proposed a learning basis for convolutional layers. Their method compresses the number of parameters of EDSR Baseline by up to 93%.
While designing IMDeception we followed a manual approach since other approaches can still be applied to further push its limits.
3 Proposed Method
In this section, we describe the details of the proposed network. As it was mentioned before, the main motivation of this paper is efficiency while keeping performance at a comparable level with million-parameter networks. As a starting point, IMDN  is selected as the baseline of our work. The original IMDN architecture can be seen in Fig. 2(a).
The variations of this network are already known to be high-performing [20, 33, 23] and improving it is challenging. This is due to the already employed network mechanisms such as Progressive Refinement Module (PRM) (Fig. 2(c)) Contrast Aware Channel Attention (CCA) (Fig. 2(c)) and Intermediate Information Collection(IIC) (Fig. 2(a)) are very efficient. The modules are indeed studied further in the original work, and each of these modules’ contributions to the final model is noted. The individual contributions of each module can be seen in Tab. 1
Note that in IMDN, PRM and CCA are used locally inside the IMDBs, however IIC is used in a more global setting. Also note that the improvement provided by the PRM is much larger than the CCA and IIC. Furthermore the number of parameters drops with the PRM. Motivated by these facts and inspired by the Inception network’s repeated structure  we created a network structure where PRM is repeated locally in the blocks and globally among the blocks to improve performance and reduce the number of parameters. This is done in such a way that IIC in the global setting is replaced with the proposed Global PRM (Fig. 1(a)). Furthermore, CCA layers are used in every IMDB, but the performance contributions from these layers are marginal compared to the number of operations and parameters that they add to the network. However, since attention layers are great at increasing the receptive field, we decided to use a limited number of block-based non-local attention blocks  in our proposed network’s main path. To further reduce the number of parameters and number of operations of the network, every single Conv2D operation inside the IMDB is replaced with Gblocks (Fig. 1(b)) as in XLSR  which is based on grouped convolutions. We call these group convolution based structures as Grouped Information Distilling Blocks (GIDB). Although the grouped convolutions are not well optimized in training frameworks,  if utilised correctly within an inference-oriented framework, group convolutions can lead to speed ups as noted in [7, 3] especially in mobile devices where efficient network structures are usually employed.
Mathematically, the model can be described as follows; Given a low resolution image , super-resolved image, , can be obtained as follows:
Here, is our proposed optimized super-resolution model. In the begining of the network a 64 channel 3x3 convolution is employed for feature extraction, as in IMDN, let represent these features. These features are both transferred to the end of the network and processed in the Global Progressive Refinement Module as follows:
In the above equations, represent 3/1 ratio channel splitting, channel concatenation, and block based non-local attention. are channel split features of block which is our proposed Grouped Information Distilling Block (GIDB). Note that here GPRM is used for global feature distilling and aggregation, and operating on the outputs of GIDBs. At the local level, features are further processed by GIDBs as follows:
Here represents 1x1 convolution operation used for information fusion, Note that at the local level input features are processed and refined in grouped fashion using s. is implemented using 3x3 grouped convolution (groups=4) and cascaded 1x1 convolution to allow information flow between the groups. Using information grouping and processing the features in a grouped fashion reduces the number of parameters while almost at no cost of performance loss. The detailed
implementation along with the used activation functions can be seen inFig. 1(b).
The output of the GPRM module, , is further processed to construct the super-resolved image, as follows:
Here represent 1x1 and 3x3 convolution layers respectively. We used Leaky ReLU (slope=0.05) activations. is the upsampling layer implemented as shown in Fig. 2(b)
Our proposed network structure, which we call IMDeception, combining all of these ideas, can be seen in Fig. 2.
Note that we used global PRM among the GIDB and local PRM as in IMDB inside GIDB. Our proposed architecture defines a class of highly efficient architectures sharing the same structure with different channel numbers on the filters. As it can be seen from Fig. 2, we define the complexity of the models using parameter. Depending on the needs, the parameter can be used to adjust the complexity of the network. From our experiments, we have observed that even with no attention blocks can still show high reconstruction performance with great inference timings.
The performance parameters of various IMDeception networks using parameter and existence of attention blocks can be seen in Tab. 2
|IMDeception||+ NLA||+ NLA|
|#Div2K Val. (PSNR)||29.02||28.82||28.70||28.48||28.45|
4.2 Training Details
The proposed model () was trained in two different phases in all of the phases we used;
For the first phase, we used Charbonnier loss with as in Eq. 5 and trained for 2000 epochs, which lasted 2 days and 7 hours on a single NVIDIA Tesla v100. See Fig. 4 for training curves and learning rate policy.
The second phase of the training started from the best checkpoint, and this time L2-norm was used as the loss function, and trained for 1300 epochs which lasted 1 day and 16 hours.
In this section, the proposed architecture’s PSNR results are given on various different datasets. The PSNR results of IMDeception and other state-of-the-art methods can be seen in Tab. 3. From the experiments it can be seen that, although IMDeception () has very limited number of parameters and FLOPS, it has on par performance with state-of-the-art algorithms. Especially, IMDeception’s performance on Urban100, Manga109 datasets is well above E-RFDN, IMDN, CARN, LapSRN methods except EDSR (which has 43M parameters). An interesting result is IMDeception ()’s PSNR performance on Urban100 and Manga109 surpasses LapSRN although it has only 7% of number of parameters. The number-of-parameters and PSNR results of these methods can be best seen in Fig. 5
Another important property that IMDeception has, its precise output on the repeated structures and patterns which can be seen in Fig. 6.
Div2K Test Set PSNR is 28.73
In terms of run-time, our proposed method has great potential for optimization on edge devices , due to parallel grouped convolutions and a reduced number of parameters. As it can be seen from the Tab. 3, the proposed model defines a set of efficient architectures, which can be used in different devices with different inference run-times with good reconstruction performance. As a reference and as an indication of its potential, we run our proposed models on NVIDIA RTX 2080 Super and NVIDIA Jetson Xavier AGX 30W devices. To do this, we have converted trained models to ONNX format and used NVIDIA’s TRT Engine application to convert them to an inference engine to use the hardware’s full potential. The run-times are listed in Tab. 4. Note that IMDeception can run on this edge device at up to 24fps while outputting high-resolution 2K images. An important conclusion that can be made from the run-time experiments although the number of parameters and FLOPS are lower for IMDeception compared to , the inference run-times are higher. This is due to the fact that GPUs are usually optimized for the channel sizes, which are powers of 2. Because 12 is not a power of 2, additional processing in the GPU is required, negating the benefits of the reduced number of parameters and FLOPS. This is an important conclusion to make since this phenomenon is not observable during inference with a training framework such as PyTorch.
Note the increased inference time. This is because 12 is not a power of 2 and the GPUs are optimized and have kernels specific to sizes of power of 2.
We proposed an efficient model based on the IMDN network called IMDeception. IMDeception employs the proposed Global Progressive Refinement Module (GPRM), which is an extension of the Progressive Refinement Module (PRM). Unlike PRM that works only with the Conv2d layer at the local scale, GPRM can be used with any arbitrary block, as we did with the newly proposed Grouped Information Distilling Blocks (GIDB). Both of the proposed mechanisms/blocks can be used for different networks and in different structures. These structures are designed with efficiency in mind, which reduces the number of parameters and FLOPS while maintaining high performance. GPRM is an efficient way of combining features and can be an alternative to Dense Networks style or IIC-style feature aggregating methods. One nice feature of it is that it separates the aggregated part from the distilled part, which helps controlling the network size while maintaining network performance. On the other hand, GIDB uses grouped convolutions, which, if implemented with efficiency in mind, can provide a speed boost during inference. We also showed that the proposed model is very high-performing on various different datasets and has great inference timings on different hardware, including NVIDIA Jetson Xavier AGX.
-  Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., July 2017.
-  Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In Eur. Conf. Comput. Vis.
-  Mustafa Ayazoglu. Extremely lightweight quantization robust real-time single-image super resolution for mobile devices. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 2472–2479, 2021.
-  Ming Zhuo Chen and Jun Ming Wu. Group feature information distillation network for single image super-resolution. In 2021 7th International Conference on Computer and Communications (ICCC), pages 1827–1831, 2021.
-  Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. volume 8692, pages 184–199, 2014.
-  Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network. Eur. Conf. Comput. Vis., 9906:391–407, 2016.
-  Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, Michael O’Boyle, and Amos Storkey. Optimizing grouped convolutions on edge devices, 2020.
-  Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Int. Conf. Comput. Vis., pages 1398–1406, 2017.
-  Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In Adv. Neural Inform. Process. Syst., 2015.
-  Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2261–2269, 2017.
-  Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei Wang. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, oct 2019.
-  Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information distillation network. In IEEE Conf. Comput. Vis. Pattern Recog., pages 723–731, 2018.
-  Nikhil Iyer, V. Thejas, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. Wide-minima density hypothesis and the explore-exploit learning rate schedule. CoRR, abs/2003.03977, 2020.
-  Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution, 2015.
-  Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. IEEE Conf. Comput. Vis. Pattern Recog., pages 1646–1654, 2016.
-  Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., 2017.
-  Yawei Li, Shuhang Gu, Luc Van Gool, and Radu Timofte. Learning filter basis for convolutional neural network compression. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
-  Yawei Li, Shuhang Gu, Kai Zhang, Luc Van Gool, and Radu Timofte. DHP: differentiable meta pruning via hypernetworks. In Eur. Conf. Comput. Vis., 2020.
-  Yawei Li, Wen Li, Martin Danelljan, Kai Zhang, Shuhang Gu, Luc Van Gool, and Radu Timofte. The heterogeneity hypothesis: Finding layer-wise differentiated network architectures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2144–2153, 2021.
Yawei Li, Kai Zhang, Luc Van Gool, Radu Timofte, et al.
Ntire 2022 challenge on efficient super-resolution: Methods and
IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2022.
-  Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., July 2017.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network, 2014.
-  Jie Liu, Jie Tang, and Gangshan Wu. Residual feature distillation network for lightweight image super-resolution. In Eur. Conf. Comput. Vis.
-  Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network. Int. Conf. Comput. Vis., 2020.
-  Tao Sheng, Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Liang Shen, and Mickey Aleksic. A quantization-friendly separable convolution for mobilenets. CoRR, abs/1803.08607, 2018.
-  Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. IEEE Conf. Comput. Vis. Pattern Recog., pages 1874–1883, 2016.
Dehua Song, Chang Xu, Xu Jia, Yiyi Chen, Chunjing Xu, and Yunhe Wang.
Efficient residual dense block search for image super-resolution.
Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12007–12014, Apr. 2020.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1–9, 2015.
-  Ying Tai, Jian Yang, and Xiaoming Liu. Image super-resolution via deep recursive residual network. In IEEE Conf. Comput. Vis. Pattern Recog., 2017.
-  Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks, 2017.
-  Yan Wu, Zhiwu Huang, Suryansh Kumar, Rhea Sanjay Sukthanker, Radu Timofte, and Luc Van Gool. Trilevel neural architecture search for efficient single image super-resolution. CoRR, abs/2101.06658, 2021.
-  Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, Zhaowen Wang, Xinchao Wang, and Thomas S. Huang. Wide activation for efficient and accurate image super-resolution. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2018.
-  Kai Zhang, Martin Danelljan, Yawei Li, Radu Timofte, et al. Aim 2020 challenge on efficient super-resolution: Methods and results. In Computer Vision – ECCV 2020 Workshops, pages 5–40, 2020.
-  Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Eur. Conf. Comput. Vis., pages 294–310, 2018.
-  Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. In IEEE Conf. Comput. Vis. Pattern Recog.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. IEEE Conf. Comput. Vis. Pattern Recog., pages 8697–8710, 2018.