FreqNet: A Frequency-domain Image Super-Resolution Network with Dicrete Cosine Transform

by   Runyuan Cai, et al.

Single image super-resolution(SISR) is an ill-posed problem that aims to obtain high-resolution (HR) output from low-resolution (LR) input, during which extra high-frequency information is supposed to be added to improve the perceptual quality. Existing SISR works mainly operate in the spatial domain by minimizing the mean squared reconstruction error. Despite the high peak signal-to-noise ratios(PSNR) results, it is difficult to determine whether the model correctly adds desired high-frequency details. Some residual-based structures are proposed to guide the model to focus on high-frequency features implicitly. However, how to verify the fidelity of those artificial details remains a problem since the interpretation from spatial-domain metrics is limited. In this paper, we propose FreqNet, an intuitive pipeline from the frequency domain perspective, to solve this problem. Inspired by existing frequency-domain works, we convert images into discrete cosine transform (DCT) blocks, then reform them to obtain the DCT feature maps, which serve as the input and target of our model. A specialized pipeline is designed, and we further propose a frequency loss function to fit the nature of our frequency-domain task. Our SISR method in the frequency domain can learn the high-frequency information explicitly, provide fidelity and good perceptual quality for the SR images. We further observe that our model can be merged with other spatial super-resolution models to enhance the quality of their original SR output.


page 10

page 15

page 17

page 19


Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Despite the breakthroughs in accuracy and speed of single image super-re...

Frequency Domain-based Perceptual Loss for Super Resolution

We introduce Frequency Domain Perceptual Loss (FDPL), a loss function fo...

What Hinders Perceptual Quality of PSNR-oriented Methods?

In this paper, we discover two factors that inhibit POMs from achieving ...

Invertible Image Rescaling

High-resolution digital images are usually downscaled to fit various dis...

Multimodal-Boost: Multimodal Medical Image Super-Resolution using Multi-Attention Network with Wavelet Transform

Multimodal medical images are widely used by clinicians and physicians t...

Deep SR-ITM: Joint Learning of Super-resolution and Inverse Tone-Mapping for 4K UHD HDR Applications

Recent modern displays are now able to render high dynamic range (HDR), ...

Robust Real-World Image Super-Resolution against Adversarial Attacks

Recently deep neural networks (DNNs) have achieved significant success i...

1 Introduction

Single image super-resolution(SISR) aims to recover high-frequency details for a high-resolution(HR) image from one of its degraded low-resolution(LR) version. After years of development, the SISR has been widely used in many computer vision tasks, such as media content enhancement

Glasner et al. (2009), medical imagingOktay et al. (2016) and satellite imagingYıldırım and Gungor (2012). Traditional state-of-the-art SR methods mainly adopt the example-basedGlasner et al. (2009) strategy, exploiting internal similarities or learning a mapping from the external dictionary. The sparse-coding-based SRYang et al. (2008) is one of the most representative methods.

Recently, deep convolutional neural network (CNN) based SISR methods have achieved significant improvements over traditional methods. Deep learning-based methods treat this problem as a dense image regression task, which learns an end-to-end image mapping function represented by a CNN between LR and HR images. Dong et al.

Dong et al. (2016a) proposed SRCNN that first adopted deep learning into SISR using a three-layer CNN to represent the mapping function. Residual blockHe et al. (2015) was later introduced into SISR in SRResNetLedig et al. (2017) and improved in EDSRLim et al. (2017). Residual block makes it possible to build deeper or wider networks. Zhang et al.Zhang et al. (2018b) and Tong et al.Tong et al. (2017) adopted dense blocksHuang et al. (2017) to combine features from different levels. Zhang et al.Zhang et al. (2018a) improved residual block by adding channel attention. Based on the progress of non-blind methods, blind super-resolution methodsLiu et al. (2021), which aim at complex degradation models in real scenarios, have received increasing attention recently.

The SISR methods mentioned above commonly use the minimization of the mean squared error (MSE) between the recovered SR image and the HR ground truth as the optimization target. Minimizing spatial MSE also maximizes the peak signal-to-noise ratio (PSNR), which is a common measure used to evaluate SR algorithms. However, such a pipeline often results in blurry effects because the high-frequency textures have been excessively destructed in the degrading process and are hard to predict. Generative adversarial networks (GANs)

Goodfellow et al. (2020) based SISR approaches are proposed to relieve the above problems. However, the unpleasant hallucinations and artifacts caused by GANs further pose more challenges. Zhang et al.Zhang et al. (2018a) further proposed a residual-in-residual (RIR) structure to bypass the redundant low-frequency information through multiple skip connections, implicitly guiding the network to focus on learning high-frequency information. However, since the commonly used PSNR and structural similarity index measure(SSIM) are based on per-pixel loss and picture global information, respectively, their perception of high-frequency details is limited. To the best of our knowledge, current spatial domain-based methods do not have an explicit approach for learning high-frequency information and verifying the fidelity of output artificial details.

To practically resolve this problem, we propose FreqNet, a frequency-domain-based super-resolution network, to directly learn the reconstruction of high-frequency features. The proposed network contains two parallel flows: the Spatial Extraction Network(SEN) and the Frequency Reconstruction Network(FRN), in order to make use of both domains’ information. We first convert both LR and HR images to frequency coefficients using discrete cosine transform (DCT)Rao and Yip (1990), then reshape them to obtain DCT feature maps. The SEN takes standard LR images as input, through the spatial feature reconstruction trunk and the down-sampling shrinking trunk to obtain one component of target HR DCT feature maps. The FRN is purely operated on frequency domain, which takes LR DCT feature maps as input, through the frequency-domain reconstruction trunk to obtain the other component. The weighted sum of two components makes our final frequency domain output, which can be converted to SR image through inverse discrete cosine transform(iDCT)Rao and Yip (1990). Thanks to the characteristic of DCT, we can easily merge our output with any other SR model to enhance the high-frequency details of its output. We further propose depth-wise residual block(DWRB) and deformable residual block(DRB) to be implemented respectively in FRN and SEN that can better use the characteristics of the frequency domain feature maps. As the ability of spatial MSE (and PSNR) to capture high-frequency detail is very limited, we propose a frequency-domain loss function to evaluate the quality of the output SR image.

Overall, our contributions are three-fold: (1) We propose FreqNet, a frequency-domain-based SISR network, to learn the high-frequency features explicitly with a specially designed pipeline. Our network can produce perceptually satisfying results with high-fidelity details. (2) We propose depth-wise residual block structure and deformable residual block structure to fit the nature of frequency-domain feature extraction. Both structures can improve our network’s reconstruction quality and feature extraction ability. (3) We propose a frequency-domain loss function and a corresponding metric that measures output quality from the accuracy of high-frequency detail reconstruction.

2 Related Works

Numerous image deep learning-based SR methods have been studied in the computer vision community. Here we focus on works related to CNN-based methods and the works on frequency-domain learning.

2.1 Image Super-Resolution with CNN

Numerous methods have proven the effectiveness of the CNN-based pipeline on image super-resolution tasks. The pioneering work was done by Dong et al.Dong et al. (2016a), their proposed SRCNN for image SR achieved superior performance against previous works. Kim et al. proposed VDSRKim et al. (2016a) and DRCNKim et al. (2016b) by introducing residual learning to ease the training difficulty and significantly improve accuracy. Tai et al. introduced recursive blocks in DRRNTai et al. (2017b) and memory blocks in MemNetTai et al. (2017a). A faster network structure FSRCNNDong et al. (2016b) was proposed to accelerate the pipeline of SRCNN. Ledig et al.Ledig et al. (2017) introduced ResNetHe et al. (2015) to construct a deeper network, SRResNet, for image SR. They also proposed SRGAN with perceptual lossesJohnson et al. (2016) and generative adversarial network (GAN)Goodfellow et al. (2020) for photo-realistic SR. Such GAN based model was then introduced in ESRGANWang et al. (2018)

, which confirmed that dropping the batch normalization layers can result in better performance. Although SRGAN and ESRGAN can alleviate the blurring and over smoothing artifacts, their predicted results may not be faithfully reconstructed and produce unpleasing artifacts. By removing unnecessary modules in conventional residual networks, Lim et al.

Lim et al. (2017) proposed EDSR and MDSR, which achieve significant improvement. Zhang et al.Zhang et al. (2018a)

introduce channel attention to residual block. Blind-SR methods have also received increasing attention recently, aiming at complex degradation models in real scenarios by estimating degradation kernel using an extra module

Gu et al. (2019). The methods of Zhang et al. (2021),Bell-Kligler et al. (2019) and Wang et al. (2021) achieved state-of-the-art performance in real-world scenario with multiple modelling strategies.

However, all these CNN-based methods operate on the spatial domain. The information on the frequency domain is not directly used, though the recovery of high-frequency information is precisely the target of the image super-resolution task.

2.2 Frequency-Domain based Deep Learning

Projecting image to frequency domain provides a new perspective for various computer vision tasks. Remarkable performance has been achieved in some frequency-domain works. Works of Torfason et al. (2018) ,Xu et al. (2018) and Wu et al. (2018) jointly train auto-encoder-based networks on compression and inference tasks with frequency-domain input. Gueguen et al. (2018)

extracts features from the frequency domain to classify images.

Ehrlich and Davis (2019) proposes a model conversion algorithm to convert the spatial-domain CNN models to the frequency domain. Xu et al. (2020) propose a method of learning in the frequency domain using DCT-based sparse image representations, proving that we can use frequency-domain information directly in current CNN models without a complex model transition procedure. Nash et al. (2021) further translate the DCT representation into a sequence of DCT channel, spatial location, and DCT coefficient triples, and achieve state-of-art performance on image generation and restoration tasks with a Transformer-based auto-regressive architecture.

The essence of the SR task is to recover the information of high-frequency channels in the image. Hence, frequency-domain features are informative for HR reconstruction and can potentially enhance the performance with proper methods. However, there is no existing SR method using the characteristics of DCT feature maps. Hence, we propose a super-resolution pipeline on the DCT domain, which we will present in detail in the following section.

Precisely, the overall architecture is given in figure 1. In section 3.1, we introduce the image conversion process that projects the spatial image to spatial domain. In section 3.2 and 3.3, we explain the architecture and components of FreqNet in detail. The propsed frequency-domain loss function will be presented in section 3.4.

3 Method

Figure 1: The architecture of FreqNet. Our FreqNet contains two parallel data flows: the Spatial Extraction Network(SEN) and the Frequency Reconstruction Network(FRN), taking LR image and LR DCT feature maps as input, respectively. The final output is the weighted sum of predicted DCT feature maps from two sub-networks. Loss is computed between GT DCT feature maps(on the right-top of the figure) and the final output.

We propose a frequency-domain based pipeline for image super-resolution. Our method consists of an image conversion process that converts the spatial image to the frequency domain and a specialized network for training with the frequency domain information. As shown in 1, our proposed network consists of two parallel sub-network, respectively operates on spatial-domain and frequency-domain inputs to make use of both domains’ information. We will first explain the image conversion process in the following section. Details of architecture will be discussed in 3.2.

3.1 Image Conversion to the Frequency Domain

Following the JPEG codec, we first transform the original RGB images to zero-centered normalized YCrCb color space, containing a brightness component Y (luma) and two color components Cb and Cr (chroma). Then we upsample the LR image to make it the same size as the HR image.

3.1.1 Generation of DCT Blocks

Figure 2: Converting pixel blocks to DCT blocks

To get frequency-domain information, we crop the images into uniform size of pixel blocks, then pass them through Discrete Cosine Transform(DCT) module. The DCT projects an image into a collection of cosine components which stands for different frequencies of 2D signals. Given a block size , the two-dimensional DCT converts a zero-centered pixel blocks to obtain an DCT block , as interpreted below:


Where and are the horizontal and vertical index of frequencies in the DCT block, and stand for the horizontal and vertical index of pixel block, and is a normalizing scale factor to enforce orthonormality.

For a standard DCT transform in JPEG codec, the block size M is 8, which indicates that any information in an pixel block can be represented by a linear combination of 64 2D signals. However, in image super-resolution, the block is upscaled to . Thus we perform DCT transform with the block size 32. At this stage, both the LR and HR images are converted into frequency-domain blocks that contain the DCT information of 1024 frequency channels.

3.1.2 Reforming DCT feature maps

Figure 3: Reforming DCT blocks to DCT feature maps

Not all information in the frequency range can be perceived for the perceptual ability of the human eye. Many other DCT based methods induce sparsity to DCT blocks through quantization. However, for the super-resolution task, we tend to preserve the information as much as possible. Thus, as illustrated in figure 3, we perform a region selection on the DCT block. Only the values inside the left-top R×R selected region will be preserved for the next step. In practice, we choose as a trade-off between training difficulty and perceptual quality. We will explain later how we handle the values outside the selected region.

Following the processing method proposed by Xu et al. (2020)

, we flatten the DCT blocks to DCT vectors of length

. Then we pose these vectors at their corresponding spatial positions, forming a cuboid of size , where H and W are the height and width of the original image. This cuboid is a collection of DCT feature maps, each channel at the third dimension is a frequency-domain feature map that contains the information of the frequency it represents.

3.1.3 Channel-wise Normalization

We further perform normalization on each frequency channel. For channel of the frequency-domain feature maps , we perform:


Where and

denotes the mean and standard deviation of channel

that are pre-calculated on our training set.

Unlike quantization, this normalization process does not change the relative intensity of each feature map, thus guaranteeing the integrity of information. The purpose of this operation is to project the values to a suitable range for learning.

3.2 Architecture of FreqNet

As shown in figure 1, our FreqNet contains two parallel data flows: the Spatial Extraction Network(SEN) and the Frequency Reconstruction Network(FRN), in order to make use of both domains’ information.

The SEN takes up-scaled LR image as input. Only one convolutional layer is used to extract the shallow feature from the LR input. is then passed through the Reconstruction Trunk(RT), which contains a sequence of multiple Residual Groups(RG)Zhang et al. (2018a) and Deformable Residual Groups(DRG) to convert the spatial feature maps into frequency domain features . Then we feed to Shrinking Trunk(ST), which consists of 4 down-sampling convolution layers with , to gradually shrink the scale of features maps while maintaining the channels. The final output is one component of target HR DCT feature maps. The overall process can be interpreted as:


Where denotes the first convolution operation, and denote the RT and ST structure.

The FRN is purely operated on frequency domain. We take the pre-processed LR DCT feature maps as input, through the frequency-domain reconstruction trunk(FRT), which contains a sequence of depth-wise residual groups(DWRG) and RG to obtain the other component of target HR DCT feature maps, noted as . A skip-connection is added to take advantage of the similarities between input and the target, thus drawing attention towards the difference on high-frequency channels. The overall process can be interpreted as:


Where denotes the FRT structure.

The outputs of two sub-network have the same size, and a weighted element-wise sum is applied to get the final output:


Where the and are the pre-defined weights for two components.

The output is further fed to a 2-stage inverse Discrete Cosine Transform(iDCT) module, which is an inverse flow of data-processing pipeline we defined in 3.1. We first project the back to its original range of values by performing denormalization on each channel. Then, in stage-1, we reform the DCT feature maps back to DCT blocks of size , and the rest of block is filled with information from LR DCT blocks. Then we use iDCT to get the final SR image in stage-2.

3.3 Modified Residual Group

Figure 4: Different types of RG and RB implemented in FreqNet

Inspired by the success of residual groups(RG) in Zhang et al. (2018a), we take it as the basic module of our network. As shown in figure 4, an RG is a sequence of residual blocks(RB)Lim et al. (2017) with an in-group skip connection between the input and output features. The original RB can be interpreted as:


Where denotes a convolution layer, is the feature from last block and is the feature towards next layer. As described in 3.1.3

, the final output of network should be of zero-centered distribution, thus we replace ReLU layer by LeakyReLU with a high negative slope to fit our case.

3.3.1 Deformable Residual Group

The RG structure makes it possible to achieve large depth, consequently providing a large receptive field size. However, uniformly extending the receptive field does not always positively impact high-precision required tasks, such as the reconstruction of frequency-domain feature maps, due to the potential redundant information. Deformable convolution layerZhu et al. (2019)(DefConv) can be a solution. By learning an offset, DefConv provides the ability to constrain the sampling area. Each convolution operation only focuses on the valuable region, reducing the impact from the redundant receptive area.

Thus, as shown in figure 4 we further integrate DefConv into RB by partly replacing the original convolutional layers, introducing the deformable residual block(DRB), which is the basic module of deformable residual group(DRG):


Where denotes the deformable convolution layer. The proposed DRB structure has better guidance on the receptive field, thus yield more accurate feature extraction from last layer. We implement DRG sequence in the reconstruction trunk of SEN sub-network, after a sequence of regular RG, to improve the robustness of reconstructed .

3.3.2 Depth-wise Residual Group

For most spatial domain tasks, the intermediate deep feature maps are abstract and strongly correlated. However, through the reforming method we defined in

3.1.2, the frequency-domain feature maps have concrete semantic information and share less correlation between each other. To better reflect this characteristic, we propose the depth-wise residual block(DWRB) that replace the first convolution layer in RB by depth-wise convolution layerHoward et al. (2017):


Where denotes the depth-wise convolution layer. A depth-wise convolution layer performs 2-D convolution on each channel of the input without merging information from other channels, which is suitable to make the module focus on extracting information from own channel for the next stage of reconstruction, rather than relying on global information. Depth-wise residual group(DWRG) is the RG that deploy the DWRB instead of RB.

3.4 Frequency-domain Loss Function

The definition of our frequency-domain loss function is critical to the performance of our network. Commonly, the loss function of super-resolution task is based on pixel-wise Mean Square Error(MSE), as minimizing spatial MSE also maximizes the peak signal-to-noise ratio(PSNR). However, solutions from MSE optimization can achieve high PSNR while lacking high-frequency content, which results in unsatisfying perceptual quality with overly smooth texturesLedig et al. (2017).

For our frequency domain super-resolution, this problem can be solved in a intuitive method. Since the target is a series of frequency-domain feature maps with semantic meaning assigned to each channel, we can allocate different weights to each frequency channel while computing the loss, thus explicitly guide the network to focus on the reconstruction of selected high-frequency channels. Following Lai et al. (2019)

, we further replace the MSE backbone by Charbonnier Loss that can better handle the outliers, which are more likely to appear in frequency-domain samples. The proposed frequency-domain loss function

is calculated as:


Where is the backbone of Charbonnier Loss, and denotes the width and height of output feature maps, denotes the weight assigned to channel and denotes the frequency-domain feature maps, as we previously define in equation 8.

4 Experiment Results

4.1 Experimental Settings

Our experimental settings about datasets, degradation models, evaluation metric and training settings are declared below:

Datasets and degradation model. Following Timofte et al. (2017) We use 800 training images from DIV2K datasetTimofte et al. (2017)

as training set. For testing, we use four standard benchmark datasets: Set5

Bevilacqua et al. (2012)

, Set14

Zeyde et al. (2012), BSD100Martin et al. (2001) and MANGA109Matsui et al. (2016). We conduct experiments with Bicubic degradation model.

Evaluation Metrics. The SR results are evaluated with PSNR on the luminance channel(Y channel) of transformed YCrCb space. We also propose a frequency-domain reconstruction metric(FRM) on the luminance channel that measures the quality of high-frequency feature reconstructed:


Training Settings. We crop 800 training images into mini patches. Respectively, the size of cropped LR image is and the size of cropped HR image is . The relative location of each pair of LR and HR patches is strictly identical. Our model is trained by ADAM optimizorKingma and Ba (2015), with , and . We implement Cosine Learning Rate(CosLR) strategy, which periodically adjust the learning rate at epoch of period with the equation:


Where the is and is

, the number of epochs in each period is 30. We use PyTorch

Paszke et al. (2017) to implement our method with Nvidia Geforce RTX 2080 ti GPU.

The channel-wise weights allocation of our proposed loss function (Equation 13) will be discussed in detail in section4.3.

4.2 Results with Bicubic Degradation Model

We quantitatively compare our method with 8 State-of-the-art methods, including SRCNNDong et al. (2016a), FSRCNNDong et al. (2016b), EDSRLim et al. (2017), EDNZhang et al. (2018b), RRDBWang et al. (2018) and its perceptual-driven method ESRGANWang et al. (2018), MSRResNetLedig et al. (2017) and its perceptual-driven method MSRResNet-GANLedig et al. (2017). We further perform visual comparisons with these two GAN-based methods and their PSRN-oriented version to demonstrate the perceptual quality and fidelity of the output from our model.

4.2.1 Quantitative Results by PSNR/FRM

Table 1 shows quantitative comparisons for our SR task, we compare the average PSNR and FRM on Y channel. The PSNR results of ESRGAN and MSRResNet pair are computed using the released model. For the other models, the results are cited from their papers. All the FRM results are computed using released models. Our model has the best FRM with a slight decrease in PSNR value, which shows that our method has a more accurate reconstruction of key high-frequency information. Meanwhile, although GAN-based methods visually provide more high-frequency details, their FRM values are generally low, reflecting the lack of accuracy of high-frequency information reconstructed by such methods. We will discuss the visual behavior in detail in section 4.2.2.

Method Set5 Set14 Manga109 BSD100
Bicubic 28.78 40.06 26.38 39.11 24.89 39.65 26.33 38.97
SRCNN 30.48 40.01 27.50 39.09 27.58 39.71 26.90 39.11
FSRCNN 30.72 40.13 27.61 39.12 27.90 39.77 26.98 39.09
MSRResNet 32.22 40.19 28.63 39.26 30.48 40.04 27.59 39.31
MSRResNet-GAN 29.40 39.64 26.02 38.84 27.69 39.12 25.16 39.01
EDSR 32.46 40.32 28.80 39.61 31.02 40.46 27.71 39.25
RDN 32.47 40.27 28.81 39.47 31.00 40.71 27.71 39.23
RRDB 32.60 40.34 28.88 40.14 31.16 40.63 27.76 39.52
ESRGAN 29.56 39.38 26.19 38.79 28.03 39.28 25.32 38.86
FreqNet(Ours) 32.08 43.56 28.47 42.60 30.23 40.91 27.51 40.87
Table 1: Quantitative results with Bicubic degradation model on Y channel. Best and second best results are highlighted and underlined.

4.2.2 Visual Results

(a) Image “126007.png”
(b) Image “182053.png”
(c) Image “210088.png”
(d) Image “21077.png”
(e) Image “351093.png”
Figure 5: Visual comparison for 4 SR with Bicubic Degradation model on BSD100 datasets.

In figure 5, we show visual comparisons of SR results with the Bicubic Degradation model on BSD100 datasets. For images “126007.png” and “351093.png”, we observe that our method has more precise building contours than the PSNR method, contains more details, and does not have the excessive texture as in the GAN method. For image “210088.png”, we observe that our method produces the best face pattern and eye details for the clownfish. For image “21077.png”, our method better restores the text “cas” over the other methods. And in image “182053.png”, our method predicts the arches correctly while having fewer unnecessary artifacts.

4.3 Effects of Frequency-domain Loss Function and Modified RG

We study the effects of proposed Deformable Residual Group, Depth-wise Residual Group and the Frequency-domain Loss Function.

4.3.1 Settings and Effects of Frequency-domain Loss Function.

As we defined in Equation 13, each channel has a pre-assigned weight. We propose a statistical solution to decide the weight of each channel coarsely. As shown in Figure 6, given a pair of HR and up-sampled LR DCT blocks, after the region selection process(i.e. Figure 3, (d)), of size , we perform 8 times of computation in total. For the computation, we keep the DCT region at the left-top of original DCT block unchanged, and set the values outside the selected region to be . Then we perform iDCT on both LR and HR DCT blocks and compute the mean pixel-wise residual of two converted pixel blocks as:

Region: 3 4-3 5-4 6-5 7-6 8-7 9-8 10-9
1 1 5 10 10 5 1 1
Table 2: Weight Allocation.

We randomly picked 1000 samples from the training set to perform the statistics by accumulating . We define , then for each , the value reflects the difference between HR and LR images while considering the addition frequency channels of , which is proportional to their importance. Therefore, based on , we allocate weights as the table 2 shows, where Region denotes the additional channels between the left-top region and region of the DCT block, and denotes the weight assigned to these channels involved in Equation 13.

Figure 6: Progressively calculate residuals between HR and LR pixel-blocks under different size of region selection.

To demonstrate the effect of the proposed frequency-domain loss function , we run the training process with MSE and respectively, and compare the output of two models on Set5. Both the PSNR and FRM of supervised model is higher than the MSE supervised model, and the output images contain more accurate high-frequency texture. Figure 7 shows the comparison of the SR results of image “bird.png” between MSE-supervised and -supervised FreqNet after same number of iterations. The -supervised model can produce more high-frequency details.

Figure 7: Comparison of MSE-supervised and -supervised FreqNet. supervised result contains more high-frequency details.

4.3.2 Effects of Deformable(DRG) and Depth-wise Residual Group(DWRG).

We perform a series of ablation experiments by replacing DRG or/and DWRG with original RG, to demonstrate the effect of our modified RG structure.

PSNR on Set5 31.88 32.06 31.91 32.08
FRM on Set5 43.24 43.51 43.29 43.56
Table 3: Ablation Experiments on DWRG and DWG. We use PSNR and our proposed FRM as the metric.

Respectively, in Spatial Extraction Network(SEN) we set and , in Frequency Reconstruction Network(FRN) we set and . For each group, the number of residual blocks is set as 10. As shown in Table 3, the PSNR on Set5 increased by 0.18 dB when we replace specific RG with DRG, increased by 0.03 dB when we replace specific DWRG, and we can have the best performance by using both of them. The FRM on Set5 also increased when we replace RG with DRG and DWRG, by 0.27 and 0.05 respectively. The comparison shows the effectiveness of our proposed modified Residual Group architectures.

4.4 High-frequency Detail Enhancement based on other SR Models

As the output of our proposed model is a group of separated frequency-domain feature maps, we can easily merge our output with the output of other SR models, thus realize the enhancement on selected high-frequency channels. We first perform a similar process as 3.1 to convert the output from other SR model to its frequency-domain feature maps group , then we replace certain channels in with the corresponding channels in from FreqNet to get the merged output .

Figure 8: Merging with the output of FreqNet can reduce unreasonable artifacts from the GAN method while maintaining the details in the picture.

Specifically, we can merge our output with GAN-based SR models. As shown in Figure 8, we merge the output of MSRResNet-GANWang et al. (2018) with the output of our model, for image “ARMS.png” in “Manga109”Matsui et al. (2016), the results are presented in Y-channel. The excessive artifacts from GAN can be corrected by channel replacement, and the reasonable high-frequency information that doesn’t ruin the fidelity can be preserved. This method is practical when the output is blurred due to the difficulty of prediction.

5 Conclusions

We propose FreqNet, a frequency-domain image super-resolution model that explicitly learn the reconstruction of high-frequency details from LR images. We propose the depth-wise residual group(DWRG) and deformable residual group(DRG) structure to fit the characteristics of frequency-domain task and improve the ability of our network. Meanwhile, we propose a frequency-domain loss function and the frequency-domain reconstruction metric(FRM) that can measure the quality of high-frequency detail reconstruction. The quantitative and visual results demonstrate the effectiveness of our method, and we can further merge the output of our network with the other SR models as a post-processing enhancement.


  • S. Bell-Kligler, A. Shocher, and M. Irani (2019) Blind super-resolution kernel estimation using an internal-gan. In NeurIPS, Cited by: §2.1.
  • M. Bevilacqua, A. Roumy, C. Guillemot, and M. A. Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, pp. 135.1–135.10. External Links: ISBN 1-901725-46-4, Document Cited by: §4.1.
  • C. Dong, C. C. Loy, K. He, and X. Tang (2016a) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, pp. 295–307. Cited by: §1, §2.1, §4.2.
  • C. Dong, C. C. Loy, and X. Tang (2016b)

    Accelerating the super-resolution convolutional neural network

    In ECCV, Cited by: §2.1, §4.2.
  • M. Ehrlich and L. S. Davis (2019) Deep residual learning in the jpeg transform domain. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3483–3492. Cited by: §2.2.
  • D. Glasner, S. Bagon, and M. Irani (2009) Super-resolution from a single image. pp. 349 – 356. External Links: Document Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2020) Generative adversarial networks. Commun. ACM 63 (11), pp. 139–144. External Links: ISSN 0001-0782, Link, Document Cited by: §1, §2.1.
  • J. Gu, H. Lu, W. Zuo, and C. Dong (2019) Blind super-resolution with iterative kernel correction.

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1604–1613.
    Cited by: §2.1.
  • L. Gueguen, A. Sergeev, B. Kadlec, R. Liu, and J. Yosinski (2018) Faster neural networks straight from jpeg. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §1, §2.1.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861. Cited by: §3.3.2.
  • G. Huang, Z. Liu, and K. Q. Weinberger (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. Cited by: §1.
  • J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §2.1.
  • J. Kim, J. K. Lee, and K. M. Lee (2016a) Accurate image super-resolution using very deep convolutional networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654. Cited by: §2.1.
  • J. Kim, J. K. Lee, and K. M. Lee (2016b) Deeply-recursive convolutional network for image super-resolution. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1637–1645. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.1.
  • W. Lai, J. Huang, N. Ahuja, and M. Yang (2019) Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, pp. 2599–2613. Cited by: §3.4.
  • C. Ledig, L. Theis, F. Huszár, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2017) Photo-realistic single image super-resolution using a generative adversarial network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114. Cited by: §1, §2.1, §3.4, §4.2.
  • B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. External Links: 1707.02921 Cited by: §1, §2.1, §3.3, §4.2.
  • A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong (2021) Blind image super-resolution: a survey and beyond. ArXiv abs/2107.03055. Cited by: §1.
  • D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 416–423 vol.2. External Links: Document Cited by: §4.1.
  • Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa (2016) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76, pp. 21811–21838. Cited by: §4.1, §4.4.
  • C. Nash, J. Menick, S. Dieleman, and P. W. Battaglia (2021) Generating images with sparse representations. ArXiv abs/2103.03841. Cited by: §2.2.
  • O. Oktay, W. Bai, M. Lee, R. Guerrero, K. Kamnitsas, J. Caballero, A. de Marvao, S. Cook, D. O’Regan, and D. Rueckert (2016) Multi-input cardiac image super-resolution using convolutional neural networks. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells (Eds.), Cham, pp. 246–254. External Links: ISBN 978-3-319-46726-9 Cited by: §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • K. R. Rao and P. Yip (1990) Discrete cosine transform: algorithms, advantages, applications. Academic Press Professional, Inc., USA. External Links: ISBN 012580203X Cited by: §1.
  • Y. Tai, J. Yang, X. Liu, and C. Xu (2017a) MemNet: a persistent memory network for image restoration. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4549–4557. Cited by: §2.1.
  • Y. Tai, J. Yang, and X. Liu (2017b) Image super-resolution via deep recursive residual network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2790–2798. External Links: Document Cited by: §2.1.
  • R. Timofte, E. Agustsson, L. V. Gool, M. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, X. Wang, Y. Tian, K. Yu, Y. Zhang, S. Wu, C. Dong, L. Lin, Y. Qiao, C. C. Loy, W. Bae, J. Yoo, Y. Han, J. C. Ye, J. Choi, M. Kim, Y. Fan, J. Yu, W. Han, D. Liu, H. Yu, Z. Wang, H. Shi, X. Wang, T. S. Huang, Y. Chen, K. Zhang, W. Zuo, Z. Tang, L. Luo, S. Li, M. Fu, L. Cao, W. Heng, G. Bui, T. Le, Y. Duan, D. Tao, R. Wang, X. Lin, J. Pang, J. Xu, Y. Zhao, X. Xu, J. Pan, D. Sun, Y. Zhang, X. Song, Y. Dai, X. Qin, X. Huynh, T. Guo, H. S. Mousavi, T. H. Vu, V. Monga, C. Cruz, K. Egiazarian, V. Katkovnik, R. Mehta, A. K. Jain, A. Agarwalla, C. V. S. Praveen, R. Zhou, H. Wen, C. Zhu, Z. Xia, Z. Wang, and Q. Guo (2017) NTIRE 2017 challenge on single image super-resolution: methods and results. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 1110–1121. External Links: Document Cited by: §4.1.
  • T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 4809–4817. External Links: Document Cited by: §1.
  • R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool (2018) Towards image understanding from deep compression without decoding. ArXiv abs/1803.06131. Cited by: §2.2.
  • X. Wang, L. Xie, C. Dong, and Y. Shan (2021) Real-esrgan: training real-world blind super-resolution with pure synthetic data. ArXiv abs/2107.10833. Cited by: §2.1.
  • X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. In ECCV Workshops, Cited by: §2.1, §4.2, §4.4.
  • C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. Smola, and P. Krähenbühl (2018) Compressed video action recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6026–6035. Cited by: §2.2.
  • K. Xu, M. Qin, F. Sun, Y. Wang, Y. Chen, and F. Ren (2020) Learning in the frequency domain. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1737–1746. Cited by: §2.2, §3.1.2.
  • K. Xu, Z. Zhang, and F. Ren (2018) LAPRAN: a scalable laplacian pyramid reconstructive adversarial network for flexible compressive sensing reconstruction. ArXiv abs/1807.09388. Cited by: §2.2.
  • J. Yang, J. Wright, T. Huang, and Y. Ma (2008) Image super-resolution as sparse representation of raw image patches. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. External Links: Document Cited by: §1.
  • D. Yıldırım and O. Gungor (2012) A novel image fusion method using ikonos satellite images. Journal of Geodesy and Geoinformation 1, pp. 27–34. External Links: Document Cited by: §1.
  • R. Zeyde, M. Elad, and M. Protter (2012) On single image scale-up using sparse-representations. In Curves and Surfaces, J. Boissonnat, P. Chenin, A. Cohen, C. Gout, T. Lyche, M. Mazure, and L. Schumaker (Eds.), Berlin, Heidelberg, pp. 711–730. External Links: ISBN 978-3-642-27413-8 Cited by: §4.1.
  • K. Zhang, J. Liang, L. V. Gool, and R. Timofte (2021) Designing a practical degradation model for deep blind image super-resolution. ArXiv abs/2103.14006. Cited by: §2.1.
  • Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. R. Fu (2018a) Image super-resolution using very deep residual channel attention networks. In ECCV, Cited by: §1, §1, §2.1, §3.2, §3.3.
  • Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018b) Residual dense network for image super-resolution. External Links: 1802.08797 Cited by: §1, §4.2.
  • X. Zhu, H. Hu, S. C. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9300–9308. Cited by: §3.3.1.