1 Introduction
Single image super-resolution (SISR), the task that restores low-resolution (LR) images to high-resolution (HR) images, is an active research topic that can be utilized in several applications such as surveillance [zou2011very], medical and astronomical image processing [shi2013cardiac, chen2018efficient, li2018super].
Early SISR approaches [wang2014srcnn, kim2016accurate, lim2017enhanced, zhang2018image, zhang2018residual]
focus on generating a single high-quality output for a given input LR image by improving Peak Signal-to-Noise Ratio (PSNR) ratio between the input LR images and predicted HR outputs. Since those studies utilize
or loss between the generated and ground-truth HR images, they suffer from an over-smoothing problem. Alternative to PSNR-oriented models, GAN-based methods [ledig2017srgan, wang2018esrgan] are proposed to generate photo-realistic super-resolved images.Unfortunately, multiple possible HR images exist for a single LR image and the aforementioned deterministic models which improve the image quality of a single output cannot solve this ill-posed nature of the super resolution. SRFlow [lugmayr2020srflow] learns the distribution of the HR image consistent for the given LR images and predicts diverse HR images to improve the high photo-realism, diversity, and the LR consistency at once. Following, NCSR [kim2021noise] adopts noise-conditioned layers suggested in SoftFlow [kim2020softflow] and HCFlow [liang21hierarchical] proposes hierarchical conditional flow for the diversity and the higher image quality. However, flow-based models usually generate undesired artifacts in HR outputs which leads to lower image quality and the diversity of the outputs are not improved significantly compared to SRFlow.
We observe that the super-resolution models predict the missing high-frequency information of the HR images from the given LR image which takes part in generating the diverse details of the HR images such as the shape of the foliage and the direction of the fur. Previous super-resolution models [lugmayr2020srflow, kim2021noise] predict not only high-frequency information, but also low-frequency information of the HR images. It leads to inefficient training and these models have difficulty in increasing the diversity and the image quality of the super-resolution outputs.
In this paper, we propose FS-NCSR (Frequency-Separated Noise-Conditioned Normalizing Flow for Super-Resolution) which applies frequency separation to NCSR. We reconstruct the low-frequency information of the HR outputs by upsampling LR images in bicubic without any learnable parameters and predict the high-frequency information by training flow-based model. By doing so, we increase the diversity of learned super-resolution space in both 4 and 8 settings and improve the super-resolution quality by reducing the number of the artifact. Our contributions can be summarized as follows:
-
We propose a flow-based algorithm for high-quality diverse super-resolution output using noise-conditioned affine coupling and frequency separation.
-
By filtering low-resolution information of the target image, the generative model focuses on producing high-frequency outputs and improves super-resolution quality.
-
We expand the filtered input data distribution by adding noise to the sparse high-frequency image for the output diversity.
2 Related Works
2.1 Single Image Super Resolution
Super-resolution has been studied long in computer vision fields. Before deep learning-based methods have been applied, sparsed coding
[dai2015jointlyRegressedRegressors, sun2012sceneMatching, yang2008sparseRepr, yang2010sparseRepr]and local linear regression
[timofte2013anchNeighReg, timofte2014A+, yang2013simpleFuncSR] have been highly applied. Many deep learning-based methods have been approached for SISR, since SRCNN [wang2014srcnn] which exploited CNN layers and L1 Loss. After SRCNN was proposed, many variations have been suggested including [wang2018esrgan]. But as CNN-based methods have relied on L1 or L2 loss, they have generated blurry images. GAN-based methods, which were first suggested by SRGAN[ledig2017srgan], have shown improvements by employing adversarial loss and perceptual loss. Although GAN-based methods have generated images with good quality[ledig2017srgan, wang2018esrgan], their diversities were so limited, thereby generating only one image.2.2 Normalizing Flow
Flow-based models have been first proposed by [dinh2014nice]
for modeling complex high dimensional density. As flow-based models learn the whole distribution, they have been widely used for mapping complex distributions given simple distribution, including Gaussian distribution. Invertible neural networks have been adopted to map complex distributions from simple distributions
[dinh2014nice, dinh2017realnvp, kingma2018glow]. Flow-based models in the early days have not shown great improvements relative to GAN-based models. However, SRFlow[lugmayr2020srflow], which adopts negative log-likelihood loss, showed improvements in image quality and diversities simultaneously. As SRFlow used negative log-likelihood loss, it could learn the whole distribution, which leads to generating much more diverse images than GAN-based methods. NCSR [kim2021noise] has shown further improvements in terms of image quality and diversity, by providing networks with noises. [kim2021noise] has proposed adding a conditional noise layer, which essentially resolves distribution discrepancy between simple data and complex data.2.3 Frequency Separation
The study of frequency domain based on Fast Fourier Transform (FFT) algorithm
[cooley1965algorithm] played a crucial role in traditional signal processing. In this perspective, before the era of deep learning, studying the frequency information was important in image restoration research. In this light, it is readily known that high-frequency information of the given image contributes greatly to its sharpness and high-quality detail. Therefore, we can say that a recent huge success of deep learning-based approaches in realistic images generation is due to the success of synthesizing high-frequency information of the desired images.Therefore, in recent image restoration research including super-resolution, there exist approaches [whang2021deblurring] in which low-frequency and high-frequency are separated and treated by a separate neural network, and approaches [suvorov2021resolution] in which an FFT-based layer is designed to better process information of the frequency domain. We observed that when the former approaches were combined with NCSR, instability of the NLL training of the flow-based model occurred. And in the case of the latter approaches, the existing FFT-based layers are not suitable for the flow-based approach due to their non-invertible nature.
3 Methods
Given a LR image, our goal is to learn a diverse super-resolution space corresponding to that image. From the perspective of the frequency domain, we propose a more efficient method to increase the diversity of learned space. In this section, we introduce our point of view and proposed method. We begin with a brief background related to our work.

3.1 Background
Various model frameworks (e.g
. Generative Adversarial Networks
[goodfellow2014generative], Normalizing Flow [rezende2015variational], and Diffusion probabilistic models [ho2020denoising]) have been proposed in recent deep learning-based generative model research. And they show their respective strengths and weaknesses along with excellent performance. Among them, the flow-based model configures a mapping between the desired data distribution and latent space distribution (e.g. Gaussian) through a series of invertible transformations. Such an invertible mapping architecture enables an explicit computation of negative log-likelihood (NLL) by the change of variable formula as:(1) |
By minimizing NLL directly, it is widely known that the flow-based models show decent performance in mode coverage of the desired data distribution.
Based on this advantage of the flow-based approach, SRFlow [lugmayr2020srflow] first showed that the flow-based modeling of the conditional distribution of the HR image can successfully learn super-resolution space corresponding to the given LR input. And one of its variants model, NCSR [kim2021noise], proposed an additional noise-conditional layer to SRFlow to generate more diverse super-resolution outputs. Results of the previous works show that the ill-posedness of super-resolution can be solved from the perspective of super-resolution space learning. To take advantage of the flow-based model’s good mode coverage performance, we propose a method to learn more diverse super-resolution space with NCSR architecture.
3.2 High-Frequency Information
There are various ways to configure a High-pass filter and Low-pass filter to separate high-frequency and low-frequency information. Without affecting the stability of NLL training of the flow-based model, we utilize the bicubic downsampling-upsampling process as the Low-pass filter, , with a specific scale factor . And the corresponding High-pass filter, , computes the high-frequency information of the given input by subtracting low-frequency information from the HR target :
(2) |
There are also other frequency separation methods. Some can configure and
based on FFT and others can utilize the known 3x3 (or 5x5) kernel. In the former case, the filtering threshold level is an additional hyperparameter that is heavily dependent on an individual image. And in both cases, to match the low-frequency information of the LR input
and , additional process such as the usage of a neural network is required leading to instability of NLL training.By using this simple kind of High-pass filter, sparse high-frequency information can be efficiently obtained since we have the LR input as . And it leads to our proposed method which achieves efficient training without the need for additional memory or network compared to the previous flow-based approaches.
3.3 Overall Method
We propose FS-NCSR (Frequency Separating Noise-Conditioned Normalizing Flow for Super-Resolution), the generative model for super-resolution only produces the high-frequency information of the target HR image without redundant low-frequency information readily available from . Our overall model architecture is shown in Figure 1.
In the training process of the flow-based models, dequantization processes exist [kingma2018glow, ho2019flow++] for better performance. As can be readily checked in Figure 2 and Table 3, the high-frequency information is relatively sparse compared to HR images. And training the model with this kind of information is difficult. In the previous work of NCSR, the idea of Softflow [kim2020softflow] was used by adding a different level of noise to the input instead of the dequantization process. This can be interpreted as an attempt to expand the modality of the desired data distribution’s sparse region in the perspective of score matching [song2019generative, song2020score] which is in the spotlight of the generative model today. Therefore, we applied the same idea of Softflow [kim2020softflow] to deal with sparse information, and it was crucial in the training stability of the proposed method.
Now, with the same analog to the work of [lugmayr2020srflow, kim2021noise], we can formulate the training process of our method as follows:
(3) | |||
where indicates noise resized to the same size as the LR input . And also similar to [lugmayr2020srflow, kim2021noise]
, we formulate the loss function only NLL
as below,(4) | ||||
The model trained in proposed method does not require additional cost in the inference stage compared to the previous approaches. Since the low-frequency information is readily given by the LR input . The super-resolution output is obtained by:
(5) | |||
where is the random noise from the latent space .
In this perspective of frequency domain, super-resolution is the process of generating the corresponding high-frequency information since we have .
4 Experiments
4.1 Datasets
We utilize DF2K dataset, a merged dataset of DIV2K [agustsson2017ntire] and Flickr2K111https://github.com/limbee/NTIRE2017, for training and evaluation. DIV2K dataset consists of 800, 100, and 100 high-resolution images of train, validation, and test split, respectively. Flickr2K dataset comprises 2560 high-resolution images. The train split from the DIV2K dataset and the whole Flickr2K dataset are merged and used for training. We evaluate our model with the validation split of DIV2K dataset.
We try to increase the amount of training dataset by including crawled images from Unsplash website222https://unsplash.com, but there are no performance improvements in the diversity and visual quality of super-resolved images. Thus, we do not include our crawled dataset in this research.
During training, we randomly crop 160×160 patches from original HR images and use them as HR samples. We obtain LR samples by downsampling HR patches and utilizing HR and LR patches as HR-LR pairs for training. The LR samples are downsampled via bicubic kernel. We train our model in RGB channels, and randomly apply horizontal flips and 90-degree rotation for data augmentation.

4.2 Training
We use the Adam optimizer [kingma2014adam] with = 0.9, = 0.99, = , and set the initial learning rate as . Following [kim2021noise], the learning rate is halved at 50%, 75%, 90%, and 95% of the total training steps. We train our network with a batch size of 16 on a V100 GPU. The 4 network was trained at 180k steps and the 8 network at 220k steps.
4.3 Evaluation
We evaluate our model and other baselines based on three criteria: photo-realism, diversity of super-resolution space, and image consistency on LR. We adopt LPIPS [zhang2018unreasonable] to evaluate photo-realism, diversity score to evaluate diversity, and LR PSNR to evaluate LR consistency.
LPIPS. LPIPS is the distance between the super-resolved and the ground-truth HR image. The distance is measured on the feature space of AlexNet [krizhevsky2012imagenet].
Diversity Score. To obtain meaningful diversity of models, Lugmayr et al. [Lugmayr_2021_CVPR] proposed the diversity score. Let the ground-truth HR image and be the -th patch of . Generating samples from the super-resolution models, the -th super-resolved images from the model is , and its -th patch is , where . Than the diversity score can be computed as follows:
(6) |
where minimum distance on a global sample, , defined as follows:
(7) |
We use LPIPS as distance function , and set .
LR PSNR. In LR, the super-resolved output of the model must be consistent with the original LR input. Thus, we measure PSNR (Peak Signal-to-Noise Ratio) between downsampled super-resolved image and given input LR image.
Model | Diversity | LPIPS | LR PSNR |
---|---|---|---|
RRDB [wang2018esrgan] | 0 | 0.253 | 49.20 |
ESRGAN [wang2018esrgan] | 0 | 0.124 | 39.03 |
ESRGAN+ [Rakotonirina_2020] | 22.13 | 0.279 | 35.45 |
SRFlow [lugmayr2020srflow] | 25.26 | 0.120 | 49.97 |
HCFlow [liang21hierarchical] | 22.73 | 0.116 | 49.46 |
NCSR [kim2021noise] | 26.72 | 0.119 | 50.75 |
FS-NCSR (Ours) | 29.44 | 0.127 | 49.31 |
4.4 Quantitative Results
We compare our model, FS-NCSR, with diverse baseline models: RRDB [wang2018esrgan], ESRGAN [wang2018esrgan], ESRGAN+ [Rakotonirina_2020], SRFlow [Lugmayr_2021_CVPR], HCFlow [liang21hierarchical], and NCSR [kim2021noise]. RRDB is the model trained with loss with ground-truth HR image, consequently oriented to minimizing PSNR. ESRGAN and ESRGAN+ are GAN-based methods that are the common baselines for photo-realistic super-resolution. RRDB and ESRGAN are deterministic models, so their diversity scores are zero. SRFlow, HCFlow, and NCSR are stochastic super-resolution models that can super-resolve diverse photo-realistic images from the given input LR image. For all the flow-based super-resolution models, the temperature is set to 0.9. However, the temperature is 0.85 for NCSR 8 model.
We measure the diversity score, LPIPS, LR PSNR of our model and compare them with the reported results of other baselines. We evaluate all the models in 4 super-resolution setting. As shown in Table 1, our proposed model, FS-NCSR, achieves the highest diversity score in 4 setting. The diversity score of FS-NCSR is significantly higher than NCSR [kim2021noise], which indicates frequency separation plays a key role to improve diversity. Although FS-NCSR achieves the lower LR PSNR and higher LPIPS than SRFlow [lugmayr2020srflow], HCFlow [liang21hierarchical] and NCSR, diversity increase is significant compared to such performance degradation so can be compensated. In addition, we observe that the number of artifacts and failure cases in the generated samples of FS-NCSR is less than that of NCSR. We will discuss this qualitative comparison in 4.5.
We also evaluate all the models except ESRGAN+ [Rakotonirina_2020] in 8 super-resolution setting. As presented in Table 2, FS-NCSR outperforms all the other methods in terms of diversity score and LPIPS. Also, FS-NCSR achieves comparable LR PSNR with SRFlow [lugmayr2020srflow], the model which achieved the highest LR PSNR. These results show that FS-NCSR outperforms all the other methods in terms of photo-realism and diversity, and frequency separation is a decisive factor.
Model | Diversity | LPIPS | LR PSNR |
---|---|---|---|
RRDB [wang2018esrgan] | 0 | 0.419 | 45.43 |
ESRGAN [wang2018esrgan] | 0 | 0.277 | 31.35 |
SRFlow [lugmayr2020srflow] | 25.31 | 0.272 | 50.00 |
NCSR [kim2021noise] | 26.8 | 0.278 | 44.55 |
FS-NCSR (Ours) | 26.9 | 0.257 | 48.90 |
To clearly demonstrate the effect of frequency separation, we additionally report the metric trajectories during the training process of FS-NCSR and NCSR [kim2021noise]. We measure LPIPS and diversity score in 150k, 160k, 170k, 180k steps for each model. The results of such models during the training process are presented in Figure 6. For trained weights of FS-NCSR, higher diversity and lower LPIPS than NCSR weights of the same iteration are measured. These results show that frequency separation consistently improves the diversity and photo-realism of the model output during the training process.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

4.5 Qualitative Results
The qualitative result in Figure 4 shows that the direction and the degree of density of the leaves are slightly different for every 5 outputs. Thus, we can say that the proposed method not only shows a higher diversity score than previous approaches but also can generate outputs with diverse details that are distinguishable visually. It means that the frequency separation can enhance the high mode coverage performance of the flow-based model.
We now qualitatively compare our result with the output of NCSR to verify the effect of the frequency separation. As discussed in 4.4, the FS-NCSR’s LPIPS was lower than the existing approaches. But Figure 3 shows that FS-NCSR can reproduce the characters more clearly than NCSR. This qualitatively confirmed that although the training focused on high-frequency information performs slightly lower on LPIPS, actual outputs do not suffer a degradation of image quality than the existing methodologies.
The existing SRFlow and NCSR models show repeated failure cases where artifacts appear in a specific image (e.g. 0807, 0828 from DIV2K validation set). In the case of 0807 from DIV2K, for instance, when both SRFlow and NCSR generated the corresponding 4 super-resolved outputs, all outputs were failure cases since some artifacts appeared. On the other hand, when FS-NCSR generate the 4 super-resolved outputs of the given image, 4 out of 10 outputs were made without any artifact, and even for 6 failure cases, the degree of the artifact was relatively less than that of NCSR. Figure 5 presents the degree of the artifact differs between NCSR and FS-NCSR output and the FS-NCSR’s artifact-free results compared to the ground truth image.
4.6 Ablation: Comparison of Generated High-Frequency Information
So far, we have discussed the results both quantitatively and qualitatively with the super-resolved outputs only. But we tried to compare the results from the perspective of frequency information additionally. Since the sparse high-frequency information plays a key role in the proposed method, we investigated how the proposed method affects the sparsity of the generated high-frequency information. For this purpose, the generated high-frequency information in range is first quantized to the uint8 range. And then Sparsity and Relative Sparsity (RS) is computed as follows:
(8) | |||
where is the shape of a given image. Since the ground truth high-frequency information is already sparse, the RS reflects the sparsity of each ground truth image for a more fair comparison.
Model | Average Sparsity | Average RS |
---|---|---|
NCSR [kim2021noise] | 66.2% | 1.123 |
FS-NCSR (Ours) | 66.0% | 1.120 |
See Table 3. Although our proposed method shows less average sparsity and average RS slightly, the average sparsity of the ground truth high-frequency information and the that of generated output from both NCSR and FS-NCSR was about 10% more sparse, resulting in a lack of information compared to the ground truth. This margin of difference with the ground truth verifies that a significant loss of information still exists from the perspective of the frequency domain. Therefore, it seems that it needs to be addressed in future studies.
5 NTIRE 2022 Challenge
Team | LPIPS | LR PSNR | Div. Score | MOR |
IMAG_ZW | 0.171 | 48.14 | 21.938 | 3.57 |
FS-NCSR (Ours) | 0.126 | 50.13 | 28.853 | 3.67 |
IMAG_WZ | 0.169 | 45.20 | 27.320 | 3.34 |
SSS | 0.110 | 44.70 | 13.285 | _ |
NCSR | 0.117 | 50.54 | 26.041 | _ |
SRFlow | 0.122 | 49.86 | 25.008 | 3.62 |
ESRGAN | 0.124 | 38.74 | 0.000 | 3.52 |
Team | LPIPS | LR PSNR | Div. Score | MOR |
---|---|---|---|---|
FS-NCSR (Ours) | 0.257 | 50.37 | 26.539 | 4.510 |
SSS | 0.237 | 37.43 | 13.548 | 4.850 |
NCSR | 0.259 | 48.64 | 26.941 | 4.503 |
SRFlow | 0.282 | 47.72 | 25.582 | 4.775 |
ESRGAN | 0.284 | 30.65 | 0.000 | 4.452 |
Our proposed method, FS-NCSR, achieved competitive results in both tracks of NTIRE 2022 ”Learning Super Resolution Space Challenge” [lugmayr2022ntire]. See table 4 and 5 for the challenge result of 4 and 8 tracks respectively. In the 4 track, FS-NCSR obtained the highest diversity score among the existing and newly proposed methods by a relatively large margin. Also, it obtained the best LPIPS and LR-PSNR results among this year’s participants, although it did not lead to the best MOR. In the 8 track, FS-NCSR was this year’s only method that achieved comparable results compared to the last year’s approaches. Through the improvement of LR-PSNR, it seems that the frequency separation affected improving the consistency with low-resolution.
6 Conclusion
We propose a flow-based algorithm, FC-NCSR, to learn high-frequency information of super-resolution space. Based on the relation between the high-frequency information and the high-quality details of the given image, we train the generative model for super-resolution to produce the high-frequency information corresponding to the low-resolution input. With a simple high-pass filter using the low-frequency information of the low-resolution input, we successfully increase the super-resolution diversity without any influence on the stability of the flow-based NLL training and visual quality degradation. We also confirm that the frequency separation of FS-NCSR reduces the failure cases due to artifacts, and therefore, significantly improves the quality of the super-resolution output.
7 Acknowledgement
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)]