Image super-resolution (SR) aims to construct a high-resolution (HR) image from its low-resolution (LR) counterpart. SR is an important class of image processing and has been widely applied on real-world tasks [sr_medical, sr_medical_3d, sr_surveillance, sr_face]srcnn, vdsr, srresnet, edsr, esrgan]. Many of them pre-defined SR as a pixel-level mapping from LR-to-SR and adopted MAE/MSE††Mean average error and mean square error, respectively.
loss to pursue high Peak Signal to Noise Ratio (PSNR), such models are termed as PSNR-oriented models (POMs).
Existing methods [edsr, esrgan] prefer to crop input LR images to patches during training, e.g.,
. However, in these cropping processes, these LR patches lost contextual information and high-frequency details during down-sampling operations. Such a phenomenon increases the probability that two regions are similar in their LR form, but the two regions are not identical before down-sampling. If the similarity of these two LR regions exceeds the model’s discriminative ability, the model must be optimized to map these two regions to a distribution center of their HR regions in the feature space (Fig.1(a)). Thus, these generated HR images contain very similar low-frequency signals (this is why their LR regions are similar), but their image details are different. As a result, these generated images are over-smoothed. We name such a phenomenon as “the center-oriented optimization (COO) problem”, hindering POMs from producing clear-detail images.
In addition, Pianykh et al. revealed that the key to human perception of image quality is the high-frequency information [human_perceptual]
. In contrast, POMs used MAE/MSE loss functions to assign the same calculation for areas with different frequencies; however, low-frequency-oriented regions of an image are over. This situation drives POMs to tend to restore the low-frequency regions. Two mainstream types of existing approaches aimed at improving the perceptual quality of generated images, w.r.t. creating larger models or using GAN-based methods, to both indirectly weaken the COO problem and reduce the low-frequency tendency. In particular, larger-size models improved the upper bound of the discriminate ability, thus achieving accurate mappings from LR to HR. GAN-based methods generated fake details, whose distribution is close to details of the real-world HRs. However, both methods led to more calculations and latency. Moreover, GAN-based methods are unstable and time-consuming during training.
In this paper, we propose a Detail Enhanced Contrastive Loss (DECLoss) for SR networks to alleviate the COO problem and eliminate the low-frequency tendency. The main idea of DECLoss is to precisely map an LR region to an HR region, even if the HR region is not the LR’s ground truth. DECLoss consists of two calculation steps 1) high-frequency enhancement and 2) spatial contrastive learning. To cope with human perception, we first enhance image details by compensating the high-frequency information in the Fourier domain. We then reshape the SR patches and their corresponding HR patches to a sequence of mini-patches. For each SR mini-patch, within a training batch, we select all HR mini-patches that are really similar to the SR’s ground truth as positive samples, and rest HR mini-patches are negative samples. Next, we introduce contrastive learning to reduce the distance of positive samples and increase that of negative samples. Therefore, the SR model can accurately map each different region of LR to an HR region, so as to augment the details of the generated image (Fig. 1 (b)). It is the first time that we have combined contrastive learning with region similarity in SR. Without any adversarial operations, our DECLoss is stable and straightforward. In summary, our main contributions are three-fold:
1) Revealing Obstructions for High Perceptual Quality. We reveal the factors hindering POMs from generating high perceptual quality images, w.r.t. the center-oriented optimization problem and low-frequency tendency. In particular, it is the first time that we define the COO problem and quantify effects of this problem on SR models.
2) Proposing a Detail Enhanced Contrastive Loss (DECLoss). We propose a novel perceptual-driven loss function. Based on the image frequency transformation and contrastive learning, DECLoss alleviates the COO problem and eliminate the low-frequency tendency, to achieve the higher perceptual quality.
3) Extensive Experiments. Extensive experiments demonstrate the efficiency and effectiveness of our methods. Without any adversarial operations, our DECLoss, e.g., in EDSR [edsr], our DECLoss based method achieves equivalent performance with a GAN-based method with 3.60 faster; and combined with RaGAN [esrgan], our RRDB [esrgan] model outperform a variety of state-of-the-art methods.
2 Related Work
2.1 Image Super-Resolution
Image super-resolution is an important image restoration task in computer vision[sr_dual, sr_unpaired, sr_meta, sr2021addersr, sr2021learning, sr2021masa]. Since SRCNN [srcnn]
first applied early convolutional neural networks to solve SR tasks. VDSR[vdsr] also used a very deep network for SR. When He et al. [resnet] proposed ResNet for learning residuals, SRResNet [srresnet] introduced ResBlock [resnet] to expand the network depth. EDSR [edsr] further enhanced the efficiency of residual methods to advance SR results. DRCN [drcn], DRRN [drrn], and CARN [carn] also adopted ResBlock [resnet] for the recursive learning. Then, RRDB [esrgan] and RDN [rdn] used dense connections to augment information from former layers. RCAN [rcan] and RFA [rfa] explored an attention mechanism within a deep SR model.
To improve the perceptual quality and append missing details caused by the COO problem. Johnson et al. [perceptual_loss] proposed a perceptual loss; and Zhang et al. [lpips] proposed a training-based model, “LPIPS”, to measure perceptual distances of images. SRGAN [srresnet] first introduced a GAN-based method into an SR model, where an adversarial loss was used during training. ESRGAN [esrgan] made a significant progress for GAN-based methods. Images generated by ESRGAN looked more natural in texture. In contrast, Beby-GAN [beby_gan]
paid more attention to generating fake details, thereby further improving the perceptual quality. In particular, their research on the over-smoothed phenomenon of SR inspired us to discover of COO problem. Although GAN-based methods are effective in generating fake details, due to the two-stage gradient backpropagations and system I/O††
Input/output, this paper mainly refers to the tensors., GAN-based methods cost too much in training.
2.2 Contrastive Learning
Contrastive Learning demonstrates its effectiveness in unsupervised representation learning [byol, contrast_representation, contrast_understanding] tasks. The contrastive learning method is to learn representations, which similar samples stay close to each other, while dissimilar samples are far apart [simclr, moco, caron2020unsupervised]. Many super-resolution approaches also applied contrastive learning to improve their robustness. E.g., DASR [wang2021unsupervised] applied contrastive learning in the degradation representations. Wang et al. [wang2021towards] proposed a distillation method with contrastive learning. Zhang et al. [zhang2021blind] used a bidirectional contrastive loss to identify both high-frequency and low-frequency features.
3 Center-Oriented Optimization Problem
Center-oriented optimization (COO) problem is: when dealing with several hard-to-discriminate LR inputs, POMs tend to map an LR input to the distribution center of all potential HR, resulting in generating over-smoothed outputs, shown in Fig. 2. Therefore, we first describe in mathematics that how the COO problem influences the quality of POMs. Therefore, a function must be established to describe the influences of the COO problem.
3.1 Problem Description
POMs usually pre-define the super-resolution as an LR input mapping to an HR image through a function , so as to make be as similar as possible to the ground truth . Thus, we have:
Then, to minimize the distance between an SR image and its corresponding ground truth , the function updates its parameter by using:
However, in the real world, a region from an LR image might be very similar to another LR region that model cannot discriminate; however, ’s corresponding HR are different in details. Then, we have:
where denotes the summarized distance of two similar SR patches with their ground truth . Therefore, the model is forced to optimize the distance, so that SR must be close to the center of HRs :
We then extend Eq. 6 to the entire data set:
where is the mapping probability from LR to HR . For example, if LRs are not similar to each other, thus, an example mapping probability is [0.97, 0.02, 0.01]; however, if LRs are similar to each other, their probability is [0.4, 0.4, 0.2]. Unfortunately, it is difficult to calculate these probabilities.
Note that, the image down-sampling loses the high-frequency information, i.e., HRs usually look the same at the low-frequency regions but their high-frequency details are different. Therefore, the details of generated images are blurry; however, these high-frequency details are very important to the perceptual quality.
3.2 Description Function
Due to the difficulties in calculating the probabilities defined in Eq. 7, we analyze how to measure the intensity of the COO problem (ICOO) by using feature distances. Therefore, we first assume that each LR image is well mapped to its corresponding HR image by the function . The distance between each SR with corresponding HR is approximately zero and the distance from other HR are positive; thus we have:
where denotes the mapping accuracy of SR ; denotes the length of a dataset, or, can also denote the top- similar HR images of SR . It is possible to relax the condition of Eq. 9, so that SR can reach any potential HR , even if the HR is not the ground truth . Since the clarity of image detail is the key to human perception of image quality [human_perceptual], the mapping accuracy of image details is usually not essential. Thus, Eq. 9 is rewritten as:
where denotes the HR image with the shortest distance from . Then, we can describe the intensity of the COO problem through the sum of the score defined in Eq. 10 with multiple generated images :
where denotes the length of a proper subset of a dataset; and we use the logarithm function to avoid too small values. Eq. 11 is the core function to reflect the intensity of the COO problem (ICOO). Note that, ICOO measures similarities between SR and HR distributions, rather than to measure two image’s similarity. Such distribution similarities represent the overall SR image quality. The ICOO measurement is described in Sec. 5.2.
Detail enhanced contrastive loss (DECLoss) is a novel loss function of SR, which aims to alleviate the COO problem and eliminate the Low-Frequency Tendency, so as to increase the perceptual quality of SR images.
DECLoss is regarded as a perceptual-driven loss function for image SR. DECLoss does not require any generate-adversarial processes, which are not efficient and effective. Fig. 3 shows that DECLoss consists of two steps – the high-frequency enhancement and spatial contrastive loss, optimized for the Low-Frequency Tendency and COO problem, respectively. First, We enhance details of SR images and their HR images , paying more attention to the high-frequency regions, which are more fit to the human perception [human_perceptual]. Second, we reshape the both SR images and HR images into a sequence of mini-patches. DECLoss with a smaller patch size is more sensitive in feature similarities, so that it can improves the clustering results. Finally, we introduce the “contrastive learning” to cluster mini-patches according to their corresponding HR similarities. Note that, reducing the distance of similar SR images solves the limitation that POMs can only map LR to their ground truth. In addition, enlarging the distance of different groups also increases the difficulty in mapping an LR to the distribution center of similar HRs (Eq. 6).
4.2 High-Frequency Enhancement
In order to increase high-frequency details of POMs, we enhance the image’s high-frequency in Fourier space. As illustrated in Fig. 3
(b), we first use the Discrete Fourier Transform (DFT)to map a SR image and its ground truth (denoted as ), to the Fourier domain :
where and are the height and width of an image , respectively; and each component of is defined as:
We then multiply the inverse Gaussian kernel vectorwith the Fourier matrix to obtain:
where each component of is defined as:
where and are control variables. Finally, we use the Inverse Discrete Fourier Transform (IDFT) to map the Fourier matrix back to the image domain :
where is defined as:
where real and imag are the real part and imaginary part of a complex number, respectively. Note that, the inverse Gaussian kernel provides a smooth importance for regions in different frequencies, suppresses the low-frequencies, and enhances the high-frequencies, so as to alleviate the Low-Frequency Tendency.
4.3 Spatial Contrastive Loss
The spatial contrastive loss is regarded as the key method alleviating the COO problem. Based on similarities of HRs, we control the mapping range of COO problem by using contrastive clustering. Fig. 3(a) illustrates that we first reshape the high-frequency enhanced (Sec. 4.2) input patches into a sequence of flattened 2D mini-patches , where is the resolution of the original patch, is the number of channels, is the batch size, is the resolution of each mini-patch, and is the resulting number of mini-patches. Similarly, a mini-patch is more polarized than a normal patch, which is more beneficial for subsequent clustering operations.
We initially measure the cosine similarities of each SR-to-HR, and SR-to-SR :
where is the number of a batch of HR mini-patches. To better discriminate the positive samples and negative samples, we regard the PSNR similarities of HR-to-HR as a mask :
where denotes high-frequency enhanced (Sec. 4.2), denotes mini-patch, is the -norm, and MAX is the upper bound of color space. In particular, if the similarity is greater than a threshold , the input is regarded as a positive sample; otherwise, it is a negative sample. Then, the scores of the positive samples and negative ones are represented as:
where are temperatures. Note that is equivalent to the original contrastive learning matrix [simclr]. DECLoss is defined as:
|Config.||Model||L1||VGG [perceptual_loss]||DECLoss||GAN [srresnet]||RaGAN [esrgan]||PSNR||LPIPS||ICOO||GPU Hs.|
4.4 Loss Function
Inspired by the Perceptual Loss [perceptual_loss], we apply the and perceptual loss in our model. loss is the 1-norm distance between a generated image and its ground truth, which ensures the image reconstruction quality. Thus, loss is defined as:
The perceptual loss measures the distance in the feature space. We specify a pre-trained VGG-19 [vgg] to generate the feature map, denoted as :
The total loss function is defined as:
where , , and are the weights to balance different loss terms.
We train all the models on DIV2K [div2k]
dataset that consists of 2,000 resolution 800 training images, 100 validation images, and 100 test images. To obtain training LR data, we down-sample the HR images using the bicubic interpolation. We evaluate all models on famous SR benchmarks: DIV2K(test)[div2k], BSD100 [bsd100], and Urban100 [urban100]. We mainly produce experiments on EDSR and RRDB models. By following [edsr], all experiments are performed with a scaling factor of between LR and HR images. Then, we crop patches with size and size from LR and HR, respectively. To ensure the fairness of comparisons, the batch size of all models is set to 32. There are 1,000 batches in each training round.
We divide the training process into two stages. First, we pre-train all the models with loss (Eq. 24
) as an initialization, with 200 epochs, and 5 warm-up epochs. We use a cosine learning rate withand its initial learning rate is set to . This pre-training with loss helps the model converge to a certain range. Second, the generator is trained using the loss function defined in Eq. 26 with . The temperature is set to . Due to the trade-off between number of mini-patches and computation resource, the size of mini-patch is . The learning rate is set to and cosine learning rate is applied to a decay schedule. For the optimization, we use the Adam algorithm [adam] with and
. We implement our models with PyTorch and train on 4A100 GPUs.
5.2 Intensity of COO Problem
We proposed a function to measure the intensity of the COO problem (Eq. 11), termed ICOO. ICOO calculates distribution similarities between SRs and HRs, rather than measuring two image’s similarities. The mini-patch size is , 8 and 100; these parameters indicate that eight SR mini-patches are randomly cropped for each SR, and 100 HR mini-patches are cropped for each HR. We test ten rounds to average the results, so as to reduce influences of the randomness. Fig. 5 shows that we randomly selected and trained 20 models on Urban100 [urban100], with various architectures and initialization. The Spielman Relevance score of ICOO with LPIPS is 0.892, demonstrating a high positive correlation between ICOO and LPIPS. We also added our ICOO metrics in the comparison.
Comparison of different loss configurations. We conducted an ablation study with different loss configurations to exhibit the effectiveness of our DECLoss. In Table 1 and Fig. 4, Config. 1 indicates results of the basic EDSR; Config. 2 represents results of the DECLoss with restoration loss ; Config. 3 represents results of the perceptual loss; Configs. 4, 5, and 9 are results of the GAN metric Losses; Configs. 6 and 8 present results of our DECLoss trained with perceptual loss [esrgan]. In Configs. 2 and 6, the results of our DECLoss outperformed those of Configs. 1 and 3, respectively. Config. 6 also has the equivalent performance with those of the GAN-based metrics: Configs. 4 and 6. Also, results of Config. 8 achieve similar scores with those of Config. 9. We also conduct the orthogonality of DECLoss that train with RaGAN [esrgan] in Configs. 7 and 10: the results in LPIPS of Configs. 7 and 10 are lower than 0.004 and 0.002 of Configs. 5 and 9, respectively.
Influence of contrastive temperatures. Following original contrastive learning [simclr], we introduced two temperature values to balance the intensity of both positive and negative samples. As shown in Fig. 6, we tested the performance of [0.5, 1.0, 1.5, 2.0, 4.0, 8.0] with a fixed . We then test the performance of [0.5, 1.0, 1.5, 2.0, 4.0, 8.0] with a fixed . Since the SR’s contrastive learning are relatively simple, larger temperatures can increase the learning efficiency. The results demonstrate that: within a certain range, the values of and have no significant influence on the experimental results.
Influence of different patch sizes. The patch size is also an important trade-off between the model performance and resource consumption; since the space complexity is . As illustrated in Table 2, we evaluated on EDSR [edsr] for [1, 2, 3, 4, 8] on DIV2K [div2k]. To avoid the usage out of our GPU’s memory, we set the batch size to 16 especially. It is interesting that, within a certain range, the larger the patch size is, the better the perceptual quality. This phenomenon indicates that: the contrastive learning allows for more accurate mapping of similar HR details with smaller mini-patches, so as to achieve better performance on LPIPS.
|Benchmark||Metric||Bicubic||EDSR [edsr]||SRFlow [srflow]||RankSRGAN [ranksrgan]||DECLoss||ESRGAN [esrgan]||DECLoss+|
5.4 Comparisons with State-of-the-art Methods
In addition to conducting the effectiveness of our DECLoss, we compared state-of-the-art methods. Note that, it is not necessary for our DECLoss to outperform the GAN-based method, our DECLoss required less GPU training time to achieve similar accuracy. Also in Table 3, our DECLoss, trained on RRDB, outperforms POMs. Compared with GAN-based methods: RankSRGAN [ranksrgan] and ESRGAN [esrgan], our DECLoss performed equivalently to RankSRGAN, with 2.4 faster, and even DECLoss is 0.005 higher in LPIPS than ESRGAN on Urban100, as well as DECLoss achieved 1.57 faster. We also show the orthogonality of our DECLoss, we apply RaGAN [esrgan] with DECLoss on RRDB, denoted as DECLoss+, which outperformed a varieties of state-of-the-art methods. We illustrate these results in Fig. 7. Compared with a detail-enhanced SR model: SRFlow [srflow], DECLoss+ achieves 0.011 lower in LPIPS on Urban100, with 12.85 faster.
In this work, we discovered two factors hindering the perceptual quality of PSNR-oriented models: the center-oriented optimization problem and low-frequency tendency. To better alleviate these problems, we proposed the Detail Enhanced Contrastive Loss (DECLoss), consisting of the high-frequency enhanced module and spatial contrastive learning loss. Without any generate-adversarial processes, our DECLoss achieved equivalent performance to GAN-based methods, but our method was faster. The Experimental results proved our method’s efficiency. It is also interesting to identify that (Table 2), the smaller patch size of the model can lead to the better the perceptual quality. However, due to the hardware memory constraints, the patch size cannot be reduced infinitely; thus, how to find this trade-off will be the key to our important future work.