FCL-GAN: A Lightweight and Real-Time Baseline for Unsupervised Blind Image Deblurring

04/16/2022
by   Suiyi Zhao, et al.
6

Blind image deblurring (BID) remains a challenging and significant task. Benefiting from the strong fitting ability of deep learning, paired data-driven supervised BID method has obtained great progress. However, paired data are usually synthesized by hand, and the realistic blurs are more complex than synthetic ones, which makes the supervised methods inept at modeling realistic blurs and hinders their real-world applications. As such, unsupervised deep BID method without paired data offers certain advantages, but current methods still suffer from some drawbacks, e.g., bulky model size, long inference time, and strict image resolution and domain requirements. In this paper, we propose a lightweight and real-time unsupervised BID baseline, termed Frequency-domain Contrastive Loss Constrained Lightweight CycleGAN (shortly, FCL-GAN), with attractive properties, i.e., no image domain limitation, no image resolution limitation, 25x lighter than SOTA, and 5x faster than SOTA. To guarantee the lightweight property and performance superiority, two new collaboration units called lightweight domain conversion unit(LDCU) and parameter-free frequency-domain contrastive unit(PFCU) are designed. LDCU mainly implements inter-domain conversion in lightweight manner. PFCU further explores the similarity measure, external difference and internal connection between the blurred domain and sharp domain images in frequency domain, without involving extra parameters. Extensive experiments on several image datasets demonstrate the effectiveness of our FCL-GAN in terms of performance, model size and reference time.

READ FULL TEXT VIEW PDF

Authors

page 2

page 4

page 6

page 7

page 8

09/07/2021

Unpaired Adversarial Learning for Single Image Deraining with Rain-Space Contrastive Constraints

Deep learning-based single image deraining (SID) with unpaired informati...
11/29/2021

Unsupervised Image Denoising with Frequency Domain Knowledge

Supervised learning-based methods yield robust denoising results, yet th...
01/23/2020

Semi-DerainGAN: A New Semi-supervised Single Image Deraining Network

Removing the rain streaks from single image is still a challenging task,...
03/25/2022

Unsupervised Image Deraining: Optimization Model Driven Deep CNN

The deep convolutional neural network has achieved significant progress ...
04/19/2021

DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning

While self-supervised representation learning (SSL) has received widespr...
12/15/2019

DerainCycleGAN: An Attention-guided Unsupervised Benchmark for Single Image Deraining and Rainmaking

Single image deraining (SID) is an important and challenging topic in em...
05/28/2019

FireNet: A Specialized Lightweight Fire & Smoke Detection Model for Real-Time IoT Applications

Fire disasters typically result in lot of loss to life and property. It ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Performance comparison between the SOTA unsupervised method UID-GAN (Lu et al., 2019) and our FCL-GAN. Where PSNR, SSIM (Wang et al., 2004) and NIQE (Mittal et al., 2012) are compared on the GoPro test set (Nah et al., 2017), and ”Runtime” indicates the time to infer a 1280*720 resolution image using Nvidia RTX 2080Ti.
Figure 2. The architecture of FCL-GAN, which consists of two collaborative units: LDCU and PFCU. LDCU performs conversion of different domains (i.e., , ), and PFCU pulls similar latent representation (i.e., , ) and pushes dissimilar latent representation (i.e., ) in the frequency domain.

Blind image deblurring (BID), as a classical multimedia processing task, aims at recovering a latent image from a blurred input. Blurred pictures in real-world are common, which will greatly affect the image quality and degrade the related low-level vision perception and high-level tasks. Conventional optimization-based methods assume the latent sharp image satisfies various priors (Bai et al., 2020; Joshi et al., 2009; Pan et al., 2014b; Chen et al., 2019; Pan et al., 2016; Yan et al., 2017)

, and transform the deblur problem into maximum a posteriori probability optimization. However, these methods require complex iterative optimizations with a long inference time. Moreover, the deblurring results contain unpleasant heavy artifacts.

In recent years, data-driven deep BID methods (usually supervised) (Nah et al., 2017; Tao et al., 2018; Kupyn et al., 2018; Zhang et al., 2019a; Suin et al., 2020; Zhang et al., 2020; Chen et al., 2021; Zamir et al., 2021a; Cho et al., 2021; Zamir et al., 2021b) have achieved superior performance, benefiting from the rapid development of deep learning. Data-driven supervised BID methods usually use a large amount of synthetic paired data to train a deep neural network (DNN) with various cost functions, and then learn an end-to-end mapping from blurred to sharp images. Note that a large amount of synthetic paired data is the key to the success of supervised BID methods. However, collecting paired data by hand is expensive, and strictly paired data is usually impossible in reality. Besides, realistic blur is more complex and diverse than the synthetic ones. Moreover, the synthetic data cannot reflect all blurs, resulting in the limitated performance of current supervised deblur methods in practical applications. These challenges have given rise to the unsupervised deep BID methods (Nimisha et al., 2018; Lu et al., 2019; Chen et al., 2018; Zhao et al., 2021), which aim at learning the mapping of blurred images to sharp images without any paired data.

Deep unsupervised BID methods are rarely studied due to higher difficulty and greater challenge, compared with the supervised models. Specifically, due to the lack of strong constraints in unpaired case, researchers tend to design bulky deep models to seek weak connections between input and output, while this usually results in long inference time. As a result, deploying such bulky deep models into mobile devices for online and real-time computation will be challenging. Besides, current unsupervised deep models perform BID under numerous restrictions, e.g., specific image domain (Nimisha et al., 2018; Lu et al., 2019), needing multiple frames for training (Chen et al., 2018) and small input resolutions (Zhao et al., 2021). Clearly, these will directly limit the development and real-world application of unsupervised BID models.

In this paper, we therefore propose a simple and effective deep unsupervised BID baseline to address the aforementioned shortcomings of current studies. From the performance point of view, the proposed baseline should satisfy the following conditions: 1) high model performance, i.e., achieving SOTA performance of unsupervised methods; 2) lightweight model size, i.e., meeting the configuration requirements of most devices at present; 3) fast inference speed, i.e., the baseline can work in a real-time situation, below 33.33 ms/frame. Besides, from the perspective of overcoming limitations, the baseline should also satisfy the following conditions: 1) no image domain limitation, i.e., can work on arbitrary image domain; 2) no image resolution limitation, i.e., can process high-resolution images; 3) no multi-frame input limitation, i.e., can perform single-input-single-output mapping. To illustrate our baseline meets the above conditions, we accept a high-resolution natural image as input and compare the SOTA’s output in Figure 1. Clearly, our method outperforms SOTA in different aspects.

The major contributions of this paper are summarized as follows:

  • We propose a lightweight and real-time baseline (FCL-GAN) for unsupervised BID. FCL-GAN drives the task by a frequency-domain contrastive loss constrained lightweight CycleGAN, as shown in Figure 2. Two new collaborative units, called lightweight domain conversion unit (LDCU) and parameter-free frequency-domain contrast unit (PFCU), are designed, which jointly makes sure “lightweight” and “real-time”, in addition to overcoming the limitations of current unsupervised methods. To the best of our knowledge, this is the first lightweight and real-time deep unsupervised BID method, which can be regarded as a guiding method for future research.

  • Although CycleGAN (Zhu et al., 2017) is a popular architecture trained without paired data by a cycle consistency loss, it still suffers from large chromatic aberration, severe artifacts and model redundancy when handling BID task (See Figure 3). As such, we disassemble and analyze the model structure and basic constituent units, and present a novel lightweight domain conversion unit (LDCU), which can deblur the degraded images better in a lightweight and real-time manner.

  • To further improve the deblur performance without extra parameters, we introduce a new concept of frequency-domain contrastive learning (FCL), and present a parameter-free frequency-domain contrastive unit (PFCU) to measure the similarity between latent representations in frequency domain as a contrastive constraint. FCL can also address the shortcomings of contrastive learning on deblur task.

  • Extensive simulations on several datasets demonstrated the SOTA unsupervised deblur performance of our FCL-GAN, with stonger generalization ability to handle real-world blurs. To be specific, FCL-GAN allows for a smaller model size (24.6MB vs. 606.8MB) and less inference time (0.011s vs. 0.062s), compared with current SOTA method.

2. Related Work

2.1. Data-Driven Deep BID

Deep fully-supervised BID. Benefiting from large-scale paired training data, supervised BID methods can learn accurate mapping more easily. For example, a “multi-scale” end-to-end training manner (Nah et al., 2017) was proposed for directly deblurring without blur kernels. Kupyn et al. (Kupyn et al., 2018, 2019) leverages CGAN (Mirza and Osindero, 2014) and wgan-gp (Gulrajani et al., 2017) to obtain visually realistic deblurring results in a generative-adversarial manner. Zhang et al. (Zhang et al., 2020) provide a Real-World blurred images dataset and design an effective GAN-based network (DBGAN) to model real-world blur. The “multi-patch” strategy is also used to partition the image in spatial dimensions and then perform a coarse-to-fine progressive deblurring (Zhang et al., 2019a)(Zamir et al., 2021a). Instead of using convolution, Zamir et al. (Zamir et al., 2021b) introduce and refine the transformer (Vaswani et al., 2017) to deblur high-resolution images, achieving state-of-the-art (SOTA) fully-supervised BID performance.

Deep semi-supervised BID. Semi-supervised methods learn an approximate blurred-sharp mapping based on small-scale paired data and large-scale unpaired data. Compared with fully-supervised BID, it will be more difficult for the semi-supervised BID to learn a blurred-sharp mapping. Therefore, Nimisha et al. (M et al., 2018)

first try to estimate the global camera motion from small-scale paired data in a semi-supervised manner and use the obtained global camera motion to perform single image deblurring and change detection.

Deep unsupervised BID. The training of unsupervised methods does not involve paired data, and instead, it uses large-scale unpaired data for deblurring. Compared with the fully-/semi-supervised BID, unsupervised BID methods are more difficult to learn an accurate blurred-sharp mapping due to the weak constraint between blurred and sharp images. For example, a self-supervised optimization scheme (Chen et al., 2018)

was proposed based on the existing deblur models, which utilizes the continuous frames of video and introduces a physically-based blur information model during training to improve the deblur performance. Strictly speaking, this is not an unsupervised BID technique since it is built on a fully supervised model and continuous frames. Based on the simple generative adversarial network (GAN)

(Goodfellow et al., 2014), Nimisha et al. (Nimisha et al., 2018) introduce scale-space gradient loss and reblurring loss to self-supervise the model to perform domain-specific deblurring. Lu et al. (Lu et al., 2019) entangle the content and blur of the blurred image over the domain-specific datasets, and then the blur can be easily removed from the blurred image. Zhao et al. (Zhao et al., 2021) focus on the chromatic aberration problem of unsupervised methods and proposes blur offset estimation and adaptive blur correction strategy to maintain color information while deblurring, which achieves better unsupervised BID performance.

2.2. Contrastive Learning

With the iteration of self-supervised and unsupervised techniques, contrastive learning has been attracting more and more attention. In general, contrastive learning aims to map the original data into a latent representation space in which anchors are pulled close to positive samples and pushed farther away from negative samples. In this way, the model can not only learn from positive signals but also benefits from correcting undesirable behaviors. In recent years, Contrastive learning has been widely used in various high-level vision tasks, e.g., object detection (Xie et al., 2021), medical image segmentation (Chaitanya et al., 2020) and image caption (Dai and Lin, 2017), which has achieved superior performance in high-level tasks and is receiving increasing attention. More recently, Contrastive learning has been successfully applied to various low-level vision tasks and achieved SOTA performance, e.g., image denoising (Dong et al., 2021), image dehazing (Wu et al., 2021)

and image super-resolution

(Zhang et al., 2021). Similarly, Chen et al. (Chen et al., 2022) first introduce contrastive learning to unsupervised single image deraining task and perform contrastive constraints at the feature level, which achieves SOTA performance in unsupervised single image deraining.

3. Proposed Baseline Method

3.1. Architecture

We show the architecture and learning process of our FCL-GAN in Figure 2. Clearly, it has two main cooperating units, i.e., LDCU and PFCU. LDCU is a lightweight unit that implements interconversion between different domain images. PFCU applies contrastive learning for deblurring at a technical level, making the output anchor closer to the positive sample. In what follows, we detail the interactive process between the LDCU and the PFCU. For ease of description, we will begin with some basic definitions:

  • “Stream”: The process of data transfer, e.g., “positive stream” is used to transfer positive samples and positive latent representations. When the positive samples are sharp, the “stream” will be denoted as “sharp-guide stream”.

  • “Buffer”

    : The place where the samples used by the previous epochs are stored. Samples inside the “buffer” are treated as negative samples. For example, for the “sharp-guide” streams, samples inside the “buffer” are all blurred.

LDCU is the basis of the whole architecture of FCL-GAN, which implements domain conversion. As shown in Figure 2, the LDCU contains two branches, i.e., the deblur branch: and the reblur branch: . Similarly, PFCU will impose different contrastive constraints for the two branches, i.e., sharp-guide contrast constraint for the deblur branch, while using blurred-guide contrast constraint for the reblur branch. Let’s take the deblur branch as an example, the latent images , the sharp images and the images in the sharp-guide negative buffer will be transferred to the PFCU as the anchor, positive samples and negative samples, respectively, to obtain the latent representations. Then, the similarity are calculated and contrastive constraints are performed. Note that the framework of FCL-GAN is lightweight and real-time, since LDCU is lightweight and real-time, and PFCU does not involve extra parameters and calculations during the inference process.

3.2. Lightweight Domain Conversion Unit

Let be the blurred image domain, and let be the sharp image domain, our ultimate goal is to map the images in without ground truth to . To this end, we introduce LDCU as the backbone of our FCL-GAN. LDCU contains two generators ( and ) and two discriminators ( and ): () for mapping images in () to (); () for discriminating the authenticity of images in (). Motivated by (Wei et al., 2021; Chen et al., 2022), two functional circuits are set to perform deblurring and blur generation. Taking the deblurring circuit as an example, given images in , maps to images in , and then remaps to images in , i.e., . Throughout the deblurring circuit, is used to discriminate whether is truly an image in . Similar to the deblurring circuit, the blur generation circuit accepts images in for the opposite mapping, i.e., .

In general, converting images between and

can be regarded as an image-to-image translation task. As a classical image-to-image translation framework in this domain, CycleGAN

(Zhu et al., 2017) has a very powerful ability to learn inter-domain differences. However, CycleGAN cannot achieve the expected deblurring result, and there may be several potential reasons for this: 1) Incompatibility of network architectures. Unlike the image-to-image translation task, the network architecture of the deblurring method is elaborated and sometimes a slight change can severely disrupt the results, e.g., changing BN to IN, which can be seen in ablation studies; 2) Different degrees of task complexity. Image to image translation tasks (e.g., zebrahorse, orangeapple and summerwinter) tend to have a straightforward inter-domain difference, so the deep network can easily learn precise inter-domain difference. However, the inter-domain difference for image deblurring is usually complex, which can be reflected by the blurring degree. As exemplified by the blurring degree in extreme cases, when the blurring degree is very large, one cannot even read any information in the blurred image; when the blurring degree is very small, the blurred image can even be regarded as a sharp image. Therefore, directly migrating CycleGAN to the deblurring task is not feasible and will lead to a series of negative effects, such as chromatic aberration, severe artifacts and parameter redundancy, as shown in Figure 3.

As such, we strive to make the network lightweight and more effective. Specifically, we perform a fully recursive decomposition of entire LDCU, carefully design each minimal component, and finally reconstitute LDCU with these elaborate components. We will mainly focus on refining the generator in LDCU, since it takes up the direct representation of domain conversion, as the key component of LDCU. Generally speaking, the number of generator parameters is much higher than that of the discriminator in the whole generation-discrimination structure because a sufficient number of parameters will better exploit the powerful nonlinear mapping capability of DNN. However, the generator will tend to saturate when the number of parameters reaches a certain level. Thus, a larger number of parameters will make the model redundant, which will seriously hinder the inference and deployment of the model.

To minimize model redundancy, we introduce and perform the following operations from bottom to top: 1) meta design; 2) lightweight structure design. Next, we detail the operations.

Meta design.Convolutional neural networks (CNN) consist of a stack of non-divisible units, e.g., convolution (Conv), batch normalization (BN), instance normalization (IN) and Rectified linear unit (ReLU). However, efficiently organizing these non-divisible units to build a deblurring network is a problem worth exploring. Conv and ReLU are necessary, because they respectively support the CNN’s parametric learning capability and nonlinear fitting capability. Note that the normalization unit is a mandatory factor.

For supervised deblurring, researchers usually did not use the normalization units for model’s structure design, since adding additional normalization units under the condition of adequate constraints may severely hinder the model to learn an accurate 1-to-1: blurredsharp mapping (Nah et al., 2017; Tao et al., 2018; Zhang et al., 2019a; Park et al., 2020; Tsai et al., 2021; Zamir et al., 2021a; Cho et al., 2021). However, in some other fields (e.g., image-to-image translation and style transfer), researchers also used IN for model’s structure design, because these tasks expect to learn a 1-to-1 style mapping (Zhu et al., 2017; Yi et al., 2017; Zhang et al., 2019b). However, in unsupervised deblurring, the lack of strong constraints and the diversity of blurring degrees lead to n-to-n style mapping. Besides, BN normalizes to multiple samples (mini-batch) instead of IN, and is more likely to learn global differences among domains. To this end, we introduce the concepts of basic and residual metas to learn this weakly-constrained style mapping based on BN, as shown in Figure 4 (a) and (e). We have analyzed the importance of our proposed basic and residual metas in detail in ablation studies.

Figure 3. Visualizing the negative effects when CycleGAN (Zhu et al., 2017) is applied directly to the deblurring task. We can see that CycleGAN often results in severe chromatic aberrations (both two groups are about 1.5 times larger than ours), model redundancy (107.90 MB vs. Our 24.56 MB) and artifacts.
Figure 4. Different forms of basic and residual metas. ”Norm” means the normalization in basic beta, which can be BN, IN or norm-free. LDCU is based on (a) and (e). (g) is a schematic diagram for norm-free that leads to instability during training.

Lightweight structure design. In the field of image restoration, many efforts have been devoted to designing complex models to obtain desired results. However, no matter how complex the model is, they are all based on two underlying structures, i.e., encode-decoder structure (Nah et al., 2017) (see Figure 5 (b)) or single-scale pipeline (Ren et al., 2019) (see Figure 5 (a)). The encoder-decoder structure can effectively abstract the content information but cannot maintain the spatial details of the images. The single-scale pipeline can ensure accurate spatial information, but cannot easily abstract the image contents. With the same settings, the number of parameters is the same as that of encoder-decoder structure. Nevertheless, from the viewpoint of inference speed, without the encoding-decoding process, the model based on single-scale pipeline forward at the original resolution, which can reduce the inference speed and affect the real-time capability. To simplify the descriptions, we next use encode-decoder and single-scale to represent these two structures.

Considering the strengths and weaknesses of the two existing structures, Zamir et al. (Zamir et al., 2021a) add an additional original resolution module after encoder-decoder, which balances the advantages of both, but also involves extra parameters and inference time. Inspired by Cho et al. (Cho et al., 2021), we introduce a new structure termed lightweight encoder-decoder (LED) to incorporate the strengths of both encoder-decoder and single-scale, as shown in Figure 5 (c). Compared with encoder-decoder, LED stacks residual metas at a larger resolution, which will be more favorable for preserving the spatial details. Compared with single-scale, LED contains both encoding and decoding, which facilitates the abstraction of the image content.

Figure 5. Different structures of DNN models. ResMetas contains several residual metas. Only two basic metas are shown when coding for simplifying the schematic.

Figure 6 (a) illustrates the difference between LED and encoder-decoder, i.e., the ResMetas of encoder-decoder are all gathered in the deep layer, while our LED’s can be distributed in the shallow layer. Figures 6 (b) and (c) compare the number of parameters and the number of calculations for both structures under the same setting. Note that it is known that for the number of parameters: encoder-decoder’s = single-scale’s; for the number of calculations: encoder-decoder’s ¡ single-scale’s. However, as we can see from Figure 6, LED’s parameters are much lower than that of the encoder-decoder and the number of calculations is the same. At this point, we can conclude that LED considers the advantages of encoder-decoder and single-pipeline while making the model lightweight without introducing extra calculations. Finally, we will illustrate the effectiveness of our LED in ablation studies.

3.3. Parameter-Free Frequency-Domain Contrastive Unit (PFCU)

The difference between blurred and sharp images is hard to explain in the spatial domain. However, the difference in the frequency domain can be explained by the fact that the blurred image loses the high-frequency signal in a sharp image. A few efforts have been made to investigate the frequency domain in the BID task, but they are limited to only two ways: either using constraints in the frequency domain between the latent image and ground truth (Cho et al., 2021)

, or integrating Fast Fourier transform (FFT) and inverse Fast Fourier transform (IFFT) in the model

(Mao et al., 2021). However, these approaches are ground truth-dependent and difficult to understand.

Figure 6. (a) Difference between LDE and encoder-decoder. (b) and (c) compare the number of parameters and the number of calculations brought by the red and blue arrows.

We have studied a large number of blurred and sharp images in the frequency domain and found that: without normalizing the images, the blurred images tended to be all black, however the sharp images tended to be all white. Besides, the higher the degree of blurring, the darker the image, as shown in Figure 7. However, it is very challenging to directly measure the difference between blurred and sharp frequency domain images in an unsupervised manner. Therefore, we put our hopes on powerful contrastive learning.

Two key questions need to be addressed to fully exploit the effect of contrastive learning: 1) How to get the latent representation of samples? 2) How to calculate the similarity between the latent representations? Note that DCD-GAN (Chen et al., 2022)

used additional branch to obtain potential representation and used the cosine similarity function to define distances. However, this approach will make the model tend to update the extra branches and ignore the task itself.

To address the above problems well, based on the principle of lightweight, we introduce a novel parameter-free frequency-domain contrast unit (i.e., PFCU) for contrastive learning in frequency domain. Figure 2

shows the structure of PFCU in detail, which consists of several layers, i.e., the FFT layer, modulus-performing layer, binarization layer and de-marginalization layer. Specifically, given an input sample

, the FFT layer firstly performs a fast Fourier transform on to obtain the frequency domain output of complex type. The modulus-performing layer then modulos to obtain the real space representation . The binarization layer aims to binarize according to a threshold value (zero) to obtain binarized . Finally, the de-marginalization layer aims to preserve ’s central region to further highlight the frequency domain variation and obtain the latent representation of input . We express the whole process of obtaining the latent representation by the following formula:

Figure 7. Different degrees of blurred and sharp images in the frequency domain without normalization.
(1)

After obtaining the potential representation of samples, calculating the similarity will be the most critical issue to be addressed. Note that it is unfeasible to use the traditional cosine similarity directly because there are no learnable parameters in the calculation of latent representations, and two frequency domain images with small differences may have a low similarity score, which is certainly unreasonable (See Figure 8). For this purpose, considering the difference in black coverage between various latent representations, we design a new similarity measurement method. Specifically, given two latent representations ,

and a hyperparameter

, we first chunk , to obtain , , and then calculate the black coverage for each chunk to obtain , . The black coverage for each chunk is calculated as follows:

(2)

where is one chunk of the chunked latent representation (e.g., and above), is the -th row and -th column element of , and are the size of , respectively.

Then, we can define and calculate the similarity based on the black coverage and in frequency domain as follows:

(3)
Datasets GoPro HIDE Runtime (ms) Model Size (MB)
Metrics PSNR/SSIM CSE (Ratio) NIQE PSNR/SSIM CSE (Ratio)
Gong et al. (Gong et al., 2017) 26.40/0.863 / / / / / 39.00
Deep supervised DeepDeblur (Nah et al., 2017) 29.08/0.914 / 4.921 25.73/0.874 / 4330 89.40
methods SRN (Tao et al., 2018) 30.26/0.934 / 4.834 28.36/0.915 / 1870 27.50
CycleGAN (Zhu et al., 2017) 22.54/0.720 2.676 4.359 21.81/0.690 3.300 12.55 107.90
Deep unsupervised DualGAN (Yi et al., 2017) 22.86/0.722 4.384 4.176 / / / 324.20
methods UID-GAN (Lu et al., 2019) 23.56/0.738 1.532 5.289 22.70/0.715 2.147 62.06 606.80
Ours 24.84/0.771 1 3.924 23.43/0.732 1 10.86 24.56
Table 1. Performance comparison on two benchmark datasets: GoPro (Nah et al., 2017) and HIDE (Shen et al., 2019).
Figure 8. Validation of our similarity against cosine similarity in the frequency domain. The numbers on lines is the similarity between the two representations. The most unreasonable value of the cosine similarity is highlighted in red.
Figure 9. Visual comparison of image deblurring on the GoPro dataset (Nah et al., 2017).
Figure 10. Visual comparison of image deblurring on the CelebA dataset (Liu et al., 2015).

For ease of understanding, we represent the entire process of calculating the similarity between two different latent representations (, ) by . We show some results to demonstrate the effectiveness of our similarity measurement method against the coisine similarity in Figure 8. After solving the latent representation acquisition and similarity measure problems, we can easily apply contrastive learning to our FCL-GAN. Specifically, given an anchor , a positive exemplar and several negative exemplars , we include the following contrastive loss in training:

Metrics PSNR/SSIM CSE (Ratio)
Pan et al. (Pan et al., 2014a) 15.16/0.380 /
Xu et al. (Xu et al., 2013) 16.84/0.470 /
Optimization- Pan et al. (Pan et al., 2014b) 17.34/0.520 /
based methods Pan et al. (Pan et al., 2016) 17.59/0.540 /
Krishnan et al. (Krishnan et al., 2011) 18.51/0.560 /
Deep supervi- DeblurGAN (Kupyn et al., 2018) 18.86/0.540 /
sed methods DeepDeblur (Nah et al., 2017) 18.26/0.570 /
CycleGAN (Zhu et al., 2017) 19.40/0.560 1.918
Deep unsuper- UID-GAN (Lu et al., 2019) 20.81/0.650 2.279
vised methods Ours 21.07/0.652 1
Table 2. Performance comparison on CelebA (Liu et al., 2015).
(4)

where is the number of negative exemplars and is the temperature coefficient that is set to 0.07 in all experiments.

3.4. Loss function

Except for the contrastive loss , we also introduce adversarial loss , cycle-consistency loss and TV regularization for deblurring.

is applied only to restored sharp images and other losses to both the sharp and blurred domains. The total loss function is:

.

4. Experiments

4.1. Experimental Settings

Datasets. In this paper, we evaluate each BID method on four widely-used image datasets, incuding a natural image dataset GoPro (Nah et al., 2017), a human-aware blurring dataset HIDE (Shen et al., 2019), a human face dataset CelebA (domain-specific) (Liu et al., 2015), and a real-world blur dataset RealBlur (namely, RealBlur-J and RealBlur-R) (Rim et al., 2020).

Evaluation Metrics. We use two widely-used reference metrics (PSNR and SSIM (Wang et al., 2004)), one non-reference metric (NIQE (Mittal et al., 2012)) and one chromatic aberration metric (i.e., color sensitive error, CSE (Zhao et al., 2021)) for evaluations. We also compare the model size to measure the lightweight property. Besides, to measure the real-time performance of each model, we compare the inference time on Nvidia RTX 2080 Ti. In our experimental results: the symbol means the higher, the better, while the symbol means the lower, the better.

Compared Methods. We compare our FCL-GAN with 13 methods, including five optimization-based methods: (Pan et al., 2014a), (Pan et al., 2014b), (Pan et al., 2016), (Xu et al., 2013), (Krishnan et al., 2011); four data-driven deep supervised methods:(Gong et al., 2017), DeblurGAN (Kupyn et al., 2018), DeepDeblur (Nah et al., 2017), SRN (Tao et al., 2018); three data-driven deep unsupervised methods: DualGAN (Yi et al., 2017), CycleGAN (Zhu et al., 2017), UID-GAN (Lu et al., 2019). We prefer to use the pre-trained model. However, if the settings are inconsistent or there is no pre-trained model, we will retrain it using the code provided by the authors.

Figure 11. Visual comparison of image deblurring on the RealBlur-J dataset(Rim et al., 2020).

Implementation Details.

The proposed baseline model is implementated based on PyTorch version 1.10 and Nvidia RTX 3090 with 24G memory. We set 80 epochs for training, using Adam

(Kingma and Ba, 2015) with =0.5 and =0.999 for optimization. The initial learning rate was set to 0.0001, which was reduced by half every 20 epochs.

4.2. Experimental Results

1) Result on GoPro. We train our model on the training set of GoPro and evaluate it on the test set, as shown in Table 1. Note that CSE metric is calculated based on the deblurred image and ground-truth for evaluating the performance of each deep unsupervised deblur model on each dataset, and we use the ratio of the CSE result of other methods to our model for comparison. We see that our FCL-GAN substantially outperforms the existing deep unsupervised BID methods in terms of model effectiveness, and meanwhile obtains a significant advantage in inference time and model size.

2) Result on HIDE. To quantitatively and qualitatively compare the generalization ability of each model, we directly apply the model pre-trained on GoPro to the HIDE dataset. As shown in Table 1 and Figure 9, our proposed FCL-GAN has more strong generalization ability and outperforms all data-driven deep unsupervised methods. We are also surprised that our FCL-GAN obtains highly competitive results from some deep supervised deblur models.

3) Result on CelebA. To evaluate the ability of each method for processing domain-specific blurred images, we train and test each model on CelebA face dataset. From Table 2 and Figure 10, our method achieves the optimal performance and visual effects among all unsupervised methods again, and deep unsupervised methods tend to outperform some deep supervised methods in this case.

4) Result on RealBlur. To examine the ability of each model to handle real-world blurs, we directly apply the model pre-trained on GoPro to the RealBlur dataset, as shown in Table 3 and Figure 11. From Table 3, we see that our method outperforms all unsupervised methods. Figure 11 shows that our method can better handle real-world blurring images.

Datasets RealBlur-J RealBlur-R
Metrics PSNR/SSIM CSE PSNR/SSIM CSE
Super- DeblurGAN (Kupyn et al., 2018) 27.97/0.834 / 33.79/0.903 /
vised DeepDeblur (Nah et al., 2017) 27.87/0.827 / 32.51/0.841 /
CycleGAN (Zhu et al., 2017) 19.79/0.633 2.898 12.38/0.242 3.024
Unsup- UID-GAN (Lu et al., 2019) 22.87/0.671 1.394 16.64/0.323 2.985
ervised Ours 25.35/0.736 1 28.37/0.663 1
Table 3. Performance comparison on RealBlur (Rim et al., 2020).
Datasets Model Ours
GoPro PSNR/SSIM 24.56/0.749 24.73/0.765 24.84/0.771
CelebA PSNR/SSIM 20.83/0.648 21.01/0.651 21.07/0.652
Table 4. Ablation studies for loss functions on the GoPro dataset (Nah et al., 2017) and CelebA dataset (Liu et al., 2015).

4.3. Ablaiton Studies

1) Effectiveness of loss function. To verify the effectiveness of the used loss functions in training, we ablate and on the GoPro and CelebA datasets. From the experimental results in Table 4, we see that both and have positive effects, and removing has a more negative impact on the performance.

2) Effectiveness of designed basic meta and residual meta. In this study, we design three different forms for verification, as shown in Figure 4. For the basic meta, we consider the following three forms: (a) Conv-BN-ReLU, (b) Conv-IN-ReLU, and (c) Conv-ReLU. While for the residual meta, we also consider three forms: (d) is the basic form of resblock (He et al., 2016); (e) is a simplified version of (d); (f) is the most widely used form in the field of deblurring (Nah et al., 2017; Zhang et al., 2019a; Cho et al., 2021). Table 5 shows the performance of various collaborative effects of basic meta and residual meta. We have the following conclusions which can be applied to unsupervised deblurring: 1) for the gain on performance: BN ¿ norm-free ¿ IN; 2) the introduction of IN substantially degrades the performance; 3) the simplified version of Residual meta performs better. Besides, norm-free seems to have the same effect as BN. However, norm-free leads to instability of training in experiments, as shown in Figure 4 (g).

Combination (a)+(d) (a)+(e) (b)+(d) (b)+(e)
PSNR/SSIM 23.61/0.736 24.84/0.771 19.40/0.635 20.11/0.641
Combination (c)+(d) (c)+(e) (c)+(f)
PSNR/SSIM 23.47/0.729 24.55/0.765 24.20/0.762
Table 5. Ablation studies for different combinations of basic metas and residual metas on GoPro (Nah et al., 2017).
Structures PSNR/SSIM Model Size Runtime
Encoder-decoder 24.66/0.763 50.30 (MB) 15.19 (ms)
Single-scale 24.69/0.770 50.30 (MB) 76.94 (ms)
Our LED 24.84/0.771 24.56 (MB) 10.86 (ms)
Table 6. Ablation studies for structures on GoPro (Nah et al., 2017).

3) Effectiveness of LED structure. We compare the deblurring performance, model size, and inference time of the three structures in Figure 5. Table 6 describes the comparison results. As can be seen, our LED is more light, infers faster, and performs better.

5. Conclusion

We have discussed the limitations of existing unsupervised deep BID methods, and technically proposed a lightweight and real-time unsupervised BID baseline (FCL-GAN). We analyze the blurred and sharp images in the frequency domain and introduce frequency-domain contrastive learning to obtain superior performance. Qualitative and quantitative results show that our method achieves SOTA performance for the unsupervised BID and is even highly competitive with some supervised deep BID models on the domain-specific case. In terms of lightweight and real-time inference performance, our FCL-GAN method outperforms all existing image deblurring models (no matter supervised or unsupervised). Furthermore, our method only needs 0.011s to process a high-resolution image (1280x720) on an Nvidia RTX 2080Ti. In future, we will consider the deployment issue of our lightweight model, and also explore new stategies to furher imrpove the deblurring performance.

6. Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (62072151, 61732007, 61932009 and 62020106007), and the Anhui Provincial Natural Science Fund for Distinguished Young Scholars (2008085J30). Zhao Zhang is the corresponding author of this paper.

References

  • Y. Bai, H. Jia, M. Jiang, X. Liu, X. Xie, and W. Gao (2020) Single-image blind deblurring using multi-scale latent structure prior. IEEE Trans. Circuits Syst. Video Technol. 30 (7), pp. 2033–2045. Cited by: §1.
  • K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu (2020) Contrastive learning of global and local features for medical image segmentation with limited annotations. In Proceedings of the Annual Conference on Neural Information Processing Systems, virtual, Cited by: §2.2.
  • H. G. Chen, J. Gu, O. Gallo, M. Liu, A. Veeraraghavan, and J. Kautz (2018)

    Reblur2Deblur: deblurring videos via self-supervised learning

    .
    In Proceedings of the IEEE International Conference on Computational Photography, Pittsburgh, PA, USA, pp. 1–9. Cited by: §1, §1, §2.1.
  • L. Chen, F. Fang, T. Wang, and G. Zhang (2019) Blind image deblurring with local maximum gradient prior. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA

    ,
    pp. 1742–1750. Cited by: §1.
  • L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen (2021) HINet: half instance normalization network for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, virtual, pp. 182–192. Cited by: §1.
  • X. Chen, J. Pan, K. Jiang, Y. Li, Y. Huang, C. Kong, L. Dai, and Z. Fan (2022) Unpaired deep image deraining using dual contrastive learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2, §3.2, §3.3.
  • S. Cho, S. Ji, J. Hong, S. Jung, and S. Ko (2021) Rethinking coarse-to-fine approach in single image deblurring. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, pp. 4621–4630. Cited by: §1, §3.2, §3.2, §3.3, §4.3.
  • B. Dai and D. Lin (2017)

    Contrastive learning for image captioning

    .
    In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 898–907. Cited by: §2.2.
  • N. Dong, M. Maggioni, Y. Yang, E. Pérez-Pellitero, A. Leonardis, and S. McDonagh (2021) Residual contrastive learning for joint demosaicking and denoising. CoRR abs/2106.10070. Cited by: §2.2.
  • D. Gong, J. Yang, L. Liu, Y. Zhang, I. D. Reid, C. Shen, A. van den Hengel, and Q. Shi (2017) From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 3806–3815. Cited by: Table 1, §4.1.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada, pp. 2672–2680. Cited by: §2.1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 5767–5777. Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 770–778. Cited by: §4.3.
  • N. Joshi, C. L. Zitnick, R. Szeliski, and D. J. Kriegman (2009) Image deblurring and denoising using color priors. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, Florida, USA, pp. 1550–1557. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, Cited by: §4.1.
  • D. Krishnan, T. Tay, and R. Fergus (2011) Blind deconvolution using a normalized sparsity measure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, pp. 233–240. Cited by: Table 2, §4.1.
  • O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas (2018) DeblurGAN: blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 8183–8192. Cited by: §1, §2.1, Table 2, §4.1, Table 3.
  • O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang (2019) DeblurGAN-v2: deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea (South), pp. 8877–8886. Cited by: §2.1.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, pp. 3730–3738. Cited by: Figure 10, Table 2, §4.1, Table 4.
  • B. Lu, J. Chen, and R. Chellappa (2019) Unsupervised domain-specific deblurring via disentangled representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 10225–10234. Cited by: Figure 1, §1, §1, §2.1, Table 1, Table 2, §4.1, Table 3.
  • N. T. M, V. Rengarajan, and R. Ambasamudram (2018) Semi-supervised learning of camera motion from A blurred image. In Proceedings of the IEEE International Conference on Image Processing, Athens, Greece, pp. 803–807. Cited by: §2.1.
  • X. Mao, Y. Liu, W. Shen, Q. Li, and Y. Wang (2021) Deep residual fourier transformation for single image deblurring. CoRR abs/2111.11745. Cited by: §3.3.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §2.1.
  • A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21 (12), pp. 4695–4708. Cited by: Figure 1, §4.1.
  • S. Nah, T. H. Kim, and K. M. Lee (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 257–265. Cited by: Figure 1, §1, §2.1, Figure 9, §3.2, §3.2, Table 1, Table 2, §4.1, §4.1, §4.3, Table 3, Table 4, Table 5, Table 6.
  • T. M. Nimisha, S. Kumar, and A. N. Rajagopalan (2018) Unsupervised class-specific deblurring. In Proceedings of the 15th European Conference, Munich, Germany, pp. 358–374. Cited by: §1, §1, §2.1.
  • J. Pan, Z. Hu, Z. Su, and M. Yang (2014a) Deblurring face images with exemplars. In Proceedings of the 13th European Conference, Zurich, Switzerland, pp. 47–62. Cited by: Table 2, §4.1.
  • J. Pan, Z. Hu, Z. Su, and M. Yang (2014b) Deblurring text images via l0-regularized intensity and gradient prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 2901–2908. Cited by: §1, Table 2, §4.1.
  • J. Pan, D. Sun, H. Pfister, and M. Yang (2016) Blind image deblurring using dark channel prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 1628–1636. Cited by: §1, Table 2, §4.1.
  • D. Park, D. U. Kang, J. Kim, and S. Y. Chun (2020)

    Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training

    .
    In Proceedings of the 16th European Conference, Glasgow, UK, pp. 327–343. Cited by: §3.2.
  • D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng (2019) Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 3937–3946. Cited by: §3.2.
  • J. Rim, H. Lee, J. Won, and S. Cho (2020) Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the 16th European Conference, Glasgow, UK, pp. 184–201. Cited by: Figure 11, §4.1, Table 3.
  • Z. Shen, W. Wang, X. Lu, J. Shen, H. Ling, T. Xu, and L. Shao (2019) Human-aware motion deblurring. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea (South), pp. 5571–5580. Cited by: Table 1, §4.1.
  • M. Suin, K. Purohit, and A. N. Rajagopalan (2020) Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 3603–3612. Cited by: §1.
  • X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia (2018) Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 8174–8182. Cited by: §1, §3.2, Table 1, §4.1.
  • F. Tsai, Y. Peng, Y. Lin, C. Tsai, and C. Lin (2021) BANet: blur-aware attention networks for dynamic scene deblurring. CoRR abs/2101.07518. Cited by: §3.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 5998–6008. Cited by: §2.1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), pp. 600–612. Cited by: Figure 1, §4.1.
  • Y. Wei, Z. Zhang, Y. Wang, M. Xu, Y. Yang, S. Yan, and M. Wang (2021) DerainCycleGAN: rain attentive cyclegan for single image deraining and rainmaking. IEEE Trans. Image Process. 30, pp. 4788–4801. Cited by: §3.2.
  • H. Wu, Y. Qu, S. Lin, J. Zhou, R. Qiao, Z. Zhang, Y. Xie, and L. Ma (2021) Contrastive learning for compact single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, virtual, pp. 10551–10560. Cited by: §2.2.
  • E. Xie, J. Ding, W. Wang, X. Zhan, H. Xu, P. Sun, Z. Li, and P. Luo (2021) DetCo: unsupervised contrastive learning for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, pp. 8372–8381. Cited by: §2.2.
  • L. Xu, S. Zheng, and J. Jia (2013) Unnatural L0 sparse representation for natural image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, pp. 1107–1114. Cited by: Table 2, §4.1.
  • Y. Yan, W. Ren, Y. Guo, R. Wang, and X. Cao (2017) Image deblurring via extreme channels prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 6978–6986. Cited by: §1.
  • Z. Yi, H. (. Zhang, P. Tan, and M. Gong (2017) DualGAN: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2868–2876. Cited by: §3.2, Table 1, §4.1.
  • S. W. Zamir, A. Arora, S. H. Khan, M. Hayat, F. S. Khan, M. Yang, and L. Shao (2021a) Multi-stage progressive image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, virtual, pp. 14821–14831. Cited by: §1, §2.1, §3.2, §3.2.
  • S. W. Zamir, A. Arora, S. H. Khan, M. Hayat, F. S. Khan, and M. Yang (2021b) Restormer: efficient transformer for high-resolution image restoration. CoRR abs/2111.09881. Cited by: §1, §2.1.
  • H. Zhang, Y. Dai, H. Li, and P. Koniusz (2019a) Deep stacked hierarchical multi-patch network for image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 5978–5986. Cited by: §1, §2.1, §3.2, §4.3.
  • H. Zhang, W. Chen, H. He, and Y. Jin (2019b) Disentangled makeup transfer with generative adversarial network. CoRR abs/1907.01144. Cited by: §3.2.
  • J. Zhang, S. Lu, F. Zhan, and Y. Yu (2021) Blind image super-resolution via contrastive representation learning. CoRR abs/2107.00708. Cited by: §2.2.
  • K. Zhang, W. Luo, Y. Zhong, L. Ma, B. Stenger, W. Liu, and H. Li (2020) Deblurring by realistic blurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 2734–2743. Cited by: §1, §2.1.
  • S. Zhao, Z. Zhang, R. Hong, M. Xu, H. Zhang, M. Wang, and S. Yan (2021) Unsupervised color retention network and new quantization metric for blind motion deblurring. Cited by: §1, §1, §2.1, §4.1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2242–2251. Cited by: 2nd item, Figure 3, §3.2, §3.2, Table 1, Table 2, §4.1, Table 3.