DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better

We present a new end-to-end generative adversarial network (GAN) for single image motion deblurring, named DeblurGAN-v2, which considerably boosts state-of-the-art deblurring efficiency, quality, and flexibility. DeblurGAN-v2 is based on a relativistic conditional GAN with a double-scale discriminator. For the first time, we introduce the Feature Pyramid Network into deblurring, as a core building block in the generator of DeblurGAN-v2. It can flexibly work with a wide range of backbones, to navigate the balance between performance and efficiency. The plug-in of sophisticated backbones (e.g., Inception-ResNet-v2) can lead to solid state-of-the-art deblurring. Meanwhile, with light-weight backbones (e.g., MobileNet and its variants), DeblurGAN-v2 reaches 10-100 times faster than the nearest competitors, while maintaining close to state-of-the-art results, implying the option of real-time video deblurring. We demonstrate that DeblurGAN-v2 obtains very competitive performance on several popular benchmarks, in terms of deblurring quality (both objective and subjective), as well as efficiency. Besides, we show the architecture to be effective for general image restoration tasks too. Our codes, models and data are available at:



page 3

page 4

page 6

page 7

page 8


A GAN-Based Input-Size Flexibility Model for Single Image Dehazing

Image-to-image translation based on generative adversarial network (GAN)...

Omni-GAN: On the Secrets of cGANs and Beyond

It has been an important problem to design a proper discriminator for co...

DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks

We present an end-to-end learning approach for motion deblurring, which ...

SL-CycleGAN: Blind Motion Deblurring in Cycles using Sparse Learning

In this paper, we introduce an end-to-end generative adversarial network...

SSH: Single Stage Headless Face Detector

We introduce the Single Stage Headless (SSH) face detector. Unlike two s...

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation

Learning a good image prior is a long-term goal for image restoration an...

Generative Adversarial Transformers

We introduce the GANsformer, a novel and efficient type of transformer, ...

Code Repositories


[ICCV 2019] "DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better" by Orest Kupyn, Tetiana Martyniuk, Junru Wu, Zhangyang Wang

view repo


[ICCV 2019] "DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better" by Orest Kupyn, Tetiana Martyniuk, Junru Wu, Zhangyang Wang

view repo


Image Deblurring/Restoration Web Application Powered by Deep Learning

view repo


Image Deblurring Deep Learning Model Built Over Tensorflow 2.0 and Keras

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The SSIM-FLOPs trade-off plot on the GoPRO dataset. Compared to three state-of-the-art competitors (in blue): DeblurGAN [20], DeepDeblur [32] and Scale-Recurrent Network (SRN) [45], DeblurGAN-v2 models (with different backbones, in red) are shown to achieve superior or comparable quality, and are much more efficient.

This paper focuses on the challenging setting of single-image blind motion deblurring. Motions blurs are commonly found from photos taken by hand-held cameras, or low-frame-rate videos containing moving objects. Blurs degrade the human perceptual quality, and challenge subsequent computer vision analytics. The real-world blurs typically have unknown and spatially varying blur kernels, and are further complicated by noise and other artifacts.

The recent prosperity of deep learning has led to significant progress in the image restoration field

[48, 27]. Specifically, generative adversarial networks (GANs) [8]

often yield sharper and more plausible textures than classical feed-forward encoders and witness success in image super-resolution 

[22] and in-painting [53]. Recently, [20]

introduced GAN to deblurring by treating it as a special image-to-image translation task 

[12]. The proposed model, called DeblurGAN, was demonstrated to restore perceptually pleasing and sharp images, from both synthetic and real-world blurry images. DeblurGAN was also 5 times faster than its closest competitor as of then [32].

Built on the success of DeblurGAN, this paper aims to make another substantial push on GAN-based motion deblurring. We introduce a new framework to improve over DeblurGAN, called DeblurGAN-v2 in terms of both deblurring performance and inference efficiency, as well as to enable high flexibility over the quality- efficiency spectrum. Our innovations are summarized as below111An informal note: we quite like the sense of humor in [38], quoted as: ”We present some updates to YOLO. We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell.” – that well describes what we have done to DeblurGAN, too; although we consider DeblurGAN-v2 a non-incremental upgrade of DeblurGAN, with significant performance & efficiency improvements.:

  • Framework Level: We construct a new conditional GAN framework for deblurring. For the generator, we introduce the Feature Pyramid Network (FPN), which was originally developed for object detection [26], to the image restoration task for the first time. For the discriminator, we adopt a relativistic discriminator [15] with a least-square loss wrapped [29] inside, and with two columns that evaluate both global (image) and local (patch) scales respectively.

  • Backbone Level: While the above framework is agnostic to the generator backbones, the choice would affect deblurring quality and efficiency. To pursue the state-of-the-art deblurring quality, we plug in a sophisticated Inception-ResNet-v2 backbone. To shift towards being more efficient, we adopt MobileNet, and further create its variant with depth-wise separable convolutions (MobileNet-DSC). The latter two become extremely compact in size and fast at inference.

  • Experiment Level: We present very extensive experiments on three popular benchmarks to show the state-of-the-art (or close) performance (PSNR, SSIM, and perceptual quality) achieved by DeblurGAN-v2. In terms of the efficiency, DeblurGAN-v2 with MobileNet-DSC is 11 times faster than DeblurGAN [20], over 100 times faster than [32, 45], and has a model size of just 4 MB, implying the possibility of real-time video deblurring. We also present a subjective study of the deblurring quality on real blurry images. Lastly, we show the potential of our models in general image restoration, as extra flexibility.

2 Related work

2.1 Image Deblurring

Figure 2: DeblurGAN-v2 pipeline architecture.

Single image motion deblurring is traditionally treated as a deconvolution problem, and can be tackled in either a blind or a non-blind manner. The former assumes a given or pre-estimated blur kernel

[39, 52]. The latter is more realistic yet highly ill-posed. Earlier models rely on natural image priors to regularize deblurring [19, 35, 24, 4]. However, most handcrafted priors cannot well capture the complicated blur variations in real images.

Emerging deep learning techniques have boosted the breakthrough in image restoration tasks. Sun  [43]

exploited a convolutional neural network (CNN) for blur kernel estimation. Gong  

[7] used a fully convolutional network to estimate the motion flow. Besides those kernel-based methods, end-to-end kernel-free CNN methods were explored to restore a clean image from the blurry input directly, e.g.,  [32, 34]. The latest work by Tao [45] extended the Multi-Scale CNN from [32] to a Scale-Recurrent CNN for blind image deblurring, with impressive results.

The success of GANs for image restoration has impacted single image deblurring as well since Ramakrishnan  [37] first solved image deblurring by referring to the image translation idea [12]. Lately, Kupyn  [20] introduced DeblurGAN that exploited Wasserstein GAN [1] with the gradient penalty [9] and the perceptual loss [14].

2.2 Generative adversarial networks

A GAN [8] consists of two models: a discriminator and a generator , that form a two-player minimax game. The generator learns to produce artificial samples and is trained to fool the discriminator, in a goal to capture the real data distribution. In particular, as a popular GAN variant, conditional GANs [30] have been widely applied to image-to-image translation problems, with image restoration and enhancement as special cases. They take the label or an observed image in addition to the latent code as inputs.

The minimax game with the value function is formulated as the following [8] (fake-real labels set to ):

Such an objective function is notoriously hard to optimize, and one needs to deal with many challenges, e.g., mode collapse and gradient vanishing/explosion, during the training process. To fix the vanishing gradients and stabilize the training, Least Squares GANs discriminator [29]

tried to introduce a loss function that provides smoother and non-saturating gradient. The authors observe that the log-type loss in

[8] saturates quickly as it ignores the distance between to the decision boundary. In contrast, an loss provides gradients proportional to that distance, so that fake samples more far away from the boundary receive larger penalties. The proposed loss function also minimizes the Pearson divergence that leads to the better training stability. The LSGAN objective function is written as::


Another relevant improvement to GANs is the Relativistic GAN [15]

. It used a relativistic discriminator to estimate the probability that the given real data is more realistic than a randomly sampled fake data. As the author advocated, such would account for a priori knowledge that half of the data in the mini-batch is fake. The relativistic discriminators show more stable and computationally efficient training in comparison to other GAN types, including WGAN-GP

[9] that was used in DeblurGAN-v1.

3 DeblurGAN-v2 Architecture

The overview of DeblurGAN-v2 architecture is illustrated in Figure 2. It restores a sharp image from a single blurred image , via the trained generator.

3.1 Feature Pyramid Deblurring

Existing CNNs for image deblurring (and other restoration problems) [22, 32] typically refer to ResNet-like structures. Most state-of-the-art methods [32, 45] dealt with different levels of blurs, utilizing multi-stream CNN s with an input image pyramid at different scales. However, processing multiple scale images is time-consuming and memory-demanding. We introduce the idea of Feature Pyramid Networks [26] to image deblurring (more generally, the field of image restoration and enhancement), for the first time to our best knowledge. We treat this novel approach as a lighter-weight alternative to incorporate multi-scale features.

The FPN module was originally designed for object detection [26]

. It generates multiple feature map layers which encode different semantics and contain better quality information. FPN comprises a bottom-up and a top-down pathway. The bottom-up pathway is the usual convolutional network for feature extraction, along which the spatial resolution is downsampled, but more semantic context information is extracted and compressed. Through the top-down pathway, FPNs reconstructs higher spatial resolution from the semantically rich layers. The lateral connections between the bottom-up and top-down pathways supplement high-resolution details and help localize objects.

Our architecture consists of an FPN backbone from which we take five final feature maps of different scales as the output. Those features are later up-sampled to the same

input size and concatenated into one tensor which contains the semantic information on different levels. We additionally add two upsampling and convolutional layers at the end of the network to restore the original image size and reduce artifacts. Similar to

[20, 28], we introduce a direct skip connection from the input to the output, so that the learning focuses on the residue. The input images are normalized to [-1 1]. We also use a tanh activation layer to keep the output in the same range. In addition to the multi-scale feature aggregation capability, FPN also strikes a balance between accuracy and speed: please see experiment parts.

3.2 Choice of Backbones: Trade-off between Performance and Efficiency

The new FPN-embeded architecture is agnostic to the choice of feature extractor backbones. With this plug-and-play property, we are entitled with the flexibility to navigate through the spectrum of accuracy and efficiency. By default, we choose ImageNet-pretrained backbones to convey more semantic-related features. As one option, we use

Inception-ResNet-v2 [44] to pursue strong deblurring performance, although we find other backbones such as SE-ResNeXt [11] to be similarly effective.

The demands of efficient restoration model have recently drawn increasing attentions due to the prevailing need of mobile on-device image enhancement [54, 50, 47]. To explore this direction, we choose the MobileNet V2 backbone [40] as one option. To reduce the complexity further, we try another more aggressive option on top of DeblurGAN-v2 with MobileNet V2, by replacing all normal convolutions in the full network (including those not in backbone) with Depthwise Separable Convolutions [5]. The resulting model is denoted as MobileNet-DSC, and can provide extremely lightweight and efficient image deblurring.

To unleash this important flexibility to practitioners, in our codes, we have implemented the switch of backbones as a simple one-line command: it can be compatible with many state-of-the-art pre-trained networks.

3.3 Double-Scale RaGAN-LS Discriminator

Instead of the WGAN-GP discriminator in DeblurGAN [20], we suggest several upgrades in DeblurGAN-v2. We first adopt the relativistic “wrapping” [15] on the LSGAN [29] cost function, creating a new RaGAN-LS loss:


It is observed to make training notably faster and more stable compared to using the WGAN-GP objective. We also empirically conclude that the generated results possess higher perceptual quality and overall sharper outputs. Correspondingly, the adversarial loss for the DeblurGAN-v2 generator will be optimizing (3.3) w.r.t. .

Extending to Both Global and Local Scales. Isola [12] propose to use a PatchGAN discriminator which operates on the images patches of size 70 70, that proves to produce sharper results than the standard “global” discriminator that operates on the full image. The PatchGAN idea was adopted in DeblurGAN [20].

However, we observed that for highly non-uniform blurred images, especially when complex object movements are involved, the “global” scales are still essential for discriminators to incorporate full spatial contexts [13]. To take advantage of both global and local features, we propose to use a double-scale discriminator, consisting of one local branch that operates on patch levels like [12] did, and the other global branch that feeds the full input image. We observe that to allow DeblurGAN-v2 to better handle larger and more heterogeneous real blurs.

Overall Loss Function For training image restoration GANs, one needs to compare the images on the training stage – the reconstructed and the original ones, under some metric. One common option is the pixel-space loss , e.g., the simplest or distance. As [22] suggested, using tends to yield oversmoothened pixel-space outputs. [20] proposed to use the perceptual distance [14], as a form of “content” loss . In contrast to the , it computes the Euclidean loss on the VGG19 [41] conv3_3 feature maps. We incorporate those prior wisdoms and use a hybrid three-term loss for training DeblurGAN-v2:

The terms contains both global and local discriminator losses. Also, we choose mean-square-error (MSE) loss as : although DeblurGAN did not include an term, we find it to help correct color and texture distortions.

3.4 Training Datasets

Figure 3:

Visual comparison of synthesized blurry images, without interpolation (a,c) and with interpolation (b,d).

Sun  [43] Xu  [51] DeepDeblur [32] SRN [45] DeblurGAN [20] Inception-ResNet-v2 MobileNet MobileNet-DSC
PSNR 24.64 25.10 29.23 30.10 28.70 29.55 28.17 28.03
SSIM 0.842 0.890 0.916 0.932 0.927 0.934 0.925 0.922
Time 20 min 13.41s 4.33s 1.6s 0.85s 0.35s 0.06s 0.04s
FLOPS N/A N/A 1760.04G 1434.82G 678.29G 411.34G 43.75G 14.83G
Table 1: Performance and efficiency comparison on the GoPro test dataset, All models were tested on the linear image subset.
Method Sun [43] DeepDeblur [32] SRN [45] DeblurGAN [20] Inception-ResNet-v2 MobileNet MobileNet-DSC
PSNR 25.22 26.48 26.75 26.10 26.72 26.36 26.35
SSIM 0.773 0.807 0.837 0.816 0.836 0.820 0.819
Table 2: PSNR and SSIM comparison on the Kohler dataset.

The GoPro dataset [32] uses the GoPro Hero 4 camera to capture 240 frames per second (fps) video sequences, and generate blurred images through averaging consecutive short-exposure frames. It is a common benchmark for image motion blurring, containing 3,214 blurry/clear image pairs. We follow the same split [32], to use 2,103 pairs for training and the remaining 1,111 pairs for evaluation.

The DVD dataset [42] collects 71 real-world videos captured by various devices such as iPhone 6s, GoPro Hero 4 and Nexus 5x, at 240 fps. The author then generated 6708 synthetic blurry and sharp pairs by averaging consecutive short-exposure frames to approximate a longer exposure [46]. The dataset was initially used for video deblurring but was later also brought to the image deblurring field.

The NFS dataset [16] was initially proposed to benchmark visual object tracking. It consists of 75 videos captured with high-frame rate cameras from iPhone 6 and iPad Pro. Additionally, 25 sequences are collected from YouTube captured at 240 fps from a variety of different devices. It covers variety of scenes including sport, skydiving, underwater, wildlife, roadside, and indoor scenes.

Training data preparation: Conventionally, the blurry frames are averaged from consecutive clean frames. However, we notice unrealistic ghost effects when observing the directly averaged frames, as in Figure 3(a)(c). To alleviate that, we first use a video frame interpolation model [33] to increase the original 240-fps videos to 3840 fps, then perform average pooling over the same time window (but now with more frames). It leads to smoother and more continuous blurs, as in Figure 3(b)(d). Experimentally, this data preparation did not noticeably impact PSNR/SSIM but was observed to improve the visual quality results.

4 Experimental evaluation

Figure 4: Visual comparison on the Kohler dataset.

4.1 Implementation Details

We implemented all of our models using PyTorch 

[36]. We compose our training set by selecting each second frame from the GoPro and DVD datasets, and every tenth frame from the NFS dataset, with the hope to reduce overfitting to any specific dataset. We then train DeblurGAN-v2 on the resulting set of approximately 10,000 image pairs. Three backbones are evaluated: Inception-ResNet-v2, MobileNet, and MobileNet-DSC. The former targets at high-performance deblurring, while the latter two are more suited for resource-constrained edge applications. Specifically, the extremely lightweight DeblurGAN-v2 (MobileNet-DSC) costs 96% fewer parameters than DeblurGAN-v2 (Inception-ResNet-v2).

All models were trained on a single Tesla-P100 GPU, with Adam [17] optimizer and the learning rate of

for 150 epochs, followed by another 150 epochs with a linear decay to

. We freeze the pre-trained backbone weights for 3 epochs, and then we unfreeze all weights and continue the training. The un-pre-trained parts are initialized with random Gaussian. The training takes 5 days to converge. The models are fully convolutional, thus can be applied to the images of arbitrary size.

(a) Blurred photo
(b) Whyte  [49]
(c) Krishnan [19]
(d) Sun  [43]
(e) Xu  [51]
(f) Pan  [35]
(g) DeepDeblur [32]
(h) SRN [45]
(i) DeblurGAN [20]
(j) DeblurGAN-v2

[Best visual quality]

(k) DeblurGAN-v2

[High efficiency]

(l) DeblurGAN-v2

[Highest efficiency]

Figure 5: Qualitative comparison on the “face2” test image of the Lai dataset [21]. DeblurGAN-v2 models are artifact-free, in contrast to other neural and non-CNN algorithms, producing smoother and visually more pleasing results.
(a) Degraded photo
(b) DeblurGAN
(c) DeblurGAN-v2 (Inception-ResNet-v2)
(d) Clean photo
Figure 6: Visual comparison example on the Restore Dataset.

4.2 Quantitative Evaluation on GoPro Dataset

We compare our models with a number of state-of-the-arts: one of is a traditional method by Xu [51], while the rest are deep learning-based:  [43] by Sun , DeepDeblur [32], SRN [45], and DeblurGAN [20]. We compare on both standard performance metrics (PSNR, SSIM), and inference efficiency (averaged running time per image measured on a single GPU). Results are summarized in Table1.

In terms of PSNR/SSIM, DeblurGAN-v2 (Inception-ResNet-v2) and SRN are ranked top-2: DeblurGAN-v2 (Inception-ResNet-v2) has slightly lower PSNR, which is not surprising since it was not trained under pure MSE loss; but it outperforms SRN in SSIM. However, we are very encouraged to observe that DeblurGAN-v2 (Inception-ResNet-v2) takes 78% less inference time than SRN. Moreover, two of our light-weight models, DeblurGAN-v2 (MobileNet) and DeblurGAN-v2 (MobileNet-DSC), show SSIMs (0.925 and 0.922) on par with the other two latest deep deblurring methods, DeblurGAN (0.927) and DeepDeblur (0.916), while being up to 100 times faster.

In particular, MobileNet-DSC only costs 0.04s per image, which even enables near real-time video frame deblurring, for 25-fps videos. To our best knowledge, DeblurGAN-v2 (MobileNet-DSC) is the only deblurring method so far that can simultaneously achieve (reasonably) high performance and that high inference efficiency.

4.3 Quantitative Evaluation on Kohler dataset

The Kohler dataset  [18] consists of 4 images, each blurred with 12 different kernels. It is a standard benchmark for evaluating blind deblurring algorithms. The dataset was generated by recording and analyzing real camera motion, which was then played back on a robot platform such that a sequence of sharp images was recorded sampling the 6D camera motion trajectory.

PSNR SSIM Inference Time Resolution
WFA 28.35 N/A N/A N/A
DVD (single) 28.37 0.913 1.0s 960 x 540
DeblurGAN-v2 28.54 0.929 0.06s 1280 x 720
Table 3: Results on DVD dataset

The comparison results are reported in Table 2. Similarly to GoPro, SRN and DeblurGAN-v2 (Inception-ResNet-v2) remain to be the best two PSNR/SSIM performers, but this time SRN is marginally superior in both. However, please be reminded that, similarly to the GoPro case, this “almost tie” result was achieved while DeblurGAN-v2 (Inception-ResNet-v2) costs only 1/5 of SRN’s inference complexity. Moreover, both DeblurGAN-v2 (MobileNet) and DeblurGAN-v2 (MobileNet-DSC) outperform DeblurGAN on the Kohler dataset in both SSIM and PSNR: that is impressive given the former two’s much lighter weights.

Figure 4 displays visual examples on the Kohler dataset. DeblurGAN-v2 effectively restores the edges and textures, without noticeable artifacts. SRN for this specific example shows some color artifacts when zoomed in.

4.4 Quantitative Evaluation on DVD dataset

Blurry Krishnan [19] Whyte  [49] Xu  [51] Sun  [43] Pan  [35]
1 1.08 0.57 0.77 0.64 0.91
DeepDeblur [32] SRN [45] DeblurGAN [20] DeblurGAN-v2 DeblurGAN-v2 DeblurGAN-v2
(Inception-ResNet-v2) (MobileNet) (MobileNet-DSC)
1.08 1.68 1.29 1.74 1.44 1.32
Table 4: Average subjective scores of deblurring results on the Lai dataset [21].

We next test DeblurGAN-v2 on the DVD testing set used in [42], but with a single-frame setting (treating all frames as individual images) without using multiple frames together. We compare with two strong video deblurring methods: WFA [6], and DVD [42], For the latter, we adopt the authors’ self-reported results when using a single frame as the model input (denoted as “single”), for a fair comparison. As shown in Table 6, DeblurGAN-v2 (MobileNet) outperforms WFA and DVD (single), while being at least 17 times faster (DVD was tested on a reduced resolution of 960 540, while DeblurGAN-v2 is on 1280 x 720).

While not specifically optimized for video deblurring, DeblurGAN-v2 shows good potential, and we will extend it to video deblurring as future work.

4.5 Subjective Evaluation on Lai dataset

The Lai dataset [21] has real-world blurry images of different qualities and resolutions collected in various types of scenes. Those real images have no clean/sharp counterparts, making a full-reference quantitative evaluation impossible. Following [21], we conduct a subjective survey to compare the deblurring performance on those real images.

We fit a Bradley-Terry model [2] to estimate the subjective score for each method so that they can be ranked, with the identical routine following the previous benchmark work [23, 25]. Each blurry image is processed with each of the following algorithms: Krishnan [19], Whyte  [49], Xu  [51], Sun  [43], Pan  [35], DeepDeblur [32], SRN [45], DeblurGAN [20]

; and the three DeblurGAN-v2 variants (Inception-ResNet-v2, MobileNet, MobileNet-DSC). The eleven deblurring results, together with the original blurry image, are sent for pairwise comparison to construct the winning matrix. We collect the pair comparison results from 22 human raters. We observed good consensus and small inter-person variances among raters, which makes scores reliable.

The subjective scores are reported in Table 4. We did not normalize the scores due to the absence of ground-truth: as a result, it is the score rank rather than the absolute score value that matters here. It can be observed that deep learning-based deblurring algorithms, in general, have more favorable visual results than traditional methods (some even making visual quality worse than the blurry input). DeblurGAN [20] outperforms DeepDeblur [32], but lags behind SRN [45]. With the Inception-ResNet-v2 backbone, DeblurGAN-v2 demonstrates clearly superior perceptual quality over SRN, making it the top performer in terms of subjective quality. DeblurGAN-v2 with MobileNet and MobileNet-DSC backbones have minor performance degradations compared to the Inception-ResNet-v2 version. However, both are still preferred by subjective raters, compared to DeepDeblur and DeblurGAN, while being 2-3 orders-of-magnitude faster.

Figure 5 displays visual comparison examples on deblurring the “face2” image. DeblurGAN-v2 (Inception-ResNet-v2) (4(j)) and SRN (4(h)) are the top-2 most favored results, both balancing well between edge-sharpness and overall smoothness. By zooming in, SRN is found to still generate some ghost artifacts on this example, e.g., the white “intrusion” from the collar to the bottom right face region. In comparison, DeblurGAN-v2 (Inception-ResNet-v2) shows artifact-free deblurring. Besides, DeblurGAN-v2 (MobileNet) and DeblurGAN-v2 (MobileNet-DSC) results are also smooth and visually better than DeblurGAN, though less sharper than DeblurGAN-v2 (Inception-ResNet-v2).

4.6 Ablation Study and Analysis

We perform an ablation study on the effect of specific components of the DeblurGAN-v2 pipeline. Starting from the original DeblurGAN (ResNet G, local-scale patch D, WGAN-GP + perceptual loss), we gradually inject our modifications on the generator (adding FPN), discriminator (adding global-scale), and the loss (replacing WGAN-GP loss with RaGAN-LS, and adding an MSE term). The results are summarized in Table 6. We can see that all our proposed components steadily improve both PSNR and SSIM. In particular, the FPN module contributes most significantly. Also, adding either MSE or perceptual loss benefits both training stability and final results.

DeblurGAN (starting point) 28.70 0.927
+ FPN 29.26 0.931
+ FPN + Global D 29.29 0.932
+ FPN + Global D + RaGAN-LS 29.37 0.933
DeblurGAN-v2 (FPN + Global D +
RaGAN-LS + MSE Loss) 29.55 0.934
Removing perceptual loss
(replace 0.5 with 0 in ) 28.81 0.924
Table 5: Ablation Study on the GoPro dataset, based on DeblurGAN-v2 (Inception-ResNet-v2).

As an extra baseline for the efficiency of FPN, we tried to create a “compact” version of SRN, with roughly the same FLOPs (456 GFLOPs) to match DeblurGAN-v2 Inception-ResNet-v2 (411 GFLOPs). We reduced the numbers of ResBlocks by 2/3 in each EBlock/DBlock while keeping their 3-scale recurrent structure. We then compare with DeblurGAN-v2 (Inception-ResNet-v2) on GoPro, where that “compact” SRN only achieved PSNR = 28.92 dB and SSIM = 0.9324. We also tried channel pruning [10] to reduce SRN FLOPs and the result was no better.

Degraded 22.056 0.873
DeblurGAN 26.435 0.892
DeblurGAN-v2 (Inception-ResNet-v2) 26.916 0.894
DeblurGAN-v2 (MobileNet-DSC) 25.412 0.891
Table 6: PSNR/SSIM comparison on Restore Dataset.

4.7 Extension to General Restoration

Real-world atural images commonly go through multiple kinds of degradations (noise, blur, compression, etc.) at once, and a few recent works were devoted to such join enhancement tasks [31, 55] We study the effect of DeblurGAN-v2 on the task of general image restoration. While NOT being the main focus of this paper, we intend to show the general architecture superiority of DeblurGAN-v2, especially for modifications made w.r.t. DeblurGAN.

We synthesize a new challenging Restore Dataset. We take 600 images from GoPRO, and 600 images from DVD, both with motion blurs already (same as above). We then use the albumentations library [3] to further add Gaussian and speckle Noise, JPEG compression, and up-scaling artifacts to those images. Eventually, we split 8000 images for training and 1200 for testing. We train and compare DeblurGAN-v2 (Inception-ResNet-v2), DeblurGAN-v2 (MobileNet-DSC), and DeblurGAN. As shown in Table 6 and Fig. 6, DeblurGAN-v2 (Inception-ResNet-v2) achieves the best PSNR, SSIM, and visual quality.

5 Conclusion

This paper introduces DeblurGAN-v2, a powerful and efficient image deblurring framework, with promising quantitative and qualitative results. DeblurGAN-v2 enables to switch between different backbones, for flexible trade-offs between performance and efficiency. We plan to extend DeblurGAN-v2 for real-time video enhancement, and for better handling mixed degradations.

Acknowledgements: O. Kupyn and T. Martyniuk were supported by SoftServe, Let’s Enhance, and UCU. J. Wu and Z. Wang were supported by NSF Award RI-1755701. The authors thank Arseny Kravchenko, Andrey Luzan and Yifan Jiang for constructive discussions, and Igor Krashenyi and Oles Dobosevych for computational resources.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.1.
  • [2] R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §4.5.
  • [3] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and A. A. Kalinin (2018) Albumentations: fast and flexible image augmentations. arXiv preprint arXiv:1809.06839. Cited by: §4.7.
  • [4] C. Chang and J. Wu (2014) A new single image deblurring algorithm using hyper laplacian priors.. In ICS, pp. 1015–1022. Cited by: §2.1.
  • [5] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1251–1258. Cited by: §3.2.
  • [6] M. Delbracio and G. Sapiro (2015) Burst deblurring: removing camera shake through fourier burst accumulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2385–2393. Cited by: §4.4.
  • [7] D. Gong, J. Yang, L. Liu, Y. Zhang, I. Reid, C. Shen, A. Van Den Hengel, and Q. Shi (2016) From Motion Blur to Motion Flow: a Deep Learning Solution for Removing Heterogeneous Motion Blur. Cited by: §2.1.
  • [8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014-06) Generative Adversarial Networks. External Links: Link Cited by: §1, §2.2, §2.2.
  • [9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.1, §2.2.
  • [10] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §4.6.
  • [11] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.2.
  • [12] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    arxiv. Cited by: §1, §2.1, §3.3, §3.3.
  • [13] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou, and Z. Wang (2019) EnlightenGAN: deep light enhancement without paired supervision. arXiv preprint arXiv:1906.06972. Cited by: §3.3.
  • [14] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, Cited by: §2.1, §3.3.
  • [15] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: 1st item, §2.2, §3.3.
  • [16] H. Kiani Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey (2017) Need for speed: a benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1125–1134. Cited by: §3.4.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization.. CoRR abs/1412.6980. External Links: Link Cited by: §4.1.
  • [18] R. Köhler, M. Hirsch, B. Mohler, B. Schölkopf, and S. Harmeling (2012) Recording and playback of camera shake: benchmarking blind deconvolution with a real-world database. In Proceedings of the 12th European Conference on Computer Vision - Volume Part VII, ECCV’12, Berlin, Heidelberg, pp. 27–40. External Links: ISBN 978-3-642-33785-7, Link, Document Cited by: §4.3.
  • [19] D. Krishnan, T. Tay, and R. Fergus (2011) Blind deconvolution using a normalized sparsity measure. In CVPR 2011, pp. 233–240. Cited by: §2.1, 4(c), §4.5, Table 4.
  • [20] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas (2018) Deblurgan: blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8183–8192. Cited by: Figure 1, 3rd item, §1, §2.1, §3.1, §3.3, §3.3, §3.3, Table 1, Table 2, Figure 4, 4(i), §4.2, §4.5, §4.5, Table 4.
  • [21] W. Lai, J. Huang, Z. Hu, N. Ahuja, and M. Yang (2016) A comparative study for single image blind deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1709. Cited by: Figure 5, §4.5, Table 4.
  • [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1, §3.1, §3.3.
  • [23] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang (2019) Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing 28 (1), pp. 492–505. Cited by: §4.5.
  • [24] L. Li, J. Pan, W. Lai, C. Gao, N. Sang, and M. Yang (2018-06) Learning a discriminative prior for blind image deblurring. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [25] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang, X. Guo, and X. Cao (2019) Single image deraining: a comprehensive benchmark analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3838–3847. Cited by: §4.5.
  • [26] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: 1st item, §3.1, §3.1.
  • [27] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, and T. Huang (2017) Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2507–2515. Cited by: §1.
  • [28] D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang (2018) When image denoising meets high-level vision tasks: a deep learning approach. In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    pp. 842–848. Cited by: §3.1.
  • [29] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, and Z. Wang (2016) Least squares generative adversarial networks. Note: cite arxiv:1611.04076 External Links: Link Cited by: 1st item, §2.2, §3.3.
  • [30] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. External Links: Link, 1411.1784 Cited by: §2.2.
  • [31] J. Mustaniemi, J. Kannala, J. Matas, S. Särkkä, and J. Heikkilä (2018) LSD - joint denoising and deblurring of short and long exposure images with convolutional neural networks. CoRR abs/1811.09485. External Links: Link, 1811.09485 Cited by: §4.7.
  • [32] S. Nah, T. Hyun, K. Kyoung, and M. Lee (2016) Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring. Cited by: Figure 1, 3rd item, §1, §2.1, §3.1, §3.4, Table 1, Table 2, Figure 4, 4(g), §4.2, §4.5, §4.5, Table 4.
  • [33] S. Niklaus, L. Mai, and F. Liu (2017) Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270. Cited by: §3.4.
  • [34] M. Noroozi, P. Chandramouli, and P. Favaro (2017) Motion Deblurring in the Wild. Cited by: §2.1.
  • [35] J. Pan, D. Sun, H. Pfister, and M. Yang (2016) Blind image deblurring using dark channel prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1628–1636. Cited by: §2.1, 4(f), §4.5, Table 4.
  • [36] PyTorch. Note: Cited by: §4.1.
  • [37] S. Ramakrishnan, S. Pachori, A. Gangopadhyay, and S. Raman (2017) Deep generative filter for motion deblurring. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2993–3000. Cited by: §2.1.
  • [38] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: footnote 1.
  • [39] W. Ren, J. Zhang, L. Ma, J. Pan, X. Cao, W. Zuo, W. Liu, and M. Yang (2018) Deep non-blind deconvolution via generalized low-rank approximation. In Advances in Neural Information Processing Systems, pp. 295–305. Cited by: §2.1.
  • [40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-01) MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv e-prints, pp. arXiv:1801.04381. External Links: 1801.04381 Cited by: §3.2.
  • [41] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3.
  • [42] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang (2017) Deep video deblurring for hand-held cameras. In CVPR, Cited by: §3.4, §4.4.
  • [43] J. Sun, W. Cao, Z. Xu, and J. Ponce (2015) Learning a Convolutional Neural Network for Non-uniform Motion Blur Removal. Cited by: §2.1, Table 1, Table 2, 4(d), §4.2, §4.5, Table 4.
  • [44] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  • [45] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia (2018) Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8174–8182. Cited by: Figure 1, 3rd item, §2.1, §3.1, Table 1, Table 2, Figure 4, 4(h), §4.2, §4.5, §4.5, Table 4.
  • [46] J. Telleen, A. Sullivan, J. Yee, O. Wang, P. Gunawardane, I. Collins, and J. Davis (2007) Synthetic shutter speed imaging. In Computer Graphics Forum, Cited by: §3.4.
  • [47] Y. Wang, T. Nguyen, Y. Zhao, Z. Wang, Y. Lin, and R. Baraniuk (2018) Energynet: energy-efficient dynamic inference. NeurIPS CDNNRIA Workshop. Cited by: §3.2.
  • [48] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang (2016) D3: deep dual-domain based fast restoration of jpeg-compressed images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2764–2772. Cited by: §1.
  • [49] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce (2010) Non-uniform deblurring for shaken images. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 491–498. Cited by: 4(b), §4.5, Table 4.
  • [50] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin (2018)

    Deep k-means: re-training and parameter sharing with harder cluster assignments for compressing deep convolutions


    International Conference on Machine Learning

    pp. 5359–5368. Cited by: §3.2.
  • [51] L. Xu, S. Zheng, and J. Jia (2013) Unnatural L0 Sparse Representation for Natural Image Deblurring. External Links: Link Cited by: Table 1, 4(e), §4.2, §4.5, Table 4.
  • [52] X. Xu, J. Pan, Y. Zhang, and M. Yang (2018) Motion blur kernel estimation via deep learning. IEEE Transactions on Image Processing 27 (1), pp. 194–205. Cited by: §2.1.
  • [53] R. A. Yeh, C. Chen, T. Lim, M. Hasegawa-Johnson, and M. N. Do (2016)

    Semantic image inpainting with perceptual and contextual losses

    CoRR abs/1607.07539. External Links: Link, 1607.07539 Cited by: §1.
  • [54] K. Yu, C. Dong, L. Lin, and C. Change Loy (2018)

    Crafting a toolchain for image restoration by deep reinforcement learning

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2443–2452. Cited by: §3.2.
  • [55] X. Zhang, H. Dong, Z. Hu, W. Lai, F. Wang, and M. Yang (2018) Gated fusion network for joint image deblurring and super-resolution. In BMVC, Cited by: §4.7.