Image Deblurring using Generative Adversarial Networks
We present an end-to-end learning approach for motion deblurring, which is based on conditional GAN and content loss. It improves the state-of-the art in terms of peak signal-to-noise ratio, structural similarity measure and by visual appearance. The quality of the deblurring model is also evaluated in a novel way on a real-world problem -- object detection on (de-)blurred images. The method is 5 times faster than the closest competitor. Second, we present a novel method of generating synthetic motion blurred images from the sharp ones, which allows realistic dataset augmentation. Model, training code and dataset are available at https://github.com/KupynOrest/DeblurGANREAD FULL TEXT VIEW PDF
We present a new end-to-end generative adversarial network (GAN) for sin...
We present a novel conditional Generative Adversarial Network (cGAN)
We propose AffordanceNet, a new deep learning approach to simultaneously...
Existing fiducial markers solutions are designed for efficient detection...
We present the collections of images of the same rotating plastic object...
Autonomous vehicles use cameras as one of the primary sources of informa...
Dark Channel Prior (DCP) is a widely recognized traditional dehazing
Image Deblurring using Generative Adversarial Networks
imported from https://github.com/KupynOrest/DeblurGAN.git
This work is on blind motion deblurring of a single photograph. Significant progress has been recently achieved in related areas of image super-resolution and in-painting  by applying generative adversarial networks (GANs) . GANs are known for the ability to preserve texture details in images, create solutions that are close to the real image manifold and look perceptually convincing.
Inspired by recent work on image super-resolution 
and image-to-image translation by generative adversarial networks, we treat deblurring as a special case of such image-to-image translation. We present DeblurGAN – an approach based on conditional generative adversarial networks 
and a multi-component loss function. Unlike previous work we use Wasserstein GAN with the gradient penalty  and perceptual loss . This encourages solutions which are perceptually hard to distinguish from real sharp images and allows to restore finer texture details than if using traditional MSE or MAE as an optimization target.
We make three contributions. First, we propose a loss and architecture which obtain state-of-the art results in motion deblurring, while being 5x faster than the fastest competitor. Second, we present a method based on random trajectories for generating a dataset for motion deblurring training in an automated fashion from the set of sharp image. We show that combining it with an existing dataset for motion deblurring learning improves results compared to training on real-world images only. Finally, we present a novel dataset and method for evaluation of deblurring algorithms based on how they improve object detection results.
The common formulation of non-uniform blur model is the following:
where is a blurred image, are unknown blur kernels determined by motion field .
is the sharp latent image, denotes the convolution, is an additive noise.
The family of deblurring problems is divided into two types: blind and non-blind deblurring. Early work  mostly focused on non-blind deblurring, making an assumption that the blur kernels are known. Most rely on the classical Lucy-Richardson algorithm, Wiener or Tikhonov filter to perform the deconvolution operation and obtain estimate. Commonly the blur function is unknown, and blind deblurring algorithms estimate both latent sharp image and blur kernels
. Finding a blur function for each pixel is an ill-posed problem, and most of the existing algorithms rely on heuristics, image statistics and assumptions on the sources of the blur. Those family of methods addresses the blur caused by camera shake by considering blur to be uniform across the image. Firstly, the camera motion is estimated in terms of the induced blur kernel, and then the effect is reversed by performing a deconvolution operation. Starting with the success of Ferguset al. , many methods  has been developed over the last ten years. Some of the methods are based on an iterative approach  , which improve the estimate of the motion kernel and sharp image on each iteration by using parametric prior models. However, the running time, as well as the stopping criterion, is a significant problem for those kinds of algorithms. Others use assumptions of a local linearity of a blur function and simple heuristics to quickly estimate the unknown kernel. These methods are fast but work well on a small subset of images.
Recently, Whyte et al.  developed a novel algorithm for non-uniform blind deblurring based on a parametrized geometric model of the blurring process in terms of the rotational velocity of the camera during exposure. Similarly Gupta et al. 
made an assumption that the blur is caused only by 3D camera movement. With the success of deep learning, over the last few years, there appeared some approaches based on convolutional neural networks (CNNs). Sunet al.  use CNN to estimate blur kernel, Chakrabarti  predicts complex Fourier coefficients of motion kernel to perform non-blind deblurring in Fourier space whereas Gong  use fully convolutional network to move for motion flow estimation. All of these approaches use CNN to estimate the unknown blur function. Recently, a kernel-free end-to-end approaches by Noorozi  and Nah  that uses multi-scale CNN to directly deblur the image. Ramakrishnan et al.  use the combination of pix2pix framework  and densely connected convolutional networks  to perform blind kernel-free image deblurring. Such methods are able to deal with different sources of the blur.
The idea of generative adversarial networks, introduced by Goodfellow et al. , is to define a game between two competing networks: the discriminator and the generator. The generator receives noise as an input and generates a sample. A discriminator receives a real and generated sample and is trying to distinguish between them. The goal of the generator is to fool the discriminator by generating perceptually convincing samples that can not be distinguished from the real one. The game between the generator and discriminator is the minimax objective:
where is the data distribution and is the model distribution, defined by , the input is a sample from a simple noise distribution. GANs are known for its ability to generate samples of good perceptual quality, however, training of vanilla version suffer from many problems such as mode collapse, vanishing gradients etc, as described in . Minimizing the value function in GAN is equal to minimizing the Jensen-Shannon divergence between the data and model distributions on . Arjovsky et al.  discuss the difficulties in GAN training caused by JS divergence approximation and propose to use the Earth-Mover (also called Wasserstein-1) distance . The value function for WGAN is constructed using Kantorovich-Rubinstein duality :
where is the set of Lipschitz functions and is once again the model distribution The idea here is that critic value approximates , where is a Lipschitz constant and is a Wasserstein distance. In this setting, a discriminator network is called critic and it approximates the distance between the samples. To enforce Lipschitz constraint in WGAN Arjovsky et al. add weight clipping to . Gulrajani et al.  propose to add a gradient penalty term instead:
to the value function as an alternative way to enforce the Lipschitz constraint. This approach is robust to the choice of generator architecture and requires almost no hyperparameter tuning. This is crucial for image deblurring as it allows to use novel lightweight neural network architectures in contrast to standard Deep ResNet architectures, previously used for image deblurring.
Generative Adversarial Networks have been applied to different image-to-image translation problems, such as super resolution , style transfer , product photo generation  and others. Isola et al.  provides a detailed overview of those approaches and present conditional GAN architecture also known as pix2pix. Unlike vanilla GAN, cGAN learns a mapping from observed image
and random noise vector, to . Isola et al. also put a condition on the discriminator and use U-net architecture 
for generator and Markovian discriminator which allows achieving perceptually superior results on many tasks, including synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images.
The goal is to recover sharp image given only a blurred image as an input, so no information about the blur kernel is provided. Debluring is done by the trained CNN , to which we refer as the Generator. For each it estimates corresponding image. In addition, during the training phase, we introduce critic the network and train both networks in an adversarial manner.
We formulate the loss function as a combination of content and adversarial loss:
where the equals to 100 in all experiments. Unlike Isola et al.  we do not condition the discriminator as we do not need to penalize mismatch between the input and output. Adversarial loss Most of the papers related to conditional GANs, use vanilla GAN objective as the loss  function. Recently  provides an alternative way of using least aquare GAN  which is more stable and generates higher quality results. We use WGAN-GP  as the critic function, which is shown to be robust to the choice of generator architecture . Our premilinary experiments with different architectures confirmed that findings and we are able to use architecture much lighter than ResNet152 , see next subsection. The loss is calculated as the following:
DeblurGAN trained without GAN component converges, but produces smooth and blurry images.
Content loss. Two classical choices for ”content” loss function are L1 or MAE loss, L2 or MSE loss on raw pixels. Using those functions as sole optimization target leads to the blurry artifacts on generated images due to the pixel-wise average of possible solutions in the pixel space . Instead, we adopted recently proposed Perceptual loss . Perceptual loss is a simple L2-loss, but based on the difference of the generated and target image CNN feature maps. It is defined as following:
is the feature map obtained by the j-th convolution (after activation) before the i-th maxpooling layer within the VGG19 network, pretrained on ImageNet, and are the dimensions of the feature maps. In our work we use activations from convolutional layer. The activations of the deeper layers represents the features of a higher abstraction . The perceptual loss focuses on restoring general content   while adversarial loss focuses on restoring texture details. DeblurGAN trained without Perceptual loss or with simple MSE on pixels instead doesn’t converge to meaningful state.
Additional regularization. We have also tried to add TV regularization and model trained with it yields worse performance – 27.9 vs. 28.7 w/o PSNR on GoPro dataset.
Generator CNN architecture is shown in Figure 3. It is similar to one proposed by Johnson et al.  for style transfer task. It contains two strided convolution blocks with stride , nine residual blocks  (ResBlocks) and two transposed convolution blocks. Each ResBlock consists of a convolution layer, instance normalization layer , and ReLU  activation. Dropout 
regularization with a probability of 0.5 is added after the first convolution layer in each ResBlock. In addition, we introduce the global skip connection which we refer to as ResOut. CNN learns a residual correctionto the blurred image , so . We find that such formulation makes training faster and resulting model generalizes better. During the training phase, we define a critic network , which is Wasserstein GAN  with gradient penalty , to which we refer as WGAN-GP. The architecture of critic network is identical to PatchGAN [16, 22]. All the convolutional layers except the last are followed by InstanceNorm layer and LeakyReLU  with .
There is no easy method to obtain image pairs of corresponding sharp and blurred images for training.A typical approach to obtain image pairs for training is to use a high frame-rate camera to simulate blur using average of sharp frames from video [27, 25]. It allows to create realistic blurred images but limits the image space only to scenes present in taken videos and makes it complicated to scale the dataset. Sun et al.  creates synthetically blurred images by convolving clean natural images with one out of 73 possible linear motion kernels, Xu et al.  also use linear motion kernels to create synthetically blurred images. Chakrabarti  creates blur kernel by sampling 6 random points and fitting a spline to them. We take a step further and propose a method, which simulates more realistic and complex blur kernels. We follow the idea described by Boracchi and Foi 
of random trajectories generation. Then the kernels are generated by applying sub-pixel interpolation to the trajectory vector. Each trajectory vector is a complex valued vector, which corresponds to the discrete positions of an object following 2D random motion in a continuous domain. Trajectory generation is done by Markov process, summarized in Algorithm1. Position of the next point of the trajectory is randomly generated based on the previous point velocity and position, gaussian perturbation, impulse perturbation and deterministic inertial component.
We implemented all of our models using PyTorch deep learning framework. The training was performed on a single Maxwell GTX Titan-X GPU using three different datasets. The first model to which we refer as DeblurGANWILD was trained on a random crops of size 256x256 from 1000 GoPro training dataset images  downscaled by a factor of two. The second one DeblurGANSynth was trained on 256x256 patches from MS COCO dataset blurred by method, presented in previous Section. We also trained DeblurGANComb on a combination of synthetically blurred images and images taken in the wild, where the ratio of synthetically generated images to the images taken by a high frame-rate camera is 2:1. As the models are fully convolutional and are trained on image patches they can be applied to images of arbitrary size. For optimization we follow the approach of  and perform 5 gradient descent steps on , then one step on , using Adam  as a solver. The learning rate is set initially to for both generator and critic. After the first epochs we linearly decay the rate to zero over the next epochs. At inference time we follow the idea of  and apply both dropout and instance normalization. All the models were trained with a batch size = 1, which showed empirically better results on validation. The training phase took 6 days for training one DeblurGAN network.
|Sun et al.||Nah et al.||Xu et al.||DeblurGAN|
|Time||20 min||4.33 s||13.41 s||0.85 s|
GoPro dataset consists of 2103 pairs of blurred and sharp images in 720p quality, taken from various scenes. We compare the results of our models with state of the art models ,  on standard metrics and also show the running time of each algorithm on a single GPU. Results are in Table1. DeblurGAN shows superior results in terms of structured self-similarity, is close to state-of-the-art in peak signal-to-noise-ratio and provides better looking results by visual inspection. In contrast to other neural models, our network does not use L2 distance in pixel space so it is not directly optimized for PSNR metric. It can handle blur caused by camera shake and object movement, does not suffer from usual artifacts in kernel estimation methods and at the same time has more than 6x fewer parameters comparing to Multi-scale CNN , which heavily speeds up the inference. Deblured images from test on GoPro dataset are shown in Figure 7.
Kohler dataset  consists of 4 images blurred with 12 different kernels for each of them. This is a standard benchmark dataset for evaluation of blind deblurring algorithms. The dataset is generated by recording and analyzing real camera motion, which is played back on a robot platform such that a sequence of sharp images is recorded sampling the 6D camera motion trajectory. Results are in Table 2, similar to GoPro evaluation.
|Method||Sun et al.||Nah et al.||Xu et al.||Whyte et al.||DeblurGAN|
Object Detection is one of the most well-studied problems in computer vision with applications in different domains from autonomous driving to security. During the last few years approaches based on Deep Convolutional Neural Networks showed state of the art performance comparing to traditional methods. However, those networks are trained on limited datasets and in real-world settings images are often degraded by different artifacts, including motion blur, Similar to and  we studied the influence of motion blur on object detection and propose a new way to evaluate the quality of deblurring algorithm based on results of object detection on a pretrained YOLO  network.
For this, we constructed a dataset of sharp and blurred street views by simulating camera shake using a high frame-rate video camera. Following  we take a random between 5 and 25 frames taken by 240fps camera and compute the blurred version of a middle frame as an average of those frames. All the frames are gamma-corrected with and then the inverse function is taken to obtain the final blurred frame. Overall, the dataset consists of 410 pairs of blurred and sharp images, taken from the streets and parking places with different number and types of cars.
Blur source includes both camera shake and blur caused by car movement. The dataset and supplementary code are available online. Then sharp images are feed into the YOLO network and the result after visual verification is assigned as ground truth. Then YOLO is run on blurred and recovered versions of images and average recall and precision between obtained results and ground truth are calculated. This approach corresponds to the quality of deblurring models on real-life problems and correlates with the visual quality and sharpness of the generated images, in contrast to standard PSNR metric. The precision, in general, is higher on blurry images as there are no sharp object boundaries and smaller object are not detected as it shown in Figure 9.
Results are shown in Table 3. DeblurGAN significantly outperforms competitors in terms of recall and F1 score.
|Nah et al. ||0.834||0.552||0.665|
We described a kernel-free blind motion deblurring learning approach and introduced DeblurGAN which is a Conditional Adversarial Network that is optimized using a multi-component loss function. In addition to this, we implemented a new method for creating a realistic synthetic motion blur able to model different blur sources. We introduce a new benchmark and evaluation protocol based on results of object detection and show that DeblurGAN significantly helps detection on blurred images.
The authors were supported by the ELEKS Ltd., ARVI Lab, Czech Science Foundation Project GACR P103/12/G084, the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center, the CTU student grant SGS17/185/OHK3/3T/13. We thank Huaijin Chen for finding the bug in peak-signal-to-noise ratio evaluation.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
International Conference on Machine Learning (ICML), pages 807–814, 2010.