Removing blur caused by camera shake in images has always been a challenging problem in computer vision literature due to its ill-posed nature. Motion blur caused due to the relative motion between the camera and the object in 3D space induces a spatially varying blurring effect over the entire image. In this paper, we propose a novel deep filter based on Generative Adversarial Network (GAN) architecture integrated with global skip connection and dense architecture in order to tackle this problem. Our model, while bypassing the process of blur kernel estimation, significantly reduces the test time which is necessary for practical applications. The experiments on the benchmark datasets prove the effectiveness of the proposed method which outperforms the state-of-the-art blind deblurring algorithms both quantitatively and qualitatively.READ FULL TEXT VIEW PDF
Removing camera motion blur from a single light field is a challenging t...
Blind motion deblurring is one of the most basic and challenging problem...
Removing pixel-wise heterogeneous motion blur is challenging due to the
The detection of spatially-varying blur without having any information a...
The task of image deblurring is a very ill-posed problem as both the ima...
We present a highly efficient blind restoration method to remove mild bl...
The problem of deblurring an image when the blur kernel is unknown remai...
Motion blur is a common problem which occurs predominantly when capturing an image using light weight devices like mobile phones. Due to the finite exposure interval and the relative motion between the capturing device and the captured object, the image obtained is often blurred. In 
, it was shown that standard network models, trained only on high-quality images, suffer a significant degradation in performance when applied to those degraded by blur due to defocus or subject/camera motion. Thus, there is a serious need to tackle the issue of blurring in images. Blur induced due to motion in images is spatially non-uniform and the blur kernel is unknown. Due to depth variation, the segmentation boundaries of the objects and the relative motion between the camera and scene objects, estimating spatially variant non-uniform kernel is quite difficult. In this paper, we introduce a generative adversarial network (GAN) based deep learning architecture to address this challenging problem. We obtain significantly better results than the state-of-the-art algorithms proposed to solve the problem of image deblurring.
Most of the previous works in the literature tackle the problem of camera deshaking by modelling it as a blind deconvolution problem and using image statistics as priors or regularizers to obtain the blur kernels. While these methods have achieved great success in benchmark datasets, restrictive assumptions in their methods and algorithms limit their practical applicability. Also, most of these works in the literature have been dedicated to solve the problem of blind deconvolution assuming the blur kernel to be spatially uniform. Very few works have been proposed to solve this challenge by taking spatially varying blur kernel. To tackle the problem of non-uniform blind deblurring, previous works divide the image into smaller regions and estimate the blur kernels for each region separately . Once the kernels are obtained for each of the local regions in the image, they are then deblurred and combined using OLA (Overlap Add) method to generate the final deconvolved image. Proposed works which exploit deep learning methods first try to predict the probabilistic distribution of motion blur information in a small region of the given image and then try to utilize this blurring observation to recover the sharp image . Only one work to the very best of our knowledge has attempted to directly recover the sharp image from the given blurred image . However, it is computationally expensive as authors exploit multi-scale framework to obtain the deblurred image. Therefore, we aim to recover the artifact-free image directly without using the multi-scale framework. An exhaustive survey of blind deblurring algorithms can be found in .
In our model, we enable every convolutional unit in the deep network to make independent decisions based on the entire array of lower level activations. Unlike  and  which use residual blocks as primary workhorses through element-wise summation of lower level activations with higher level outputs, we want information from different semantic levels to flow unaltered throughout the network. To achieve this, we propose a densely connected ‘generative network’.
|Method||MBMF ||MS-CNN ||OURS|
|Time||0.72 sec||2.2 sec||0.3 sec|
|Xu et al.||25.1858||0.8960||0.9614||0.9081||0.9527||4.1811||0.8644||0.9570|
|Sun et al. ||24.6890||0.8561||0.9308||0.8691||0.9427||4.1132||0.8430||0.9532|
|Xu et al.||25.95||0.7474||0.8358||0.8309||0.9563||2.4140||0.7478||0.9271|
|Sun et al.||24.58||0.7379||0.8059||0.8255||0.9393||2.3897||0.7303||0.9087|
|Xu et al.||27.47||0.7506||0.8115||0.8810||0.9642||2.5025||0.7698||0.9309|
|Sun et al. ||25.12||0.7281||0.7748||0.7990||0.9401||2.1963||0.7267||0.9108|
Our architecture consists of a densely connected generator and a discriminator. The task of the generator is to recycle features spanning across multiple receptive scales to generate an image that fools the discriminator into thinking that the generated image came from the target distribution. Thus, we can generate visually appealing and statistically consistent deblurred image given a blurred image. The task of the discriminator is to correctly identify from which distribution each of its input images came from by analysing different patches in each image to make a decision. We elaborate both our generator and discriminator models in detail.
Unlike , we do not reduce the dimension of the information and keep it constant throughout the network. While this does give rise to memory constraints, it protects the network from generating checkerboard artifacts found commonly in networks relying on deconvolution to generate visually appealing images . Instead, through feature re-use across all levels in the generator network, our model exhibits high generation performance with a much smaller network depth than the other CNN-based methods used for non-uniform motion deblurring ,,. This enables smoother training, faster test time and allows efficient memory usage. Our generator model as shown in Fig. 1 consists of 4 parts which are the head, the dense field, the tail, and the global skip connection. We describe each of them in detail below.
a) The Head: We define the hyper-parameter ‘channel-rate’ (chr) as the constant number of activation channels that are output by each convolutional layer. The value of channel-rate is 64. The head comprises of a simple convolutional layer which convolves over the raw input image and outputs channel-rate (256) feature activations. This provides sufficient first-level activation maps to trigger the densely connected stack of layers.
b) The Dense Field: This section consists of number of convolutional ‘blocks’ placed sequentially one after the other, all having their outputs fully connected with the output of the layer ahead of them. The dense connection is efficiently achieved in practice by concatenating output activation maps of every th layer in the dense field with the output maps of th layer. Hence, the number of activation maps input to the th dense block will be equal to ‘chr chr’. The structure of a dense block is shown in Fig. 2
. The first operation is a Leaky ReLU which not only adds non-linearity to the incoming activations but also avoids using sparse gradients which could compromize GAN training stability. The convolution ‘chokes’ the number of activation maps being convolved later to a maximum equal to ‘chr’. This conserves parameter and data memory in the deeper layers of the dense field when the number of raw activation channels entering will be chr (384) or more. The convolution at the final layer of each dense block uses ‘chr’ number of chr) filters, giving rise to ‘chr’ number of activation maps at the end of each dense block. The convolutions along the dense field are alternated between ‘spatial’ convolution and ‘dilated’ convolution with linearly increasing dilation factor . We use dilated convolution 
at every even numbered layer within the dense field. We have the dilation factor increasing linearly to a maximum till the centre of the dense field and then symmetrically reducing till we arrive at the tail. This helps to increase the receptive field at an exponential rate with every layer while the parameter space increases linearly and hence introduces higher disparity between the multiple scales of activation maps that arrive at subsequent dense layers. We avoid pooling and strided convolution operations to keep the dimensions of the output maps to be constant and equal to the image size throughout the network. Adding dropout at the end of each block helps us effectively add Gaussian noise to the input of each layer in the generator (G) which prevents the GAN collapse problem by enabling G to blindly model shake distributions other than a pure delta distribution.
In our GAN framework, the discriminator is the primary agent which guides the statistics that the generator employs to create restored images. Moreover, we do not want the depth of the discriminator network depth so much that it memorizes the easier task of classification. We employ a Markovian patch discriminator  with 10 convolutional layers, which is similar to a non-overlapping sliding window that tends to look for well-defined, structural features at several local patches. This also enforces rich coloration in the generated natural images .
a) and Adverserial Loss: Traditionally, learning-based image restoration works have used or loss between the ground truth and the rectified image as the chief objective function . In case of an adverserial framework used for such a purpose , this loss is pooled with the adverserial loss which measures how well the generator is performing with respect to fooling the discriminator. However, using loss solely in deep CNN models leads to overly smooth images, as pixel-wise error functions tend to converge at the mean of all possible solutions in the image manifold, whenever they encounter uncertainty . This creates dull images with not many sharp edges and most importantly, with the blur still largely intact at edges and corners. At the same time, solely using adverserial loss does retain edges and gives rise to a more realistic color distribution 
. However, it compromises on two things: it still has no abstract idea of structure and it only has the discriminator judging generator performance based on the output image alone with no regard to the blurred input. We remove these limitations by leveraging perceptual loss and adding it to the net loss function given in Eqn.4.
Here, , are the width and height of the ReLU layer of VGG-16 network  and is the forward pass through VGG-16 network upto ReLU layer.
We feed two image pairs into the discriminator in our GAN framework. One pair consists of the input blurred image and the corresponding output image generated by the generator, whereas the other pair consists of the input blurred image and the corresponding ground truth deblurred image. This converges with the generator modelling the conditional distribution of the latent image, given the input image, a result that will help the generated images maintain high statistical consistency between a given input and its output. This is essentially what we need, because we want ‘G’ to maintain the output’s dependency on the blurred input to accomodate different kinds and amounts of shake blur and prevent it from swaying too far away in its effort to fool the discriminator. Hence, we can view a conditional GAN as a ‘relevance regularizer’ in an image to image network. Mathematically, this would change the original GAN optimization problem used in our task which would be given by:
to a conditional loss function which needs to be minimized, given by
Thus, the combined loss function for our network is,
where, and are hypermeters which are set to 145 and 170 respectively in our experiments. From Table 3, we notice a significant boost in the performance across all metrics by introducing this technique. At this stage, our network has already outperformed the two baseline models modified and trained for our task: a very-deep, sequential ResNet model used by  and the hourglass, U-net model used by . It is worth noting that our dense model with much fewer layers (10 dense blocks) not only outperformed, but also converged faster than the model in  with 15 residual blocks, showing that our model and the framework resonate much better.
We implemented our model with torch7 library. All the experiments were performed on a workstation with i7 processor and NVIDIA GTX Titan X GPU.
Network Parameters: We optimize our loss function through the ADAM scheme 
and converge it using stochastic gradient descent (SGD). Throughout the experiments, we kept the batch-size for training as 3 and fixed base learning rate and momentum toand respectively. Similar to , we use instance normalization instead of training batch statistics during test-time.
To train our model, we extracted patches of size from GoPRO dataset and combined them with the images sampled randomly from MS-COCO 
and Imagenet dataset (which are resized to ) to generate our training dataset. We then apply non-uniform blurs similar to  on images sampled from MS-COCO and ImageNet datasets. We also perform data augmentation by using translational and rotational flipping, thus producing a final dataset consisting of 0.5 million training image pairs of blurred and deblurred images.
We have designed a novel, end-to-end conditional GAN-based filter model which performs blind restoration of shaken images. Our results show that our model and framework outperforms the state-of-the-art for non-uniform deblurring. The fast execution time of our model makes it easily deployable in cameras and photo editing tools. We show that densely connected convolutional networks can be as effective for image generation as it is for classification.
Shubham Pachori and Shanmuganathan Raman were supported through an ISRO RESPOND grant.
Image-to-image translation with conditional adversarial networks.In IEEE CVPR, 2017.