## 1 Introduction

In digital photography, motion blur is a common and longstanding problem where the blurring is induced by the relative motion of the camera or the subject with respect to the other [lai2016comparative].
In classical image processing, such a motion blur is generally regarded as a motion kernel being applied on the original sharp image through a linear operation, *e.g.*, convolution.
Often in practice, however, neither the blur kernel nor the original image is known a priori, and thus the task becomes to estimate both from the blurry input image.
In image processing, the term blind deconvolution is often used to represent the task of image restoration without any explicit knowledge of the impulse response function, also known as the point-spread function (PSF) and the original sharp image [lai2016comparative, levin2009understanding].
The blurred image is typically formulated as:

(1) |

where and are the unknown original clean image and the blur kernel, respectively,

is the additive measurement noise generally modeled as white Gaussian noise (AWGN) with variance

, and represents the 2D convolution operator. Hence, the task of blind deconvolution is to estimate a sharp and the corresponding from an infinite set of pairs using the blurry image , making it an ill-posed and very challenging problem.A judicious approach to such problems is to utilize some prior knowledge about the statistics of the natural image and/or motion kernels. There exists a multitude of algorithms to efficiently estimate the image and kernel using prior knowledge of the model [fergus2006removing, levin2006blind, xu2010two]. A majority of them are based on maximum-a-posterior (MAP) framework,

(2) |

where is the likelihood of the noisy output given a certain , that corresponds to the data fidelity term, and and are the priors of the original image and blur kernel, respectively.
Note that, Eq. (2) is correct under the assumption that the sharp original image and the blur kernel are independent.
These MAP-based algorithms are often iterative in nature and usually rely on the sparsity-inducing regularizers, either in gradient domain [xu2010two, krishnan2011blind, xu2013unnatural]
or more generally in sparsifying transformation domain
[pan2018deblurring].
However, the knowledge of the prior is not usually enough, for instance, Levin *et al.*[levin2009understanding] shows that MAP-based methods may lead to a trivial solution of an impulse kernel resulting in the same noisy image as output.
By carefully designing the appropriate regularizer and selecting the proper step size and learning rate, one may find a sharper image.
These parameters are, however, difficult to determine analytically as they heavily depend on the noisy input image itself, and thus do not admit any generalization.

Data-driven methods, on the other hand, make an attempt to determine a non-linear mapping that deblurs the noisy image by learning the appropriate parameter choices particular to an underlying image dataset using deep neural networks (DNN) [xu2018motion, chakrabarti2016neural]. Given the training dataset, one can use a DNN either to extract features from the noisy image to estimate the blur kernel [chakrabarti2016neural] or directly learn the mapping to the sharp image [nah2017deep]. Although these methods achieve substantial performances in certain practical scenarios, they often do not succeed in handling various complex and large-sized blur kernels in blind deconvolution. The structure of the neural networks is usually empirically determined and thus they often lack inherent interpretability. Recent works generate attribution based maps to explain the networks decision, however, they disregard the untapped potential of the model knowledge [agarwal2019removing].

In order to enjoy the advantages of both model-based iterative algorithms and data-driven learning strategies, one may exploit the idea of deep unfolding [hershey2014deep, gregor2010learning].
Especially, [hershey2014deep] shows that propagating through a neural network is, in fact, *equivalent* to implementing the iterative algorithm for a finite number of times, and thus the trained network can be naturally interpreted as a parameter optimized algorithm.
In recent years, deep unfolding networks have gained a significant amount of attention in various branches of signal processing [hershey2014deep, khobahi2019deepsignal, khobahi2019deepradar, bertocchi2019deep, khobahi2019model].
However, in the context of blind image deconvolution, the extent of deep unfolding capabilities remains largely unexplored.
Recently Li *et al.*[li2019algorithm], performed motion deblurring by means of unfolding an iterative algorithm that relies on total-variation (TV) regularization prior in the image gradient domain [perrone2016clearer].
Although this approach performs better than the state-of-the-art model-based and data-driven blind deconvolution counterparts, the strict requirement of training a network for a certain dataset makes the algorithm impractical for real-time usage: the algorithm requires a ground truth dataset, to begin with.
Additionally, in practice, motion kernels are not pre-deterministic (*e.g.*

, in drone image processing), and hence acquiring a labeled dataset is not possible for a supervised learning scenario.

In this paper, we propose a novel technique to unfold an iterative algorithm that estimates the latent clean image and corresponding blur kernel on the fly — a zero-shot self-supervised algorithm. In particular, we use the classical Richardson-Lucy blind deconvolution algorithm [fish1995blind] to construct the network structure and iteratively estimate the clean image and the kernel. We experimentally verify the performance of our algorithm and compare it with [li2019algorithm] and other iterative algorithms and recent neural network approaches.

## 2 Problem Formulation

In this section, we lay the groundwork for our proposed model-aware deep architecture for the problem of blind deconvolution. To this end, we consider an extension of the Richardson-Lucy (RL) algorithm as a baseline to design a deep neural network such that each layer imitates the behavior of one iteration of the RL algorithm.

Generally, the problem of blind deconvolution can be cast as the following optimization problem:

(3) |

where the first term represents the data fidelity term and is the regularization coefficient for the total variation (TV) regularization operated on the image . The RL algorithm seeks to recover the sharp image and the blur kernel in an iterative manner as described in [fish1995blind]. Starting from an initial guess for the sharp image and the kernel , the update steps for the image and the kernel at the -th iteration is given by,

(4a) | ||||

(4b) |

where represents the Hadamard product and

denotes the flipped version of the vector/matrix argument.

## 3 Blind Deconvolution via Deep-URL

In order to obtain a model-aware deep architecture we slightly over parameterize the iterations of RL algorithm (See Eq. (4a)-(4b)) and unfold them onto the layers of a deep neural network. In particular, each layer corresponds to one iteration of the baseline iterative algorithm. Namely, we fix the total computational complexity of the RL algorithm by fixing the total number of iterations as a DNN with layers. Thus, by substituting the and in Eq. (4a)-(4b) with trainable parameters, we reformulate each subsequent iterative operation as:

(5a) | ||||

(5b) |

where and are the weights for -th layer. Furthermore, represents the *Sigmoid*activation function and

denotes the Rectifier Linear Unit. Note that there exist two implicit constraints on the recovered sharp image and the kernel: (a) both

and are non-negative and (b) each element of and must meet a range constraint.Hence, in order to ensure constraint (a), each convolution operation is activated by a function, and in addition we use the*Sigmoid*activation after each update step to satisfy constraint (b).

Let denote the set of trainable parameters of layer , and . Using the iterative updates from Eq. (5a)-(5b), we formulate the training of our proposed model-aware deep network: Deep Unfolded Richardson Lucy (Deep-URL) architecture as follows,

(6) |

where the loss function

is the negative of the structural similarity index (SSIM) [wang2004image] between the true blurred image and the reconstructed blurred image .It is worth mentioning that the proposed deep architecture in conjunction with the proposed learning method manifests itself as a self-supervised learning process where the degraded image is the only information used for estimation of the sharp image and the blurred kernel .
Fig. 1 illustrates the proposed Deep-URL architecture and the training process.
Finally, Algorithm 1 summarizes the joint optimization process for updating .
Note that, *once the self-supervised model is optimized for a given blur kernel, the learned weights can be directly used for deblurring any image blurred with the same kernel.*

## 4 Experiments

In this section, we investigate the performance of the proposed Deep-URL framework and compare it with several other state-of-the-art methods in the context of blind deconvolution. First, we compare the performance of Deep-URL with the baseline RL algorithm using the standard MNIST handwritten digit dataset [lecun1998gradient]. Second, we use Levin dataset [levin2009understanding] to compare Deep-URL with existing iterative and deep learning-based blind deconvolution methods proposed in [li2019algorithm, chakrabarti2016neural, nah2017deep].

Optimization setup. The training of Deep-URL (Eq. (6

)) is carried out using the RMSprop optimizer for

epochs by employing an adaptive learning rate scheme with an initial learning rate of and a decaying factor of when reaching % and % of the total number of epochs. In addition, the TV regularization coefficient was set tofor all experiments. All trainable parameters were initialized using a uniform distribution. We performed a batch-wise optimization, with a batch size of

, on images blurred using the same kernel for enhancing the performance of Deep-URL.Evaluation metrics. Inspired by [li2019algorithm]

, we use the following metrics to evaluate the performance of our proposed method: (1) Structural Similarity Index (SSIM), (2) Peak Signal-to-Noise-Ratio (PSNR), (3) Improvement in Signal-to-Noise-Ratio (ISNR) for the quality of the reconstructed image

, and (4) Root-Mean-Square Error (RMSE) for comparing the recovered blur kernel with the original . In the sequel, we use the term PSF and blur kernel interchangeably.Metrics | RL | D-URL | RL | D-URL |

PSNR(dB) | 10.3919 | 18.2821 | 10.4742 | 19.7075 |

ISNR (dB) | 0.0651 | 7.9554 | 0.0764 | 9.3096 |

SSIM | 0.4453 | 0.7669 | 0.4484 | 0.8206 |

RMSE(1e-3) | 38.54 | 4.396 | 38.07 | 4.399 |

MNIST dataset results.
For this experiment, we consider the well-known MNIST dataset. We randomly draw 1000 sample images from the MNIST training dataset and use the same motion kernels provided by [levin2009understanding].
Particularly, for each image, we convolve the original image with a randomly chosen aforementioned blur kernel to generate the degraded image.
Table 1 demonstrates the performances of the proposed Deep-URL framework with layers and the original RL algorithm with the same number of iterations.
It is evident from Table 1 that the proposed method significantly outperforms the baseline RL algorithm across all evaluation metrics.
Interestingly, Deep-URL achieves better performance in terms of both recovering the original image and the PSF even with only layers—this is presumably due to the hybrid model-based and data-driven nature of the proposed method.
Moreover, Deep-URL with layers attains a very high average ISNR value for the recovered image, which is 121 higher than that of the original RL algorithm. Note that, the RMSE between the original and the reconstructed PSF using the proposed method assumes a 8.55 smaller value than that of the RL algorithm. By comparing the evaluation performance of Deep-URL for , it is evident that increasing the number of layers result in a much higher increase of scores across all evaluation metrics as compared to the baseline RL algorithm.
Finally, from Fig. 2, we found that the classical RL algorithm is sensitive to the number of iteration and the performance fluctuates on a random set of 100 MNIST images.
However, the performance of Deep-URL always increases as we increase the number of iterations *i.e.*, the number of layers.

Levin dataset results. For this experiment, we use the dataset provided by [levin2009understanding] – a widely used benchmark dataset in several deblurring works [li2019algorithm, krishnan2011blind, xu2010two]. It comprises of 4 grayscale images and 8 motion blur kernels: a total of 32 motion blurred images. Table 2 summarizes the performance of Deep-URL in comparison with the baseline RL as well as the methodologies proposed in [chakrabarti2016neural], [nah2017deep] and [li2019algorithm] on the same dataset. It can be observed from Table 2 that Deep-URL significantly outperforms the baseline RL algorithm across all image and kernel evaluation metrics. In contrast to other methods that include a priori learning using training images, Deep-URL is a self-deblurring framework and performs at par (PSNR) or better (ISNR and SSIM) on the image quality evaluation metrics. Interestingly, an 1.8 increase can be observed in ISNR using Deep-URL with just when compared to [li2019algorithm]. In regards to the reconstructed blur kernel, it was found that most pixels did not converge to absolute zero and hence a higher RMSE score was obtained in reconstructing the motion kernel blindly. From Fig. 3, we observe Deep-URL reconstructs smoother images with lesser artifacts as compared to other state-of-the-art methods.

Metrics | [li2019algorithm] | [nah2017deep] | [chakrabarti2016neural] | RL | D-URL () | D-URL () |

PSNR(dB) | 27.15 | 24.51 | 23.18 | 19.42 | 24.85 | 27.12 |

ISNR (dB) | 3.79 | 1.35 | 0.02 | -2.98 | 5.36 | 6.95 |

SSIM | 0.88 | 0.81 | 0.81 | 0.53 | 0.89 | 0.91 |

RMSE(1e-3) | 3.87 | - | - | 10.10 | 8.08 | 7.10 |

## 5 Conclusion

In this work, we considered the problem of blind deconvolution and proposed the Deep-URL framework—a model-aware deep blind deconvolution architecture—by unfolding the Richardson-Lucy algorithm (Sec. 3). Quantitative and qualitative evaluations (Sec. 4) show Deep-URL achieves superior performance than both its baseline RL algorithm and several existing blind deconvolution techniques. In contrast to other MAP-based frameworks, Deep-URL does not show convergence to the trivial solution of an impulse like kernel.

Comments

There are no comments yet.