Planar Geometry and Latest Scene Recovery from a Single Motion Blurred Image

04/07/2019
by   Kuldeep Purohit, et al.
0

Existing works on motion deblurring either ignore the effects of depth-dependent blur or work with the assumption of a multi-layered scene wherein each layer is modeled in the form of fronto-parallel plane. In this work, we consider the case of 3D scenes with piecewise planar structure i.e., a scene that can be modeled as a combination of multiple planes with arbitrary orientations. We first propose an approach for estimation of normal of a planar scene from a single motion blurred observation. We then develop an algorithm for automatic recovery of a number of planes, the parameters corresponding to each plane, and camera motion from a single motion blurred image of a multiplanar 3D scene. Finally, we propose a first-of-its-kind approach to recover the planar geometry and latent image of the scene by adopting an alternating minimization framework built on our findings. Experiments on synthetic and real data reveal that our proposed method achieves state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

page 9

page 10

page 11

page 12

page 13

04/11/2017

Simultaneous Stereo Video Deblurring and Scene Flow Estimation

Videos for outdoor scene often show unpleasant blur effects due to the l...
03/01/2019

Single Image Deblurring and Camera Motion Estimation with Depth Map

Camera shake during exposure is a major problem in hand-held photography...
11/19/2019

Superpixel Soup: Monocular Dense 3D Reconstruction of a Complex Dynamic Scene

This work addresses the task of dense 3D reconstruction of a complex dyn...
06/15/2015

Automatic Layer Separation using Light Field Imaging

We propose a novel approach that jointly removes reflection or transluce...
08/22/2017

Reflection Separation and Deblurring of Plenoptic Images

In this paper, we address the problem of reflection removal and deblurri...
07/07/2020

Long-term Human Motion Prediction with Scene Context

Human movement is goal-directed and influenced by the spatial layout of ...
01/20/2020

Plane Pair Matching for Efficient 3D View Registration

We present a novel method to estimate the motion matrix between overlapp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recovery of 3D structure from images is an extensively researched area in computer vision. Algorithms for scene geometry recovery find applications in visual servoing, video conferencing, tracking, active vision, augmented reality etc. Well-known cues for depth recovery include disparity

[1], optical flow [2], texture [3, 4], shading [5], defocus blur [6], and motion blur [7, 8, 9]. While depth estimation has been of general interest, some of the works in literature target the case of inferring piecewise planar geometry (Manhattan model). This was primarily motivated by the fact that the world around us can, in many cases, be modeled as being piecewise planar. Estimating a 3D geometry in terms of planar parameters has tremendous advantages including reduction in the computational complexity and robustness to pixel-level errors in depth cues.

Many works exist in the literature that specifically addresses the task of inferring planar scene geometry from a single image. To recover the surface orientation, foreshortening of texture was used as a cue in [4] whereas [3] used local variations of spatial frequencies. The orientation of text planes was estimated using perspective geometry in [10]. The work in [11] revealed the fact that higher-order correlations in the frequency domain caused by the projection of a planar texture are proportional to the orientation of the plane. [12] proposed a method to determine the surface normal using projective geometry and spectral analysis. While all the above methods work under the assumption of clean images, there exist very few works which attempt to make use of the cues from degradations (in the form of blur) to estimate plane normal. In [13], optical blur is used as a cue to estimate the planar orientation from a single image. The works in [14], [15] utilize motion blur to infer the surface normal of the scene from a single motion blurred image, but by assuming the case of in-plane translational camera motion.

There have been few attempts to estimate the complete 3D structure of scene from a single image using learning-based approaches. The work in [16]

used a Markov Random Field trained via supervised learning to infer a set of plane parameters associated with the scene.

[17] proposed an approach to identify multiple distinct planes, and estimating their orientation from a single image of an outdoor urban scene by learning the relationship between appearance and structure from a large set of labeled examples.

Lately, convolutional neural networks (CNN) are being increasingly used to address the ill-posedness of single image depth estimation. They are trained on specific datasets formed with the help of multi-view images or depth sensors

[18, 19, 20] to predict depth map from a single image. However, the performance of these methods degrades on general test images that are different from the labeled data available during training. Moreover, accurate depth estimation becomes a challenge in the presence of blur since the fine-level depth cues get subdued in the presence of motion blur.

Motion blurred images have attracted increased attention in research, owing to the ubiquity of mobile phones and hand-held imaging devices. Recent years have witnessed significant progress in single image motion deblurring. While the standard blind deblurring algorithms such as [21, 22, 23, 24, 25] consider the motion blur to be uniform across the image, various methods have been proposed to handle blur variations due to camera rotational motion [26, 27, 28, 29, 30] and scene depth variations [31, 32]. However, none of the existing approaches address multi-planar inclined scenes.

Although the problem of blur and depth estimation are individually quite challenging, a few attempts have been made in the literature to jointly tackle the two problems. Among the existing methods on motion deblurring, the ones that come close to that of ours is [31] and [32]. The work in [31] have proposed to jointly estimate depth and non-uniform blur from a single blurred image but is designed to handle only piecewise fronto-parallel planar scenes. [32] was designed to remove the motion blur effects caused in underwater imaging by modeling it via a virtual depth map characterized using a single exponential function. While few other works on depth-aware motion deblurring such as [33, 34, 35] have also been proposed, they rely on multiple observations.

Recently, various learning based attempts have been proposed to solve the problem of removing heterogeneous blur from a single blurred image. [36]

trained a CNN for predicting a probability distribution of motion blur at the patch level. To recover the latent image,

[37] estimated a dense motion flow with a fully convolutional neural network. [38] used adversarial training to learn blur-invariant features to perform motion deblurring. End-to-end trainable multi-scale CNN models are proposed in [39, 40] to restore the latent image directly. Although the above methods attempt to solve the deblurring problem in more generic settings, the performance of these methods depends purely on the training data and the learning capability of the underlying networks. While the learning based models have been shown to handle a few types of heterogeneous blur, their performance on generic blurred images is not guaranteed. At the same time, the performance of some of these methods on standard datasets such as [41] reveal the fact that conventional methods still outperform the learning based approaches when it comes to specific image formation models.

In this paper, we not only extend our previous work in [14] and but also bring in many other contributions. First, we show how the approach in [14] can be modified to account for the camera motions involving rotations too. We then develop a fully-automatic first-of-its-kind approach to recover the number of planes, the parameters corresponding to each plane, and the camera motion from a single motion blurred image of a scene with multiple planes. These results are then used to pose an alternating minimization problem to recover the complete scene geometry as well as the latent image of the scene. On motion blurred images, our depth estimates are more accurate than learning based approaches [19, 20] due to the cues present in the blur-kernels and additional constraints present in the algorithm of motion deblurring. In this paper, we relax majority of constraints that were being enforced in previous works. Unlike [14] which handles the case of in-plane translational motion of the camera alone, our proposed approach for normal estimation can handle more general kinds of camera motion. In addition, we also tackle the case of multi-planar scenes and propose a novel formulation for deblurring of such scenes. Unlike [31] which requires user interaction and relies on piecewise fronto-parallel planar assumption, our approach is fully automatic and uses only a piecewise planar representation of the scene.

The key contributions of our work are summarized below

  • This is the first work in literature to perform surface normal estimation from general motion blur present in a single image.

  • We develop a fully-automatic algorithm to estimate the number of planes, parameters corresponding to each plane, and the camera motion from a single motion blurred image.

  • We propose an elegant alternating minimization approach to jointly estimate the scene geometry and latent image from a single motion blurred image. Our proposed approach is able to deliver state-of-the-art results on single image depth-aware deblurring.

The remainder of this paper is organized as follows. In section 2, we introduce the motion blur image formation model for the case of scenes which can be approximated with single-plane. Our proposed approach for normal estimation directly from the blur kernels is introduced in Section 3. Both the image formation model and the extension of proposed normal estimation method for the case of the multi-planar scene is described in section 4. In Section 5, as an application to our findings, we propose a potential use of the estimated normals to perform blind deblurring of a scene containing multiple inclined planes. This is followed by experimental results in Section 6.

2 Motion blur model for planar scenes

In this section, we will introduce the image formation model corresponding to a single planar scene. If the scene is planar, the blurred image can be expressed as the aggregation of warped instances of sharp image as [26, 42]

(1)

where refers to the homography (at time instant ) which defines the geometric transformation between the latent image and the warped image that get projected onto the image plane corresponding to the camera pose at , is the exposure time during which the camera sensors will be exposed to the light from the scene. In the reminder of the paper, we will use the following discrete equivalent model of Eq. (1) [26, 43]

(2)

where is the transformation spread function (TSF) defined over a discrete camera pose space . The TSF will be a (where denotes the cardinality of

) vector with positive value for those poses

over which the camera has moved and zeros for all other poses. The value of represents the fraction of the total exposure duration during which the camera is remained at pose , during the entire exposure time. By this definition it is straightforward to see that the encodes the camera motion information in a compact form.

The works in [33, 43] have shown that can be related to through a space-varying convolution as

(3)

where denotes the PSF at the pixel position which is also a function of spatial coordinates defined by , and denotes the space-varying blurring operation indicating that will vary when changes. The PSF value can be expressed in terms of the TSF as [43]

(4)

where is the 2D Dirac Delta function, and denotes the transformed image coordinates obtained by applying at pixel position . Note that, the PSF represents the displacements undergone by an image point due to the underlying camera motion, and hence the relation in Eq. (4) encodes the motion of scene points in the image plane when the camera moves according to the path defined by .

Figure 1: Camera setup.

We can express as a function of the camera pose and plane parameters as

(5)

where , and are the rotation and translation matrices corresponding to the camera pose , is the normal of underlying planar scene, and is the perpendicular distance between the camera center and the scene plane, and is the camera intrinsic matrix with being the focal length of the camera.

Next we discuss our proposed approach to estimate the normal of a planar scene by making use of the blur kernels extracted from different locations in a single motion blurred image.

3 Normal Estimation from Blur Kernels

This section describes our approach, wherein we employ PSFs extracted from various locations in a motion blurred image to estimate the surface normal of the underlying scene. Fig. 2 shows blur kernels induced at four different points in the image-plane for two different cases: a fronto-parallel planar scene, and an inclined plane with surface normal . For both cases, to generate the blur kernels, we have used the same camera motion which involves only in-plane translations. Clearly for the case of the fronto-parallel planar scene, all the blur kernel are one and the same, since all of them are at the same depth and the camera motion contains only in-plane translations. However, for the inclined planar scene, the size of the blur kernel varies with the scene depth, indicating that the blur kernels themselves carry the cue about the surface normal of the underlying scene. This is the key observation which motivated us to formulate a technique where one can determine the surface normal using pixel-shift information contained in the PSFs from different locations in a blurred image.

Figure 2: Blur kernels at 4 different locations on an image blurred due to in-plane translational camera motion. First row corresponds to fronto-parallel scene and the second row corresponds to an inclined planar scene with .
Figure 3: Blur kernels at 4 corner locations of an image blurred due to camera trajectory involving translations and rotations. The first row corresponds to a fronto-parallel scene and second row corresponds to an inclined planar scene with .

In general, the homography (in Eq. (5)) is a function of dimensional (6D) camera motion (3D rotations and 3D translations). However, recent works in [26, 41] have shown that the effect of camera motion encountered in practice can be well-approximated using in-plane rotations and translations thereby reducing the space of camera motion from 6D to 3D while not compromising on the validity of image formation model. Hence, we too adopt this approximation to reduce the ill-posedness of the associated problems that we are going to address. Thus we use the homography which is parameterized by translation along -axis () and -axis (), and rotations about -axis (). Therefore the equation for in Eq. (5) can be simplified to the following form

(6)

Furthermore, a recent work [29] has shown that, for typical handshakes the blur induced by in-plane rotations of the camera can be very well modeled with small (i.e; cos () and sin() ). Our proposed solution for the normal estimation tries to exploit the linearization capability of small approximation model for rotational motion. Thus for the case of general camera motion, the overall homography matrix can be simplified to the following form.

(7)

For the case of an inclined scene with orientation , consider a single camera pose that is involved in the formation of PSF at position . The camera pose shift the intensity at pixel location to a new location , which can be determined as

(8)

Eq. (8) implies that the pixel shifts are no longer a constant, but vary as a function of the spatial coordinates and . This, in turn leads to variation in the blur kernels as well.

The linearity of the relationship between pixel shifts and the surface normal can be further pronounced, if the quantity being considered is the difference in the shift caused due to two different camera positions and as in the following relation

(9)

where , , , , and . The relation in Eq. (9) can be rearranged to obtain a linear relation between the unknown n and the pixel shifts along and direction induced at a location x as

(10)
(11)

As can be deduced from Eq. (10) and Eq. (11), unlike the case for pure in-plane translations, the PSFs induced by general camera motion will be spatially varying even for the case of fronto-parallel planar scenes. This is illustrated in Fig. 3, where we show 4 blur kernels corresponding to different positions in the image plane for a fronto-parallel planar scene (first row) as well as an inclined plane (second row). As is evident, the blur kernels are no longer spatially invariant for the case of fronto-parallel scene. However, by comparing the kernels from first and second row, it can be observed that the variation induced by the translational camera motion in each blur kernel still carries information about the surface normal.

Although the presence of preempts the recovery of surface normal directly from the quantities in the right-most column vector of Eq. (10) alone, we can utilize the information from both Eq. (10) and Eq. (11) together to overcome this issue. Let us denote the entries in the right-most column vector in Eq. (10) and Eq. (11) as and . Similar to the case of in-plane translations, we can collect pixel shifts from multiple locations (for ) in the image to form a overdetermined set of linear equations in terms of the unknowns and as follows.

(12)
(13)

We use the difference between the extreme points of the locally estimated PSFs to compute the quantities and . By making use of multiple PSFs computed from different locations in the blurred image, we can solve for and using least squares error minimization. From the estimates of and , and with the help of the relations in Eq. (10) and Eq. (11), the estimated parameters and the normals are related as

(14)
(15)

Hence, we can obtain the components of the surface normal (upto a scale factor ambiguity) as follows

(16)

The common scale factor can be removed by enforcing unit norm constraint to yield the final estimate of normal. Note that the normal estimate obtained in this way not only handles practically occurring camera motion, but also provides a normal estimate with minimal correspondence requirements. A minimum of 3 correspondences is sufficient to obtain the normal estimate by solving Eqs. (12)-(13).

4 Multi-planar-scenes

Until now we have been discussing the case of scenes which can be modeled using a single inclined plane. However natural scenes often comprise of multiple inclined planes. In this section, we will first introduce the image formation model of multi-planar-scenes, followed by discussion on our approach for normal estimation of all the planes in the scene from a single motion blurred image. We express the blurred image as the summation of disjoint regions from a set of blurred images as

(17)

where (for i=1,..,N) refers to a set of disjoint masks which define the spatial support of different planes in the blurry image. Each blurred image corresponds to a unique plane in the scene, which can be related to the underlying latent image as

(18)

where refers to the homography (at time instant ) which defines the transformation between plane in latent image and the corresponding region in the warped image. We can express as a function of the camera pose (rotation and translation ) and plane parameters (normal and perpendicular distance ) as

(19)

Thus the homographies corresponding to different planes in the scene are governed by the same set of camera pose vectors but different set of plane parameters.

Next, we propose an approach to automatically estimate the number of planes () and associated normals () from a single motion blurred image. Our approach proceeds as follows. We estimate PSFs corresponding to overlapping patches at various positions in the input image. Note that our method for normal estimation (discussed in Section 3) require only 3 PSFs to estimate the normal corresponding to a single plane. We apply RANSAC [44]

in an iterative fashion to identify the PSFs and hence the patches corresponding to individual planes. To identify the first plane in the scene, we use RANSAC as follows. We first compute a normal from 3 randomly chosen PSFs. Next, we replace one PSF among the three with another from the whole collection and estimate the normal again. If the estimated normal comes closer to the one we obtained from first three, we declare the new PSF as an inlier. We repeat this process to identify all the inlier PSFs from the entire collection of PSFs. If the number of inliers is more than a specified threshold, we declare that entire collection of inlier PSFs as corresponding to a single normal. This process is then repeated with the remaining PSFs (obtained by removing the inlier PSFs from the whole collection) to identify other normals in the scene. This process is continued until we have a situation where the remaining collection of PSFs does not have sufficient inliers that agree with a single normal. we obtain the number of planes in the scene and the corresponding inliers. At the end of RANSAC, the normal estimates for a plane are again refined by solving for a single normal but by employing all the inliers corresponding to it. The outlier PSFs which remain after RANSAC are deemed to correspond to noisy PSF estimates.

In the following section, we will explore the potential of our proposed approach for normal estimation to recover the complete scene geometry as well as the underlying latent image.

5 Multi-Planar Motion deblurring

In this section, we introduce our approach for recovery of complete scene geometry and restoration of the latent image assuming the availability of a single motion blurred image of a multi-planar scene. As discussed in Section 4, from a single motion blurred image, we can recover the normals corresponding to all the planes. However, to recover the latent image, we need to solve for the remaining unknown variables in the image formation model (Eq. (17)-(18)).

Consider the discrete equivalent model of blurred image formation derived from Eq. (2), Eq. (17), and Eq. (18) as given by

(20)

From Eq. (20) it can be observed that, for latent image estimation, we need to recover the camera motion (), depth values ( for i=1,..,N), and an accurate estimate of plane segmentation masks ( for i=1,..,N). We first employ the inlier blur kernels and the normal estimate obtained from RANSAC (in Section 4) to estimate the TSF () and the depth parameters ( for i=1,..,N). This is then followed by an alternating minimization scheme where we solve for both the latent image () and segmentation masks ( for i=1,..,N) to yield the final restored image.

5.1 Estimation of camera motion and depth values

To estimate TSF and depth values, we make use of the inlier PSFs and the normal estimates obtained from RANSAC in Section 4. Consider a spatial location x lying on the plane of the scene. The PSF at can be related to TSF as [43]

(21)

Eq. (21) relates the inlier PSFs with corresponding depth values and underlying camera motion. As is evident from Eq. (19), although the camera motion is the same for the entire image, the effective pixel motion experienced by each scene point depends on the normal and the depth value of the corresponding plane. To solve for the TSF, we define the depth of one plane to be reference depth and solve for the scalar factor corresponding to all other planes [43].

The relation in Eq. (21) can be expressed in matrix-vector multiplication form as

(22)

where is a motion matrix which embeds the motion of a point light source at with respect to the camera poses in , and and are the column vector forms of and . Note that the entries of depend on the plane normal and unknown scale factor too. By aggregating such relations corresponding to all the inlier PSFs we can obtain an equation of the following form.

(23)

where and . The total number of inlier PSFs obtained from RANSAC is denoted as . Since the measurement matrix requires knowledge of depth values corresponding to each plane, we cannot use Eq. (23) alone to solve for the camera motion . Hence we choose to alternatively update the camera motion and depth values until convergence.

TSF refinement: Once the scale factors are known, we can build the matrix M in Eq. (23) and then estimate by solving the following optimization problem

(24)

where denotes the iteration number. We apply an norm based sparsity prior on to enforce the fact that camera motion will occupy only few poses in the entire search space. The weight of the prior is controlled through the scale factor . We solve Eq. (24) using alternating direction method of multipliers (ADMM) [45] to obtain the TSF estimate for iteration.

Scale factor refinement: To refine the scale factors, we form a set of scale factors around 1, and search for the ones which satisfy the current estimate of TSFs and the inlier PSFs corresponding to each plane. We first use the camera motion obtained from previous iteration to generate the PSFs at all the locations and all the scale factors in . For plane, the kernels generated from at locations corresponding to all the inlier PSFs of that plane are compared with respective inlier PSFs to update the scale factor . To update we solve the following optimization problem

(25)

where refers to the set of spatial locations corresponding to all the inlier PSFs of plane.

In the first iteration, we estimate the TSF by setting all the scale factors to unity (i.e; ). The TSF estimate thus obtained is then used for updating the scale factors. Using the updated scale factors, we re-estimate using (24). This refinement process of and is repeated until the convergence of all the scale factors.

5.2 Image restoration and recovery of segmentation masks

In this section, we will discuss our approach to recover the latent image by making use of the estimates obtained from previous sections. Since latent image estimation requires knowledge of segmentation masks, the problem is still ill-posed. Hence we employ an alternating minimization (AM) scheme, where we iteratively repeat both latent image estimation and segmentation mask recovery to arrive at the desired solution. Details on the two sub-problems in our AM scheme is discussed next.

(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 4: Results for a synthetically blurred single layer scene with and camera motion containing only in-plane translations. (a) Ground truth depth map generated using the plane parameters. (b) Ground truth input image. (c) Input blurred image generated using the depth map. (d) A pair of ground truth (top) and estimates kernels (bottom) obtained from two different locations in the blurred image. Restored image using (e) [39], (f) [40], (g) [30] and (h) the proposed approach.

Latent image estimation: The relation in Eq. (20) can be expressed in a matrix-vector multiplication form as follows

(26)

where g and f are the lexicographically ordered form of and ,respectively. The matrix which embeds the pixel motion corresponding to plane is built according to the camera motion and the parameters of plane. is a diagonal matrix built based on the segmentation mask . The matrix subsumes the pixel motions corresponding to all the points in the scene. From known estimates of the scene plane parameters (, ), camera motion (), and plane segmentation masks (), we estimate the latent image by solving the following form of optimization.

(27)

Here, to obtain , we apply norm based prior (weighted by the scale factor ) to enforce natural sparsity of latent image gradients [46] and then solve the resulting optimization using ADMM [45].

Estimation of segmentation masks: We estimate segmentation masks by posing it as a multi-label MRF optimization problem where the labels indicating the pixel assignments corresponding to each plane. This optimization is then solved using graphcut [47]. For a pixel at p, we define the cost corresponding to assigning the label as

(28)

where is the data cost to assign the label to pixel p, is a neighborhood of pixels around p, is the smoothness cost to assign the labels to the adjacent pixels p, q and is the scalar weight on the smoothness term. We use the following form of cost function to compute the data cost corresponding to .

(29)

It is straightforward to see that the above data cost enforces the label assignment to respect the image formation model in Eq. (26). The smoothness cost has the following form.

(30)

where is a scalar value. This is used to enforce the fact that adjacent pixels in the image are more likely to have identical labels, i.e; the pixels corresponding to a single plane will form a contiguous region.

We start our AM by solving for the latent image by initializing corresponding to the background layer as all s and other layers as all s. This is then followed by alternative refinement of both mask and the latent image to yield the final restored image as well as an accurate layer segmentation map.

6 Experiments

In this section, we validate the proposed method on both synthetic and real examples. We also show quantitative and qualitative comparisons with state-of-the-art blind deblurring approaches. For normal estimation of all the scene planes, we have estimated the PSFs from overlapping patches of size with an overlap factor of . To estimate the blur kernel for a selected patch we used an off-the-shelf blind motion deblurring technique in [23]. To find the extremities of blur kernels, we use the PSF end point localization approach from [9]. These PSF estimates are then used in our RANSAC based approach to identify the number of planes and associated normals. In the RANSAC algorithm, the PSF estimate which induces a deviation of more than degrees in the normal estimate is treated as an outlier.

In all our experiments, the number of iterations for alternating refinement of TSF and depth values in Section 5.1 as well as the AM between the latent image estimation and segmentation mask recovery was set to . To solve various optimization problems discussed in previous sections, the value of , , and were set to , , and , respectively. All these parameters were found empirically through experimentation. For the segmentation mask recovery using Eq. (28), we used the image obtained by applying smoothing filter [48] on the estimated latent image from Eq. (27). As observed in [31], the smoothing filter not only helps in countering the adverse effects of the small edges during depth estimation, it also helps in recovering strong gradients in the latent image which, in turn, ensure better convergence of the subsequent AM approach.

(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 5: Results for a synthetically blurred two-layer scene, with background layer as a fronto-parallel plane and foreground layer having . (a) Ground truth depth map (generated using the plane parameters and segmentation masks). (b) Input blurred image generated using the depth map and camera trajectory from [41]. (c) Recovered depth-map obtained using the estimated plane parameters and segmentation masks. Restored image using (d) the proposed approach, (e) [39] , (f) [49], (g) [40] and (h) [31].
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 6: Results for a synthetically blurred three-layer scene. (a) Ground truth depth map (generated using the plane parameters and segmentation masks). (b) Input blurred image generated using the depth map and camera trajectory from [41]. (c) Recovered depth-map obtained using the estimated plane parameters and segmentation masks. Restored image using (d) the proposed approach, (e) [39] , (f) [49], (g) [40] and (h) [31].

6.1 Synthetic Experiments

To generate synthetic test examples, we used the trajectories from the dataset of [41] to simulate the camera motion. Images from the data-set of [50] were used as ground truth images corresponding to different layers. Layer masks were formed by manually creating binary masks of arbitrary shapes. For all the synthetic experiments we set the focal length to be 1000 pixels.

To perform quantitative evaluation of our proposed scheme for normal and latent image estimation, we created a dataset of synthetic examples comprising of 10 blurred images corresponding to 3D scenes with single and multiple planes. We verify the performance of our normal estimation scheme by finding the angular error between the ground truth normal and estimated normal. For the performance comparison of our deblurring scheme, we have used PSNR (Peak Signal to Noise Ratio) and SSIM (Structural Similarity Measure) values of the restored images calculated with respect to the corresponding ground truth images. These values are compared with state-of-the-art deblurring approaches, by generating their results using the implementations provided by respective authors.

A synthetic example of a single layer scene blurred using in-plane translational motion of the camera and plane normal is shown in Fig. 4. Using our algorithm, the estimated value of normal for this image is which is quite close to the true normal (the angular error is degrees). Fig. 5 and Fig. 6 show synthetic examples corresponding to a and a layer scenes, respectively. In both cases, images were blurred using camera motion involving translations and rotations and the background was set to be fronto-parallel. While the foreground layer of example in Fig. 5 was blurred using , we used the normals and for the two foreground layers in Fig. 6. For scene in Fig. 5, the estimated normals using the proposed method was found to be and which amounts to an average error of degrees. Proceeding similarly for Fig. 6, the average angular error for the three normals was found to be degrees. The average angular error for our synthetic dataset is degrees.

Qualitative comparisons for deblurring are shown in Figs. 4,5 and 6. It can be seen that our approach recovers scene texture faithfully, while the results of existing methods contain visible artifacts. The learning based approaches [39], [49] and [40] contain artifacts at the planar boundaries and in dense textured regions. Although undesirable, such local deviations from ground-truth are often found in results of generative models, since the outputs of these networks are not constrained to follow the image formation model. For Fig. 4, The approach of [30] leads to deblurring of only few regions since it does not model depth variations. Similar issues are found in the results of multi-planar deblurring algorithm of [31] in Fig. 5 and 6, as it does not handle inclined planes. Note that manually marked regions (belonging to each plane) were provided as input to [31]). In contrast, our method is able to automatically segment the scenes and deblur them faithfully. The superiority of our results is also reflected in the quantitative comparisons provided in Table 1.

Method [39] [49] [40] [31] Ours
PSNR(dB) 25.49 25.23 26.02 27.25 29.12
SSIM 0.7200 0.7573 0.7783 0.8346 0.9068
Table 1: Quantitative Comparison of deblurring using our method with other state-of-the-art blind deblurring algorithms on synthetically blurred dataset.

6.2 Real Experiments

(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 7: Results of depth estiamtion and deblurring for a scene containing a single plane. (a) Input blurred image. (b) Depth map generated using [19]. (c) Depth map generated using [20]. (d) Estimated depth map using our method. The next row shows deblurring images using (e) [39] , (f) [49], (g) [40], and (h) the proposed approach.
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
(k) (l) (m) (n) (o)
(p) (q) (r) (s) (t)
Figure 8: Results of depth estiamtion and deblurring for scenes containing two planes. Subfigures (a,k) show the input blurred images, (b,l) the depth maps generated using [19], (c,m) the depth maps generated using [20], (d,n) The depth-maps used by [31], (e,o) estimated depth map using our method. The second and fourth rows show the deblurring results of [31] (f,p), [39] (g,q), [49] (h,r), [40] (i,s), and our method (j,t).
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
(k) (l) (m) (n) (o)
(p) (q) (r) (s) (t)
Figure 9: Results of depth estiamtion and deblurring for scenes containing three planes. Subfigures (a,k) show the input blurred images, (b,l) the depth maps generated using [19], (c,m) the depth maps generated using [20], (d,n) The depth-maps used by [31], (e,o) estimated depth map using our method. The second and fourth rows show the deblurring results of [31] (f,p), [39] (g,q), [49] (h,r), [40] (i,s), and our method (j,t).

The real experiments are carried out using images captured with Xiomi Mi5 camera in the presence of general camera shake. For the purpose of comparison of deblurring, we applied the conventional non-uniform motion deblurring method of [31] and the learning based models of [39], [49], and [40] to individual images. For depth estimation, we compare with the recent learning based single image depth estimation methods of [19] and [20].

In the first example, we consider a scenario where a large billboard is present at an inclination to the camera, as shown in Fig. 7. The scene can be modeled as a single inclined plane. By following the same procedure as outlined in the synthetic case, outlier PSFs were removed and only the authentic blur kernels were used to estimate the TSF. Note that for real examples, we do not have knowledge of the true normal. Using our algorithm, the estimated value of normal for this image is which is visually consistent with the scene inclination. Note that our estimated depth-map appears more consistent with the scene than the results of [19, 20], since we utilize the information present in the blur-kernels and enforce a planar constraint.

Fig. 8 shows results and comparisons on real blurred images containing layered scenes, respectively. In Fig. 8(a), both the planes are inclined along the horizontal direction with respect to the camera while in Fig. 8(k), and the foreground is approximately fronto-parallel and the background is inclined along vertical direction. Our normal estimates for Fig. 8(a) are and while for Fig. 8(k), they are and , respectively. Our depth-estimates concur with the scene geometry. The results of [19, 20] do describe the scene depth-variation at a very coarse-level but contain various depth-discontinuities at fine-level. The superiority of our depth segmentation can be attributed to the constraints present in our deblurring algorithm.

In terms of deblurring performance, The method of [31] is able to partially deblur some regions in the scene (due to the manually supplied depth-segmentation as input), but suffers from incomplete deblurring and ringing artifacts in inclined regions. The results of [39], [49], and [40] suffer from incomplete deblurring while introducing artifacts in textured regions. Our method leads to better deblurring results.

The next set of examples containing layered scenes are shown in Fig. 9. The intermediate results for iterative depth estimation on the 4th test image are shown in Fig. 10(b-f), Note that the three different planes are clearly distinguishable in the final iteration. Again, it can be seen that our approach recovers scene depth and texture faithfully, while the results of existing methods contain visible artifacts in inclined regions.

(a) (b) (c) (d) (e) (f)
Figure 10: Estimated depth-maps for the real blurred image from Fig. 9(a) from our AM scheme, recorded after each iteration. Note that the three different planes are clearly distinguishable in the final iteration.

7 Conclusions

We formulated the underlying relationship between the surface normal of a planar scene and the induced space-variant nature of blur due to camera motion. By utilizing the correspondences among the extreme points of the PSFs, we proposed a new approach to solve for the surface normal of a planar scene. The method leads to robust normal estimation even on real images which can be conveniently plugged into existing image formation model for restoration of motion-blurred 3D scenes. Finally, we proposed a first-of-its-kind scheme to estimate orientation of multiple planes from a single motion blurred image and utilized it to deblur the image. Our proposed approach achieves state-of-the-art results for the task of single image 3D scene motion deblurring.

References

  • [1] Sang Hwa Lee and Siddharth Sharma. Real-time disparity estimation algorithm for stereo camera systems. IEEE transactions on Consumer electronics, 57(3), 2011.
  • [2] Behzad Shahraray and Michael K Brown. Robust depth estimation from optical flow. In Computer Vision., Second International Conference on, pages 641–650. IEEE, 1988.
  • [3] Boaz J Super and Alan C Bovik. Planar surface orientation from texture spatial frequencies. Pattern Recognition, 28(5):729–743, 1995.
  • [4] Lisa Gottesfeld Brown and Haim Shvaytser. Surface orientation from projective foreshortening of isotropic texture autocorrelation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):584–588, 1990.
  • [5] Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape-from-shading: a survey. IEEE transactions on pattern analysis and machine intelligence, 21(8):690–706, 1999.
  • [6] Subhasis Chaudhuri and Ambasamudram N Rajagopalan. Depth from defocus: a real aperture imaging approach. Springer Science & Business Media, 2012.
  • [7] Paramanand Chandramouli and A Rajagopalan. Inferring image transformation and structure from motion-blurred images. In BMVC, pages 73–1, 2010.
  • [8] Huei-Yung Lin and Chia-Hong Chang. Depth recovery from motion blurred images. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 1, pages 135–138. IEEE, 2006.
  • [9] Yali Zheng, Shohei Nobuhara, and Yaser Sheikh. Structure from motion blur in low light. In CVPR, pages 2569–2576. IEEE, 2011.
  • [10] Paul Clark and Majid Mirmehdi. Estimating the orientation and recovery of text planes in a single image. In BMVC, pages 1–10, 2001.
  • [11] Hany Farid and Jana Kosecka. Estimating planar surface orientation using bispectral analysis. IEEE Transactions on image processing, 16(8):2154–2160, 2007.
  • [12] Thomas Greiner, Shivani G Rao, and Sukhendu Das. Estimation of orientation of a textured planar surface using projective equations and separable analysis with m-channel wavelet decomposition. Pattern Recognition, 43(1):230–243, 2010.
  • [13] Scott McCloskey and Michael Langer. Planar orientation from blur gradients in a single image. In CVPR, pages 2318–2325. IEEE, 2009.
  • [14] M Purnachandra Rao, AN Rajagopalan, and Guna Seetharaman. Inferring plane orientation from a single motion blurred image. In ICPR, pages 2089–2094. IEEE, 2014.
  • [15] Subeesh Vasu, AN Rajagopalan, and Gunasekaran Seetharaman. Tapping motion blur for robust normal estimation of planar scenes. In ICIP, pages 2761–2765. IEEE, 2015.
  • [16] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2009.
  • [17] Osian Haines and Andrew Calway. Detecting planes and estimating their orientation from a single image. In BMVC, pages 1–11, 2012.
  • [18] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, pages 2366–2374, 2014.
  • [19] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
  • [20] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
  • [21] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T Roweis, and William T Freeman. Removing camera shake from a single photograph. In ACM transactions on graphics (TOG), volume 25, pages 787–794. ACM, 2006.
  • [22] Sunghyun Cho and Seungyong Lee. Fast motion deblurring. In ACM Transactions on Graphics (TOG), volume 28, page 145. ACM, 2009.
  • [23] Li Xu and Jiaya Jia. Two-phase kernel estimation for robust motion deblurring. ECCV, pages 157–170, 2010.
  • [24] Libin Sun, Sunghyun Cho, Jue Wang, and James Hays. Edge-based blur kernel estimation using patch priors. In ICCP, pages 1–8. IEEE, 2013.
  • [25] Tomer Michaeli and Michal Irani. Blind deblurring using internal patch recurrence. In ECCV, pages 783–798. Springer, 2014.
  • [26] Ankit Gupta, Neel Joshi, C Lawrence Zitnick, Michael Cohen, and Brian Curless. Single image deblurring using motion density functions. ECCV, pages 171–184, 2010.
  • [27] Michael Hirsch, Christian J Schuler, Stefan Harmeling, and Bernhard Schölkopf. Fast removal of non-uniform camera shake. In ICCV, pages 463–470. IEEE, 2011.
  • [28] Oliver Whyte, Josef Sivic, Andrew Zisserman, and Jean Ponce. Non-uniform deblurring for shaken images. International journal of computer vision, 98(2):168–186, 2012.
  • [29] Subeesh Vasu and A. N. Rajagopalan. From local to global: Edge profiles to camera motion in blurred images. In CVPR, July 2017.
  • [30] Yanyang Yan, Wenqi Ren, Yuanfang Guo, Rui Wang, and Xiaochun Cao. Image deblurring via extreme channels prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [31] Zhe Hu, Li Xu, and Ming-Hsuan Yang. Joint depth estimation and camera shake removal from single blurry image. In CVPR, pages 2893–2900, 2014.
  • [32] Karthik Seemakurthy, Subeesh Vasu, and Rajagopalan Ambasamudram. Deskewing by space-variant deblurring. In BMVC, 2016.
  • [33] Michal Sorel and Jan Flusser. Space-variant restoration of images degraded by camera motion blur. IEEE Transactions on Image Processing, 17(2):105–116, 2008.
  • [34] Li Xu and Jiaya Jia. Depth-aware motion deblurring. In ICCP, pages 1–8. IEEE, 2012.
  • [35] Chandramouli Paramanand and Ambasamudram N Rajagopalan. Non-uniform motion deblurring for bilayer scenes. In CVPR, pages 1115–1122, 2013.
  • [36] Jian Sun, Wenfei Cao, Zongben Xu, Jean Ponce, et al. Learning a convolutional neural network for non-uniform motion blur removal. In CVPR, pages 769–777, 2015.
  • [37] Dong Gong, Jie Yang, Lingqiao Liu, Yanning Zhang, Ian Reid, Chunhua Shen, AVD Hengel, and Qinfeng Shi.

    From motion blur to motion flow: a deep learning solution for removing heterogeneous motion blur.

    In CVPR, 2017.
  • [38] TM Nimisha, Akash Kumar Singh, and AN Rajagopalan. Blur-invariant deep learning for blind-deblurring. In ICCV, pages 4752–4760, 2017.
  • [39] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, volume 2017, 2017.
  • [40] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. arXiv preprint arXiv:1802.01770, 2018.
  • [41] Rolf Köhler, Michael Hirsch, Betty Mohler, Bernhard Schölkopf, and Stefan Harmeling. Recording and playback of camera shake: Benchmarking blind deconvolution with a real-world database. ECCV, pages 27–40, 2012.
  • [42] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurring for shaken images. In CVPR, 2010.
  • [43] Chandramouli Paramanand and AN Rajagopalan. Shape from sharp and motion-blurred image pair. International journal of computer vision, 107(3):272–292, 2014.
  • [44] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • [45] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends in Machine Learning

    , 3(1):1–122, 2011.
  • [46] Yilun Wang, Junfeng Yang, Wotao Yin, and Yin Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248–272, 2008.
  • [47] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222–1239, 2001.
  • [48] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. Image smoothing via l 0 gradient minimization. In ACM Transactions on Graphics (TOG), volume 30, page 174. ACM, 2011.
  • [49] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas.

    Deblurgan: Blind motion deblurring using conditional adversarial networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8183–8192, 2018.
  • [50] Libin Sun and James Hays. Super-resolution from internet-scale scene matching. In Proceedings of the IEEE Conf. on International Conference on Computational Photography (ICCP), 2012.