A Motion Free Approach to Dense Depth Estimation in Complex Dynamic Scene

02/11/2019 ∙ by Suryansh Kumar, et al. ∙ Australian National University 8

Despite the recent success in per-frame monocular dense depth estimation of rigid scenes using deep learning methods, they fail to achieve similar success for complex dynamic scenes, such as MPI Sintel butler2012naturalistic. Moreover, conventional geometric methods to address this problem using a piece-wise rigid scene model requires a reliable estimation of motion parameters for each local model, which is difficult to obtain and validate. In this work, we show that, given per-pixel optical flow correspondences between two consecutive frames and the sparse depth prior for the reference frame, we can recover the dense depth map for the successive frames without solving for motion parameters. By assigning the locally rigid structure to the piece-wise planar approximation of a dynamic scene which transforms as rigid as possible over frames, we demonstrate that we can bypass the motion estimation step. In essence, our formulation provides a new way to think and recover dense depth map of a complex dynamic scene which is recursive, incremental and motion free in nature and therefore, it can also be integrated with the modern neural network frameworks for large-scale depth-estimation applications. Our proposed method does not make any prior assumption about the rigidity of a dynamic scene, as a result, it is applicable to a wide range of scenarios. Experimental results show that our method can effectively provide the depth for the successive/multiple frames of a dynamic scene without using any motion parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dense depth estimation of complex dynamic scenes from two consecutive frames has recently gained enormous attention from several industries involved in augmented reality, autonomous driving, movies etc. Applications such as obstacle detection [21], robot navigation [20] etc., need reliable depth to develop autonomous systems. Despite the recent research in solving this problem has provided some promising theory and results, its success strongly depends on the accurate estimates of 3D motion parameters [19, 26].

Figure 1: Given consecutive monocular perspective frame (a), (b) of a complex dynamic scene and the dense optical flow correspondences between them (d). Also, assume an approximate sparse depth prior for the reference frame is provided as input (c), then, our algorithm under the piecewise planar approximation of a dynamic scene gives per-pixel depth estimate for the next frame (f) without solving for any motion parameters. (e) ground-truth depth.

To our knowledge, almost all the existing geometric solutions to this problem have tried to fit the well-established theory of rigid reconstruction to estimate per-pixel depth of dynamic scenes from monocular images [24, 19, 26]. Hence, these extensions are intricate to execute and highly depend on per-object or per-superpixel [1] reliable motion estimates [24, 19, 26]. The main issue with these frameworks is that, even if the depth for the first/reference frame is known, we must solve for per-superpixel or per-object motion to obtain the depth for the next frame. As a result, the composition of their objective function fails to utilize the depth knowledge and therefore, it does not integrate to large-scale applications. In this work, we argue that in a dynamic scene, if the depth for the reference frame is known then it seems “unnecessary or at least undesirable” to estimate motion to recover the dense depth map for the next frame. Therefore, the rationale behind relative motion estimation as an essential paradigm for obtaining the depth of a complex dynamic scene seems optional under the prior knowledge about the depth of the reference frame and dense optical flow between frames. To endorse our argument, we propose a new motion free approach which is easy to implement and allow the users to get rid of the complexity associated with the optimization on (3) manifold.

We posit that the recent geometric methods to solve this task have been limited by their inherent dependence on the motion parameters. Consequently, we present an alternative method to realize the dynamic scene depth estimation task as a global as-rigid-as-possible (ARAP) optimization problem which is motion-free. Inspired by the prior work [19], we model the dynamic scene as a set of locally planar surfaces, now previous work constrains the movement of local planar structure based on the homography [23] and its relative motion between frames. In contrast, we propose that the ARAP constraint over a dynamic scene may not need 3D motion parameters, and its definition just based on 3D Euclidean distance metric is a sufficient regularization to supply the depth for the next frame. To this point, one may ask “Why ARAP assumption for a dynamic scene?

Consider a general real-world dynamic scene, the change we observe in the scene between consecutive time frame is not arbitrary, rather it is regular. Hence, if we observe a local transformation closely, it changes rigidly, but the overall transformation that the scene undergoes is non-rigid. Therefore, to assume that the dynamic scene deforms as rigid as possible seems quite convincing and practically works well for most real-world dynamic scenes.

To use this ARAP model, we first decompose the dynamic scene as a collection of moving planes. We considered K-nearest neighbors per superpixel [1] —which is an approximation of a surfel in the projective space, to define our ARAP model. For each superpixel, we choose three points i.e., an anchor point (center of the plane), and two other non-collinear points. Since the depth for the reference frame is assumed to be known (for at least 3 non-collinear points per superpixel), we can estimate per plane normal for the reference frame, but to estimate per plane normal for the next frame, we need depth for at least 3 non-collinear points per plane . If per-pixel depth for the reference frame is known, then ARAP model can be extended to pixel level without any loss of generality. The only reason for such discrete planar approximation is the computational complexity.

Our ARAP model defined over planes does not take into account the depth continuity along the boundaries of the planes. We address it in the subsequent step by solving a depth continuity constraint optimization problem using the TRW-S algorithm [14] (see Fig. 1 for a sample result).

In this work, we make the following contributions:

  • We propose an approach to estimate dense depth map of a complex dynamic scene that circumvents explicit parameterization of the inter-frame motion, specifying as rigid as possible constraint on the depth estimation.

  • Our algorithm under piece-wise planar and as rigid as possible assumption appropriately encapsulates the behavior of a dynamic scene to estimate per pixel depth.

  • Although the formulation is shown to work ideally for classical case of two consecutive frames, its incremental in nature and therefore, it is easy to extend to handle multiple frames without estimating any 3D motion parameters. Experimental results over multiple frames show the validity of our claim .

2 Related Work and Our Motivation

Recently, numerous papers motivated by the success of neural networks have been published for the dense depth estimation of a dynamic scene from images [31, 8, 30, 6]. The noticeable part is, none of these work shows their results on the MPI dataset [4]. For brevity, in this paper, we limit our discussion to the recent papers that are motivated geometrically to solve this problem, leading to the easy discourse of our contributions. Also, we briefly discuss why our formulation can be more beneficial to the learning algorithms for this task than other geometric approaches [19, 26].

(a)
(b)
Figure 2: (a) Piece-wise planar approximation of a dynamic scene. Each superpixel is assumed to be an approximation of a 3D plane in the projective space. The center of the plane is shown with a filled circle (anchor point). (b) Decomposition of the scene into a local graph structure. Locally rigid graph model with its k-nearest neighbor is shown for the reference frame and the next frame.

Motion-free approach to estimate the 3D geometry of a rigid scene introduced by Li [22] and its extension [13] to single non-rigidly deforming object are restricted to handle few sparse points over multiple frames (M view, N point). To the best of our knowledge, two significant class of work in the recent past have been proposed for estimating dense depth map of the entire dynamic scene from two consecutive monocular images [24, 19, 26]

, however, all of these methods are motion dependent. These work can broadly be classified as (a) object level motion segmentation approach (b) object level motion segmentation free approach.

(a) Object-level motion segmentation approach: Ranftl et al. [26] proposed a two/three-staged approach to solve dense monocular depth estimation of a dynamic scene. Given the dense optical flow field, the method first performs an object level motion segmentation using epipolar geometry [10]. Per-object motion segmentation is then used to perform object level 3D reconstruction using triangulation [10]. To obtain a scene consistent depth map, ordering constraint and smoothness constraint were employed over Quick-shift superpixel [29] graph to deliver the final result.

(b) Object-level motion segmentation free approach: Kumar et al. [19] argued that “in a general dynamic scene setting, the task of densely segmenting rigidly moving object or parts is not trivial”. They proposed an over-parametrized algorithm to solve this task without using object-specific motion segmentation. The method dubbed as “Superpixel Soup” showed that under two mild assumptions about the dynamic scene i.e., (a) the deformation of the scene is locally rigid and globally as rigid as possible and (b) the scene can be approximated by piece-wise planar model, scale consistent 3D reconstruction of a dynamic scene can be obtained for both the frames with a higher accuracy. Inspired by locally rigid assumption, recently, Noraky et al. [24] proposed a method that uses optical flow and depth prior to estimate pose and 3D reconstruction of a deformable object.

Challenges with such geometric approaches: Although these methods provide a plausible direction to solve this challenging problem, its usage to real-world applications is very limited. The major challenge with these approaches is the correct estimation of motion parameters. The method proposed by Ranftl et al. [26] estimates per-object relative rigid motion which is not a sensible choice if the object themselves are deforming. On the other hand method such as [24, 19] estimates per superpixel/region relative rigid motion which is sensitive to the size of the superpixels and distance of the surfel from the camera.

The point we are trying to make is, given the depth for the reference frame of a dynamic scene, can we correctly estimate the depth for the next frame using the aforementioned approaches?. Maybe yes, but then, we have to again estimate relative rigid motion for each object or superpixel and so on and so forth. Inspired by the “as-rigid-as-possible” (ARAP) intuition [19], in this work, we show that if we know the depth for the reference frame and dense optical flow correspondences between consecutive frames, then estimating relative motion is not essential, under the locally planar assumption. We can successfully estimate the depth for the next frame by exploiting as-rigid-as-possible global constraint. These depth estimate using ARAP can further be refined using boundary depth continuity constraint.

The next concern could be why we are after solving this problem in a motion free way?. Keeping in mind the success of deep learning approaches to estimate per-frame dense depth map, our cost function can directly provide the depth for the next frame of a dynamic scene without any motion estimate. And since the choice of a reference frame and the next frame is relative, it further provides a recursive way to improve depth estimate over iteration if supplied with appropriate priors. Moreover, our formulation provides the flexibility to solve for depth at a pixel level rather than at an object level or superpixel level which is hard to realize using motion based approaches [24, 19, 26]. Nevertheless, to reduce the overall computational cost, we stick to optimize our objective function at superpixel level.

3 Piecewise Planar Scene Model

Inspired by the recent work on dense depth estimation of a general dynamic scene [19], our model parameterizes the scene as a collection of piece-wise planar surface, where each local plane is assumed to be moving over frames. The global deformation of the entire scene is assumed to be as rigid as possible. Moreover, we assign the center of each plane (anchor point) to act as a representative for the entire points within that plane (see Fig.2). In addition to the anchor point of each plane, we take two more points from the same plane so that these three points are non-collinear (see Fig.3). This strategy is used to define our as rigid as possible constraint between the reference frame and next frame without using any motion parameters. As the depth for the reference frame and the optical flow between the two successive frames is assumed to be known a priori, each local planar region is described using only four parameters —normal and depth, instead of nine [19].

Our model first assigns each pixel of the reference frame to a superpixel using SLIC algorithm [1] and each of these superpixels then acts as a representative for its 3D plane geometry. Since the global geometry of the dynamic scene is assumed to be deforming ARAP, we solve for the depth in the next frame subject to the transformation that each plane undergoes from the first frame to the next frame should be as minimum as possible. The solution to ARAP global constraint provides depth for three points per plane in the next frame, which is used to estimate the normal and depth of the plane. The estimated depth and normal of each plane is then used to calculate per pixel depth in the next frame.

Although our algorithm is described for the classical two-frame case, it is easy to extend to the multi-frame case. The energy function we define below is solved in two steps: First, we solve for the depth of each superpixel in the next frame using as rigid as possible constraint. Due to the piece-wise planar approximation of the scene, the overall solution to the depth introduces discontinuity along the boundaries. To remove the blocky artifacts —due to the discretization of the scene, we smooth the obtained depth along the boundaries of all the estimated 3D plane in the second step using TRWS [14]. If the ARAP cost function is extended to pixel-level then the boundary continuity constraint can be avoided [11]. Nevertheless, over-segmentation of the scene provides a good enough approximation of a dynamic scene and is computationally easy to handle.

3.1 Model overview

Notation: We refer two consecutive perspective image ,

as the reference frame and next frame respectively. Vectors are represented by bold lowercase letters, for

e.g. ‘’ and the matrices are represented by bold uppercase letters, for e.g. ‘’. The 1-norm, 2-norm of a vector is denoted as and respectively.

3.2 As-Rigid-As-Possible (ARAP)

The idea of ARAP constraint is well known in practice and has been widely used for shape modeling and shape manipulation [12]. Recently Kumar et al. [19] exploited this idea to estimate scale consistent dense 3D structure of a dynamic scene. The motivation to use ARAP constraint in our work is inspired by [19] idea i.e. restrict the deformation such that the overall transformation in the scene between frames is as small as possible.

Let (, ) and (, ) be the depth of two neighboring 3D points from the reference coordinate in the consecutive frames. Let , be its image coordinate in the reference frame and , be its image coordinate in the next frame. If ‘’ denotes the intrinsic camera calibration matrix then, , is the unit vector in the direction of the 3D point respectively for the reference frame. Similarly, the corresponding unit vectors in the next frame is denoted with (see Fig. 2(a)). Using these notations, we define the ARAP constraint as:

(1)

Here, is the total number of planes used to approximate the scene and is the ‘’ neighboring planes local to superpixel (see Fig. 2(b)). is the exponential weight fall off based on the image distance of the points i.e. slowly break the rigidity constraint if the points are far apart in the image space. This constraint encapsulates our idea i.e., the change in the distance of point relative to its local neighbors in the next frame should be as minimum as possible. Note that the summation goes over rather than due the reason discussed in Sec.

Figure 3: Intuition on orientation and shape regularization. Anchor point and two non-collinear points are shown in red and green respectively. Dark red line show the change in the next frame.

3.3 Orientation and Shape Regularization

Solving the ARAP constraint provides us the depths for three non-collinear points per-plane for the next frame. We use these three depth estimate per plane to solve for their normals in the next frame. Let the 3D points corresponding to the three depths for superpixel in the next frame be denoted as , and respectively. We estimate the normals in the next frame as:

(2)

where superscript ‘’ is used intentionally to denote the anchor point, which is assumed to be at the center of each plane (see Fig. 3). Rewriting Eq. (2) in terms of depth

(3)

(a) Orientation smoothness constraint: Once we compute the normal for each plane and 3D coordinates of the anchor point, which lies on the plane, we estimate the depth of the plane as follows

(4)

The computed depth of the plane is then used to solve for per-pixel depth in the next frame —assuming the intrinsic camera matrix is known [19, 10]. To encourage the smoothness in the change of angles between each adjacent planes (see Fig. 3), we define the orientation regularization as

(5)

where, is an empirical constant and = is the truncated function with as a scalar parameter.

(b) Shape smoothness constraint: In our representation, the dynamic scene model is approximated by the collection of piecewise planar regions. Hence, the solution to per-pixel depth obtained using Eq. (1) to Eq. (4) may provide discontinuity along the boundaries of the planes in 3D (see Fig. 3). To allow smoothness in the 3D coordinates for each adjacent planes along their region of separation, we define the shape smoothness constraint as

(6)

The symbol ‘’ denotes the set of boundary pixels of superpixel that are shared with the boundary pixel of other superpixels. The weight = takes into account the color consistency of the plane along the boundary points —weak continuity constraint [3]. Since all the pixels within the same plane are assumed to share the same model, smoothness for the pixels within the plane is not essentially required. Similar to orientation regularization, = is the truncated penalty function with as a scalar parameter. The overall optimization steps of our method is provided in Algorithm (1).

  Input: (), optical_flow(), , depth for reference frame.
  Output: Dense depth map for the next frame.
  1: Over-segment the reference frame into superpixels [1].
  2: Assign anchor point for each superpixel and two other points in the same plane such that these three points are non-collinear (see Fig. 3).
  3: Use K-NN algorithm over superpixels to get the K-nearest neighbor index set.
  4: Solve for per-superpixel depth in the next frame
(7)
Note: The second constraint provides a trust region for the fast and proper convergence of a non-convex problem (Fig.10). Can be thought of as max/min restriction to the scene deformation.
  5: Estimate the normal of each plane in the next frame Eq. (3).
  6: Estimate the depth of each plane Eq. (4).
  7: Solve per pixel depth for the next frame using per plane depth , , normal of the plane and its image coordinate.
  8: Refine the depth of the next frame by minimizing Eq. (5), Eq. (6) with respect to depth and normal [14] .
(8)
  9: (Optional) For generalizing the idea to multi-frame, repeat the above steps by making the next frame as the reference frame and new frame as the next frame.
Algorithm 1 :  A Motion Free Approach

4 Experimental Evaluation

We performed the experimental evaluation of our approach on two benchmark datasets, namely MPI Sintel [4] and KITTI [7]. These two datasets conveniently provide a complex and realistic environment to test and compare our dense depth estimation algorithm. We compared the accuracy of our approach against two recent state-of-the-art methods [19, 26] that use geometric approach to solve dynamic scene dense depth estimation from monocular images. These comparisons are performed using three different dense optical flow estimation algorithms, namely PWC-Net [27], FlowFields [2] and Full Flow [5]. All the depth estimation accuracies are reported using mean relative error (MRE) metric. Let be the estimated depth and be the ground-truth depth, then MRE is defined as

(9)

where ‘’ denotes the total number of points. The statistical results for DMDE [26] and Superpixel Soup [19] are taken from their published work for comparison.

Implementation Details:

We over-segment the reference frame into 1000-1200 superpixels using SLIC algorithm [1] to approximate the scene. Almost all of the experiments use fixed value of = 1 and = 20-25. For computing the dense optical flow correspondences between images we used both traditional methods and deep-learning framework such as PWC-Net [27], FlowFields[2] and Full Flow [5]. The depth for the reference image is initialized using Mono-Depth [8] model on the KITTI dataset and using Superpixel Soup algorithm [19] on the MPI-Sintel dataset. The reason for such inconsistent choice is that available deep-neural network depth estimation model fails to provide reasonable depth estimate on the MPI dataset –see supplementary material. The proposed optimization is solved in two stages, firstly Eq. (7) is optimized using SQP [25] algorithm and Eq. (8) is optimized using TRW-S [14] algorithm. The choice of the optimizer is purely empirical, and the user may choose different optimization algorithm to solve the same cost function. The algorithm is implemented in C++/MATLAB which takes 10-12 minutes on a commodity desktop computer to provides the results.

The implementation is performed under two different experimental settings. In the first setting, given the sparse (i.e. for three non-collinear points per superpixel) depth estimate of a dynamic scene for the reference frame, we estimate the per-pixel depth for the next frame. In the second experimental setting, we generalize this idea of two frame depth estimation to multiple frames by computing the depth estimates over frames. For easy understanding, MATLAB codes are provided in the supplementary material showing our idea of ARAP on synthetic examples of a dynamic scene.

Figure 4: Depth results on the MPI Sintel dataset[4] for the next frame under two frame experimental setting. and row show ours and ground-truth depth map results respectively.

4.1 MPI Sintel

This dataset gives an ideal setting to evaluate depth estimation algorithms for complex dynamic scenes. It contains image sequences with considerable motion and severe illumination change. Moreover, the large number of non-planar scenes and non-rigid deformations makes it a suitable choice to test the piece-wise planar assumption. We selected seven set of scenes namely , , , , , and from the clean category of this dataset to test our method.

OF Methods DMDE [26] S. Soup [19] Ours
PWC Net [27] - - 0.1848
Flow Fields [2] 0.2970 0.1669 0.1943
Full Flow [5] - 0.1933 0.2144
Table 1: Comparison of dense depth estimation methods under two consecutive frame setting against the state-of-the-art approaches on the MPI Sintel dataset [4]. For consistency, the evaluations are performed using mean relative error metric (MRE).

(a) Two-frame results: While testing our algorithm for the two-frame case, the reference frame depth is initialized using recently proposed superpixel-soup algorithm [19]. The optical flow between the frames is computed using methods such as PWC-Net [27], Flow Fields [2] and Full Flow [5]. Table (1) shows the statistical performance comparison of our method against other geometric approaches. The statistics clearly show that we can perform almost equally well without motion estimation. Qualitative results within this setting are shown in Fig. 4.

(b) Multi-frame results: In multi-frame setting, only the depth for the first frame is initialized. The result obtained for the next frame is then used for the upcoming frames to estimate its dense depth map. Since we are dealing with the dynamic scene, the environment changes slowly and therefore, the error starts to accumulate over frames. Fig. 9(a) reflects this propagation of error over frames. Qualitative results over multiple frames are shown in Fig. 5.

Figure 5: Results on MPI Sintel dataset [4] under multi-frame experimental setting. (a) Image frame for which the depth is initialized. (b) Depth estimation results using our method over frames.

4.2 Kitti

The KITTI dataset has emerged as a standard benchmark dataset to evaluate the performance of dense depth estimation algorithms. It contains images of outdoor driving scenes with different lighting conditions and large camera motion. We tested our algorithm on both KITTI raw-data and KITTI 2015 benchmark. For KITTI dataset, we used Monodepth method [8] to initialize the reference frame depth. Dense optical flow correspondences are obtained using the same aforementioned methods. For consistency, the depth estimation error measurement on KITTI dataset follows the same order of 50 meters as presented in [8] work.

Two-frame results: KITTI 2015 scene flow dataset provides two consecutive frame pair of a dynamic scene to test algorithms. Table (2) provides the depth estimation statistical result of our algorithm in comparison to other competing methods. Here, our results are a bit better using PWC-Net [27] optical flow and Monodepth [8] depth initialization. Fig. 6 shows the qualitative results using our approach in comparison to the Monodepth [8] for the next frame.

Multi-frame results: To test the performance of our algorithm on multi-frame KITTI dataset, we used KITTI raw dataset specifically from the city, residential and road category. The depth for only the first frame is initialized using monodepth deep learned model and then we estimate the depth for the upcoming frames. Due to very large displacement in the scene per frame (150) pixels, the rate of change of error accumulation curve for KITTI dataset (Fig. 9(b)) is a bit steeper than MPI Sintel. Fig. 7 and Fig. 9(b) show the qualitative results and depth error accumulation over frames on KITTI raw dataset respectively.

Figure 6: Results on KITTI 2015 benchmark dataset under two frame experimental setting. row: Monodepth [8] results on the same sequence for the next frame for qualitative comparison.
OF Methods DMDE [26] S. Soup [19] Ours
PWC Net [27] - - 0.1182
Flow Fields [2] 0.1460 0.1268 0.1372
Full Flow [5] - 0.1437 0.1665
Table 2: Comparison of dense depth estimation under two consecutive frame setting against the state-of-the-art approaches on KITTI dataset [4]. For consistency, the evaluations are performed using mean relative error metric (MRE). The results are better due to monodepth initialization for the reference frame.

5 Statistical Analysis

Besides experimental evaluations under the aforementioned variable initialization, we also conducted other experiments to better understand the behavior of the proposed method. We conducted experiments on a synthetic example shown in Fig. 8 for easy understanding to the readers. MATLAB codes are provided in the supplementary material for reference.

(a) Effect of the variable : The number of superpixels to approximate the dynamic scene can affect the performance of our method. A small number of superpixel can provide poor depth result, whereas a very large number of superpixel can increase the computation time. Fig. 9(c) shows the change in the accuracy of depth estimation with respect to change in the number of superpixels. The curve suggests that for KITTI and MPI Sintel 1000-1200 superpixel provides a reasonable approximation to the dynamic scenes.

(b) Effect of the variable : The number of K-nearest neighbors to define the local rigidity graph can have a direct effect on the performance of the algorithm. Although works well for the tested benchmarks, it is purely an empirical parameter and can be different for a distinct dynamic scene. Fig. 9(d) demonstrates the performance of the algorithm with the change in the values of .

(c) Performance of the algorithm under noisy initialization: This experiment is conducted to study the sensitivity of the method to noisy depth initialization. Fig. 10(a) shows the change in the 3D reconstruction accuracy with the variation in the level of noise from 1% to 9%. We introduced the Gaussian noise using randn() MATLAB function and the results are documented for the example shown in Fig. 8 after repeating the experiment for 10 times and taking its average values. We observe that our algorithm can provide arguable results when the noise level gets high.

Figure 7: Results on KITTI raw dataset under multi-frame experimental setup. (a) Reference image for which the depth is initialized (b) Dense depth results over frames using our algorithm.
Figure 8: Synthetic example to conduct in-depth behavior analysis of the ARAP. Two objects are deforming independently over a rigid background motion. The objects are at a finite separation from the background. For numerical details on this example, kindly go through the supplementary material.
(a)
(b)
(c)
(d)
Figure 9: (a)-(b) Accumulation of error over frames for MPI and KITTI dataset respectively. (c) Change in the depth estimation accuracy w.r.t number of superpixel. (d) Variation in the depth accuracy as a function of k-nearest neighbor ()
(a)
(b)
(c)
(d)
Figure 10: (a) Depth results for the next frame with different levels of Gaussian noise in the reference frame coordinate initialization. (b) Variation in the performance with the change in the values for synthetic example. (c) Convergence curve of the ARAP objective function. (d) Quick convergence with similar accuracy on the same example can be achieved by using restricted isometric constraint.

(d) Performance of the algorithm under restricted isometry constraint with objective function: While minimizing the ARAP objective function under the constraint, we restrict the convergence trust region of the optimization. This constraint makes the algorithm works extremely well —both in terms of timing and accuracy, if an approximate knowledge about the deformation that the scene may undergo is known a priori. Fig. 10(b) shows the 3D reconstruction accuracy as a function of for the example shown in Fig. 8. Clearly, if we have an approximate knowledge about the scene transformation, we can get high accuracy in less time. See Fig. 10(d) which illustrates the quick convergence by using this constraint under suitable range of .

(e) Nature of convergence of the proposed ARAP optimization:
1
) Without restricted isometry constraint: As rigid as possible minimization under the constraint is alone a good enough constraint to provide acceptable results. However, it may take a considerable number of iterations to do so. Fig. 10(c) shows the convergence curve.

2) With restricted isometry constraint: Employing the approximate bound on the deformation that the scene may undergo in the next time instance can help fast convergence with similar accuracy. Fig. 10(d) shows that the same accuracy can be achieved in 60-70 iterations.

6 Limitation and Discussion

Even though our method works well for diverse dynamic scenes, there are still a few challenges associated with the formulation. Firstly, very noisy depth initialization for the reference frame can provide unsettling results. Secondly, our method is challenged by the instant arrival or removal of the dynamic subjects in the scene, and in such cases, it may need reinitialization of the reference depth. Lastly, well-known limitations such as occlusion and temporal consistency, especially around the regions close to the boundary of the images can also affect the accuracy of our algorithm.

Discussion: In defense, we would like to state that motion based methods to structure from motion is prone to noisy data as well. Algorithms like motion averaging [9], M-estimators and random sampling [28] are quite often used to rectify the solution.

(a) Why do we choose geometric approach to initialize our algorithm on MPI dataset? LKVO network [30] is one of the top performing networks for dense depth estimation on KITTI dataset. Our implementation of this network on the MPI dataset provided us with unsatisfactory results. Qualitative results obtained using this network on the clean class is provided in the supplementary material. The training parameters are also provided for reference.

(b) What do we gain or lose by our motion free approach?
Estimating all kinds of conceivable motion in a complex dynamic scene from images is a challenging task, in that respect, our method provides an alternative way to achieve per pixel depth without estimating any 3D motion. However, in achieving this we are allowing the gauge freedom between the frames (temporal relations in 3D over frames).

7 Conclusion

The problem of estimating per-pixel depth of a dynamic scene, where the complex motions are prevalent is a challenging task to solve. Quite naturally, previous methods rely on standard motion estimation techniques to solve this problem, which in fact is a non-trivial task for a non-rigid scene. In contrast, this paper introduces a new way to perceive this problem, which essentially trivializes the motion estimate as a compulsory step. By observing the behavior of most of the real-world dynamic scenes closely, it can be inferred that it locally transforms rigidly and globally as rigid as possible. Such observation allows us to propose a motion-free algorithm to dense depth estimation under the piece-wise planar approximation of the scene. Results on benchmark datasets show the competence of our 3D motion-free idea.

Acknowledgement.

 This work is funded in part by the ARC Centre of Excellence for Robotic Vision (CE140100016), ARC Discovery project on 3D computer vision for geo-spatial localisation (DP190102261), ARC DECRA project DE140100180 and Natural Science Foundation of China (61420106007, 61871325)

.

References

  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. In IEEE transactions on Pattern Analysis and Machine Intelligence, volume 34, pages 2274–2282. IEEE, 2012.
  • [2] C. Bailer, B. Taetz, and D. Stricker. Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In IEEE international Conference on Computer Vision, pages 4015–4023, 2015.
  • [3] A. Blake and A. Zisserman. Visual reconstruction. MIT press, 1987.
  • [4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, pages 611–625. Springer, 2012.
  • [5] Q. Chen and V. Koltun. Full flow: Optical flow estimation by global optimization over regular grids. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4706–4714, 2016.
  • [6] R. Garg, B. V. Kumar, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
  • [7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. In Int. J. Rob. Res., volume 32, pages 1231–1237, Sept. 2013.
  • [8] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, page 7, 2017.
  • [9] V. M. Govindu. Motion averaging in 3d reconstruction problems. In Riemannian Computing in Computer Vision. To Appear. Springer, 2015.
  • [10] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [11] M. Hornáček, F. Besse, J. Kautz, A. Fitzgibbon, and C. Rother. Highly overparameterized optical flow using patchmatch belief propagation. In European Conference on Computer Vision, pages 220–234. Springer, 2014.
  • [12] T. Igarashi, T. Moscovich, and J. F. Hughes. As-rigid-as-possible shape manipulation. In ACM transactions on Graphics, volume 24, pages 1134–1141. ACM, 2005.
  • [13] P. Ji, H. Li, Y. Dai, and I. Reid. “maximizing rigidity” revisited: A convex programming approach for generic 3d shape reconstruction from multiple perspective views. In ICCV, pages 929–937. IEEE, 2017.
  • [14] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006.
  • [15] S. Kumar. Jumping manifolds: Geometry aware dense non-rigid structure from motion. arXiv preprint arXiv:1902.01077, 2019.
  • [16] S. Kumar, A. Cherian, Y. Dai, and H. Li. Scalable dense non-rigid structure-from-motion: A grassmannian perspective. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [17] S. Kumar, Y. Dai, and H.Li. Spatio-temporal union of subspaces for multi-body non-rigid structure-from-motion. In Pattern Recognition, volume 71, pages 428–443. Elsevier, May 2017.
  • [18] S. Kumar, Y. Dai, and H. Li. Multi-body non-rigid structure-from-motion. In International Conference on 3D Vision (3DV), pages 148–156. IEEE, 2016.
  • [19] S. Kumar, Y. Dai, and H. Li. Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. In IEEE International Conference on Computer Vision, pages 4649–4657, Oct 2017.
  • [20] S. Kumar, A. Dewan, and K. M. Krishna. A bayes filter based adaptive floor segmentation with homography and appearance cues. In Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing, page 54. ACM, 2012.
  • [21] S. Kumar, M. S. Karthik, and K. M. Krishna. Markov random field based small obstacle discovery over images. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 494–500. IEEE, 2014.
  • [22] H. Li. Multi-view structure computation without explicitly estimating motion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2777–2784. IEEE, 2010.
  • [23] E. Malis and M. Vargas. Deeper understanding of the homography decomposition for vision-based control. PhD thesis, INRIA, 2007.
  • [24] J. Noraky and V. Sze. Depth estimation of non-rigid objects for time-of-flight imaging. In IEEE International Conference on Image Processing, pages 2925–2929. IEEE, 2018.
  • [25] M. J. Powell. A fast algorithm for nonlinearly constrained optimization calculations. In Numerical analysis, pages 144–157. Springer, 1978.
  • [26] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4058–4066, 2016.
  • [27] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [28] P. H. Torr and D. W. Murray. The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision, 24(3):271–300, 1997.
  • [29] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In European Conference on Computer Vision, pages 705–718. Springer, 2008.
  • [30] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
  • [31] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, page 7, 2017.