Superpixel Soup: Monocular Dense 3D Reconstruction of a Complex Dynamic Scene

11/19/2019 ∙ by Suryansh Kumar, et al. ∙ ETH Zurich Australian National University 19

This work addresses the task of dense 3D reconstruction of a complex dynamic scene from images. The prevailing idea to solve this task is composed of a sequence of steps and is dependent on the success of several pipelines in its execution. To overcome such limitations with the existing algorithm, we propose a unified approach to solve this problem. We assume that a dynamic scene can be approximated by numerous piecewise planar surfaces, where each planar surface enjoys its own rigid motion, and the global change in the scene between two frames is as-rigid-as-possible (ARAP). Consequently, our model of a dynamic scene reduces to a soup of planar structures and rigid motion of these local planar structures. Using planar over-segmentation of the scene, we reduce this task to solving a "3D jigsaw puzzle" problem. Hence, the task boils down to correctly assemble each rigid piece to construct a 3D shape that complies with the geometry of the scene under the ARAP assumption. Further, we show that our approach provides an effective solution to the inherent scale-ambiguity in structure-from-motion under perspective projection. We provide extensive experimental results and evaluation on several benchmark datasets. Quantitative comparison with competing approaches shows state-of-the-art performance.



There are no comments yet.


page 1

page 5

page 8

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of reconstructing 3D geometry of the scene from images —popularly known as structure-from-motion (SfM), is a fundamental problem in computer vision. An initial introduction and working solution to this problem can be found as early as 1970’s and 1980’s

[ullman1979interpretation] [grimson1981images] [longuet1981computer], which Blake et al. discussed comprehensively in their seminal work [blake1987visual]. While this field of study in the past was largely dominated by sparse feature based reconstruction of a rigid scene [hartley1997triangulation] [hartley1997defense] [hartley2003multiple] [tomasi1993pictures] [tomasi1992shape] and a non-rigid object [bregler2000recovering] [dai2014simple] [lee2013procrustean] [kumar2016multi] [kumar2017spatio], in recent years, with the surge in computational resources, dense 3D reconstruction of the scene have been introduced and successfully demonstrated [newcombe2015dynamicfusion] [newcombe2011dtam] [ranftl2016dense].

A dense solution to this inverse problem is essential due to its increasing demand in many real-world applications –from animation and entertainment industry to robotics industry (VSLAM). In particular, with the proliferation of monocular camera in almost all modern mobile devices has elevated the demand for sophisticated dense reconstruction algorithm. When the scene is static and the camera is moving, 3D reconstruction of such scenes from images can be achieved by using conventional rigid structure from motion techniques [hartley2003multiple] [agarwal2011building] [schoenberger2016sfm] [schoenberger2016mvs]. In contrast, to model arbitrary dynamic scene can be very challenging. When the camera is moving and the scene is static under such settings, the elegant geometrical constraint can help explain the camera’s [hartley1997defense] [govindu2001combining], which are later used to realize the dense 3D reconstruction of the scene [schoenberger2016sfm] [schoenberger2016mvs] [newcombe2011dtam] [triggs1999bundle]

. However, such geometrical constraint may fail when multiple rigidly moving objects are observed by a moving camera. Although each of the individual rigid objects can be reconstructed up to an arbitrary scale (assuming motion segmentation is provided), the reconstruction of the whole dynamic scene is generally impossible, simply because the relative scales among all the moving shapes cannot be determined in a globally consistent way. Furthermore, since all the estimated motions are relative to each other, one cannot distinguish camera motion from the object motion. Therefore, prior information about the objects, or the scene, and their relation to the frame of reference are used to fix the placement of these objects relative to each other.

Fig. 1: Dense 3D reconstruction of a complex dynamic scene, where both the camera and the objects are moving with respect to each other. The top left shows a sample reconstruction on messi sequence from Youtube Object dataset [prest2012learning]. The top right shows the reconstruction on alley_1 sequence from the MPI Sintel dataset [butler2012naturalistic].

Hence, from the above discussion, it can be argued that the solution to 3D reconstruction of a general dynamic scene is non-trivial. Nevertheless, it is an important problem to solve as many real-world applications need a reliable solution to this problem. For example, understanding of a traffic scene, a typical outdoor traffic scene consists of both multiple rigid motions of vehicles, and non-rigid motion of the pedestrians. To model such scenarios, it is important to have an algorithm that can provide dense 3D information from images.

Recently, Ranftl et al. [ranftl2016dense] proposed a three-step approach to procure dense 3D reconstruction of a general dynamic scene using two consecutive perspective frames. Concretely, it performs object-level motion segmentation followed by per-object 3D reconstruction and finally solves for scale ambiguity. We know that in a general dynamic setting, the task of densely segmenting rigidly moving objects or part is not trivial. Consequently, inferring motion models for deforming shapes becomes very challenging. Furthermore, the success of object-level segmentation builds upon the assumption of multiple rigid motions, fails to describe more general scenarios such as “when the objects themselves are deforming”. Subsequently, 3D reconstruction algorithms dependent on motion segmentation of objects suffer.

Motivated by such limitations, we propose a unified approach that neither performs any object-level motion segmentation nor assumes any prior knowledge about the scene rigidity type and still able to recover scale consistent dense reconstruction of a complex dynamic scene. Our formulation instinctively encapsulates the solution to inherent scale ambiguity in perspective structure from motion which is a very challenging problem in general. We show that by using two prior assumptions —about the 3D scene and the deformation, we can effectively pin down the unknown relative scales, and obtain a globally consistent dense 3D reconstruction of a dynamic scene from its two perspective views. The two basic assumptions we used about the dynamic scene are:

  1. The dynamic scene can be approximated by a collection of piecewise planar surfaces each having its own rigid motion.

  2. The deformation of the scene between two frames is locally-rigid but globally as-rigid-as-possible.

  • [leftmargin=*]

  • Piece-wise planar model: Our method models a dynamic scene as a collection of piece-wise planar regions. Given two perspective images (reference image), (next image) of a general dynamic scene, our method first over-segment the reference image into superpixels. This collection of superpixels are assumed approximation of the dynamic scene in the projective space. It can be argued that modeling dynamic scene per pixel can be more compelling, however, modeling of a scene using planar regions makes this problem computationally tractable for optimization or inference [bleyer2011object, vogel20153d].

  • Locally-rigid and globally as-rigid-as-possible: We implicitly assume that each local plane undergoes a rigid motion. Suppose every individual superpixel corresponds to a small planar patch moving rigidly in 3D space and dense optical flow between frame is given, we can estimate its location in 3D using rigid reconstruction pipeline [hartley2003multiple, vogel20113d]. Since the relative scale of these patches is not determined correctly, they are floating in 3D space as a set of unorganized superpixel soup. Under the assumption that the change between the frame is not too arbitrary rather regular or smooth, the scene can be assumed to be changing as rigid as possible globally. Using this intuition, our method starts finding for each superpixel an appropriate scale, under which the entire set of superpixels can be assembled (glued) together coherently, forming a piece-wise smooth surface, as if playing the game of “3D jigsaw puzzle”. Hence, we call our method the “SuperPixel Soup” algorithm (see Fig. 2 for a conceptual visualization).

In this paper, we show that our aforementioned assumptions can faithfully model most of the real-world dynamic scenarios. Furthermore, we encapsulate these assumptions in a simple optimization problem which are solved using a combination of continuous and discrete optimization algorithms [benson2002interior, benson2014interior, kolmogorov2006convergent]. We demonstrate the benefit of our approach on available benchmark dataset such as KITTI [geiger2013vision], MPI Sintel [butler2012naturalistic] and Virtual KITTI [gaidon2016virtual]. The statistical comparison shows that our algorithm outperforms many available state-of-the-art methods by a significant margin.

Fig. 2: Reconstructing a 3D surface from a soup of un-scaled superpixels via solving a 3D Superpixel Jigsaw puzzle problem.

2 Related Works

The solution to SfM has undergone prodigious development since its inception [ullman1979interpretation]. Even after such a remarkable development in this field, the choice of algorithm depends on the complexity of the object motion and the environment. In this work, we utilize the idea of rigidity (locally) to solve dense reconstruction of a general dynamic scene. The concept of rigidity is not new in structure from motion problem [ullman1979interpretation] [longuet1987computer] and has been effectively applied as a global constraint to solve large scale reconstruction problem [agarwal2011building]. The idea of global rigidity to solve structure and motion has also been exploited to solve reconstruction over multiple frames via a factorization approach [tomasi1992shape].

The literature on structure from motion and its treatment to different scenarios is very extensive. Consequently, for brevity, we only discuss the previous works that are of direct relevance to dynamic 3D reconstruction from monocular images. The linear low-rank model has been used for dense non-rigid reconstruction. Kumar et al. [kumar2018scalable, Kumar_2019_CVPR] and Garg et al. [garg2013dense] solved the task with an orthographic camera model assuming feature matches across multiple frames is given as input. Fayad et al. [fayad2010piecewise] recovered deformable surfaces with a quadratic approximation, again from multiple frames. Taylor et al. [taylor2010non] proposed a piecewise rigid solution using locally-rigid SfM to reconstruct a soup of rigid triangles.

While Taylor et al. [taylor2010non] method is conceptually similar to ours, there are major differences:

  1. We achieve two-view dense reconstruction while [taylor2010non] relies on multiple views (N 4).

  2. We use perspective camera model while they rely on an orthographic camera model.

  3. We solve the scale-indeterminacy issue, which is an inherent ambiguity for 3D reconstruction under perspective projection, while Taylor et al. [taylor2010non] method does not suffer from this, at the cost of being restricted to the orthographic camera model.

Recently, Russel et al. [russell2014video] and Ranftl et al. [ranftl2016dense] used object-level segmentation for dense dynamic 3D reconstruction. In contrast, our method is free from object segmentation, hence circumvent the difficulty associated with motion segmentation in a dynamic setting.

The template-based approach is yet another method for deformable surface reconstruction. Yu et al. [yu2015direct] proposed a direct approach to capture dense, detailed 3D geometry of generic, complex non-rigid meshes using a single RGB camera. While it works for generic surfaces, the requirement of template prevents its wider application to more general scenes. Wang et al. [wang2016template] introduced a template-free approach to reconstruct a poorly-textured, deformable surface. Nevertheless, its success is restricted to a single deforming surface rather than the entire dynamic scene. Varol et al. [varol2009template] reconstructed deformable surfaces based on a piecewise reconstruction assuming overlapping patches to be consistent over the entire surface, but again limited to the reconstruction of a single deformable surface.

While the conceptual idea of our work appeared in ICCV 2017, this journal version provides (i) in-depth realization of our overall optimization (ii) Qualitative comparison with [ranftl2016dense], Video-PopUp [russell2014video]

as well as statistical comparison with deep-learning method

[zhou2017unsupervised]. (iii) Comprehensive ablation test showing the importance of each term in the overall optimization. (iv) Extensive performance analysis showing the performance with the variation in the number of superpixels, choice of k-nearest neighbor, choice of dense optical flow algorithm and change in the shape of the superpixel. (v) Detail discussion on the failure cases, choice of euclidean metric for nearest neighbor graph construction, and limitation of our work with possible direction for improvements.

3 Motivation and Contribution

The formulation proposed in this work is motivated by the following endeavor in dense structure from motion of a dynamic scene.

3.1 Object level motion segmentation

To solve dense reconstruction of an entire dynamic scene from perspective images, the first step that is practiced usually is: Perform object-level motion segmentation to infer distinct motion models of multiple rigidly moving object in the scene. As alluded before, dense segmentation of moving object in a dynamic scene in itself is a challenging task. Also, non-rigidly moving object themselves may compose of a union of distinct motion models. Therefore, object-level segmentation build upon the assumption of per object rigid motion will fail to describe a general dynamic scene. This motivates us to develop an algorithm that can recover a dense-detailed 3D model of a complex dynamic scene from its two perspective images, without object-level motion segmentation as an essential intermediate step.

3.2 Separate treatment for rigid SfM and non-rigid SfM

Our investigation shows that the algorithms for deformable object 3D reconstruction often differs from a rigidly moving object. Not only solutions, but even the assumptions varies significantly e.g orthographic projection, low-rank shape [bregler2000recovering] [dai2014simple] [lee2013procrustean] [kumar2017spatio]. The reason for such inadequacy is perfectly valid due to the under-constraint nature of the problem itself. This motivated us to develop an algorithm that can provide i.e 3D reconstruction of entire dynamic scene and the non-rigidly deforming object under similar assumptions and formulation.

Although to accomplish this goal for any arbitrary non-rigid deformation remains an open problem, experiments suggest that our framework under the aforementioned assumptions about the scene and the deformation, can reconstruct a general dynamic scene irrespective of the scene rigidity type. Thanks to the recent advancement in the dense optical flow algorithms [bailer2015flow] [chen2016full] which can reliably capture smooth non-rigid deformation over frames. These robust dense optical flow algorithms allow us to exploit local motion of deforming surfaces. Thus, our formulation is competent enough to bridge this gap between rigid and non-rigid SfM.

The main contributions of our work are as follows:

  1. A framework which disentangles object-level motion segmentation for dense 3D reconstruction of a complex dynamic scene.

  2. A common framework for dense two-frame 3D reconstruction of a complex dynamic scene (including deformable objects), which achieves state-of-the-art performance.

  3. A new idea to resolve the inherent relative scale ambiguity problem in monocular 3D reconstruction by exploiting the as-rigid-as-possible (ARAP) constraint [sorkine2007rigid].

4 Outline of the Algorithm

Before providing the details of our algorithm, we would like to introduce some common notations that are used throughout the paper.

4.1 Notation

We represent two consecutive images as , :

, also referred as reference image and next image respectively. Vectors are represented by bold lower case letter, such as ‘

’ and matrices are represented by bold upper case letter such as ‘’. The subscript ‘a’, ‘b’ denotes anchor point and boundary point respectively, for e.g , represents anchor point and boundary point corresponding to superpixel in the image space. The 1-norm, 2-norm of a vector is denoted as and respectively. For matrices, Frobenius norm is denoted as .

4.2 Overview

We first over-segment the reference image into superpixels, then model the deformation of the scene by a union of piece-wise rigid motions of these superpixels. Specifically, we divide the overall non-rigid reconstruction into a local rigid reconstruction of each superpixel, followed by an assembly process which glues all these individual local reconstructions in a globally coherent manner. While the concept of the above divide-and-conquer procedure looks simple, there is however a fundamental difficulty (of scale indeterminacy) in its implementation. Scale-Indeterminacy refers to the well-known fact that using a moving camera one can only recover the 3D structure up to an unknown scale. In our method, the individual rigid reconstruction of each superpixel can only be determined up to an unknown scale, the assembly of the entire non-rigid scene is only possible if and only if these scales among the superpixels are solved —which is, however, a challenging open task itself.

In this paper, we show how this can be done using two very mild assumption §3.2. Under these assumptions, our method solves the unknown relative scales and obtains a globally-coherent dense 3D reconstruction of a complex dynamic scene from its two perspective views.

4.3 Problem Statement

To implement the above idea of piecewise rigid reconstruction, we first partition the reference image into set of superpixels , where each superpixel is parametrized by its boundary pixels and an anchor point corresponding to the centroid of the superpixel in the image plane. Such a superpixel partition of the image plane naturally induces a piecewise-smooth over segmentation of the corresponding 3D scene surface. We denote this set of 3D scene surfaces as = . Although surfel is perhaps a better term, we nevertheless call it “3D superpixel” for the sake of easy exposition. We further assume each 3D superpixel (‘’) is a small 3D planar patch and , which is parameterized by surface normal , 3D anchor-point , and 3D boundary-points (i.e these are the pre-images of and ). Assume every 3D superpixel moves rigidly according where represents relative rotation, is the translation direction, and the unknown scale.

After our notation and symbol introduction, we are in a position to put our idea in a more precise way: Given two intrinsically calibrated perspective images and of a generally dynamic scene and the corresponding dense optical flow field, our task is to reconstruct a piecewise-planar approximation of the dynamic scene surface. The deformable scene surface in the reference frame (i.e, ) and the one in the second frame (i.e, ) are parametrized by their respective 3D superpixels and , where each is described by its surface normal and an anchor point . Any 3D plane can be determined by an anchor point and a surface normal . If one can estimate correct placement of all the 3D anchor points and all the surface normals corresponding to the reference frame, the problem is solved, since each element of is related to via transformation (locally rigid).

The overall procedure of our method is presented in Algorithm 1.

  Input: Two consecutive image frames of a dynamic scene and dense optical flow correspondences between them.
  Output: 3D reconstruction for both images.
  1. Divide the reference image into ’’ superpixels and construct a K-NN graph to represent the entire scene as a graph defined over these superpixels §4.4.
  2. Employ two-view epipolar geometry to recover the rigid motion and shape for each 3D superpixel §4.5.
  3. Optimize the proposed energy function to assemble (or glue) and align all the reconstructed superpixels (“3D Superpixel Jigsaw Puzzle”) §4.5.2.
  Note: The procedure of the above algorithm looks simple; there is, however, a fundamental difficulty of scale indeterminacy in its execution.
Algorithm 1 :  SuperPixel Soup

4.4 Formulation

We begin by briefly reiterating some of our representation. We partition the reference image into a set , whose corresponding set in the 3D world is . Equivalently, and are the respective sets for the next frame. The mapping of each element in the reference frame and next frame differs by a rigid transformation. Mathematically, via transformation (also known as special euclidean group), for instance = where and . In our formulation each 3D plane is described by = {(, ) }, where is the total number of superpixels (see Fig. 3). Similarly, in the image space through the plane-induced homography [hartley2003multiple]111scale is introduced both in the numerator and denominator for clarification that scale does not affect the homography transformation.. Here, is the intrinsic camera matrix and is the depth of the plane. Using these notations and definitions, we build a K-NN graph.

Build a K-NN graph: Using over-segmentation of the reference image (which is the projection of a set of 3D planes ) and Euclidean distance metric, we construct a K-NN graph in the image space connecting each anchor point to its K-nearest anchor points. The graph vertices () are composed of anchor point that connects to other anchor points via graph edges (). The distance between any two vertices () is taken as the Euclidean distance between them. Here, we assume Euclidean distance as a valid graph metric to describe the edge length between any two local vertices. Such an assumption is valid for local compactness (Euclidean spaces are locally compact). Interested reader may refer to [burago2001course] [williamson1987constructing] [whiteley2004rigidity] for comprehensive details. Here, ’K’ is the number of nearest neighbor that is used to construct local graph structure. This K-NN graph relation helps to constrain the motion and continuity of the space (defined in terms of optical flow, depth). To impose a hard constraint, we build a K-NN graph using anchor point beyond its immediate neighbors (Fig. 4).

This K-NN graph is crucial in the establishment of local rigidity constraint which is the basis of our assumption. This graph structure allows us to enforce our assumption i.e, the shape to be as rigid as possible globally and rigid locally.

Fig. 3: Illustration shows the modeling of a continuous scene with a piece wise rigid and planar assumption. Each superpixel is composed of a set where contains geometric parameters such as normal, anchor point, boundary points of a plane in 3D and contains the motion parameters i.e rotation and translation.

As-Rigid-As-Possible (ARAP) Energy Term: Our method is built on the idea that the correct scales of 3D superpixels can be estimated by enforcing prior assumptions that govern the deformation of the dynamic surface. Specifically, we require that, locally, the motion that each 3D-superpixel undergoes is rigid, and globally the entire dynamic scene surface must move as rigid as possible (ARAP). In other words, while the dynamic scene is globally non-rigid, its deformation must be regular in the sense that it deforms as rigidly as possible. To implement this idea, we define an ARAP-energy term as:


Here, the first term favors smooth motion between the local neighbors, while the second term encourages inter-node distances between the anchor node and its K nearest neighbor nodes (denoted as ) to be preserved before and after motion (hence as-rigid-as-possible, see Fig. 4). We define the weighting parameters as:


These weights are set to be inversely proportional to the distance between two superpixels. This is to reflect our intuition that, the further apart two superpixels are, the weaker the energy is. Although there may be redundant information in these two terms w.r.t scale estimation, we keep them for motion refinement §4.5.2. Note that, this term is only defined over anchor points, hence it enforces no depth smoothness along boundaries. The weighting term in advocates the local rigidity by penalizing over the distance between anchor points. This allows immediate neighbors to have smooth deformation over time. Also, note that is generally non-convex. This non-convexity is due to the second term in Eq. 1, where we have a minus sign between two norm terms. In Eq. 2 is an empirical constant.

alone is good enough to provide reasonably correct scales, however, the piece-wise planar composition of a continuous 3D space creates discontinuity near the boundaries of each plane. For this reason, we incorporate additional constraint to fix this depth discontinuity and further refine motions and geometry for each superpixel via neighboring relations. We call these constraints as Planar Re-projection, 3D Continuity and Orientation Energy constraint.

Fig. 4: Demonstration of as rigid as possible constraint. Superpixel segmentation in the reference frame is used to decompose the entire scene as a set of anchor points. Schematic representation shows the construction of K-NN around a particular anchor point (shown in Red). We constrain the local 3D coordinate transformation both before and after motion (green shows K-NN the reference frame, yellow shows the relation in the next frame (after motion)). We want this transformation to be as rigid as possible.

Planar Re-projection Energy Term: With the assumption that each superpixel represents a plane in 3D, it must satisfy corresponding planar reprojection error in 2D image space. This reprojection cost reflects the average dissimilarity in the optical flow correspondences across the entire superpixel due to motion. Therefore, it helps us to constrain the surface normal, rotation and translation direction such that they obey the observed planar homography in the image space. To infer any pixel inside a superpixel, we use the operator , for e.g will give the coordinates of pixel inside . Using it we define


Here, , is the optical flow correspondence of pixel inside superpixel in the reference frame and next frame respectively. The operator represent the cardinal number of a set. is a trade-off scalar chosen empirically. A natural question that may arise is: This term is independent of scale, then what’s the purpose of this constraint? How does it help? Kindly, refer to §4.5.2 for details.

3D Continuity Energy Term: In case of a dynamic scene, where both camera and the objects are in motion, its quite apparent that the scene will undergo some changes across frames. Hence, to assume unremitting global continuity with a piece-wise planar assumption, in a dynamic scene is unreasonable. Instead, local weak continuity constraint can be enforced —a constraint that can be broken occasionally [hinton1977relaxation] i.e., local planes are connected to few of its neighbors. Accordingly, we want to allow local neighbors to be piece-wise continuous. To favor this continuous or smooth surface reconstruction, we require neighboring superpixels to have a smooth depth transition at their boundaries. To do so, we define a 3D continuity energy term as:

Fig. 5: 3D Continuity energy favors continuous surface for the planes that shares the common boundary points. a)-d) The lesser the is, smoother the surface becomes (color bar shows the energy).

where, , represents the corresponding matrices in 2D image space and 3D Euclidean space (, where is the total number of boundary pixel for superpixel). Since in our representation, geometry and motion are shared among all pixels within a superpixel, so regularization within the superpixel is not explicitly needed. Thus, we only concentrate on the shared boundary pixels to regularize our energy. Note that the neighboring relationship in is different from term. Here, the neighbors share common boundaries with each other.

To encourage the geometry to be approximately smooth locally if the object has similar appearance, we color weight the energy term along the boundary pixels. For each boundary pixel of a given superpixel, we consider its 4-connected neighboring pixels to weight. Using this idea for we obtain:


which weigh the inter-plane transition by color difference. The symbol is a set that contains the 4 connecting pixels to each superpixel boundary pixel shared with superpixel. The color based weighting term plays an important role to allow for “weak continuity constraint” i.e gradually allow for occasional discontinuity [hinton1977relaxation] [blake1983least].

To better understand the implication of constraint, consider two boundary points in the image space . Generally, if these two points lie on a different plane, it will not coincide in the 3D space before and after motion. Hence, we compute the 3D distance between boundary pixels corresponding to both reference frame and next frame, which leads to our goal of penalizing distance along shared edges (see Fig. 5). Therefore, this term ensures the 3D coordinates across superpixel boundaries to be continuous in both frames. The challenge here is to reach a satisfactory solution for overall scene continuity, almost everywhere in both the frames [blake1987visual]. In the Eq. 4 is a truncation function defined as and similar to Eq. 2 in Eq. 5 is a constant, chosen empirically.

Orientation Energy Term: To encourage the smoothness in the orientation of the neighboring planes, we added one more geometric constraint i.e, defined as follows.


Here neighbor index is same as 3D continuity term. denotes the truncated penalty function which is defined as . Intuitively, it encourages the similarity between neighboring normal’s and truncate any value more than .

Combined Energy Function: Equipped with all these constraints, we define our overall cost function or energy function to obtain a scale consistent 3D reconstruction of a complex dynamic scene. Our goal is to estimate depth (), surface normal () and scale for each 3D planar superpixel. The key is to estimate the unknown relative scale . We solve this by minimizing the following energy function:


The equality constraint on fixes the unknown freedom of a global scale. The constraint on is imposed to restrict the rotation matrix to lie on manifold. In our formulation, the rotation matrix represents the combined Euler 3D angles. Although there are other efficient representations for 3D rotation, we used matrix representation as it comes naturally via epipolar geometric constraint, hence, further post-conversion steps can be avoided. The constant are included for numerical consistency.

4.5 Implementation

We partition the reference image into 1,000-2,000 superpixels [achanta2012slic]. Parameters such as , , , , were tuned differently for different datasets. To perform optimization of the proposed energy function (Eq. 7), we require initial set of proposals for motion and geometry.

Fig. 6: a) Superpixelled reference image b) Individual superpixel depth with arbitrary scale (unorganised superpixel soup) c) recovered depth map using our approach (organised superpixel soup) d) ground-truth depth map.

4.5.1 Initial Proposal Generation

We exploit piece-wise rigid and planar assumption to estimate an initial proposal for geometry and motion. We start by estimating homography for each superpixel using dense feature correspondences. Piece-wise rigid assumption helps in approximate estimation of rotation and correct translation direction via triangulation and chierality check [hartley2003multiple] [hartley1997triangulation]. To obtain the correct normal direction and initial depth estimate, we solve the following set of equations for each superpixel:


The reason we choose this strategy to obtain normal is because a simple decomposition of homography matrix to the rotation, translation and normal can lead to sign ambiguity [varol2009template] [malis2007deeper]. Nevertheless, if one has correct rotation and direction of translation –which we infer from chierality check, then inferring normal becomes easy222 The solution to the obtained normal must be normalized.. Here, we assume the depth ’’ to be a positive constant and the initial arbitrary reconstruction is in the +Z direction. This strategy of gathering 9-dimensional variables (6-motion variable and 3-geometry variable) for each individual superpixel gives us a good enough estimate to get started with the minimization of our overall energy function 333If the size of the superpixel is very small, use the neighboring superpixels optical flow to estimate motion parameters..

To initialize 3D vectors in our formulation we use the following well known relation:


where, are image coordinates and are camera intrinsic parameters which can be inferred from matrix.

4.5.2 Optimization

With good enough initialization of the variables, we start to optimize our energy function Eq. 7. A global optimal solution is hard to achieve due to the non-convex nature of the proposed cost function (Eq. 7). However, it can be solved efficiently using interior-point methods [benson2002interior] [benson2014interior]. Although the solution found by the interior point method is at best local minimizer, empirically they appear to give good 3D reconstruction. In our experiments, we initialized all ’s with an initial value of .

Next, we employ a particle based refinement algorithm to rectify our initial motion and geometry beliefs. Specifically, we used the Max-Product Particle Belief propagation (MP-PBP) procedure with the TRW-S algorithm [kolmogorov2006convergent] to optimize over the surface normals, rotations, translations and depths for all 3D superpixels using Eq. 10. We generated 50 particles as proposals for the unknown parameters around the already known beliefs to initiate refinement moves. Repeating this strategy for 5-10 iterations, we obtain a smooth and refined 3D structure of the dynamic scene.


5 Experiments and Results

We evaluated our formulation both qualitatively and quantitatively on various standard benchmark datasets, namely MPI Sintel [butler2012naturalistic], KITTI [geiger2013vision], VKITTI [gaidon2016virtual] and You-Tube Object dataset [prest2012learning]. All these dataset contains images of dynamic scene where both camera and objects are in motion w.r.t each other. To test the reconstruction result on deformable objects we used Paper, T-shirt [varol2009template] [varol2012constrained] and Back Sequence [garg2013dense]. For evaluating the result, we selected the most commonly used error metric i.e, mean relative error metric.

Evaluation Metric

: To keep the evaluation metric consistent with the previous work

[ranftl2016dense], we used mean relative error (MRE) metric for evaluation. MRE is defined as . Here, , denotes the estimated and ground-truth depth respectively with being the total number of points. The error is computed after re-scaling the recovered depth properly as the reconstruction is obtained up to an unknown global scale. Quantitative evaluation for the YouTube-Objects dataset and the Back dataset are missing due to the absence of ground-truth result.

To show that our same formulation works well for both rigid and non-rigid cases, we evaluated our method with different types of scene that contain rigid, non-rigid, complex dynamic scene i.e., composition of both rigid and non-rigid.

5.1 Experimental Setup and Results

Experimental setup and processing time: We partition the reference image using SLIC superpixels [achanta2012slic]. We used the current state-of-the-art optical flow algorithm to compute dense optical flow [bailer2015flow]. To initialize the motion and geometry variables, we used the the procedure discussed in §4.5.1. Interior point algorithm [benson2002interior] [benson2014interior] and TRW-S [kolmogorov2006convergent] were employed to solve the proposed optimization. We implemented our algorithm in MATLAB/C++. Our modified implementation (modified from our ICCV implementation[kumar2017monocular]) takes an average of 15-20 minutes to provide the result for images of size . The processing time is estimated on a regular desktop with Intel core i7 processor (16 GB RAM) for 50 refinement particle per superpixel.

Results on MPI Sintel Dataset: We begin our analysis of experimental results with MPI Sintel dataset [butler2012naturalistic]. This dataset is derived from an animation movie featuring complex scenes. It contains highly dynamic sequences with large motions, significant illumination changes, and non-rigidly moving objects. This dataset has emerged as a standard benchmark to evaluate dense optical flow algorithm’s and recently, it has also been used in the evaluation of dense 3D reconstruction methods for a general dynamic scene [ranftl2016dense].

The presence of non-rigid objects in the scene makes it a prominent choice for us to test our algorithm. It is a challenging dataset particularly for the piece-wise planar assumption due to the presence of many small and irregular shapes in the scene. Additionally, the presence of ground-truth depth map makes quantitative analysis much easier. We selected 120 pair of images to test our method that includes images from alley_1, ambush_4, mountain_1, sleeping_1 and temple_2. Fig. 7 shows some qualitative results on a few images taken from the sub-group of MPI Sintel dataset.

Fig. 7: Qualitative results using our algorithm in a complex dynamic scene. Example images are taken from MPI Sintel dataset [butler2012naturalistic]. Top row: Input reference image from sleeping_1, sleeping_2, shaman_3, temple_2, alley_2 sequence (from left to right). Middle row: Ground-truth depth map for the respective frames. Bottom row: Recovered depth map using our method.
Fig. 8: Qualitative results using our algorithm for the outdoor scenes. Examples are taken from VKITTI dataset [gaidon2016virtual]. Top row: Input reference image. Middle row: Ground-truth depth map for the respective frames. Bottom row: Recovered depth map using our method.

Results on VKITTI Dataset: The Virtual KITTI dataset [gaidon2016virtual] contains computer-rendered photo-realistic outdoor driving scenes which resemble KITTI dataset. The advantage of using this dataset is that it provides perfect ground-truths for many measurements. Furthermore, it helps to simulate algorithm related to dense 3D reconstruction with distortion-free and noise-free images, facilitating quick experimentation. We selected 120 pair of images from 0001_morning, 0002_morning, 0006_morning and 0018_morning. Our qualitative results in comparison to the ground-truth depth map are shown in Fig. 8.

Results on KITTI Dataset: The KITTI dataset [geiger2013vision] features the real-world outdoor scene targeting autonomous driving application. The KITTI images are taken from the camera mounted on top of a car. It’s a challenging dataset as it contains scenes with large camera motion and realistic lighting condition. In contrast to the aforementioned datasets, it only contains sparse ground-truth 3D information which makes evaluation a bit strenuous. Nonetheless, it captures noisy real-world situation and therefore, it is well suited to test the 3D reconstruction algorithm for a general dynamic scene case. We selected 00-09 sub-category from odometry dataset to evaluate and compare our results. We calculated mean relative error only over the provided sparse 3D LiDAR points –after adjusting the global scale. Fig. 9 shows some qualitative results on few images.

Results on Non-Rigid Sequence We also tested our method on some commonly used dense non-rigid sequence namely kinect_paper [varol2009template], kinect_tshirt [varol2009template] and back sequence [garg2013dense]444Note: The intrinsic matrix for back sequence is not available with the dataset, we estimated an approximate value of it using 2D-3D relation available from Garg et. al. [garg2013dense].. Most of the benchmark approach to solve non-rigid structure from motion use multiple frames and orthographic camera model. Despite a two-frame method and perspective camera model, we are able to capture the deformation of non-rigid object and achieve its reliable reconstruction. Qualitative results for dense non-rigid object sequence are shown in Fig. 10. To compute the mean relative error, we align and scale our shape (fixing global ambiguity) w.r.t ground-truth shape.

Fig. 9: Qualitative results on KITTI Dataset [geiger2013vision]. The second row shows the obtained depth map for the respective frames. Note: Dense ground-truth depth data is not available with this dataset.
Fig. 10: Dense 3D reconstruction of the objects that are undergoing non-rigid deformation over frames. Top row: Input reference frame from Back sequence [garg2013dense], Paper sequence [varol2009template][varol2012constrained] and t-shirt sequence[varol2009template][varol2012constrained]. Bottom row: Qualitative 3D reconstruction results for the respective deforming object.

5.2 Comparison

We compared the performance of our algorithm against several dynamic reconstruction methods, namely, Block Matrix Method (BMM) [dai2014simple], Point Trajectory Approach (PTA) [akhter2011trajectory], Low-rank Reconstruction (GBLR) [fragkiadaki2014grouping]), Depth Transfer (DT) [karsch2014depth], DMDE [ranftl2016dense] and ULDEMV [zhou2017unsupervised]. This comparison is made on the available benchmark datasets i.e., MPI Sintel (MPI-S), KITTI, VKITTI, kinect_tshirt (k_tshirt), kinect_paper (k_paper). Table I provides the statistical result of our method in comparison to the baseline approach on these datasets. Our method outperforms others in the outdoor sequence and provides a commendable performance for deformable sequence. Additionally, we performed a qualitative comparison on MPI Sintel [butler2012naturalistic], KITTI[geiger2013vision] and You-Tube object dataset[prest2012learning]. Fig. 11 and Fig. 12 provides the visual comparison result of our method to other competing methods. It can be observed that our method consistently delivers superior performance on all of these datasets. While compiling the results per frame comparison is also made over the entire sequence. Evaluation in the case of KITTI dataset is done only for the provided sparse 3D LiDAR points. Fig. 13(a), Fig. 13(b) and Fig. 14(c) shows per category statistical performance of our approach against other competing methods on the benchmark dataset.

Fig. 11: Qualitative comparison of our method with DMDE [ranftl2016dense] on MPI Sintel [butler2012naturalistic] and KITTI Dataset [butler2012naturalistic]. Left to Right: For each input reference image, we show its ground-truth depth map (GT Depth), depth map reported by DMDE [ranftl2016dense] and depth map obtained using our approach. Note: Dense GT depth map for KITTI Dataset is taken from DMDE [ranftl2016dense] work.

5.3 Performance Analysis

Besides statistical comparison, we conducted other experiments to analyze the behavior of our algorithm. These experiments supply an in-depth understanding of the dependency of our algorithm on other input modules.

Fig. 12: Qualitative evaluation of our approach with the Video-PopUp [russell2014video]. Clearly, our method provides more dense and detailed reconstruction of the scene. In the second row t-shirt description is missing with Video-PopUp [russell2014video] approach. By contrast our method has no such holes. Note: The results presented here for Video-PopUp are taken from their webpage since the source code provided by the authors crashes frequently.
MPI-S 0.4833 0.4101 0.3121 0.3177 0.297 0.1643
V-KITTI 0.2630 0.3237 0.2894 0.2742 - 0.0925
KITTI 0.2703 0.4112 0.3903 0.4090 0.148 0.1254
k_paper 0.2040 0.0920 0.0322 0.0520 - 0.0472
k_tshirt 0.2170 0.1030 0.0443 0.0420 - 0.0480
TABLE I: Performance Comparison: This table lists the MRE errors. For DMDE [ranftl2016dense] we used its previously reported results as its implementation is not publicly available. SF, MF, TF refers to single frame, multi-frame and two-frame based approach respectively. The reference to the method DT[karsch2014depth], GLRT[fragkiadaki2014grouping], BMM[dai2014simple], PTA[akhter2011trajectory], DMDE [ranftl2016dense].

Performance with variation in number of superpixels: Our method uses SLIC based over-segmentation of the reference frame to discretize the 3D space. Therefore, the number of superpixels that represent the real-world plays a crucial role in the accuracy of piece-wise continuous reconstruction. If the number of superpixels are very high the estimation of motion parameters becomes tricky and therefore, neighboring superpixels are used to estimate rigid motion which leads to computation challenges. In contrast, small number of superpixels are unable to capture the intrinsic details of a complex dynamic scene. So, a trade-off between the two is often a better choice. Fig. 14(a) shows the plot of depth error variation with the change in the number of superpixels.

Fig. 13: Quantitative comparison with our method with PTA [akhter2011trajectory], BMM [dai2014simple], GLRT[fragkiadaki2014grouping], DT [karsch2014depth] on benchmark datasets. The depth error is calculated by adjusting the numerical scale of the obtained depth map to the ground-truth value, to account for global scale ambiguity. (a)-(b) comparison on MPI Sintel [butler2012naturalistic], Virtual KITTI [gaidon2016virtual] and KITTI [geiger2013vision] dataset. These numerical values show the fidelity of reconstruction that can be retrieved on these benchmark datasets using our formulation.
Fig. 14: (a) Change in mean relative depth error with the change in number of superpixels. It can be observed that after 1000 superpixel the MRE more or less starts saturating with no significant effect on the overall accuracy. However, it was observed that the motion estimation becomes critical with the increase in number of superpixels. (b) Performance evaluation in RMSE (in meters) with the state-of-the-art optical flow methods in comparison to the ground-truth optical flow (MPI Sintel [butler2012naturalistic]

dataset). (c) Mean Relative Depth Error comparison with a recently proposed unsupervised learning based approach (ULDEMV

[zhou2017unsupervised]) on KITTI dataset [geiger2013vision].

Performance with regular grid as image superpixel: Under the piece-wise planar assumption, its not only the number of superpixels that affects the accuracy of reconstruction but also the type of superpixel pattern. To analyze this dependency, we took the worst possible case i.e to divide the reference image into approximately 1000 regular grid and compare its performance against 1000 SLIC superpixel. Our observation clearly shows a decline in the performance in comparison to SLIC superpixels. However, the difference in accuracy is not very significant (see Fig. 15).

Fig. 15: Effects of superpixel pattern on the reconstruction of a dynamic scene. a) with SLIC as superpixels (MRE for the shown frame is 0.0912) b) with uniform grid as superpixels (MRE achieved for the given frame is 0.1442).

Effects of K in K-NN Graph: In our method, the ARAP energy term is evaluated using the K nearest neighbor graph. Different K value leads to different 3D reconstruction result. An experiment on the flying dragon sequence is conducted to analyze the effect of varying K on the performance of our algorithm. The result of the flying dragon case is shown in Fig. 16. With the increase in K, the rigidity constraint is enforced in an increased neighborhood which directs the 3D reconstruction towards a globally rigid solution. On the other hand, a very small value of K fails to constrain the within object motion. In most of our experiments, we used a K in the range of which achieved satisfactory 3D reconstruction. Also, increasing the value of K directly affects the overall algorithmic complexity.

Fig. 16: Effect of parameter K in building the K-NN graph. Our algorithm results in good reconstruction if a suitable K is chosen, in accordance with the levels of complexity in a dynamic scene. (b) Ground-truth depth-map (scaled for illustration purpose). (c) when K=4, a reasonable reconstruction is obtained. (d) when K=20, regions tend to grow bigger. (Best viewed in color.)
Fig. 17: (a)-(b) are the reference frame and the next frame. It is a very challenging case for proper scale recovery with monocular images with dynamic motion. In both of these cases the motion of the girl between two consecutive frames is very large and therefore, the neighboring relations with the planes (say superpixels in image domain) in the consecutive frames gets violated. In such cases, our method may not be able to provide correct scales for each moving planes in 3D. In the first example, the complicated motion of the feet of the girl leads to wrong scale estimation. In the second example, the cart along with girl is moving w.r.t the camera. The hand of the girl has a substantial motion in the consecutive frames which leads to incorrect estimation of scale. (c)-(d) Ground-truth and obtained depth map respectively.

Performance variation using different optical flow algorithm: As our method uses dense optical flow correspondences between frames as input, the performance of our method is directly affected by its accuracy. To analyze the sensitivity of our method to different optical flow methods, we conducted experiments by testing our method with the ground-truth optical flow and few state-of-the-art optical flow methods [bailer2015flow] [chen2016full]. In Fig. 14(b), we show the 3D reconstruction performance evaluated in RMSE 555RMSE (Root Mean Square Error) , here , denotes the estimated and ground-truth depth respectively and is the total number of points. with different optical flow as inputs. This experiment reveals the importance of dense optical flow in the accurate reconstruction of a dynamic scene. While ground truth optical flow naturally achieves the best performance, the difference in result using different state-of-the-art optical flow is not dramatic. Therefore, we conclude that our method can achieve reliable results with the available dense optical flow algorithm’s.

6 Limitations and Discussion

The success of our method depends on the effectiveness of the piece-wise planar and as rigid as possible assumption. As a result, our method may fail if the piece-wise smooth model is no longer a valid approximation for the dynamic scene. For example, very fine or very small structures which are considerably far from the camera are difficult to recover under the piecewise planar assumption. Further, what about as rigid as possible assumption, When as rigid as possible assumption may fail? When the motion of the dynamic objects between consecutive frame is significantly large such that most of its neighboring relations in the reference frame get violated in the next frame. Additionally, if the non-rigid shape shrinks or expands over frames such as a deflating or inflating balloon, ARAP model fails. A couple of examples for such situations are discussed in Fig. 17. The other major limitation of our method is the overall processing time.

6.1 Discussion

1. Direction to reduce the processing time of our algorithm: Our algorithm is computationally expensive to execute on a regular desktop machine. This is due to the formulation for solving a higher-order graph optimization problem and particle-based refinement using TRW-S. To speed up the processing time, we are implementing some of the recent research work in the field of fast interior-point optimization and message-passing algorithm [pearson2017fast, Tourani_2018_ECCV] to our framework. We believe solving our optimization using these algorithms along with better computation capabilities can significantly reduce the processing time of our method.

2. Suitability of euclidean distance metric between graph vertices: Generally, the euclidean distance metric between graph vertices works well under our piece-wise planar assumption of a dynamic scene. However, there are situations where it may not be an appropriate metric. For example: when the shape of the superpixels is affected by noise or modeling of curved spaces using a piece-wise planar graph structure. To handle such special cases its better to measure distance in an embedding space (isometric embedding) or use metric, etc. To be very precise, depending on the shape of the deforming structure over time, the choice of a suitable metric may vary. Interested readers are encouraged to study the field of intrinsic metric on graphs [keller2015intrinsic].

Ablation Analysis: To understand the contribution of different energy term in the overall optimization, we performed ablation analysis. Firstly, in the proposed optimization framework the 3D continuity term is defined over boundaries between neighboring superpixels, which alone is not sufficient to constrain the motion beyond its immediate neighbors. Secondly, and has nothing to do with scale computation whatsoever. Hence, combining these three terms is not good enough to explain the correct scales for each of the object present in the scene. On the other hand, as rigid as possible term is defined for each superpixel’s anchor point over the K-NN graph structure. However, it does not take into account the alignment of planes in 3D along the boundaries. As a result, the overall reconstruction suffers. Thus, this demonstrates that all the terms are essential for reliable dynamic 3D reconstruction. Fig. 18 illustrates the contribution of different terms toward the final reconstruction result. Table II provides numerical value showing the importance of different terms on the overall performance of our algorithm. It can be observed that the improvement in output due to normal orientation constraint is not very significant.

Fig. 18: Effect of using “as rigid as possible”, “Planar re-projection”, “3D continuity” and “Orientation” term. Top row: By enforcing the “as rigid as possible” term only, the recovered relative scales are correct but the reconstructed planes are misaligned with respect to their neighbors. Middle row: With the planar re-projection, 3D continuity and orientation term enforced, the resultant 3D reconstruction achieves continuous neighboring boundaries, however, the relative scales for every plane in 3D are not correct. Bottom row: By enforcing the the “as rigid as possible” term along with all the other smoothness terms, we can handle both relative scales and 3D reconstruction for a complex dynamic scene.
alley_1 0.2248 0.2022 0.1697 0.1606
ambush_4 0.2381 0.2093 0.1701 0.1676
mountain_1 0.2127 0.1923 0.1492 0.1405
sleeping_1 0.2418 0.2026 0.1912 0.1823
TABLE II: Contribution of each individual energy term to the overall optimzation. Each column show the mean relative reconstruction error due the addition of the respective energy term. The sign symbolizes the addition of the all the energy term (columns) left of it.

7 Conclusion

In this paper, we have explored, investigated and supplied a distinct perspective to solve one of the classical problems in geometric computer vision i.e., to reconstruct a dense 3D model of a complex, dynamic, and generally non-rigid scene from its two perspective images. This topic of research is often considered as a very challenging task in structure from motion. In spite of its reasonable challenges, we have demonstrated that dense detailed 3D reconstruction of dynamic scenes is, in fact possible, provided that certain prior assumptions about the scene geometry and the deformation in the scene are satisfied. Both the assumptions we used are mild, realistic and commonly satisfied by the real-world scenarios. Our comprehensive evaluation on the benchmark datasets shows that our new insight to solve dense monocular 3D reconstruction of a general dynamic scene provides better results than other competing methods. This said, we think more profound research on top of our idea may help in the development of sophisticated SfM algorithms.


This research is supported in part by the Australia Research Council ARC Centre of Excellence for Robotics Vision (CE140100016), ARC-Discovery (DP 190102261) and ARC-LIEF (190100080), The Natural Science Foundation of China grants (61871325, 61420106007, 61671387), the “New Generation of Artificial Intelligence” major project under Grant 2018AAA0102800, and ARC grant DE140100180 and in part by a research gift from Baidu RAL (ApolloScapes-Robotics and Autonomous Driving Lab). The authors gratefully acknowledge the Data Science GPU gift award by NVIDIA Corporation. We thank all the reviewers and AE for their constructive suggestions.