1 Introduction
The task of reconstructing 3D geometry of the scene from images —popularly known as structurefrommotion (SfM), is a fundamental problem in computer vision. An initial introduction and working solution to this problem can be found as early as 1970’s and 1980’s
[ullman1979interpretation] [grimson1981images] [longuet1981computer], which Blake et al. discussed comprehensively in their seminal work [blake1987visual]. While this field of study in the past was largely dominated by sparse feature based reconstruction of a rigid scene [hartley1997triangulation] [hartley1997defense] [hartley2003multiple] [tomasi1993pictures] [tomasi1992shape] and a nonrigid object [bregler2000recovering] [dai2014simple] [lee2013procrustean] [kumar2016multi] [kumar2017spatio], in recent years, with the surge in computational resources, dense 3D reconstruction of the scene have been introduced and successfully demonstrated [newcombe2015dynamicfusion] [newcombe2011dtam] [ranftl2016dense].A dense solution to this inverse problem is essential due to its increasing demand in many realworld applications –from animation and entertainment industry to robotics industry (VSLAM). In particular, with the proliferation of monocular camera in almost all modern mobile devices has elevated the demand for sophisticated dense reconstruction algorithm. When the scene is static and the camera is moving, 3D reconstruction of such scenes from images can be achieved by using conventional rigid structure from motion techniques [hartley2003multiple] [agarwal2011building] [schoenberger2016sfm] [schoenberger2016mvs]. In contrast, to model arbitrary dynamic scene can be very challenging. When the camera is moving and the scene is static under such settings, the elegant geometrical constraint can help explain the camera’s [hartley1997defense] [govindu2001combining], which are later used to realize the dense 3D reconstruction of the scene [schoenberger2016sfm] [schoenberger2016mvs] [newcombe2011dtam] [triggs1999bundle]
. However, such geometrical constraint may fail when multiple rigidly moving objects are observed by a moving camera. Although each of the individual rigid objects can be reconstructed up to an arbitrary scale (assuming motion segmentation is provided), the reconstruction of the whole dynamic scene is generally impossible, simply because the relative scales among all the moving shapes cannot be determined in a globally consistent way. Furthermore, since all the estimated motions are relative to each other, one cannot distinguish camera motion from the object motion. Therefore, prior information about the objects, or the scene, and their relation to the frame of reference are used to fix the placement of these objects relative to each other.
Hence, from the above discussion, it can be argued that the solution to 3D reconstruction of a general dynamic scene is nontrivial. Nevertheless, it is an important problem to solve as many realworld applications need a reliable solution to this problem. For example, understanding of a traffic scene, a typical outdoor traffic scene consists of both multiple rigid motions of vehicles, and nonrigid motion of the pedestrians. To model such scenarios, it is important to have an algorithm that can provide dense 3D information from images.
Recently, Ranftl et al. [ranftl2016dense] proposed a threestep approach to procure dense 3D reconstruction of a general dynamic scene using two consecutive perspective frames. Concretely, it performs objectlevel motion segmentation followed by perobject 3D reconstruction and finally solves for scale ambiguity. We know that in a general dynamic setting, the task of densely segmenting rigidly moving objects or part is not trivial. Consequently, inferring motion models for deforming shapes becomes very challenging. Furthermore, the success of objectlevel segmentation builds upon the assumption of multiple rigid motions, fails to describe more general scenarios such as “when the objects themselves are deforming”. Subsequently, 3D reconstruction algorithms dependent on motion segmentation of objects suffer.
Motivated by such limitations, we propose a unified approach that neither performs any objectlevel motion segmentation nor assumes any prior knowledge about the scene rigidity type and still able to recover scale consistent dense reconstruction of a complex dynamic scene. Our formulation instinctively encapsulates the solution to inherent scale ambiguity in perspective structure from motion which is a very challenging problem in general. We show that by using two prior assumptions —about the 3D scene and the deformation, we can effectively pin down the unknown relative scales, and obtain a globally consistent dense 3D reconstruction of a dynamic scene from its two perspective views. The two basic assumptions we used about the dynamic scene are:

The dynamic scene can be approximated by a collection of piecewise planar surfaces each having its own rigid motion.

The deformation of the scene between two frames is locallyrigid but globally asrigidaspossible.

[leftmargin=*]

Piecewise planar model: Our method models a dynamic scene as a collection of piecewise planar regions. Given two perspective images (reference image), (next image) of a general dynamic scene, our method first oversegment the reference image into superpixels. This collection of superpixels are assumed approximation of the dynamic scene in the projective space. It can be argued that modeling dynamic scene per pixel can be more compelling, however, modeling of a scene using planar regions makes this problem computationally tractable for optimization or inference [bleyer2011object, vogel20153d].

Locallyrigid and globally asrigidaspossible: We implicitly assume that each local plane undergoes a rigid motion. Suppose every individual superpixel corresponds to a small planar patch moving rigidly in 3D space and dense optical flow between frame is given, we can estimate its location in 3D using rigid reconstruction pipeline [hartley2003multiple, vogel20113d]. Since the relative scale of these patches is not determined correctly, they are floating in 3D space as a set of unorganized superpixel soup. Under the assumption that the change between the frame is not too arbitrary rather regular or smooth, the scene can be assumed to be changing as rigid as possible globally. Using this intuition, our method starts finding for each superpixel an appropriate scale, under which the entire set of superpixels can be assembled (glued) together coherently, forming a piecewise smooth surface, as if playing the game of “3D jigsaw puzzle”. Hence, we call our method the “SuperPixel Soup” algorithm (see Fig. 2 for a conceptual visualization).
In this paper, we show that our aforementioned assumptions can faithfully model most of the realworld dynamic scenarios. Furthermore, we encapsulate these assumptions in a simple optimization problem which are solved using a combination of continuous and discrete optimization algorithms [benson2002interior, benson2014interior, kolmogorov2006convergent]. We demonstrate the benefit of our approach on available benchmark dataset such as KITTI [geiger2013vision], MPI Sintel [butler2012naturalistic] and Virtual KITTI [gaidon2016virtual]. The statistical comparison shows that our algorithm outperforms many available stateoftheart methods by a significant margin.
2 Related Works
The solution to SfM has undergone prodigious development since its inception [ullman1979interpretation]. Even after such a remarkable development in this field, the choice of algorithm depends on the complexity of the object motion and the environment. In this work, we utilize the idea of rigidity (locally) to solve dense reconstruction of a general dynamic scene. The concept of rigidity is not new in structure from motion problem [ullman1979interpretation] [longuet1987computer] and has been effectively applied as a global constraint to solve large scale reconstruction problem [agarwal2011building]. The idea of global rigidity to solve structure and motion has also been exploited to solve reconstruction over multiple frames via a factorization approach [tomasi1992shape].
The literature on structure from motion and its treatment to different scenarios is very extensive. Consequently, for brevity, we only discuss the previous works that are of direct relevance to dynamic 3D reconstruction from monocular images. The linear lowrank model has been used for dense nonrigid reconstruction. Kumar et al. [kumar2018scalable, Kumar_2019_CVPR] and Garg et al. [garg2013dense] solved the task with an orthographic camera model assuming feature matches across multiple frames is given as input. Fayad et al. [fayad2010piecewise] recovered deformable surfaces with a quadratic approximation, again from multiple frames. Taylor et al. [taylor2010non] proposed a piecewise rigid solution using locallyrigid SfM to reconstruct a soup of rigid triangles.
While Taylor et al. [taylor2010non] method is conceptually similar to ours, there are major differences:

We achieve twoview dense reconstruction while [taylor2010non] relies on multiple views (N 4).

We use perspective camera model while they rely on an orthographic camera model.

We solve the scaleindeterminacy issue, which is an inherent ambiguity for 3D reconstruction under perspective projection, while Taylor et al. [taylor2010non] method does not suffer from this, at the cost of being restricted to the orthographic camera model.
Recently, Russel et al. [russell2014video] and Ranftl et al. [ranftl2016dense] used objectlevel segmentation for dense dynamic 3D reconstruction. In contrast, our method is free from object segmentation, hence circumvent the difficulty associated with motion segmentation in a dynamic setting.
The templatebased approach is yet another method for deformable surface reconstruction. Yu et al. [yu2015direct] proposed a direct approach to capture dense, detailed 3D geometry of generic, complex nonrigid meshes using a single RGB camera. While it works for generic surfaces, the requirement of template prevents its wider application to more general scenes. Wang et al. [wang2016template] introduced a templatefree approach to reconstruct a poorlytextured, deformable surface. Nevertheless, its success is restricted to a single deforming surface rather than the entire dynamic scene. Varol et al. [varol2009template] reconstructed deformable surfaces based on a piecewise reconstruction assuming overlapping patches to be consistent over the entire surface, but again limited to the reconstruction of a single deformable surface.
While the conceptual idea of our work appeared in ICCV 2017, this journal version provides (i) indepth realization of our overall optimization (ii) Qualitative comparison with [ranftl2016dense], VideoPopUp [russell2014video]
as well as statistical comparison with deeplearning method
[zhou2017unsupervised]. (iii) Comprehensive ablation test showing the importance of each term in the overall optimization. (iv) Extensive performance analysis showing the performance with the variation in the number of superpixels, choice of knearest neighbor, choice of dense optical flow algorithm and change in the shape of the superpixel. (v) Detail discussion on the failure cases, choice of euclidean metric for nearest neighbor graph construction, and limitation of our work with possible direction for improvements.3 Motivation and Contribution
The formulation proposed in this work is motivated by the following endeavor in dense structure from motion of a dynamic scene.
3.1 Object level motion segmentation
To solve dense reconstruction of an entire dynamic scene from perspective images, the first step that is practiced usually is: Perform objectlevel motion segmentation to infer distinct motion models of multiple rigidly moving object in the scene. As alluded before, dense segmentation of moving object in a dynamic scene in itself is a challenging task. Also, nonrigidly moving object themselves may compose of a union of distinct motion models. Therefore, objectlevel segmentation build upon the assumption of per object rigid motion will fail to describe a general dynamic scene. This motivates us to develop an algorithm that can recover a densedetailed 3D model of a complex dynamic scene from its two perspective images, without objectlevel motion segmentation as an essential intermediate step.
3.2 Separate treatment for rigid SfM and nonrigid SfM
Our investigation shows that the algorithms for deformable object 3D reconstruction often differs from a rigidly moving object. Not only solutions, but even the assumptions varies significantly e.g orthographic projection, lowrank shape [bregler2000recovering] [dai2014simple] [lee2013procrustean] [kumar2017spatio]. The reason for such inadequacy is perfectly valid due to the underconstraint nature of the problem itself. This motivated us to develop an algorithm that can provide i.e “ 3D reconstruction of entire dynamic scene and the nonrigidly deforming object under similar assumptions and formulation.”
Although to accomplish this goal for any arbitrary nonrigid deformation remains an open problem, experiments suggest that our framework under the aforementioned assumptions about the scene and the deformation, can reconstruct a general dynamic scene irrespective of the scene rigidity type. Thanks to the recent advancement in the dense optical flow algorithms [bailer2015flow] [chen2016full] which can reliably capture smooth nonrigid deformation over frames. These robust dense optical flow algorithms allow us to exploit local motion of deforming surfaces. Thus, our formulation is competent enough to bridge this gap between rigid and nonrigid SfM.
The main contributions of our work are as follows:

A framework which disentangles objectlevel motion segmentation for dense 3D reconstruction of a complex dynamic scene.

A common framework for dense twoframe 3D reconstruction of a complex dynamic scene (including deformable objects), which achieves stateoftheart performance.

A new idea to resolve the inherent relative scale ambiguity problem in monocular 3D reconstruction by exploiting the asrigidaspossible (ARAP) constraint [sorkine2007rigid].
4 Outline of the Algorithm
Before providing the details of our algorithm, we would like to introduce some common notations that are used throughout the paper.
4.1 Notation
We represent two consecutive images as , :
, also referred as reference image and next image respectively. Vectors are represented by bold lower case letter, such as ‘
’ and matrices are represented by bold upper case letter such as ‘’. The subscript ‘a’, ‘b’ denotes anchor point and boundary point respectively, for e.g , represents anchor point and boundary point corresponding to superpixel in the image space. The 1norm, 2norm of a vector is denoted as and respectively. For matrices, Frobenius norm is denoted as .4.2 Overview
We first oversegment the reference image into superpixels, then model the deformation of the scene by a union of piecewise rigid motions of these superpixels. Specifically, we divide the overall nonrigid reconstruction into a local rigid reconstruction of each superpixel, followed by an assembly process which glues all these individual local reconstructions in a globally coherent manner. While the concept of the above divideandconquer procedure looks simple, there is however a fundamental difficulty (of scale indeterminacy) in its implementation. ScaleIndeterminacy refers to the wellknown fact that using a moving camera one can only recover the 3D structure up to an unknown scale. In our method, the individual rigid reconstruction of each superpixel can only be determined up to an unknown scale, the assembly of the entire nonrigid scene is only possible if and only if these scales among the superpixels are solved —which is, however, a challenging open task itself.
In this paper, we show how this can be done using two very mild assumption §3.2. Under these assumptions, our method solves the unknown relative scales and obtains a globallycoherent dense 3D reconstruction of a complex dynamic scene from its two perspective views.
4.3 Problem Statement
To implement the above idea of piecewise rigid reconstruction, we first partition the reference image into set of superpixels , where each superpixel is parametrized by its boundary pixels and an anchor point corresponding to the centroid of the superpixel in the image plane. Such a superpixel partition of the image plane naturally induces a piecewisesmooth over segmentation of the corresponding 3D scene surface. We denote this set of 3D scene surfaces as = . Although surfel is perhaps a better term, we nevertheless call it “3D superpixel” for the sake of easy exposition. We further assume each 3D superpixel (‘’) is a small 3D planar patch and , which is parameterized by surface normal , 3D anchorpoint , and 3D boundarypoints (i.e these are the preimages of and ). Assume every 3D superpixel moves rigidly according where represents relative rotation, is the translation direction, and the unknown scale.
After our notation and symbol introduction, we are in a position to put our idea in a more precise way: Given two intrinsically calibrated perspective images and of a generally dynamic scene and the corresponding dense optical flow field, our task is to reconstruct a piecewiseplanar approximation of the dynamic scene surface. The deformable scene surface in the reference frame (i.e, ) and the one in the second frame (i.e, ) are parametrized by their respective 3D superpixels and , where each is described by its surface normal and an anchor point . Any 3D plane can be determined by an anchor point and a surface normal . If one can estimate correct placement of all the 3D anchor points and all the surface normals corresponding to the reference frame, the problem is solved, since each element of is related to via transformation (locally rigid).
The overall procedure of our method is presented in Algorithm 1.
4.4 Formulation
We begin by briefly reiterating some of our representation. We partition the reference image into a set , whose corresponding set in the 3D world is . Equivalently, and are the respective sets for the next frame. The mapping of each element in the reference frame and next frame differs by a rigid transformation. Mathematically, via transformation (also known as special euclidean group), for instance = where and . In our formulation each 3D plane is described by = {(, ) }, where is the total number of superpixels (see Fig. 3). Similarly, in the image space through the planeinduced homography [hartley2003multiple]^{1}^{1}1scale is introduced both in the numerator and denominator for clarification that scale does not affect the homography transformation.. Here, is the intrinsic camera matrix and is the depth of the plane. Using these notations and definitions, we build a KNN graph.
Build a KNN graph: Using oversegmentation of the reference image (which is the projection of a set of 3D planes ) and Euclidean distance metric, we construct a KNN graph in the image space connecting each anchor point to its Knearest anchor points. The graph vertices () are composed of anchor point that connects to other anchor points via graph edges (). The distance between any two vertices () is taken as the Euclidean distance between them. Here, we assume Euclidean distance as a valid graph metric to describe the edge length between any two local vertices. Such an assumption is valid for local compactness (Euclidean spaces are locally compact). Interested reader may refer to [burago2001course] [williamson1987constructing] [whiteley2004rigidity] for comprehensive details. Here, ’K’ is the number of nearest neighbor that is used to construct local graph structure. This KNN graph relation helps to constrain the motion and continuity of the space (defined in terms of optical flow, depth). To impose a hard constraint, we build a KNN graph using anchor point beyond its immediate neighbors (Fig. 4).
This KNN graph is crucial in the establishment of local rigidity constraint which is the basis of our assumption. This graph structure allows us to enforce our assumption i.e, the shape to be as rigid as possible globally and rigid locally.
AsRigidAsPossible (ARAP) Energy Term: Our method is built on the idea that the correct scales of 3D superpixels can be estimated by enforcing prior assumptions that govern the deformation of the dynamic surface. Specifically, we require that, locally, the motion that each 3Dsuperpixel undergoes is rigid, and globally the entire dynamic scene surface must move as rigid as possible (ARAP). In other words, while the dynamic scene is globally nonrigid, its deformation must be regular in the sense that it deforms as rigidly as possible. To implement this idea, we define an ARAPenergy term as:
(1)  
Here, the first term favors smooth motion between the local neighbors, while the second term encourages internode distances between the anchor node and its K nearest neighbor nodes (denoted as ) to be preserved before and after motion (hence asrigidaspossible, see Fig. 4). We define the weighting parameters as:
(2) 
These weights are set to be inversely proportional to the distance between two superpixels. This is to reflect our intuition that, the further apart two superpixels are, the weaker the energy is. Although there may be redundant information in these two terms w.r.t scale estimation, we keep them for motion refinement §4.5.2. Note that, this term is only defined over anchor points, hence it enforces no depth smoothness along boundaries. The weighting term in advocates the local rigidity by penalizing over the distance between anchor points. This allows immediate neighbors to have smooth deformation over time. Also, note that is generally nonconvex. This nonconvexity is due to the second term in Eq. 1, where we have a minus sign between two norm terms. In Eq. 2 is an empirical constant.
alone is good enough to provide reasonably correct scales, however, the piecewise planar composition of a continuous 3D space creates discontinuity near the boundaries of each plane. For this reason, we incorporate additional constraint to fix this depth discontinuity and further refine motions and geometry for each superpixel via neighboring relations. We call these constraints as Planar Reprojection, 3D Continuity and Orientation Energy constraint.
Planar Reprojection Energy Term: With the assumption that each superpixel represents a plane in 3D, it must satisfy corresponding planar reprojection error in 2D image space. This reprojection cost reflects the average dissimilarity in the optical flow correspondences across the entire superpixel due to motion. Therefore, it helps us to constrain the surface normal, rotation and translation direction such that they obey the observed planar homography in the image space. To infer any pixel inside a superpixel, we use the operator , for e.g will give the coordinates of pixel inside . Using it we define
(3)  
Here, , is the optical flow correspondence of pixel inside superpixel in the reference frame and next frame respectively. The operator represent the cardinal number of a set. is a tradeoff scalar chosen empirically. A natural question that may arise is: This term is independent of scale, then what’s the purpose of this constraint? How does it help? Kindly, refer to §4.5.2 for details.
3D Continuity Energy Term: In case of a dynamic scene, where both camera and the objects are in motion, its quite apparent that the scene will undergo some changes across frames. Hence, to assume unremitting global continuity with a piecewise planar assumption, in a dynamic scene is unreasonable. Instead, local weak continuity constraint can be enforced —a constraint that can be broken occasionally [hinton1977relaxation] i.e., local planes are connected to few of its neighbors. Accordingly, we want to allow local neighbors to be piecewise continuous. To favor this continuous or smooth surface reconstruction, we require neighboring superpixels to have a smooth depth transition at their boundaries. To do so, we define a 3D continuity energy term as:
(4)  
where, , represents the corresponding matrices in 2D image space and 3D Euclidean space (, where is the total number of boundary pixel for superpixel). Since in our representation, geometry and motion are shared among all pixels within a superpixel, so regularization within the superpixel is not explicitly needed. Thus, we only concentrate on the shared boundary pixels to regularize our energy. Note that the neighboring relationship in is different from term. Here, the neighbors share common boundaries with each other.
To encourage the geometry to be approximately smooth locally if the object has similar appearance, we color weight the energy term along the boundary pixels. For each boundary pixel of a given superpixel, we consider its 4connected neighboring pixels to weight. Using this idea for we obtain:
(5) 
which weigh the interplane transition by color difference. The symbol is a set that contains the 4 connecting pixels to each superpixel boundary pixel shared with superpixel. The color based weighting term plays an important role to allow for “weak continuity constraint” i.e gradually allow for occasional discontinuity [hinton1977relaxation] [blake1983least].
To better understand the implication of constraint, consider two boundary points in the image space . Generally, if these two points lie on a different plane, it will not coincide in the 3D space before and after motion. Hence, we compute the 3D distance between boundary pixels corresponding to both reference frame and next frame, which leads to our goal of penalizing distance along shared edges (see Fig. 5). Therefore, this term ensures the 3D coordinates across superpixel boundaries to be continuous in both frames. The challenge here is to reach a satisfactory solution for overall scene continuity, almost everywhere in both the frames [blake1987visual]. In the Eq. 4 is a truncation function defined as and similar to Eq. 2 in Eq. 5 is a constant, chosen empirically.
Orientation Energy Term: To encourage the smoothness in the orientation of the neighboring planes, we added one more geometric constraint i.e, defined as follows.
(6) 
Here neighbor index is same as 3D continuity term. denotes the truncated penalty function which is defined as . Intuitively, it encourages the similarity between neighboring normal’s and truncate any value more than .
Combined Energy Function: Equipped with all these constraints, we define our overall cost function or energy function to obtain a scale consistent 3D reconstruction of a complex dynamic scene. Our goal is to estimate depth (), surface normal () and scale for each 3D planar superpixel. The key is to estimate the unknown relative scale . We solve this by minimizing the following energy function:
(7)  
The equality constraint on fixes the unknown freedom of a global scale. The constraint on is imposed to restrict the rotation matrix to lie on manifold. In our formulation, the rotation matrix represents the combined Euler 3D angles. Although there are other efficient representations for 3D rotation, we used matrix representation as it comes naturally via epipolar geometric constraint, hence, further postconversion steps can be avoided. The constant are included for numerical consistency.
4.5 Implementation
We partition the reference image into 1,0002,000 superpixels [achanta2012slic]. Parameters such as , , , , were tuned differently for different datasets. To perform optimization of the proposed energy function (Eq. 7), we require initial set of proposals for motion and geometry.
4.5.1 Initial Proposal Generation
We exploit piecewise rigid and planar assumption to estimate an initial proposal for geometry and motion. We start by estimating homography for each superpixel using dense feature correspondences. Piecewise rigid assumption helps in approximate estimation of rotation and correct translation direction via triangulation and chierality check [hartley2003multiple] [hartley1997triangulation]. To obtain the correct normal direction and initial depth estimate, we solve the following set of equations for each superpixel:
(8) 
The reason we choose this strategy to obtain normal is because a simple decomposition of homography matrix to the rotation, translation and normal can lead to sign ambiguity [varol2009template] [malis2007deeper]. Nevertheless, if one has correct rotation and direction of translation –which we infer from chierality check, then inferring normal becomes easy^{2}^{2}2 The solution to the obtained normal must be normalized.. Here, we assume the depth ’’ to be a positive constant and the initial arbitrary reconstruction is in the +Z direction. This strategy of gathering 9dimensional variables (6motion variable and 3geometry variable) for each individual superpixel gives us a good enough estimate to get started with the minimization of our overall energy function ^{3}^{3}3If the size of the superpixel is very small, use the neighboring superpixels optical flow to estimate motion parameters..
To initialize 3D vectors in our formulation we use the following well known relation:
(9) 
where, are image coordinates and are camera intrinsic parameters which can be inferred from matrix.
4.5.2 Optimization
With good enough initialization of the variables, we start to optimize our energy function Eq. 7. A global optimal solution is hard to achieve due to the nonconvex nature of the proposed cost function (Eq. 7). However, it can be solved efficiently using interiorpoint methods [benson2002interior] [benson2014interior]. Although the solution found by the interior point method is at best local minimizer, empirically they appear to give good 3D reconstruction. In our experiments, we initialized all ’s with an initial value of .
Next, we employ a particle based refinement algorithm to rectify our initial motion and geometry beliefs. Specifically, we used the MaxProduct Particle Belief propagation (MPPBP) procedure with the TRWS algorithm [kolmogorov2006convergent] to optimize over the surface normals, rotations, translations and depths for all 3D superpixels using Eq. 10. We generated 50 particles as proposals for the unknown parameters around the already known beliefs to initiate refinement moves. Repeating this strategy for 510 iterations, we obtain a smooth and refined 3D structure of the dynamic scene.
(10) 
5 Experiments and Results
We evaluated our formulation both qualitatively and quantitatively on various standard benchmark datasets, namely MPI Sintel [butler2012naturalistic], KITTI [geiger2013vision], VKITTI [gaidon2016virtual] and YouTube Object dataset [prest2012learning]. All these dataset contains images of dynamic scene where both camera and objects are in motion w.r.t each other. To test the reconstruction result on deformable objects we used Paper, Tshirt [varol2009template] [varol2012constrained] and Back Sequence [garg2013dense]. For evaluating the result, we selected the most commonly used error metric i.e, mean relative error metric.
Evaluation Metric
: To keep the evaluation metric consistent with the previous work
[ranftl2016dense], we used mean relative error (MRE) metric for evaluation. MRE is defined as . Here, , denotes the estimated and groundtruth depth respectively with being the total number of points. The error is computed after rescaling the recovered depth properly as the reconstruction is obtained up to an unknown global scale. Quantitative evaluation for the YouTubeObjects dataset and the Back dataset are missing due to the absence of groundtruth result.To show that our same formulation works well for both rigid and nonrigid cases, we evaluated our method with different types of scene that contain rigid, nonrigid, complex dynamic scene i.e., composition of both rigid and nonrigid.
5.1 Experimental Setup and Results
Experimental setup and processing time: We partition the reference image using SLIC superpixels [achanta2012slic]. We used the current stateoftheart optical flow algorithm to compute dense optical flow [bailer2015flow]. To initialize the motion and geometry variables, we used the the procedure discussed in §4.5.1. Interior point algorithm [benson2002interior] [benson2014interior] and TRWS [kolmogorov2006convergent] were employed to solve the proposed optimization. We implemented our algorithm in MATLAB/C++. Our modified implementation (modified from our ICCV implementation[kumar2017monocular]) takes an average of 1520 minutes to provide the result for images of size . The processing time is estimated on a regular desktop with Intel core i7 processor (16 GB RAM) for 50 refinement particle per superpixel.
Results on MPI Sintel Dataset: We begin our analysis of experimental results with MPI Sintel dataset [butler2012naturalistic]. This dataset is derived from an animation movie featuring complex scenes. It contains highly dynamic sequences with large motions, significant illumination changes, and nonrigidly moving objects. This dataset has emerged as a standard benchmark to evaluate dense optical flow algorithm’s and recently, it has also been used in the evaluation of dense 3D reconstruction methods for a general dynamic scene [ranftl2016dense].
The presence of nonrigid objects in the scene makes it a prominent choice for us to test our algorithm. It is a challenging dataset particularly for the piecewise planar assumption due to the presence of many small and irregular shapes in the scene. Additionally, the presence of groundtruth depth map makes quantitative analysis much easier. We selected 120 pair of images to test our method that includes images from alley_1, ambush_4, mountain_1, sleeping_1 and temple_2. Fig. 7 shows some qualitative results on a few images taken from the subgroup of MPI Sintel dataset.
Results on VKITTI Dataset: The Virtual KITTI dataset [gaidon2016virtual] contains computerrendered photorealistic outdoor driving scenes which resemble KITTI dataset. The advantage of using this dataset is that it provides perfect groundtruths for many measurements. Furthermore, it helps to simulate algorithm related to dense 3D reconstruction with distortionfree and noisefree images, facilitating quick experimentation. We selected 120 pair of images from 0001_morning, 0002_morning, 0006_morning and 0018_morning. Our qualitative results in comparison to the groundtruth depth map are shown in Fig. 8.
Results on KITTI Dataset: The KITTI dataset [geiger2013vision] features the realworld outdoor scene targeting autonomous driving application. The KITTI images are taken from the camera mounted on top of a car. It’s a challenging dataset as it contains scenes with large camera motion and realistic lighting condition. In contrast to the aforementioned datasets, it only contains sparse groundtruth 3D information which makes evaluation a bit strenuous. Nonetheless, it captures noisy realworld situation and therefore, it is well suited to test the 3D reconstruction algorithm for a general dynamic scene case. We selected 0009 subcategory from odometry dataset to evaluate and compare our results. We calculated mean relative error only over the provided sparse 3D LiDAR points –after adjusting the global scale. Fig. 9 shows some qualitative results on few images.
Results on NonRigid Sequence We also tested our method on some commonly used dense nonrigid sequence namely kinect_paper [varol2009template], kinect_tshirt [varol2009template] and back sequence [garg2013dense]^{4}^{4}4Note: The intrinsic matrix for back sequence is not available with the dataset, we estimated an approximate value of it using 2D3D relation available from Garg et. al. [garg2013dense].. Most of the benchmark approach to solve nonrigid structure from motion use multiple frames and orthographic camera model. Despite a twoframe method and perspective camera model, we are able to capture the deformation of nonrigid object and achieve its reliable reconstruction. Qualitative results for dense nonrigid object sequence are shown in Fig. 10. To compute the mean relative error, we align and scale our shape (fixing global ambiguity) w.r.t groundtruth shape.
5.2 Comparison
We compared the performance of our algorithm against several dynamic reconstruction methods, namely, Block Matrix Method (BMM) [dai2014simple], Point Trajectory Approach (PTA) [akhter2011trajectory], Lowrank Reconstruction (GBLR) [fragkiadaki2014grouping]), Depth Transfer (DT) [karsch2014depth], DMDE [ranftl2016dense] and ULDEMV [zhou2017unsupervised]. This comparison is made on the available benchmark datasets i.e., MPI Sintel (MPIS), KITTI, VKITTI, kinect_tshirt (k_tshirt), kinect_paper (k_paper). Table I provides the statistical result of our method in comparison to the baseline approach on these datasets. Our method outperforms others in the outdoor sequence and provides a commendable performance for deformable sequence. Additionally, we performed a qualitative comparison on MPI Sintel [butler2012naturalistic], KITTI[geiger2013vision] and YouTube object dataset[prest2012learning]. Fig. 11 and Fig. 12 provides the visual comparison result of our method to other competing methods. It can be observed that our method consistently delivers superior performance on all of these datasets. While compiling the results per frame comparison is also made over the entire sequence. Evaluation in the case of KITTI dataset is done only for the provided sparse 3D LiDAR points. Fig. 13(a), Fig. 13(b) and Fig. 14(c) shows per category statistical performance of our approach against other competing methods on the benchmark dataset.
5.3 Performance Analysis
Besides statistical comparison, we conducted other experiments to analyze the behavior of our algorithm. These experiments supply an indepth understanding of the dependency of our algorithm on other input modules.









MPIS  0.4833  0.4101  0.3121  0.3177  0.297  0.1643  
VKITTI  0.2630  0.3237  0.2894  0.2742    0.0925  
KITTI  0.2703  0.4112  0.3903  0.4090  0.148  0.1254  
k_paper  0.2040  0.0920  0.0322  0.0520    0.0472  
k_tshirt  0.2170  0.1030  0.0443  0.0420    0.0480 
Performance with variation in number of superpixels: Our method uses SLIC based oversegmentation of the reference frame to discretize the 3D space. Therefore, the number of superpixels that represent the realworld plays a crucial role in the accuracy of piecewise continuous reconstruction. If the number of superpixels are very high the estimation of motion parameters becomes tricky and therefore, neighboring superpixels are used to estimate rigid motion which leads to computation challenges. In contrast, small number of superpixels are unable to capture the intrinsic details of a complex dynamic scene. So, a tradeoff between the two is often a better choice. Fig. 14(a) shows the plot of depth error variation with the change in the number of superpixels.
dataset). (c) Mean Relative Depth Error comparison with a recently proposed unsupervised learning based approach (ULDEMV
[zhou2017unsupervised]) on KITTI dataset [geiger2013vision].Performance with regular grid as image superpixel: Under the piecewise planar assumption, its not only the number of superpixels that affects the accuracy of reconstruction but also the type of superpixel pattern. To analyze this dependency, we took the worst possible case i.e to divide the reference image into approximately 1000 regular grid and compare its performance against 1000 SLIC superpixel. Our observation clearly shows a decline in the performance in comparison to SLIC superpixels. However, the difference in accuracy is not very significant (see Fig. 15).
Effects of K in KNN Graph: In our method, the ARAP energy term is evaluated using the K nearest neighbor graph. Different K value leads to different 3D reconstruction result. An experiment on the flying dragon sequence is conducted to analyze the effect of varying K on the performance of our algorithm. The result of the flying dragon case is shown in Fig. 16. With the increase in K, the rigidity constraint is enforced in an increased neighborhood which directs the 3D reconstruction towards a globally rigid solution. On the other hand, a very small value of K fails to constrain the within object motion. In most of our experiments, we used a K in the range of which achieved satisfactory 3D reconstruction. Also, increasing the value of K directly affects the overall algorithmic complexity.
Performance variation using different optical flow algorithm: As our method uses dense optical flow correspondences between frames as input, the performance of our method is directly affected by its accuracy. To analyze the sensitivity of our method to different optical flow methods, we conducted experiments by testing our method with the groundtruth optical flow and few stateoftheart optical flow methods [bailer2015flow] [chen2016full]. In Fig. 14(b), we show the 3D reconstruction performance evaluated in RMSE ^{5}^{5}5RMSE (Root Mean Square Error) , here , denotes the estimated and groundtruth depth respectively and is the total number of points. with different optical flow as inputs. This experiment reveals the importance of dense optical flow in the accurate reconstruction of a dynamic scene. While ground truth optical flow naturally achieves the best performance, the difference in result using different stateoftheart optical flow is not dramatic. Therefore, we conclude that our method can achieve reliable results with the available dense optical flow algorithm’s.
6 Limitations and Discussion
The success of our method depends on the effectiveness of the piecewise planar and as rigid as possible assumption. As a result, our method may fail if the piecewise smooth model is no longer a valid approximation for the dynamic scene. For example, very fine or very small structures which are considerably far from the camera are difficult to recover under the piecewise planar assumption. Further, what about as rigid as possible assumption, When as rigid as possible assumption may fail? When the motion of the dynamic objects between consecutive frame is significantly large such that most of its neighboring relations in the reference frame get violated in the next frame. Additionally, if the nonrigid shape shrinks or expands over frames such as a deflating or inflating balloon, ARAP model fails. A couple of examples for such situations are discussed in Fig. 17. The other major limitation of our method is the overall processing time.
6.1 Discussion
1. Direction to reduce the processing time of our algorithm: Our algorithm is computationally expensive to execute on a regular desktop machine. This is due to the formulation for solving a higherorder graph optimization problem and particlebased refinement using TRWS. To speed up the processing time, we are implementing some of the recent research work in the field of fast interiorpoint optimization and messagepassing algorithm [pearson2017fast, Tourani_2018_ECCV] to our framework. We believe solving our optimization using these algorithms along with better computation capabilities can significantly reduce the processing time of our method.
2. Suitability of euclidean distance metric between graph vertices: Generally, the euclidean distance metric between graph vertices works well under our piecewise planar assumption of a dynamic scene. However, there are situations where it may not be an appropriate metric. For example: when the shape of the superpixels is affected by noise or modeling of curved spaces using a piecewise planar graph structure. To handle such special cases its better to measure distance in an embedding space (isometric embedding) or use metric, etc. To be very precise, depending on the shape of the deforming structure over time, the choice of a suitable metric may vary. Interested readers are encouraged to study the field of intrinsic metric on graphs [keller2015intrinsic].
Ablation Analysis: To understand the contribution of different energy term in the overall optimization, we performed ablation analysis. Firstly, in the proposed optimization framework the 3D continuity term is defined over boundaries between neighboring superpixels, which alone is not sufficient to constrain the motion beyond its immediate neighbors. Secondly, and has nothing to do with scale computation whatsoever. Hence, combining these three terms is not good enough to explain the correct scales for each of the object present in the scene. On the other hand, as rigid as possible term is defined for each superpixel’s anchor point over the KNN graph structure. However, it does not take into account the alignment of planes in 3D along the boundaries. As a result, the overall reconstruction suffers. Thus, this demonstrates that all the terms are essential for reliable dynamic 3D reconstruction. Fig. 18 illustrates the contribution of different terms toward the final reconstruction result. Table II provides numerical value showing the importance of different terms on the overall performance of our algorithm. It can be observed that the improvement in output due to normal orientation constraint is not very significant.
Data  

alley_1  0.2248  0.2022  0.1697  0.1606 
ambush_4  0.2381  0.2093  0.1701  0.1676 
mountain_1  0.2127  0.1923  0.1492  0.1405 
sleeping_1  0.2418  0.2026  0.1912  0.1823 
7 Conclusion
In this paper, we have explored, investigated and supplied a distinct perspective to solve one of the classical problems in geometric computer vision i.e., to reconstruct a dense 3D model of a complex, dynamic, and generally nonrigid scene from its two perspective images. This topic of research is often considered as a very challenging task in structure from motion. In spite of its reasonable challenges, we have demonstrated that dense detailed 3D reconstruction of dynamic scenes is, in fact possible, provided that certain prior assumptions about the scene geometry and the deformation in the scene are satisfied. Both the assumptions we used are mild, realistic and commonly satisfied by the realworld scenarios. Our comprehensive evaluation on the benchmark datasets shows that our new insight to solve dense monocular 3D reconstruction of a general dynamic scene provides better results than other competing methods. This said, we think more profound research on top of our idea may help in the development of sophisticated SfM algorithms.
Acknowledgements:
This research is supported in part by the Australia Research Council ARC Centre of Excellence for Robotics Vision (CE140100016), ARCDiscovery (DP 190102261) and ARCLIEF (190100080), The Natural Science Foundation of China grants (61871325, 61420106007, 61671387), the “New Generation of Artificial Intelligence” major project under Grant 2018AAA0102800, and ARC grant DE140100180 and in part by a research gift from Baidu RAL (ApolloScapesRobotics and Autonomous Driving Lab). The authors gratefully acknowledge the Data Science GPU gift award by NVIDIA Corporation. We thank all the reviewers and AE for their constructive suggestions.
Comments
There are no comments yet.