Dense Object Reconstruction from RGBD Images with Embedded Deep Shape Representations

10/11/2018 ∙ by Lan Hu, et al. ∙ 2

Most problems involving simultaneous localization and mapping can nowadays be solved using one of two fundamentally different approaches. The traditional approach is given by a least-squares objective, which minimizes many local photometric or geometric residuals over explicitly parametrized structure and camera parameters. Unmodeled effects violating the lambertian surface assumption or geometric invariances of individual residuals are encountered through statistical averaging or the addition of robust kernels and smoothness terms. Aiming at more accurate measurement models and the inclusion of higher-order shape priors, the community more recently shifted its attention to deep end-to-end models for solving geometric localization and mapping problems. However, at test-time, these feed-forward models ignore the more traditional geometric or photometric consistency terms, thus leading to a low ability to recover fine details and potentially complete failure in corner case scenarios. With an application to dense object modeling from RGBD images, our work aims at taking the best of both worlds by embedding modern higher-order object shape priors into classical iterative residual minimization objectives. We demonstrate a general ability to improve mapping accuracy with respect to each modality alone, and present a successful application to real data.



There are no comments yet.


page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and review of existing work

The ability of a machine to perceive and localize within its immediate surroundings has long been recognized as a fundamental enabler of several game-changing future technologies, including mixed or augmented reality, robotics, and intelligent vehicles. Whether we are talking about a passively moving head-mounted display in an augmented reality application or an actively navigating self-driving car, the 3D perception task remains similar: Process the continuous input data stream from all the available sensors to solve the mutual problem of simultaneous localization and mapping (SLAM).

In an effort to include 3D geometric priors into the estimation, the community has recently also explored deep end-to-end models for the feed-forward generation of depth maps or even full object models. For example,

[Wu et al., 2017, Girdhar et al., 2016, Yang et al., 2017, Smith and Meger, 2017] use encoder-decoder networks to output binary 3D occupancy grids from single images of an object. [Di et al., 2016, Yan et al., 2016, Rezende et al., 2016] furthermore train networks to perform reconstruction by utilizing multi-view images. While these models lead to surprising performance, they—at least at test time—ignore the more traditional but often valid geometric or photometric consistency constraints altogether. As a result, deep feed-forward models have difficulties to reconstruct fine details, and often fail to provide confidence measures or satisfying performance in corner-case scenarios that are insufficiently represented in the training set.

In an effort to push the performance of SLAM formulations to the next level, the community has recently investigated various strategies to combine both modalities and include prior, high-level knowledge into classical iterative residual minimization frameworks:

  • The simplest approach of including higher-level knowledge is given by explicitly representing the common 1D or 2D geometric structures of man-made environments. Straightforward examples are given by lines and planes [Salas-Moreno et al., 2014, Micusik and Wildenauer, 2015]. Although a well-explored idea, this technique already achieves remarkable properties that we also pursue in our work: A low-dimensional dense representation of the environment that implicitly enforces smoothness by encoding the higher-order shape of the environment.

  • Starting from these simple geometric primitives, the community has moved on to the usage of image-based semantic object detection modules. The latter typically incorporate the experience given by countless training examples into a state-of-the-art deep convolutional neural network, which is used online for retrieving a plausible CAD model to represent objects in the environment. Our work is similar in that we also leverage semantic information to change or even simplify the object representations. However, we do not rely on CAD models. While object detection modules are typically tested for generalization ability, picking a concrete model from a database for the actual SLAM optimization objective

    [Civera et al., 2011, Salas-Moreno et al., 2013, Gupta et al., 2015, Gálvez-López et al., 2016, Mu et al., 2016] removes this advantage and limits accurate inference to a finite set of objects for which an exact model is known in advance.

  • More recently, the community defined the concept of semantic mapping, which denotes the augmentation of generated 3D models by semantic information. However, methods such as [Koppula et al., 2011, Stückler et al., 2015, Kundu et al., 2014, McCormac et al., 2017] hardly explore the benefit of semantic knowledge in geometric inference. They merely transfer local semantic knowledge to a global representation where multiple observations from different view-points are fused111Note however that the semantic recognition part can be readily extended to use dense depth information as well.. A notable exception is presented in [Häne et al., 2017], where local smoothness constraints are rendered dependent on the semantic knowledge.

  • The works closest to our approach are given by [Bloesch et al., 2018] and [Zhu et al., 2017]

    , which utilize low-dimensional latent feature vectors to achieve low-rank representations of 2.5D depth maps and point clouds, respectively. However,

    [Bloesch et al., 2018] aims at learning a general representation for entire depth maps without employing semantic knowledge. [Zhu et al., 2017] employs shape representations for individual objects, however only in the form of sparse point clouds for very simple shapes and without mechanisms to deal with occlusions. Furthermore, the work leaves open questions about how a shape generator mapping from a low-dimensional latent space to the full object geometry is effectively split off from the hourglass architecture taken from [Fan et al., 2016]. A further notable work is given by [Dame et al., 2013], which however does not yet take benefit from powerful modern deep architectures.

Our contributions are as follows. We apply the idea of embedded deep shape representations to a novel RGBD incremental tracking and mapping framework for single objects. Our representation in 3D is dense and object centric, and we introduce an occlusion mask-supported strategy to actively find a trade-off between imposing priors generated by the network, and traditional residuals with respect to the measurements. We conclude with a successful application to a challenging real example, and reconstruct a dense model of a chair from real Kinect images. Section 2 introduces our deep shape representation. Section 3 explains how this model is embedded into an incremental tracking and mapping framework. Section 4 finally concludes with our experimental evaluation on both artificial and real data.

2 Object shape representation learning

This section introduces our higher-level object shape representations. We start by seeing a motivation for the basic form of the representation, as well as the need for functions mapping to and from the defined shape space. We then see the exact architecture devised to realize these mappings, followed by an exposition of all details on the corresponding training procedure.

2.1 Motivation for higher-level models

The traditional geometric formulation of a simultaneous tracking and mapping problem starts with the definition of a measurement model. Let be points on an object, and the poses of viewpoints from where the object is observed. Without loss of generality, point measurements are given by


where is a projection function that respects the intrinsic parameters of the sensor, and is an independent noise component. The residuals between the measurements and the estimated poses and points are then given by


A popular alternative to solve the tracking and mapping problem is to search for poses and points that minimize the sum of least squares residuals. However, in order to improve the conditioning of the problem, the energy is often complemented by a dual smoothness term that enforces neighbouring points to remain close to each other. The final objective may read


where is an indicator function that indicates whether point is visible in view , is the set of index-pairs for points that are defined to be neighbors, and and are robust cost functions. There are several problems with this objective. The first one is that the dimensionality of the problem is potentially very high, especially if we are considering the dense scenario. Then, since not all residuals are sensitive w.r.t. the camera pose, the correct solution depends on the additional smoothness term. Even though it can be solved efficiently via a primal-dual method, it does raise the complexity of the optimization problem. The final problem of the formulation is that the measurement model is very simple, and does not take gross errors caused by reflections or complicated illumination conditions into account. The robust kernels and the smoothness term have only limited ability to deal with such effects.

We now assume that we have a compact way to describe the object’s shape given by an -dimensional shape descriptor and a function to map from the latent shape space to the full 3D geometry. Let be the set of points describing the object’s shape and generated by . The objective of tracking and mapping may now be reformulated as


a much simpler formulation that has a few encouraging properties such as low-dimensionality, implicit smoothness in the generated point cloud, and—assuming that the shape representation is strong enough to only generate valid shapes—a high resilience against effects that are not modeled by simple measurement functions.

2.2 Architecture

There are various ways to formulate the low-dimensional shape representation such as PCA or LLE, but we choose here auto-encoders which have proven to be potentially very good at this task [Girdhar et al., 2016]. As indicated in Figure 1, we structure the object shape in the original space with a binary voxel grid, and train an auto-encoder to reproduce a full model of the object. Unlike [Wu et al., 2017, Smith and Meger, 2017, Yan et al., 2016, Rezende et al., 2016], which generate the full shape from the RGBD image directly, the voxel grid fuses the information from different views more efficiently, and also does not limit the number of views. We only use simple occupancy information for map encoding, where 1 represents an occupied cell, and 0 an empty one. All our grids (i.e. the input which is the measurements, denoted , the output, denoted , and the ground truth 3D shape, denoted ) are occupancy grids.

In this work, we focus on the example object class of chairs, which we believe to be interesting since sharing commonalities while at the same time having relatively complex shape and intra-class shape variations. Note that—after training is finished—we may separate the encoder from the decoder, and thus obtain our mapping function from the latent shape space to the full geometry that we would like to embed into a residual minimization framework. This is particularly supported by the fact that we do not employ any skip connections in the network.

Figure 1: Inspired by [Girdhar et al., 2016], we use a deep auto-encoder to train a generic low-dimensional shape representation for a certain class of objects. After training is completed, encoder and decoder are separated and used for initialization and iterative residual minimization, respectively.

Besides obtaining a low-dimensional shape representation, we also want our network to learn how to predict the full shape from only partial input binary fields. By doing so, we enable our network and in particular the encoder part to initialize the latent shape descriptor directly from our fused measurements (i.e. one or several fused depth images222Note that, once camera poses are known, a partial 3D binary field is readily obtained from 2.5D depth images by recalculating 3D points and finding intersecting voxels.). However, note that we still limit ourselves to the intrinsic object shape, and therefore only use similarly oriented chairs. The geometry of Euclidean poses is well understood, and explicitly parametrized in our formulation.

Figure 2: Detailed architecture of our encoder-decoder network.

The dimensionality of the shape representation is the most important design factor. Choosing a high dimensionality may generate many saddle points in the shape space [Dauphin et al., 2014], which makes it difficult for our subsequent iterative residual minimization scheme to converge to the optimal solution. While [Yang et al., 2017], [Dai et al., 2017], and [Wu et al., 2016] use more than dimensions, we managed to reduce the number to , and thus smooth out the topology of our optimization space.

The detailed structure of our auto-encoder neural network is given by the encoder and decoder illustrated in Figure 2. Our encoder is a down-sampling network, encoding a binary grid into the latent shape representation with a dimensionality of . There are five convolution layers. The first four layers have a similar structure, each of them applying a bank of

convolution filters with stride

, followed by a batch normalization and a leaky relu activation function. The fifth layer is made up by 3D average pooling and a 3D convolution with filters and stride

, which are used to substitute the full linear transformation to avoid over-fitting and sustain spatial information

[Lin et al., 2013]. The function of the decoder is to generate the full shape from the latent shape representation. Our decoder has four deconvolution layers with stride , followed by a batch normalization and a leaky relu activation function except for the last layer, which has a tanh activation function. Due to the range of the output, we add a linear transformation mapping from to after the tanh function.

2.3 Training

We use to represent the value at position in a 3D voxel grid

. The loss function of the network is given by the binary cross entropy


where is the resolution , the target value in , and the estimated value in . We derive our training data from CAD chair rendering models [Chang et al., 2015]. The input shapes are given by partial voxel grids obtained from depth images. We therefore start by hypothesizing a virtual depth camera scanning frames from different view-points for each CAD model. The altitude and azimuth of the views randomly ranges from and , respectively. The principal axis of the views are intersecting with the object center, and the -axis of each view remains horizontal. Virtual depth images are generated by projecting each triangle into the image, finding the intersecting pixels, and intersecting the corresponding rays with the triangle in 3D to recover the depth. A depth check is added to handle occlusions.

Figure 3: Processing pipeline for generating artificial partial binary grid scans.

Once a depth image is generated, it is transformed back into a 3D point cloud using the virtual camera parameters. The last step then consists of transforming all point clouds into binary voxel fields, which is easily achieved by setting all the voxels that contain a 3D point to , and the rest to . Note that—in order to make sure the ground truth example is really complete—the point cloud used to generate the ground truth binary field is derived straight from the CAD model without bypassing via the synthesized depth frames. Also note that the transformation from the point clouds to the binary fields first employs anisotropic scaling and a translational shift to make sure the grid resolution is ideally exploited. For each training example, we generate these transformation parameters only once straight from the CAD model, and then apply them without change to the partial views as well. We obtain pairs of partial and complete voxel grids for each chair and use different chairs for training our network, which means training pairs in total.

We initialize all the convolution and normalization parameters of the network by sampling from the gaussian distributions

and . The parameters of the Adam solver are and , and the learning rate is set to for

batches. The entire network is trained from scratch with two 1080Ti GPU using Pytorch.

3 Embedding into Iterative Residual Minimization

This section illustrates how we iteratively refine the generated shape by embedding the model into a traditional residual minimization framework. The computation is divided into two stages. The first one involves the generation of a 3D shape prior by a standard feed-forward execution of the network. The second part then consists of the iterative refinement during which the consistency with the original measurements is taken into account. The section concludes by a brief exposition of our incremental, alternating tracking and mapping scheme. Figure 4 illustrates a break-down of our complete pipeline.

Figure 4: Overview of our shape prior based reconstruction pipeline. After segmentation, the frames are first compensated for their pose and fused to generate an occupancy measurement as well as an occlusion mask . is sent into the auto-encoder network to generate the shape prior . Then, , , and are used to construct cross-entropy residuals with respect to the output of the network. Finally, these residuals are iteratively minimized over the latent shape representation .

3.1 Shape prior generation

We start by a plain feed-forward application of the entire auto-encoder network to initialize our latent object shape representation and obtain a prior on the full 3D geometry. With respect to Figure 4, this is the top part of the flow-chart. The procedure works as follows. We first start by applying an object segmentation in each RGBD frame (on synthetic datasets, this step is not necessary, and on real datasets, we simply perform a RGB-image based foreground-background segmentation). Assuming that the pose of each RGBD frame has already been identified, we then take the 3D points measured on the object’s surface and transform them into the world frame (which coincides with the object frame in our work). How to estimate and gradually update the camera pose of each frame will be discussed in Section 3.4. Similarly to the training dataset generation, we complete the input generation by estimating isotroping scale factors from the 3D points, and then transform each partial point cloud into a binary occupancy grid using the discretization function . If denotes the matrix of 3D points observed in frame , the fused measurement is finally given as


where is the element-wise OR operation between two voxel grids. is the input for our auto-encoder predicting the full shape , where is our initial low-rank shape representation.

3.2 Iterative shape refinement

We now proceed to the core of our contribution. Most existing approaches employing shape priors do not subject the network’s prediction to any further post-processing. The prediction of the network however tends to be blurry and often misses out on fine structure details. It furthermore comes without any guarantees and—as shown in numerous works—networks can indeed be “fooled” by inputs that otherwise would seem to be relatively normal examples. We therefore construct residuals of the network output with respect to the original measurements, and iteratively minimize those residuals as a function of our latent shape representation . Note that, although the introduction of binary occupancy grids puts this into a different form, the basic idea of this approach is similar to the one presented in (4), namely a shape generator embedded into traditional residuals.

Figure 5: Visualization of the mask . Purple points are on the surface. Checking the distance of each voxel from the camera center and comparing against the measurements then defines free space (invisible) and occluded points (in green).

Using only the measurements to construct these residuals however bares a trap. Not every point in space may have been visible, and parameters defining the shape in unobserved regions would become fully unconstrained (recall that the encoder is no longer employed during residual minimization). Our optimization needs to be further constrained on the prior , especially in the unobserved regions. Finding a good balance is however difficult, as we still want to remain able to exploit the details that are potentially captured by in the observed parts of space.

To solve this problem, we introduce the occlusion mask , which indicates which points in space have not been observed by any previous measurements. The mask also has a voxel grid format of similar resolution as and . As explained in Figure 5,

can easily be constructed by a simple heuristic that sets voxels on the surface as well as all voxels with smaller depth from the camera center than the one measured at the reprojection location to zero, and leaves all other points equal to one (i.e. the occluded parts). Note that masks from different views can be combined by element-wise AND operation.

The mask allows us to balance between the measurements and the prior information to update our shape representation, especially in the occluded regions. Our objective for optimizing the structure is finally given by


where is the binary cross entropy function already introduced in Section 2.3, and is an overall trade-off factor defined as . governs the overall amount at which we enforce the prior in occluded regions. Especially for the first few frames, the measurement coverage may not be very high, thus making the regularization on the prior more important. Once more frames have been accumulated, the measurements themselves are typically strong enough to regularize the latent shape representation .

3.3 Optimization

We explored two strategies for refining the latent shape representation based on minimizing (7)-

  • Gradient descent- The partial derivatives of (7) w.r.t. are readily computed by applying gradient back-propagation in Pytorch. It is however difficult to optimize the shape as the decoder represents a highly-nonlinear function, and the optimization problem thus turns non-convex. Local optimization methods are thus prone to get trapped in local minima and saddle points.

  • Covariance Matrix Adaptation Evolution Strategy (CMA-ES)- CMA-ES is a search method known for its state-of-the-art performance in derivative-free optimization of non-linear or non-convex optimization problems [Hansen, 2016]. In contrast to most classical methods, fewer assumptions are made on the underlying objective function, and it can easily be applied to a black-box decoder network for which derivatives are very hard to compute. As shown in Figure 6, gradient descent converges smoothly, which means it is easy to get trapped in the nearest local extremum or saddle point. CMA-ES in contrast declines in a fluctuating fashion, indicating that it is much better at overcoming local minima. Further comparisons between gradient descent and CMA-ES are given in Section 4.

3.4 Incremental tracking and mapping

So far we have assumed that the poses are simply given, which is unrealistic. In practice, the pose of a newly arriving frame is simply initialized by running the ICP algorithm w.r.t. to previous frames [Besl and McKay, 1992]. The frame is added to our set of keyframes if sufficient displacement has been detected. Each time the set of keyframes is incremented, we fuse new grids for the measurements and the occlusion mask, and rerun our mapping paradigm. Keyframes can furthermore be realigned with the mapping result in an alternating fashion, which removes the global drift.

4 Experiments

We evaluate our method qualitatively and quantitatively on both synthetic data and real data. We start by tuning the step size of gradient descent in a dedicated offline experiment. As illustrated in Figure 6, averaging the residual behavior over many different iterative minimization procedures and for different step sizes reveals that a value of leads to fast convergence without overshooting. The starting point of gradient descent is always set to . CMA-ES is implemented by the python package Pycma. The initial mean vector is again set to , and the initial covariance is set to (the value has again been identified by a dedicated experiment). The maximum number of iterations is set to 350 for both methods. To make quantitative statements about the quality of the reconstruction, we evaluate the Intersection-over-Union(IoU) between the predicted and the ground truth 3D binary occupancy grid, which is defined as follows-


where is an indicator function, and is set to .

Figure 6: The left figure shows the convergence behavior of CMA-ES and gradient descent. The right one shows the convergence of gradient descent for different step sizes.

4.1 Iterative refinement on synthetic data

We render synthetic test cases again by using CAD models of chairs from ShapeNet. However, unlike the test cases generated for training our network, here we always generate a continuous stream of images captured along a virtual orbit around the chair. This allows us to test the incremental mapping performance of our paradigm. Our results on synthetic data focus on the mapping part, hence the positions and orientations of each frame are fixed to their original values. We evaluate the plain auto-encoder prediction and the refined results after gradient descent and CMA-ES. Figure 8 shows the IoU results obtained for the different approaches and notably averaged over different random chairs.

Our experiments show that CMA-ES based iterative refinement performs better than the auto-encoder alone, especially as the number of frames is increasing. The performance of the auto-encoder is simply limited to what has been covered by the training set. The two additional solid lines in Figure 8 show an upper and a lower bound of the IoU error obtained by additionally setting all occluded voxels in the measurement grid to either 0 or 1, respectively. Owing to the sparse structure of the chair, most of the unobserved parts are indeed empty, so the upper bound turns out to be very accurate.(Note- if repeating the same experiment with more voluminous chairs, the upper-bound quickly becomes worse than the CMA output, the result show in Figure 8.)

Figure 7: IoU of pure auto-encoder prediction and CMA-ES and gradient descent based refinements.
Figure 8: IoU error for voluminous chairs

Note however that in the general case there is no simple rule to define the value of the unobserved voxels, which is why the result from CMA-ES has to be interpreted as the best result. Figure 9 shows some visual results comparing the output of CMA-ES against the prior from the auto-encoder and ground truth. It can be concluded that the refined model provides more detail, especially for fine structures on the legs and the back of the chair. Figure 10 furthermore explains the reason why—in average—gradient descent cannot outperform CMA-ES. Gradient descent simply converges to the nearest local minimum or saddle point, a potentially wrong solution. Compared to CMA-ES, it consequently has very little ability to overcome wrong local minimum value even as new depth frames are obtained.

Figure 9: Visualization of CMA-ES based refinement result (in yellow) . Most of the reconstructed results exceed the results of the auto-encoder (in green) and are able to gradually approach ground truth (in red). denotes the number of frames taken into account. The blue column denotes the fused measurement grid suffering from occlusions. Fine details recovered through embedding into iterative residual minimization are highlighted.
Figure 10: Illustration of failure cases of gradient descent. While CMA-ES (fourth column) is able to use additional frames to eventually approach groundtruth (and outperform the auto-encoder), gradient descent (third column) gets trapped in a local minimum early on.

4.2 Iterative refinement on real data

Although the application to real data remains very challenging owing to the segmentation errors and the depth noise, we managed to obtain a few successful results by applying the incremental tracking and mapping pipeline outlined in Section 3.4 to a few sequences of real Kinect images from the chair dataset [Choi et al., 2016]. Figure 11 shows one of our obtained results. It confirms that the incremental addition of new frames—each time followed by iterative minimization of the cross-entropy residuals—is able to recover fine structure details compared to the feed-forward result from the auto-encoder alone. Although serving as a basic proof of concept, note that the procedure currently still relies on manual assistance for segmenting the chair (e.g. depth thresholding) and an initial guess for the pose of the first frame.

Figure 11: Result on real data. For the left figure, the first column shows the employed real images of a chair, the second column the segmented chair, the third column the fused measurement occupancy grid, the fourth the result from the auto-encoder, and the last the final result after iterative residual minimization. The right figure also shows a surface estimate of the final result, along with the estimated camera poses.

5 Discussion

We regard the present work as a fundamental cornerstone in lifting the parameterization of SLAM in unknown environments to a higher level. We successfully create a marriage between classical but powerful residual minimization techniques and modern deep architectures, able to provide complete and detailed reconstructions at the level of objects even in the light of partially occluded measurements. Despite the employment of residual errors with respect to the measurements, the fact that the estimated shape needs to remain a point in a previously trained latent shape space leads to a good ability to deal with missing data and unmodeled, disturbing influences on the measurements. The present work deals with only a single object, and the world frame coincides with the object frame. We intend to pursue this promising avenue and extend our work to more complex environments with multiple objects of different types. We furthermore intend to introduce a representation that permits joint optimization of poses and geometry.

The community has recently raised interesting questions about the possibility of end-to-end learning in 3D reconstruction. We argue here that from a practical perspective we do not yet have valid incentives to generalize our networks to influences that we already know how to model very well. For example, we started our investigation by using the offspring of [Girdhar et al., 2016], which generalizes to arbitrary object poses at the input of the network. Our observation is that this significantly increases the dimensionality of the shape space, which in turn affects computational complexity all while reducing the quality of the reconstruction. Our work serves as an example of how to combine valid existing models for poses and residual errors with modern deep architectures, and in particular restrict the generalization domain of the network to parameters for which no explicit model is available (i.e. the intrinsic object shape). By inserting explicit, minimal representations for the well-understood geometric transformations, we achieve a low-dimensional overall parametrization and outstanding performance.