1 Introduction and review of existing work
The ability of a machine to perceive and localize within its immediate surroundings has long been recognized as a fundamental enabler of several game-changing future technologies, including mixed or augmented reality, robotics, and intelligent vehicles. Whether we are talking about a passively moving head-mounted display in an augmented reality application or an actively navigating self-driving car, the 3D perception task remains similar: Process the continuous input data stream from all the available sensors to solve the mutual problem of simultaneous localization and mapping (SLAM).
In an effort to include 3D geometric priors into the estimation, the community has recently also explored deep end-to-end models for the feed-forward generation of depth maps or even full object models. For example,[Wu et al., 2017, Girdhar et al., 2016, Yang et al., 2017, Smith and Meger, 2017] use encoder-decoder networks to output binary 3D occupancy grids from single images of an object. [Di et al., 2016, Yan et al., 2016, Rezende et al., 2016] furthermore train networks to perform reconstruction by utilizing multi-view images. While these models lead to surprising performance, they—at least at test time—ignore the more traditional but often valid geometric or photometric consistency constraints altogether. As a result, deep feed-forward models have difficulties to reconstruct fine details, and often fail to provide confidence measures or satisfying performance in corner-case scenarios that are insufficiently represented in the training set.
In an effort to push the performance of SLAM formulations to the next level, the community has recently investigated various strategies to combine both modalities and include prior, high-level knowledge into classical iterative residual minimization frameworks:
The simplest approach of including higher-level knowledge is given by explicitly representing the common 1D or 2D geometric structures of man-made environments. Straightforward examples are given by lines and planes [Salas-Moreno et al., 2014, Micusik and Wildenauer, 2015]. Although a well-explored idea, this technique already achieves remarkable properties that we also pursue in our work: A low-dimensional dense representation of the environment that implicitly enforces smoothness by encoding the higher-order shape of the environment.
Starting from these simple geometric primitives, the community has moved on to the usage of image-based semantic object detection modules. The latter typically incorporate the experience given by countless training examples into a state-of-the-art deep convolutional neural network, which is used online for retrieving a plausible CAD model to represent objects in the environment. Our work is similar in that we also leverage semantic information to change or even simplify the object representations. However, we do not rely on CAD models. While object detection modules are typically tested for generalization ability, picking a concrete model from a database for the actual SLAM optimization objective[Civera et al., 2011, Salas-Moreno et al., 2013, Gupta et al., 2015, Gálvez-López et al., 2016, Mu et al., 2016] removes this advantage and limits accurate inference to a finite set of objects for which an exact model is known in advance.
More recently, the community defined the concept of semantic mapping, which denotes the augmentation of generated 3D models by semantic information. However, methods such as [Koppula et al., 2011, Stückler et al., 2015, Kundu et al., 2014, McCormac et al., 2017] hardly explore the benefit of semantic knowledge in geometric inference. They merely transfer local semantic knowledge to a global representation where multiple observations from different view-points are fused111Note however that the semantic recognition part can be readily extended to use dense depth information as well.. A notable exception is presented in [Häne et al., 2017], where local smoothness constraints are rendered dependent on the semantic knowledge.
, which utilize low-dimensional latent feature vectors to achieve low-rank representations of 2.5D depth maps and point clouds, respectively. However,[Bloesch et al., 2018] aims at learning a general representation for entire depth maps without employing semantic knowledge. [Zhu et al., 2017] employs shape representations for individual objects, however only in the form of sparse point clouds for very simple shapes and without mechanisms to deal with occlusions. Furthermore, the work leaves open questions about how a shape generator mapping from a low-dimensional latent space to the full object geometry is effectively split off from the hourglass architecture taken from [Fan et al., 2016]. A further notable work is given by [Dame et al., 2013], which however does not yet take benefit from powerful modern deep architectures.
Our contributions are as follows. We apply the idea of embedded deep shape representations to a novel RGBD incremental tracking and mapping framework for single objects. Our representation in 3D is dense and object centric, and we introduce an occlusion mask-supported strategy to actively find a trade-off between imposing priors generated by the network, and traditional residuals with respect to the measurements. We conclude with a successful application to a challenging real example, and reconstruct a dense model of a chair from real Kinect images. Section 2 introduces our deep shape representation. Section 3 explains how this model is embedded into an incremental tracking and mapping framework. Section 4 finally concludes with our experimental evaluation on both artificial and real data.
2 Object shape representation learning
This section introduces our higher-level object shape representations. We start by seeing a motivation for the basic form of the representation, as well as the need for functions mapping to and from the defined shape space. We then see the exact architecture devised to realize these mappings, followed by an exposition of all details on the corresponding training procedure.
2.1 Motivation for higher-level models
The traditional geometric formulation of a simultaneous tracking and mapping problem starts with the definition of a measurement model. Let be points on an object, and the poses of viewpoints from where the object is observed. Without loss of generality, point measurements are given by
where is a projection function that respects the intrinsic parameters of the sensor, and is an independent noise component. The residuals between the measurements and the estimated poses and points are then given by
A popular alternative to solve the tracking and mapping problem is to search for poses and points that minimize the sum of least squares residuals. However, in order to improve the conditioning of the problem, the energy is often complemented by a dual smoothness term that enforces neighbouring points to remain close to each other. The final objective may read
where is an indicator function that indicates whether point is visible in view , is the set of index-pairs for points that are defined to be neighbors, and and are robust cost functions. There are several problems with this objective. The first one is that the dimensionality of the problem is potentially very high, especially if we are considering the dense scenario. Then, since not all residuals are sensitive w.r.t. the camera pose, the correct solution depends on the additional smoothness term. Even though it can be solved efficiently via a primal-dual method, it does raise the complexity of the optimization problem. The final problem of the formulation is that the measurement model is very simple, and does not take gross errors caused by reflections or complicated illumination conditions into account. The robust kernels and the smoothness term have only limited ability to deal with such effects.
We now assume that we have a compact way to describe the object’s shape given by an -dimensional shape descriptor and a function to map from the latent shape space to the full 3D geometry. Let be the set of points describing the object’s shape and generated by . The objective of tracking and mapping may now be reformulated as
a much simpler formulation that has a few encouraging properties such as low-dimensionality, implicit smoothness in the generated point cloud, and—assuming that the shape representation is strong enough to only generate valid shapes—a high resilience against effects that are not modeled by simple measurement functions.
There are various ways to formulate the low-dimensional shape representation such as PCA or LLE, but we choose here auto-encoders which have proven to be potentially very good at this task [Girdhar et al., 2016]. As indicated in Figure 1, we structure the object shape in the original space with a binary voxel grid, and train an auto-encoder to reproduce a full model of the object. Unlike [Wu et al., 2017, Smith and Meger, 2017, Yan et al., 2016, Rezende et al., 2016], which generate the full shape from the RGBD image directly, the voxel grid fuses the information from different views more efficiently, and also does not limit the number of views. We only use simple occupancy information for map encoding, where 1 represents an occupied cell, and 0 an empty one. All our grids (i.e. the input which is the measurements, denoted , the output, denoted , and the ground truth 3D shape, denoted ) are occupancy grids.
In this work, we focus on the example object class of chairs, which we believe to be interesting since sharing commonalities while at the same time having relatively complex shape and intra-class shape variations. Note that—after training is finished—we may separate the encoder from the decoder, and thus obtain our mapping function from the latent shape space to the full geometry that we would like to embed into a residual minimization framework. This is particularly supported by the fact that we do not employ any skip connections in the network.
Besides obtaining a low-dimensional shape representation, we also want our network to learn how to predict the full shape from only partial input binary fields. By doing so, we enable our network and in particular the encoder part to initialize the latent shape descriptor directly from our fused measurements (i.e. one or several fused depth images222Note that, once camera poses are known, a partial 3D binary field is readily obtained from 2.5D depth images by recalculating 3D points and finding intersecting voxels.). However, note that we still limit ourselves to the intrinsic object shape, and therefore only use similarly oriented chairs. The geometry of Euclidean poses is well understood, and explicitly parametrized in our formulation.
The dimensionality of the shape representation is the most important design factor. Choosing a high dimensionality may generate many saddle points in the shape space [Dauphin et al., 2014], which makes it difficult for our subsequent iterative residual minimization scheme to converge to the optimal solution. While [Yang et al., 2017], [Dai et al., 2017], and [Wu et al., 2016] use more than dimensions, we managed to reduce the number to , and thus smooth out the topology of our optimization space.
The detailed structure of our auto-encoder neural network is given by the encoder and decoder illustrated in Figure 2. Our encoder is a down-sampling network, encoding a binary grid into the latent shape representation with a dimensionality of . There are five convolution layers. The first four layers have a similar structure, each of them applying a bank of
convolution filters with stride, followed by a batch normalization and a leaky relu activation function. The fifth layer is made up by 3D average pooling and a 3D convolution with filters and stride
, which are used to substitute the full linear transformation to avoid over-fitting and sustain spatial information[Lin et al., 2013]. The function of the decoder is to generate the full shape from the latent shape representation. Our decoder has four deconvolution layers with stride , followed by a batch normalization and a leaky relu activation function except for the last layer, which has a tanh activation function. Due to the range of the output, we add a linear transformation mapping from to after the tanh function.
We use to represent the value at position in a 3D voxel grid
. The loss function of the network is given by the binary cross entropy
where is the resolution , the target value in , and the estimated value in . We derive our training data from CAD chair rendering models [Chang et al., 2015]. The input shapes are given by partial voxel grids obtained from depth images. We therefore start by hypothesizing a virtual depth camera scanning frames from different view-points for each CAD model. The altitude and azimuth of the views randomly ranges from and , respectively. The principal axis of the views are intersecting with the object center, and the -axis of each view remains horizontal. Virtual depth images are generated by projecting each triangle into the image, finding the intersecting pixels, and intersecting the corresponding rays with the triangle in 3D to recover the depth. A depth check is added to handle occlusions.
Once a depth image is generated, it is transformed back into a 3D point cloud using the virtual camera parameters. The last step then consists of transforming all point clouds into binary voxel fields, which is easily achieved by setting all the voxels that contain a 3D point to , and the rest to . Note that—in order to make sure the ground truth example is really complete—the point cloud used to generate the ground truth binary field is derived straight from the CAD model without bypassing via the synthesized depth frames. Also note that the transformation from the point clouds to the binary fields first employs anisotropic scaling and a translational shift to make sure the grid resolution is ideally exploited. For each training example, we generate these transformation parameters only once straight from the CAD model, and then apply them without change to the partial views as well. We obtain pairs of partial and complete voxel grids for each chair and use different chairs for training our network, which means training pairs in total.
3 Embedding into Iterative Residual Minimization
This section illustrates how we iteratively refine the generated shape by embedding the model into a traditional residual minimization framework. The computation is divided into two stages. The first one involves the generation of a 3D shape prior by a standard feed-forward execution of the network. The second part then consists of the iterative refinement during which the consistency with the original measurements is taken into account. The section concludes by a brief exposition of our incremental, alternating tracking and mapping scheme. Figure 4 illustrates a break-down of our complete pipeline.
3.1 Shape prior generation
We start by a plain feed-forward application of the entire auto-encoder network to initialize our latent object shape representation and obtain a prior on the full 3D geometry. With respect to Figure 4, this is the top part of the flow-chart. The procedure works as follows. We first start by applying an object segmentation in each RGBD frame (on synthetic datasets, this step is not necessary, and on real datasets, we simply perform a RGB-image based foreground-background segmentation). Assuming that the pose of each RGBD frame has already been identified, we then take the 3D points measured on the object’s surface and transform them into the world frame (which coincides with the object frame in our work). How to estimate and gradually update the camera pose of each frame will be discussed in Section 3.4. Similarly to the training dataset generation, we complete the input generation by estimating isotroping scale factors from the 3D points, and then transform each partial point cloud into a binary occupancy grid using the discretization function . If denotes the matrix of 3D points observed in frame , the fused measurement is finally given as
where is the element-wise OR operation between two voxel grids. is the input for our auto-encoder predicting the full shape , where is our initial low-rank shape representation.
3.2 Iterative shape refinement
We now proceed to the core of our contribution. Most existing approaches employing shape priors do not subject the network’s prediction to any further post-processing. The prediction of the network however tends to be blurry and often misses out on fine structure details. It furthermore comes without any guarantees and—as shown in numerous works—networks can indeed be “fooled” by inputs that otherwise would seem to be relatively normal examples. We therefore construct residuals of the network output with respect to the original measurements, and iteratively minimize those residuals as a function of our latent shape representation . Note that, although the introduction of binary occupancy grids puts this into a different form, the basic idea of this approach is similar to the one presented in (4), namely a shape generator embedded into traditional residuals.
Using only the measurements to construct these residuals however bares a trap. Not every point in space may have been visible, and parameters defining the shape in unobserved regions would become fully unconstrained (recall that the encoder is no longer employed during residual minimization). Our optimization needs to be further constrained on the prior , especially in the unobserved regions. Finding a good balance is however difficult, as we still want to remain able to exploit the details that are potentially captured by in the observed parts of space.
To solve this problem, we introduce the occlusion mask , which indicates which points in space have not been observed by any previous measurements. The mask also has a voxel grid format of similar resolution as and . As explained in Figure 5,
can easily be constructed by a simple heuristic that sets voxels on the surface as well as all voxels with smaller depth from the camera center than the one measured at the reprojection location to zero, and leaves all other points equal to one (i.e. the occluded parts). Note that masks from different views can be combined by element-wise AND operation.
The mask allows us to balance between the measurements and the prior information to update our shape representation, especially in the occluded regions. Our objective for optimizing the structure is finally given by
where is the binary cross entropy function already introduced in Section 2.3, and is an overall trade-off factor defined as . governs the overall amount at which we enforce the prior in occluded regions. Especially for the first few frames, the measurement coverage may not be very high, thus making the regularization on the prior more important. Once more frames have been accumulated, the measurements themselves are typically strong enough to regularize the latent shape representation .
We explored two strategies for refining the latent shape representation based on minimizing (7)-
Gradient descent- The partial derivatives of (7) w.r.t. are readily computed by applying gradient back-propagation in Pytorch. It is however difficult to optimize the shape as the decoder represents a highly-nonlinear function, and the optimization problem thus turns non-convex. Local optimization methods are thus prone to get trapped in local minima and saddle points.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES)- CMA-ES is a search method known for its state-of-the-art performance in derivative-free optimization of non-linear or non-convex optimization problems [Hansen, 2016]. In contrast to most classical methods, fewer assumptions are made on the underlying objective function, and it can easily be applied to a black-box decoder network for which derivatives are very hard to compute. As shown in Figure 6, gradient descent converges smoothly, which means it is easy to get trapped in the nearest local extremum or saddle point. CMA-ES in contrast declines in a fluctuating fashion, indicating that it is much better at overcoming local minima. Further comparisons between gradient descent and CMA-ES are given in Section 4.
3.4 Incremental tracking and mapping
So far we have assumed that the poses are simply given, which is unrealistic. In practice, the pose of a newly arriving frame is simply initialized by running the ICP algorithm w.r.t. to previous frames [Besl and McKay, 1992]. The frame is added to our set of keyframes if sufficient displacement has been detected. Each time the set of keyframes is incremented, we fuse new grids for the measurements and the occlusion mask, and rerun our mapping paradigm. Keyframes can furthermore be realigned with the mapping result in an alternating fashion, which removes the global drift.
We evaluate our method qualitatively and quantitatively on both synthetic data and real data. We start by tuning the step size of gradient descent in a dedicated offline experiment. As illustrated in Figure 6, averaging the residual behavior over many different iterative minimization procedures and for different step sizes reveals that a value of leads to fast convergence without overshooting. The starting point of gradient descent is always set to . CMA-ES is implemented by the python package Pycma. The initial mean vector is again set to , and the initial covariance is set to (the value has again been identified by a dedicated experiment). The maximum number of iterations is set to 350 for both methods. To make quantitative statements about the quality of the reconstruction, we evaluate the Intersection-over-Union(IoU) between the predicted and the ground truth 3D binary occupancy grid, which is defined as follows-
where is an indicator function, and is set to .
4.1 Iterative refinement on synthetic data
We render synthetic test cases again by using CAD models of chairs from ShapeNet. However, unlike the test cases generated for training our network, here we always generate a continuous stream of images captured along a virtual orbit around the chair. This allows us to test the incremental mapping performance of our paradigm. Our results on synthetic data focus on the mapping part, hence the positions and orientations of each frame are fixed to their original values. We evaluate the plain auto-encoder prediction and the refined results after gradient descent and CMA-ES. Figure 8 shows the IoU results obtained for the different approaches and notably averaged over different random chairs.
Our experiments show that CMA-ES based iterative refinement performs better than the auto-encoder alone, especially as the number of frames is increasing. The performance of the auto-encoder is simply limited to what has been covered by the training set. The two additional solid lines in Figure 8 show an upper and a lower bound of the IoU error obtained by additionally setting all occluded voxels in the measurement grid to either 0 or 1, respectively. Owing to the sparse structure of the chair, most of the unobserved parts are indeed empty, so the upper bound turns out to be very accurate.(Note- if repeating the same experiment with more voluminous chairs, the upper-bound quickly becomes worse than the CMA output, the result show in Figure 8.)
Note however that in the general case there is no simple rule to define the value of the unobserved voxels, which is why the result from CMA-ES has to be interpreted as the best result. Figure 9 shows some visual results comparing the output of CMA-ES against the prior from the auto-encoder and ground truth. It can be concluded that the refined model provides more detail, especially for fine structures on the legs and the back of the chair. Figure 10 furthermore explains the reason why—in average—gradient descent cannot outperform CMA-ES. Gradient descent simply converges to the nearest local minimum or saddle point, a potentially wrong solution. Compared to CMA-ES, it consequently has very little ability to overcome wrong local minimum value even as new depth frames are obtained.
4.2 Iterative refinement on real data
Although the application to real data remains very challenging owing to the segmentation errors and the depth noise, we managed to obtain a few successful results by applying the incremental tracking and mapping pipeline outlined in Section 3.4 to a few sequences of real Kinect images from the chair dataset [Choi et al., 2016]. Figure 11 shows one of our obtained results. It confirms that the incremental addition of new frames—each time followed by iterative minimization of the cross-entropy residuals—is able to recover fine structure details compared to the feed-forward result from the auto-encoder alone. Although serving as a basic proof of concept, note that the procedure currently still relies on manual assistance for segmenting the chair (e.g. depth thresholding) and an initial guess for the pose of the first frame.
We regard the present work as a fundamental cornerstone in lifting the parameterization of SLAM in unknown environments to a higher level. We successfully create a marriage between classical but powerful residual minimization techniques and modern deep architectures, able to provide complete and detailed reconstructions at the level of objects even in the light of partially occluded measurements. Despite the employment of residual errors with respect to the measurements, the fact that the estimated shape needs to remain a point in a previously trained latent shape space leads to a good ability to deal with missing data and unmodeled, disturbing influences on the measurements. The present work deals with only a single object, and the world frame coincides with the object frame. We intend to pursue this promising avenue and extend our work to more complex environments with multiple objects of different types. We furthermore intend to introduce a representation that permits joint optimization of poses and geometry.
The community has recently raised interesting questions about the possibility of end-to-end learning in 3D reconstruction. We argue here that from a practical perspective we do not yet have valid incentives to generalize our networks to influences that we already know how to model very well. For example, we started our investigation by using the offspring of [Girdhar et al., 2016], which generalizes to arbitrary object poses at the input of the network. Our observation is that this significantly increases the dimensionality of the shape space, which in turn affects computational complexity all while reducing the quality of the reconstruction. Our work serves as an example of how to combine valid existing models for poses and residual errors with modern deep architectures, and in particular restrict the generalization domain of the network to parameters for which no explicit model is available (i.e. the intrinsic object shape). By inserting explicit, minimal representations for the well-understood geometric transformations, we achieve a low-dimensional overall parametrization and outstanding performance.
- Wu et al.  Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances In Neural Information Processing Systems, pages 540–550, 2017.
Girdhar et al. 
Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta.
Learning a predictable and generative vector representation for
European Conference on Computer Vision, pages 484–499. Springer, 2016.
- Yang et al.  Bo Yang, Hongkai Wen, Sen Wang, Ronald Clark, Andrew Markham, and Niki Trigoni. 3d object reconstruction from a depth view with adversarial learning. arXiv preprint arXiv:1708.07969, 2017.
- Smith and Meger  Edward Smith and David Meger. Improved adversarial systems for 3d object generation and reconstruction. arXiv preprint arXiv:1707.09557, 2017.
- Di et al.  Xinhan Di, Rozenn Dahyot, and Mukta Prasad. Deep shape from a low number of silhouettes. In European Conference on Computer Vision, pages 251–265. Springer, 2016.
- Yan et al.  Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016.
- Rezende et al.  Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances In Neural Information Processing Systems, pages 4996–5004, 2016.
- Salas-Moreno et al.  Renato F Salas-Moreno, Ben Glocken, Paul HJ Kelly, and Andrew J Davison. Dense planar slam. In Mixed and Augmented Reality (ISMAR), 2014 IEEE International Symposium on, pages 157–164. IEEE, 2014.
Micusik and Wildenauer 
Branislav Micusik and Horst Wildenauer.
Descriptor free visual indoor localization with line segments.
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3165–3173. IEEE, 2015.
- Civera et al.  Javier Civera, Dorian Gálvez-López, Luis Riazuelo, Juan D Tardós, and JMM Montiel. Towards semantic slam using a monocular camera. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 1277–1284. IEEE, 2011.
- Salas-Moreno et al.  Renato F Salas-Moreno, Richard A Newcombe, Hauke Strasdat, Paul HJ Kelly, and Andrew J Davison. Slam++: Simultaneous localisation and mapping at the level of objects. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1352–1359. IEEE, 2013.
- Gupta et al.  Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Aligning 3d models to rgb-d images of cluttered scenes. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 4731–4740. IEEE, 2015.
- Gálvez-López et al.  Dorian Gálvez-López, Marta Salas, Juan D Tardós, and JMM Montiel. Real-time monocular object slam. Robotics and Autonomous Systems, 75:435–449, 2016.
- Mu et al.  Beipeng Mu, Shih-Yuan Liu, Liam Paull, John Leonard, and Jonathan P How. Slam with objects using a nonparametric pose graph. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 4602–4609. IEEE, 2016.
- Koppula et al.  Hema S Koppula, Abhishek Anand, Thorsten Joachims, and Ashutosh Saxena. Semantic labeling of 3d point clouds for indoor scenes. In Advances in neural information processing systems, pages 244–252, 2011.
- Stückler et al.  Jörg Stückler, Benedikt Waldvogel, Hannes Schulz, and Sven Behnke. Dense real-time mapping of object-class semantics from rgb-d video. Journal of Real-Time Image Processing, 10(4):599–609, 2015.
- Kundu et al.  Abhijit Kundu, Yin Li, Frank Dellaert, Fuxin Li, and James M Rehg. Joint semantic segmentation and 3d reconstruction from monocular video. In European Conference on Computer Vision, pages 703–718. Springer, 2014.
- McCormac et al.  John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 4628–4635. IEEE, 2017.
- Häne et al.  C Häne, C Zach, A Cohen, and M Pollefeys. Dense semantic 3d reconstruction. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
- Bloesch et al.  Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam-learning a compact, optimisable representation for dense visual slam. arXiv preprint arXiv:1804.00874, 2018.
- Zhu et al.  Rui Zhu, Chaoyang Wang, Chen-Hsuan Lin, Ziyan Wang, and Simon Lucey. Semantic photometric bundle adjustment on natural sequences. arXiv preprint arXiv:1712.00110, 2017.
- Fan et al.  Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. CoRR, abs/1612.00603, 2016.
- Dame et al.  A Dame, V A Prisacariu, C Y Ren, and I Reid. Dense reconstruction using 3d object shape priors. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
- Dauphin et al.  Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
- Dai et al.  Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
- Wu et al.  Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
- Lin et al.  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
- Chang et al.  Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Hansen  Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
- Besl and McKay  Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
- Choi et al.  Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A large dataset of object scans. arXiv preprint arXiv:1602.02481, 2016.