I Introduction
In robotics, simultaneous localization and mapping (SLAM) with highlevel semantic features has become an important factor for scene understanding
[1]. The object oriented feature is ideal for viewindependent loopclosure in SLAM, and is easily used for reinforcement learning such as object search
[2, 3]. Especially, object recognition is widely used to capture the meaningful features and infer categories and viewpoints [4, 5]. However, there is a lack of consideration of the probabilistic observation model for the object. Recent studies for object recognition can be applied to object detection and evaluation of the classification probability [6, 7]. Nevertheless, these methods concentrate on the detecting objects precisely for a single image, rather than estimating the probability distribution of the object. Therefore, it is cumbersome to apply it to the probabilistic SLAM which requires a tractable observation model capable of numerical analysis. Intractable probabilistic distribution of the object’s shape also makes it difficult to estimate the observation model. Moreover, since 3D scene observed by mono camera or range finder is a single view, a mobile robot can observe only a part of an object at a glance. As an object has various single views, ideally, scanning the entire 3D shape of an object should be possible to achieve viewindependent feature matching, loopclosure and complete volumetric maps. However, it is hard to obtain full shapes of objects while a mobile robot performing various tasks in realtime.The goal of this paper is to approximate the intractable observation model for the object from its single view. In order to exploit the single view that the mobile robot actually acquires from the observation, we let the algorithm to infer the variational likelihood of the full shape from a single view. We take advantage of variational autoencoder (VAE) [8]
, and let the latent variables obtained from the observed object follow a tractable prior. In consequence, it is possible to perform the numerical analysis of the SLAM with data association problem using expectation maximization (EM) algorithm. For the approximation of the observation distribution through the generative model, we introduce the Bayesian network similar to
[9]. The overview of the proposed method is shown in Fig. 1.Our contributions are twofold: First, we show that the semantic features can be replaced by the variational latent variables for EM formulation of probabilistic SLAM; second, we introduce the encoding method for the generative model with the Bayesian networks of 3D object, exploiting its single view.
Ii Related Work
Works on SLAM with features typically address the problem of data association. In case of using an object as a feature, object recognition method based on learning with nonlinear function is usually performed. Therefore, obtaining a closed form solution is challenging since the posterior including data association follows intractable distributions. To relax the problem, traditional approaches to SLAM divide the problem into frontend and backend [10, 11]
. In frontend, feature extraction and data association are solved relying on the detection algorithm such as object recognition. Subsequently, localization is performed using filterbased or graph optimization method in the backend. Due to this partitioned structure of SLAM problem, once the data association in the frontend fails, it is hard to avoid the tremendous error of the localization in the backend.
In order to overcome these issues, several methods have been developed to modify the false data association by weighting the association results [12, 13]. However, proposed methods have limitations to deal with the uncorrelated data association and loop closing. [14]
proposed a probabilistic data association formulation for semantic SLAM, using EM algorithm. They started with the pose optimization problem, and introduced the random variable of data association as latent one. Therefore, the approximated solution of maximum a posterior (MAP) for SLAM problem can be achieved despite the failure of initial data association. However, the existing semantic SLAM basically performs data association based on object recognition algorithm such as
[6, 7, 15], which usually has intractable observation probability distribution. Thus even if the EM algorithm is used, the iterative solution of the object label is obtained in the expectation part by treating the label as a latent variable along with the data association. Besides, this method hardly consider that observation of full shape with single mobile robot in realtime is infeasible.Therefore, we show that the approximated observation model with variational inference enables the complete EM algorithm for the semantic SLAM problem. In order to formulate the generative models for the variational inference, we assume that latent variables related to some factors such as class, instance categories and camera position are involved to create the object. In this aspect, our proposed model is similar to the studies of disentangled representation in [16, 17]. These works associated the latent variables with specific elements of the object or face such as contrast in image, object categories, facial angle, emotional expression, gender, and so on. In this way, those worked on how the objects are generated when latent variables corresponding to the specific elements change.
Since most of the objects are observed as a single view, we train the VAE to infer the variational likelihood of full shape from a single view of an object. For actual implementation and evaluation, we assume that the single view obtained from a range finder or depth camera is represented as a voxel grid as well as full shape.
For the SLAM with data association problem, we will show that it is unnecessary to estimate the full shape of the object in actual practice as we only need the encoded features obtained from the encoder. However, our network basically is an autoencoder, thus it can be used as a shape retrieval network from a single view. In this perspective, [18] is similar to our study. [19, 20, 21] are also quite similar to our work in terms of the fact that they attempt to match a 3D shape to a single view, even though hardly consider the probabilistic approaches.
Iii Graphical Models for Likelihood of 3D Object
Suppose the mobile robot observes the full shape of object, and represents it as voxel grid. In our work, we use this voxelized object shape as a semantic feature . To learn the generative model for estimating the observation model of 3D object, it is useful to introduce a Bayesian graphical model as [9]. Similar to [16], we assume that the Bayesian random process for the 3D object involves : the category, the characteristics of the observed instance, camera viewpoint around the center of the object, and translations in the voxel grid. The category denotes the class of objects related to the rough appearance, and stands for the detailed shape of the individual instance. We represent this Bayesian process as a directed acyclic graph model in Fig. 2(a).
For the generative model, the joint probability distribution can be denoted as . we simply assume that
and each of the factorized terms follows a uniform distribution. Since the likelihood of the Bayesian process is too complex to handle, we resort to approximations using deep generative model instead. For our work, VAE with multilayered neural networks is adopted
[8]. We display the graphical model with variational latent variables in Fig. 2(b). Then the lowerbound of the likelihood is represented as follows:(1) 
By the mean field inference [8][22], we assume that . Hence the variational likelihood can be factorized as for all symbol that . Similarly, we assume that the prior of can also be factorized as the product of conditional densities such that . Then the KLdivergence term in (1) can be represented as follows:
In many studies of using VAE, the prior of is simply assumed to be . However, now the prior depends on , we let where , i.e., nonlinear function with parameter . Therefore we construct the neural networks for which denotes the prior distributions and also should be inferred from the training data. For simplicity, we can have , which is a diagonal covariance matrix. For more simplicity we let and only leave as trainable variables. Similar to the prior, we assume that be the multivariate Gaussians with diagonal covariance as in [8].
Unlike our previous assumption, the mobile robot usually observes objects in the form of a single view such as an RGBD image, thus obtaining the full shape is challenging for the realtime performance in practice. To infer the true observation model of the full shape, which is ideal for viewindependent feature matching and loop closure, we exploit the single view and let variational likelihood indirectly infer the latent variables implying full shape from it. Therefore, similar to [18], we redefine the variational likelihood as , where is an observed single view of an object. In this paper, is assumed to be a voxelized grid which is converted from segmented depth images or point clouds.
Iv Variational Latent Variables and Semantic Features
Iva Variational Latent Variables and Semantic SLAM with Data Association
Consider the localization and mapping problem with object semantic features. As in [14], assume that we have a collection of static landmarks. The goal of the semantic SLAM is to estimate the 3D coordinate and label of the landmark, and robot poses given a set of object observations . A ’th object detection from keyframe is composed of a full shape and a 3D coordinate for the center of its bounding box. With the latent variables for data association, the EM formulation for the semantic SLAM can be represented as follows:
(2)  
(3) 
is the set of all possible data associations representing that the object detection of landmark was obtained from the robot state . Also, is the set of all possible data association such that th detection is assigned to th landmark. For more details of the EM formulation, please refer to [14]. Note that unlike [14], we only set the data association as latent variables except the object label for EM algorithm.
Now assume that observation events are iid, i.e., . Here, if we set a landmark label and apply the lower bound in (1) to approximate the true likelihood , the EM formulation in (2) and (3) can be represented as follows:
(4)  
(5) 
where and (for more details, see Appendix I). Consequently, when performing the EM with variational latent variables, we can simply replace with
which is a tractable Gaussian distribution, and substitute
which is the encoded feature from observed single view . In other words, the semantic feature can be replaced with the encoded feature . Therefore, even if the full shape is hard to observe and its observation model is intractable, we can exploit the single view to infer the true weight and optimal solutions for EM approximately.IvB MLE with Variational Latent Variables
In addition to the EM formulation, the variational lower bound can be used to solve the classification problem of the semantic features. Consider the maximum likelihood estimation (MLE) of . With (1), the approximated MLE can be denoted as follows:
(6) 
(see Appendix II for the proof). The approximated optimal solution is therefore obtained from the MLE with tractable variational prior and encoded feature . As similar to the EM case, the classification solution of the full shape can be obtained from the MLE approximately since we utilize the single view to encode the features which ultimately try to denote the full shape. Note that the usage of 3D voxelized shape as features is not a limitation of our method, and the algorithm can be applied to any other form of the features such as mesh grid or RGBD images.
V Training Details
Va Data Augmentation
In order to train and evaluate the proposed method, we use ModelNet10 and ModelNet40 datasets which are the subset of ModelNet 3D CAD datasets. For our experiments, two training sets are used; One is the single view data from 12 viewpoints of each object, and the other is the dataset from 24 viewpoints. Voxels of the single views are created through perspective projection. During the training, random translations are performed for each sample as the same procedure done in [23]. In addition, two copies are used for all the samples and we add random noise to one of them. Our graphical model involves not only the categories but also instance labels, translation and viewpoint, so that random flipping is not conducted since it can change the object’s characteristics of instance and viewpoint index.
VB Loss Function
VB1 Lower Bound
In order to train the VAE, the negative variational lower bound from (1
) is used as the loss function. Since we assume that
and be the multivariate Gaussians with diagonal covariance, the KLdivergence in (1) is expressed as follows:where and are the th component of and , respectively. The expectation term in (1) can be estimated by the reparameterization trick [8] as follows:
where . Since the 3D shape of object is represented as binary random variables, we let
be the Bernoulli distributions.
VB2 regularization of the prior distribution
When the class or instance labels of the objects are different, the encoded features corresponding to those objects should also be different, which is ideal for solving EM or MLE. Therefore in chapter III, we introduce the prior for each labels, and construct neural networks for nonlinear function which ouputs the mean
(note that the variance
is assumed to be ).The problem is that the initial values of the weight and bias of the neural networks are close to [24], so that the encoded features for various objects have small variations at the beginning of learning. Moreover, also has small variation regardless of the labels for a similar reason. Consequently even after the learning converges the performance of the EM and MLE is actually poor. Therefore, it is necessary to limit the distance between the prior means on the latent space to some threshold. To do this, we define regularization loss as follows:
where
The threshold denotes the minimum distance in the latent space between means according to the different labels.
VB3 reconstruction
In the proposed algorithm, VAE can be regarded as a retrieval network from single view to full shape. The expectation term and the KLdivergence term in (1
) can also be regarded as the terms of shape retrieval and a regularization factor, respectively. In our case, we find that when the KLdivergence term strongly restricts the network, it easily plunges into the local minima and gives the false results such as allzero assigned voxels or centralized spherical bulb shape. Therefore, we change the range of the binary variables for target voxelized shape from
to , to increase the scale of the gradient for shape retrieval loss as in [25]. The modified loss for shape retrieval is then as follows:where and are th component of the binary occupancy variables for the predicted full shape and the target(true) shape of the object, respectively. With this modification, our network can efficiently converge to the optimal state, which infers the restricted encoded features and correct full shapes from single views.
VC Training Networks
We construct our VAE with basic convolutional neural networks and dense layers. The overview of the proposed network is displayed in Fig.
3. The encoding part of the VAE is composed of 9 convolutional layers and 2 dense layers. Similarly, 2 dense layers and 9 transposed convolutional layers are adopted to implement the decoder. For the prior distribution, we construct 2 dense layers. We use the Batch normalization for efficient and stable learning. The dropout regularization is applied to the hidden layer to add noise to the latent variables. For nonlinearity, all layers use the exponential linear unit (ELU) except for the last layer, which adopts a sigmoid function for implementing the Bernoulli distribution. We use the Adam for the optimizer of our VAE network.
Vi Experiments
Via Variational Latent Variable and Object Shape
The latent variables are sampled from the variational likelihood, which approximates the distribution . Therefore the encoded features have correlations with the full shape . For verification, we compare the MLE results of class label using (6) with the results of the classifier trained with full shape of objects. We use ModelNet10 dataset for these evaluations. The structure of the classifier is the same as the encoder of the VAE except for the last layer, which consists of an additional dense layer with softmax activation. As shown in Fig. 4, the confusion matrices for MLE and full shape classification results show similar aspects.
Meanwhile, the prior networks trained together with autoencoder represent the distributions of the latent variables, which imply the class, instance and observation position. Therefore the parameters of the prior also reflect the characteristics of each labels. To examine this we show the distance between the parameters, especially , in Fig. 4. The prior of the class label is a distribution of the latent variable that determines the rough shape of the object. Therefore the closer the shapes of the classes are, the closer the distance in latent space between the prior parameter is. As a result, the distances show similar aspects to the confusion matrices of full shapes.
ViB Classification and Reconstruction
We also compare the MLE results with the classification results of stateoftheart multiview based classification algorithms in Table. I. Our results show better performance than most of the other multiview or voxelized fullshape based classification algorithms. In addition, despite the simple architecture of the encoding layer, the proposed algorithm shows a competitive results to [25] which shows the best performance with single view by applying the deep ResNet structure (8 convolutional layers for the shallowest path, 45 convolutional layers for the deepest path and 2 dense layers).
Methods  ModelNet10  ModelNet40 
DeepPano[26]  88.66%  82.54% 
Voxnet[23]  92.00%  83.00% 
3DGAN[18]  91.00%  83.30% 
GIFT[27]  92.35%  83.10% 
PANORAMANN[28]  91.12%  90.70% 
LightNet[29]  93.94%  88.93% 
VRN[25] (single view)    88.98% 
proposed  91.35%  86.82% 
(single view) 
The full shape of the observed single view can also be inferred from the trained autoencoder. The retrieval results are shown in Table.
II, and some samples of retrieval results are displayed in Fig. 6. We also report the precisionrecall curve in Fig. 5. Interestingly, unlike the classification results, the accuracy of retrieval results for the ModelNet10 and ModelNet40 dataset are not significantly different. This is because some classes such as airplanes or cars that have distinctive shape features are easier to reconstruct than other furniture classes.Meanwhile, since the scale, color, or material of the object is not considered at all during learning, object reasoning has limitations only by the geometric information from the single views. For example, in the last line on the right side of Fig. 6, the full shape is reasonably inferred from the single view. However, due to lack of the other information, the MLE result implies that it is the most likely to be the chair, not the sofa.
Methods  ModelNet10  ModelNet40  
AUC  mAP  AUC  mAP  
PANORAMANN[28]    87.39%    83.45% 
ShapeNets[30]  69.28%  68.26%  49.94%  49.23% 
DeepPano[26]  85.45%  84.18%  77.63%  76.81% 
GIFT[27]  92.35%  91.12%  83.10%  81.94% 
proposed  81.94%  82.72%  84.94%  83.82% 
(single view) 
Vii Conclusion
Since the highdimensional feature such as the 3D shape of the object follows intractable observation model, numerical analysis for Bayesian inference such as semantic SLAM becomes challenging. To overcome this problem, we show that the semantic features can be replaced by variational latent variables. We also present a feature encoding method using the variational generative model. Since observing the full shape in realtime is challenging, the proposed algorithm infers the variational likelihood of the full shape from the single view. Therefore, complex observation model of 3D object is approximated to the tractable distribution. Consequently, the encoded features and their priors enable the numerical analysis for the probabilistic estimation. Experiments are conducted to evaluate the algorithm on the 3D CAD dataset. To analyze the approximated distributions and encoded features, we perform classification with maximum likelihood estimation, and shape retrieval by decoding process.
Appendix I : Variational Lower Bound and EM Algorithm
As we assume that and each of the factorized priors is a uniform distribution, the term for the observation model can be represented as where is constant. Therefore, we can apply the variational lower bound to approximate the likelihood for observation model in EM formulation. Since we are focusing on the likelihood of , in the remaining part we let and for convenience.
Viia Variational Lower Bound for Expectation Step
After the convergence of VAE training, the variational lower bound in (1) approximates the loglikelihood of in (2). Therefore, we can rewrite the term as the following:
(7) 
where
Substituting (7) into (2) yields:
Since , and are independent to and , we can reduce the fraction as:
(8) 
Focusing on the term , we now expand the negative KLdivergence term as:
Note that since we assume the priors of the variational latent variables are multivariate Gaussians with diagonal covariances in Section III, we can represent as , where , and . Similarly, we let , where , and . Then we continue:
where is the normalization term for . Since there is no constraint to , we simply let (see Section III). The term then can be rewritten as:
(9) 
As the exponential terms in (9) is dependent not on the data association but on , substituting (9) into (8) and reducing the fraction finally yield:
ViiB Variational Lower Bound for Maximization Step
Similar to the expectation step, we can also apply the variational likelihood for the maximization step. Since we assumed that , (3) can be rewritten as follows:
Note that only and in (7) and (9) are related to . Therefore substituting (7) and (9) subsequently, we have:
Since , we can further expand the equation as follows:
Appendix II : MLE with Variational Latent Variables
References

[1]
R. F. SalasMoreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J.
Davison, “Slam++: Simultaneous localisation and mapping at the level of
objects,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2013, pp. 1352–1359.  [2] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi, “Targetdriven visual navigation in indoor scenes using deep reinforcement learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3357–3364.
 [3] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.
 [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [5] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint learning of object classification and viewpoint estimation using unaligned 3d object dataset,” arXiv preprint arXiv:1603.06208, 2016.
 [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
 [7] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, vol. 1612, 2016.
 [8] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [9] L. FeiFei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2. IEEE, 2005, pp. 524–531.
 [10] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “Fastslam: A factored solution to the simultaneous localization and mapping problem,” in Aaai/iaai, 2002, pp. 593–598.
 [11] S. Thrun and M. Montemerlo, “The graph slam algorithm with applications to largescale mapping of urban structures,” The International Journal of Robotics Research, vol. 25, no. 56, pp. 403–429, 2006.
 [12] N. Sünderhauf and P. Protzel, “Towards a robust backend for pose graph slam,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1254–1261.
 [13] ——, “Switchable constraints for robust pose graph slam,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 1879–1884.
 [14] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Probabilistic data association for semantic slam,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1722–1729.
 [15] M. Zhu, N. Atanasov, G. J. Pappas, and K. Daniilidis, “Active deformable part models inference,” in European Conference on Computer Vision. Springer, 2014, pp. 281–296.
 [16] A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox, “Learning to generate chairs, tables and cars with convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 692–705, 2017.
 [17] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in Advances in Neural Information Processing Systems, 2015, pp. 2539–2547.
 [18] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling,” in Advances in Neural Information Processing Systems, 2016, pp. 82–90.

[19]
R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning a predictable and generative vector representation for objects,” in
European Conference on Computer Vision. Springer, 2016, pp. 484–499.  [20] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum, “Marrnet: 3d shape reconstruction via 2.5 d sketches,” in Advances In Neural Information Processing Systems, 2017, pp. 540–550.
 [21] B. Yang, H. Wen, S. Wang, R. Clark, A. Markham, and N. Trigoni, “3d object reconstruction from a single depth view with adversarial learning,” arXiv preprint arXiv:1708.07969, 2017.
 [22] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” in Advances in Neural Information Processing Systems, 2016, pp. 3738–3746.
 [23] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for realtime object recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 922–928.

[24]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, 2010, pp. 249–256.  [25] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative and discriminative voxel modeling with convolutional neural networks,” arXiv preprint arXiv:1608.04236, 2016.
 [26] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deeppano: Deep panoramic representation for 3d shape recognition,” IEEE Signal Processing Letters, vol. 22, no. 12, pp. 2339–2343, 2015.
 [27] S. Bai, X. Bai, Z. Zhou, Z. Zhang, Q. Tian, and L. J. Latecki, “Gift: Towards scalable 3d shape retrieval,” IEEE Transactions on Multimedia, vol. 19, no. 6, pp. 1257–1271, 2017.
 [28] P. Papadakis, I. Pratikakis, T. Theoharis, and S. Perantonis, “Panorama: A 3d shape descriptor based on panoramic views for unsupervised 3d object retrieval,” International Journal of Computer Vision, vol. 89, no. 2, pp. 177–192, 2010.
 [29] S. Zhi, Y. Liu, X. Li, and Y. Guo, “Toward realtime 3d object recognition: A lightweight volumetric cnn framework using multitask learning,” Computers & Graphics, 2017.
 [30] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1912–1920.
Comments
There are no comments yet.