The ability to build and use a map of discrete environmental landmarks to navigate is one of the greatest strengths of the landmark-based simultaneous localization and mapping (SLAM) paradigm, but hinges critically on reliable recognition of previously mapped landmarks, i.e. data association. Consistent data association over long periods of time is vital if we aim to achieve robust robot navigation in the operational limit as “time goes to infinity.” Unfortunately, long-term data association is substantially more challenging than short-term data association, since uncertainty in robot pose may grow large enough that many previously observed landmarks may be reasonable, albeit erroneous, candidates for a loop closure. For this reason, any mechanism by which we can associate landmarks uniquely is of interest.
Recent advances in the capabilities and reliability of deep neural networks for object detection and feature extraction have motivated the use of semantics jointly to distinguish landmarks, taken to be objects, in the environment, and infer over time the correct semantic class of each landmark, i.e. semantic SLAM. However, no detection system can be expected to have perfect accuracy. Rather than build navigation systems that depend on perfect detection and classification, we aim to develop methods that can take into consideration the error characterization of perception systems like neural network-based object detectors.
Robustness to misclassification and pose uncertainty requires the abilities to represent and resolve ambiguity in data association, and to reject incorrect loop closures. Traditional methods represent the problem of finding the correct set of hypotheses as a tree-search problem, where each node of the tree represents an association decision. Mitigating the complexity of search requires careful pruning of plausible hypotheses. Rather than explicitly search over association hypotheses, in this work we marginalize out data association variables at each point in time, allowing associations to arise implicitly from the inference of pose and landmark values.
The main contribution of this work is an approximate max-marginalization procedure for data associations which provides theoretical grounding for a “max-mixture”-type factor  within the context of semantic SLAM (shown in Figure 1); thus taking steps toward a unifying perspective on previous work in “robust SLAM” and recent work on data association for semantic SLAM, e.g. 
. Approximate max-marginalization eliminates data association variables in a way that preserves standard Gaussian distribution assumptions in SLAM in what otherwise becomes a non-Gaussian inference problem. Our representation makes use of error characterization of an object detector, taking into consideration uncertainty to fuse detection information with geometric information from other sensors, like stereo cameras or lidar. Lastly, the proposed method incorporates loop closure rejection via incorporation of a null-hypothesis association, which we experimentally find to be critical in providing robustness to odometry noise and misclassification in the semantic SLAM problem.
The remainder of this paper proceeds as follows. In Section II we discuss related works on the topics of data association, robust SLAM, and semantic SLAM. We describe the problem of semantic SLAM with unknown data association in Section III, where we outline our approach to data association at a high-level. In Section IV we describe in detail the max-marginalization procedure, data association weight computation, and define the “semantic max-mixture factor” used for optimization, including the representation of null-hypothesis data association for loop closure rejection. Finally, experimental results demonstrating the robustness of the proposed approach to odometry noise and misclassification during indoor semantic navigation and results from the KITTI dataset  are provided in Section V.
Ii Related Work
The proposed work intersects the topics of data association, robust SLAM, and semantic SLAM. Classical work on data association stems from target-tracking literature, where probabilistic data association (PDA)  and multi-hypothesis tracking (MHT)  were introduced. Subsequently these methods were applied to early filtering-based SLAM solutions making Gaussian noise assumptions [4, 3]. FastSLAM  later introduced a particle filtering-based approach to the non-Gaussian inference problem of data association; a data association sampler was introduced that serves as an alternative to explicit search over associations.
Later work in the area of pose-graph optimization
focused on themes of multi-hypothesis SLAM and outlier rejection, with the introduction of methods like switchable constraints, max-mixtures , and junction tree inference methods . These works consider mitigating the effects of perceptual aliasing, often in the context of laser scan matching or appearance-based loop closure (see, for example 
), whereas in semantic SLAM we also want to locate and classify discrete objects in the environment. Nonetheless, as we demonstrate in this work, landmark-based semantic SLAM shares similar challenges. Specifically, in this work we take an approximate Bayesian inference perspective to the semantic SLAM problem and arrive at a specific case of the max-mixtures method
where component weights are directly computed as probabilities of candidate associations.
In the area of semantic SLAM, classifications from object detectors have been used to aid data association [2, 6, 32, 26, 21]. While many such works consider maximum-likelihood data association [32, 26]
, recent works have considered probabilistic data association making use of expectation-maximization, or alternate between sampling data associations and recomputing SLAM solutions . In contrast to , we model data associations as a mixture, rather than averaging solutions for different associations. We also marginalize out poses and landmarks when computing data association probabilities, whereas in 
, point estimates of robot poses and landmarks are used to compute data association weights. In both and , convergence requires iteratively recomputing data associations and performing factor graph optimization; we aim to avoid the complexity associated with this recomputation. In previous work , we addressed this problem for the general non-Gaussian SLAM case using nonparametric belief propagation. While that approach gives rich uncertainties representing data association ambiguity, in this work our goal is maximum a posteriori inference specifically in the nonlinear Gaussian case.
Iii Semantic SLAM with Unknown Data Association
We define the semantic SLAM problem to be the inference of vehicle poses and landmark states , given a set of measurements . This corresponds to the following maximum a posteriori (MAP) inference problem:
We consider the vehicle state space throughout, while we consider landmark states containing both geometric and semantic components, i.e. (for landmarks with only a positional component) oris a fixed, a priori known set of discrete semantic classes. When necessary, we denote the separate pose/positional components and semantic components of a landmark as and , respectively. We assume multiple measurements can be made at each point in time, such that is the -th measurement made at time . Lastly, we consider the measurement space as having jointly geometric and semantic components, e.g. range-bearing measurements with semantic class in or 6-DoF pose measurements in with semantic class in .
When associations between measurements and landmarks are not known, they must be inferred. Specifically we take to be the set of associations of measurements at all points in time to landmarks, such that indicates that the -th measurement taken at time corresponds to landmark . The most common approach to SLAM with unknown data association is that of maximum-likelihood, in which the most probable set of data associations are computed and fixed, then used to solve for the most probable robot poses and landmark states. This approach can be brittle, as a single incorrect association can move the optimal solution for robot poses and landmark states far from their true values.
In order to mitigate the effects of data association errors one may consider probabilistic data association, in which multiple associations for a measurement are given consideration commensurate with their probability. Generally this corresponds with marginalization of the data association variable, i.e.
Common assumptions of additive Gaussian measurement noise make Gaussian, such that the resulting belief is a sum of Gaussians, which generally falls outside the realm of traditional nonlinear least-squares optimization approaches to SLAM.
A recent approach to semantic SLAM  preserves the Gaussian nature of the problem by replacing the above expectation with one over , leading to an expectation-maximization algorithm for optimization. Convergence requires recomputation of the association weights , and prior to convergence the solution will lie somewhere between those obtained given fixed associations.
We propose an alternative solution to the MAP inference problem in which the “sum-marginal” above is replaced by the “max-marginal”:
Each component of the max-marginal is a weighted Gaussian, while the operator simply acts to switch to the “best” data associations for any given point in the latent space of and . The optimal solution for the max-marginal is identical to the MAP solution for the true posterior in (2) . Exact computation of the true max-marginal is generally intractable due to the combinatorial number of plausible data associations, but as we will show, several reasonable approximations make the max-marginal a practical method for dealing with data association ambiguity.
Iv Max-Mixture Semantic SLAM
We consider a standard Gaussian SLAM framework with an odometry model that is Gaussian with covariance with respect to the relative transform from to , denoted and Gaussian geometric measurement model with covariance , with respect to the nonlinear function 111The function in this work is taken to be a relative transform in the case of full landmark pose measurements, or bearing, elevation, and range when only landmark position information is available.. Semantic measurements with model are assumed independent of the pose from which a landmark was observed, as well as its position, and as in [6, 2].
The unnormalized posterior , can be written in the following general factor graph formulation:
where each factor is in correspondence with one of the relevant (odometric or landmark) measurement models. From the measurement models, there is a clear partition of geometric information from the odometry and geometric landmark measurement models (which depend on both the robot poses and landmark locations) and the semantic information, which depends only on the class of the associated landmark. Since we do not know data associations, we instead infer them from data. At a high level, our approach is to apply variable elimination to data associations to produce an equivalent factor graph with data associations marginalized out. The proposed max-mixture semantic SLAM approach approximates optimization over the max-marginal in (3). In particular, we introduce a proactive max-marginalization procedure for computing data association weights. For associations to previous landmarks, as well as for the null-hypothesis case, the max-marginal over candidate associations is represented as a factor (in the factor graph SLAM framework) taking on the form of a “max-mixture” . We term these factors, semantic max-mixture factors. The addition of null-hypothesis data association enables the rejection of incorrect loop closures. The resulting factor graph is amenable to optimization using standard nonlinear least-squares techniques, from which the optimal robot and landmark states can be recovered.
Iv-a Proactive Max-Marginalization
Exact computation of the max-marginal over all possible data associations in Equation (3) is computationally expensive due to the combinatorial growth in the size of the set of possible data associations over time. Marginalizing out data associations proactively, i.e. as new measurements are made, and ignoring the influence of future measurements on association probabilities allows us to mitigate the complexity of full max-marginalization.
In particular, suppose we have some set of previous measurements and new measurements . We aim to compute the max-marginal over associations to the new measurements, denoted . Formally, we have the following:
from Bayes’ rule, where we have used the conditional independences , and , since consists of associations to only measurements outside of . Applying max-marginalization to data associations, we obtain:
Here is the (potentially non-Gaussian) posterior distribution over poses and landmarks after sum-marginalization of data associations to the measurements . For the purposes of optimization in the Gaussian case, we take this as the max-marginal .
The consequences of this simple change are significant: evaluating the operators in the above expression no longer requires examination of previous associations and can be done in linear time for the most recent measurement. The result is that we have arrived at an approximate max-product algorithm for SLAM with unknown data associations.
Iv-B Data Association Weight Computation
Consider a single measurement of a landmark , which in our case consists of the joint geometric and semantic measurement of the landmark. We assume the data association probability is proportional to the likelihood with poses and landmarks marginalized out (see Figure 1(a)). From the factored measurement model assumption, the likelihood of the form can be broken into the product of separate semantic and geometric likelihoods:
Each term on the right-hand side can be expanded as follows into the summation over landmark classes:
and integral over robot pose and landmark location:
With data associations marginalized out, the belief would be generally non-Gaussian. Consequently, we again make an approximation and use the single Gaussian component corresponding to the max-marginal evaluated at the current estimate of and , ) which we denote :
Since all of the terms in the integral are now Gaussian, it can be simplified as follows, based on the method of :
is the mean of the joint distribution overand . The covariance , is defined as:
where is the block joint covariance matrix between pose and candidate landmark , is the Jacobian of the measurement function, and is the covariance of the geometric measurement model. This result, combined with the expression in (8) gives the marginal likelihood in (7) that we normalize to compute data association probabilities.
Iv-C Semantic Max-Mixture Factor
Assuming uniform priors on data associations, the distribution is proportional to the marginal likelihood in (7) and can simply be normalized over all assignments to . This results in a set of candidate landmark hypotheses, max-marginalization of which produces a max-mixture factor for a measurement :
The max-marginalization step is visualized in Figure 1(b), where we have eliminated the data association variable from the inference process. By augmenting the candidate set to be , we allow a null-hypothesis
data association to be made. In practice, we assume a probability for the null-hypothesis and normalize the remaining data associations such that the total probability of the augmented hypothesis set equals 1. The null-hypothesis component is assumed to be Gaussian with large standard deviation (e.g.).
Finally, we can recover maximum a posteriori landmark semantic class estimates (assuming uniform priors) as in  as follows:
which are recovered using the data association probabilities stored as the component weights of the max-mixture factors.
V Experimental Results
All computational experiments were implemented in C++ using the Robot Operating System (ROS)  and the implementation of iSAM2  within the GTSAM  library for optimization and covariance recovery. We demonstrate our approach on 3D visual SLAM tasks using data collected during indoor navigation with an MIT RACECAR vehicle222https://mit-racecar.github.io/ equipped only with a ZED stereo camera , as well as with stereo image data from the KITTI dataset [8, 7]. Experiments were run on a single core of a 2.2 GHz Intel i7 CPU. We use evo  for trajectory evaluation333We provide trajectory comparisons for all of the methods tested without landmarks visualized, but more detailed visualizations and videos can be found on the project page: https://github.com/MarineRoboticsGroup/mixtures_semantic_slam.
In both experiments, we compared two variants of the proposed method: semantic max-mixtures (MM) and semantic max-mixtures with null-hypothesis data association (MM+NH) to a known data association (Known DA) baseline, naïve maximum-likelihood (ML) data association (which makes a single association with the landmark maximizing Eq. (7)), as well as an expectation-maximization approach similar to that of , here referred to as Gaussian probabilistic data association (GPDA). We use a threshold on the marginal likelihood in (7) to determine new landmarks, as well as to produce the set of landmark candidates. This is similar to standard Mahalanobis distance-based thresholds, but considers also the semantic likelihood444We use a test with confidence 0.9 and null-hypothesis weight of 0.1..
V-a MIT RACECAR Dataset
We collected roughly 25 minutes of data during indoor navigation with the MIT RACECAR mobile robot platform over a roughly 1.08 km trajectory. We sampled AprilTag [31, 19] detection keyframes at a rate of 1 Hz resulting in 702 observations of 262 unique tags. Odometry was obtained using the ZED stereo camera visual odometry . The use of AprilTags uniquely allows us to obtain a baseline “ground-truth” solution with known data associations. We artificially assigned semantic labels to each AprilTag by considering the true tag ID modulo for a -class semantic SLAM problem. In the experiments presented in this paper, we use classes, as we found it to be one of the most challenging situations555In general, with a “good” detector, the presence of many unique semantic classes among landmarks makes the data association problem easier.
. While AprilTags give generally accurate orientation information, we typically cannot expect this of neural network-based object detectors. For this reason, we set a large standard error on roll, pitch, and yaw of AprilTag detections to prevent orientation information from giving any substantial data association cues. Furthermore, this experimental setup allows us to apply classification error and simulate additional odometry noise in a repeatable way to study the trade off in data association performance with noise in classification and odometry.
In Figure 3, we provide box-plots summarizing statistical results of trajectory error on the MIT RACECAR dataset. Specifically, we considered robustness to simulated additional odometry noise and detector misclassification. We calibrated an initial odometry model, but in testing we add simulated Gaussian noise of , , and yaw666We take the baseline simulated noise model as , , and for yaw, in robot frame., multiplied by a scale factor varying from 0 to 10. We simulate classification error for a misclassification rate (from 0% to 50%) by sampling from a semantic measurement model with confusion matrix equal to on the diagonal and on the off-diagonal. We find that while errors in all methods increase with added noise in odometry and misclassification, all of the probabilistic methods generally outperform maximum-likelihood. Most significantly, we find that the addition of the null-hypothesis to our approach drastically reduces error in all tests. Beyond the ability of the null-hypothesis method to reject bad loop closures, the addition of the null hypothesis may help prevent the max-mixtures approach from becoming “stuck” in a local optimum by decreasing the high cost associated with being “between” hypotheses; a region that must be crossed before hypothesis switching can take place. We contextualize these quantitative results with qualitative trajectories plotted in Figure 4 for a 2-class problem with 10% misclassification and 10% additional odometry error.
|Method||Max Error||Mean Error||Median Error||RMSE|
|MM + NH||19.88||6.48||6.44||7.70|
|Method||Max Error||Mean Error||Median Error||RMSE|
|MM + NH||0.58||0.043||0.037||0.053|
V-B KITTI Dataset
We also evaluate our approach on stereo camera data from the KITTI dataset odometry sequence 5 . In our experiments, we use the MobileNet-SSD object detector ([13, 12, 17]), from which detections were obtained at approximately 10 Hz. We threshold the confidence of the detector at 0.8, using detections of cars as landmarks. We use VISO2 stereo odometry for visual odometry . We estimate the range and bearing to cars as the average range and bearing to all points tracked by VISO2 that project into the bounding box for a given car detection. Despite this very noisy landmark signal, we show qualitatively in Figure 5 that all of the probabilistic data association methods successfully recover reasonable trajectory estimates, while the maximum-likelihood approach fails catastrophically due to incorrect loop closures. These results are corroborated by the translation and rotation errors, summarized in Tables I and II, respectively.
Vi Conclusion and Future Work
We have proposed an approach to semantic SLAM with probabilistic data association based on approximate max-marginalization of data associations. This led to a “max-mixture”-type approach to factor graph SLAM, amenable to nonlinear least-squares optimization. We have shown how mixture component weights can be computed from joint semantic and geometric measurements, and null-hypothesis associations can be incorporated to reject bad loop closures. We evaluated the proposed approach on real stereo image data with known data associations using AprilTags  under a variety of simulated odometry and detection noise models, as well as on stereo image data from the KITTI dataset using noisy detections of cars as landmarks. We have shown that our approach is competitive with recent expectation-maximization methods for data association in semantic SLAM and drastically outperforms the common maximum-likelihood approach, particularly with use of null-hypothesis data association, while being similarly easy to implement within existing factor graph optimization frameworks. The particular benefits of the null-hypothesis in the proposed framework suggest that the ability to reject incorrect loop closures is a necessity for semantic SLAM systems relying on object detectors, and this work is one step toward unifying existing literature in robust SLAM with recent work on data association for object-level/semantic SLAM.
In this work we used classifications from object detectors to disambiguate data associations, but alternative semantic descriptors can be modeled and incorporated similarly, for example using neural network-based feature matching techniques [30, 33, 11]. Additionally, our experience simulating semantic SLAM with AprilTags suggested that orientation can provide very useful cues for data association. While we focused on approaches that give limited geometric information about objects, methods like  that infer the full pose of objects may greatly improve accuracy and robustness of data association.
-  (1975) Tracking in a cluttered environment with probabilistic data association. Automatica 11 (5), pp. 451–460. Cited by: §II.
-  (2017) Probabilistic data association for semantic SLAM. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 1722–1729. Cited by: §I, §II, §III, §IV-C, §IV, §V.
-  (1991) Probabilistic data association for dynamic world modeling: a multiple hypothesis approach. In Advanced Robotics, 1991.’Robots in Unstructured Environments’, 91 ICAR., Fifth International Conference on, pp. 1287–1294. Cited by: §II.
-  (1994) Modeling a dynamic environment using a bayesian multiple hypothesis approach. Artificial Intelligence 66 (2), pp. 311–344. Cited by: §II.
-  (2012) Factor graphs and GTSAM: a hands-on introduction. Technical report Georgia Institute of Technology. Cited by: §V.
-  (2019) Multimodal semantic SLAM with probabilistic data association. In 2019 IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II, §IV.
-  (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §V.
-  (2012) Are we ready for autonomous driving? the KITTI Vision Benchmark Suite. In , Cited by: §I, §V-B, §V.
-  (2011-06) StereoScan: Dense 3d Reconstruction in Real-time. In IEEE Intelligent Vehicles Symposium, Baden-Baden, Germany. Cited by: §V-B.
-  (2017) Evo: python package for the evaluation of odometry and slam.. Note: https://github.com/MichaelGrupp/evo Cited by: §V.
-  (2015) Matchnet: unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286. Cited by: §VI.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §V-B.
-  (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7310–7311. Cited by: §V-B.
-  (2009) Covariance recovery from a square root information matrix for data association. Robotics and autonomous systems 57 (12), pp. 1198–1210. Cited by: §IV-B.
-  (2012) ISAM2: incremental smoothing and mapping using the Bayes tree. The International Journal of Robotics Research 31 (2), pp. 216–235. Cited by: §V.
-  (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §III.
-  (2016) SSD: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §V-B.
-  (2015) Visual place recognition: a survey. IEEE Transactions on Robotics 32 (1), pp. 1–19. Cited by: §II.
-  (2019-08) Long-duration fully autonomous operation of rotorcraft unmanned aerial systems for remote-sensing data acquisition. Journal of Field Robotics, pp. arXiv:1908.06381. External Links: Cited by: §V-A.
-  (2002) FastSLAM: a factored solution to the simultaneous localization and mapping problem. In Proc. of the AAAI National Conference on Artificial Intelligence, 2002, Cited by: §II.
-  (2016) SLAM with objects using a nonparametric pose graph. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 4602–4609. Cited by: §II.
-  (2018) QuadricSLAM: Constrained Dual Quadrics from Object Detections as Landmarks in Semantic SLAM. IEEE Robotics and Automation Letters (RA-L). Cited by: §VI.
-  (2013) Inference on networks of mixtures for robust robot mapping. The International Journal of Robotics Research 32 (7), pp. 826–840. Cited by: §I, §II, §IV.
-  (2009) ROS: an open-source robot operating system. In ICRA Workshop on Open Source Software, Cited by: §V.
-  (1979) An algorithm for tracking multiple targets. IEEE transactions on Automatic Control 24 (6), pp. 843–854. Cited by: §II.
-  (2013) SLAM++: simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359. Cited by: §II.
-  (2014) Hybrid inference optimization for robust pose graph estimation. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2675–2682. Cited by: §II.
-  StereoLabs ZED camera. External Links: Cited by: §V-A, §V.
-  (2012) Switchable constraints for robust pose graph SLAM. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 1879–1884. Cited by: §II.
L2-net: deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669. Cited by: §VI.
-  (2016-10) AprilTag 2: Efficient and robust fiducial detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4193–4198. External Links: Cited by: §V-A, §VI.
-  (2018) CubeSLAM: monocular 3D object detection and SLAM without prior models. arXiv preprint arXiv:1806.00557. Cited by: §II.
-  (2016) Lift: learned invariant feature transform. In European Conference on Computer Vision, pp. 467–483. Cited by: §VI.