I Introduction
The ability to build and use a map of discrete environmental landmarks to navigate is one of the greatest strengths of the landmarkbased simultaneous localization and mapping (SLAM) paradigm, but hinges critically on reliable recognition of previously mapped landmarks, i.e. data association. Consistent data association over long periods of time is vital if we aim to achieve robust robot navigation in the operational limit as “time goes to infinity.” Unfortunately, longterm data association is substantially more challenging than shortterm data association, since uncertainty in robot pose may grow large enough that many previously observed landmarks may be reasonable, albeit erroneous, candidates for a loop closure. For this reason, any mechanism by which we can associate landmarks uniquely is of interest.
Recent advances in the capabilities and reliability of deep neural networks for object detection and feature extraction have motivated the use of semantics jointly to distinguish landmarks, taken to be objects, in the environment, and infer over time the correct semantic class of each landmark, i.e. semantic SLAM. However, no detection system can be expected to have perfect accuracy. Rather than build navigation systems that depend on perfect detection and classification, we aim to develop methods that can take into consideration the error characterization of perception systems like neural networkbased object detectors.
Robustness to misclassification and pose uncertainty requires the abilities to represent and resolve ambiguity in data association, and to reject incorrect loop closures. Traditional methods represent the problem of finding the correct set of hypotheses as a treesearch problem, where each node of the tree represents an association decision. Mitigating the complexity of search requires careful pruning of plausible hypotheses. Rather than explicitly search over association hypotheses, in this work we marginalize out data association variables at each point in time, allowing associations to arise implicitly from the inference of pose and landmark values.
The main contribution of this work is an approximate maxmarginalization procedure for data associations which provides theoretical grounding for a “maxmixture”type factor [23] within the context of semantic SLAM (shown in Figure 1); thus taking steps toward a unifying perspective on previous work in “robust SLAM” and recent work on data association for semantic SLAM, e.g. [2]
. Approximate maxmarginalization eliminates data association variables in a way that preserves standard Gaussian distribution assumptions in SLAM in what otherwise becomes a nonGaussian inference problem
[6]. Our representation makes use of error characterization of an object detector, taking into consideration uncertainty to fuse detection information with geometric information from other sensors, like stereo cameras or lidar. Lastly, the proposed method incorporates loop closure rejection via incorporation of a nullhypothesis association, which we experimentally find to be critical in providing robustness to odometry noise and misclassification in the semantic SLAM problem.The remainder of this paper proceeds as follows. In Section II we discuss related works on the topics of data association, robust SLAM, and semantic SLAM. We describe the problem of semantic SLAM with unknown data association in Section III, where we outline our approach to data association at a highlevel. In Section IV we describe in detail the maxmarginalization procedure, data association weight computation, and define the “semantic maxmixture factor” used for optimization, including the representation of nullhypothesis data association for loop closure rejection. Finally, experimental results demonstrating the robustness of the proposed approach to odometry noise and misclassification during indoor semantic navigation and results from the KITTI dataset [8] are provided in Section V.
Ii Related Work
The proposed work intersects the topics of data association, robust SLAM, and semantic SLAM. Classical work on data association stems from targettracking literature, where probabilistic data association (PDA) [25] and multihypothesis tracking (MHT) [1] were introduced. Subsequently these methods were applied to early filteringbased SLAM solutions making Gaussian noise assumptions [4, 3]. FastSLAM [20] later introduced a particle filteringbased approach to the nonGaussian inference problem of data association; a data association sampler was introduced that serves as an alternative to explicit search over associations.
Later work in the area of posegraph optimization
focused on themes of multihypothesis SLAM and outlier rejection, with the introduction of methods like switchable constraints
[29], maxmixtures [23], and junction tree inference methods [27]. These works consider mitigating the effects of perceptual aliasing, often in the context of laser scan matching or appearancebased loop closure (see, for example [18]), whereas in semantic SLAM we also want to locate and classify discrete objects in the environment. Nonetheless, as we demonstrate in this work, landmarkbased semantic SLAM shares similar challenges. Specifically, in this work we take an approximate Bayesian inference perspective to the semantic SLAM problem and arrive at a specific case of the maxmixtures method
[23]where component weights are directly computed as probabilities of candidate associations.
In the area of semantic SLAM, classifications from object detectors have been used to aid data association [2, 6, 32, 26, 21]. While many such works consider maximumlikelihood data association [32, 26]
, recent works have considered probabilistic data association making use of expectationmaximization
[2], or alternate between sampling data associations and recomputing SLAM solutions [21]. In contrast to [2], we model data associations as a mixture, rather than averaging solutions for different associations. We also marginalize out poses and landmarks when computing data association probabilities, whereas in [2], point estimates of robot poses and landmarks are used to compute data association weights. In both
[2] and [21], convergence requires iteratively recomputing data associations and performing factor graph optimization; we aim to avoid the complexity associated with this recomputation. In previous work [6], we addressed this problem for the general nonGaussian SLAM case using nonparametric belief propagation. While that approach gives rich uncertainties representing data association ambiguity, in this work our goal is maximum a posteriori inference specifically in the nonlinear Gaussian case.Iii Semantic SLAM with Unknown Data Association
We define the semantic SLAM problem to be the inference of vehicle poses and landmark states , given a set of measurements . This corresponds to the following maximum a posteriori (MAP) inference problem:
(1) 
We consider the vehicle state space throughout, while we consider landmark states containing both geometric and semantic components, i.e. (for landmarks with only a positional component) or
(for 6 degreeoffreedom landmark pose estimation), where
is a fixed, a priori known set of discrete semantic classes. When necessary, we denote the separate pose/positional components and semantic components of a landmark as and , respectively. We assume multiple measurements can be made at each point in time, such that is the th measurement made at time . Lastly, we consider the measurement space as having jointly geometric and semantic components, e.g. rangebearing measurements with semantic class in or 6DoF pose measurements in with semantic class in .When associations between measurements and landmarks are not known, they must be inferred. Specifically we take to be the set of associations of measurements at all points in time to landmarks, such that indicates that the th measurement taken at time corresponds to landmark . The most common approach to SLAM with unknown data association is that of maximumlikelihood, in which the most probable set of data associations are computed and fixed, then used to solve for the most probable robot poses and landmark states. This approach can be brittle, as a single incorrect association can move the optimal solution for robot poses and landmark states far from their true values.
In order to mitigate the effects of data association errors one may consider probabilistic data association, in which multiple associations for a measurement are given consideration commensurate with their probability. Generally this corresponds with marginalization of the data association variable, i.e.
(2)  
Common assumptions of additive Gaussian measurement noise make Gaussian, such that the resulting belief is a sum of Gaussians, which generally falls outside the realm of traditional nonlinear leastsquares optimization approaches to SLAM.
A recent approach to semantic SLAM [2] preserves the Gaussian nature of the problem by replacing the above expectation with one over , leading to an expectationmaximization algorithm for optimization. Convergence requires recomputation of the association weights , and prior to convergence the solution will lie somewhere between those obtained given fixed associations.
We propose an alternative solution to the MAP inference problem in which the “summarginal” above is replaced by the “maxmarginal”:
(3) 
Each component of the maxmarginal is a weighted Gaussian, while the operator simply acts to switch to the “best” data associations for any given point in the latent space of and . The optimal solution for the maxmarginal is identical to the MAP solution for the true posterior in (2) [16]. Exact computation of the true maxmarginal is generally intractable due to the combinatorial number of plausible data associations, but as we will show, several reasonable approximations make the maxmarginal a practical method for dealing with data association ambiguity.
Iv MaxMixture Semantic SLAM
We consider a standard Gaussian SLAM framework with an odometry model that is Gaussian with covariance with respect to the relative transform from to , denoted and Gaussian geometric measurement model with covariance , with respect to the nonlinear function ^{1}^{1}1The function in this work is taken to be a relative transform in the case of full landmark pose measurements, or bearing, elevation, and range when only landmark position information is available.. Semantic measurements with model are assumed independent of the pose from which a landmark was observed, as well as its position, and as in [6, 2]
they are taken as samples from a categorical distribution with probability vector defined by a classifier confusion matrix (assumed to be known
a priori). Lastly, we assume the geometric and semantic measurements factor as .The unnormalized posterior , can be written in the following general factor graph formulation:
(4) 
where each factor is in correspondence with one of the relevant (odometric or landmark) measurement models. From the measurement models, there is a clear partition of geometric information from the odometry and geometric landmark measurement models (which depend on both the robot poses and landmark locations) and the semantic information, which depends only on the class of the associated landmark. Since we do not know data associations, we instead infer them from data. At a high level, our approach is to apply variable elimination to data associations to produce an equivalent factor graph with data associations marginalized out. The proposed maxmixture semantic SLAM approach approximates optimization over the maxmarginal in (3). In particular, we introduce a proactive maxmarginalization procedure for computing data association weights. For associations to previous landmarks, as well as for the nullhypothesis case, the maxmarginal over candidate associations is represented as a factor (in the factor graph SLAM framework) taking on the form of a “maxmixture” [23]. We term these factors, semantic maxmixture factors. The addition of nullhypothesis data association enables the rejection of incorrect loop closures. The resulting factor graph is amenable to optimization using standard nonlinear leastsquares techniques, from which the optimal robot and landmark states can be recovered.
Iva Proactive MaxMarginalization
Exact computation of the maxmarginal over all possible data associations in Equation (3) is computationally expensive due to the combinatorial growth in the size of the set of possible data associations over time. Marginalizing out data associations proactively, i.e. as new measurements are made, and ignoring the influence of future measurements on association probabilities allows us to mitigate the complexity of full maxmarginalization.
In particular, suppose we have some set of previous measurements and new measurements . We aim to compute the maxmarginal over associations to the new measurements, denoted . Formally, we have the following:
(5) 
from Bayes’ rule, where we have used the conditional independences , and , since consists of associations to only measurements outside of . Applying maxmarginalization to data associations, we obtain:
(6) 
Here is the (potentially nonGaussian) posterior distribution over poses and landmarks after summarginalization of data associations to the measurements . For the purposes of optimization in the Gaussian case, we take this as the maxmarginal .
The consequences of this simple change are significant: evaluating the operators in the above expression no longer requires examination of previous associations and can be done in linear time for the most recent measurement. The result is that we have arrived at an approximate maxproduct algorithm for SLAM with unknown data associations.


IvB Data Association Weight Computation
Consider a single measurement of a landmark , which in our case consists of the joint geometric and semantic measurement of the landmark. We assume the data association probability is proportional to the likelihood with poses and landmarks marginalized out (see Figure 1(a)). From the factored measurement model assumption, the likelihood of the form can be broken into the product of separate semantic and geometric likelihoods:
(7) 
Each term on the righthand side can be expanded as follows into the summation over landmark classes:
(8) 
and integral over robot pose and landmark location:
(9) 
With data associations marginalized out, the belief would be generally nonGaussian. Consequently, we again make an approximation and use the single Gaussian component corresponding to the maxmarginal evaluated at the current estimate of and , ) which we denote :
(10) 
Since all of the terms in the integral are now Gaussian, it can be simplified as follows, based on the method of [14]:
(11) 
where
is the mean of the joint distribution over
and . The covariance , is defined as:(12) 
where is the block joint covariance matrix between pose and candidate landmark , is the Jacobian of the measurement function, and is the covariance of the geometric measurement model. This result, combined with the expression in (8) gives the marginal likelihood in (7) that we normalize to compute data association probabilities.
IvC Semantic MaxMixture Factor
Assuming uniform priors on data associations, the distribution is proportional to the marginal likelihood in (7) and can simply be normalized over all assignments to . This results in a set of candidate landmark hypotheses, maxmarginalization of which produces a maxmixture factor for a measurement :
(13) 
The maxmarginalization step is visualized in Figure 1(b), where we have eliminated the data association variable from the inference process. By augmenting the candidate set to be , we allow a nullhypothesis
data association to be made. In practice, we assume a probability for the nullhypothesis and normalize the remaining data associations such that the total probability of the augmented hypothesis set equals 1. The nullhypothesis component is assumed to be Gaussian with large standard deviation (e.g.
).Finally, we can recover maximum a posteriori landmark semantic class estimates (assuming uniform priors) as in [2] as follows:
(14) 
which are recovered using the data association probabilities stored as the component weights of the maxmixture factors.
V Experimental Results
All computational experiments were implemented in C++ using the Robot Operating System (ROS) [24] and the implementation of iSAM2 [15] within the GTSAM [5] library for optimization and covariance recovery. We demonstrate our approach on 3D visual SLAM tasks using data collected during indoor navigation with an MIT RACECAR vehicle^{2}^{2}2https://mitracecar.github.io/ equipped only with a ZED stereo camera [28], as well as with stereo image data from the KITTI dataset [8, 7]. Experiments were run on a single core of a 2.2 GHz Intel i7 CPU. We use evo [10] for trajectory evaluation^{3}^{3}3We provide trajectory comparisons for all of the methods tested without landmarks visualized, but more detailed visualizations and videos can be found on the project page: https://github.com/MarineRoboticsGroup/mixtures_semantic_slam.
In both experiments, we compared two variants of the proposed method: semantic maxmixtures (MM) and semantic maxmixtures with nullhypothesis data association (MM+NH) to a known data association (Known DA) baseline, naïve maximumlikelihood (ML) data association (which makes a single association with the landmark maximizing Eq. (7)), as well as an expectationmaximization approach similar to that of [2], here referred to as Gaussian probabilistic data association (GPDA). We use a threshold on the marginal likelihood in (7) to determine new landmarks, as well as to produce the set of landmark candidates. This is similar to standard Mahalanobis distancebased thresholds, but considers also the semantic likelihood^{4}^{4}4We use a test with confidence 0.9 and nullhypothesis weight of 0.1..
Va MIT RACECAR Dataset
We collected roughly 25 minutes of data during indoor navigation with the MIT RACECAR mobile robot platform over a roughly 1.08 km trajectory. We sampled AprilTag [31, 19] detection keyframes at a rate of 1 Hz resulting in 702 observations of 262 unique tags. Odometry was obtained using the ZED stereo camera visual odometry [28]. The use of AprilTags uniquely allows us to obtain a baseline “groundtruth” solution with known data associations. We artificially assigned semantic labels to each AprilTag by considering the true tag ID modulo for a class semantic SLAM problem. In the experiments presented in this paper, we use classes, as we found it to be one of the most challenging situations^{5}^{5}5In general, with a “good” detector, the presence of many unique semantic classes among landmarks makes the data association problem easier.
. While AprilTags give generally accurate orientation information, we typically cannot expect this of neural networkbased object detectors. For this reason, we set a large standard error on roll, pitch, and yaw of AprilTag detections to prevent orientation information from giving any substantial data association cues. Furthermore, this experimental setup allows us to apply classification error and simulate additional odometry noise in a repeatable way to study the trade off in data association performance with noise in classification and odometry.
In Figure 3, we provide boxplots summarizing statistical results of trajectory error on the MIT RACECAR dataset. Specifically, we considered robustness to simulated additional odometry noise and detector misclassification. We calibrated an initial odometry model, but in testing we add simulated Gaussian noise of , , and yaw^{6}^{6}6We take the baseline simulated noise model as , , and for yaw, in robot frame., multiplied by a scale factor varying from 0 to 10. We simulate classification error for a misclassification rate (from 0% to 50%) by sampling from a semantic measurement model with confusion matrix equal to on the diagonal and on the offdiagonal. We find that while errors in all methods increase with added noise in odometry and misclassification, all of the probabilistic methods generally outperform maximumlikelihood. Most significantly, we find that the addition of the nullhypothesis to our approach drastically reduces error in all tests. Beyond the ability of the nullhypothesis method to reject bad loop closures, the addition of the null hypothesis may help prevent the maxmixtures approach from becoming “stuck” in a local optimum by decreasing the high cost associated with being “between” hypotheses; a region that must be crossed before hypothesis switching can take place. We contextualize these quantitative results with qualitative trajectories plotted in Figure 4 for a 2class problem with 10% misclassification and 10% additional odometry error.
Method  Max Error  Mean Error  Median Error  RMSE 

ML  126.46  52.78  59.66  62.35 
GPDA  30.76  10.52  8.90  12.11 
MM  23.23  9.31  8.31  11.37 
MM + NH  19.88  6.48  6.44  7.70 
Method  Max Error  Mean Error  Median Error  RMSE 

ML  0.42  0.15  0.11  0.19 
GPDA  0.51  0.06  0.052  0.069 
MM  0.54  0.055  0.049  0.065 
MM + NH  0.58  0.043  0.037  0.053 
VB KITTI Dataset
We also evaluate our approach on stereo camera data from the KITTI dataset odometry sequence 5 [8]. In our experiments, we use the MobileNetSSD object detector ([13, 12, 17]), from which detections were obtained at approximately 10 Hz. We threshold the confidence of the detector at 0.8, using detections of cars as landmarks. We use VISO2 stereo odometry for visual odometry [9]. We estimate the range and bearing to cars as the average range and bearing to all points tracked by VISO2 that project into the bounding box for a given car detection. Despite this very noisy landmark signal, we show qualitatively in Figure 5 that all of the probabilistic data association methods successfully recover reasonable trajectory estimates, while the maximumlikelihood approach fails catastrophically due to incorrect loop closures. These results are corroborated by the translation and rotation errors, summarized in Tables I and II, respectively.
Vi Conclusion and Future Work
We have proposed an approach to semantic SLAM with probabilistic data association based on approximate maxmarginalization of data associations. This led to a “maxmixture”type approach to factor graph SLAM, amenable to nonlinear leastsquares optimization. We have shown how mixture component weights can be computed from joint semantic and geometric measurements, and nullhypothesis associations can be incorporated to reject bad loop closures. We evaluated the proposed approach on real stereo image data with known data associations using AprilTags [31] under a variety of simulated odometry and detection noise models, as well as on stereo image data from the KITTI dataset using noisy detections of cars as landmarks. We have shown that our approach is competitive with recent expectationmaximization methods for data association in semantic SLAM and drastically outperforms the common maximumlikelihood approach, particularly with use of nullhypothesis data association, while being similarly easy to implement within existing factor graph optimization frameworks. The particular benefits of the nullhypothesis in the proposed framework suggest that the ability to reject incorrect loop closures is a necessity for semantic SLAM systems relying on object detectors, and this work is one step toward unifying existing literature in robust SLAM with recent work on data association for objectlevel/semantic SLAM.
In this work we used classifications from object detectors to disambiguate data associations, but alternative semantic descriptors can be modeled and incorporated similarly, for example using neural networkbased feature matching techniques [30, 33, 11]. Additionally, our experience simulating semantic SLAM with AprilTags suggested that orientation can provide very useful cues for data association. While we focused on approaches that give limited geometric information about objects, methods like [22] that infer the full pose of objects may greatly improve accuracy and robustness of data association.
References
 [1] (1975) Tracking in a cluttered environment with probabilistic data association. Automatica 11 (5), pp. 451–460. Cited by: §II.
 [2] (2017) Probabilistic data association for semantic SLAM. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 1722–1729. Cited by: §I, §II, §III, §IVC, §IV, §V.
 [3] (1991) Probabilistic data association for dynamic world modeling: a multiple hypothesis approach. In Advanced Robotics, 1991.’Robots in Unstructured Environments’, 91 ICAR., Fifth International Conference on, pp. 1287–1294. Cited by: §II.
 [4] (1994) Modeling a dynamic environment using a bayesian multiple hypothesis approach. Artificial Intelligence 66 (2), pp. 311–344. Cited by: §II.
 [5] (2012) Factor graphs and GTSAM: a handson introduction. Technical report Georgia Institute of Technology. Cited by: §V.
 [6] (2019) Multimodal semantic SLAM with probabilistic data association. In 2019 IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §II, §IV.
 [7] (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §V.

[8]
(2012)
Are we ready for autonomous driving? the KITTI Vision Benchmark Suite.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §I, §VB, §V.  [9] (201106) StereoScan: Dense 3d Reconstruction in Realtime. In IEEE Intelligent Vehicles Symposium, BadenBaden, Germany. Cited by: §VB.
 [10] (2017) Evo: python package for the evaluation of odometry and slam.. Note: https://github.com/MichaelGrupp/evo Cited by: §V.
 [11] (2015) Matchnet: unifying feature and metric learning for patchbased matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286. Cited by: §VI.

[12]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §VB.  [13] (2017) Speed/accuracy tradeoffs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7310–7311. Cited by: §VB.
 [14] (2009) Covariance recovery from a square root information matrix for data association. Robotics and autonomous systems 57 (12), pp. 1198–1210. Cited by: §IVB.
 [15] (2012) ISAM2: incremental smoothing and mapping using the Bayes tree. The International Journal of Robotics Research 31 (2), pp. 216–235. Cited by: §V.
 [16] (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §III.
 [17] (2016) SSD: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §VB.
 [18] (2015) Visual place recognition: a survey. IEEE Transactions on Robotics 32 (1), pp. 1–19. Cited by: §II.
 [19] (201908) Longduration fully autonomous operation of rotorcraft unmanned aerial systems for remotesensing data acquisition. Journal of Field Robotics, pp. arXiv:1908.06381. External Links: Document, Link Cited by: §VA.
 [20] (2002) FastSLAM: a factored solution to the simultaneous localization and mapping problem. In Proc. of the AAAI National Conference on Artificial Intelligence, 2002, Cited by: §II.
 [21] (2016) SLAM with objects using a nonparametric pose graph. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 4602–4609. Cited by: §II.
 [22] (2018) QuadricSLAM: Constrained Dual Quadrics from Object Detections as Landmarks in Semantic SLAM. IEEE Robotics and Automation Letters (RAL). Cited by: §VI.
 [23] (2013) Inference on networks of mixtures for robust robot mapping. The International Journal of Robotics Research 32 (7), pp. 826–840. Cited by: §I, §II, §IV.
 [24] (2009) ROS: an opensource robot operating system. In ICRA Workshop on Open Source Software, Cited by: §V.
 [25] (1979) An algorithm for tracking multiple targets. IEEE transactions on Automatic Control 24 (6), pp. 843–854. Cited by: §II.
 [26] (2013) SLAM++: simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359. Cited by: §II.
 [27] (2014) Hybrid inference optimization for robust pose graph estimation. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2675–2682. Cited by: §II.
 [28] StereoLabs ZED camera. External Links: Link Cited by: §VA, §V.
 [29] (2012) Switchable constraints for robust pose graph SLAM. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 1879–1884. Cited by: §II.

[30]
(2017)
L2net: deep learning of discriminative patch descriptor in euclidean space
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669. Cited by: §VI.  [31] (201610) AprilTag 2: Efficient and robust fiducial detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4193–4198. External Links: Document, ISBN 9781509037629 Cited by: §VA, §VI.
 [32] (2018) CubeSLAM: monocular 3D object detection and SLAM without prior models. arXiv preprint arXiv:1806.00557. Cited by: §II.
 [33] (2016) Lift: learned invariant feature transform. In European Conference on Computer Vision, pp. 467–483. Cited by: §VI.
Comments
There are no comments yet.