Monocular simultaneous localization and mapping (SLAM) is a classical problem that has been tackled in various forms in the robotics and computer vision communities for more than 15 years. Starting from the seminal work of Davison, impressive results have been obtained in the construction of sparse or semi-dense 3D maps and in visual odometry [2, 11, 4], with a single camera. Given the availability and low price of this kind of sensor, many applications have been developed on top of monocular SLAM systems.
One of the main limits of monocular SLAM systems is that, because of the projective nature of the sensor, the scale of the 3D scene is not observable. This has two important implications: (1) The scale of the camera trajectory and of the reconstructed map are arbitrary, depending typically on choices made during the system initialization; (2) While no loop closure process is applied on the map and on the trajectory (usually with some form of bundle adjustment), the scale error may drift without bound. For example, in Fig. 1, top, the basic version of ORB-SLAM (without loop closure) outputs the green path on one of the KITTI dataset urban video sequences. The ground truth appears in red. The scale drift explains why the internal scale estimate is clearly increasing during the whole experiment. When loop closure processes are applied, the global scale is made coherent over the map and the trajectory, but again, at an arbitrary value. Since for many applications (mobile robotics, augmented reality,…) the true scale factor plays a critical role, automatic methods to infer it are important.
We estimate the scale correction to apply to the camera trajectory and the map (top and middle figures), using Bayesian inference based on a detector of instances of pre-defined object classes (e.g. cars, bottom figure), with prior distributions specified on the height of theses objects. The top figure shows, for KITTI sequence, the reconstructed trajectory (without loop closure) by ORB-SLAM, in green, while our corrected trajectory is depicted in blue, and the ground truth in red.
The main idea of this work is that, based on the semantic content of what a monocular system can perceive, and even if each perceived cue gives uncertain evidence on scale, a robot should be able to infer the global scale of the structures present in the scene. Handling the potential contradictions between cues can be done efficiently within a Bayesian framework, as it allows to specify and fuse nicely the uncertain knowledge given by each visual cue. Evidence of this inference process in animal visual systems has been exhibited . As an example, when it observes a scene containing cars and houses, the human brain, based on its prior knowledge on the size of typical cars and houses, can infer depths and distances, even though there is a slight possibility that all of these objects are just small objects in a toy world. Based on this idea of using general semantic knowledge (e.g., detections of cars at the bottom of Fig. 1), we build up a Bayesian inference system on the monocular SLAM scale correction. This system allows us to produce, in the case of Fig. 1, the blue path, much closer to the ground truth than the green one (without correction).
Ii Related work
Monocular SLAM has been a tool of choice in 3D scene and camera trajectory reconstruction, e.g. in mobile robotics or augmented reality, in particular because monocular systems are widespread and inexpensive. Two categories of online techniques coexist in the literature for this purpose: the ones that use Bayesian filtering  and the ones that extend the traditional bundle adjustment algorithms to online systems [9, 12]. The latter have allowed to attain outstanding results in the recent years, at much larger scales than the former. However, common limitations of all the existing methods are that: (i) the scale of the reconstruction is unknown by essence, and (ii) the consecutive reconstructions/pose estimations may introduce scale drift which makes the global maps or the complete trajectories inconsistent. Most of the classical SLAM systems [3, 11] use ad-hoc elements in the initialization phase to set the reconstruction scale at the beginning, i.e., known objects or known motions. To limit the scale drift, loop closure techniques allow to reset the scale in a consistent way with its initial value .
An obvious solution to the scale recovery problem is to upgrade the sensors to devices capable of measuring depth (e.g., Kinect ) or displacements (e.g., IMU sensors ), but this may be costly or simply unfeasible. In this work, we focus on using only the semantic content of RGB images, together with prior uncertain information on this content, to infer the global scale of the reconstruction. Previous works in this direction include , where the output of an object recognition system is used in the map/trajectory optimization, and , where object detection was used to simplify the map building process with depth cameras. In both cases, databases of specific instances of objects were used, whereas our work uses more general object classes.
Closer to our approach,  proposes a scale estimation method that tracks the user face and uses it as a cue for determining the scale. The method is designed for cell phones equipped with front and rear cameras and would be difficult to extend to more generic monocular systems. In , , and  in a context of monocular vision embedded on cars, the scale is integrated based on the knowledge of the camera height above the ground, and based on local planarity assumptions in the observed scene. Again, our method can be applied in more generic settings, although we evaluate here in this road navigation context.
In , the approach is similar to ours as it also uses size priors for the detections and as it is applied to urban scenes. Nevertheless, we do not rely on consecutive object detections (which implies data association to be solved) and instead get observations from any object detection on which projections of the reconstructed 3D cloud points lie. Additionally, by relying only on points projected from the 3D map, we ensure in some way not to include information coming from dynamic objects. We use a Bayesian formulation that allows us to integrate different elements of previous knowledge, such as a prior on the variations of the scale correction factor.
Finally, our work is also reminiscent of approaches that perform machine learning-based depth inference from the texture of monocular images. Here we combine the strength of recent deep learning detection techniques  with the power and flexibility of Bayesian inference, so as to integrate available prior knowledge in a principled way.
In a first version of this work , we adopted a similar strategy to the one presented here, but this paper introduces three novel contributions:
the estimated scale correction parameter is now associated to a motion model, and its variations are related to the SLAM system scale drift (see Section IV-A),
a new, more robust, probabilistic observation model is proposed (see Section IV-B).
it is now implemented in a state of the art Monocular SLAM system, which allows for improved evaluation.
Iii Detection-based scale estimation
In this section, we present the core elements of our detection-based scale correction system.
Iii-a Notations and definitions
From now on, we assume that we run a monocular SLAM algorithm (such as ). We denote the camera calibration matrix by K. The camera pose at time is referred to as . points, reconstructed by the SLAM algorithm, are indexed by and referred to as in the world frame and as in the camera frame. Points projected on the image frame at time are noted as where is the standard perspective projection function.
As explained hereafter, we rely on a generic object detector that, given an image in our video sequence, outputs a list of detected objects, together with the class they belong to. This detector (see Section V-B) has been trained to detect instances from dozens of classes. In frame , we denote the set of object detections as , and the set of sets of detections done at frames as . Each individual detection is noted as . We define two functions and , such that is the rectangular region in the image corresponding to the detection, like the rectangles depicted in Fig. 1, and is the object class of the detection (e.g., “truck”, “car”, “bottle”…). Finally, we introduce a prior height distribution built beforehand for an object class as defined on .
A system such as ORB-SLAM  maintains a local map of points expressed in the world frame. It is a subset of the global map that contains the set of points from keyframes that share map points with the current frame, and the set with neighbors in in the covisibility graph. The local map is used for tracking purposes and it is optimized via bundle adjustment every time a new keyframe is added. Most other SLAM systems work in a similar way. Our aim in this work is to estimate the scalar , which we define as the correction to apply to the local 3D map or to the local trajectory in order to obtain the correct scale of the local map maintained by a visual SLAM system, at time .
Iii-B Problem statement
as a random variable to be estimated at timesuch that for any pair of points , the true metric distance between them is given by , where is the distance measure in the current reconstruction and is a reconstruction error noise.
Since the local map is used for tracking, we can recover the camera trajectory with its correct metric scale by estimating . Given , the pose of the camera at time according to the visual SLAM system, the pose with its correct metric scale can then be computed incrementally by
where builds a similarity from the rigid transformation and the scale factor .
In the following, we develop a Bayesian formulation for the estimation of this local scale correction as the mode of the posterior distribution:
i.e., conditioned to the observation of detected objects and to the SLAM local reconstructions .
Iii-C Bayesian framework for estimating the scale correction
As mentioned above, we stress that, because of the scale drift inherent to monocular SLAM systems, the global scale correction is varying with time. To estimate it, we use observations from object detections on which we have priors for their belonging classes (e.g., priors on cars heights in Fig. 1), and we use a dynamical model to cope with potential variations in the internal scale of the SLAM algorithm, i.e., a rough model on the dynamics of scale drift. As we do not have a detailed knowledge of these variations (it probably depends on the internal logic of each SLAM algorithm), we use a simple dynamic model from frame to frame
where . We will explain how to select in Section IV-A.
In frame , let us suppose that we are capable of getting a set of object detections . By applying Bayes formula,
For the sake of clarity in this derivation, let us first suppose that consists of a single detection of an object belonging to class .
Through the formula above, we obtain a recursive Bayes filter that allows to make updates of the scale correction estimate at each new frame, by incorporating three terms: (1) a transition probability that models the scale drift in the SLAM algorithm; (2) a likelihood term that evaluates the probability of having the observed detection, given a current point cloud built by the SLAM algorithm, given a possible height for the detected object of class , and given a global scale correction ; (3) a prior on heights , specific to the class of the detected object .
This means that, at each step, we can update the posterior on
. We implemented the previous inference scheme in two ways: as a discrete Bayes filter and as a Kalman filter. Using one or the other depends mainly on the context and on the nature of the involved distributions. In the first case, we use a histogram representation for the posterior distribution and for
, the prior probability the object height. By discretizing the possible heightsover a pre-defined interval, we can compute the likelihood term as
. In the second case, when the involved distributions are Gaussian and the models linear, then we have an instance of the Kalman filter, which takes a simple form of mean/variance updates (see SectionIV-D).
Note that in the more general case of , and by assuming conditional independence between the different detections observed in frame , we have
In the following, we give details on these three distributions.
Iv Definition of the probabilistic models
Iv-a Transition probability
As stated before, the distribution allows us to encode time variations of the global scale correction. These variations are caused by accumulation of errors in the mapping and tracking threads of the SLAM algorithm.
Experiments show that larger global scale variations occur in situations when the camera experiences greater angular displacement . For this reason,
is modeled as a Gaussian distribution centered at
with a standard deviation, variable for each frame , and proportional to the angular displacement of the camera.
Let be the angular displacement (in degrees) along the rotation axis between and . Let be the last time since the scale was updated, then we define .
The standard deviation is then calculated as
The values observed to work in practice are , , and . These values for and have been determined from the observed variations of the scale correction along several test sequences.
Iv-B Likelihood of detections
The term is the probability that the detected object has the dimensions in pixels with which it was detected, given that the object has a real size , that the scale is , and that the local map is .
The general idea to evaluate it is to estimate the height of the detected object using and , then to obtain a scale correction estimate and compare it with .
Let be the points from the local map transformed in the camera frame and whose projection lies inside , with the current pose and map parameters. We assume that in the world frame in which the SLAM system does its tracking and mapping, we can identify the vertical direction. We assume that the detected object surface is parallel to the vertical direction and that the object is oriented vertically. From , we will first construct a point that will lie on a vertical straight line to be used to infer the object height. Let be the projection of the points in the plane perpendicular to the vertical direction and that pass through the camera position.
We assume that the points are sorted in increasing order according to their distance to the camera position, given by the SLAM system. The point is obtained as a weighted average of , giving higher weight to points closer to the camera except for a small portion of the closest points. This is done in order to filter out points that do not lie on the surface of the object, in particular points from the background inside the detection region, or points appearing due to partial occlusions. This can be observed on Figure 2, left, with the points lying on the object surface in green, the points closer to the camera (which can lie, for example, on the ground) in yellow, and the points further away to the camera (e.g., on a building behind the car) in blue. Finally, is depicted in red.
The averaging of the points is done with a gamma density on the index position, with parameters , , which were determined to work well on practice. Hence, we can estimate as
The 3D line is defined as the line passing through with vertical direction (see Fig. 2, right). Let be the projection of on the image with the current camera parameters and a line in the image passing through and such that the plane obtained by back projecting is vertical.
We consider the intersections of this line with the boundary of the detection , , as depicted in Figure 2 with green dots on the image plane, while is the red dot. These two image points are taken as the vertical extremities of the object.
Let and be the 3D map rays obtained by back projecting the image points and , respectively. We define and . These two points are taken as the vertical extremities of the object in the 3D map, as seen in Figure 2 (in green).
Left: top view of a 3D object corresponding to a car detection. The dots correspond to 3D points that project inside the detection region. The green dots lie on the surface of the object, the yellow dots are closer to the camera than the object’s surface (they could correspond to an occluded part of the car), and the blue dots are further away to the camera than the object’s surface (they correspond to points in the background of the detection region). A representative of these points is obtained, the red dot, which is more likely to lie on the surface of the object; all the points are averaged with a gamma distribution evaluated at their depth ranking value. Right: projection of the object on the image. The red dotcorresponds to a point on the surface of the car, is the projection of this point on the image. and are the vertical extremities of the object on the image and and correspond to the extremities in the world frame. is a line parallel to the vertical direction passing through and is the projection of this line in the image. and correspond to the back projection of and , respectively.
Then the object height can be estimated as the Euclidean distance . The scale correction observation, given , is calculated as . Finally, the likelihood of the detection is evaluated as
with a Gaussian density with mean and standard deviation . The next section describes how can be evaluated at .
Iv-C Observation noise variances
We define , the observation noise, to quantify roughly the uncertainty on each scale observation. We evaluate it as the standard deviation on using uncertainty propagation, as follows. Let , with one of the points in as defined in the previous section, expressed in the camera frame. The variance on the depth of , , is approximated as the variance of the distances with the weights of Eq. 5. We refer to it as .
The distance can be expressed as with and ( and are considered as constant, here, as they are the slopes of and and do not depend on ). Hence, the standard deviation on is roughly . Now so can be approximated as
Iv-D Posterior updates
In the case the scale correction and height prior distributions are represented as discrete distributions, the implementation of Eq. 3 is quite straightforward.
In the case the height prior distributions are Gaussian , the Bayes Filter can be implemented as a Kalman Filter, where the current scale correction estimate is represented by its mean/variance before and after correction, (means) and (variances) with the following equations:
It results from Eq. 1,
In this step, the difference with the traditional Kalman filter is that we have to marginalize the variable over the prior on the object class, i.e. compute . To simplify the evaluation and keep the result as a Gaussian, we use Eq. 6 for a fixed value of , , i.e., .
In that case, we deduce that is a Gaussian centered at with variance . The update expressions follow for the means and variances:
V Experimental results
V-a Description of the experimental setup
For a quantitative evaluation of the scale estimation for correcting scale drift, the algorithm is run on 10 sequences of the KITTI dataset . Each sequence consists of a driving scenario in an urban environment with varying speeds and distances. We want to stress that, although the application presented in these experiments is quite specific (monocular vision for road vehicles), the proposed method is much more generic and can be used in many other scenarios. We have chosen this application to measure its potential benefits, because of the existence of well documented datasets, such as KITTI. Sequences to are considered here, except for sequence , since it is in a highway in which the SLAM algorithm (ORB-SLAM) fails due to the high speed. The sequences come with ground truth poses for evaluation. The evaluation computes errors between relative transformations for every subsequence of length meters as proposed in . Here, as our algorithm evaluates the scale correction, we only present results on translational errors. The rotational errors are a consequence of the SLAM algorithm and do not depend on the scale.
YOLO9000  is used for detecting car instances, and the minimum confidence threshold is set to . We could have considered more object classes but their presence in the KITTI dataset is marginal (a few “truck” or “bus” objects only). Object detection is run every 5 frames. As it can be seen in Table II, the number of updates, i.e. of integrations of observations in the Bayesian framework, is quite variable. In sequence or , there are approximately updates per frame; in sequence , this number falls to . Of course, this has an impact on the final errors (see below).
The prior distribution for the car’s height is set as a Gaussian with mean meters. The mean was chosen in accordance with the report by the International Council on Clean Transportation  for average car height in 2015. Based on these facts, we selected the Kalman Filter implementation of the algorithm, equations 7 and 8 for prediction and equations 9 and 10 for correction.
The ORB-SLAM and YOLO algorithms run in real time, and the Kalman filter implementation of the scale correction estimation adds negligible additional processing time, which guarantees the real time performance of the algorithm.
V-C Evaluation and discussion
|Bayesian||Update only||Avg. Scale||Stereo |
We can see in Figure 3(a) the evolution of the scale estimate (in bold) along with the scale observations corresponding to the KITTI sequence , i.e. the same as the illustration of Fig. 1. Our scale correction estimate is clearly decreasing from values , i.e. as ORB-SLAM is sub-estimating the scale, to values towards the end of the sequence, i.e. as ORB-SLAM is over-estimating the scale. This effect is perceptible in Fig. 1, through the path estimated by ORB-SLAM: distances in the trajectory produced by ORB-SLAM without scale estimation (in green) are seen to be overestimated later in time.
(c). The time is indicated by the color of the posterior (lighter colors means later times). Updates can also have a higher effect on the posterior in moments where a large scale drift is expected due to high rotational translation, as suggested by equation4.
In Table II, we compare the errors obtained with different approaches for scale estimation for the 10 KITTI sequences analyzed: (i) in the second column, our method as presented in this paper; (ii) in the third column, our method without the scale correction motion model, i.e. roughly as in ; (iii) in the fourth column, a very simple method that computes an average value of the scale correction, , and applies it to the map and the trajectory (this corresponds to neglecting the scale drift effect); (iv) in the fifth column, to give a hint on the precision reached by a 3D sensor, we give the results by ORB-SLAM 2 with the stereo datasets; finally, (v) the fifth column gives results from the monocular system developed in , where the camera height over the road is known.
Analyzing the results for our methods (second and third columns) in the different sequences, one can see a strong correlation between the obtained errors and the average number of updates per frame as described in Table II, as expected. For example, in sequences , , , where a lot of cars where detected, the results are very good, with errors significantly lower than . On the opposite, sequences , , with their scarce car detections, give rather poor results. Sequence , for instance, is a short sequence in a highway, without static vehicles, and produces only 4 update steps. However, (bottom row), the overall error levels are lower than . Note that introducing the motion model with varying variance has allowed to improve the performance of  by a factor of 3. Last, as expected, not including the scale drift (fourth column) leads to very poor performance. Finally, a detector such as  is quite versatile, so we could use it at its maximum potential by integrating other classes to detect, e.g. road signs, house doors and windows.
In Fig. 4, we give two more examples of reconstructed trajectories with/without our scale correction and with/without motion model for the scale correction factor. Our method allows the final trajectory (in blue) to get very close to the ground truth (in red). Similarly, in Fig. 5, we give the errors of these same methodologies, for different path lengths, and averaged over the 10 sequences. Again, our method allows to get very reasonable errors, between 4 and 7.
Some of the best monocular systems with scale correction,  and , have average errors of and , respectively, which are very similar to the average error of our method, . But these monocular methods are specific to driving scenarios, based on a given fixed camera height and an observable plane. On the other hand, state of the art methods for scale estimation based on object detection, , have errors of in average. Our method outperforms state of the art methods of scale estimation based on object detection while achieving similar performance to state of the art monocular systems with scale correction, but within a more general framework.
We have presented a Bayes filter algorithm that allows to estimate the scale correction to apply to the output of a monocular SLAM algorithm so as to obtain correct maps and trajectories. The observation model uses object detections given by a generic object detector, and integrates height priors over the object from the detected classes. A probabilistic motion model is proposed in order to model the scale drift. In the light of the very promising results obtained in the KITTI dataset, we will put our efforts in obtaining a better model for the scale drift, whose evolution over time seems to exhibit a clear structure.
-  European vehicle market statistics pocketbook 2015/2016. Technical report, The International Council on Clean Transportation, 2015.
-  D. S. C. Forster, M. Pizzoli. Svo: Fast semi-direct monocular visual odometry. In Proc. of IEEE Int. Conf. on Robotics and Automation (ICRA), 2014.
-  A. J. Davison. Real-Time Simultaneous Localisation and Mapping with a Single Camera. In Int. Conf. Comput. Vis., 2003.
-  J. Engel and D. Cremers. Lsd-slam: Large-scale direct monocular slam. In In Proc. of European Conference on Computer Vision (ECCV), 2014.
-  D. P. Frost, O. Kähler, and D. W. Murray. Object-aware bundle adjustment for correcting monocular scale drift. In Proc. of Int. Conf. on Robotics and Automation, pages 4770–4776, May 2016.
-  D. Gálvez-López, M. Salas, J. D. Tardós, and J. Montiel. Real-time monocular object slam. Robotics and Autonomous Systems, 75, Part B:435 – 449, 2016.
A. Geiger, P. Lenz, and R. Urtasun.
Are we ready for autonomous driving? the kitti vision benchmark
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  J. Gräter, T. Schwarze, and M. Lauer. Robust scale estimation for monocular visual odometry using structure from motion and vanishing points. In 2015 IEEE Intelligent Vehicles Symposium (IV), pages 475–480, June 2015.
-  G. Klein and D. Murray. Parallel Tracking and Mapping for Small AR Workspaces. In Proc. of Int. Symp. Mix. Augment. Real. IEEE, Nov. 2007.
-  S. B. Knorr and D. Kurz. Leveraging the user’s face for absolute scale estimation in handheld monocular slam. Proc. of Int. Symp. on Mixed and Augmented Reality, 00:11–17, 2016.
-  R. Mur-Artal, J. Montiel, and J. Tardós. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
-  R. Mur-Artal and J. D. Tardós. ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. arXiv preprint arXiv:1610.06475, 2016.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. a. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Proc. of Int. Symp. on Mixed and Augmented Reality. IEEE, October 2011.
-  G. Nützi, S. Weiss, D. Scaramuzza, and R. Siegwart. Fusion of imu and vision for absolute scale estimation in monocular slam. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2011.
-  J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  R. Salas-Moreno, R. Newcombe, H. Strasdat, P. Kelly, and A. Davison. Slam++: Simultaneous localisation and mapping at the level of objects. In Proc. of Int. Conf. on Computer Vision and Pattern Recognition, 2013.
-  A. Saxena, M. Sun, and A. Y. Ng. Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Trans. Pattern Anal. Mach. Intell., 31(5), May 2009.
-  S. Song, M. Chandraker, and C. C. Guest. Parallel, real-time monocular visual odometry. In Proc. of IEEE Int. Conf. on Robotics and Automation, pages 4698–4705, May 2013.
-  E. Sucar and J.-B. Hayet. Probabilistic global scale estimation for monoslam based on generic object detection. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
-  J. M. Wolfe, K. R. Kluender, D. M. Levi, L. M. Bartoshuk, R. S. Herz, R. Klatzky, S. J. Lederman, and D. M. Merfeld. Sensation and Perception. Sinauer associates, 2014.
-  D. Zhou, Y. Dai, and H. Li. Reliable scale estimation and correction for monocular visual odometry. 2016 IEEE Intelligent Vehicles Symposium (IV), pages 490–495, 2016.