IV-SLAM: Introspective Vision for Simultaneous Localization and Mapping

by   Sadegh Rabiee, et al.
The University of Texas at Austin

Existing solutions to visual simultaneous localization and mapping (V-SLAM) assume that errors in feature extraction and matching are independent and identically distributed (i.i.d), but this assumption is known to not be true – features extracted from low-contrast regions of images exhibit wider error distributions than features from sharp corners. Furthermore, V-SLAM algorithms are prone to catastrophic tracking failures when sensed images include challenging conditions such as specular reflections, lens flare, or shadows of dynamic objects. To address such failures, previous work has focused on building more robust visual frontends, to filter out challenging features. In this paper, we present introspective vision for SLAM (IV-SLAM), a fundamentally different approach for addressing these challenges. IV-SLAM explicitly models the noise process of reprojection errors from visual features to be context-dependent, and hence non-i.i.d. We introduce an autonomously supervised approach for IV-SLAM to collect training data to learn such a context-aware noise model. Using this learned noise model, IV-SLAM guides feature extraction to select more features from parts of the image that are likely to result in lower noise, and further incorporate the learned noise model into the joint maximum likelihood estimation, thus making it robust to the aforementioned types of errors. We present empirical results to demonstrate that IV-SLAM 1) is able to accurately predict sources of error in input images, 2) reduces tracking error compared to V-SLAM, and 3) increases the mean distance between tracking failures by more than 70 V-SLAM.



There are no comments yet.


page 1

page 3

page 4

page 7


Semantic Feature Matching for Robust Mapping in Agriculture

Visual Simultaneous Localization and Mapping (SLAM) systems are an essen...

Dynamic Object Tracking and Masking for Visual SLAM

In dynamic environments, performance of visual SLAM techniques can be im...

A*SLAM: A Dual Fisheye Stereo Edge SLAM

This paper proposes an A*SLAM system that features combining two sets of...

A Spectral Learning Approach to Range-Only SLAM

We present a novel spectral learning algorithm for simultaneous localiza...

DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features

A robust and efficient Simultaneous Localization and Mapping (SLAM) syst...

HE-SLAM: a Stereo SLAM System Based on Histogram Equalization and ORB Features

In the real-life environments, due to the sudden appearance of windows, ...

Submap-based Pose-graph Visual SLAM: A Robust Visual Exploration and Localization System

For VSLAM (Visual Simultaneous Localization and Mapping), localization i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual simultaneous localization and mapping (V-SLAM) extracts features from observed images, and identifies correspondences between features across time-steps. By jointly optimizing the re-projection error of such features along with motion information derived from odometry or inertial measurement units (IMUs), V-SLAM reconstructs the trajectory of a robot along with a sparse 3D map of the locations of the features in the world. To accurately track the location of the robot and build a map of the world, V-SLAM requires selecting features from static objects, and correctly and consistently identifying correspondences between features. Unfortunately, despite extensive work on filtering out bad features [1, 29, 7] or rejecting unlikely correspondence matches [30, 17, 27], V-SLAM solutions still suffer from errors stemming from incorrect feature matches and features extracted from moving objects. Furthermore, V-SLAM solutions assume that re-projection errors are independent and identically distributed (i.i.d), an assumption that we know to be false: features extracted from low-contrast regions or from regions with repetitive textures exhibit wider error distributions than features from regions with sharp, locally unique corners. As a consequence of such assumptions, and the reliance on robust frontends to filter out bad features, even state of the art V-SLAM solutions suffer from catastrophic failures when encountering challenging scenarios such as specular reflections, lens flare, and shadows of moving objects encountered by robots in the real world.

We present introspective vision for SLAM (IV-SLAM), a fundamentally different approach for addressing these challenges – instead of relying on a robust frontend to filter out bad features, IV-SLAM builds a context-aware total noise model [28]

that explicitly accounts for heteroscedastic noise, and learns to account for bad correspondences, moving objects, non-rigid objects and other causes of errors. During a training phase, IV-SLAM uses reference poses via SLAM using a supervisory sensor to identify regions of images where features result in reprojection errors inconsistent with the reference poses. With this training data, IV-SLAM learns to predict the reliability of features as a function of locations on a provided image. During the online inference phase, the predicted reliability is then used to guide informative feature extraction by the IV-SLAM frontend, and to generate image feature-specific robust loss functions when solving for the pose of the camera in the backend. Thus, IV-SLAM is capable of learning to identify causes of V-SLAM failures in an autonomously supervised manner, and is subsequently able to leverage the learned model to improve the robustness and accuracy of tracking during actual deployments.

Our experimental results demonstrate that IV-SLAM 1) is able to accurately predict sources of error in input images as identified by ground truth in simulation, 2) reduces tracking error on both simulation and real-world data, and 3) significantly increases the mean distance between tracking failures when operating under challenging real-world conditions that frequently lead to catastrophic tracking failures of V-SLAM.

Ii Related Work

There exists a plethora of research on designing and learning distinctive image feature descriptors. This includes the classical hand-crafted descriptors such as ORB and SIFT, as well as the more recent learned ones [17, 27, 2, 15, 26]

that mainly rely on Siamese and triplet network architectures to generate a feature vector given an input image patch. Selecting interest points on the images for extracting these descriptors is traditionally done by convolving the image with specific filters 

[4] or using first-order approximations such as the Harris detector. More recently, CNN approaches have become popular [14, 20]. Cieslewski et al. [8] train a network that given an input image outputs a score map for selecting interest points. Once features are extracted at selected keypoints, pruning out a subset of them that are predicted to be unreliable is done in different ways. Alt et al. [1]

train a classifier to predict good features at the descriptor level, and Wang and Zhang 


use hand-crafted heuristics to determine good SIFT features. Carlone and Karaman 


propose a feature pruning method for visual-inertial SLAM that uses the estimated velocity of the camera to reject image features that are predicted to exit the scene in the immediate future frames. A line of work leverages scene semantic information in feature selection. Kaneko et al. 

[13] run input images through a semantic segmentation network and limit feature extraction to semantic classes that are assumed to be more reliable such as static objects. Ganti and Waslander [11] follow a similar approach while taking into account the uncertainty of the segmentation network in their feature selection process. While these approaches benefit from using the contextual information in the image, they are limited to hand-enumerated lists of sources of error. Moreover, not all potential sources of failure can be categorized in one of the semantic classes, for which well-trained segmentation networks exist. Repetitive textures, shadows, and reflections are examples of such sources. Pruning bad image correspondences once features are matched across frames is also another active area of research. RANSAC [10]

is the traditional solution to this problem, and more recently deep learning approaches have been developed 

[16, 25]

, which use permutation-equivariant network architectures and predict outlier correspondences by processing coordinates of the pairs of matched features. While the goal of these methods is to discard outlier correspondences, not all bad matches are outliers. There is also a grey area of features that for reasons such as specularity, shadows, motion blur, etc., are not located as accurately as other features without clearly being outliers.

Our work is agnostic of the feature descriptor type and the feature matching method at hand. It is similar to [8] in that it learns to predict unreliable regions for feature extraction in a self-supervised manner. However, it goes beyond being a learned keypoint detector and applies to the full V-SLAM solution by exploiting the predicted feature reliability scores to generate a context-aware loss function for the bundle adjustment problem. Unlike available approaches for learning good feature correspondences, which require accurate ground truth poses of the camera for training [16, 25, 30], our work only requires rough estimates of the reference pose of the camera. IV-SLAM is inspired by early works on introspective vision [18, 9] and applies the idea to visual SLAM.

Iii Visual SLAM

In visual SLAM, the pose of the camera is estimated and a 3D map of the environment is built by finding correspondences in the image space across frames. For each new input image (or stereo image pair , image features are extracted and matched with those from previous frames. Then, the solution to SLAM is given by


where is the sequence of state vectors composed of the pose of the camera and the map. represents observations made by the robot and are the control commands and/or odometry and IMU measurements. is the observation likelihood for image feature correspondences, given the estimated poses of the camera and the map . For each time-step , the V-SLAM frontend processes image to extract features associated with 3D map points . The observation error here is the reprojection error of onto the image and is defined as


where is the prediction of the observation , is the camera matrix, and and are the rotation and translation parts of , respectively. In the absence of control commands and IMU/odometry measurements, Eq. 1 reduces to the bundle adjustment problem, which is formulated as a nonlinear optimization:


where is the covariance matrix associated to the scale at which a feature has been extracted and is the loss function for . The choice of noise model for the observation error has a significant effect on the accuracy of the maximum likelihood (ML) solution to SLAM [28]. There exists a body of work on developing robust loss functions [5, 3] that targets improving the performance of vision tasks in the presence of outliers.

is known to have a non-Gaussian distribution

in the context of V-SLAM due to the frequent outliers that happen in image correspondences [28]. Instead, it is assumed to be drawn from long-tailed distributions such as piecewise densities with a central peaked inlier region and a wide outlier region. While is usually modeled to be i.i.d., there exist obvious reasons as to why this is not a realistic assumption. Image features that are extracted from objects with high-frequency surface textures can be located less accurately; whether the underlying object is static or dynamic affects the observation error; the presence of multiple similar-looking instances of an object in the scene can lead to correspondence errors. These are all examples of how can change across observations from frame to frame and at different regions of the same image. In the next section, we explain how IV-SLAM leverages the contextual information available in the input images to learn an improved that better represents the non i.i.d nature of the observation error.

Iv Introspective Vision for SLAM

Fig. 1: IV-SLAM pipeline during inference.

IV-SLAM models the observation error distribution to be dependent on the observations, i.e. , where and is a heteroscedastic noise distribution. Let

be the probability density function (PDF) of

, we want , where is a loss function from the space of robust loss functions [3]. In this paper, we choose , where is the space of Huber loss functions and specifically


where is the squared error value and is an observation-dependent parameter of the loss function and is correlated with the reliability of the observations. IV-SLAM learns an empirical estimate of such that the corresponding error distribution better models the observed error values. During the training phase, input images and estimated observation error values are used to learn to predict the reliability of image features at any point on an image. During the inference phase, a context-aware is estimated for each observation using the predicted reliability score, where a smaller value of corresponds to an unreliable observation. The resultant loss function is then used in Eq. 3 to solve for . Fig. 1 illustrates the IV-SLAM pipeline during inference. In the rest of this section, we explain the training and inference phases of IV-SLAM in detail.

Iv-a Self-Supervised Training

One of the main properties of IV-SLAM as an introspective perception algorithm is its capability to generate labeled data required for its training in a self-supervised and automated manner. This empowers the system to follow a continual learning approach and exploit logged data during robot deployments for improving perception. In the following, we explain the underlying machine-learned model of IV-SLAM as well as the automated procedure for its training data collection.

Iv-A1 Introspection Function

In order to apply a per-observation loss function , IV-SLAM learns an introspection function that given an input image and a location on the image, predicts a cost value that represents a measure of reliability for image features extracted at . Higher values of indicate a less reliable feature.

is implemented as a deep neural network such that given an input image

, outputs an image of the same size . We refer to as the image feature costmap and . In other words, represents a heatmap for , such that regions on with high pixel values indicate unreliable regions on for extracting and matching image features. In this paper, a fully convolutional network architecture composed of the MobileNetV2 [19] as the encoder and a transposed convolution layer with deep supervision as the decoder is used. We use the same architecture as that used by Zhou et al. [31] for the task of image segmentation.

Iv-A2 Collection of Training Data

Fig. 2: Training label generation for the introspective function.

IV-SLAM  requires a set of pairs of input images and their corresponding target image feature costmaps to train the introspection function. The training is performed offline and although it is mainly unsupervised, rough estimates of the reference pose of the camera are required for pruning the auto-generated training data, when the SLAM’s tracking accuracy is detected to be bad. In this work, the reference poses are obtained by running a 3D lidar-based SLAM solution [23] on logged lidar data offline. The automated procedure for generating the dataset , presented in Algorithm 1, is as follows: The core SLAM algorithm is run on the images and at each frame , the Mahalanobis distance of the reference and estimated pose of the camera is calculated as


where denotes the corresponding element of in the Lie algebra of . is the covariance of the reference pose of the camera and is approximated as a constant lower bound for the covariance of the reference SLAM solution. A chi-square test with is done for and if it fails, the current frame will be flagged as unreliable and a training label will not be generated for it.

At each frame that has been recognized as reliable for training data labeling, reprojection error values are calculated for all matched image features. A normalized cost value is then computed for each image feature, where denotes the diagonal covariance matrix associated with the scale at which the feature has been extracted. The set of sparse cost values calculated for each frame is then converted to a costmap the same size as the input image. This is achieved using a Gaussian Process regressor. Given the set of feature locations and their corresponding cost values , it estimates cost values for all points on a grid

with a cell size of 10 pixels. This low-resolution costmap is then resized using bilinear interpolation to obtain

. The estimated variance values for the outputs of the Gaussian Process regressor are also used in the same manner to generate an uncertainty map

. An image mask is then generated, where is an uncertainty threshold. masks all the pixels in the estimated costmap that have high uncertainty values associated with them. Areas in the input image with a low number of extracted features are an example of such high uncertainty regions. Figure 2 shows the estimated costmap and image mask for an example input image. The generated costmaps along with the input images

are then used to train the introspection function using a stochastic gradient descent (SGD) optimizer and a mean squared error loss (MSE) that is only applied to the unmasked regions of the image.

The advantage of such a self-supervised training scheme is that it removes the need for highly accurate ground truth poses of the camera, which would have been necessary if image features were to be evaluated by common approaches such as the epipolar error across different frames.

Iv-B Robust Estimation in IV-SLAM

During inference, input images are run through the introspection function , which outputs estimated costmaps . IV-SLAM uses to both guide the feature extraction process and adjust the contribution of extracted features when solving for the state vector .

Guided feature extraction.    Each image is divided into equally sized cells and the maximum number of image features to be extracted from each cell is determined to be inversely proportional to , i.e. the sum of the corresponding costmap image pixels within that cell. This helps IV-SLAM prevent situations where the majority of extracted features in a frame are unreliable.

Reliability-aware bundle adjustment.    Extracted features from the input image are matched with map points, and for each matched image feature extracted at pixel location , a specific loss function is generated as defined in Eq. 4. The loss function parameter is approximated as , where and

is a positive constant and a hyperparameter that defines the range at which

can be adjusted. We pick

to be the chi-square distribution’s

percentile, i.e. for a stereo observation. In other terms, for each image feature, the Huber loss is adjusted such that the features that are predicted to be less reliable (larger ) have a less steep associated loss function (smaller ). Lastly, the tracked features along with their loss functions are plugged into Eq. 3 and the solution to the bundle adjustment problem, i.e. the current pose of the camera as well as adjustments to the previously estimated poses and the location of map points, are estimated using a nonlinear optimizer.

1:Input: Set of matched features and map points for current frame, estimated camera pose , reference camera pose , reference camera pose covariance
2:Output: Costmap image
3: costmap computation grid cell size
4: output cost-map size
7:if IsTrackingUnreliable( then  
8:     return .
9:end if
10:for  to  do
11:      CalcReprojectionError()
13:end for
14:for  to floor(do
15:     for  to floor(do
16:          ()
17:          GaussianProc()
18:     end for
19:end for
20: ResizeImage()
Algorithm 1 Training Label Generation

V Experimental Results

In this section: 1) We evaluate IV-SLAM on how well it predicts reliability of image features (Section V-B). 2) We show that IV-SLAM improves tracking accuracy of a state-of-the-art V-SLAM algorithm and reduces frequency of its tracking failures (Section V-C). 3) We look at samples of sources of failure learned by IV-SLAM to negatively affect V-SLAM. (Section V-D).

To evaluate IV-SLAM, we implement it on top of the stereo version of ORB-SLAM. We pick ORB-SLAM because it has various levels of feature matching pruning and outlier rejection in place, which indicates that the remaining failure cases that we address with introspective vision cannot be resolved with meticulously engineered outlier rejection methods.

V-a Experimental Setup

State-of-the-art vision-based SLAM algorithms have shown great performance on popular benchmark datasets such as [12] and EuROC [6]. These datasets, however, do not perfectly reflect the many difficult situations that can happen when the robots are deployed in the wild and over extended periods of time [24]. In order to assess the effectiveness of IV-SLAM on improving visual SLAM performance, we collect simulated and real-world datasets that expose these algorithms to challenging situations such as reflections, glare, shadows, and dynamic objects.

Simulation.    In order to evaluate IV-SLAM in a controlled setting, where we have accurate poses of the camera and ground truth depth of the scene, we use AirSim [22], a photo-realistic simulation environment. A car agent is equipped with a stereo pair of RGB cameras as well as a depth camera that provides ground truth depth readings for every pixel in the left camera frames. A dataset consisting of the sensor readings as well as the ground truth pose of the cameras is recorded by driving the car around the publicly available City environment. The same trajectories are repeated in different environmental and weather conditions such as clear weather, wet roads, and also in the presence of snow and leaves accumulation on the road. The trajectories include fast turns and high-speed maneuvers to pose challenging situations for the task of SLAM. The data is then split into train and test sets, each composed of separate full deployment sessions(uninterrupted trajectories), such that both train and test sets include data from all environmental conditions. The dataset includes more than stereo image pairs, and the train and test sets each encompass more than and traversed by the car, respectively.

Real-world.    We also evaluate IV-SLAM on real-world data that we collect using a Clearpath Jackal wheeled mobile robot. The robot is equipped with a stereo pair of Point Grey cameras that capture pixel RGB images at as well as a Velodyne VLP-16 3D Lidar. The dataset is collected over the span of a week in a college campus setting in both indoor and outdoor environments and different lighting conditions. Similar to the simulation scenario, the data is split into train and test sets that consist of separate uninterrupted robot deployment sessions. The train and test datasets consist of more than and worth of trajectories, respectively, traversed by the robot at a mean velocity of . The reference pose of the robot is estimated by a 3D Lidar-based SLAM solution, LeGO-LOAM [23]. The algorithm is run on the data offline and with extended optimization rounds for increased accuracy. The reference camera poses are then calculated using the extrinsic calibration of the cameras. They are used both for evaluating the performance of the vision-based SLAM solutions that we have under test and for loosely supervising the training of IV-SLAM as explained in Section IV.

V-B Feature Reliability Prediction

Fig. 3: Mean reprojection error for image features sorted 1) randomly (baseline), 2) based on predicted reliability (Introspection func.), 3) based on ground truth reprojection error (ideal)

As explained in Section IV-A1, IV-SLAM’s introspection function (IF) predicts the reliability of each extracted image feature by predicting feature-specific noise model parameters. In this section, we assess IF’s success in achieving its objective.

For this purpose, we compare IF’s prediction of reliability of image features with their corresponding ground truth reprojection error. Since obtaining the ground truth reprojection error requires access to ground truth 3D coordinates of objects associated with each image feature as well as accurate reference poses of the camera, we conduct this experiment in simulation. IV-SLAM is trained on the simulation dataset with the method explained in Section IV. The IF is then run on all images in the test set along with the original ORB-SLAM. For each image feature extracted and matched by ORB-SLAM, we log its predicted cost by the IF, as well as its corresponding ground truth reprojection error. We then sort all image features in ascending order with respect to 1) ground truth reprojection errors and 2) predicted cost values. Figure 3 illustrates the mean reprojection error for the top of features in each of these ordered lists for a variable . The lower the area under a curve in this figure, the more accurate is the corresponding image feature evaluator in sorting features based on their reliability. The curve corresponding to the ground truth reprojection errors indicates the ideal scenario where all image features are sorted correctly. The baseline is also illustrated in the figure as the resultant mean reprojection error when the features are sorted randomly (mean error over 1000 trials) and it corresponds to the case when no introspection is available and all features are treated equally. As can be seen, using the IF significantly improves image feature reliability assessment.

Real-World Simulation
Method Trans. Err. % Rot. Err. () MDBF () Trans. Err. % Rot. Err. () MDBF ()
IV-SLAM 5.85 0.523 621.1 11.69 0.147 450.4
ORB-SLAM 9.20 0.558 357.1 15.28 0.186 312.7
TABLE I: Aggregate results for simulation and real-world experiments.
Fig. 4: Per trajectory comparison of the performance of IV-SLAM and ORB-SLAM in the simulation experiment. (fig:failure_count_airsim) Tracking failure count. (fig:rpe_trans_airsim) RMSE of translational error and (fig:rpe_rot_airsim) RMSE of rotational error over consecutive -long horizons.
Fig. 5: Per trajectory comparison of the performance of IV-SLAM and ORB-SLAM in the real-world experiment. (fig:failure_count_jackal) Tracking failure count. (fig:rpe_trans_jackal) RMSE of translational error and (fig:rpe_rot_jackal) RMSE of rotational error over consecutive -long horizons.

V-C Tracking Accuracy and Tracking Failures

We compare our introspection-enabled version of ORB-SLAM with the original algorithm in terms of their camera pose tracking accuracy and robustness. Both algorithms are run on the test data and their estimated poses of the camera are recorded. If the algorithms loose track due to lack of sufficient feature matches across frames, tracking is reinitialized and continued from after the point of failure along the trajectory and the event is logged as an instance of tracking failure for the corresponding SLAM algorithm. The relative pose error (RPE) is then calculated for both algorithms at consecutive pairs of frames that are meters apart as defined in [21]. Given the larger scale of the environment, and faster speed of the robot in the simulation dataset, we pick for the real-world environment and for the simulation. Figures 4 and 5 compare the per trajectory root-mean-square error (RMSE) of the rotation and translation parts of the RPE as well as the tracking failure count for IV-SLAM and ORB-SLAM in both experimental environments. Table I summarizes the results and shows the RMSE values calculated over all trajectories. The results demonstrate that IV-SLAM  outperforms the original ORB-SLAM by both reducing the tracking error and increasing the mean distance between failures (MDBF).

Fig. 6: Snapshots of IV-SLAM running on real-world data (top row) and in simulation (bottom row). Green and red points on the image represent the reliable and unreliable tracked image features, respectively, as predicted by the introspection model. Detected sources of failure include shadow of the robot, surface reflections, pedestrians, glare, and ambiguous image features extracted from high-frequency textures such as asphalt or vegetation.

V-D Qualitative Results

Fig. 7: Example deployment session of the robot. IV-SLAM successfully follows the reference camera trajectory while ORB-SLAM leads to severe tracking errors caused by image features extracted on the shadow of the robot.

In order to better understand how IV-SLAM improves upon the underlying SLAM algorithm, and what it has learned to achieve the improved performance, we look at sample qualitative results.

Figure 7 demonstrates an example deployment session of the robot from the real-world dataset and compares the reference pose of the camera with the estimated trajectories by both algorithms under test. It shows how image features extracted from the shadow of the robot cause significant tracking errors for ORB-SLAM, while IV-SLAM successfully handles such challenging situations. Figure 6 illustrates further potential sources of failure picked up by IV-SLAM during inference, and demonstrates that the algorithm has learned to down-weight image features extracted from sources such as shadow of the robot, surface reflections, lens glare, and pedestrians in order to achieve improved performance.

Vi Conclusion

In this paper, we introduced IV-SLAM: a self-supervised approach for learning to predict sources of failure for V-SLAM and to estimate a context-aware noise model for image correspondences. We empirically demonstrated that IV-SLAM improves the accuracy and robustness of a state-of-the-art V-SLAM solution with extensive simulated and real-world data. IV-SLAM currently only uses static input images to predict the reliability of features. As future work, we would like to incorporate robot motion data in the design and also leverage the predictions of the introspection function in motion planning to reduce the probability of failures for V-SLAM.


  • [1] N. Alt, S. Hinterstoisser, and N. Navab (2010) Rapid selection of reliable templates for visual tracking. In

    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    pp. 1355–1362. Cited by: §I, §II.
  • [2] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016)

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In Bmvc, Vol. 1, pp. 3. Cited by: §II.
  • [3] J. T. Barron (2019) A general and adaptive robust loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4331–4339. Cited by: §III, §IV.
  • [4] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §II.
  • [5] M. J. Black and A. Rangarajan (1996) On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of computer vision 19 (1), pp. 57–91. Cited by: §III.
  • [6] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart (2016) The euroc micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: §V-A.
  • [7] L. Carlone and S. Karaman (2018) Attention and anticipation in fast visual-inertial navigation. IEEE Transactions on Robotics 35 (1), pp. 1–20. Cited by: §I, §II.
  • [8] T. Cieslewski, K. G. Derpanis, and D. Scaramuzza (2019) SIPs: succinct interest points from unsupervised inlierness probability learning. In 2019 International Conference on 3D Vision (3DV), pp. 604–613. Cited by: §II, §II.
  • [9] S. Daftry, S. Zeng, J. A. Bagnell, and M. Hebert (2016) Introspective perception: learning to predict failures in vision systems. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1743–1750. Cited by: §II.
  • [10] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II.
  • [11] P. Ganti and S. L. Waslander (2019) Network uncertainty informed semantic feature selection for visual slam. In 2019 16th Conference on Computer and Robot Vision (CRV), Vol. , pp. 121–128. Cited by: §II.
  • [12] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §V-A.
  • [13] M. Kaneko, K. Iwami, T. Ogawa, T. Yamasaki, and K. Aizawa (2018) Mask-slam: robust feature-based monocular slam by masking using semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 258–266. Cited by: §II.
  • [14] K. Lenc and A. Vedaldi (2016) Learning covariant feature detectors. In European conference on computer vision, pp. 100–117. Cited by: §II.
  • [15] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Advances in Neural Information Processing Systems, pp. 4826–4837. Cited by: §II.
  • [16] K. Moo Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua (2018) Learning to find good correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2666–2674. Cited by: §II, §II.
  • [17] Y. Ono, E. Trulls, P. Fua, and K. M. Yi (2018) LF-net: learning local features from images. In Advances in neural information processing systems, pp. 6234–6244. Cited by: §I, §II.
  • [18] S. Rabiee and J. Biswas (2019) IVOA: introspective vision for obstacle avoidance. arXiv preprint arXiv:1903.01028. Cited by: §II.
  • [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §IV-A1.
  • [20] N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys (2017)

    Quad-networks: unsupervised learning to rank for interest point detection

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1822–1830. Cited by: §II.
  • [21] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stückler, and D. Cremers (2018) The tum vi benchmark for evaluating visual-inertial odometry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1680–1687. Cited by: §V-C.
  • [22] S. Shah, D. Dey, C. Lovett, and A. Kapoor (2017) AirSim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, External Links: arXiv:1705.05065, Link Cited by: §V-A.
  • [23] T. Shan and B. Englot (2018) Lego-loam: lightweight and ground-optimized lidar odometry and mapping on variable terrain. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4758–4765. Cited by: §IV-A2, §V-A.
  • [24] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song, F. Qiao, L. Song, et al. (2019) Are we ready for service robots? the openloris-scene datasets for lifelong slam. arXiv preprint arXiv:1911.05603. Cited by: §V-A.
  • [25] W. Sun, W. Jiang, E. Trulls, A. Tagliasacchi, and K. M. Yi (2020) ACNe: attentive context normalization for robust permutation-equivariant learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11286–11295. Cited by: §II, §II.
  • [26] Y. Tian, B. Fan, and F. Wu (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669. Cited by: §II.
  • [27] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) Sosnet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11016–11025. Cited by: §I, §II.
  • [28] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon (1999) Bundle adjustment–a modern synthesis. In International workshop on vision algorithms, pp. 298–372. Cited by: §I, §III.
  • [29] X. Wang and H. Zhang (2006) Good image features for bearing-only slam. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2576–2581. Cited by: §I, §II.
  • [30] A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese (2016)

    Generic 3d representation via pose estimation and matching

    In European Conference on Computer Vision, pp. 535–553. Cited by: §I, §II.
  • [31] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §IV-A1.