Despite its simplicity and time of invention, Random Sample Consensus (RANSAC)  remains an important method for robust optimization, and is a vital component of many state-of-the-art vision pipelines [39, 40, 29, 6]. RANSAC allows accurate estimation of model parameters from a set of observations of which some are outliers. To this end, RANSAC iteratively chooses random sub-sets of observations, so called minimal sets, to create model hypotheses. Hypotheses are ranked according to their consensus with all observations, and the top-ranked hypothesis is returned as the final estimate.
The main limitation of RANSAC is its poor performance in domains with many outliers. As the ratio of outliers increases, RANSAC requires exponentially many iterations to find an outlier-free minimal set. Implementations of RANSAC therefore often restrict the maximum number of iterations, and return the best model found so far .
In this work, we combine RANSAC with a neural network that predicts a weight for each observation. The weights ultimately guide the sampling of minimal sets. We call the resulting algorithm Neural-Guided RANSAC (NG-RANSAC). A comparison of our method with vanilla RANSAC can be seen in Fig. 1.
train a neural network to classify observations as outliers or inliers, fitting final model parameters only to the latter. Although designed to replace RANSAC, their method achieves best results when combined with RANSAC during test time, where it would remove any outliers that the neural network might have missed. This motivates us to train the neural network in conjunction with RANSAC in a principled fashion, rather than imposing it afterwards.
Instead of interpreting the neural network output as soft inlier labels for a robust model fit, we let the output weights guide RANSAC hypothesis sampling. Intuitively, the neural network should learn to decrease weights for outliers, and increase them for inliers. This paradigm yields substantial flexibility for the neural network in allowing a certain misclassification rate without negative effects on the final fitting accuracy due to the robustness of RANSAC. The distinction between inliers and outliers, as well as which misclassifications are tolerable, is solely guided by the minimization of the task loss function during training. Furthermore, our formulation of NG-RANSAC facilitates training with any (non-differentiable) task loss function, and any (non-differentiable) model parameter solver, making it broadly applicable. For example, when fitting essential matrices, we may use the 5-point algorithm rather than the (differentiable) 8-point algorithm which other learned robust estimators rely on[56, 36]. The flexibility in choosing the task loss also allows us to train NG-RANSAC self-supervised by using maximization of the inlier count as training objective.
. They formulated a prior probability of sparse feature matches being valid based on matching scores. While this has a positive affect on RANSAC performance in some applications, feature matching scores, or other hand-crafted heuristics, were clearlynot designed to guide hypothesis search. In particular, calibration of such ad-hoc measures can be difficult as the reliance on over-confident but wrong prior probabilities can yield situations where the same few observations are sampled repeatedly. This fact was recognized by Chum and Matas who proposed PROSAC , a variant of RANSAC that uses side-information only to change the order in which RANSAC draws minimal sets. In the worst case, if the side-information was not useful at all, their method would degenerate to vanilla RANSAC. NG-RANSAC takes a different approach in (i) learning the weights to guide hypothesis search rather than using hand-crafted heuristics, and (ii) integrating RANSAC itself in the training process which leads to self-calibration of the predicted weights.
Recently, Brachmann proposed differentiable RANSAC (DSAC) to learn a camera re-localization pipeline . Unfortunately, we can not directly use DSAC to learn hypothesis sampling since DSAC is only differentiable to observations, not sampling weights. However, NG-RANSAC applies a similar trick also used to make DSAC differentiable, namely the optimization of the expected task loss during training. While we do not rely on DSAC, neural guidance can be used in conjunction with DSAC (NG-DSAC) to train neural networks that predict observations and observation confidences at the same time.
We summarize our main contributions:
We present NG-RANSAC, a formulation of RANSAC with learned guidance of hypothesis sampling. We can use any (non-differentiable) task loss, and any (non-differentiable) minimal solver for training.
Choosing the inlier count itself as training objective facilitates self-supervised learning of NG-RANSAC.
We use NG-RANSAC to estimate epipolar geometry of image pairs from sparse correspondences, where it surpasses competing robust estimators.
We combine neural guidance with differentiable RANSAC (NG-DSAC) to train neural networks that make accurate predictions for parts of the input, while neglecting other parts. These models achieve competitive results for horizontal line estimation, and state-for-the-art for camera re-localization.
2 Related Work
RANSAC was introduced in 1981 by Fischler and Bolles . Since then it was extended in various ways, see the survey by Raguram . Combining some of the most promising improvements, Raguram created the Universal RANSAC (USAC) framework  which represents the state-of-the-art of classic RANSAC variants. USAC includes guided hypothesis sampling according to PROSAC , more accurate model fitting according to Locally Optimized RANSAC , and more efficient hypothesis verification according to Optimal Randomized RANSAC . Many of the improvements proposed for RANSAC could also be applied to NG-RANSAC since we do not require any differentiability of such add-ons. We only imposes restrictions on how to generate hypotheses, namely according to a learned probability distribution.
RANSAC is not often used in recent machine learning-heavy vision pipelines. Notable exceptions include geometric problems like object instance pose estimation[3, 5, 21], and camera re-localization [41, 51, 28, 8, 46] where RANSAC is coupled with decision forests or neural networks that predict image-to-object correspondences. However, in most of these works, RANSAC is not part of the training process because of its non-differentiability. DSAC [4, 6] overcomes this limitation by making the hypothesis selection a probabilistic action which facilitates optimization of the expected task loss during training. However, DSAC is limited in which derivatives can be calculated. DSAC allows differentiation to observations. For example, we can use it to calculate the gradient of image coordinates for a sparse correspondence. However, DSAC does not model observation selection, and hence we cannot use it to optimize a matching probability. By showing how to learn neural guidance, we close this gap. The combination with DSAC enables the full flexibility of learning both, observations and their selection probability.
Besides DSAC, a differentiable robust estimator, there has recently been some work on learning robust estimators. We discussed the work of Yi  in the introduction. Ranftl and Koltun  take a similar but iterative approach reminiscent of Iteratively Reweighted Least Squares (IRLS) for fundamental matrix estimation. In each iteration, a neural network predicts observation weights for a weighted model fit, taking into account the residuals of the last iteration. Both,  and , have shown considerable improvements to vanilla RANSAC but require differentiable minimal solvers, and task loss functions. NG-RANSAC outperforms both approaches, and is more flexible when it comes to defining the training objective. This flexibility also enables us to train NG-RANSAC in a self-supervised fashion, possible with neither  nor .
Preliminaries. We address the problem of fitting model parameters to a set of observations that are contaminated by noise and outliers. For example, could be a fundamental matrix that describes the epipolar geometry of an image pair , and could be the set of SIFT correspondences  we extract for the image pair. To calculate model parameters from the observations, we utilize a solver , for example the 8-point algorithm . However, calculating from all observations will result in a poor estimate due to outliers. Instead, we can calculate from a small subset (minimal set) of observations with cardinality : . For example, for a fundamental matrix when using the 8-point algorithm. RANSAC  is an algorithm to chose an outlier-free minimal set from such that the resulting estimate is accurate. To this end, RANSAC randomly chooses minimal sets to create a pool of model hypotheses .
RANSAC includes a strategy to adaptively choose , based on an online estimate of the outlier ratio . The strategy guarantees that an outlier-free set will be sampled with a user-defined probability. For tasks with large outlier ratios, calculated like this can be exponentially large, and is usually clamped to a maximum value . For notational simplicity, we take the perspective of a fixed but do not restrict the use of an early-stopping strategy in practice.
RANSAC chooses a model hypothesis as the final estimate according to a scoring function :
The scoring function measures the consensus of an hypothesis all observations, and is traditionally implemented as inlier counting .
Neural Guidance. RANSAC chooses observations uniformly random to create the hypothesis pool . We aim at sampling observations according to a learned distribution instead that is parametrized by a neural network with parameters . That is, we select observations according to . Note that is a categorical distribution over the discrete set of observations , not a continuous distribution in observation space. We wish to learn parameters in a way that increases the chance of selecting outlier-free minimal sets, which will result in accurate estimates . We sample a hypothesis pool according to by sampling observations and minimal sets independently,
From a pool , we estimate model parameters with RANSAC according to Eq. 1. For training, we assume that we can measure the quality of the estimate with a task loss function . The task loss can be calculated a ground truth model , or self-supervised, by using the inlier count of the final estimate: . We wish to learn the distribution in a way that we receive a small task loss with high probability. Inspired by DSAC , we define our training objective as the minimization of the expected task loss:
We compute the gradients of the expected task loss the network parameters as
Integrating over all possible hypothesis pools to calculate the expectation is infeasible. Therefore, we approximate the gradients by drawing samples :
Note that gradients of the task loss function do not appear in the expression above. Therefore, differentiability of the task loss , the robust solver (RANSAC) or the minimal solver is not required. These components merely generate a training signal for steering the sampling probability
in a good direction. Due to the approximation by sampling, the gradient variance of Eq.5
can be high. We apply a standard variance reduction technique from reinforcement learning by subtracting a baseline:
We found a simple baseline in the form of the average loss per image sufficient, . Subtracting the baseline will move the probability distribution towards hypothesis pools with lower-than-average loss for each training example.
Combination with DSAC. Brachmann  proposed a RANSAC-based pipeline where a neural network with parameters predicts observations . End-to-end training of the pipeline, and therefore learning the observations , is possible by turning the hypothesis selection of RANSAC (cf. Eq. 1) into a probabilistic action:
This differentiable variant of RANSAC (DSAC) chooses a hypothesis randomly according to a distribution calculated from hypothesis scores. The training objective aims at learning network parameters such that hypotheses with low task loss are chosen with high probability:
In the following, we extend the formulation of DSAC with neural guidance (NG-DSAC). We let the neural network predict observations and, additionally, a probability associated with each observation . Intuitively, the neural network can express a confidence in its own predictions through this probability. This can be useful if a certain input for the neural network contains no information about the desired model . In this case, the observation prediction is necessarily an outlier, and the best the neural network can do is to label it as such by assigning a low probability. We combine the training objectives of NG-RANSAC (Eq. 3) and DSAC (Eq. 8) which yields:
where we again construct from individual ’s according to Eq. 2. The training objective of NG-DSAC consists of two expectations. Firstly, the expectation sampling a hypothesis pool according to the probabilities predicted by the neural network. Secondly, the expectation sampling a final estimate from the pool according to the scoring function. As in NG-RANSAC, we approximate the first expectation via sampling, as integrating over all possible hypothesis pools is infeasible. For the second expectation, we can calculate it analytically, as in DSAC, since it integrates over the discrete set of hypotheses in a given pool . Similar to Eq. 6, we give the approximate gradients of NG-DSAC as:
where we use as a stand-in for . The calculation of gradients for NG-DSAC requires the derivative of the task loss (note the last part of Eq. 10) because depends on parameters via observations . Therefore, training NG-DSAC requires a differentiable task loss function , a differentiable scoring function , and a differentiable minimal solver . Note that we inherit these restrictions from DSAC. In return, NG-DSAC allows for learning observations and observation confidences, at the same time.
We evaluate neural guidance on multiple, classic computer vision tasks. Firstly, we apply NG-RANSAC to estimating epipolar geometry of image pairs in the form of essential matrices and fundamental matrices. Secondly, we apply NG-DSAC to horizon line estimation and camera re-localization. We present the main experimental results here, and refer to the appendix for details about network architectures, hyper-parameters and more qualitative results. Our implementation is based on PyTorch, and we will make the code publicly available for all the tasks discussed below.
4.1 Essential Matrix Estimation
Epipolar geometry describes the geometry of two images that observe the same scene . In particular, two image points and in the left and right image corresponding to the same 3D point satisfy , where the matrix denotes the fundamental matrix. We can estimate uniquely (but only up to scale) from 8 correspondences, or from 7 correspondences with multiple solutions . The essential matrix is a special case of the fundamental matrix when the calibration parameters and of both cameras are known: . The essential matrix can be estimated from 5 correspondences . Decomposing the essential matrix allows to recover the relative pose between the observing cameras, and is a central step in image-based 3D reconstruction . As such, estimating the fundamental or essential matrices of image pairs is a classic and well-researched problem in computer vision.
In the following, we firstly evaluate NG-RANSAC for the calibrated case and estimate essential matrices from SIFT correspondences . For the sake of comparability with the recent, learned robust estimator of Yi  we adhere closely to their evaluation setup, and compare to their results.
Datasets. Yi  evaluate their approach in outdoor as well as indoor settings. For the outdoor datasets, they select five scenes from the structure-from-motion (SfM) dataset of : Buckingham, Notredame, Sacre Coeur, St. Peter’s and Reichstag. They pick two additional scenes from : Fountain and Herzjesu. They reconstruct each scene using a SfM tool  to obtain ‘ground truth’ camera poses, and co-visibility constraints for selecting image pairs. For indoor scenes Yi choose 16 sequences from the SUN3D dataset  which readily comes with ground truth poses captured by KinectFusion . See Appendix A for a listing of all scenes. Indoor scenarios are typically very challenging for sparse feature-based approaches because of texture-less surfaces and repetitive elements (see Fig. 1 for an example). Yi train their best model using one outdoor scene (St. Peter’s) and one indoor scene (Brown 1), and test on all remaining sequences (6 outdoor, 15 indoor). Yi kindly provided us with their exact data splits, and we will use their setup. Note that training and test is performed on completely separate scenes, the neural network has to generalize to unknown environments.
Via the essential matrix, we recover the relative camera pose up to scale, and compare to the ground truth pose as follows. We measure the angular error between the pose rotations, as well as the angular error between the pose translation vectors in degrees. We take the maximum of the two values as the final angular error. We calculate the cumulative error curve for each test sequence, and compute the area under the curve (AUC) up to a threshold of 5, 10 or 20. Finally, we report the average AUC over all test sequences (but separately for the indoor and outdoor setting).
Implementation. Yi train a neural network to classify a set of sparse correspondences in inliers and outliers. They represent each correspondence as a 4D vector combining the 2D coordinate in the left and right image. Their network is inspired by PointNet 
, and processes each correspondence independently by a series of multilayer perceptrons (MLPs). Global context is infused by using instance normalization in-between layers. We re-build this architecture in PyTorch, and train it according to NG-RANSAC (Eq. 3). That is, the network predicts weights to guide RANSAC sampling instead of inlier class labels. We use the angular error between the estimated relative pose, and the ground truth pose as task loss . As minimal solver , we use the 5-point algorithm . To speed up training, we initialize the network by learning to predict the distance of each correspondence to the ground truth epipolar line, see Appendix A for details. We initialize for 75k iterations, and train according to Eq. 3 for 25k iterations. We optimize using Adam  with a learning rate of 10. For each training image, we extract 2000 SIFT correspondences, and sample hypothesis pools with hypotheses. We use a low number of hypotheses during training to obtain variation when sampling pools. For testing, we increase the number of hypotheses to . We use an inlier threshold of 10 assuming normalized image coordinates using camera calibration parameters.
Results. We compare NG-RANSAC to the inlier classification (InClass) of Yi . They use their approach with SIFT as well as LIFT  features. The former works better for outdoor scenes, the latter works better for indoor scenes. We also include results for DeMoN , a learned SfM pipeline, and GMS , a semi-dense approach using ORB features . See Fig. 2 a) for results. RANSAC achieves poor results for indoor and outdoor scenes across all thresholds, scoring as the weakest method among all competitors. Coupling it with neural guidance (NG-RANSAC) elevates it to the leading position with a comfortable margin. See also Fig. 3 for qualitative results.
NG-RANSAC outperforms InClass of Yi  despite some similarities. Both use the same network architecture, are based on SIFT correspondences, and both use RANSAC at test time. Yi  train using a hybrid classification-regression loss based on the 8-point algorithm, and ultimately compare essential matrices using squared error. Therefore, their training objective is very different from the evaluation procedure. During evaluation, they use RANSAC with the 5-point algorithm on top of their inlier predictions, and measure the angular error. NG-RANSAC incorporates all these components in its training procedure, and therefore optimizes the correct objective.
Using Side-Information. The evaluation procedure defined by Yi  is well suited to test a robust estimator in high-outlier domains. However, it underestimates what classical approaches can achieve on these datasets. Lowe’s ratio criterion  is often deployed to drastically reduce the amount of outlier correspondences. It is based on the distance ratio of the best and second-best SIFT match. Matches with a ratio above a threshold (we use 0.8) are removed before running RANSAC. We denote the ratio criterion as +Ratio in Fig. 2 b), and observe a drastic improvement when combined with SIFT and RANSAC. This classic approach outperforms all learned methods of Fig. 2 a) and is competitive to NG-RANSAC (when not using any side-information). We can push performance even further by applying RootSIFT normalization to the SIFT descriptors . By training NG-RANSAC on ratio-filtered RootSIFT correspondences, using distance ratios as additional network input (denoted by +SI), we achieve best accuracy.
Self-supervised Learning. We can train NG-RANSAC self-supervised when we define a task loss to assess the quality of an estimate independent of a ground truth model . A natural choice is the inlier count of the final estimate. We found the inlier count to be a very stable training signal, even in the beginning of training such that we require no special initialization of the network. We report results of self-supervised NG-RANSAC in Fig. 2 c). It outperforms all competitors, but achieves slightly worse accuracy than NG-RANSAC trained with supervision. A supervised task loss allows NG-RANSAC to adapt more precisely to the evaluation measure used at test time. For the datasets used so far, the process of image pairing uses co-visibility information, and therefore a form of supervision. In the next section, we learn NG-RANSAC fully self-supervised by using the ordering of sequential data to assemble image pairs.
4.2 Fundamental Matrix Estimation
We apply NG-RANSAC to fundamental matrix estimation, comparing it to the learned robust estimator of Ranftl and Koltun , denoted Deep F-Mat. They propose an iterative procedure where a neural network estimates observation weights for a robust model fit. The residuals of the last iteration are an additional input to the network in the next iteration. The network architecture is similar to the one used in . Correspondences are represented as 4D vectors, and they use the descriptor matching ratio as an additional input. Each observation is processed by a series of MLPs with instance normalization interleaved. Deep F-Mat was published very recently, and the code is not yet available. We therefore follow the evaluation procedure described in  and compare to their results.
Datasets. Ranftl and Koltun  evaluate their method on various datasets that involve custom reconstructions which are not publicly available. Therefore, we compare to their method on the Kitti dataset , which is online. Ranftl and Koltun  train their method on sequences 00-05 of the Kitti odometry benchmark, and test on sequences 06-10. They form image pairs by taking subsequent images within a sequence. For each pair, they extract SIFT correspondences and apply Lowe’s ratio criterion  with a threshold of 0.8.
Evaluation Metric. Ranftl and Koltun 
evaluate using multiple metrics. They measure the percentage of inlier correspondences of the final model. They calculate the F-score over correspondences where true positives are inliers of both the ground truth model and the estimated model. The F-score measures the alignment of estimated and true fundamental matrix in image space. Both metrics use an inlier threshold of 0.1px. Finally, they calculate the mean and median epipolar error of inlier correspondences the ground truth model, using an inlier threshold of 1px.
Implementation. We cannot use the architecture of Deep F-Mat which is designed for iterative application. Therefore, we re-use the architecture of Yi  from the previous section for NG-RANSAC (also see Appendix B for details). We adhere to the training setup described in Sec. 4.1 with the following changes. We observed faster training convergence on Kitti, so we omit the initialization stage, and directly optimize the expected task loss (Eq. 3) for 300k iterations. Since Ranftl and Koltun  evaluate using multiple metrics, the choice of the task loss function is not clear. Hence, we train multiple variants with different objectives (%Inliers, F-score and Mean error) and report the corresponding results. As minimal solver , we use the 7-point algorithm, a RANSAC threshold of 0.1px, and we draw hypothesis pools per training image with hypotheses each.
Results. We report results in Fig. 4 where we compare NG-RANSAC with RANSAC, USAC  and Deep F-Mat. Note that USAC also uses guided sampling based on matching ratios according to the PROSAC strategy . NG-RANSAC outperforms the classical approaches RANSAC and USAC. NG-RANSAC also performs slightly superior to Deep F-Mat. We observe that the choice of the training objective has small but significant influence on the evaluation. All metrics are highly correlated, and optimizing a metric in training generally also achieves good (but not necessarily best) accuracy using this metric at test time. Interestingly, optimizing the inlier count during training performs competitively, although being a self-supervised objective. Fig. 3 shows a qualitative result on Kitti.
4.3 Horizon Lines
We fit a parametric model, the horizon line, to a single image. The horizon can serve as a cue in image understanding or for image editing . Traditionally, this task is solved via vanishing point detection and geometric reasoning [37, 24, 57, 42], often assuming a Manhattan or Atlanta world. We take a simpler approach and use a general purpose CNN that predicts a set of 64 2D points based on the image to which we fit a line with RANSAC, see Fig. 5. The network has two output branches (see Appendix C for details) predicting (i) the 2D points , and (ii) probabilities for guided sampling.
Dataset. We evaluate on the HLW dataset  which is a collection of SfM datasets with annotated horizon line. Test and training images partly show the same scenes, and the horizon line can be outside the image area.
Evaluation Metric. As is common practice on HLW, we measure the maximum distance between the estimated horizon and ground truth within the image, normalized by image height. We calculate the AUC of the cumulative error curve up to a threshold of 0.25.
Implementation. We train using the NG-DSAC objective (Eq. 9) from scratch for 300k iterations. As task loss , we use the normalized maximum distance between estimated and true horizon. For hypothesis scoring , we use a soft inlier count . We train using Adam  with a learning rate of 10. For each training image, we draw hypothesis pools with hypotheses. We also draw 16 hypotheses at test time. We compare to DSAC which we train similarly but disable the probability branch.
Results. We report results in Fig. 5. DSAC and NG-DSAC achieve competitive accuracy on this dataset, ranking among the top methods. NG-DSAC has a small but significant advantage over DSAC alone. Our method is only surpassed by SLNet , an architecture designed to find semantic lines in images. SLNet generates a large number of random candidate lines, selects a candidate via classification, and refines it with a predicted offset. We could couple SLNet with neural guidance for informed candidate sampling. Unfortunately, the code of SLNet is not online and the authors did not respond to inquiries.
4.4 Camera Re-Localization
We estimate the absolute 6D camera pose (position and orientation) a known scene from a single RGB image.
Dataset. We evaluate on the Cambridge Landmarks  dataset. It is comprised of RGB images depicting five landmark buildings111We omitted one additional scene (Street). Like other learning-based methods before  we failed to achieve sensible results for this scene. By visual inspection, the corresponding SfM reconstruction seems to be of poor quality, which potentially harms the training process. in Cambridge, UK. Ground truth poses were generated by running a SfM pipeline.
Evaluation Metric. We measure the median translational error of estimated poses for each scene222The median rotational accuracies are between 0.2 to 0.3 for all scenes, and do hardly vary between methods..
Implementation. We build on the publicly available DSAC++ pipeline  which is a scene coordinate regression method . A neural network predicts for each image pixel a 3D coordinate in scene space. We recover the pose from the 2D-3D correspondences using a perspective-n-point solver 
within a RANSAC loop. The DSAC++ pipeline implements geometric pose optimization in a fully differentiable way which facilitates end-to-end training. We re-implement the neural network integration of DSAC++ with PyTorch (the original uses LUA/Torch). We also update the network architecture of DSAC++ by using a ResNet instead of a VGGNet . As with horizon line estimation, we add a second output branch to the network for estimating a probability distribution over scene coordinate predictions for guided RANSAC sampling. We denote this extended architecture NG-DSAC++. We adhere to the training procedure and hyperparamters of DSAC++ (see Appendix D) but optimize the NG-DSAC objective (Eq. 9) during end-to-end training. As task loss , we use the average of the rotational and translational error the ground truth pose. We sample hypothesis pools with hypotheses per training image, and increase the number of hypotheses to for testing.
Results. We report our quantitative results in Fig. 7. Firstly, we observe a significant improvement for most scenes when using DSAC++ with a ResNet architecture. Secondly, comparing DSAC++ with NG-DSAC++, we notice a small to moderate, but consistent, improvement in accuracy. The advantage of using neural guidance is largest for the Great Court scene, which features large ambiguous grass areas, and large areas of sky visible in many images. NG-DSAC++ learns to ignore such areas, see the visualization in Fig. 6 a). The network learns to mask these areas solely guided by the task loss during training, as the network fails to predict accurate scene coordinates for them. In Fig. 6 b), we visualize the internal representation learned by DSAC++ and NG-DSAC++ for one scene. The representation of DSAC++ is very noisy, as it tries to optimize geometric constrains for sky and grass pixels. NG-DSAC++ learns a cleaner representation by focusing entirely on buildings.
We have presented NG-RANSAC, a robust estimator using guided hypothesis sampling according to learned probabilities. For training we can incorporate non-differentiable task loss functions and non-differentiable minimal solvers. Using the inlier count as training objective allows us to also train NG-RANSAC self-supervised. We applied NG-RANSAC to multiple classic computer vision tasks and observe a consistent improvement RANSAC alone.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 647769). The computations were performed on an HPC Cluster at the Center for Information Services and High Performance Computing (ZIH) at TU Dresden.
Appendix A Essential Matrix Estimation
List of Scenes Used for Training and Testing.
Staint Peter’s (Outdoor)
brown_bm_3 - brown_bm_3 (Indoor)
brown_cogsci_2 - brown_cogsci_2
brown_cogsci_6 - brown_cogsci_6
brown_cogsci_8 - brown_cogsci_8
brown_cs_3 - brown_cs3
brown_cs_7 - brown_cs7
harvard_c4 - hv_c4_1
harvard_c10 - hv_c10_2
harvard_corridor_lounge - hv_lounge1_2
harvard_robotics_lab - hv_s1_2
hotel_florence_jx - florence_hotel_stair_room_all
mit_32_g725 - g725_1
mit_46_6conf - bcs_floor6_conf_1
mit_46_6lounge - bcs_floor6_long
mit_w85g - g_0
mit_w85h - h2_1
Network Architecture. As mentioned in the main paper, we replicated the architecture of Yi  for our experiments on epipolar geometry (estimating essential and fundamental matrices). For a schematic overview see Fig. 8. The network takes a set of feature correspondences as input, and predicts as output a weight for each correspondence which we use to guide RANSAC hypothesis sampling. The network consists of a series of multilayer perceptrons (MLPs) that process each correspondence independently. We implement the MLPs with convolutions. The network infuses global context via instance normalization layers , and it accelerate training via batch normalization . The main body of the network is comprised of 12 blocks with skip connections 
. Each block consists of two linear layers followed by instance normalization, batch normalization and a ReLU activation each. We apply a Sigmoid activation to the last layer, and normalize by dividing by the sum of outputs.333The original architecture of Yi  uses a slightly different output processing due to using the output as weights for a robust model fit. They use a ReLU activation followed by a tanh activation.
Initialization Procedure. We initialize our network in the following way. We define a target sampling distribution using the ground truth essential matrix given for each training pair. Intuitively, the target distribution should return a high probability when a correspondence is aligned with the ground truth essential matrix , and a low probability otherwise. We assume that correspondence is a 4D vector containing two 2D image coordinates and (3D in homogeneous coordinates). We define the epipolar error of a correspondence essential matrix :
where returns the th entry of a vector. Using the epipolar error, we define the target sampling distribution:
Parameter controls the softness of the target distribution, and we use which corresponds to the inlier threshold we use for RANSAC. To initialize our network, we minimize the KL divergence between the network prediction and the target distribution . We initialize for 75k iterations using Adam  with a learning rate of and a batch size of 32.
Implementation Details. For the following components we rely on the implementations provided by OpenCV : the 5-point algorithm , epipolar error, SIFT features , feature matching, and essential matrix decomposition. We extract 2000 features per input image which yields 2000 correspondences for image pairs after matching. When applying Lowe’s ratio criterion  for filtering and hence reducing the number of correspondences, we randomly duplicate correspondences to restore the number of 2000. We minimize the expected task loss using Adam  with a learning rate of
and a batch size of 32. We choose hyperparameters based on validation error of theReichstag scene. We observe that the magnitude of the validation error corresponds well to the magnitude of the training error, a validation set would not be strictly required.
Qualitative Results. We present additional qualitative results for indoor and outdoor scenarios in Fig. 9. We compare results of RANSAC and NG-RANSAC, also visualizing neural guidance as predicted by our network. We obtain these results in the high-outlier setup, without using Lowe’s ratio criterion and without using side-information as additional network input.
Appendix B Fundamental Matrix Estimation
Implementation Details. We reuse the architecture of Fig. 8
. We normalize image coordinates of feature matches before passing them to the network. We subtract the mean coordinate and divide by the coordinate standard deviation, where we calculate mean and standard deviation over the training set. Ranftl and Koltun fit the final fundamental matrix to the top 20 weighted correspondences as predicted by their network. Similarly, we re-fit the final fundamental matrix to the largest inlier set found by NG-RANSAC. This refinement step results in a small but noticeable increase in accuracy. For the following components we rely on the implementations provided by OpenCV : the 7-point algorithm, epipolar error, SIFT features  and feature matching.
Appendix C Horizon Lines
Network Architecture. We provide a schematic of our network architecture for horizon line estimation in Fig. 11. The network takes a px RGB image as input. We re-scale images of arbitrary aspect ratio such that the long side is
px. We symmetrically zero-pad the short side topx. The network has two output branches. The first branch predicts a set of 2D points, our observations , to which we fit the horizon line. We apply a Sigmoid and re-scale output points to [-1.5,1.5] in relative image coordinates to support horizon lines outside the image area. We implement the network in a fully convolutional way , each output point is predicted for a patch, or restricted receptive field, of the input image. Therefore, we shift the coordinate of each output point to the center of its associated patch.
The second output branch predicts sampling probabilities for each output point. We apply a Sigmoid to the output of the second branch, and normalize by dividing by the sum of outputs. During training, we block the gradients of the second output branch when back propagating to the base network. The sampling gradients have larger variance and magnitude than the observation gradients of the first branch, especially in the beginning of training. This has a negative effect on convergence of the network as a whole. Intuitively, we want to give priority to the observation prediction because they determine the accuracy of the final model parameters. The sampling prediction should address deficiencies in the observation predictions without influencing them too much. The gradient blockade ensures these properties.
Implementation Details. We use a differentiable soft inlier count  as scoring function, :
where denotes the point-line distance between observation and line hypothesis . Hyperparameter determines the softness of the scoring distribution in DSAC, determines the softness of the Sigmoid, and is the inlier threshold. We use , and . We choose hyperparameters based on grid search for the minimal training loss.
As discussed in the main paper, we use the normalized maximum distance between a line hypothesis and the ground truth horizon in the image as task loss . This can lead to stability issues when we sample line hypotheses with very steep slope. Therefore, we clamp the task loss to a maximum of 1, the normalized image height.
As mentioned before, some images in the HLW dataset  have their horizon outside the image. Some of these images contain virtually no visual cue where the horizon exactly lies. Therefore, we find it beneficial to use a robust variant of the task loss that limits the influence of such outliers. We use:
we use the square root of the task loss after a magnitude of , which is the magnitude up to which the AUC is calculated when evaluating on HLW .
Appendix D Camera Re-Localization
Network Architecture. We provide a schematic of our network architecture for camera re-localization in Fig. 13. The network is a FCN  that takes an RGB image as input, and predicts dense outputs, sub-sampled by a factor of 8. The network has two output branches. The first branch predicts 3D scene coordinates , our observations , to which we fit the 6D camera pose. The second output branch predicts sampling probabilities for the scene coordinates. We apply a Sigmoid to the output of the second branch, and normalize by dividing by the sum of outputs. During training, we block the gradients of the second output branch when back propagating to the base network. The sampling gradients have larger variance and magnitude than the observation gradients of the first branch, especially in the beginning of training. This has a negative effect on convergence of the network as a whole. Intuitively, we want to give priority to the scene coordinate prediction because they determine the accuracy of the pose estimate. The sampling prediction should address deficiencies in the scene coordinate predictions without influencing them too much. The gradient blockade ensures these properties.
Implementation details. We follow the three-stage training procedure proposed by Brachmann and Rother for DSAC++ .
Firstly, we optimize the distance between predicted and ground truth scene coordinates. We obtain ground truth scene coordinates by rendering the sparse reconstructions given in the Cambridge Landmarks dataset . We ignore pixels with no corresponding 3D point in the reconstruction. Since the reconstructions contain outlier 3D points, we use the following robust distance:
we use the Euclidean distance up to a threshold of 10m after which we use the square root of the Euclidean distance. We train the first stage for 500k iterations using Adam  with a learning rate of and a batch size of 1 image.
Secondly, we optimize the reprojection error of the scene coordinate predictions to the ground truth camera pose. Similar to the first stage, we use a robust distance function with a threshold of 10px after which we use the square root of the reprojection error. We train the second stage for 300k iterations using Adam  with a learning rate of and a batch size of 1 image.
Thirdly, we optimize the expected task loss according to the NG-DSAC objective as explained in the main paper. As task loss we use . We measure the angle between estimated camera rotation and ground truth rotation in degree. We measure the distance between the estimated camera position and ground truth position in meters. As with horizon line estimation (see previous section), we use a soft inlier count as hypothesis scoring function with hyperparameters , and . We train the third stage for 200k iterations using Adam  with a learning rate of and a batch size of 1 image.
Learned 3D Representations. We visualize the internal 3D scene representations learned by DSAC++ and NG-DSAC++ in Fig. 14 for two more scenes.
-  R. Arandjelovic. Three things everyone should know to improve object retrieval. In CVPR, 2012.
-  J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T. D. Nguyen, and M.-M. Cheng. Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In CVPR, 2017.
-  E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6D object pose estimation using 3D object coordinates. In ECCV, 2014.
-  E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSAC-Differentiable RANSAC for camera localization. In CVPR, 2017.
-  E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and C. Rother. Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In CVPR, 2016.
-  E. Brachmann and C. Rother. Learning less is more-6D camera localization via 3D surface regression. In CVPR, 2018.
-  G. Bradski. OpenCV. Dr. Dobb’s Journal of Software Tools, 2000.
-  T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, L. Di Stefano, and P. H. Torr. On-the-fly adaptation of regression forests for online camera relocalisation. In CVPR, 2017.
-  O. Chum and J. Matas. Matching with PROSAC - Progressive sample consensus. In CVPR, 2005.
-  O. Chum and J. Matas. Optimal randomized ransac. TPAMI, 2008.
-  O. Chum, J. Matas, and J. Kittler. Locally optimized RANSAC. In DAGM, 2003.
-  M. A. Fischler and R. C. Bolles. Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981.
-  X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng. Complete solution classification for the perspective-three-point problem. TPAMI, 2003.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
-  R. I. Hartley. In defense of the eight-point algorithm. TPAMI, 1997.
-  R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004.
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification.In ICCV, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  J. Heinly, J. L. Schönberger, E. Dunn, and J.-M. Frahm. Reconstructing the World* in Six Days *(As Captured by the Yahoo 100 Million Image Dataset). In CVPR, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  O. H. Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann, and C. Rother. ipose: Instance-aware 6d pose estimation of partly occluded objects. 2018.
-  A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In ICCV, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  F. Kluger, H. Ackermann, M. Y. Yang, and B. Rosenhahn. Deep learning for vanishing point detection using an inverse gnomonic projection. In GCPR, 2017.
-  J.-T. Lee, H.-U. Kim, C. Lee, and C.-S. Kim. Semantic line detection and its applications. In ICCV, 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
-  D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. S. Torr. Random forests versus neural networks - what’s best for camera localization? In ICRA, 2017.
-  R. Mur-Artal and J. D. Tardós. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. T-RO, 2017.
-  R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-time dense surface mapping and tracking. In Proc. ISMAR, 2011.
-  D. Nistér. An efficient solution to the five-point relative pose problem. TPAMI, 2004.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
-  R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm. Usac: A universal framework for random sample consensus. TPAMI, 2013.
-  R. Raguram, J.-M. Frahm, and M. Pollefeys. A comparative analysis of RANSAC techniques leading to adaptive real-time random sample consensus. In ECCV, 2008.
-  R. Ranftl and V. Koltun. Deep fundamental matrix estimation. In ECCV, 2018.
-  C. Rother. A new approach for vanishing point detection in architectural environments. In BMVC, 2002.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternative to SIFT or SURF. In ICCV, 2011.
-  T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. TPAMI, 2016.
-  J. L. Schönberger and J.-M. Frahm. Structure-from-Motion Revisited. In CVPR, 2016.
-  J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013.
-  G. Simon, A. Fond, and M.-O. Berger. A-contrario horizon-first vanishing point detection using second-order grouping laws. In ECCV, 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
-  C. Strecha, W. von Hansen, L. J. V. Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. In CVPR, 2008.
-  R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998.
-  H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
-  B. Tordoff and D. W. Murray. Guided sampling and consensus for motion estimation. In ECCV, 2002.
-  P. H. S. Torr and A. Zisserman. MLESAC: A new robust estimator with application to estimating image geometry. CVIU, 2000.
-  D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, 2016.
-  B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In CVPR, 2017.
-  J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. S. Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In CVPR, 2015.
-  S. Workman, M. Zhai, and N. Jacobs. Horizon lines in the wild. In BMVC, 2016.
-  C. Wu. Towards linear-time incremental structure from motion. In 3DV, 2013.
-  J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In ICCV, 2013.
-  K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned invariant feature transform. In ECCV, 2016.
-  K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to find good correspondences. In CVPR, 2018.
-  M. Zhai, S. Workman, and N. Jacobs. Detecting vanishing points using global image context in a non-manhattan world. In CVPR, 2016.