1 Introduction
Despite its simplicity and time of invention, Random Sample Consensus (RANSAC) [12] remains an important method for robust optimization, and is a vital component of many stateoftheart vision pipelines [39, 40, 29, 6]. RANSAC allows accurate estimation of model parameters from a set of observations of which some are outliers. To this end, RANSAC iteratively chooses random subsets of observations, so called minimal sets, to create model hypotheses. Hypotheses are ranked according to their consensus with all observations, and the topranked hypothesis is returned as the final estimate.
The main limitation of RANSAC is its poor performance in domains with many outliers. As the ratio of outliers increases, RANSAC requires exponentially many iterations to find an outlierfree minimal set. Implementations of RANSAC therefore often restrict the maximum number of iterations, and return the best model found so far [7].
In this work, we combine RANSAC with a neural network that predicts a weight for each observation. The weights ultimately guide the sampling of minimal sets. We call the resulting algorithm NeuralGuided RANSAC (NGRANSAC). A comparison of our method with vanilla RANSAC can be seen in Fig. 1.
When developing NGRANSAC, we took inspiration from recent work on learned robust estimators [56, 36]. In particular, Yi [56]
train a neural network to classify observations as outliers or inliers, fitting final model parameters only to the latter. Although designed to replace RANSAC, their method achieves best results when combined with RANSAC during test time, where it would remove any outliers that the neural network might have missed. This motivates us to train the neural network in conjunction with RANSAC in a principled fashion, rather than imposing it afterwards.
Instead of interpreting the neural network output as soft inlier labels for a robust model fit, we let the output weights guide RANSAC hypothesis sampling. Intuitively, the neural network should learn to decrease weights for outliers, and increase them for inliers. This paradigm yields substantial flexibility for the neural network in allowing a certain misclassification rate without negative effects on the final fitting accuracy due to the robustness of RANSAC. The distinction between inliers and outliers, as well as which misclassifications are tolerable, is solely guided by the minimization of the task loss function during training. Furthermore, our formulation of NGRANSAC facilitates training with any (nondifferentiable) task loss function, and any (nondifferentiable) model parameter solver, making it broadly applicable. For example, when fitting essential matrices, we may use the 5point algorithm rather than the (differentiable) 8point algorithm which other learned robust estimators rely on
[56, 36]. The flexibility in choosing the task loss also allows us to train NGRANSAC selfsupervised by using maximization of the inlier count as training objective.The idea of using guided sampling in RANSAC is not new. Tordoff and Murray first proposed to guide the hypothesis search of MLESAC [48], using sideinformation [47]
. They formulated a prior probability of sparse feature matches being valid based on matching scores. While this has a positive affect on RANSAC performance in some applications, feature matching scores, or other handcrafted heuristics, were clearly
not designed to guide hypothesis search. In particular, calibration of such adhoc measures can be difficult as the reliance on overconfident but wrong prior probabilities can yield situations where the same few observations are sampled repeatedly. This fact was recognized by Chum and Matas who proposed PROSAC [9], a variant of RANSAC that uses sideinformation only to change the order in which RANSAC draws minimal sets. In the worst case, if the sideinformation was not useful at all, their method would degenerate to vanilla RANSAC. NGRANSAC takes a different approach in (i) learning the weights to guide hypothesis search rather than using handcrafted heuristics, and (ii) integrating RANSAC itself in the training process which leads to selfcalibration of the predicted weights.Recently, Brachmann proposed differentiable RANSAC (DSAC) to learn a camera relocalization pipeline [4]. Unfortunately, we can not directly use DSAC to learn hypothesis sampling since DSAC is only differentiable to observations, not sampling weights. However, NGRANSAC applies a similar trick also used to make DSAC differentiable, namely the optimization of the expected task loss during training. While we do not rely on DSAC, neural guidance can be used in conjunction with DSAC (NGDSAC) to train neural networks that predict observations and observation confidences at the same time.
We summarize our main contributions:

We present NGRANSAC, a formulation of RANSAC with learned guidance of hypothesis sampling. We can use any (nondifferentiable) task loss, and any (nondifferentiable) minimal solver for training.

Choosing the inlier count itself as training objective facilitates selfsupervised learning of NGRANSAC.

We use NGRANSAC to estimate epipolar geometry of image pairs from sparse correspondences, where it surpasses competing robust estimators.

We combine neural guidance with differentiable RANSAC (NGDSAC) to train neural networks that make accurate predictions for parts of the input, while neglecting other parts. These models achieve competitive results for horizontal line estimation, and statefortheart for camera relocalization.
2 Related Work
RANSAC was introduced in 1981 by Fischler and Bolles [12]. Since then it was extended in various ways, see the survey by Raguram [35]. Combining some of the most promising improvements, Raguram created the Universal RANSAC (USAC) framework [34] which represents the stateoftheart of classic RANSAC variants. USAC includes guided hypothesis sampling according to PROSAC [9], more accurate model fitting according to Locally Optimized RANSAC [11], and more efficient hypothesis verification according to Optimal Randomized RANSAC [10]. Many of the improvements proposed for RANSAC could also be applied to NGRANSAC since we do not require any differentiability of such addons. We only imposes restrictions on how to generate hypotheses, namely according to a learned probability distribution.
RANSAC is not often used in recent machine learningheavy vision pipelines. Notable exceptions include geometric problems like object instance pose estimation
[3, 5, 21], and camera relocalization [41, 51, 28, 8, 46] where RANSAC is coupled with decision forests or neural networks that predict imagetoobject correspondences. However, in most of these works, RANSAC is not part of the training process because of its nondifferentiability. DSAC [4, 6] overcomes this limitation by making the hypothesis selection a probabilistic action which facilitates optimization of the expected task loss during training. However, DSAC is limited in which derivatives can be calculated. DSAC allows differentiation to observations. For example, we can use it to calculate the gradient of image coordinates for a sparse correspondence. However, DSAC does not model observation selection, and hence we cannot use it to optimize a matching probability. By showing how to learn neural guidance, we close this gap. The combination with DSAC enables the full flexibility of learning both, observations and their selection probability.Besides DSAC, a differentiable robust estimator, there has recently been some work on learning robust estimators. We discussed the work of Yi [56] in the introduction. Ranftl and Koltun [36] take a similar but iterative approach reminiscent of Iteratively Reweighted Least Squares (IRLS) for fundamental matrix estimation. In each iteration, a neural network predicts observation weights for a weighted model fit, taking into account the residuals of the last iteration. Both, [56] and [36], have shown considerable improvements to vanilla RANSAC but require differentiable minimal solvers, and task loss functions. NGRANSAC outperforms both approaches, and is more flexible when it comes to defining the training objective. This flexibility also enables us to train NGRANSAC in a selfsupervised fashion, possible with neither [56] nor [36].
3 Method
Preliminaries. We address the problem of fitting model parameters to a set of observations that are contaminated by noise and outliers. For example, could be a fundamental matrix that describes the epipolar geometry of an image pair [16], and could be the set of SIFT correspondences [27] we extract for the image pair. To calculate model parameters from the observations, we utilize a solver , for example the 8point algorithm [15]. However, calculating from all observations will result in a poor estimate due to outliers. Instead, we can calculate from a small subset (minimal set) of observations with cardinality : . For example, for a fundamental matrix when using the 8point algorithm. RANSAC [12] is an algorithm to chose an outlierfree minimal set from such that the resulting estimate is accurate. To this end, RANSAC randomly chooses minimal sets to create a pool of model hypotheses .
RANSAC includes a strategy to adaptively choose , based on an online estimate of the outlier ratio [12]. The strategy guarantees that an outlierfree set will be sampled with a userdefined probability. For tasks with large outlier ratios, calculated like this can be exponentially large, and is usually clamped to a maximum value [7]. For notational simplicity, we take the perspective of a fixed but do not restrict the use of an earlystopping strategy in practice.
RANSAC chooses a model hypothesis as the final estimate according to a scoring function :
(1) 
The scoring function measures the consensus of an hypothesis all observations, and is traditionally implemented as inlier counting [12].
Neural Guidance. RANSAC chooses observations uniformly random to create the hypothesis pool . We aim at sampling observations according to a learned distribution instead that is parametrized by a neural network with parameters . That is, we select observations according to . Note that is a categorical distribution over the discrete set of observations , not a continuous distribution in observation space. We wish to learn parameters in a way that increases the chance of selecting outlierfree minimal sets, which will result in accurate estimates . We sample a hypothesis pool according to by sampling observations and minimal sets independently,
(2) 
From a pool , we estimate model parameters with RANSAC according to Eq. 1. For training, we assume that we can measure the quality of the estimate with a task loss function . The task loss can be calculated a ground truth model , or selfsupervised, by using the inlier count of the final estimate: . We wish to learn the distribution in a way that we receive a small task loss with high probability. Inspired by DSAC [4], we define our training objective as the minimization of the expected task loss:
(3) 
We compute the gradients of the expected task loss the network parameters as
(4) 
Integrating over all possible hypothesis pools to calculate the expectation is infeasible. Therefore, we approximate the gradients by drawing samples :
(5) 
Note that gradients of the task loss function do not appear in the expression above. Therefore, differentiability of the task loss , the robust solver (RANSAC) or the minimal solver is not required. These components merely generate a training signal for steering the sampling probability
in a good direction. Due to the approximation by sampling, the gradient variance of Eq.
5can be high. We apply a standard variance reduction technique from reinforcement learning by subtracting a baseline
[45]:(6) 
We found a simple baseline in the form of the average loss per image sufficient, . Subtracting the baseline will move the probability distribution towards hypothesis pools with lowerthanaverage loss for each training example.
Combination with DSAC. Brachmann [4] proposed a RANSACbased pipeline where a neural network with parameters predicts observations . Endtoend training of the pipeline, and therefore learning the observations , is possible by turning the hypothesis selection of RANSAC (cf. Eq. 1) into a probabilistic action:
(7) 
This differentiable variant of RANSAC (DSAC) chooses a hypothesis randomly according to a distribution calculated from hypothesis scores. The training objective aims at learning network parameters such that hypotheses with low task loss are chosen with high probability:
(8) 
In the following, we extend the formulation of DSAC with neural guidance (NGDSAC). We let the neural network predict observations and, additionally, a probability associated with each observation . Intuitively, the neural network can express a confidence in its own predictions through this probability. This can be useful if a certain input for the neural network contains no information about the desired model . In this case, the observation prediction is necessarily an outlier, and the best the neural network can do is to label it as such by assigning a low probability. We combine the training objectives of NGRANSAC (Eq. 3) and DSAC (Eq. 8) which yields:
(9) 
where we again construct from individual ’s according to Eq. 2. The training objective of NGDSAC consists of two expectations. Firstly, the expectation sampling a hypothesis pool according to the probabilities predicted by the neural network. Secondly, the expectation sampling a final estimate from the pool according to the scoring function. As in NGRANSAC, we approximate the first expectation via sampling, as integrating over all possible hypothesis pools is infeasible. For the second expectation, we can calculate it analytically, as in DSAC, since it integrates over the discrete set of hypotheses in a given pool . Similar to Eq. 6, we give the approximate gradients of NGDSAC as:
(10) 
where we use as a standin for . The calculation of gradients for NGDSAC requires the derivative of the task loss (note the last part of Eq. 10) because depends on parameters via observations . Therefore, training NGDSAC requires a differentiable task loss function , a differentiable scoring function , and a differentiable minimal solver . Note that we inherit these restrictions from DSAC. In return, NGDSAC allows for learning observations and observation confidences, at the same time.
4 Experiments
We evaluate neural guidance on multiple, classic computer vision tasks. Firstly, we apply NGRANSAC to estimating epipolar geometry of image pairs in the form of essential matrices and fundamental matrices. Secondly, we apply NGDSAC to horizon line estimation and camera relocalization. We present the main experimental results here, and refer to the appendix for details about network architectures, hyperparameters and more qualitative results. Our implementation is based on PyTorch
[32], and we will make the code publicly available for all the tasks discussed below.4.1 Essential Matrix Estimation
Epipolar geometry describes the geometry of two images that observe the same scene [16]. In particular, two image points and in the left and right image corresponding to the same 3D point satisfy , where the matrix denotes the fundamental matrix. We can estimate uniquely (but only up to scale) from 8 correspondences, or from 7 correspondences with multiple solutions [16]. The essential matrix is a special case of the fundamental matrix when the calibration parameters and of both cameras are known: . The essential matrix can be estimated from 5 correspondences [31]. Decomposing the essential matrix allows to recover the relative pose between the observing cameras, and is a central step in imagebased 3D reconstruction [40]. As such, estimating the fundamental or essential matrices of image pairs is a classic and wellresearched problem in computer vision.
In the following, we firstly evaluate NGRANSAC for the calibrated case and estimate essential matrices from SIFT correspondences [27]. For the sake of comparability with the recent, learned robust estimator of Yi [56] we adhere closely to their evaluation setup, and compare to their results.
Datasets. Yi [56] evaluate their approach in outdoor as well as indoor settings. For the outdoor datasets, they select five scenes from the structurefrommotion (SfM) dataset of [19]: Buckingham, Notredame, Sacre Coeur, St. Peter’s and Reichstag. They pick two additional scenes from [44]: Fountain and Herzjesu. They reconstruct each scene using a SfM tool [53] to obtain ‘ground truth’ camera poses, and covisibility constraints for selecting image pairs. For indoor scenes Yi choose 16 sequences from the SUN3D dataset [54] which readily comes with ground truth poses captured by KinectFusion [30]. See Appendix A for a listing of all scenes. Indoor scenarios are typically very challenging for sparse featurebased approaches because of textureless surfaces and repetitive elements (see Fig. 1 for an example). Yi train their best model using one outdoor scene (St. Peter’s) and one indoor scene (Brown 1), and test on all remaining sequences (6 outdoor, 15 indoor). Yi kindly provided us with their exact data splits, and we will use their setup. Note that training and test is performed on completely separate scenes, the neural network has to generalize to unknown environments.
Via the essential matrix, we recover the relative camera pose up to scale, and compare to the ground truth pose as follows. We measure the angular error between the pose rotations, as well as the angular error between the pose translation vectors in degrees. We take the maximum of the two values as the final angular error. We calculate the cumulative error curve for each test sequence, and compute the area under the curve (AUC) up to a threshold of 5
, 10 or 20. Finally, we report the average AUC over all test sequences (but separately for the indoor and outdoor setting).Implementation. Yi train a neural network to classify a set of sparse correspondences in inliers and outliers. They represent each correspondence as a 4D vector combining the 2D coordinate in the left and right image. Their network is inspired by PointNet [33]
, and processes each correspondence independently by a series of multilayer perceptrons (MLPs). Global context is infused by using instance normalization
[49] inbetween layers. We rebuild this architecture in PyTorch, and train it according to NGRANSAC (Eq. 3). That is, the network predicts weights to guide RANSAC sampling instead of inlier class labels. We use the angular error between the estimated relative pose, and the ground truth pose as task loss . As minimal solver , we use the 5point algorithm [31]. To speed up training, we initialize the network by learning to predict the distance of each correspondence to the ground truth epipolar line, see Appendix A for details. We initialize for 75k iterations, and train according to Eq. 3 for 25k iterations. We optimize using Adam [23] with a learning rate of 10. For each training image, we extract 2000 SIFT correspondences, and sample hypothesis pools with hypotheses. We use a low number of hypotheses during training to obtain variation when sampling pools. For testing, we increase the number of hypotheses to . We use an inlier threshold of 10 assuming normalized image coordinates using camera calibration parameters.Results. We compare NGRANSAC to the inlier classification (InClass) of Yi [56]. They use their approach with SIFT as well as LIFT [55] features. The former works better for outdoor scenes, the latter works better for indoor scenes. We also include results for DeMoN [50], a learned SfM pipeline, and GMS [2], a semidense approach using ORB features [38]. See Fig. 2 a) for results. RANSAC achieves poor results for indoor and outdoor scenes across all thresholds, scoring as the weakest method among all competitors. Coupling it with neural guidance (NGRANSAC) elevates it to the leading position with a comfortable margin. See also Fig. 3 for qualitative results.
NGRANSAC outperforms InClass of Yi [56] despite some similarities. Both use the same network architecture, are based on SIFT correspondences, and both use RANSAC at test time. Yi [56] train using a hybrid classificationregression loss based on the 8point algorithm, and ultimately compare essential matrices using squared error. Therefore, their training objective is very different from the evaluation procedure. During evaluation, they use RANSAC with the 5point algorithm on top of their inlier predictions, and measure the angular error. NGRANSAC incorporates all these components in its training procedure, and therefore optimizes the correct objective.
Using SideInformation. The evaluation procedure defined by Yi [56] is well suited to test a robust estimator in highoutlier domains. However, it underestimates what classical approaches can achieve on these datasets. Lowe’s ratio criterion [27] is often deployed to drastically reduce the amount of outlier correspondences. It is based on the distance ratio of the best and secondbest SIFT match. Matches with a ratio above a threshold (we use 0.8) are removed before running RANSAC. We denote the ratio criterion as +Ratio in Fig. 2 b), and observe a drastic improvement when combined with SIFT and RANSAC. This classic approach outperforms all learned methods of Fig. 2 a) and is competitive to NGRANSAC (when not using any sideinformation). We can push performance even further by applying RootSIFT normalization to the SIFT descriptors [1]. By training NGRANSAC on ratiofiltered RootSIFT correspondences, using distance ratios as additional network input (denoted by +SI), we achieve best accuracy.
Selfsupervised Learning. We can train NGRANSAC selfsupervised when we define a task loss to assess the quality of an estimate independent of a ground truth model . A natural choice is the inlier count of the final estimate. We found the inlier count to be a very stable training signal, even in the beginning of training such that we require no special initialization of the network. We report results of selfsupervised NGRANSAC in Fig. 2 c). It outperforms all competitors, but achieves slightly worse accuracy than NGRANSAC trained with supervision. A supervised task loss allows NGRANSAC to adapt more precisely to the evaluation measure used at test time. For the datasets used so far, the process of image pairing uses covisibility information, and therefore a form of supervision. In the next section, we learn NGRANSAC fully selfsupervised by using the ordering of sequential data to assemble image pairs.
4.2 Fundamental Matrix Estimation
We apply NGRANSAC to fundamental matrix estimation, comparing it to the learned robust estimator of Ranftl and Koltun [36], denoted Deep FMat. They propose an iterative procedure where a neural network estimates observation weights for a robust model fit. The residuals of the last iteration are an additional input to the network in the next iteration. The network architecture is similar to the one used in [56]. Correspondences are represented as 4D vectors, and they use the descriptor matching ratio as an additional input. Each observation is processed by a series of MLPs with instance normalization interleaved. Deep FMat was published very recently, and the code is not yet available. We therefore follow the evaluation procedure described in [36] and compare to their results.
Datasets. Ranftl and Koltun [36] evaluate their method on various datasets that involve custom reconstructions which are not publicly available. Therefore, we compare to their method on the Kitti dataset [14], which is online. Ranftl and Koltun [36] train their method on sequences 0005 of the Kitti odometry benchmark, and test on sequences 0610. They form image pairs by taking subsequent images within a sequence. For each pair, they extract SIFT correspondences and apply Lowe’s ratio criterion [27] with a threshold of 0.8.
Evaluation Metric. Ranftl and Koltun [36]
evaluate using multiple metrics. They measure the percentage of inlier correspondences of the final model. They calculate the Fscore over correspondences where true positives are inliers of both the ground truth model and the estimated model. The Fscore measures the alignment of estimated and true fundamental matrix in image space. Both metrics use an inlier threshold of 0.1px. Finally, they calculate the mean and median epipolar error of inlier correspondences the ground truth model, using an inlier threshold of 1px.
Implementation. We cannot use the architecture of Deep FMat which is designed for iterative application. Therefore, we reuse the architecture of Yi [56] from the previous section for NGRANSAC (also see Appendix B for details). We adhere to the training setup described in Sec. 4.1 with the following changes. We observed faster training convergence on Kitti, so we omit the initialization stage, and directly optimize the expected task loss (Eq. 3) for 300k iterations. Since Ranftl and Koltun [36] evaluate using multiple metrics, the choice of the task loss function is not clear. Hence, we train multiple variants with different objectives (%Inliers, Fscore and Mean error) and report the corresponding results. As minimal solver , we use the 7point algorithm, a RANSAC threshold of 0.1px, and we draw hypothesis pools per training image with hypotheses each.
Results. We report results in Fig. 4 where we compare NGRANSAC with RANSAC, USAC [34] and Deep FMat. Note that USAC also uses guided sampling based on matching ratios according to the PROSAC strategy [9]. NGRANSAC outperforms the classical approaches RANSAC and USAC. NGRANSAC also performs slightly superior to Deep FMat. We observe that the choice of the training objective has small but significant influence on the evaluation. All metrics are highly correlated, and optimizing a metric in training generally also achieves good (but not necessarily best) accuracy using this metric at test time. Interestingly, optimizing the inlier count during training performs competitively, although being a selfsupervised objective. Fig. 3 shows a qualitative result on Kitti.
4.3 Horizon Lines
We fit a parametric model, the horizon line, to a single image. The horizon can serve as a cue in image understanding
[52] or for image editing [25]. Traditionally, this task is solved via vanishing point detection and geometric reasoning [37, 24, 57, 42], often assuming a Manhattan or Atlanta world. We take a simpler approach and use a general purpose CNN that predicts a set of 64 2D points based on the image to which we fit a line with RANSAC, see Fig. 5. The network has two output branches (see Appendix C for details) predicting (i) the 2D points , and (ii) probabilities for guided sampling.Dataset. We evaluate on the HLW dataset [52] which is a collection of SfM datasets with annotated horizon line. Test and training images partly show the same scenes, and the horizon line can be outside the image area.
Evaluation Metric. As is common practice on HLW, we measure the maximum distance between the estimated horizon and ground truth within the image, normalized by image height. We calculate the AUC of the cumulative error curve up to a threshold of 0.25.
Implementation. We train using the NGDSAC objective (Eq. 9) from scratch for 300k iterations. As task loss , we use the normalized maximum distance between estimated and true horizon. For hypothesis scoring , we use a soft inlier count [6]. We train using Adam [23] with a learning rate of 10. For each training image, we draw hypothesis pools with hypotheses. We also draw 16 hypotheses at test time. We compare to DSAC which we train similarly but disable the probability branch.
Results. We report results in Fig. 5. DSAC and NGDSAC achieve competitive accuracy on this dataset, ranking among the top methods. NGDSAC has a small but significant advantage over DSAC alone. Our method is only surpassed by SLNet [25], an architecture designed to find semantic lines in images. SLNet generates a large number of random candidate lines, selects a candidate via classification, and refines it with a predicted offset. We could couple SLNet with neural guidance for informed candidate sampling. Unfortunately, the code of SLNet is not online and the authors did not respond to inquiries.
4.4 Camera ReLocalization
We estimate the absolute 6D camera pose (position and orientation) a known scene from a single RGB image.
Dataset. We evaluate on the Cambridge Landmarks [22] dataset. It is comprised of RGB images depicting five landmark buildings^{1}^{1}1We omitted one additional scene (Street). Like other learningbased methods before [6] we failed to achieve sensible results for this scene. By visual inspection, the corresponding SfM reconstruction seems to be of poor quality, which potentially harms the training process. in Cambridge, UK. Ground truth poses were generated by running a SfM pipeline.
Evaluation Metric. We measure the median translational error of estimated poses for each scene^{2}^{2}2The median rotational accuracies are between 0.2 to 0.3 for all scenes, and do hardly vary between methods..
Implementation. We build on the publicly available DSAC++ pipeline [6] which is a scene coordinate regression method [41]. A neural network predicts for each image pixel a 3D coordinate in scene space. We recover the pose from the 2D3D correspondences using a perspectivenpoint solver [13]
within a RANSAC loop. The DSAC++ pipeline implements geometric pose optimization in a fully differentiable way which facilitates endtoend training. We reimplement the neural network integration of DSAC++ with PyTorch (the original uses LUA/Torch). We also update the network architecture of DSAC++ by using a ResNet
[18] instead of a VGGNet [43]. As with horizon line estimation, we add a second output branch to the network for estimating a probability distribution over scene coordinate predictions for guided RANSAC sampling. We denote this extended architecture NGDSAC++. We adhere to the training procedure and hyperparamters of DSAC++ (see Appendix D) but optimize the NGDSAC objective (Eq. 9) during endtoend training. As task loss , we use the average of the rotational and translational error the ground truth pose. We sample hypothesis pools with hypotheses per training image, and increase the number of hypotheses to for testing.Results. We report our quantitative results in Fig. 7. Firstly, we observe a significant improvement for most scenes when using DSAC++ with a ResNet architecture. Secondly, comparing DSAC++ with NGDSAC++, we notice a small to moderate, but consistent, improvement in accuracy. The advantage of using neural guidance is largest for the Great Court scene, which features large ambiguous grass areas, and large areas of sky visible in many images. NGDSAC++ learns to ignore such areas, see the visualization in Fig. 6 a). The network learns to mask these areas solely guided by the task loss during training, as the network fails to predict accurate scene coordinates for them. In Fig. 6 b), we visualize the internal representation learned by DSAC++ and NGDSAC++ for one scene. The representation of DSAC++ is very noisy, as it tries to optimize geometric constrains for sky and grass pixels. NGDSAC++ learns a cleaner representation by focusing entirely on buildings.
5 Conclusion
We have presented NGRANSAC, a robust estimator using guided hypothesis sampling according to learned probabilities. For training we can incorporate nondifferentiable task loss functions and nondifferentiable minimal solvers. Using the inlier count as training objective allows us to also train NGRANSAC selfsupervised. We applied NGRANSAC to multiple classic computer vision tasks and observe a consistent improvement RANSAC alone.
Acknowledgements:
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 647769). The computations were performed on an HPC Cluster at the Center for Information Services and High Performance Computing (ZIH) at TU Dresden.
Appendix A Essential Matrix Estimation
List of Scenes Used for Training and Testing.
Training:

Staint Peter’s (Outdoor)

brown_bm_3  brown_bm_3 (Indoor)
Testing (Outdoor):

Buckingham

Notre Dame

Sacre Coeur

Reichstag

Fountain

HerzJesu
Testing (Indoor):

brown_cogsci_2  brown_cogsci_2

brown_cogsci_6  brown_cogsci_6

brown_cogsci_8  brown_cogsci_8

brown_cs_3  brown_cs3

brown_cs_7  brown_cs7

harvard_c4  hv_c4_1

harvard_c10  hv_c10_2

harvard_corridor_lounge  hv_lounge1_2

harvard_robotics_lab  hv_s1_2

hotel_florence_jx  florence_hotel_stair_room_all

mit_32_g725  g725_1

mit_46_6conf  bcs_floor6_conf_1

mit_46_6lounge  bcs_floor6_long

mit_w85g  g_0

mit_w85h  h2_1
Network Architecture. As mentioned in the main paper, we replicated the architecture of Yi [56] for our experiments on epipolar geometry (estimating essential and fundamental matrices). For a schematic overview see Fig. 8. The network takes a set of feature correspondences as input, and predicts as output a weight for each correspondence which we use to guide RANSAC hypothesis sampling. The network consists of a series of multilayer perceptrons (MLPs) that process each correspondence independently. We implement the MLPs with convolutions. The network infuses global context via instance normalization layers [49], and it accelerate training via batch normalization [20]. The main body of the network is comprised of 12 blocks with skip connections [18]
. Each block consists of two linear layers followed by instance normalization, batch normalization and a ReLU activation
[17] each. We apply a Sigmoid activation to the last layer, and normalize by dividing by the sum of outputs.^{3}^{3}3The original architecture of Yi [56] uses a slightly different output processing due to using the output as weights for a robust model fit. They use a ReLU activation followed by a tanh activation.Initialization Procedure. We initialize our network in the following way. We define a target sampling distribution using the ground truth essential matrix given for each training pair. Intuitively, the target distribution should return a high probability when a correspondence is aligned with the ground truth essential matrix , and a low probability otherwise. We assume that correspondence is a 4D vector containing two 2D image coordinates and (3D in homogeneous coordinates). We define the epipolar error of a correspondence essential matrix :
(11) 
where returns the th entry of a vector. Using the epipolar error, we define the target sampling distribution:
(12) 
Parameter controls the softness of the target distribution, and we use which corresponds to the inlier threshold we use for RANSAC. To initialize our network, we minimize the KL divergence between the network prediction and the target distribution . We initialize for 75k iterations using Adam [23] with a learning rate of and a batch size of 32.
Implementation Details. For the following components we rely on the implementations provided by OpenCV [7]: the 5point algorithm [31], epipolar error, SIFT features [27], feature matching, and essential matrix decomposition. We extract 2000 features per input image which yields 2000 correspondences for image pairs after matching. When applying Lowe’s ratio criterion [27] for filtering and hence reducing the number of correspondences, we randomly duplicate correspondences to restore the number of 2000. We minimize the expected task loss using Adam [23] with a learning rate of
and a batch size of 32. We choose hyperparameters based on validation error of the
Reichstag scene. We observe that the magnitude of the validation error corresponds well to the magnitude of the training error, a validation set would not be strictly required.Qualitative Results. We present additional qualitative results for indoor and outdoor scenarios in Fig. 9. We compare results of RANSAC and NGRANSAC, also visualizing neural guidance as predicted by our network. We obtain these results in the highoutlier setup, without using Lowe’s ratio criterion and without using sideinformation as additional network input.
Appendix B Fundamental Matrix Estimation
Implementation Details. We reuse the architecture of Fig. 8
. We normalize image coordinates of feature matches before passing them to the network. We subtract the mean coordinate and divide by the coordinate standard deviation, where we calculate mean and standard deviation over the training set. Ranftl and Koltun
[36] fit the final fundamental matrix to the top 20 weighted correspondences as predicted by their network. Similarly, we refit the final fundamental matrix to the largest inlier set found by NGRANSAC. This refinement step results in a small but noticeable increase in accuracy. For the following components we rely on the implementations provided by OpenCV [7]: the 7point algorithm, epipolar error, SIFT features [27] and feature matching.Appendix C Horizon Lines
Network Architecture. We provide a schematic of our network architecture for horizon line estimation in Fig. 11. The network takes a px RGB image as input. We rescale images of arbitrary aspect ratio such that the long side is
px. We symmetrically zeropad the short side to
px. The network has two output branches. The first branch predicts a set of 2D points, our observations , to which we fit the horizon line. We apply a Sigmoid and rescale output points to [1.5,1.5] in relative image coordinates to support horizon lines outside the image area. We implement the network in a fully convolutional way [26], each output point is predicted for a patch, or restricted receptive field, of the input image. Therefore, we shift the coordinate of each output point to the center of its associated patch.The second output branch predicts sampling probabilities for each output point. We apply a Sigmoid to the output of the second branch, and normalize by dividing by the sum of outputs. During training, we block the gradients of the second output branch when back propagating to the base network. The sampling gradients have larger variance and magnitude than the observation gradients of the first branch, especially in the beginning of training. This has a negative effect on convergence of the network as a whole. Intuitively, we want to give priority to the observation prediction because they determine the accuracy of the final model parameters. The sampling prediction should address deficiencies in the observation predictions without influencing them too much. The gradient blockade ensures these properties.
Implementation Details. We use a differentiable soft inlier count [6] as scoring function, :
(13) 
where denotes the pointline distance between observation and line hypothesis . Hyperparameter determines the softness of the scoring distribution in DSAC, determines the softness of the Sigmoid, and is the inlier threshold. We use , and . We choose hyperparameters based on grid search for the minimal training loss.
As discussed in the main paper, we use the normalized maximum distance between a line hypothesis and the ground truth horizon in the image as task loss . This can lead to stability issues when we sample line hypotheses with very steep slope. Therefore, we clamp the task loss to a maximum of 1, the normalized image height.
As mentioned before, some images in the HLW dataset [52] have their horizon outside the image. Some of these images contain virtually no visual cue where the horizon exactly lies. Therefore, we find it beneficial to use a robust variant of the task loss that limits the influence of such outliers. We use:
(14) 
we use the square root of the task loss after a magnitude of , which is the magnitude up to which the AUC is calculated when evaluating on HLW [52].
Appendix D Camera ReLocalization
Network Architecture. We provide a schematic of our network architecture for camera relocalization in Fig. 13. The network is a FCN [26] that takes an RGB image as input, and predicts dense outputs, subsampled by a factor of 8. The network has two output branches. The first branch predicts 3D scene coordinates [41], our observations , to which we fit the 6D camera pose. The second output branch predicts sampling probabilities for the scene coordinates. We apply a Sigmoid to the output of the second branch, and normalize by dividing by the sum of outputs. During training, we block the gradients of the second output branch when back propagating to the base network. The sampling gradients have larger variance and magnitude than the observation gradients of the first branch, especially in the beginning of training. This has a negative effect on convergence of the network as a whole. Intuitively, we want to give priority to the scene coordinate prediction because they determine the accuracy of the pose estimate. The sampling prediction should address deficiencies in the scene coordinate predictions without influencing them too much. The gradient blockade ensures these properties.
Implementation details. We follow the threestage training procedure proposed by Brachmann and Rother for DSAC++ [6].
Firstly, we optimize the distance between predicted and ground truth scene coordinates. We obtain ground truth scene coordinates by rendering the sparse reconstructions given in the Cambridge Landmarks dataset [22]. We ignore pixels with no corresponding 3D point in the reconstruction. Since the reconstructions contain outlier 3D points, we use the following robust distance:
(15) 
we use the Euclidean distance up to a threshold of 10m after which we use the square root of the Euclidean distance. We train the first stage for 500k iterations using Adam [23] with a learning rate of and a batch size of 1 image.
Secondly, we optimize the reprojection error of the scene coordinate predictions to the ground truth camera pose. Similar to the first stage, we use a robust distance function with a threshold of 10px after which we use the square root of the reprojection error. We train the second stage for 300k iterations using Adam [23] with a learning rate of and a batch size of 1 image.
Thirdly, we optimize the expected task loss according to the NGDSAC objective as explained in the main paper. As task loss we use . We measure the angle between estimated camera rotation and ground truth rotation in degree. We measure the distance between the estimated camera position and ground truth position in meters. As with horizon line estimation (see previous section), we use a soft inlier count as hypothesis scoring function with hyperparameters , and . We train the third stage for 200k iterations using Adam [23] with a learning rate of and a batch size of 1 image.
Learned 3D Representations. We visualize the internal 3D scene representations learned by DSAC++ and NGDSAC++ in Fig. 14 for two more scenes.
References
 [1] R. Arandjelovic. Three things everyone should know to improve object retrieval. In CVPR, 2012.
 [2] J. Bian, W.Y. Lin, Y. Matsushita, S.K. Yeung, T. D. Nguyen, and M.M. Cheng. Gms: Gridbased motion statistics for fast, ultrarobust feature correspondence. In CVPR, 2017.
 [3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6D object pose estimation using 3D object coordinates. In ECCV, 2014.
 [4] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. DSACDifferentiable RANSAC for camera localization. In CVPR, 2017.
 [5] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, and C. Rother. Uncertaintydriven 6D pose estimation of objects and scenes from a single RGB image. In CVPR, 2016.
 [6] E. Brachmann and C. Rother. Learning less is more6D camera localization via 3D surface regression. In CVPR, 2018.
 [7] G. Bradski. OpenCV. Dr. Dobb’s Journal of Software Tools, 2000.
 [8] T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, L. Di Stefano, and P. H. Torr. Onthefly adaptation of regression forests for online camera relocalisation. In CVPR, 2017.
 [9] O. Chum and J. Matas. Matching with PROSAC  Progressive sample consensus. In CVPR, 2005.
 [10] O. Chum and J. Matas. Optimal randomized ransac. TPAMI, 2008.
 [11] O. Chum, J. Matas, and J. Kittler. Locally optimized RANSAC. In DAGM, 2003.
 [12] M. A. Fischler and R. C. Bolles. Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981.
 [13] X.S. Gao, X.R. Hou, J. Tang, and H.F. Cheng. Complete solution classification for the perspectivethreepoint problem. TPAMI, 2003.
 [14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
 [15] R. I. Hartley. In defense of the eightpoint algorithm. TPAMI, 1997.
 [16] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004.

[17]
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification.
In ICCV, 2015.  [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [19] J. Heinly, J. L. Schönberger, E. Dunn, and J.M. Frahm. Reconstructing the World* in Six Days *(As Captured by the Yahoo 100 Million Image Dataset). In CVPR, 2015.
 [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [21] O. H. Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann, and C. Rother. ipose: Instanceaware 6d pose estimation of partly occluded objects. 2018.
 [22] A. Kendall, M. Grimes, and R. Cipolla. PoseNet: A convolutional network for realtime 6DoF camera relocalization. In ICCV, 2015.
 [23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [24] F. Kluger, H. Ackermann, M. Y. Yang, and B. Rosenhahn. Deep learning for vanishing point detection using an inverse gnomonic projection. In GCPR, 2017.
 [25] J.T. Lee, H.U. Kim, C. Lee, and C.S. Kim. Semantic line detection and its applications. In ICCV, 2017.
 [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [27] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 2004.
 [28] D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. S. Torr. Random forests versus neural networks  what’s best for camera localization? In ICRA, 2017.
 [29] R. MurArtal and J. D. Tardós. ORBSLAM2: an opensource SLAM system for monocular, stereo, and RGBD cameras. TRO, 2017.
 [30] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Realtime dense surface mapping and tracking. In Proc. ISMAR, 2011.
 [31] D. Nistér. An efficient solution to the fivepoint relative pose problem. TPAMI, 2004.
 [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
 [34] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.M. Frahm. Usac: A universal framework for random sample consensus. TPAMI, 2013.
 [35] R. Raguram, J.M. Frahm, and M. Pollefeys. A comparative analysis of RANSAC techniques leading to adaptive realtime random sample consensus. In ECCV, 2008.
 [36] R. Ranftl and V. Koltun. Deep fundamental matrix estimation. In ECCV, 2018.
 [37] C. Rother. A new approach for vanishing point detection in architectural environments. In BMVC, 2002.
 [38] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB: An efficient alternative to SIFT or SURF. In ICCV, 2011.
 [39] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritized matching for largescale imagebased localization. TPAMI, 2016.
 [40] J. L. Schönberger and J.M. Frahm. StructurefromMotion Revisited. In CVPR, 2016.
 [41] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGBD images. In CVPR, 2013.
 [42] G. Simon, A. Fond, and M.O. Berger. Acontrario horizonfirst vanishing point detection using secondorder grouping laws. In ECCV, 2018.
 [43] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, 2014.
 [44] C. Strecha, W. von Hansen, L. J. V. Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multiview stereo for high resolution imagery. In CVPR, 2008.
 [45] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998.
 [46] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
 [47] B. Tordoff and D. W. Murray. Guided sampling and consensus for motion estimation. In ECCV, 2002.
 [48] P. H. S. Torr and A. Zisserman. MLESAC: A new robust estimator with application to estimating image geometry. CVIU, 2000.
 [49] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, 2016.
 [50] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In CVPR, 2017.
 [51] J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. S. Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In CVPR, 2015.
 [52] S. Workman, M. Zhai, and N. Jacobs. Horizon lines in the wild. In BMVC, 2016.
 [53] C. Wu. Towards lineartime incremental structure from motion. In 3DV, 2013.
 [54] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database of big spaces reconstructed using sfm and object labels. In ICCV, 2013.
 [55] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned invariant feature transform. In ECCV, 2016.
 [56] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to find good correspondences. In CVPR, 2018.
 [57] M. Zhai, S. Workman, and N. Jacobs. Detecting vanishing points using global image context in a nonmanhattan world. In CVPR, 2016.
Comments
There are no comments yet.