1 Introduction
In traditional Computer Vision, many tasks can be solved by finding the singular or eigenvector corresponding to the smallest, often zero, singular or eigenvalue of the matrix encoding a linear system. Examples include estimating essential matrices or homographies from matched keypoints and computing pose from 3D to 2D correspondences.
In the era of Deep Learning, there is growing interest in embedding these methods within a deep architecture to allow endtoend training. For example, it has recently been shown that such an approach can be used to train networks to detect and match keypoints in image pairs while accounting for the global consistency of the correspondences [1]. More generally, this approach would allow us to explicitly encode notions of geometry within deep networks, thus sparing the network the need to relearn what has been known for decades and making it possible to learn from smaller amounts of training data.
One way to implement this approach is to design a network whose output defines a matrix and train it so that the smallest singluar or eigenvector of the matrices it produces are as close as possible to groundtruth ones. This is the strategy used in [1]
to simultaneously establish correspondences and compute the corresponding Essential Matrix: The network’s outputs are weights discriminating inlier correspondences from outliers and are used to assemble an auxiliary matrix whose smallest eigenvector is the soughtfor Essential Matrix.
The main obstacle to implementing this approach is that it requires being able to differentiate the singular value decomposition (SVD) or the eigendecomposition (ED) in a stable manner to train the network, a nontrivial problem that has already received considerable attention [2, 3, 4]
. As a result, these decompositions are already part of standard Deep Learning frameworks, such as TensorFlow
[5]or PyTorch
[6]. However, they ignore two key practical issues. First, when optimizing with respect to the matrix itself or with respect to parameters defining it, the vector corresponding to the smallest singular value or eigenvalue may switch abruptly as the relative magnitudes of these values change, which is essentially nondifferentiable. This is illustrated in the example of Fig. 1, discussed in detail in Section 2. Second, computing the gradient requires dividing by the difference between two singular values or eigenvalues, which could be zero. While a solution to the latter was proposed in [2], the former is unavoidable.



(a)  (b) 
In this paper, we therefore introduce an approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix , which is either the output of the network or a function of it, without explicitly performing an SVD or ED. Our loss is fully differentiable, does not suffer from the instabilities the abovementioned problems can cause, and can be naturally incorporated in a deep learning architecture. In practice, because image measurements are never perfect, the eigenvalue is never strictly zero. This, however, does not affect the computation either, which makes our approach robust to noise.
To demonstrate this in a Deep Learning context, we evaluate our approach on the tasks of training a network to find globallyconsistent keypoint correspondences using the essential matrix and training another to remove outliers for pose estimation when solving the PerspectivenPoint (PnP) problem. In both cases, our approach delivers stateoftheart results, whereas using the standard implementation of singular and eigenvalue decomposition provided in TensorFlow results in either the learning procedure not converging or in significantly worse performance.
2 Motivation
To illustrate the problems associated with differentiating eigenvectors and eigenvalues, consider the outlier rejection toy example depicted by Fig. 1. The inputs are 3D points lying on a plane and drawn in black, and an outlier 3D point shown in red, which we assume to be very far from the plane. Suppose we want to assign a binary weight to each point (1 for inliers, 0 for outliers) such that the eigenvector corresponding to the smallest eigenvalue of the weighted covariance matrix is close to the groundtruth one in the leastsquare sense. When the weight assigned to the outlier is 0, it would be , which is also the normal to the plane and is shown in green. However, if at some point during optimization, typically at initialization, we assign the weight 1 to the outlier, will correspond to the largest eigenvalue instead of the smallest, and the eigenvector corresponding to the smallest eigenvalue will be the vector shown in blue, which is perpendicular to . As a result, if we initially set all weights to 1 and optimize them so that the smallest eigenvector approaches the plane normal, the gradient values will depend on the coordinates of . At one point during the optimization, if everything goes well, the weight assigned to the outlier will become small enough so that the smallest eigenvector switches from being to being , which introduces a large jump in the gradient vector whose values will now depend on the coordinates of instead of .
In this simple case, this kind of instability does not preclude eventual convergence. However, in more complex situations, we found that it does, as evidenced by our experiments. This problem was already noted in [1] in the context of learning keypoint correspondences. To circumvent this issue, the algorithm in [1] had to first rely on a classification loss to determine the potential inlier correspondences before incorporating the loss based on the essential matrix to impose geometric constraints, which requires eigendecomposition. This ensured that the network weights were already good enough to prevent eigenvector switching when starting to minimize the geometrybased loss.
3 Related Work
In recent years, the need to integrate geometric methods and mathematical tools into Deep Learning frameworks has led to the reformulation of a number of them in network terms. For example, [7] considers spatial transformations of image regions with CNNs. The set of such transformations is extended in [8]. In a different context, [9] derives a differentiation of the Cholesky decomposition that could be integrated in Deep Learning frameworks.
Unfortunately, the set of geometric Computer Vision problems that these methods can handle remains relatively limited. In particular, there is no widely accepted deeplearning way to solve the many geometric problems that reduce to finding leastsquare solution of linear systems. In this work, we consider two such problems: Computing the essential matrix from keypoint correspondences in an image pair and estimating the 3D pose of an object from 3Dto2D correspondences, both of which we briefly discuss below.
Estimating the Essential matrix from correspondences.
The eigenvaluebased solution to this problem has been known for decades [10, 11, 12] and remains the standard way to compute Essential matrices [13]. The real focus of research in this area has been to establish reliable keypoint correspondences and to eliminate outliers. In this context, variations of RANSAC [14], such as MLESAC [15] and Least median of squared (LMeds) [16], and very recently GMS [17], have become popular. For a comprehensive study of such methods, we refer the interested reader to [18]. With the emergence of Deep Learning, there has been a trend towards moving away from this decadesold knowledge and apply instead a blackbox approach where a Deep Network is trained to directly estimate the rotation and translation matrices [19, 20] without a priori geometrical knowledge. The very recent work of [1] attempts to reconcile these two opposing trends by embedding the geometric constraints into a Deep Net and has demonstrated superior performance for this task when the correspondences are hard to establish.
Estimating 3D pose from 3Dto2D correspondences.
This is known as the PerspectivenPoint (PnP) problem. It has also been investigated for decades and is also amenable to an eigendecompositionbased solution [12], many variations of which have been proposed over the years [21, 22, 23, 24]. DSAC [25]
is the only approach we know of that integrates the PnP solver into a Deep Network. As explicitly differentiating through the PnP solver is not optimization friendly, the authors apply the log trick used in the reinforcement learning literature. This amounts to using a numerical approximation of the derivative from random samples, which is not ideal, given that an analytical alternative exists. Moreover, DSAC only works for grid configurations and known scenes. By contrast, the method we propose in this work has an analytical form, with no need for stochastic sampling.
Differentiating the eigen and singular value decomposition
Whether computing the essential matrix, estimating 3D pose, or solving any other leastsquares problem, incorporating an eigendecompositionsolver into a deep network requires differentiating the eigendecomposition. Expressions for such derivatives have been given in [2, 3] and reformulated in terms that are compatible with backpropagation in [4]. Specifically, as shown in [4], for a matrix written as , the variations of the eigenvectors with respect to the matrix, used to compute derivatives, are
(1) 
where , and
(2) 
As can be seen from Eq. 2, if two eigenvalues are equal, that is, , the denominator becomes 0, thus creating numerical instabilities. The same can be said about singular value decomposition.
A solution to this was proposed in [2], and singular and eigenvalue decomposition have been used within deep networks for problems where all the singular values are used and their order is irrelevant [26, 27]
. In the context of spectral clustering, the approach of
[28] also proposed a solution that eliminates the need for explicit eigendecomposition. This solution, however, was dedicated to the scenario where one seeks to use all nonzero eigenvalues, assuming a matrix of constant rank.Here, by contrast, we tackle problems where what matters is a single eigen or singularvalue. In this case, the order of the eigenvalues is important. However, this order can change during training, which results in a nondifferentiable switch from one eigenvector to another, as in the toy example of Section 2. In turn, this leads to numerical instabilities, which can prevent convergence. In [1]
, this problem is finessed by first training the network using a classification loss that does not depend on eigenvectors. Only once a sufficiently good solution is found, that is, a solution close enough to the correct one for vector switching not to happen anymore, is the loss term that depends on the eigenvector associated to the smallest eigenvalue turned on. As we will show later, we can achieve stateoftheart results without the need for such a heuristic, by deriving a more robust, eigendecompositionfree loss function.
4 Our Approach
We introduce an approach that enables us to work with eigenvectors corresponding to zero eigenvalues within an endtoend learning formalism, while being subject to neither the gradient instabilities due to vector switching discussed in Section 2 nor to difficulties caused by repeated eigenvalues. To this end, we derive a loss function that directly operates on the matrix whose eigen or singularvectors we are interested in but without explicitly performing an SVD or ED.
In this section, we first discuss the generic scenario in which the matrix of interest directly is the output of the network. We then consider the slightly more involved case where the network predicts weights that themselves define the matrix, which corresponds to our application scenarios. Note that, while we discuss our approach in the context of Deep Learning, it is applicable to any optimization framework where one seeks to optimize a loss function based on the smallest eigenvector of a matrix with respect to the parameters that defining this matrix.
4.1 Generic Scenario
Given an input measurement , let us denote by the output of a deep network with parameters . Here, we consider the case where the output of the network is a matrix, which we write as . Our goal is to tackle problems where the loss function of the network depends on the smallest eigenvector of
, which ensures that the matrix is symmetric.
Typically, one can use an loss of the form , where is the groundtruth smallest eigenvector. The standard approach to addressing this, as followed in [4, 1], consists of explicitly differentiating this loss w.r.t. , then w.r.t. and finally w.r.t.
via backpropagation. As discussed above, however, this is not optimization friendly.
To overcome this, we propose to define a new loss motivated by the linear equation that defines eigenvectors and eigenvalues. Specifically, if is an eigenvector of with eigenvalue ,
it satisfies
(3) 
Since eigenvectors have unitnorm, i.e., , multiplying both sides of this equation from the left by yields
(4) 
In this paper, we consider zero eigenvalue problems, that is, . Since is positive semidefinite, we have that for any . Given the groundtruth eigenvector that we seek to predict, this lets us define the loss function
(5) 
Intuitively, this loss aims to find the parameters such that is an eigenvector of the resulting matrix with minimum eigenvalue, that is, zero in our case, assuming that we can truly reach the global minimum of our loss. However, this loss alone has multiple, globallyoptimal solutions, including the trivial one .
To address this, we note that this trivial solution has not only one zero eigenvalue corresponding to eigenvector , but that all its eigenvalues are zero. Since, in practice, we typically search for matrices that have a single zero eigenvalue, we propose to maximize the projection of the data along the directions orthogonal to . Such a projection can be achieved by making use of the orthogonal complement to , given by , where
is the identity matrix. By defining
, we can then rewrite our loss function as(6) 
where computes the trace of a matrix and sets the relative influence of the two terms. Note that we can apply the same strategy to cases where multiple eigenvalues are zero, by reducing the orthogonal space to only the directions corresponding to nonzero eigenvalues, and introducing the first term for all eigenvectors whose eigenvalues we want to be zero.
For numerical stability, we further propose to bound the second term in the range . To do so, we therefore rewrite our loss as
(7) 
where is a scalar. This loss is fully differentiable, and can thus be used to learn the parameters of a deep network. Since it does not explicitly depend on performing an eigendecomposition at every iteration of the optimization, it suffers from neither the eigenvector switching problem, nor the nonunique eigenvalue problem.
4.2 Learning to Predict Weights
In practice, the problem of interest is often more constrained than training a network to directly output a matrix . In particular, in this paper, we consider problems where the goal is to predict a weight for each element of the input. This typically leads to formulations where has the form , with a data matrix and a diagonal matrix whose elements are the s. Below, we introduce the formulation for each of the applications that we consider in our experiments.
4.2.1 Outlier Rejection with 3D Points.
To show that we can indeed backpropagate nicely through the proposed loss formulation where directly using the analytical gradient fails, we first briefly revisit the toy outlier rejection problem used to motivate our approach in Section 1. For this experiment, we do not train a Deep Network, or perform any learning procedure. Instead, given 3D points , including inliers and outliers, we directly optimize the weight of each point. At every step of optimization, given the current weight values, we compute the weighted mean of the points . Let be the matrix of meansubtracted 3D points. We then compute the weighted covariance matrix , where is a diagonal matrix whose elements are the s. The smallest eigenvector of then defines the direction of noise.
4.2.2 Keypoint Matching with the Essential Matrix.
For this task, to isolate the effect of the loss function only, we followed the same setup as in [1]. Specifically, we used the same network architecture as in [1], which takes correspondences between two 2D points as input and outputs a dimensional vector of weights, that is, one weight for each correspondence.
Formally, let
(9) 
encode the coordinates of correspondence in the two images. Following the 8 points algorithm [10], we construct as matrix , each row of which is computed from one correspondence vector as
(10) 
where denotes row of . A weighted version of the 8 points algorithm [29] then computes the essential matrix as the smallest eigenvector of , with the diagonal matrix of weights.
Let , where is the groundtruth eigenvector representing the true essential matrix. We can then write an eigendecompositionfree essential loss as
(11) 
Given a set of training samples, consisting of image pairs with groundtruth essential matrices, we can then use this loss, instead of the classification loss or essential loss of [1], to train a network to predict the weights.
Note that, as suggested by [12] and done in [1], we use the 2D coordinates normalized to using the camera intrinsics as input to the network.
When calculating the loss, as suggested by [11], we move the centroid of the reference points to the origin of the coordinate system and scale the points so that their RMS distance to the origin is equal to . This means that we also have to scale and translate accordingly.
4.2.3 3Dto2D Correspondences for Pose Estimation.
The goal of this problem, also known as the PerspectivenPoint (PnP) problem [21], is to determine the absolute pose (rotation and translation) of a calibrated camera, given known 3D points and corresponding 2D image points.
For this task, as we are still dealing with sparse correspondences, we use the same network architecture as before for 2Dto2D correspondences, except that we now have one additional input dimension, since we have 3Dto2D correspondences.
This network takes correspondences between 3D and 2D points as input and outputs a dimensional vector of weights, still one weight for each correspondence.
Mathematically, we can denote the input correspondences as
(12) 
where are the coordinates of a 3D point, and , denote the corresponding image location. According to [12], we have
(13) 
To recover the pose, we then follow the Direct Linear Transform (DLT) method
[12]. This consists of constructing the matrix , every two rows of which are computed from one correspondence as(14) 
where denotes row of . Then, the solution of the weighted PnP problem can be obtained as the eigenvector of corresponding to the smallest eigenvalue. Therefore, we can define a PnP loss similar to the one of Eq. 11 for 2Dto2D correspondences, but with defined as discussed above, and, given training samples, each consisting of a set of 3Dto2D correspondences with corresponding groundtruth eigenvector encoding the pose, train a network to predict weights such that we obtain the correct pose via DLT. As in the 2Dto2D case, we use the normalized coordinate system for the 2D coordinates.
Note that the characteristics of the rotation matrix, that is, orthogonality and determinant 1, are not preserved by the DLT solution. Therefore, to make the result a valid rotation matrix, we refine the DLT results by the generalized Procrustes algorithm [30, 31], which is a common postprocessing technique for PnP algorithms. Note that this step is not involved during training, but only in the validation process to select the best model and at test time.
5 Experiments
We now present our results for the three tasks discussed above, that is, plane fitting as in Section 2, distinguishing good keypoint correspondences from bad ones, and solving the PerspectivenPoint (PnP) problem. We rely on a TensorFlow implementation using the Adam [32] optimizer, with a learning rate of , unless stated otherwise, and default parameters. When training a network for keypoint matching and PnP, we used minibatches of 32 samples and, in the plane fitting case, we also tested vanilla gradient descent in addition to Adam.
5.1 Plane Fitting
The setup is the one discussed in Section 2. We randomly sampled 100 3D points on the plane. Specifically, we uniformly sampled and
. We then added zeromean Gaussian noise with standard deviation
in the dimension. We also generate outliers in a similar way, where and is uniformly samples in the same range, andis sampled from a Gaussian distribution with mean 50 and standard deviation of 5. For the baselines that directly use the analytical gradients of SVD and ED, we take the objective function to be
, where is the minimum eigenvector of in Eq. 8 and is the groundtruth noise direction, which is also the plane normal and is the vector in this case. Note that we consider both and and take the minimum distance, denoted by the and the in the loss function. For this problem, both solutions are correct due to the sign ambiguity of eigendecomposition, and we need to take this into account.We consider two ways of computing analytical gradients, one using the SVD and the other the selfadjoint eigendecomposition (Eigh), which both yield mathematically valid solutions. To implement our approach, we rely on Eq. 8.
Fig. 2 shows the evolution of the loss as the optimization proceeds when using either Adam or vanilla gradient descent, when a single outlier is present. Note that SVD and Eigh have exactly the same behavior because they constitute two equivalent ways of solving the same problem. Using Adam in conjunction with either one initially yields a very slow decrease in the loss function, until it suddenly drops to zero when the switch of the eigenvector with the smallest eigenvalue occurs. By contrast, our approach produces a much more gradual decrease in the loss with no overly large gradients ever being generated. The difference in behavior is even more drastic when vanilla gradient descent is used instead of Adam: SVD and Eigh take millions of iterations to converge. We tried multiple learning rates within the range , none of them has led to faster convergence. We provide the results with different learning rates in the supplementary material.
We also evaluate the behavior of our method and the baselines in the presence of more outliers. As shown in Fig. 3, while both our method and the baseline still present the same convergence patterns as before, our approach correctly recovers the inliers and outliers, while the SVD baseline discards many outliers and even accepts outliers.
Note that, while in this planefitting example the SVD or Eighbased methods converge, in the more complex cases below, this is not always true.
5.2 Keypoint Matching
To evaluate our method on a realworld problem, we use the SUN3D dataset [33]. For a fair comparison, we trained our network on the same data as [1], that is, the “brownbm305” sequence, and evaluate it on the test sequences used for testing in [20, 1] . Additionally, to show that our method is not overfitting, we also test on a completely different dataset, the “fountainP11” and “HerzJesusP8” sequences of [34].
We follow the evaluation protocol of [1], which constitutes the stateoftheart in keypoint matching, and only change the loss function to our own loss of Eq. 11. We use and , which we empirically found to work well for 2Dto2D keypoint matching. We compare our method against that of [1], both in its original implementation that involves minimizing a classification loss first and then without that initial step, which we denote as “Essential_Only”. The latter is designed to show how critical the initial classificationbased minimization of [1] is. In addition, we also compare against standard RANSAC [35], LMeds [36], MLESAC [15], and GMS [17] to provide additional reference points. We do this in terms of the performance metric used in [1] and referred to as mean Average Precision (mAP). This metric is computed by observing the ratio of accurately recovered poses given a certain maximum threshold, and taking the area under the curve of this graph.
We summarize the results in Fig. 4 and provide numbers for individual datasets in the supplementary material. Our approach performs roughly on par with [1], the stateoftheart method on keypoint matching, and outperforms all the other baselines. Importantly, “Essential_Only” severely underperforms and even often fails completely. In short, instead of having to find a workaround to the eigenvector switching problem as in [1], we can directly optimize our objective function, which is far more generally applicable. Furthermore, the workaround in [1] would converge to a suboptimal solution, as it the classification loss depends on a userselected decision boundary, that is, a heuristic definition of inliers. By contrast, our method can simply discover the inliers automatically while training, thanks to the second term in Eq. 7.
In Fig. 5
, we compare the correspondences classified as inlier by our method to those of RANSAC on image pairs from the dataset of
[34] and SUN3D, respectively. Note that even the correspondences that are misclassified as inliers are very close to being inliers. By contrast, RANSAC yields much larger errors.5.3 PnP
Following standard practice for evaluating PnP algorithms [21, 24], we used the procedure of [24] to generate a synthetic dataset composed of 3Dto2D correspondences with noise and outliers added. Each training example comprises two thousand 3D points and we set the ground truth translation of the camera pose to be their centroid.
We then create a random groundtruth rotation , and project the 3D points to the image plane of our virtual camera. As in REPPnP [24], we apply Gaussian noise with a standard deviation of to these projections. For outliers, we include random outliers by assigning 3D points to arbitrary valid 2D image positions.
We train a neural network with the same architecture as in the keypoint matching case, except that it now takes 3Dto2D correspondences as input. We empirically found that
and works well for this task. During training, to learn to be robust to outliers, we randomly select between 100 and 1000 of the two thousand matches and turn them into outliers. In other words, the two thousand training matches will contain a random number of outliers that our network will learn to filter out.We compare our method against modern PnP methods, EPnP [21], OPnP [23], PPnP [30], RPnP [37] and REPPnP [24]. We also evaluate the DLT [12], since our loss formulation is based on it. Among these methods, REPPnP is the one most specifically designed to handle outliers. We also report the performance of two commonly used baselines that leverage RANSAC [14], P3P [22]+RANSAC and EPnP+RANSAC. For other methods, RANSAC did not bring noticeable improvements, and we omitted them in the graph for better visual clarity.
To compare all these methods, we use standard rotation and translation error metrics [38]. Specifically, we report the closest arc distance in radians for the rotation matrix measured using quaternions, and the distance between the translation vectors normalized by the ground truth. To demonstrate the effect of outliers at test time, we fix the number of matches to be 200 and vary the number of outliers from to . We perform each experiment 100 times and report the average.
Fig. 13 summarizes the results. We outperform all other methods significantly, especially when the number of outliers increases. REPPnP is the one competing method that seems least affected. As long as the number of outliers is small, it is on a par with us but passed a certain point—when there are more than 40 outliers, that is, 20% of the total—its performance, particularly in terms of rotation error, decreases quickly whereas ours does not.
As in the keypoint matching case, we have tried to compute the results of a network relying explicitly on eigendecomposition and minimizing the norm of the difference between the groundtruth eigenvector and the predicted one. However, we found that such a network was unable to converge,
as depicted in Fig. 7, where we compare the loss evolution of this approach to that of ours. This again clearly shows the benefits of our eigendecompositionfree approach.
6 Conclusion
We have introduced a novel approach to training deep networks that rely on losses computed from an eigenvector corresponding to a zero eigenvalue of a matrix defined by the network’s output. Our loss does not suffer from the numerical instabilities of analytical differentiation of eigendecomposition, and converges to the correct solution much faster. We have demonstrated the effectiveness of our method on the tasks of keypoint matching in real images and outlier rejection for the PnP problem. In both cases, our new loss has allowed us to achieve stateoftheart results.
Since many Computer Vision tasks rely on leastsquare solutions to linear systems, we will investigate the use of our approach for other ones, such as homography estimation. Furthermore, we hope that our work will contribute to imbuing Deep Learning techniques with traditional Computer Vision knowledge, thus avoiding discarding decades of valuable research, and leading to more principled frameworks.
7 Acknowledgements
This research was supported in part by the National Natural Science Foundation of China under Grant 61603291 and the program of introducing talents of discipline to university B13043. This work was also supported in part by the Swiss Commission for Technology and Innovation. This work is performed when Zheng Dang was visiting the CVLab, EPFL, Switzerland.
References
 [1] Yi, K., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to Find Good Correspondences. In: CVPR. (2018)
 [2] Papadopoulo, T., Lourakis, M.: Estimating the jacobian of the singular value decomposition: Theory and applications. In: ECCV. (2000) 554–570
 [3] Giles, M.: Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In: Advances in Automatic Differentiation. (2008) 35–44
 [4] Ionescu, C., Vantzos, O., Sminchisescu, C.: Matrix backpropagation for Deep Networks with Structured Layers. (2015)

[5]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R.,
Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P.,
Wicke, M., Yu, Y., Zheng, X.:
Tensorflow: A System for LargeScale Machine Learning.
In: USENIX Conference on Operating Systems Design and Implementation. (2016) 265–283  [6] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop. (2017)
 [7] Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial Transformer Networks. In: NIPS. (2015) 2017–2025
 [8] Handa, A., Bloesch, M., Patraucean, V., Stent, S., McCormac, J., Davison, A.: Gvnn: Neural Network Library for Geometric Computer Vision. In: ECCV. (2016)
 [9] Murray, I.: Differentiation of the Cholesky Decomposition. arXiv Preprint (2016)
 [10] LonguetHiggins, H.: A Computer Algorithm for Reconstructing a Scene from Two Projections. Nature 293 (1981) 133–135
 [11] Hartley, R.: In Defense of the EightPoint Algorithm. PAMI 19(6) (June 1997) 580–593
 [12] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000)
 [13] Nister, D.: An Efficient Solution to the FivePoint Relative Pose Problem. In: CVPR. (June 2003)
 [14] Fischler, M., Bolles, R.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications ACM 24(6) (1981) 381–395
 [15] Torr, P., Zisserman, A.: MLESAC: A New Robust Estimator with Application to Estimating Image Geometry. CVIU 78 (2000) 138–156

[16]
Rousseeuw, P., Leroy, A.:
Robust Regression and Outlier Detection.
Wiley (1987)  [17] Bian, J., Lin, W., Matsushita, Y., Yeung, S., Nguyen, T., Cheng, M.: GMS: GridBased Motion Statistics for Fast, UltraRobust Feature Correspondence. In: CVPR. (2017)
 [18] Raguram, R., Chum, O., Pollefeys, M., Matas, J., Frahm, J.M.: USAC: A Universal Framework for Random Sample Consensus. PAMI 35(8) (2013) 2022–2038
 [19] Zamir, A.R., Wekel, T., Agrawal, P., Malik, J., Savarese, S.: Generic 3D Representation via Pose Estimation and Matching. In: ECCV. (2016)
 [20] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and Motion Network for Learning Monocular Stereo. In: CVPR. (2017)
 [21] Lepetit, V., Morenonoguer, F., Fua, P.: EPP: An Accurate Solution to the PP Problem. IJCV (2009)
 [22] Kneip, L., Scaramuzza, D., Siegwart, R.: A Novel Parametrization of the PerspectiveThreePoint Problem for a Direct Computation of Absolute Camera Position and Orientation. In: CVPR. (2011) 2969–2976
 [23] Zheng, Y., Kuang, Y., Sugimoto, S., Astrom, K., Okutomi, M.: Revisiting the PnP Problem: A Fast, General and Optimal Solution. In: ICCV. (2013)
 [24] Ferraz, L., Binefa, X., Morenonoguer, F.: Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection. In: CVPR. (2014) 501–508
 [25] Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C.: DSAC – Differentiable RANSAC for Camera Localization. ARXIV (2016)
 [26] Huang, G., Liu, Z., Weinberger, K., van der Maaten, L.: Densely Connected Convolutional Networks. In: CVPR. (2017)
 [27] Huang, Z., Wan, C., Probst, T., Gool, L.V.: Deep learning on lie groups for skeletonbased action recognition. In: CVPR. (2017) 6099–6108
 [28] Law, M., Urtasun, R., Zemel, R.S.: Deep spectral clustering learning. In: ICML. (2017) 1985–1994
 [29] Zhang, Z.: Determining the Epipolar Geometry and Its Uncertainty: A Review. IJCV 27(2) (1998) 161–195
 [30] Garro, V., Crosilla, F., Fusiello, A.: Solving the PnP Problem with Anisotropic Orthogonal Procrustes Analysis. In: 3DPVT. (2012) 262–269
 [31] Schönemann, P.: A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 31(1) (1966) 1–10
 [32] Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimisation. In: ICLR. (2015)
 [33] Xiao, J., Owens, A., Torralba, A.: SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels. In: ICCV. (2013)
 [34] Strecha, C., Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On Benchmarking Camera Calibration and MultiView Stereo for High Resolution Imagery. In: CVPR. (2008)
 [35] Cantzler, H.: Random Sample Consensus (RANSAC) (2005) CVonline.
 [36] Simpson, D.: Introduction to Rousseeuw (1984) Least Median of Squares Regression. In: Breakthroughs in Statistics. Springer (1997) 433–461
 [37] Li, S., Xu, C., Xie, M.: A Robust O(n) Solution to the PerspectiveNPoint Problem. PAMI (2012) 1444–1450
 [38] Crivellaro, A., Rad, M., Verdie, Y., Yi, K., Fua, P., Lepetit, V.: Robust 3D Object Tracking from Monocular Images Using Stable Parts. PAMI (2017)
 [39] Heinly, J., Schoenberger, J., Dunn, E., Frahm, J.M.: Reconstructing the World in Six Days. In: CVPR. (2015)
 [40] Wu, C.: Towards LinearTime Incremental Structure from Motion. In: 3DV. (2013)
Appendix 1 Appendix
We provide additional details about the results presented in Section 5 of the main paper.
1.1 Plane Fitting
As mentioned in Section 5.1 of the main paper, we tested different learning rates in the range . Figs. 8 and 9 depict the learning curves for both our loss and the standard SVD/Eighbased loss for several different ones, using either Adam or vanilla gradient descent as optimizer. As can be seen from the different plots, our approach always converges and correctly finds the inliers, as indicated by the bar plots on the right. While SVD/Eigh do converge when using Adam, this requires an eigenvector switch and learning fails when using GD. Furthermore, some inliers are systematically classified as outliers and viceversa.
1.2 Keypoint Matching
Here, we provide the detailed keypoint matching results on the SUN3D and Strecha [34] datasets. Specifically, we compare the mAP of the baselines and of our model on the individual sequences of these datasets in Figs. 10, 11 and 12 for error thresholds of 5, 10 and 20, respectively. Note that the general trend is the same as the average one reported in the main paper, with our method essentially performing on par with the stateoftheart method of [1], but without the need for pretraining with a different loss.
1.3 PnP
We evaluated our PnP approach on real data, using the dataset of [39]. Specifically, the 3D points in this dataset were obtained using the StructurefromMotion algorithm of [40], which also provides a rotation matrix and translation vector for each image. We treat these rotations and translations as groundtruth to compare different PnP algorithms. Given a pair of images, we extract SIFT features at the reprojection of the 3D points in one image, and match these features to SIFT keypoints detected in the other image. This procedure produces erroneous correspondences, which a robust PnP algorithm should discard.
In this example, we used the model trained on the synthetic data described in Section 5.3 of the main paper. Note that we apply the model without any finetuning, that is, the model is only trained with purely synthetic data. We report the quantitative results of this experiments, performed on four image pairs in Tables 1 and 2. Note that our approach yields much lower errors than all the baselines. In Fig. 13, we compare the reprojection of the 3D points on the input image after applying the rotation and translation obtained with our model and with EPnP+RANSAC. Note that the points obtained with our approach reproject much more closely to the groundtruth image locations than those of this baseline. Note that EPnP+RANSAC constitutes the bestperforming baseline, on par with OPnP and P3P+RANSAC. For the other baselines, we are unable to provide similar figures because the errors reported in Tables 1 and 2 translate to points reprojecting outside the input image. This underscores the strength of our approach, which, despite being trained on synthetic data nevertheless works on real images.
Methods  Ours  REPPnP  EPnP  RPnP  OPnP  PPnP  DLT  EPnP  P3P 

Reichstag  0.1319  36.0852  36.5411  12.6649  3.7012  18.6296  35.1560  3.1161  3.1161 
Florence  0.0416  32.6205  34.6684  24.2201  1.5771  5.0694  38.3336  2.4004  2.4004 
Prague  0.0274  39.6677  39.4239  4.6647  2.7118  16.4462  35.1211  2.5443  2.5443 
Notredame  0.0293  36.9159  32.7849  16.2304  2.8228  8.7266  32.8611  4.2309  4.263 
Methods  Ours  REPPnP  EPnP  RPnP  OPnP  PPnP  DLT  EPnP  P3P 

Reichstag  0.0110  0.2477  1.4313  0.3885  0.2346  1.2724  0.3687  0.1821  0.1821 
Florence  0.0135  1.9399  1.6896  1.7971  0.8719  0.4659  1.9069  0.7640  0.7640 
Prague  0.0069  0.6156  1.4631  1.4440  0.8715  0.8278  1.6928  0.8045  0.8045 
Notredame  0.0046  1.8663  1.7654  1.7910  0.8382  1.0669  1.7918  1.0605  1.0615 