Eigendecomposition-Free Training of Deep Networks for Linear Least-Square Problems

04/15/2020 ∙ by Zheng Dang, et al. ∙ EPFL Xi'an Jiaotong University University of Victoria 0

Many classical Computer Vision problems, such as essential matrix computation and pose estimation from 3D to 2D correspondences, can be tackled by solving a linear least-square problem, which can be done by finding the eigenvector corresponding to the smallest, or zero, eigenvalue of a matrix representing a linear system. Incorporating this in deep learning frameworks would allow us to explicitly encode known notions of geometry, instead of having the network implicitly learn them from data. However, performing eigendecomposition within a network requires the ability to differentiate this operation. While theoretically doable, this introduces numerical instability in the optimization process in practice. In this paper, we introduce an eigendecomposition-free approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix predicted by the network. We demonstrate that our approach is much more robust than explicit differentiation of the eigendecomposition using two general tasks, outlier rejection and denoising, with several practical examples including wide-baseline stereo, the perspective-n-point problem, and ellipse fitting. Empirically, our method has better convergence properties and yields state-of-the-art results.



There are no comments yet.


page 3

page 4

page 8

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The solution of many Computer Vision problems can be obtained by finding the zero singular- or eigen-value of a matrix, that is, by solving a linear least-square problem. This is true of model-fitting tasks ranging from plane and ellipse fitting [1, 2] to wide-baseline stereo [3] and the perspective-n-point (PnP) problem [4, 5, 6, 7, 8]. With the advent of Deep Learning, there has been a push to embed such least-squares approaches in deep architectures to enforce constraints. For example, we have recently shown that this could be done when training networks to detect and match keypoints in image pairs while accounting for the global epipolar geometry [9]. More generally, this approach makes it possible to explicitly incorporate geometrical knowledge into the network [10]. As a result, it operates in a more constrained space, which allows learning from smaller amounts of training data.

The currently dominant way to implement this approach is to design a network whose output is a matrix and train the network so that the smallest singular- or eigen-vector of that matrix is as close as possible to the ground-truth one 

[9, 10]. This requires differentiating the singular value decomposition (SVD) or the eigendecomposition (ED) in a stable manner at training time. This problem has received considerable attention [11, 12, 13]

and decomposition algorithms are already part of standard Deep Learning frameworks, such as TensorFlow 


or PyTorch 

[15]. Unfortunately, these algorithms suffer from two main weaknesses. First, when optimizing with respect to the matrix coefficients or with respect to parameters defining them, the vector corresponding to the smallest singular value or eigenvalue may change abruptly as the relative magnitudes of these values evolve, which is a non-differentiable behavior. This is illustrated in Fig. 1, discussed in detail in Section 2, and demonstrated experimentally in Section 5.1. Second, computing the derivatives of the eigenvector with respect to the coefficients of the input matrix requires dividing by the difference between two singular values or eigenvalues, which could be very small. While a solution to the second problem was proposed in [11], the first remains and handicaps all methods that depend on SVD or ED.

In this paper, we therefore introduce a method that can enforce the same global constraints as the SVD- and ED-based methods but without explicitly performing these decompositions, thus eliminating the non-differentiability issue and improving numerical stability. More specifically, given the network weights , let be the matrix that the network outputs and let

be the true eigenvector corresponding to the zero eigenvalue. We introduce a loss function that is minimized when

while disallowing trivial solutions such as . In practice, image measurements are not perfect and the eigenvalue is never strictly zero. However, this does not significantly affect the computation, which makes our approach robust to noise.

We will demonstrate that our approach delivers better results than using the standard implementation of singular- and eigen-value decomposition provided in TensorFlow for plane and ellipse fitting, solving the PnP problem, and distinguishing good keypoint correspondences from bad ones. A preliminary version of this paper appeared in [16]. In this extended version, we use a toy example to better motivate our approach and show how the gradients behave; we expand the formulation to handle denoising; we perform simultaneous outlier rejection and denoising; we demonstrate the application of our method to ellipse fitting. Note that, while we only discuss our approach in the context of Deep Learning, it is applicable to any optimization problem in which one seeks to optimize a loss function based on the smallest eigenvector of a matrix with respect to the parameters that define this matrix.

(a) (b)
Fig. 1: Eigenvector switching. (a) 3D points lying on a plane in black and distant outlier in red. (b) When the weights assigned to all the points are one, the eigenvector corresponding to the smallest eigenvalue is , the vector shown in blue in (a), and on the right in the top portion of (b), where we sort the eigenvectors by decreasing eigenvalue. As the optimization progresses and the weight assigned to the outlier decreases, the eigenvector corresponding to the smallest eigenvalue switches to , the vector shown in green in (a), which introduces a sharp change in the gradient values.

2 Motivation

To illustrate the problems associated with differentiating eigenvectors and eigenvalues, consider the outlier rejection example depicted by Fig. 1. The inputs are 3D inlier points lying on a plane and an outlier assumed to be very far from the plane. The inliers are shown in black and the outlier in red.

Suppose we want to assign a binary weight to each point—1 for inliers, 0 for outliers—such that the eigenvector corresponding to the smallest eigenvalue of the weighted covariance matrix is close to the ground-truth one in the least-square sense. When the weight assigned to the outlier is 0, it would be , which is also the normal to the plane and is shown in green. However, if at any point during the optimization we assign the weight 1 to the outlier, will correspond to the largest eigenvalue instead of the smallest one and the eigenvector corresponding to the smallest eigenvalue will be the vector shown in blue, which is perpendicular to . As a result, if we initially set all weights to 1 and optimize them so that the smallest eigenvector approaches the plane normal, the gradient values will depend on the coordinates of . If everything goes well, at some point during the optimization, the weight assigned to the outlier will become small enough so that the smallest eigenvector switches from being to being .

However, this introduces a large jump in the gradient vector whose values will now depend on the coordinates of instead of . In this simple case, the resulting instability does not preclude eventual convergence. However, in more complex situations, we found that it does. We already noted this problem in [9] when simultaneously establishing keypoint correspondences and estimating the essential matrix. Because of it, we had to first rely solely on a classification loss to determine the potential inlier correspondences before incorporating the loss based on the essential matrix to impose geometric constraints, which requires ED. This ensured that the network weights were already good enough to prevent eigenvector switching when starting to minimize the geometry-based loss. This strategy, however, requires determining ground-truth inlier/outlier labels, which in practice may be difficult to obtain. Furthermore, this scheme may not even be possible in cases where one cannot provide an auxiliary classification task—for example in the traditional task of removing additive noise in data. As we will show later, our method does not suffer from this.

3 Related Work

In recent years, the need to integrate geometric methods and mathematical tools into Deep Learning frameworks has led to the reformulation of a number of them in network terms. For example, [17] considers spatial transformations of image regions with CNNs. The set of such transformations is extended in [18]. In a different context, [19] derives a differentiation of the Cholesky decomposition that could be integrated in Deep Learning frameworks.

Unfortunately, the set of geometric Computer Vision problems that these methods can handle remains relatively limited. In particular, there is no widely accepted deep-learning way to solve the many geometric problems that reduce to finding the least-square solution to a linear system. In this work, we consider three such problems: Ellipse fitting, estimating the 3D pose of an object from 3D-to-2D correspondences, and computing the essential matrix from keypoint correspondences in an image pair, all of which we briefly discuss below.

Ellipse fitting.

This is a traditional Computer Vision problem, where one finds the ellipse equation that best fits a set of points. This could be used, for example, to detect people’s eyes in images by fitting ellipses to detected image edges. One of the simplest and most straightforward way to solve this problem is the Direct Linear Transform (DLT) 

[1, 2]. The DLT corresponds to finding an algebraic least-square solution to the problem by performing ED or SVD and retaining the eigen- or singular-vector that corresponds to the minimum eigen- or singular-value. Besides the DLT, other methods focus on minimizing the geometric distance instead of the algebraic one [1], and exploit constraints to enforce the result to truly be an ellipse, as opposed to a general conic, which also encompasses parabolas and hyperbolas [20, 21]. In practice, outliers are handled in an iterative manner, where the solution is refined via robust techniques such as M-estimators [1] or Least Median of Squares (LMedS) [22]. For a detailed review of ellipse fitting, we refer the reader to [2].

Estimating 3D pose from 3D-to-2D correspondences. This is known as the Perspective-n-Point (PnP) problem. It has also been investigated for decades and is also amenable to an eigendecomposition-based solution [3], many variations of which have been proposed over the years [4, 23, 6, 5]. DSAC [24]

is the only approach we know of that integrates the PnP solver into a Deep Network. As explicitly differentiating through the PnP solver is not optimization friendly, the authors apply the log trick used in the reinforcement learning literature. This amounts to using a numerical approximation of the derivatives from random samples, which is not ideal, given that an analytical alternative exists. Moreover, DSAC only works for grid configurations and known scenes. By contrast, the method we propose in this work has an analytical form, with no need for stochastic sampling.

Estimating the Essential matrix from correspondences. The eigenvalue-based solution to this problem has been known for decades [25, 26, 3] and remains the standard way to compute Essential matrices [27]. The real focus of research in this area has been to establish reliable keypoint correspondences and to eliminate outliers. In this context, variations of RANSAC [28], such as MLESAC [29] and least median of squares (LMeds) [22], and very recently GMS [30], have become popular. For a comprehensive study of such methods, we refer the interested reader to [31]. With the emergence of Deep Learning, there has been a trend towards moving away from this decades-old knowledge and apply instead a black-box approach where a Deep Network is trained to directly estimate the rotation and translation matrices [32, 33] without a priori geometrical knowledge. Our very recent work [9] attempts to reconcile these two opposing trends by embedding the geometric constraints into a Deep Net and has demonstrated superior performance for this task when the correspondences are hard to establish. A similar approach was also proposed in the contemporary work of [10].

Base Image
Matching Image
Fig. 2: Qualitative PnP results for outlier removal. Left two columns: Pairs of images between which we transfer the 2D points through the 2D-3D correspondences for Top: Florence and Bottom: Prague. From the third column to the fifth column: For each pair, we show in gray the reprojection of the 3D point cloud after applying the rotation and translation predicted by our model, OPnP, and EPnP+RANSAC, respectively. The red dots correspond to the ground-truth locations. Note that our model’s predictions cover the ground truth much more closely than the compared methods.

Differentiating the eigen- and singular value decomposition. Whether computing the essential matrix, estimating 3D pose, or solving any other least-square problem, incorporating an eigendecomposition-solver into a deep network requires differentiating the eigendecomposition. Expressions for such derivatives have been given in [11, 12] and reformulated in terms that are compatible with back-propagation in [13]. Specifically, as shown in [13], for a matrix written as , the variations of the eigenvectors with respect to the matrix, used to compute derivatives, are


where , and


As can be seen from Eq. 2, when eigenvalues are small, the denominator becomes small, resulting in a large number being multiplied to the backward flow of gradients. This causes numerical instability. The effect is exaggerated when both eigenvalues of the denominator are small, even more so when they are similar , which we found to happen in practice. The same can be said about singular value decomposition.

A solution to this was proposed in [11], and singular- and eigen-value decomposition have been used within deep networks for problems where all the singular values are exploited and their order is irrelevant [34, 35]

. In the context of spectral clustering, the approach of 

[36] also proposed a solution that eliminates the need for explicit eigendecomposition. This solution, however, was dedicated to the scenario where one seeks to use all non-zero eigenvalues, assuming a matrix of constant rank.

Here, by contrast, we tackle problems where what matters is a single eigen- or singular-value. In this case, the order of the eigenvalues is important. However, this order can change during training, which results in a non-differentiable switch from one eigenvector to another, as in the toy example of Section 2. In turn, this leads to numerical instabilities, which can prevent convergence. In [9]

, we overcame this problem by first training the network using a classification loss that does not depend on eigenvectors. Only once a sufficiently good solution is found, that is, a solution close enough to the correct one for vector switching not to happen anymore, is the loss term that depends on the eigenvector associated to the smallest eigenvalue turned on. While effective, this strategy inherently relies on having access to classification labels. As such, it is ill-suited to tackle denoising scenarios. As we show later, we can achieve state-of-the-art results without the need for such a heuristic, by deriving a more robust, eigendecomposition-free loss function.

4 Our Approach

We introduce an approach to working with eigenvectors corresponding to zero eigenvalues within an end-to-end learning formalism that eliminates the gradient instabilities due to vector switching discussed in Section 2 and the difficulties caused by repeated eigenvalues. To this end, we derive a loss function that directly operates on the matrix whose eigen- or singular-vectors we are interested in without explicitly performing either SVD or ED.

Below, we first discuss the generic case in which the network takes a set of of measurements as input and directly outputs the matrix of interest. We then consider two specialized but important cases. In the first, the network outputs weights defining the matrix, which corresponds to model fitting with outlier rejection. In the second, we consider the measurements to be noisy and allow them to be adjusted to account for the noise.

(a) Ours
(c) Ours
Fig. 3: Qualitative comparison of our results with those of RANSAC. From left to right, in each column, (a) our results and (b) RANSAC results on the “fountain-P11” of [37], (c) our results and (d) RANSAC results on the “brown-bm-3-05” of SUN3D. First two rows: We display the correspondences that each algorithm labeled as inliers. The true positives are shown in blue and the false ones in red. The false positives of our approach are still close to being correct, while those of RANSAC are truly wrong. Third row: With the estimated essential matrices, we plot the epipolar lines and their corresponding points for the false positives for each method. Note how, although the false positives look senseless for RANSAC, they are indeed following an epipolar geometry that captures both true and false positives. Fourth row: Same as in the third row, but with the ground-truth essential matrix. Our false positives are still close to the true epipolar line, whereas the ones from RANSAC are not.

4.1 Generic Scenario

Given an input measurement , let us denote by the output of a deep network with parameters . Let the network output be a matrix .

A seemingly natural approach to exploiting ED within the network would consist of minimizing the loss , where is the ground-truth smallest eigenvector as in [13, 9]. As discussed in Section 2, however, this requires differentiating the ED result to perform back-propagation and is not optimization-friendly. Furthermore, as argued in [3] for the fundamental matrix, the direct use of the norm as a distance measure is not optimal, since all the entries do not have equal importance.

To avoid the instabilities of eigenvector differentiation and use the algebraic error advocated for in [3], we consider the standard definition of a zero eigen-value , i.e.,


This suggests defining the loss function


where is the ground truth zero eigen-vector. As is always positive semi-definite, is always greater than or equal to zero.

However, there are many matrices that satisfy Eq. 4, including . To rule out such trivial solutions, we propose to minimize while maximizing the norm of the projection of the data vectors along the directions orthogonal to . In the plane-fitting example of Section 2, this would mean maximizing the projections of the data vectors on the plane whose normal is . The motivation behind this arises from the fact that maximizing along the directions orthogonal to does not preclude the correct eigenvalue to go to zero. By contrast, maximizing all eigenvalues would conflict with our original objective of Eq. 4, particularly because it would aim to maximize the projection of in the direction of , which ideally should be zero.

To prevent this, instead of directly regularizing , we rely on the matrix defined as



is the identity matrix. Note that

whatever the value . In other words,

projects all vectors onto the hyperplane normal to

. To prevent the trivial solutions discussed above, we would like the magnitude of these projections to be far from zero, or, in other words, the eigenvalues corresponding to directions orthogonal to to be as large as possible. As the trace of a matrix is the sum of its eigenvalues, this can be achieved by maximizing the trace of . We therefore introduce


as a loss to be minimized along with . Note that the terms defined in Eqs. 4 and 6 act on the eigenvalues of the same matrix but in different directions. can be rewritten as and as . Because of this, does not prevent us from obtaining a matrix with a truly zero eigenvalue in the direction .

However, is bounded by zero whereas is not and can become very large. In practice, we have observed it to reach values of and thus dominate the loss. Furthermore, in our deep learning context, during training, the network will produce one matrix for every training sample, and thus will sum over multiple trace terms. The magnitude of these terms will then significantly vary across the training samples, putting more focus on some training samples and making it hard to define the relative weight of and . This is illustrated by the bottom row of Figure 4, where we show the evolution of the trace for 3 samples. Even when all outliers are removed, the values for the 3 samples have different magnitudes.

To overcome this and facilitate balancing the two loss terms, we use a robust kernel, , that mitigates the effects of extreme trace values and brings the regularizer values of all training sample to a commensurate magnitude in the range . The resulting training curves for the same three samples as before are shown in the middle row of Figure 4. Note that they are of similar magnitudes and much lower than the trace itself, thus interfering less with our main objective . Specifically, we take our complete loss function to be


where and are positive scalars. Ideally, when there is a perfect solution, such that the outliers’ weights are all zero and the inliers have no noise, the value of would not matter as both terms would go to zero. However, in practice, because of noise and measurement errors, this does not occur, and it is necessary to control the influence of the two terms. governs whether we focus on reducing the first term – amount of estimation error – or the second one – amount of information that is preserved in the non-null space according to ground truth. The former loosely relates to the accuracy of the solution, and the latter to the recall. controls to what degree we would like to consider the second term, as the trace value can easily go up to thousands, making it effectively reach zero without a proper value. This loss is fully differentiable and can thus be used to learn the network parameters. We provide its derivatives in the next section.

Fig. 4: Loss evolution for different training samples. This example was obtained using synthetic data for the wide-baseline stereo task. As in the plane fitting experiment, no network is involved and we directly optimize the matches weights. We set the total number of matches to be 100 and initialize all the weights to be one in the three data samples, represented as different colors: one sample with 10 outliers (blue), one with 30 outliers (magenta), and one with 50 outliers (cyan). We show, from top to bottom, our main loss term, our regularizer and the corresponding trace value . In each case, we plot the evolution of the term as optimization progresses. As the weights applied to the matches remove the outliers, approaches zero for all cases. Note, however, that, while also decreases, it remains far from zero, and more importantly, reaches final values that differ significantly for the different samples. By contrast, our regularizer has much lower values that are of similar magnitudes for all samples, as indicated by the fact that the three curves overlap. This facilitates setting the weights of this regularizer.

4.2 Derivatives for the General Case

To compute the derivatives of the loss w.r.t. the parameters

, we first follow the chain rule to obtain


We can write the first term in the equation above in index notation as


where we omitted the explicit dependency of and on for the sake of conciseness. Using the same notation, the second term can be written as


where , and . Computing the derivatives with respect to , we obtain


Then, turning them back to matrix form, we write


Here, we can see that modifying and leads to directly modifying the magnitude of the gradient.

Finally, bringing Eq. 13 into Eq. 8 lets us write a gradient descent update of the network parameters as


We then use this for ADAM optimization.

Note that if we were to directly use instead of in the second loss term, the gradients would become


In contrast to Eq. 13, the second term contributes to the entire matrix , including in the direction of , which then cancels out, or at least reduces, the contribution of the term.

4.3 Model Fitting and Outlier Rejection

(a) Loss Evolution with vanilla gradient-descent
(b) Loss Evolution with Adam
Fig. 5: Plane fitting in the presence of a single outlier. We report the results for Single Value Decomposition(SVD), self-adjoint Eigendecomposition(Eigh), and for our loss function in Eq. 16. (a) Results using vanilla gradient-descent. To visualize all results in a single graph, we plot the results in log scale. (b) Results using Adam [38]. Regardless of the choice of optimizers, SVD and Eigh slowly decrease in the beginning, followed by a sudden drop, denoted with the red dot. By contrast, our method converges gradually and smoothly
(a) Loss Evolution
(b) Inliers with SVD
(c) Inliers with Ours
(d) Switching illustration
(e) Gradient Evolution
Fig. 6: Plane fitting in the presence of multiple outliers. We report results for Singular Value Decomposition (SVD), self-adjoint Eigendecomposition (Eigh), and our loss function. (a) Loss evolution for all methods. While our loss converges smoothly, SVD and Eigh remain stagnant until the shear drop at iteration 352. (b) Inlier selection of SVD. The inliers chosen by SVD are shown in red, and the outliers in blue. Positions 1 to 100 are true inliers. With multiple outliers, SVD discards many inliers, as well as estimating outliers as inliers until iteration 352, denoted with the yellow vertical line, where it no longer makes mistakes. By contrast, as shown in (c), our approach correctly rejects the outliers and accepts all inliers, starting from iteration 10, again marked with the yellow vertical line. Our method also does not show the sporadic change in the selection of inliers visible in the case of SVD in (b). (d) Illustration of switching. We draw the three eigenvectors of the ground-truth plane as the three horizontal lines, and the Euclidean distance to them from the first, second, and third eigenvectors estimated by SVD as yellow, blue, and red lines, respectively. Note the abrupt transition at iteration 352, due to the first eigenvector correctly following the largest ground-truth eigenvector starting from this iteration, whereas it was following the smallest until this iteration. This is clearly harmful for optimization. (e) Magnitude of the gradient for each iteration with SVD, in log scale. Note the drastic change in iteration 352.

The generic formulation presented above assumes that all measurements have the same weight and that the network predicts all the coefficients of the matrix instead of a number of design parameters from which the matrix can be computed. In practice, this is rarely the case. For problems such as plane-fitting, ellipse-fitting, PnP, and estimating the essential matrix, the data typically comes in the form of a set of measurements for where the are vectors of dimension , where for plane-fitting, for ellipse-fitting, for PnP, and for wide-baseline stereo. In such problems, the network must be trained to assign to each measurement a weight between 0 and 1 such that the zero eigenvalue of the matrix is the ground-truth vector , where is the matrix obtained by grouping all the vectors into an matrix and the diagonal matrix obtained from the -vector of weights predicted by the network.

In this scenario, the matrix of Eq. 3 becomes , can be written as , and the loss function of Eq. 7 becomes


For the sake of completeness, we briefly remind the reader of how the the vectors can be obtained from image measurements in the remainder of this section. For more details, we refer the interested readers to standard texts such as [3].

4.3.1 Plane Fitting

Let us first consider the toy outlier rejection problem used to motivate our approach in Section 2. Note that, for this experiment, we do not train a deep network, or perform any learning procedure. Instead, given 3D points , including inliers and outliers, we directly optimize the weight of each point. At every optimization step, given the current weight values, we compute the weighted mean of the points . We then define to be the matrix of mean-subtracted 3D points.

4.3.2 Ellipse Fitting

(a) Outlier removal
(b) Denoising
Fig. 7: Qualitative ellipse fitting results. (a) Fitting results for various methods in the presence of outliers and minor noise. (b) Results in the presence of noise only. Our method gives the most accurate results. Note that in (a), our method is the only method that fits the true ellipse in the center well, due to the abundance of outliers in this setup.
(a) Outlier removal
(b) Denoise
(c) Outlier removal and denoising, with moderate noise
(d) Outlier removal and denoising, with moderate outliers
Fig. 8: Quantitative ellipse fitting results. We report the center point error as in [20]. (a) Outlier removal results. (b) Denoising results. (c) Simultaneous outlier removal and denoising results with moderate noise of . (d) Simultaneous outlier removal and denoising results with 10 outliers. Our method performs best in all cases.

For this task, given a set of 2D coordinates, we aim to find the parameters of the ellipse that best fits the data. In contrast to the plane-fitting case, here we make use of a deep network, specifically the same one as in [9]. However, because the task is different than that tackled in [9], our network takes as input 2D coordinates, not correspondences. We use this network to predict the weights of Eq. 16, which is a diagonal matrix, with one weight for each 2D coordinate.

Formally, for each 2D point , let


We can then write a standard equation for an ellipse as


where , , , , , and define the parameters of the ellipse. We can therefore write


for this ellipse fitting case.

4.3.3 Perspective-n-Points

The goal of this problem is to determine the absolute pose (rotation and translation) of a calibrated camera, given known 3D points and corresponding 2D image points. For this task, we again use a deep network with the same architecture as that in [9]. Here, however, the network takes as input correspondences between 3D and 2D points and outputs a -dimensional vector of weights, one for each correspondence.

Mathematically, we can denote the input correspondences as


where are the coordinates of a 3D point, and , denote the corresponding image location. According to [3], we have


Following the Direct Linear Transform (DLT) method [3], the transformation parameters can be obtained as the zero-valued singular vector of a data matrix , every two rows of which are computed from one correspondence as


where denotes row of . We can then directly use this data matrix in Eq. 16. As suggested by [3] and also done in [9], we use the normalized coordinate system for the 2D coordinates. We assume that the correspondences for this setup are given, as in [9]

Note that the characteristics of the rotation matrix, that is, orthogonality and determinant 1, are not preserved by the DLT solution. Therefore, to make the result a valid rotation matrix, we refine the DLT results by the generalized Procrustes algorithm [7, 39], which is a common post-processing technique for PnP algorithms. Note that this step is not involved during training, but only in the validation process to select the best model and at test time.

4.3.4 Essential Matrix Estimation

This task is the same as that studied in [9], that is, estimating camera motion from 2D-to-2D correspondences. A critical downside of [9] is the use of the DLT to regress to the camera pose, which is solved via SVD/ED. This method therefore suffers from the switching problem discussed in Section 2. By contrast, we convert the DLT problem to a cost that directly reflects the original task.

To isolate the effect of the loss function only, we follow the same setup as in [9]. Specifically, we use the same network architecture as in [9], which takes correspondences between two 2D points as input and outputs a -dimensional vector of weights, that is, one weight for each correspondence. For the same purpose, we also use the same strategy for creating correspondences—a simple nearest neighbor matching from one image to the other.

Formally, let


encode the coordinates of correspondence in the two images. Following the 8-point algorithm [25], we construct a matrix , each row of which is computed from one correspondence vector as


where denotes row of . We can then directly use this matrix in the loss of Eq. 16, which corresponds to a weighted version of the 8-point algorithm [40].

Note that, as suggested by [3] and done in [9], we normalize the 2D coordinates to using the camera intrinsics as input to the network. Furthermore, when calculating the loss, as suggested by [26], we move the centroid of the reference points to the origin of the coordinate system and scale the points so that their RMS distance to the origin is equal to . This means that we also have to scale and translate accordingly.

Here, our formulation originates from the DLT. As a result, our loss minimizes the algebraic distance, which could be suboptimal. While there exist other formulations for solving for the essential matrix, such as the gold-standard method that minimizes the geometric distance [3], these methods rely on iterative minimization. Including such an iterative process in a loss function can be achieved by solving a bi-level optimization problem, as discussed in [41]. However, this requires second-order derivatives, and thus may become expensive in the context of deep networks, due to the large number of parameters. Alternatively, one can unroll a fixed number of iterations, but this number needs to be determined and this requires an initialization. Our method can be thought of as a way to provide such an initialization. We leave the task of incorporating a nonlinear refinement in the training process given our initialization as future work, and focus on the DLT case, which allows us to demonstrate the benefits of our approach over direct eigenvalue minimization. Note that, as an alternative, other loss functions can be used, such as a classification loss on the keypoint weights [9] or the residual error using the symmetric epipolar distance [10]. These methods, however, require defining a threshold, whose precise value can have a significant impact on the results.

4.4 Handling Noisy Measurements

Beside outliers, real-world measurements are typically affected by noise, which deteriorates the quality of the estimate of interest. Here, we show that our framework also allows us to address denoising problems in addition to outlier rejection, by making the network estimate a refinement of the input measurements. In other words, while outlier rejection can be expressed by estimating a multiplicative factor, that is , denoising translates to predicting an additive one.

To remove noise from input data, we propose to train our networks to output, in addition to the weights , a denoising matrix that has the same dimension as the input data. Thus, instead of directly using , we create the data matrix with the denoised data, . However, directly using the denoised data matrix instead of in Eq. 16 is problematic as, with the denoising vector, one could add arbitrary displacements in the direction corresponding to the largest eigenvalues to reach very large values. To overcome this, we rely on the assumption that the displacements should be small, which can be encoded by the constraint for a given threshold corresponding to the noise value. To avoid using hard constraints, which are difficult to handle with deep networks, we use soft constraints, which can be thought as the Lagrangian version of the hard ones. Altogether, we thus define our loss as


where is the ground-truth eigenvector, and , expressed in terms of the original measurements, not the denoised ones. We found this to be more stable than using , because the latter would still favor generating large displacements along the directions orthogonal to , which contradicts the last loss term and deteriorates the distinction between inliers and outliers, as outliers can then be handled using either a zero weight or a large displacement .

We apply the denoising term for the task of ellipse fitting and PnP. For both cases, we only correct for the noise in the 2D data, that is, and in Eq. 19 and and in Eq. 22.

We exclude essential matrix estimation from this setup, for the following reasons. First, for wide-baseline stereo, the same level of noise can have drastically different effects on the estimation, depending on the distance to the epipoles. For example, even the slightest noise near an epipole can greatly harm estimation. Second, and more importantly, the outlier ratio in wide-baseline stereo is much larger than in other tasks, e.g., 91% in the cogsci2-05 sequence from the SUN3D dataset. As such, the positive and negative labels are highly unbalanced, and the negative labels dominate the denoising operation, which translates to the positive ones being removed. As a consequence performing outlier removal alone has proven more robust than jointly with denoising, particularly because our network filters the matches so that the inliers attain a large majority, and the final RANSAC step then further increases robustness.

5 Experiments

(a) Rotation error
(b) Translation error
Fig. 9: Quantitative PnP results for outlier removal. (a) Rotation and (b) translation errors for our method and several baselines. Our method gives extremely stable results despite the abundance of outliers, whereas all compared methods perform significantly worse as the number of outliers increases. Even when these methods perform well on either rotation or translation, they do not perform well on both. By contrast, Ours yields near zero errors for both measures up to 130 outliers (i.e., 65%).
(a) Rotation error
(b) Translation error
Fig. 10: Quantitative PnP results for denoising. (a) Rotation and (b) translation errors for our method and several baselines with varying amount of additive noise on the 2D coordinates. Our method gives the best results in terms of rotation, and performs similarly to the best performing method, OPnP, for translation. Note that for translation, most methods, including ours, give extremely low error.
(a) Rotation error
(b) Translation error
(c) Rotation error
(d) Translation error
Fig. 11: Quantitative PnP results for simultaneous outlier removal and denoising. (a) Rotation and (b)

translation errors for varying number of outliers with additive Gaussian noise with standard deviation of 20 pixels on the 2D coordinates.

(c) Rotation and (d) translation errors for varying amount of additive noise on the 2D coordinates with 50 outliers. Our method gives the best results in all cases.
(a) Rotation error
(b) Translation error
Fig. 12: Quantitative PnP results on real scenes. The results are averaged over 1000 images. Note that our method gives the best results.

We now present our results for the four tasks discussed above, that is, plane fitting as in Section 2, ellipse fitting, solving the PnP problem, and distinguishing good keypoint correspondences from bad ones. We rely on a TensorFlow implementation using the Adam [38] optimizer, with a learning rate of , unless stated otherwise, and default parameters. When training a network, for ellipse fitting, keypoint matching and PnP, we used mini-batches of 32 samples. We used the same network as in [9] for all three tasks. The network takes observations as input, which can be either 2D locations for ellipse fitting, 3D-to-2D correspondences for PnP, or 2D-to-2D correspondences for wide-baseline stereo, and produces a

-dimensional array of weights, one for each observation. The weights indicate the likelihood of each observation to be an inlier. The observations are processed independently by weight-sharing perceptrons. The network consists of 12 residual blocks. In the each residual block, Context Normalization is applied before batch normalization and ReLU to aggregate the global information. The code to reproduce our experiments is available on Github


5.1 Plane Fitting

We first evaluate the effectiveness of the proposed method in a simple toy setup, where the task is to remove outliers. The setup is the one discussed in Section 2. We randomly sampled 100 3D points on the plane. Specifically, we uniformly sampled and . We then added zero-mean Gaussian noise with standard deviation in the dimension. We also generated outliers in a similar way, where and are uniformly sampled in the same range, and

is sampled from a Gaussian distribution with mean 50 and standard deviation 1. For the baselines that directly use the analytical gradients of SVD and ED, we take the objective function to be

, where is the minimum eigenvector of and is the ground-truth noise direction ( in Eq. 16), which is also the plane normal and is the vector in this case. Note that we consider both and take the minimum distance, denoted by the and the in the loss function. For this problem, both solutions are correct due to the sign ambiguity of ED, which should be taken into account.

We consider two ways of computing analytical gradients, one using the SVD and the other the self-adjoint eigendecomposition (Eigh), which both yield mathematically valid solutions. To implement our approach, we rely on Eq. 16, as we are mostly interested in removing outliers for this experiment. For the learning rate, we use for the single outlier case, and for the multiple outlier one.

Fig. 4(b) shows the evolution of the loss as the optimization proceeds when using Adam and vanilla gradient descent with a single outlier. Note that SVD and Eigh have exactly the same behavior because they constitute two equivalent ways of solving the same problem. Using Adam in conjunction with either one initially yields a very slow decrease in the loss function, until it suddenly drops to zero after hundreds of iterations. In case of vanilla gradient descent, SVD and Eigh require hundreds of thousands of optimization steps to converge. By contrast, our approach produces a much more gradual decrease in the loss.

We further show in Fig. 6 how our formulation works compared to that of SVD- and Eigh-based methods in the presence of twenty outliers. As the problem is harder, SVD- and Eigh-based losses take more iterations to converge than in the single outlier case. With Adam, the convergence suddenly happens after 352 iterations, when a switch of the eigenvector with the smallest eigenvalue occurs, as shown in Fig. 6 (d).

In Fig. 6 (b) and (c), we show the inliers detected during optimization. In a plane fitting example, only three inliers are required to solve the problem, thus not all inliers are used as shown by Fig. 6 (b). However, as our formulation encourages the inclusion of inliers in the optimization through the second term in Eq. (16), we are able to recover all inliers as shown in Fig. 6 (c).

In Fig.6 (d) and (e), we can clearly see the abrupt change in the optimization at iteration 352. In Fig.6 (d), we plot the distance between the ground-truth and the estimated eigenvectors obtained through eigendecomposition, where the zero-point for each eigenvector is placed differently for better visualization. To be specific, we place the eigen-vector to the ground-truth position which closest to it. The distance between the point and the ground-truth position which below it indicates how close they are. Also for better visualization, we set the interval to be 10 and down sample the points. One can clearly see that until iteration 352, the eigenvectors are wrongly assigned, which is then suddenly corrected. In Fig. 6 (e), we show the devastating effect this switch causes; the magnitude of the gradient increases drastically, which is potentially harmful for optimization. Specifically, at this point, most of the inliers have been removed by SVD and Eigh, and thus the smallest two eigenvalues are close to zero. Therefore, because of Eq. 2, the gradients remain very large, even after this switch happens.

While in this simple plane fitting example the SVD- or Eigh-based methods converge, in more complex cases, this is not always true, as we will show next with different applications.

5.2 Ellipse Fitting

To evaluate our approach on the task of ellipse fitting, we built a synthetic dataset. We generated ten thousand ellipses with random centers in the rectangle whose top right corner coordinate is and bottom left corner coordinate is , angles in the range , long and short axes in the range , thus their top, right, bottom, and left coordinates are 1, 1, -1, and -1, respectively. Working in this confined range allows us to create ellipses that are distinguishable from noise and outliers with the human eye, and yet have enough variability to capture diverse types of ellipses. While this setup is constrained to a given scale range, in practice, ellipses at different scale ranges can easily be dealt with by normalizing and denormalizing as in [26].

We compare our method against other ellipse fitting techniques that are discussed in [1]. In particular, we evaluate the robust version of the ellipse-Specific method (Spec), Least-Squares based on Orthogonal distance (LSO), Least-Squares based on Algebraic distance (LSA), gradient Weighted Least-Squares based on Algebraic distance (WLSA), ellipse fitting with M-Estimator using the Cauchy loss function, and ellipse fitting with Least Median of Squares (LMedS). As in [20], we report the error of the estimated center point with arbitrary orientation for each method.

5.2.1 Outlier Removal

For this experiment, we incorporate outliers to our dataset by adding random points within the unit rectangle around the origin. Specifically, we randomly select of the 200 2D points and convert them to outliers. We further add a Gaussian noise with standard deviation of . We set and and rely on Eq. (16), as we would like to evaluate outlier removal only.

Quantitative results are summarized in Fig. 8 (a), and qualitative results are shown in Fig. 7 (a). Our method outperforms all other methods. Note that, in Fig. 7 (a), our method is the only one able to recover an ellipse that fits the ground-truth points.

5.2.2 Denoising

To test the performance of our method on denoising, we add Gaussian noise to the points, but do not include any outliers. We vary the noise level from zero to . Note that, in this pure denoising scenario, we do not predict a weight matrix , and the outputs of the network are the displacements only. Since we penalize overly large displacements , our formulation doesn’t suffer from the trivial solution where the matrix output by the network goes to zero. There is no need to encourage the projections in the direction orthogonal to the eigenvector of interest to remain large, so we set and in Eq. 25.

As in the outlier rejection case, we report the center point error. We run each experiment 100 times and report the average. The results are summarized in Fig. 8 (b). We also show a qualitative example in Fig. 7 (b). Our method gives the most robust results with respect to the additive Gaussian noise.

5.2.3 Simultaneous Outlier Removal and Denoising

Finally, we combine both sources of perturbations and randomly choose up to 150 outliers among the 200 points and add a Gaussian noise with a standard deviation up to . For the loss function in Eq. 25, we set , and .

During testing, we fix either the number of outliers or the degree of noise and vary the other. Specifically, to show robustness against noise in the presence of outliers, we fix the number of outliers to 10 and vary the standard derivation of Gaussian noise in the range . To show robustness to outliers, we fix the Gaussian noise to be and vary the number of outliers in the range . The results are summarized in Fig. 8 (c) and (d), where we can clearly see that we outperform the baselines.

5.3 Perspective-n-Points

Let us now turn to the PnP problem. For this application, as discussed below, we perform either outlier removal, or denoising, or both tasks jointly.

5.3.1 Outlier Removal on Synthetic Data

Following standard practice for evaluating PnP algorithms [4, 5], we generate a synthetic dataset composed of 3D-to-2D correspondences with noise and outliers. Each training example comprises two hundred 3D points, and we set the ground-truth translation of the camera pose to be their centroid. We then create a random ground-truth rotation , and project the 3D points to the image plane of our virtual camera. As in REPPnP [5], we apply Gaussian noise with a standard deviation of to these projections. We generate random outliers by assigning 3D points to arbitrary valid 2D image positions.

To focus on outlier removal only, we again first use the formulation in Eq. 16. For the hyper-parameters in Eq. 16, we empirically found that and works well for this task. During training, we randomly select between 0 and 150 of the two hundred matches and turn them into outliers. In other words, the two hundred training matches will contain a random number of outliers that our network will learn to filter out.

We compare our method against modern PnP methods, EPnP [4], OPnP [6], PPnP [7], RPnP [8] and REPPnP [5]. We also evaluate the DLT [3], since our loss formulation is based on it. Among these methods, REPPnP is the one most specifically designed to handle outliers. As in the keypoint matching case, we tried to compute the results of a network relying explicitly on eigendecomposition and minimizing the norm of the difference between the ground-truth eigenvector and the predicted one. However, we found that such a network was unable to converge. We also report the performance of two commonly used baselines that leverage RANSAC [28], P3P [23]+RANSAC and EPnP+RANSAC. For the other methods, RANSAC didn’t bring noticeable improvements, so we omitted them in the graph for better visual clarity. For all baselines, we used the default parameters provided with their implementation, which were tuned on equivalent synthetic datasets. For the RANSAC-based ones, we used the default RANSAC parameter from OpenCV. Note that our method could also make use of RANSAC, but we use it on its own here. While tuning the RANSAC parameter would affect the final results, all the methods, including ours if we were to use RANSAC, would benefit from it, which would leave our conclusions unchanged.

For this comparison, we use standard rotation and translation error metrics [42]. To demonstrate the effect of outliers at test time, we fix the number of matches to 200 and vary the number of outliers from to . We run each experiment 100 times and report the average.

Fig. 9 summarizes the results. We outperform all other methods significantly, especially when the number of outliers increases. REPPnP is the one competing method that seems least affected by the outliers. As long as the number of outliers is small, it is on a par with us. However, passed a certain point—when there are more than 40 outliers, that is, 20% of the total number of correspondences—its performance, particularly in terms of rotation error, decreases quickly, whereas ours does not.

5.3.2 Outlier Removal on Real Data

We then evaluate our PnP outlier removal approach on the real dataset of [43]. Specifically, the 3D points in this dataset were obtained using the SfM algorithm of [44], which also provides a rotation matrix and translation vector for each image. We treat these rotations and translations as ground truth to compare different PnP algorithms. Given a pair of images, we extract SIFT features at the reprojection of the 3D points in one image, and match these features to SIFT keypoints detected in the other image. This procedure produces erroneous correspondences, which a robust PnP algorithm should discard. In this example, we used the model trained on the synthetic data described before. Note that we apply the model without any fine-tuning, that is, the model is only trained with purely synthetic data. We observed that, except for EPnP+RANSAC, OPnP and P3P+RANSAC, the predictions of the baselines are far from the ground truth, which led to points reprojecting outside the image.

In Fig 11(b), we provide quantitative results on this real dataset. These results are averaged over 1000 images. In Fig. 2, we compare the reprojection of the 3D points in the input image after applying the rotation and translation obtained with our model and with OPnP and EPnP+RANSAC. Note our better accuracy, both quantitatively and qualitatively.

Fig. 13: Keypoint matching results. We report the accuracy of the estimated relative pose in terms of the mean Average Precision (mAP) measure of [9]. (a) Results for the SUN3D dataset. (b) Results for the dataset of [37]. Our method performs on par with the state-of-the-art method of [9], denoted as “Classification + Essential”, without the need of any pre-training. Note the significant performance gap between “Essential_Only”, which utilizes eigendecomposition directly, and our method which is eigendecomposition-free.

5.3.3 Denoising on Synthetic Data

We further evaluate the performance of our method on the denoising task for PnP. We keep the same experimental setup as for outlier removal, except we no longer have outliers, but instead an additive Gaussian noise on the 2D coordinates. We apply a zero-mean Gaussian noise, with varying standard deviation between and pixels. As in the ellipse-fitting case, to focus on denoising only, the network only outputs the displacements , not the weights , and we set and in Eq. 25 and empirically set . As in the case of outlier rejection, we run the experiment 100 times and report the average.

We summarize the results of this experiment in Fig. 10. When the noise level is small, all compared methods give reasonable results. However, as the level of noise increases, our method remains accurate, whereas the others start failing.

5.3.4 Simultaneous Outlier Removal and Denoising

We now tackle the case where both noise and outliers are present, using synthetic data. This is a combination of the two cases in Sections 5.3.1 and 5.3.3.

For the loss function in Eq. 25, we empirically set , and . We follow two different evaluation scenarios to demonstrate the performance in terms of outlier rejection and denoising. To this end, we vary either the number of outliers or the degree of noise, while keeping the other as a moderate value. Specifically, to demonstrate the effect of outliers at test time, we fix the image noise to be 20 pixels and vary the number of outliers from 10 to 150, out of 200 matches. To evaluate the effect of image noise, we fix the number of outliers to be 50, out of 200 matches, and vary the image noise from 5 to 50 pixels. We run the each experiment 100 times and report the average.

Fig. 11 summarizes the results of these experiments. In Fig. 11 (a) and (b), we show the results for a varying number of outliers. Note that, even in the presence of noise, our method outperforms all others significantly. This is also true when varying the noise level, as evidenced by Fig. 11 (c) and (d).

5.4 Wide-baseline stereo

To evaluate our method on a real-world problem, we use the SUN3D dataset [45]. For a fair comparison, we trained our network on the same data as in [9], that is, the “brown-bm-3-05” sequence, and evaluate it on the test sequences used for testing in [33, 9]. Additionally, to show that our method is not overfitting, we also test on a completely different dataset, the “fountain-P11” and “Herz-Jesus-P8” sequences of [37].

We follow the evaluation protocol of [9], which constitutes the state-of-the-art in keypoint matching, and only change the loss function to our own loss of Eq. 16. We use and , which we empirically found to work well for 2D-to-2D keypoint matching. We compare our method against our previous work [9], both in its original implementation that involves minimizing a classification loss first and then without that initial step, which we denote as “Essential_Only”. The latter is designed to show how critical the initial classification-based minimization of [9] is. In addition, we also compare against standard RANSAC [46], LMeds [47], MLESAC [29], and GMS [30] to provide additional reference points. We do this in terms of the performance metric used in [9] and referred to as mean Average Precision (mAP). This metric is computed by observing the ratio of accurately recovered poses given a certain maximum threshold, and taking the area under the curve of this graph. We tuned the parameters of the baselines, so as to obtain the best results. For RANSAC, we used the threshold employed in [9] of 0.01.

We summarize the results in Figs. 3 and 13. Our approach performs on par with [9], the state-of-the-art method for keypoint matching, and outperforms all the other baselines, without the need of any pre-training. Importantly, “Essential_Only” severely underperforms and even often fails completely. In short, instead of having to find a workaround to the eigenvector switching problem as in [9], we can directly optimize our objective function, which is far more generally applicable. As discussed previously in Section 2, the workaround in [9] converges to a solution that depends on the pre-defined threshold for the inliers, set heuristically by the user, and is not applicable to other problems involving additive noise. By contrast, our method can simply discover the inliers automatically while training, thanks to the second term in Eq. 7, and is also applicable to removal of additive noise.

In the top row of Fig. 3

, we compare the correspondences classified as inliers by our method to those of RANSAC on image pairs from the dataset of 

[37] and SUN3D, respectively. Note that even the correspondences that are misclassified as inliers by our approach are very close to being inliers. By contrast, RANSAC yields much larger errors. In the middle row of Fig. 3, we show the epipolar geometry of the outliers, denoted as red lines. These lines were obtained using the estimated essential matrix in the top figure, and the ground-truth one in the bottom figure. Note that, while they may seem senseless to the human eye, the outliers lie very close to the estimated epipolar lines, and sometimes even the ground-truth ones. This is mainly because, with wide-baseline stereo, the depth of each point cannot be recovered.

6 Conclusion

We have introduced a novel approach to training deep networks that rely on losses computed from an eigenvector corresponding to a zero eigenvalue of a matrix defined by the network’s output. Our loss does not suffer from the numerical instabilities of analytical differentiation of eigendecomposition, and converges to the correct solution much faster. We have demonstrated the effectiveness of our method on two main tasks, outlier removal and denoising, with applications to keypoint matching, PnP and ellipse fitting. In all cases, our new loss has allowed us to achieve state-of-the-art results.

In the future, we will investigate extensions of our approach to handle non-zero eigenvalues, in particular the case of maximum eigenvalue. Furthermore, we hope that our work will contribute to imbuing Deep Learning techniques with traditional Computer Vision knowledge, thus avoiding discarding decades of valuable research, and favoring the development more principled frameworks.


  • [1] Z. Zhang, “Parameter Estimation Techniques: A Tutorial with Application to Conic Fitting,” IVC, vol. 15, no. 1, pp. 59–76, 1997.
  • [2] K. Kanatani, Y. Sugaya, and Y. Kanazawa, “Ellipse Fitting for Computer Vision: Implementation and Applications,” Synthesis Lectures on Computer Vision, pp. 1–141, 2016.
  • [3] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision.   Cambridge University Press, 2000.
  • [4] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An Accurate O(n) Solution to the PnP Problem,” IJCV, 2009.
  • [5] L. Ferraz, X. Binefa, and F. Moreno-Noguer, “Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection,” in CVPR, 2014, pp. 501–508.
  • [6] Y. Zheng, Y. Kuang, S. Sugimoto, K. Åström, and M. Okutomi, “Revisiting the PnP Problem: A Fast, General and Optimal Solution,” in ICCV, 2013.
  • [7] V. Garro, F. Crosilla, and A. Fusiello, “Solving the PnP Problem with Anisotropic Orthogonal Procrustes Analysis,” in 3DPVT, 2012, pp. 262–269.
  • [8] S. Li, C. Xu, and M. Xie, “A Robust O(n) Solution to the Perspective-N-Point Problem,” PAMI, pp. 1444–1450, 2012.
  • [9] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua, “Learning to Find Good Correspondences,” in CVPR, 2018.
  • [10] R. Ranftl and V. Koltun, “Deep Fundamental Matrix Estimation,” in ECCV, 2018.
  • [11] T. Papadopoulo and M. Lourakis, “Estimating the jacobian of the singular value decomposition: Theory and applications,” in ECCV, 2000, pp. 554–570.
  • [12] M. Giles, “Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation,” in Advances in Automatic Differentiation, 2008, pp. 35–44.
  • [13]

    C. Ionescu, O. Vantzos, and C. Sminchisescu, “Matrix Backpropagation for Deep Networks with Structured Layers,” in

    CVPR, 2015.
  • [14]

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A System for Large-Scale Machine Learning,” in

    USENIX Conference on Operating Systems Design and Implementation, 2016, pp. 265–283.
  • [15] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic Differentiation in Pytorch,” in NIPS, 2017.
  • [16] Z. Dang, K. M. Yi, Y. Hu, F. Wang, P. Fua, and M. Salzmann, “Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses,” in ECCV, 2018.
  • [17]

    M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” in

    NIPS, 2015, pp. 2017–2025.
  • [18]

    A. Handa, M. Bloesch, V. Patraucean, S. Stent, J. McCormac, and A. Davison, “Gvnn: Neural Network Library for Geometric Computer Vision,” in

    ECCV, 2016.
  • [19] I. Murray, “Differentiation of the Cholesky Decomposition,” arXiv Preprint, 2016.
  • [20] A. Fitzgibbon, M. Pilu, and R. Fisher, “Direct Least-Squares Fitting of Ellipses,” PAMI, vol. 21, no. 5, pp. 476–480, May 1999.
  • [21] R. Halır and J. Flusser, “Numerically Stable Direct Least Squares Fitting of Ellipses,” in International Conference in Central Europe on Computer Graphics and Visualization, 1998.
  • [22] P. Rousseeuw and A. Leroy,

    Robust Regression and Outlier Detection

    .   Wiley, 1987.
  • [23] L. Kneip, D. Scaramuzza, and R. Siegwart, “A Novel Parametrization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation,” in CVPR, 2011, pp. 2969–2976.
  • [24] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC – Differentiable RANSAC for Camera Localization,” ARXIV, 2016.
  • [25] H. Longuet-Higgins, “A Computer Algorithm for Reconstructing a Scene from Two Projections,” Nature, vol. 293, pp. 133–135, 1981.
  • [26] R. Hartley, “In Defense of the Eight-Point Algorithm,” PAMI, vol. 19, no. 6, pp. 580–593, June 1997.
  • [27] D. Nister, “An Efficient Solution to the Five-Point Relative Pose Problem,” in CVPR, June 2003.
  • [28] M. Fischler and R. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [29] P. Torr and A. Zisserman, “MLESAC: A New Robust Estimator with Application to Estimating Image Geometry,” CVIU, vol. 78, pp. 138–156, 2000.
  • [30] J. Bian, W. Lin, Y. Matsushita, S. Yeung, T. Nguyen, and M. Cheng, “GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence,” in CVPR, 2017.
  • [31] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm, “USAC: A Universal Framework for Random Sample Consensus,” PAMI, vol. 35, no. 8, pp. 2022–2038, 2013.
  • [32] A. R. Zamir, T. Wekel, P. Agrawal, J. Malik, and S. Savarese, “Generic 3D Representation via Pose Estimation and Matching,” in ECCV, 2016.
  • [33] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: Depth and Motion Network for Learning Monocular Stereo,” in CVPR, 2017.
  • [34] Z. Huang and L. V. Gool, “A Riemannian Network for SPD Matrix Learning,” in AAAI, 2017, p. 6.
  • [35] Z. Huang, C. Wan, T. Probst, and L. V. Gool, “Deep Learning on Lie Groups for Skeleton-Based Action Recognition,” in CVPR, 2017, pp. 6099–6108.
  • [36] M. Law, R. Urtasun, and R. S. Zemel, “Deep Spectral Clustering Learning,” in ICML, 2017, pp. 1985–1994.
  • [37] C. Strecha, W. Hansen, L. Van Gool, P. Fua, and U. Thoennessen, “On Benchmarking Camera Calibration and Multi-View Stereo for High Resolution Imagery,” in CVPR, 2008.
  • [38] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimisation,” in ICLR, 2015.
  • [39] P. Schönemann, “A Generalized Solution of the Orthogonal Procrustes Problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
  • [40] Z. Zhang, “Determining the Epipolar Geometry and Its Uncertainty: A Review,” IJCV, vol. 27, no. 2, pp. 161–195, 1998.
  • [41] S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo, “On differentiating parameterized argmin and argmax problems with application to bi-level optimization,” arXiv preprint arXiv:1607.05447, 2016.
  • [42] A. Crivellaro, M. Rad, Y. Verdie, K. M. Yi, P. Fua, and V. Lepetit, “Robust 3D Object Tracking from Monocular Images Using Stable Parts,” PAMI, 2018.
  • [43] J. Heinly, J. Schoenberger, E. Dunn, and J.-M. Frahm, “Reconstructing the World in Six Days,” in CVPR, 2015.
  • [44] C. Wu, “Towards Linear-Time Incremental Structure from Motion,” in 3DV, 2013.
  • [45] J. Xiao, A. Owens, and A. Torralba, “SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels,” in ICCV, 2013.
  • [46] H. Cantzler, “Random Sample Consensus (RANSAC),” 2005, cVonline.
  • [47] D. Simpson, “Introduction to Rousseeuw (1984) Least Median of Squares Regression,” in Breakthroughs in Statistics.   Springer, 1997, pp. 433–461.