The solution of many Computer Vision problems can be obtained by finding the zero singular- or eigen-value of a matrix, that is, by solving a linear least-square problem. This is true of model-fitting tasks ranging from plane and ellipse fitting [1, 2] to wide-baseline stereo  and the perspective-n-point (PnP) problem [4, 5, 6, 7, 8]. With the advent of Deep Learning, there has been a push to embed such least-squares approaches in deep architectures to enforce constraints. For example, we have recently shown that this could be done when training networks to detect and match keypoints in image pairs while accounting for the global epipolar geometry . More generally, this approach makes it possible to explicitly incorporate geometrical knowledge into the network . As a result, it operates in a more constrained space, which allows learning from smaller amounts of training data.
The currently dominant way to implement this approach is to design a network whose output is a matrix and train the network so that the smallest singular- or eigen-vector of that matrix is as close as possible to the ground-truth one[9, 10]. This requires differentiating the singular value decomposition (SVD) or the eigendecomposition (ED) in a stable manner at training time. This problem has received considerable attention [11, 12, 13]
and decomposition algorithms are already part of standard Deep Learning frameworks, such as TensorFlow
or PyTorch. Unfortunately, these algorithms suffer from two main weaknesses. First, when optimizing with respect to the matrix coefficients or with respect to parameters defining them, the vector corresponding to the smallest singular value or eigenvalue may change abruptly as the relative magnitudes of these values evolve, which is a non-differentiable behavior. This is illustrated in Fig. 1, discussed in detail in Section 2, and demonstrated experimentally in Section 5.1. Second, computing the derivatives of the eigenvector with respect to the coefficients of the input matrix requires dividing by the difference between two singular values or eigenvalues, which could be very small. While a solution to the second problem was proposed in , the first remains and handicaps all methods that depend on SVD or ED.
In this paper, we therefore introduce a method that can enforce the same global constraints as the SVD- and ED-based methods but without explicitly performing these decompositions, thus eliminating the non-differentiability issue and improving numerical stability. More specifically, given the network weights , let be the matrix that the network outputs and let
be the true eigenvector corresponding to the zero eigenvalue. We introduce a loss function that is minimized whenwhile disallowing trivial solutions such as . In practice, image measurements are not perfect and the eigenvalue is never strictly zero. However, this does not significantly affect the computation, which makes our approach robust to noise.
We will demonstrate that our approach delivers better results than using the standard implementation of singular- and eigen-value decomposition provided in TensorFlow for plane and ellipse fitting, solving the PnP problem, and distinguishing good keypoint correspondences from bad ones. A preliminary version of this paper appeared in . In this extended version, we use a toy example to better motivate our approach and show how the gradients behave; we expand the formulation to handle denoising; we perform simultaneous outlier rejection and denoising; we demonstrate the application of our method to ellipse fitting. Note that, while we only discuss our approach in the context of Deep Learning, it is applicable to any optimization problem in which one seeks to optimize a loss function based on the smallest eigenvector of a matrix with respect to the parameters that define this matrix.
To illustrate the problems associated with differentiating eigenvectors and eigenvalues, consider the outlier rejection example depicted by Fig. 1. The inputs are 3D inlier points lying on a plane and an outlier assumed to be very far from the plane. The inliers are shown in black and the outlier in red.
Suppose we want to assign a binary weight to each point—1 for inliers, 0 for outliers—such that the eigenvector corresponding to the smallest eigenvalue of the weighted covariance matrix is close to the ground-truth one in the least-square sense. When the weight assigned to the outlier is 0, it would be , which is also the normal to the plane and is shown in green. However, if at any point during the optimization we assign the weight 1 to the outlier, will correspond to the largest eigenvalue instead of the smallest one and the eigenvector corresponding to the smallest eigenvalue will be the vector shown in blue, which is perpendicular to . As a result, if we initially set all weights to 1 and optimize them so that the smallest eigenvector approaches the plane normal, the gradient values will depend on the coordinates of . If everything goes well, at some point during the optimization, the weight assigned to the outlier will become small enough so that the smallest eigenvector switches from being to being .
However, this introduces a large jump in the gradient vector whose values will now depend on the coordinates of instead of . In this simple case, the resulting instability does not preclude eventual convergence. However, in more complex situations, we found that it does. We already noted this problem in  when simultaneously establishing keypoint correspondences and estimating the essential matrix. Because of it, we had to first rely solely on a classification loss to determine the potential inlier correspondences before incorporating the loss based on the essential matrix to impose geometric constraints, which requires ED. This ensured that the network weights were already good enough to prevent eigenvector switching when starting to minimize the geometry-based loss. This strategy, however, requires determining ground-truth inlier/outlier labels, which in practice may be difficult to obtain. Furthermore, this scheme may not even be possible in cases where one cannot provide an auxiliary classification task—for example in the traditional task of removing additive noise in data. As we will show later, our method does not suffer from this.
3 Related Work
In recent years, the need to integrate geometric methods and mathematical tools into Deep Learning frameworks has led to the reformulation of a number of them in network terms. For example,  considers spatial transformations of image regions with CNNs. The set of such transformations is extended in . In a different context,  derives a differentiation of the Cholesky decomposition that could be integrated in Deep Learning frameworks.
Unfortunately, the set of geometric Computer Vision problems that these methods can handle remains relatively limited.
In particular, there is no widely accepted deep-learning way to solve the many geometric problems that reduce to finding the least-square solution to a linear system.
In this work, we consider three such problems: Ellipse fitting, estimating the
3D pose of an object from 3D-to-2D correspondences, and computing the essential matrix from keypoint correspondences in an image pair, all of which we briefly discuss below.
This is a traditional Computer Vision problem, where one finds the ellipse equation that best fits a set of points. This could be used, for example, to detect people’s eyes in images by fitting ellipses to detected image edges. One of the simplest and most straightforward way to solve this problem is the Direct Linear Transform (DLT)[1, 2]. The DLT corresponds to finding an algebraic least-square solution to the problem by performing ED or SVD and retaining the eigen- or singular-vector that corresponds to the minimum eigen- or singular-value. Besides the DLT, other methods focus on minimizing the geometric distance instead of the algebraic one , and exploit constraints to enforce the result to truly be an ellipse, as opposed to a general conic, which also encompasses parabolas and hyperbolas [20, 21]. In practice, outliers are handled in an iterative manner, where the solution is refined via robust techniques such as M-estimators  or Least Median of Squares (LMedS) . For a detailed review of ellipse fitting, we refer the reader to .
Estimating 3D pose from 3D-to-2D correspondences. This is known as the Perspective-n-Point (PnP) problem. It has also been investigated for decades and is also amenable to an eigendecomposition-based solution , many variations of which have been proposed over the years [4, 23, 6, 5]. DSAC 
is the only approach we know of that integrates the PnP solver into a Deep Network. As explicitly differentiating through the PnP solver is not optimization friendly, the authors apply the log trick used in the reinforcement learning literature. This amounts to using a numerical approximation of the derivatives from random samples, which is not ideal, given that an analytical alternative exists. Moreover, DSAC only works for grid configurations and known scenes. By contrast, the method we propose in this work has an analytical form, with no need for stochastic sampling.
Estimating the Essential matrix from correspondences. The eigenvalue-based solution to this problem has been known for decades [25, 26, 3] and remains the standard way to compute Essential matrices . The real focus of research in this area has been to establish reliable keypoint correspondences and to eliminate outliers. In this context, variations of RANSAC , such as MLESAC  and least median of squares (LMeds) , and very recently GMS , have become popular. For a comprehensive study of such methods, we refer the interested reader to . With the emergence of Deep Learning, there has been a trend towards moving away from this decades-old knowledge and apply instead a black-box approach where a Deep Network is trained to directly estimate the rotation and translation matrices [32, 33] without a priori geometrical knowledge. Our very recent work  attempts to reconcile these two opposing trends by embedding the geometric constraints into a Deep Net and has demonstrated superior performance for this task when the correspondences are hard to establish. A similar approach was also proposed in the contemporary work of .
Differentiating the eigen- and singular value decomposition. Whether computing the essential matrix, estimating 3D pose, or solving any other least-square problem, incorporating an eigendecomposition-solver into a deep network requires differentiating the eigendecomposition. Expressions for such derivatives have been given in [11, 12] and reformulated in terms that are compatible with back-propagation in . Specifically, as shown in , for a matrix written as , the variations of the eigenvectors with respect to the matrix, used to compute derivatives, are
where , and
As can be seen from Eq. 2, when eigenvalues are small, the denominator becomes small, resulting in a large number being multiplied to the backward flow of gradients. This causes numerical instability. The effect is exaggerated when both eigenvalues of the denominator are small, even more so when they are similar , which we found to happen in practice. The same can be said about singular value decomposition.
A solution to this was proposed in , and singular- and eigen-value decomposition have been used within deep networks for problems where all the singular values are exploited and their order is irrelevant [34, 35]
. In the context of spectral clustering, the approach of also proposed a solution that eliminates the need for explicit eigendecomposition. This solution, however, was dedicated to the scenario where one seeks to use all non-zero eigenvalues, assuming a matrix of constant rank.
Here, by contrast, we tackle problems where what matters is a single eigen- or singular-value. In this case, the order of the eigenvalues is important. However, this order can change during training, which results in a non-differentiable switch from one eigenvector to another, as in the toy example of Section 2. In turn, this leads to numerical instabilities, which can prevent convergence. In 
, we overcame this problem by first training the network using a classification loss that does not depend on eigenvectors. Only once a sufficiently good solution is found, that is, a solution close enough to the correct one for vector switching not to happen anymore, is the loss term that depends on the eigenvector associated to the smallest eigenvalue turned on. While effective, this strategy inherently relies on having access to classification labels. As such, it is ill-suited to tackle denoising scenarios. As we show later, we can achieve state-of-the-art results without the need for such a heuristic, by deriving a more robust, eigendecomposition-free loss function.
4 Our Approach
We introduce an approach to working with eigenvectors corresponding to zero eigenvalues within an end-to-end learning formalism that eliminates the gradient instabilities due to vector switching discussed in Section 2 and the difficulties caused by repeated eigenvalues. To this end, we derive a loss function that directly operates on the matrix whose eigen- or singular-vectors we are interested in without explicitly performing either SVD or ED.
Below, we first discuss the generic case in which the network takes a set of of measurements as input and directly outputs the matrix of interest. We then consider two specialized but important cases. In the first, the network outputs weights defining the matrix, which corresponds to model fitting with outlier rejection. In the second, we consider the measurements to be noisy and allow them to be adjusted to account for the noise.
4.1 Generic Scenario
Given an input measurement , let us denote by the output of a deep network with parameters . Let the network output be a matrix .
A seemingly natural approach to exploiting ED within the network would consist of minimizing the loss , where is the ground-truth smallest eigenvector as in [13, 9]. As discussed in Section 2, however, this requires differentiating the ED result to perform back-propagation and is not optimization-friendly. Furthermore, as argued in  for the fundamental matrix, the direct use of the norm as a distance measure is not optimal, since all the entries do not have equal importance.
To avoid the instabilities of eigenvector differentiation and use the algebraic error advocated for in , we consider the standard definition of a zero eigen-value , i.e.,
This suggests defining the loss function
where is the ground truth zero eigen-vector. As is always positive semi-definite, is always greater than or equal to zero.
However, there are many matrices that satisfy Eq. 4, including . To rule out such trivial solutions, we propose to minimize while maximizing the norm of the projection of the data vectors along the directions orthogonal to . In the plane-fitting example of Section 2, this would mean maximizing the projections of the data vectors on the plane whose normal is . The motivation behind this arises from the fact that maximizing along the directions orthogonal to does not preclude the correct eigenvalue to go to zero. By contrast, maximizing all eigenvalues would conflict with our original objective of Eq. 4, particularly because it would aim to maximize the projection of in the direction of , which ideally should be zero.
To prevent this, instead of directly regularizing , we rely on the matrix defined as
is the identity matrix. Note thatwhatever the value . In other words,
projects all vectors onto the hyperplane normal to. To prevent the trivial solutions discussed above, we would like the magnitude of these projections to be far from zero, or, in other words, the eigenvalues corresponding to directions orthogonal to to be as large as possible. As the trace of a matrix is the sum of its eigenvalues, this can be achieved by maximizing the trace of . We therefore introduce
as a loss to be minimized along with . Note that the terms defined in Eqs. 4 and 6 act on the eigenvalues of the same matrix but in different directions. can be rewritten as and as . Because of this, does not prevent us from obtaining a matrix with a truly zero eigenvalue in the direction .
However, is bounded by zero whereas is not and can become very large. In practice, we have observed it to reach values of and thus dominate the loss. Furthermore, in our deep learning context, during training, the network will produce one matrix for every training sample, and thus will sum over multiple trace terms. The magnitude of these terms will then significantly vary across the training samples, putting more focus on some training samples and making it hard to define the relative weight of and . This is illustrated by the bottom row of Figure 4, where we show the evolution of the trace for 3 samples. Even when all outliers are removed, the values for the 3 samples have different magnitudes.
To overcome this and facilitate balancing the two loss terms, we use a robust kernel, , that mitigates the effects of extreme trace values and brings the regularizer values of all training sample to a commensurate magnitude in the range . The resulting training curves for the same three samples as before are shown in the middle row of Figure 4. Note that they are of similar magnitudes and much lower than the trace itself, thus interfering less with our main objective . Specifically, we take our complete loss function to be
where and are positive scalars. Ideally, when there is a perfect solution, such that the outliers’ weights are all zero and the inliers have no noise, the value of would not matter as both terms would go to zero. However, in practice, because of noise and measurement errors, this does not occur, and it is necessary to control the influence of the two terms. governs whether we focus on reducing the first term – amount of estimation error – or the second one – amount of information that is preserved in the non-null space according to ground truth. The former loosely relates to the accuracy of the solution, and the latter to the recall. controls to what degree we would like to consider the second term, as the trace value can easily go up to thousands, making it effectively reach zero without a proper value. This loss is fully differentiable and can thus be used to learn the network parameters. We provide its derivatives in the next section.
4.2 Derivatives for the General Case
To compute the derivatives of the loss w.r.t. the parameters
, we first follow the chain rule to obtain
We can write the first term in the equation above in index notation as
where we omitted the explicit dependency of and on for the sake of conciseness. Using the same notation, the second term can be written as
where , and . Computing the derivatives with respect to , we obtain
Then, turning them back to matrix form, we write
Here, we can see that modifying and leads to directly modifying the magnitude of the gradient.
We then use this for ADAM optimization.
Note that if we were to directly use instead of in the second loss term, the gradients would become
In contrast to Eq. 13, the second term contributes to the entire matrix , including in the direction of , which then cancels out, or at least reduces, the contribution of the term.
4.3 Model Fitting and Outlier Rejection
The generic formulation presented above assumes that all measurements have the same weight and that the network predicts all the coefficients of the matrix instead of a number of design parameters from which the matrix can be computed. In practice, this is rarely the case. For problems such as plane-fitting, ellipse-fitting, PnP, and estimating the essential matrix, the data typically comes in the form of a set of measurements for where the are vectors of dimension , where for plane-fitting, for ellipse-fitting, for PnP, and for wide-baseline stereo. In such problems, the network must be trained to assign to each measurement a weight between 0 and 1 such that the zero eigenvalue of the matrix is the ground-truth vector , where is the matrix obtained by grouping all the vectors into an matrix and the diagonal matrix obtained from the -vector of weights predicted by the network.
For the sake of completeness, we briefly remind the reader of how the the vectors can be obtained from image measurements in the remainder of this section. For more details, we refer the interested readers to standard texts such as .
4.3.1 Plane Fitting
Let us first consider the toy outlier rejection problem used to motivate our approach in Section 2. Note that, for this experiment, we do not train a deep network, or perform any learning procedure. Instead, given 3D points , including inliers and outliers, we directly optimize the weight of each point. At every optimization step, given the current weight values, we compute the weighted mean of the points . We then define to be the matrix of mean-subtracted 3D points.
4.3.2 Ellipse Fitting
For this task, given a set of 2D coordinates, we aim to find the parameters of the ellipse that best fits the data. In contrast to the plane-fitting case, here we make use of a deep network, specifically the same one as in . However, because the task is different than that tackled in , our network takes as input 2D coordinates, not correspondences. We use this network to predict the weights of Eq. 16, which is a diagonal matrix, with one weight for each 2D coordinate.
Formally, for each 2D point , let
We can then write a standard equation for an ellipse as
where , , , , , and define the parameters of the ellipse. We can therefore write
for this ellipse fitting case.
The goal of this problem is to determine the absolute pose (rotation and translation) of a calibrated camera, given known 3D points and corresponding 2D image points. For this task, we again use a deep network with the same architecture as that in . Here, however, the network takes as input correspondences between 3D and 2D points and outputs a -dimensional vector of weights, one for each correspondence.
Mathematically, we can denote the input correspondences as
where are the coordinates of a 3D point, and , denote the corresponding image location. According to , we have
Following the Direct Linear Transform (DLT) method , the transformation parameters can be obtained as the zero-valued singular vector of a data matrix , every two rows of which are computed from one correspondence as
where denotes row of . We can then directly use this data matrix in Eq. 16. As suggested by  and also done in , we use the normalized coordinate system for the 2D coordinates. We assume that the correspondences for this setup are given, as in 
Note that the characteristics of the rotation matrix, that is, orthogonality and determinant 1, are not preserved by the DLT solution. Therefore, to make the result a valid rotation matrix, we refine the DLT results by the generalized Procrustes algorithm [7, 39], which is a common post-processing technique for PnP algorithms. Note that this step is not involved during training, but only in the validation process to select the best model and at test time.
4.3.4 Essential Matrix Estimation
This task is the same as that studied in , that is, estimating camera motion from 2D-to-2D correspondences. A critical downside of  is the use of the DLT to regress to the camera pose, which is solved via SVD/ED. This method therefore suffers from the switching problem discussed in Section 2. By contrast, we convert the DLT problem to a cost that directly reflects the original task.
To isolate the effect of the loss function only, we follow the same setup as in . Specifically, we use the same network architecture as in , which takes correspondences between two 2D points as input and outputs a -dimensional vector of weights, that is, one weight for each correspondence. For the same purpose, we also use the same strategy for creating correspondences—a simple nearest neighbor matching from one image to the other.
encode the coordinates of correspondence in the two images. Following the 8-point algorithm , we construct a matrix , each row of which is computed from one correspondence vector as
Note that, as suggested by  and done in , we normalize the 2D coordinates to using the camera intrinsics as input to the network. Furthermore, when calculating the loss, as suggested by , we move the centroid of the reference points to the origin of the coordinate system and scale the points so that their RMS distance to the origin is equal to . This means that we also have to scale and translate accordingly.
Here, our formulation originates from the DLT. As a result, our loss minimizes the algebraic distance, which could be suboptimal. While there exist other formulations for solving for the essential matrix, such as the gold-standard method that minimizes the geometric distance , these methods rely on iterative minimization. Including such an iterative process in a loss function can be achieved by solving a bi-level optimization problem, as discussed in . However, this requires second-order derivatives, and thus may become expensive in the context of deep networks, due to the large number of parameters. Alternatively, one can unroll a fixed number of iterations, but this number needs to be determined and this requires an initialization. Our method can be thought of as a way to provide such an initialization. We leave the task of incorporating a nonlinear refinement in the training process given our initialization as future work, and focus on the DLT case, which allows us to demonstrate the benefits of our approach over direct eigenvalue minimization. Note that, as an alternative, other loss functions can be used, such as a classification loss on the keypoint weights  or the residual error using the symmetric epipolar distance . These methods, however, require defining a threshold, whose precise value can have a significant impact on the results.
4.4 Handling Noisy Measurements
Beside outliers, real-world measurements are typically affected by noise, which deteriorates the quality of the estimate of interest. Here, we show that our framework also allows us to address denoising problems in addition to outlier rejection, by making the network estimate a refinement of the input measurements. In other words, while outlier rejection can be expressed by estimating a multiplicative factor, that is , denoising translates to predicting an additive one.
To remove noise from input data, we propose to train our networks to output, in addition to the weights , a denoising matrix that has the same dimension as the input data. Thus, instead of directly using , we create the data matrix with the denoised data, . However, directly using the denoised data matrix instead of in Eq. 16 is problematic as, with the denoising vector, one could add arbitrary displacements in the direction corresponding to the largest eigenvalues to reach very large values. To overcome this, we rely on the assumption that the displacements should be small, which can be encoded by the constraint for a given threshold corresponding to the noise value. To avoid using hard constraints, which are difficult to handle with deep networks, we use soft constraints, which can be thought as the Lagrangian version of the hard ones. Altogether, we thus define our loss as
where is the ground-truth eigenvector, and , expressed in terms of the original measurements, not the denoised ones. We found this to be more stable than using , because the latter would still favor generating large displacements along the directions orthogonal to , which contradicts the last loss term and deteriorates the distinction between inliers and outliers, as outliers can then be handled using either a zero weight or a large displacement .
We exclude essential matrix estimation from this setup, for the following reasons. First, for wide-baseline stereo, the same level of noise can have drastically different effects on the estimation, depending on the distance to the epipoles. For example, even the slightest noise near an epipole can greatly harm estimation. Second, and more importantly, the outlier ratio in wide-baseline stereo is much larger than in other tasks, e.g., 91% in the cogsci2-05 sequence from the SUN3D dataset. As such, the positive and negative labels are highly unbalanced, and the negative labels dominate the denoising operation, which translates to the positive ones being removed. As a consequence performing outlier removal alone has proven more robust than jointly with denoising, particularly because our network filters the matches so that the inliers attain a large majority, and the final RANSAC step then further increases robustness.
translation errors for varying number of outliers with additive Gaussian noise with standard deviation of 20 pixels on the 2D coordinates.(c) Rotation and (d) translation errors for varying amount of additive noise on the 2D coordinates with 50 outliers. Our method gives the best results in all cases.
We now present our results for the four tasks discussed above, that is, plane fitting as in Section 2, ellipse fitting, solving the PnP problem, and distinguishing good keypoint correspondences from bad ones. We rely on a TensorFlow implementation using the Adam  optimizer, with a learning rate of , unless stated otherwise, and default parameters. When training a network, for ellipse fitting, keypoint matching and PnP, we used mini-batches of 32 samples. We used the same network as in  for all three tasks. The network takes observations as input, which can be either 2D locations for ellipse fitting, 3D-to-2D correspondences for PnP, or 2D-to-2D correspondences for wide-baseline stereo, and produces a
-dimensional array of weights, one for each observation. The weights indicate the likelihood of each observation to be an inlier. The observations are processed independently by weight-sharing perceptrons. The network consists of 12 residual blocks. In the each residual block, Context Normalization is applied before batch normalization and ReLU to aggregate the global information. The code to reproduce our experiments is available on Github111https://github.com/Dangzheng/Eig-Free-release.
5.1 Plane Fitting
We first evaluate the effectiveness of the proposed method in a simple toy setup, where the task is to remove outliers. The setup is the one discussed in Section 2. We randomly sampled 100 3D points on the plane. Specifically, we uniformly sampled and . We then added zero-mean Gaussian noise with standard deviation in the dimension. We also generated outliers in a similar way, where and are uniformly sampled in the same range, and
is sampled from a Gaussian distribution with mean 50 and standard deviation 1. For the baselines that directly use the analytical gradients of SVD and ED, we take the objective function to be, where is the minimum eigenvector of and is the ground-truth noise direction ( in Eq. 16), which is also the plane normal and is the vector in this case. Note that we consider both and take the minimum distance, denoted by the and the in the loss function. For this problem, both solutions are correct due to the sign ambiguity of ED, which should be taken into account.
We consider two ways of computing analytical gradients, one using the SVD and the other the self-adjoint eigendecomposition (Eigh), which both yield mathematically valid solutions. To implement our approach, we rely on Eq. 16, as we are mostly interested in removing outliers for this experiment. For the learning rate, we use for the single outlier case, and for the multiple outlier one.
Fig. 4(b) shows the evolution of the loss as the optimization proceeds when using Adam and vanilla gradient descent with a single outlier. Note that SVD and Eigh have exactly the same behavior because they constitute two equivalent ways of solving the same problem. Using Adam in conjunction with either one initially yields a very slow decrease in the loss function, until it suddenly drops to zero after hundreds of iterations. In case of vanilla gradient descent, SVD and Eigh require hundreds of thousands of optimization steps to converge. By contrast, our approach produces a much more gradual decrease in the loss.
We further show in Fig. 6 how our formulation works compared to that of SVD- and Eigh-based methods in the presence of twenty outliers. As the problem is harder, SVD- and Eigh-based losses take more iterations to converge than in the single outlier case. With Adam, the convergence suddenly happens after 352 iterations, when a switch of the eigenvector with the smallest eigenvalue occurs, as shown in Fig. 6 (d).
In Fig. 6 (b) and (c), we show the inliers detected during optimization. In a plane fitting example, only three inliers are required to solve the problem, thus not all inliers are used as shown by Fig. 6 (b). However, as our formulation encourages the inclusion of inliers in the optimization through the second term in Eq. (16), we are able to recover all inliers as shown in Fig. 6 (c).
In Fig.6 (d) and (e), we can clearly see the abrupt change in the optimization at iteration 352. In Fig.6 (d), we plot the distance between the ground-truth and the estimated eigenvectors obtained through eigendecomposition, where the zero-point for each eigenvector is placed differently for better visualization. To be specific, we place the eigen-vector to the ground-truth position which closest to it. The distance between the point and the ground-truth position which below it indicates how close they are. Also for better visualization, we set the interval to be 10 and down sample the points. One can clearly see that until iteration 352, the eigenvectors are wrongly assigned, which is then suddenly corrected. In Fig. 6 (e), we show the devastating effect this switch causes; the magnitude of the gradient increases drastically, which is potentially harmful for optimization. Specifically, at this point, most of the inliers have been removed by SVD and Eigh, and thus the smallest two eigenvalues are close to zero. Therefore, because of Eq. 2, the gradients remain very large, even after this switch happens.
While in this simple plane fitting example the SVD- or Eigh-based methods converge, in more complex cases, this is not always true, as we will show next with different applications.
5.2 Ellipse Fitting
To evaluate our approach on the task of ellipse fitting, we built a synthetic dataset. We generated ten thousand ellipses with random centers in the rectangle whose top right corner coordinate is and bottom left corner coordinate is , angles in the range , long and short axes in the range , thus their top, right, bottom, and left coordinates are 1, 1, -1, and -1, respectively. Working in this confined range allows us to create ellipses that are distinguishable from noise and outliers with the human eye, and yet have enough variability to capture diverse types of ellipses. While this setup is constrained to a given scale range, in practice, ellipses at different scale ranges can easily be dealt with by normalizing and denormalizing as in .
We compare our method against other ellipse fitting techniques that are discussed in . In particular, we evaluate the robust version of the ellipse-Specific method (Spec), Least-Squares based on Orthogonal distance (LSO), Least-Squares based on Algebraic distance (LSA), gradient Weighted Least-Squares based on Algebraic distance (WLSA), ellipse fitting with M-Estimator using the Cauchy loss function, and ellipse fitting with Least Median of Squares (LMedS). As in , we report the error of the estimated center point with arbitrary orientation for each method.
5.2.1 Outlier Removal
For this experiment, we incorporate outliers to our dataset by adding random points within the unit rectangle around the origin. Specifically, we randomly select of the 200 2D points and convert them to outliers. We further add a Gaussian noise with standard deviation of . We set and and rely on Eq. (16), as we would like to evaluate outlier removal only.
To test the performance of our method on denoising, we add Gaussian noise to the points, but do not include any outliers. We vary the noise level from zero to . Note that, in this pure denoising scenario, we do not predict a weight matrix , and the outputs of the network are the displacements only. Since we penalize overly large displacements , our formulation doesn’t suffer from the trivial solution where the matrix output by the network goes to zero. There is no need to encourage the projections in the direction orthogonal to the eigenvector of interest to remain large, so we set and in Eq. 25.
As in the outlier rejection case, we report the center point error. We run each experiment 100 times and report the average. The results are summarized in Fig. 8 (b). We also show a qualitative example in Fig. 7 (b). Our method gives the most robust results with respect to the additive Gaussian noise.
5.2.3 Simultaneous Outlier Removal and Denoising
Finally, we combine both sources of perturbations and randomly choose up to 150 outliers among the 200 points and add a Gaussian noise with a standard deviation up to . For the loss function in Eq. 25, we set , and .
During testing, we fix either the number of outliers or the degree of noise and vary the other. Specifically, to show robustness against noise in the presence of outliers, we fix the number of outliers to 10 and vary the standard derivation of Gaussian noise in the range . To show robustness to outliers, we fix the Gaussian noise to be and vary the number of outliers in the range . The results are summarized in Fig. 8 (c) and (d), where we can clearly see that we outperform the baselines.
Let us now turn to the PnP problem. For this application, as discussed below, we perform either outlier removal, or denoising, or both tasks jointly.
5.3.1 Outlier Removal on Synthetic Data
Following standard practice for evaluating PnP algorithms [4, 5], we generate a synthetic dataset composed of 3D-to-2D correspondences with noise and outliers. Each training example comprises two hundred 3D points, and we set the ground-truth translation of the camera pose to be their centroid. We then create a random ground-truth rotation , and project the 3D points to the image plane of our virtual camera. As in REPPnP , we apply Gaussian noise with a standard deviation of to these projections. We generate random outliers by assigning 3D points to arbitrary valid 2D image positions.
To focus on outlier removal only, we again first use the formulation in Eq. 16. For the hyper-parameters in Eq. 16, we empirically found that and works well for this task. During training, we randomly select between 0 and 150 of the two hundred matches and turn them into outliers. In other words, the two hundred training matches will contain a random number of outliers that our network will learn to filter out.
We compare our method against modern PnP methods, EPnP , OPnP , PPnP , RPnP  and REPPnP . We also evaluate the DLT , since our loss formulation is based on it. Among these methods, REPPnP is the one most specifically designed to handle outliers. As in the keypoint matching case, we tried to compute the results of a network relying explicitly on eigendecomposition and minimizing the norm of the difference between the ground-truth eigenvector and the predicted one. However, we found that such a network was unable to converge. We also report the performance of two commonly used baselines that leverage RANSAC , P3P +RANSAC and EPnP+RANSAC. For the other methods, RANSAC didn’t bring noticeable improvements, so we omitted them in the graph for better visual clarity. For all baselines, we used the default parameters provided with their implementation, which were tuned on equivalent synthetic datasets. For the RANSAC-based ones, we used the default RANSAC parameter from OpenCV. Note that our method could also make use of RANSAC, but we use it on its own here. While tuning the RANSAC parameter would affect the final results, all the methods, including ours if we were to use RANSAC, would benefit from it, which would leave our conclusions unchanged.
For this comparison, we use standard rotation and translation error metrics . To demonstrate the effect of outliers at test time, we fix the number of matches to 200 and vary the number of outliers from to . We run each experiment 100 times and report the average.
Fig. 9 summarizes the results. We outperform all other methods significantly, especially when the number of outliers increases. REPPnP is the one competing method that seems least affected by the outliers. As long as the number of outliers is small, it is on a par with us. However, passed a certain point—when there are more than 40 outliers, that is, 20% of the total number of correspondences—its performance, particularly in terms of rotation error, decreases quickly, whereas ours does not.
5.3.2 Outlier Removal on Real Data
We then evaluate our PnP outlier removal approach on the real dataset of . Specifically, the 3D points in this dataset were obtained using the SfM algorithm of , which also provides a rotation matrix and translation vector for each image. We treat these rotations and translations as ground truth to compare different PnP algorithms. Given a pair of images, we extract SIFT features at the reprojection of the 3D points in one image, and match these features to SIFT keypoints detected in the other image. This procedure produces erroneous correspondences, which a robust PnP algorithm should discard. In this example, we used the model trained on the synthetic data described before. Note that we apply the model without any fine-tuning, that is, the model is only trained with purely synthetic data. We observed that, except for EPnP+RANSAC, OPnP and P3P+RANSAC, the predictions of the baselines are far from the ground truth, which led to points reprojecting outside the image.
In Fig 11(b), we provide quantitative results on this real dataset. These results are averaged over 1000 images. In Fig. 2, we compare the reprojection of the 3D points in the input image after applying the rotation and translation obtained with our model and with OPnP and EPnP+RANSAC. Note our better accuracy, both quantitatively and qualitatively.
5.3.3 Denoising on Synthetic Data
We further evaluate the performance of our method on the denoising task for PnP. We keep the same experimental setup as for outlier removal, except we no longer have outliers, but instead an additive Gaussian noise on the 2D coordinates. We apply a zero-mean Gaussian noise, with varying standard deviation between and pixels. As in the ellipse-fitting case, to focus on denoising only, the network only outputs the displacements , not the weights , and we set and in Eq. 25 and empirically set . As in the case of outlier rejection, we run the experiment 100 times and report the average.
We summarize the results of this experiment in Fig. 10. When the noise level is small, all compared methods give reasonable results. However, as the level of noise increases, our method remains accurate, whereas the others start failing.
5.3.4 Simultaneous Outlier Removal and Denoising
For the loss function in Eq. 25, we empirically set , and . We follow two different evaluation scenarios to demonstrate the performance in terms of outlier rejection and denoising. To this end, we vary either the number of outliers or the degree of noise, while keeping the other as a moderate value. Specifically, to demonstrate the effect of outliers at test time, we fix the image noise to be 20 pixels and vary the number of outliers from 10 to 150, out of 200 matches. To evaluate the effect of image noise, we fix the number of outliers to be 50, out of 200 matches, and vary the image noise from 5 to 50 pixels. We run the each experiment 100 times and report the average.
Fig. 11 summarizes the results of these experiments. In Fig. 11 (a) and (b), we show the results for a varying number of outliers. Note that, even in the presence of noise, our method outperforms all others significantly. This is also true when varying the noise level, as evidenced by Fig. 11 (c) and (d).
5.4 Wide-baseline stereo
To evaluate our method on a real-world problem, we use the SUN3D dataset . For a fair comparison, we trained our network on the same data as in , that is, the “brown-bm-3-05” sequence, and evaluate it on the test sequences used for testing in [33, 9]. Additionally, to show that our method is not overfitting, we also test on a completely different dataset, the “fountain-P11” and “Herz-Jesus-P8” sequences of .
We follow the evaluation protocol of , which constitutes the state-of-the-art in keypoint matching, and only change the loss function to our own loss of Eq. 16. We use and , which we empirically found to work well for 2D-to-2D keypoint matching. We compare our method against our previous work , both in its original implementation that involves minimizing a classification loss first and then without that initial step, which we denote as “Essential_Only”. The latter is designed to show how critical the initial classification-based minimization of  is. In addition, we also compare against standard RANSAC , LMeds , MLESAC , and GMS  to provide additional reference points. We do this in terms of the performance metric used in  and referred to as mean Average Precision (mAP). This metric is computed by observing the ratio of accurately recovered poses given a certain maximum threshold, and taking the area under the curve of this graph. We tuned the parameters of the baselines, so as to obtain the best results. For RANSAC, we used the threshold employed in  of 0.01.
We summarize the results in Figs. 3 and 13. Our approach performs on par with , the state-of-the-art method for keypoint matching, and outperforms all the other baselines, without the need of any pre-training. Importantly, “Essential_Only” severely underperforms and even often fails completely. In short, instead of having to find a workaround to the eigenvector switching problem as in , we can directly optimize our objective function, which is far more generally applicable. As discussed previously in Section 2, the workaround in  converges to a solution that depends on the pre-defined threshold for the inliers, set heuristically by the user, and is not applicable to other problems involving additive noise. By contrast, our method can simply discover the inliers automatically while training, thanks to the second term in Eq. 7, and is also applicable to removal of additive noise.
In the top row of Fig. 3
, we compare the correspondences classified as inliers by our method to those of RANSAC on image pairs from the dataset of and SUN3D, respectively. Note that even the correspondences that are misclassified as inliers by our approach are very close to being inliers. By contrast, RANSAC yields much larger errors. In the middle row of Fig. 3, we show the epipolar geometry of the outliers, denoted as red lines. These lines were obtained using the estimated essential matrix in the top figure, and the ground-truth one in the bottom figure. Note that, while they may seem senseless to the human eye, the outliers lie very close to the estimated epipolar lines, and sometimes even the ground-truth ones. This is mainly because, with wide-baseline stereo, the depth of each point cannot be recovered.
We have introduced a novel approach to training deep networks that rely on losses computed from an eigenvector corresponding to a zero eigenvalue of a matrix defined by the network’s output. Our loss does not suffer from the numerical instabilities of analytical differentiation of eigendecomposition, and converges to the correct solution much faster. We have demonstrated the effectiveness of our method on two main tasks, outlier removal and denoising, with applications to keypoint matching, PnP and ellipse fitting. In all cases, our new loss has allowed us to achieve state-of-the-art results.
In the future, we will investigate extensions of our approach to handle non-zero eigenvalues, in particular the case of maximum eigenvalue. Furthermore, we hope that our work will contribute to imbuing Deep Learning techniques with traditional Computer Vision knowledge, thus avoiding discarding decades of valuable research, and favoring the development more principled frameworks.
-  Z. Zhang, “Parameter Estimation Techniques: A Tutorial with Application to Conic Fitting,” IVC, vol. 15, no. 1, pp. 59–76, 1997.
-  K. Kanatani, Y. Sugaya, and Y. Kanazawa, “Ellipse Fitting for Computer Vision: Implementation and Applications,” Synthesis Lectures on Computer Vision, pp. 1–141, 2016.
-  R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
-  V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An Accurate O(n) Solution to the PnP Problem,” IJCV, 2009.
-  L. Ferraz, X. Binefa, and F. Moreno-Noguer, “Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection,” in CVPR, 2014, pp. 501–508.
-  Y. Zheng, Y. Kuang, S. Sugimoto, K. Åström, and M. Okutomi, “Revisiting the PnP Problem: A Fast, General and Optimal Solution,” in ICCV, 2013.
-  V. Garro, F. Crosilla, and A. Fusiello, “Solving the PnP Problem with Anisotropic Orthogonal Procrustes Analysis,” in 3DPVT, 2012, pp. 262–269.
-  S. Li, C. Xu, and M. Xie, “A Robust O(n) Solution to the Perspective-N-Point Problem,” PAMI, pp. 1444–1450, 2012.
-  K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua, “Learning to Find Good Correspondences,” in CVPR, 2018.
-  R. Ranftl and V. Koltun, “Deep Fundamental Matrix Estimation,” in ECCV, 2018.
-  T. Papadopoulo and M. Lourakis, “Estimating the jacobian of the singular value decomposition: Theory and applications,” in ECCV, 2000, pp. 554–570.
-  M. Giles, “Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation,” in Advances in Automatic Differentiation, 2008, pp. 35–44.
C. Ionescu, O. Vantzos, and C. Sminchisescu, “Matrix Backpropagation for Deep Networks with Structured Layers,” inCVPR, 2015.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A System for Large-Scale Machine Learning,” inUSENIX Conference on Operating Systems Design and Implementation, 2016, pp. 265–283.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic Differentiation in Pytorch,” in NIPS, 2017.
-  Z. Dang, K. M. Yi, Y. Hu, F. Wang, P. Fua, and M. Salzmann, “Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses,” in ECCV, 2018.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” inNIPS, 2015, pp. 2017–2025.
A. Handa, M. Bloesch, V. Patraucean, S. Stent, J. McCormac, and A. Davison, “Gvnn: Neural Network Library for Geometric Computer Vision,” inECCV, 2016.
-  I. Murray, “Differentiation of the Cholesky Decomposition,” arXiv Preprint, 2016.
-  A. Fitzgibbon, M. Pilu, and R. Fisher, “Direct Least-Squares Fitting of Ellipses,” PAMI, vol. 21, no. 5, pp. 476–480, May 1999.
-  R. Halır and J. Flusser, “Numerically Stable Direct Least Squares Fitting of Ellipses,” in International Conference in Central Europe on Computer Graphics and Visualization, 1998.
P. Rousseeuw and A. Leroy,
Robust Regression and Outlier Detection. Wiley, 1987.
-  L. Kneip, D. Scaramuzza, and R. Siegwart, “A Novel Parametrization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation,” in CVPR, 2011, pp. 2969–2976.
-  E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC – Differentiable RANSAC for Camera Localization,” ARXIV, 2016.
-  H. Longuet-Higgins, “A Computer Algorithm for Reconstructing a Scene from Two Projections,” Nature, vol. 293, pp. 133–135, 1981.
-  R. Hartley, “In Defense of the Eight-Point Algorithm,” PAMI, vol. 19, no. 6, pp. 580–593, June 1997.
-  D. Nister, “An Efficient Solution to the Five-Point Relative Pose Problem,” in CVPR, June 2003.
-  M. Fischler and R. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications ACM, vol. 24, no. 6, pp. 381–395, 1981.
-  P. Torr and A. Zisserman, “MLESAC: A New Robust Estimator with Application to Estimating Image Geometry,” CVIU, vol. 78, pp. 138–156, 2000.
-  J. Bian, W. Lin, Y. Matsushita, S. Yeung, T. Nguyen, and M. Cheng, “GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence,” in CVPR, 2017.
-  R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm, “USAC: A Universal Framework for Random Sample Consensus,” PAMI, vol. 35, no. 8, pp. 2022–2038, 2013.
-  A. R. Zamir, T. Wekel, P. Agrawal, J. Malik, and S. Savarese, “Generic 3D Representation via Pose Estimation and Matching,” in ECCV, 2016.
-  B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: Depth and Motion Network for Learning Monocular Stereo,” in CVPR, 2017.
-  Z. Huang and L. V. Gool, “A Riemannian Network for SPD Matrix Learning,” in AAAI, 2017, p. 6.
-  Z. Huang, C. Wan, T. Probst, and L. V. Gool, “Deep Learning on Lie Groups for Skeleton-Based Action Recognition,” in CVPR, 2017, pp. 6099–6108.
-  M. Law, R. Urtasun, and R. S. Zemel, “Deep Spectral Clustering Learning,” in ICML, 2017, pp. 1985–1994.
-  C. Strecha, W. Hansen, L. Van Gool, P. Fua, and U. Thoennessen, “On Benchmarking Camera Calibration and Multi-View Stereo for High Resolution Imagery,” in CVPR, 2008.
-  D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimisation,” in ICLR, 2015.
-  P. Schönemann, “A Generalized Solution of the Orthogonal Procrustes Problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
-  Z. Zhang, “Determining the Epipolar Geometry and Its Uncertainty: A Review,” IJCV, vol. 27, no. 2, pp. 161–195, 1998.
-  S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo, “On differentiating parameterized argmin and argmax problems with application to bi-level optimization,” arXiv preprint arXiv:1607.05447, 2016.
-  A. Crivellaro, M. Rad, Y. Verdie, K. M. Yi, P. Fua, and V. Lepetit, “Robust 3D Object Tracking from Monocular Images Using Stable Parts,” PAMI, 2018.
-  J. Heinly, J. Schoenberger, E. Dunn, and J.-M. Frahm, “Reconstructing the World in Six Days,” in CVPR, 2015.
-  C. Wu, “Towards Linear-Time Incremental Structure from Motion,” in 3DV, 2013.
-  J. Xiao, A. Owens, and A. Torralba, “SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels,” in ICCV, 2013.
-  H. Cantzler, “Random Sample Consensus (RANSAC),” 2005, cVonline.
-  D. Simpson, “Introduction to Rousseeuw (1984) Least Median of Squares Regression,” in Breakthroughs in Statistics. Springer, 1997, pp. 433–461.