Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses

03/21/2018 ∙ by Zheng Dang, et al. ∙ EPFL Xi'an Jiaotong University University of Victoria 0

Many classical Computer Vision problems, such as essential matrix computation and pose estimation from 3D to 2D correspondences, can be solved by finding the eigenvector corresponding to the smallest, or zero, eigenvalue of a matrix representing a linear system. Incorporating this in deep learning frameworks would allow us to explicitly encode known notions of geometry, instead of having the network implicitly learn them from data. However, performing eigendecomposition within a network requires the ability to differentiate this operation. Unfortunately, while theoretically doable, this introduces numerical instability in the optimization process in practice. In this paper, we introduce an eigendecomposition-free approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix predicted by the network. We demonstrate on several tasks, including keypoint matching and 3D pose estimation, that our approach is much more robust than explicit differentiation of the eigendecomposition, It has better convergence properties and yields state-of-the-art results on both tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In traditional Computer Vision, many tasks can be solved by finding the singular- or eigen-vector corresponding to the smallest, often zero, singular- or eigen-value of the matrix encoding a linear system. Examples include estimating essential matrices or homographies from matched keypoints and computing pose from 3D to 2D correspondences.

In the era of Deep Learning, there is growing interest in embedding these methods within a deep architecture to allow end-to-end training. For example, it has recently been shown that such an approach can be used to train networks to detect and match keypoints in image pairs while accounting for the global consistency of the correspondences [1]. More generally, this approach would allow us to explicitly encode notions of geometry within deep networks, thus sparing the network the need to re-learn what has been known for decades and making it possible to learn from smaller amounts of training data.

One way to implement this approach is to design a network whose output defines a matrix and train it so that the smallest singluar- or eigen-vector of the matrices it produces are as close as possible to ground-truth ones. This is the strategy used in [1]

to simultaneously establish correspondences and compute the corresponding Essential Matrix: The network’s outputs are weights discriminating inlier correspondences from outliers and are used to assemble an auxiliary matrix whose smallest eigenvector is the sought-for Essential Matrix.

The main obstacle to implementing this approach is that it requires being able to differentiate the singular value decomposition (SVD) or the eigendecomposition (ED) in a stable manner to train the network, a non-trivial problem that has already received considerable attention [2, 3, 4]

. As a result, these decompositions are already part of standard Deep Learning frameworks, such as TensorFlow 

[5]

or PyTorch 

[6]. However, they ignore two key practical issues. First, when optimizing with respect to the matrix itself or with respect to parameters defining it, the vector corresponding to the smallest singular value or eigenvalue may switch abruptly as the relative magnitudes of these values change, which is essentially non-differentiable. This is illustrated in the example of Fig. 1, discussed in detail in Section 2. Second, computing the gradient requires dividing by the difference between two singular values or eigenvalues, which could be zero. While a solution to the latter was proposed in [2], the former is unavoidable.

(a) (b)
Figure 1: Eigenvector switching. (a) 3D points lying on a plane in black and distant outlier in red. (b) When the weights assigned to all the points are one, the eigenvector corresponding to the smallest eigenvalue is , the vector shown in blue in (a), and on the right in the top portion of (b), where we sort the eigenvectors by decreasing eigenvalue. As the optimization progresses and the weight assigned to the outlier decreases, the eigenvector corresponding to the smallest eigenvalue switches to , the vector shown in green in (a), which introduces a sharp change in the gradient values.

In this paper, we therefore introduce an approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix , which is either the output of the network or a function of it, without explicitly performing an SVD or ED. Our loss is fully differentiable, does not suffer from the instabilities the above-mentioned problems can cause, and can be naturally incorporated in a deep learning architecture. In practice, because image measurements are never perfect, the eigenvalue is never strictly zero. This, however, does not affect the computation either, which makes our approach robust to noise.

To demonstrate this in a Deep Learning context, we evaluate our approach on the tasks of training a network to find globally-consistent keypoint correspondences using the essential matrix and training another to remove outliers for pose estimation when solving the Perspective-n-Point (PnP) problem. In both cases, our approach delivers state-of-the-art results, whereas using the standard implementation of singular- and eigen-value decomposition provided in TensorFlow results in either the learning procedure not converging or in significantly worse performance.

2 Motivation

To illustrate the problems associated with differentiating eigenvectors and eigenvalues, consider the outlier rejection toy example depicted by Fig. 1. The inputs are 3D points lying on a plane and drawn in black, and an outlier 3D point shown in red, which we assume to be very far from the plane. Suppose we want to assign a binary weight to each point (1 for inliers, 0 for outliers) such that the eigenvector corresponding to the smallest eigenvalue of the weighted covariance matrix is close to the ground-truth one in the least-square sense. When the weight assigned to the outlier is 0, it would be , which is also the normal to the plane and is shown in green. However, if at some point during optimization, typically at initialization, we assign the weight 1 to the outlier, will correspond to the largest eigenvalue instead of the smallest, and the eigenvector corresponding to the smallest eigenvalue will be the vector shown in blue, which is perpendicular to . As a result, if we initially set all weights to 1 and optimize them so that the smallest eigenvector approaches the plane normal, the gradient values will depend on the coordinates of . At one point during the optimization, if everything goes well, the weight assigned to the outlier will become small enough so that the smallest eigenvector switches from being to being , which introduces a large jump in the gradient vector whose values will now depend on the coordinates of instead of .

In this simple case, this kind of instability does not preclude eventual convergence. However, in more complex situations, we found that it does, as evidenced by our experiments. This problem was already noted in [1] in the context of learning keypoint correspondences. To circumvent this issue, the algorithm in [1] had to first rely on a classification loss to determine the potential inlier correspondences before incorporating the loss based on the essential matrix to impose geometric constraints, which requires eigendecomposition. This ensured that the network weights were already good enough to prevent eigenvector switching when starting to minimize the geometry-based loss.

3 Related Work

In recent years, the need to integrate geometric methods and mathematical tools into Deep Learning frameworks has led to the reformulation of a number of them in network terms. For example, [7] considers spatial transformations of image regions with CNNs. The set of such transformations is extended in [8]. In a different context, [9] derives a differentiation of the Cholesky decomposition that could be integrated in Deep Learning frameworks.

Unfortunately, the set of geometric Computer Vision problems that these methods can handle remains relatively limited. In particular, there is no widely accepted deep-learning way to solve the many geometric problems that reduce to finding least-square solution of linear systems. In this work, we consider two such problems: Computing the essential matrix from keypoint correspondences in an image pair and estimating the 3D pose of an object from 3D-to-2D correspondences, both of which we briefly discuss below.

Estimating the Essential matrix from correspondences.

The eigenvalue-based solution to this problem has been known for decades [10, 11, 12] and remains the standard way to compute Essential matrices [13]. The real focus of research in this area has been to establish reliable keypoint correspondences and to eliminate outliers. In this context, variations of RANSAC [14], such as MLESAC [15] and Least median of squared (LMeds) [16], and very recently GMS [17], have become popular. For a comprehensive study of such methods, we refer the interested reader to [18]. With the emergence of Deep Learning, there has been a trend towards moving away from this decades-old knowledge and apply instead a black-box approach where a Deep Network is trained to directly estimate the rotation and translation matrices [19, 20] without a priori geometrical knowledge. The very recent work of [1] attempts to reconcile these two opposing trends by embedding the geometric constraints into a Deep Net and has demonstrated superior performance for this task when the correspondences are hard to establish.

Estimating 3D pose from 3D-to-2D correspondences.

This is known as the Perspective-n-Point (PnP) problem. It has also been investigated for decades and is also amenable to an eigendecomposition-based solution [12], many variations of which have been proposed over the years [21, 22, 23, 24]. DSAC [25]

is the only approach we know of that integrates the PnP solver into a Deep Network. As explicitly differentiating through the PnP solver is not optimization friendly, the authors apply the log trick used in the reinforcement learning literature. This amounts to using a numerical approximation of the derivative from random samples, which is not ideal, given that an analytical alternative exists. Moreover, DSAC only works for grid configurations and known scenes. By contrast, the method we propose in this work has an analytical form, with no need for stochastic sampling.

Differentiating the eigen- and singular value decomposition

Whether computing the essential matrix, estimating 3D pose, or solving any other least-squares problem, incorporating an eigendecomposition-solver into a deep network requires differentiating the eigendecomposition. Expressions for such derivatives have been given in [2, 3] and reformulated in terms that are compatible with back-propagation in [4]. Specifically, as shown in [4], for a matrix written as , the variations of the eigenvectors with respect to the matrix, used to compute derivatives, are

(1)

where , and

(2)

As can be seen from Eq. 2, if two eigenvalues are equal, that is, , the denominator becomes 0, thus creating numerical instabilities. The same can be said about singular value decomposition.

A solution to this was proposed in [2], and singular- and eigen-value decomposition have been used within deep networks for problems where all the singular values are used and their order is irrelevant [26, 27]

. In the context of spectral clustering, the approach of 

[28] also proposed a solution that eliminates the need for explicit eigendecomposition. This solution, however, was dedicated to the scenario where one seeks to use all non-zero eigenvalues, assuming a matrix of constant rank.

Here, by contrast, we tackle problems where what matters is a single eigen- or singular-value. In this case, the order of the eigenvalues is important. However, this order can change during training, which results in a non-differentiable switch from one eigenvector to another, as in the toy example of Section 2. In turn, this leads to numerical instabilities, which can prevent convergence. In [1]

, this problem is finessed by first training the network using a classification loss that does not depend on eigenvectors. Only once a sufficiently good solution is found, that is, a solution close enough to the correct one for vector switching not to happen anymore, is the loss term that depends on the eigenvector associated to the smallest eigenvalue turned on. As we will show later, we can achieve state-of-the-art results without the need for such a heuristic, by deriving a more robust, eigendecomposition-free loss function.

4 Our Approach

We introduce an approach that enables us to work with eigenvectors corresponding to zero eigenvalues within an end-to-end learning formalism, while being subject to neither the gradient instabilities due to vector switching discussed in Section 2 nor to difficulties caused by repeated eigenvalues. To this end, we derive a loss function that directly operates on the matrix whose eigen- or singular-vectors we are interested in but without explicitly performing an SVD or ED.

In this section, we first discuss the generic scenario in which the matrix of interest directly is the output of the network. We then consider the slightly more involved case where the network predicts weights that themselves define the matrix, which corresponds to our application scenarios. Note that, while we discuss our approach in the context of Deep Learning, it is applicable to any optimization framework where one seeks to optimize a loss function based on the smallest eigenvector of a matrix with respect to the parameters that defining this matrix.

4.1 Generic Scenario

Given an input measurement , let us denote by the output of a deep network with parameters . Here, we consider the case where the output of the network is a matrix, which we write as . Our goal is to tackle problems where the loss function of the network depends on the smallest eigenvector of

, which ensures that the matrix is symmetric.

Typically, one can use an loss of the form , where is the ground-truth smallest eigenvector. The standard approach to addressing this, as followed in [4, 1], consists of explicitly differentiating this loss w.r.t. , then w.r.t. and finally w.r.t.

via backpropagation. As discussed above, however, this is not optimization friendly.

To overcome this, we propose to define a new loss motivated by the linear equation that defines eigenvectors and eigenvalues. Specifically, if is an eigenvector of with eigenvalue ,

it satisfies

(3)

Since eigenvectors have unit-norm, i.e., , multiplying both sides of this equation from the left by yields

(4)

In this paper, we consider zero eigenvalue problems, that is, . Since is positive semi-definite, we have that for any . Given the ground-truth eigenvector that we seek to predict, this lets us define the loss function

(5)

Intuitively, this loss aims to find the parameters such that is an eigenvector of the resulting matrix with minimum eigenvalue, that is, zero in our case, assuming that we can truly reach the global minimum of our loss. However, this loss alone has multiple, globally-optimal solutions, including the trivial one .

To address this, we note that this trivial solution has not only one zero eigenvalue corresponding to eigenvector , but that all its eigenvalues are zero. Since, in practice, we typically search for matrices that have a single zero eigenvalue, we propose to maximize the projection of the data along the directions orthogonal to . Such a projection can be achieved by making use of the orthogonal complement to , given by , where

is the identity matrix. By defining

, we can then re-write our loss function as

(6)

where computes the trace of a matrix and sets the relative influence of the two terms. Note that we can apply the same strategy to cases where multiple eigenvalues are zero, by reducing the orthogonal space to only the directions corresponding to non-zero eigenvalues, and introducing the first term for all eigenvectors whose eigenvalues we want to be zero.

For numerical stability, we further propose to bound the second term in the range . To do so, we therefore re-write our loss as

(7)

where is a scalar. This loss is fully differentiable, and can thus be used to learn the parameters of a deep network. Since it does not explicitly depend on performing an eigendecomposition at every iteration of the optimization, it suffers from neither the eigenvector switching problem, nor the non-unique eigenvalue problem.

4.2 Learning to Predict Weights

In practice, the problem of interest is often more constrained than training a network to directly output a matrix . In particular, in this paper, we consider problems where the goal is to predict a weight for each element of the input. This typically leads to formulations where has the form , with a data matrix and a diagonal matrix whose elements are the s. Below, we introduce the formulation for each of the applications that we consider in our experiments.

4.2.1 Outlier Rejection with 3D Points.

To show that we can indeed back-propagate nicely through the proposed loss formulation where directly using the analytical gradient fails, we first briefly revisit the toy outlier rejection problem used to motivate our approach in Section 1. For this experiment, we do not train a Deep Network, or perform any learning procedure. Instead, given 3D points , including inliers and outliers, we directly optimize the weight of each point. At every step of optimization, given the current weight values, we compute the weighted mean of the points . Let be the matrix of mean-subtracted 3D points. We then compute the weighted covariance matrix , where is a diagonal matrix whose elements are the s. The smallest eigenvector of then defines the direction of noise.

Given the ground-truth such eigenvector , let . We adapt the general formulation of Eq. 7 and formulate the outlier rejection problem as

(8)

Note that this translates directly to Eq. 7 by defining , where is a diagonal matrix with elements .

4.2.2 Keypoint Matching with the Essential Matrix.

For this task, to isolate the effect of the loss function only, we followed the same setup as in [1]. Specifically, we used the same network architecture as in [1], which takes correspondences between two 2D points as input and outputs a -dimensional vector of weights, that is, one weight for each correspondence.

Formally, let

(9)

encode the coordinates of correspondence in the two images. Following the 8 points algorithm [10], we construct as matrix , each row of which is computed from one correspondence vector as

(10)

where denotes row of . A weighted version of the 8 points algorithm [29] then computes the essential matrix as the smallest eigenvector of , with the diagonal matrix of weights.

Let , where is the ground-truth eigenvector representing the true essential matrix. We can then write an eigendecomposition-free essential loss as

(11)

Given a set of training samples, consisting of image pairs with ground-truth essential matrices, we can then use this loss, instead of the classification loss or essential loss of [1], to train a network to predict the weights.

Note that, as suggested by [12] and done in [1], we use the 2D coordinates normalized to using the camera intrinsics as input to the network.

When calculating the loss, as suggested by [11], we move the centroid of the reference points to the origin of the coordinate system and scale the points so that their RMS distance to the origin is equal to . This means that we also have to scale and translate accordingly.

4.2.3 3D-to-2D Correspondences for Pose Estimation.

The goal of this problem, also known as the Perspective-n-Point (PnP) problem [21], is to determine the absolute pose (rotation and translation) of a calibrated camera, given known 3D points and corresponding 2D image points.

For this task, as we are still dealing with sparse correspondences, we use the same network architecture as before for 2D-to-2D correspondences, except that we now have one additional input dimension, since we have 3D-to-2D correspondences.

This network takes correspondences between 3D and 2D points as input and outputs a -dimensional vector of weights, still one weight for each correspondence.

Mathematically, we can denote the input correspondences as

(12)

where are the coordinates of a 3D point, and , denote the corresponding image location. According to [12], we have

(13)

To recover the pose, we then follow the Direct Linear Transform (DLT) method 

[12]. This consists of constructing the matrix , every two rows of which are computed from one correspondence as

(14)

where denotes row of . Then, the solution of the weighted PnP problem can be obtained as the eigenvector of corresponding to the smallest eigenvalue. Therefore, we can define a PnP loss similar to the one of Eq. 11 for 2D-to-2D correspondences, but with defined as discussed above, and, given training samples, each consisting of a set of 3D-to-2D correspondences with corresponding ground-truth eigenvector encoding the pose, train a network to predict weights such that we obtain the correct pose via DLT. As in the 2D-to-2D case, we use the normalized coordinate system for the 2D coordinates.

Note that the characteristics of the rotation matrix, that is, orthogonality and determinant 1, are not preserved by the DLT solution. Therefore, to make the result a valid rotation matrix, we refine the DLT results by the generalized Procrustes algorithm [30, 31], which is a common post-processing technique for PnP algorithms. Note that this step is not involved during training, but only in the validation process to select the best model and at test time.

5 Experiments

We now present our results for the three tasks discussed above, that is, plane fitting as in Section 2, distinguishing good keypoint correspondences from bad ones, and solving the Perspective-n-Point (PnP) problem. We rely on a TensorFlow implementation using the Adam [32] optimizer, with a learning rate of , unless stated otherwise, and default parameters. When training a network for keypoint matching and PnP, we used mini-batches of 32 samples and, in the plane fitting case, we also tested vanilla gradient descent in addition to Adam.

5.1 Plane Fitting

The setup is the one discussed in Section 2. We randomly sampled 100 3D points on the plane. Specifically, we uniformly sampled and

. We then added zero-mean Gaussian noise with standard deviation

in the dimension. We also generate outliers in a similar way, where and is uniformly samples in the same range, and

is sampled from a Gaussian distribution with mean 50 and standard deviation of 5. For the baselines that directly use the analytical gradients of SVD and ED, we take the objective function to be

, where is the minimum eigenvector of in Eq. 8 and is the ground-truth noise direction, which is also the plane normal and is the vector in this case. Note that we consider both and and take the minimum distance, denoted by the and the in the loss function. For this problem, both solutions are correct due to the sign ambiguity of eigendecomposition, and we need to take this into account.

We consider two ways of computing analytical gradients, one using the SVD and the other the self-adjoint eigendecomposition (Eigh), which both yield mathematically valid solutions. To implement our approach, we rely on Eq. 8.

(a) Adam
(b) Gradient descent
Figure 2: Loss evolution graph of the simple toy example. (a) When the Adam optimizer is used, and (b) when vanilla gradient descent (GD) is applied. We report results for Singular Value Decomposition (SVD), self-adjoint Eigendecomposition(Eigh), and for our loss function. For each loss, we tried multiple learning rates within the range and report the best results in terms of convergence. In both cases, our loss formulation converges nicely, whereas SVD and Eigh do not. For SVD and Eigh, they do not optimize well until they reach a point where eigenvector swap happens, where only then they start to converge. This happens in an extreme case when GD is used in (b), where it takes millions of iterations to converge even for this very simple example.
(a) Loss evolution with SVD in Hard case.
(b) Inliers with SVD
(c) Inliers with Ours
Figure 3: Plane fitting in the presence of multiple outliers. With multiple outliers, both our approach and the SVD/Eigh baselines still converge (a). However, as illustrated in (b) where we plot the weight of each input point during optimization, the SVD baseline discards many inliers (Positions 1 to 100 are true inliers), while accepting outliers. By contrast, as shown in (c), our approach correctly rejects the outliers and accepts the inliers.

Fig. 2 shows the evolution of the loss as the optimization proceeds when using either Adam or vanilla gradient descent, when a single outlier is present. Note that SVD and Eigh have exactly the same behavior because they constitute two equivalent ways of solving the same problem. Using Adam in conjunction with either one initially yields a very slow decrease in the loss function, until it suddenly drops to zero when the switch of the eigenvector with the smallest eigenvalue occurs. By contrast, our approach produces a much more gradual decrease in the loss with no overly large gradients ever being generated. The difference in behavior is even more drastic when vanilla gradient descent is used instead of Adam: SVD and Eigh take millions of iterations to converge. We tried multiple learning rates within the range , none of them has led to faster convergence. We provide the results with different learning rates in the supplementary material.

We also evaluate the behavior of our method and the baselines in the presence of more outliers. As shown in Fig. 3, while both our method and the baseline still present the same convergence patterns as before, our approach correctly recovers the inliers and outliers, while the SVD baseline discards many outliers and even accepts outliers.

Note that, while in this plane-fitting example the SVD- or Eigh-based methods converge, in the more complex cases below, this is not always true.

5.2 Keypoint Matching

To evaluate our method on a real-world problem, we use the SUN3D dataset [33]. For a fair comparison, we trained our network on the same data as [1], that is, the “brown-bm-3-05” sequence, and evaluate it on the test sequences used for testing in [20, 1] . Additionally, to show that our method is not overfitting, we also test on a completely different dataset, the “fountain-P11” and “Herz-Jesus-P8” sequences of [34].

We follow the evaluation protocol of [1], which constitutes the state-of-the-art in keypoint matching, and only change the loss function to our own loss of Eq. 11. We use and , which we empirically found to work well for 2D-to-2D keypoint matching. We compare our method against that of [1], both in its original implementation that involves minimizing a classification loss first and then without that initial step, which we denote as “Essential_Only”. The latter is designed to show how critical the initial classification-based minimization of [1] is. In addition, we also compare against standard RANSAC [35], LMeds [36], MLESAC [15], and GMS [17] to provide additional reference points. We do this in terms of the performance metric used in [1] and referred to as mean Average Precision (mAP). This metric is computed by observing the ratio of accurately recovered poses given a certain maximum threshold, and taking the area under the curve of this graph.

(a) Results on the SUN3D dataset.
(b) Results on the dataset of [34]
Figure 4: Keypoint matching results. We report the accuracy of the estimated relative pose in terms of the mean Average Precision (mAP) measure of [1]. (a) Results for the SUN3D dataset. (b) Results for the dataset of [34]. Our method performs on par with the state-of-the-art method of [1], denoted as “Classification + Essential”, without the need of any pre-training. Note the significant performance gap between “Essential_Only”, which utilizes eigendecomposition directly, and our method which is eigendecomposition-free.
(a) Ours
(b) RANSAC
(c) Ours
(d) RANSAC
Figure 5: Qualitative comparison of our results with those of RANSAC. (a) Our results and (b) RANSAC results on the “fountain-P11” of [34], (c) Our results and (d) RANSAC results on the “brown-bm-3-05” of SUN3D. We display the correspondences that the algorithms labeled as inliers. True positives are shown in green and the false ones in red. The false positives of our approach are still close to being correct, while those of RANSAC are truly wrong.

We summarize the results in Fig. 4 and provide numbers for individual datasets in the supplementary material. Our approach performs roughly on par with [1], the state-of-the-art method on keypoint matching, and outperforms all the other baselines. Importantly, “Essential_Only” severely underperforms and even often fails completely. In short, instead of having to find a workaround to the eigenvector switching problem as in [1], we can directly optimize our objective function, which is far more generally applicable. Furthermore, the workaround in [1] would converge to a sub-optimal solution, as it the classification loss depends on a user-selected decision boundary, that is, a heuristic definition of inliers. By contrast, our method can simply discover the inliers automatically while training, thanks to the second term in Eq. 7.

In Fig. 5

, we compare the correspondences classified as inlier by our method to those of RANSAC on image pairs from the dataset of 

[34] and SUN3D, respectively. Note that even the correspondences that are misclassified as inliers are very close to being inliers. By contrast, RANSAC yields much larger errors.

5.3 PnP

Following standard practice for evaluating PnP algorithms [21, 24], we used the procedure of [24] to generate a synthetic dataset composed of 3D-to-2D correspondences with noise and outliers added. Each training example comprises two thousand 3D points and we set the ground truth translation of the camera pose to be their centroid.

We then create a random ground-truth rotation , and project the 3D points to the image plane of our virtual camera. As in REPPnP [24], we apply Gaussian noise with a standard deviation of to these projections. For outliers, we include random outliers by assigning 3D points to arbitrary valid 2D image positions.

We train a neural network with the same architecture as in the keypoint matching case, except that it now takes 3D-to-2D correspondences as input. We empirically found that

and works well for this task. During training, to learn to be robust to outliers, we randomly select between 100 and 1000 of the two thousand matches and turn them into outliers. In other words, the two thousand training matches will contain a random number of outliers that our network will learn to filter out.

We compare our method against modern PnP methods, EPnP [21], OPnP [23], PPnP [30], RPnP [37] and REPPnP [24]. We also evaluate the DLT [12], since our loss formulation is based on it. Among these methods, REPPnP is the one most specifically designed to handle outliers. We also report the performance of two commonly used baselines that leverage RANSAC [14], P3P [22]+RANSAC and EPnP+RANSAC. For other methods, RANSAC did not bring noticeable improvements, and we omitted them in the graph for better visual clarity.

To compare all these methods, we use standard rotation and translation error metrics [38]. Specifically, we report the closest arc distance in radians for the rotation matrix measured using quaternions, and the distance between the translation vectors normalized by the ground truth. To demonstrate the effect of outliers at test time, we fix the number of matches to be 200 and vary the number of outliers from to . We perform each experiment 100 times and report the average.

(a) Rotation error (degrees)
(b) Translation error (normalized)
Figure 6: PnP results. Our method gives extremely stable results despite the abundance of outliers, whereas all compared methods perform significantly worse as the number of outliers increase. Even when these method do well for either rotation or translation, they do not perform well on both. Ours, on the other hand gives near zero error for both measures up to 130 outliers (i.e., 65%).
(a) Loss with eigendecomposition
(b) Loss with our approach
Figure 7: Loss evolution for the PnP problem. We compare the loss based on explicit eigendecomposition with our loss. Despite our best efforts, we were not able to make the eigendecomposition-based loss converge into anything meaningful, whereas our loss function converges nicely.

Fig. 13 summarizes the results. We outperform all other methods significantly, especially when the number of outliers increases. REPPnP is the one competing method that seems least affected. As long as the number of outliers is small, it is on a par with us but passed a certain point—-when there are more than 40 outliers, that is, 20% of the total—its performance, particularly in terms of rotation error, decreases quickly whereas ours does not.

As in the keypoint matching case, we have tried to compute the results of a network relying explicitly on eigendecomposition and minimizing the norm of the difference between the ground-truth eigenvector and the predicted one. However, we found that such a network was unable to converge,

as depicted in Fig. 7, where we compare the loss evolution of this approach to that of ours. This again clearly shows the benefits of our eigendecomposition-free approach.

6 Conclusion

We have introduced a novel approach to training deep networks that rely on losses computed from an eigenvector corresponding to a zero eigenvalue of a matrix defined by the network’s output. Our loss does not suffer from the numerical instabilities of analytical differentiation of eigendecomposition, and converges to the correct solution much faster. We have demonstrated the effectiveness of our method on the tasks of keypoint matching in real images and outlier rejection for the PnP problem. In both cases, our new loss has allowed us to achieve state-of-the-art results.

Since many Computer Vision tasks rely on least-square solutions to linear systems, we will investigate the use of our approach for other ones, such as homography estimation. Furthermore, we hope that our work will contribute to imbuing Deep Learning techniques with traditional Computer Vision knowledge, thus avoiding discarding decades of valuable research, and leading to more principled frameworks.

7 Acknowledgements

This research was supported in part by the National Natural Science Foundation of China under Grant 61603291 and the program of introducing talents of discipline to university B13043. This work was also supported in part by the Swiss Commission for Technology and Innovation. This work is performed when Zheng Dang was visiting the CVLab, EPFL, Switzerland.

References

  • [1] Yi, K., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to Find Good Correspondences. In: CVPR. (2018)
  • [2] Papadopoulo, T., Lourakis, M.: Estimating the jacobian of the singular value decomposition: Theory and applications. In: ECCV. (2000) 554–570
  • [3] Giles, M.: Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In: Advances in Automatic Differentiation. (2008) 35–44
  • [4] Ionescu, C., Vantzos, O., Sminchisescu, C.: Matrix backpropagation for Deep Networks with Structured Layers. (2015)
  • [5] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.:

    Tensorflow: A System for Large-Scale Machine Learning.

    In: USENIX Conference on Operating Systems Design and Implementation. (2016) 265–283
  • [6] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop. (2017)
  • [7] Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial Transformer Networks. In: NIPS. (2015) 2017–2025
  • [8] Handa, A., Bloesch, M., Patraucean, V., Stent, S., McCormac, J., Davison, A.: Gvnn: Neural Network Library for Geometric Computer Vision. In: ECCV. (2016)
  • [9] Murray, I.: Differentiation of the Cholesky Decomposition. arXiv Preprint (2016)
  • [10] Longuet-Higgins, H.: A Computer Algorithm for Reconstructing a Scene from Two Projections. Nature 293 (1981) 133–135
  • [11] Hartley, R.: In Defense of the Eight-Point Algorithm. PAMI 19(6) (June 1997) 580–593
  • [12] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2000)
  • [13] Nister, D.: An Efficient Solution to the Five-Point Relative Pose Problem. In: CVPR. (June 2003)
  • [14] Fischler, M., Bolles, R.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications ACM 24(6) (1981) 381–395
  • [15] Torr, P., Zisserman, A.: MLESAC: A New Robust Estimator with Application to Estimating Image Geometry. CVIU 78 (2000) 138–156
  • [16] Rousseeuw, P., Leroy, A.:

    Robust Regression and Outlier Detection.

    Wiley (1987)
  • [17] Bian, J., Lin, W., Matsushita, Y., Yeung, S., Nguyen, T., Cheng, M.: GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence. In: CVPR. (2017)
  • [18] Raguram, R., Chum, O., Pollefeys, M., Matas, J., Frahm, J.M.: USAC: A Universal Framework for Random Sample Consensus. PAMI 35(8) (2013) 2022–2038
  • [19] Zamir, A.R., Wekel, T., Agrawal, P., Malik, J., Savarese, S.: Generic 3D Representation via Pose Estimation and Matching. In: ECCV. (2016)
  • [20] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and Motion Network for Learning Monocular Stereo. In: CVPR. (2017)
  • [21] Lepetit, V., Moreno-noguer, F., Fua, P.: EPP: An Accurate Solution to the PP Problem. IJCV (2009)
  • [22] Kneip, L., Scaramuzza, D., Siegwart, R.: A Novel Parametrization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation. In: CVPR. (2011) 2969–2976
  • [23] Zheng, Y., Kuang, Y., Sugimoto, S., Astrom, K., Okutomi, M.: Revisiting the PnP Problem: A Fast, General and Optimal Solution. In: ICCV. (2013)
  • [24] Ferraz, L., Binefa, X., Moreno-noguer, F.: Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection. In: CVPR. (2014) 501–508
  • [25] Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C.: DSAC – Differentiable RANSAC for Camera Localization. ARXIV (2016)
  • [26] Huang, G., Liu, Z., Weinberger, K., van der Maaten, L.: Densely Connected Convolutional Networks. In: CVPR. (2017)
  • [27] Huang, Z., Wan, C., Probst, T., Gool, L.V.: Deep learning on lie groups for skeleton-based action recognition. In: CVPR. (2017) 6099–6108
  • [28] Law, M., Urtasun, R., Zemel, R.S.: Deep spectral clustering learning. In: ICML. (2017) 1985–1994
  • [29] Zhang, Z.: Determining the Epipolar Geometry and Its Uncertainty: A Review. IJCV 27(2) (1998) 161–195
  • [30] Garro, V., Crosilla, F., Fusiello, A.: Solving the PnP Problem with Anisotropic Orthogonal Procrustes Analysis. In: 3DPVT. (2012) 262–269
  • [31] Schönemann, P.: A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 31(1) (1966) 1–10
  • [32] Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimisation. In: ICLR. (2015)
  • [33] Xiao, J., Owens, A., Torralba, A.: SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels. In: ICCV. (2013)
  • [34] Strecha, C., Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On Benchmarking Camera Calibration and Multi-View Stereo for High Resolution Imagery. In: CVPR. (2008)
  • [35] Cantzler, H.: Random Sample Consensus (RANSAC) (2005) CVonline.
  • [36] Simpson, D.: Introduction to Rousseeuw (1984) Least Median of Squares Regression. In: Breakthroughs in Statistics. Springer (1997) 433–461
  • [37] Li, S., Xu, C., Xie, M.: A Robust O(n) Solution to the Perspective-N-Point Problem. PAMI (2012) 1444–1450
  • [38] Crivellaro, A., Rad, M., Verdie, Y., Yi, K., Fua, P., Lepetit, V.: Robust 3D Object Tracking from Monocular Images Using Stable Parts. PAMI (2017)
  • [39] Heinly, J., Schoenberger, J., Dunn, E., Frahm, J.M.: Reconstructing the World in Six Days. In: CVPR. (2015)
  • [40] Wu, C.: Towards Linear-Time Incremental Structure from Motion. In: 3DV. (2013)

Appendix 1 Appendix

We provide additional details about the results presented in Section 5 of the main paper.

1.1 Plane Fitting

As mentioned in Section 5.1 of the main paper, we tested different learning rates in the range . Figs. 8 and 9 depict the learning curves for both our loss and the standard SVD/Eigh-based loss for several different ones, using either Adam or vanilla gradient descent as optimizer. As can be seen from the different plots, our approach always converges and correctly finds the inliers, as indicated by the bar plots on the right. While SVD/Eigh do converge when using Adam, this requires an eigenvector switch and learning fails when using GD. Furthermore, some inliers are systematically classified as outliers and vice-versa.

Figure 8: Loss evolution for the fitting plane problem with Adam. The different rows correspond to different learning rates, from to . On the right, we show the loss evolution for our approach and for SVD/Eigh. On the left and middle, we show bar plots indicating the points that were classified as outliers/inliers by SVD and our approach, respectively. Note that our approach always find the correct inliers (indices 1 to 100), whereas SVD typically misclassifies points.
Figure 9: Loss evolution for the fitting plane problem with GD. As with Adam in Fig. 8, our approach converges (left) and finds the correct inliers (right). By contrast, SVD/Eigh often do not converge, and when they do, tend to misclassify some points. Note that, for a learning rate of 1, SVD/Eigh returned NaN values during optimization, and we therefore omit this setting here.

1.2 Keypoint Matching

Here, we provide the detailed keypoint matching results on the SUN3D and Strecha [34] datasets. Specifically, we compare the mAP of the baselines and of our model on the individual sequences of these datasets in Figs. 1011 and 12 for error thresholds of 5, 10 and 20, respectively. Note that the general trend is the same as the average one reported in the main paper, with our method essentially performing on par with the state-of-the-art method of [1], but without the need for pre-training with a different loss.

(a)
(b)
Figure 10: Keypoint matching mAP with error threshold 5. We report the accuracy of the estimated relative pose in terms of the mean Average Precision (mAP) measure of [1] for the SUN3D dataset and the dataset of [34].
(a)
(b)
Figure 11: Keypoint matching mAP with error threshold 10. We report the accuracy of the estimated relative pose in terms of the mean Average Precision (mAP) measure of [1] for the SUN3D dataset and the dataset of [34].
(a)
(b)
Figure 12: Keypoint matching mAP with error threshold 20. We report the accuracy of the estimated relative pose in terms of the mean Average Precision (mAP) measure of [1] for the SUN3D dataset and the dataset of [34].

1.3 PnP

We evaluated our PnP approach on real data, using the dataset of [39]. Specifically, the 3D points in this dataset were obtained using the Structure-from-Motion algorithm of [40], which also provides a rotation matrix and translation vector for each image. We treat these rotations and translations as ground-truth to compare different PnP algorithms. Given a pair of images, we extract SIFT features at the reprojection of the 3D points in one image, and match these features to SIFT keypoints detected in the other image. This procedure produces erroneous correspondences, which a robust PnP algorithm should discard.

In this example, we used the model trained on the synthetic data described in Section 5.3 of the main paper. Note that we apply the model without any fine-tuning, that is, the model is only trained with purely synthetic data. We report the quantitative results of this experiments, performed on four image pairs in Tables 1 and 2. Note that our approach yields much lower errors than all the baselines. In Fig. 13, we compare the reprojection of the 3D points on the input image after applying the rotation and translation obtained with our model and with EPnP+RANSAC. Note that the points obtained with our approach reproject much more closely to the ground-truth image locations than those of this baseline. Note that EPnP+RANSAC constitutes the best-performing baseline, on par with OPnP and P3P+RANSAC. For the other baselines, we are unable to provide similar figures because the errors reported in Tables 1 and 2 translate to points reprojecting outside the input image. This underscores the strength of our approach, which, despite being trained on synthetic data nevertheless works on real images.

Methods Ours REPPnP EPnP RPnP OPnP PPnP DLT EPnP P3P
Reichstag 0.1319 36.0852 36.5411 12.6649 3.7012 18.6296 35.1560 3.1161 3.1161
Florence 0.0416 32.6205 34.6684 24.2201 1.5771 5.0694 38.3336 2.4004 2.4004
Prague 0.0274 39.6677 39.4239 4.6647 2.7118 16.4462 35.1211 2.5443 2.5443
Notre-dame 0.0293 36.9159 32.7849 16.2304 2.8228 8.7266 32.8611 4.2309 4.263
Table 1: Comparison of the rotation error of our approach with those of the baselines. A indicates that RANSAC was used as a postprocessing step. Best results are shown in bold.
Methods Ours REPPnP EPnP RPnP OPnP PPnP DLT EPnP P3P
Reichstag 0.0110 0.2477 1.4313 0.3885 0.2346 1.2724 0.3687 0.1821 0.1821
Florence 0.0135 1.9399 1.6896 1.7971 0.8719 0.4659 1.9069 0.7640 0.7640
Prague 0.0069 0.6156 1.4631 1.4440 0.8715 0.8278 1.6928 0.8045 0.8045
Notre-dame 0.0046 1.8663 1.7654 1.7910 0.8382 1.0669 1.7918 1.0605 1.0615
Table 2: Comparison of the translation error of our approach with those of the baselines. A indicates that RANSAC was used as a postprocessing step. Best results are shown in bold.
Figure 13: Qualitative PnP results. The first two columns show the two images in the pair, the second of which we seek to estimate the pose from. In the third and fourth columns, we show the reprojection of the 3D point cloud after applying the rotation and translation predicted by our model and by EPnP+RANSAC, respectively. The red dots correspond to the ground-truth locations and the gray ones to our predictions. Note that our results match the ground truth much more closely than the baseline. Top to bottom: Florence, Reichstag, Notre-dame, Prague.