Self-Calibration Supported Robust Projective Structure-from-Motion

by   Rui Gong, et al.
ETH Zurich

Typical Structure-from-Motion (SfM) pipelines rely on finding correspondences across images, recovering the projective structure of the observed scene and upgrading it to a metric frame using camera self-calibration constraints. Solving each problem is mainly carried out independently from the others. For instance, camera self-calibration generally assumes correct matches and a good projective reconstruction have been obtained. In this paper, we propose a unified SfM method, in which the matching process is supported by self-calibration constraints. We use the idea that good matches should yield a valid calibration. In this process, we make use of the Dual Image of Absolute Quadric projection equations within a multiview correspondence framework, in order to obtain robust matching from a set of putative correspondences. The matching process classifies points as inliers or outliers, which is learned in an unsupervised manner using a deep neural network. Together with theoretical reasoning why the self-calibration constraints are necessary, we show experimental results demonstrating robust multiview matching and accurate camera calibration by exploiting these constraints.


page 9

page 14

page 21


A linear method for camera pair self-calibration and multi-view reconstruction with geometrically verified correspondences

We examine 3D reconstruction of architectural scenes in unordered sets o...

Calibration of Asynchronous Camera Networks for Object Reconstruction Tasks

Camera network and multi-camera calibration for external parameters is a...

Feature matching for multi-epoch historical aerial images

Historical imagery is characterized by high spatial resolution and stere...

A Minimal Six-Point Auto-Calibration Algorithm

A non-iterative auto-calibration algorithm is presented. It deals with a...

Stereo camera system calibration: the need of two sets of parameters

The reconstruction of a scene via a stereo-camera system is a two-steps ...

Probabilistic Inference for Camera Calibration in Light Microscopy under Circular Motion

Robust and accurate camera calibration is essential for 3D reconstructio...

D2D: Learning to find good correspondences for image matching and manipulation

We propose a new approach to determining correspondences between image p...

1 Introduction

Scene structure and camera motion recovery from uncalibrated images is a fundamental problem in Computer Vision and a major requirement for numerous three-dimensional capture systems. It has been established that, in the absence of any knowledge about the scene and camera, such structure can only be obtained up to a projective ambiguity. As such reconstruction suffers from a severe distortion, it is only useful in some limited applications (such as novel view synthesis). In practice, most applications require the recovered projective scene structure to be upgraded to metric. This upgrade is, however, not achievable without the calibration of the camera. Traditionally, camera calibration relies on information obtained from a known calibration object present in the scene. Other methods rely on available measurements directly extracted from the scene. One can argue that relying on scene information may not be reliable as the assumed constraints might not even be present in many cases.

With the development of flexible camera calibration techniques [1, 2]

, cameras with fixed intrinsic parameters can be reliably and accurately calibrated once and used so long as the parameters are kept unchanged. If the known calibration of the camera remains unchanged during the acquisition, 3D reconstruction boils down to solving linear systems of equations. However, these parameters may very well vary before or during the capture of the entire image sequence. The change may not take place in every image captured but may, nevertheless, occur under change of focus or zoom. As re-calibrating the camera in this fashion is not always possible, it is safe to assume - in most cases - the camera to be uncalibrated at any instant. Means to calibrate it, other than relying on a special pattern or scene knowledge, are hence necessary. One way to do so is to resort to the more advanced and flexible approach of camera self-calibration, i.e. the recovery of the camera’s parameters using solely point correspondences across images. Point correspondences across images, allow to locate a virtual object, the so-called Absolute Conic (AC), that is omnipresent in all scenes. The AC is a special conic lying on the plane at infinity and whose projection onto images is independent upon the rigid motion of the camera. In particular, the AC carries the advantage of projecting onto an image conic (IAC) whose location only depends upon the intrinsic parameters of the camera under consideration. Camera constraints, such as partial knowledge or full parameters constancy, are used to fix the AC and its supporting plane and hence calibrate the camera. This task is generally cast into the problem of recovering a single object, the so-called Dual Absolute Quadric (DAQ), encoding information about both the IAC (hence the camera intrinsics) and the AC’s supporting plane (i.e. the plane at infinity).

The recovery of the DAQ is a challenging nonlinear problem in which correct correspondences are assumed to be available. Using point feature detectors [3, 4], it is possible to extract a good number of reliable features in an image. However, finding good matches between the features obtained from two images of the same scene is not an easy task. The problem becomes even more difficult when it comes to simultaneously matching points across multiple images. Matching based on the epipolar constraints is a widely used technique for two views: given a point in one image, its corresponding point in the second image lies on a known line. This constraint is not sufficient enough to reject all the outliers as there may exist an outlier on the other image that still lies on that line. When multiple images are matched by considering only one pair of images at a time, there may be a significant number of outliers. These outliers can further be rejected by enforcing 3 or N views constraints. In practice, the constraints only up to 3-views are used. This is due to the expensive computational cost for higher number of views.

In general, feature extraction and matching are tackled independently from the self-calibration problem. However, it is unrealistic to address the camera self-calibration problem with the assumption of the availability of perfectly matched sets of pixels across images. One can only notice that, in order to facilitate the correspondence process, camera self-calibration techniques are often tested only on ordered image sequences when real images are considered. In the presence of mismatches, camera self-calibration techniques are doomed to failure. Note that when camera parameters are known, they can be used to support the matching of features through the inspection of the re-projection residual errors. However, no such approach exists when the calibration is unknown. The basic idea on which our work is based upon spells out as follows: if self-calibration was a linear process, then one could use it to support the correspondence process in a way similar to the role of the fundamental matrix, the trifocal (or multi-focal) matching tensors or even re-projection: valid correspondences must necessarily lead to valid fundamental matrices, valid multi-focal tensors and a valid re-projection of the reconstructed scene. However, self-calibration constraints being inherently nonlinear, they - seemingly - are of no use to support the search for correspondences across uncalibrated images. We argue here that a packaged solution of 3D reconstruction form multiview will not be complete unless both of these problems are solved together. i.e, a robust multiview matching and a reliable camera self-calibration that exploit one another instead of being solved independently. How useful would be point correspondences that yield inaccurate, false or even impossible calibration? Not useful at all! Self-calibrating a camera with every candidate set of matches is unrealistic and computationally prohibitive. Also, self-calibration being a non-linear problem, it may very well fail because of numerical optimization considerations rather than false matches. It is thus of the utmost importance to find a proper formulation to express the likely existence of a valid calibration (given a point-set match) rather then going all the way to recover the camera parameters. The main goal of this paper is specifically to match multi-view correspondences with the support of such intractable self-calibration constraints. Up to our knowledge, this is the first work that (i) performs deep projective structure-from motion, and (ii) exploits the self-calibration constraints for the task of multi-view matching.

In this work, we design a deep unified framework for projective structure-from-motion and camera self-calibration, to support the multi-view matching process. Using a set of putative correspondences across multiple views, the proposed framework predicts inlier/outlier scores of the correspondences together with camera intrinsics and the plane-at-infinity. Notably, our deep network is trained end-to-end in an unsupervised manner. The unsupervised training for intrinsics and plane-at-infinity is possible, thanks to the self-calibration constraints expressed in the form of DAQ projection equations. In fact, when it is proven that our model is more robust and further improved by adding the self-calibration constraint when facing the difficult setting such as few points, few views and high outlier rate. We show the practicality of our methods, in terms of both robustness and accuracy, via real and the extreme cases of synthetic data.

2 Related Works

Camera self-calibration is widely known to be a difficult problem. This is mainly due to two reasons: the nonlinear nature of the underlying equations and the numerous critical motion sequences [5]. Critical motions cause various levels of reconstruction ambiguities and lead to the failure of camera calibration. The preliminary work based on Kruppa’s equation proposed in [6] is historically seen as the first self-calibration method. However, its application for three or more views provide weaker constraints than those obtained through subsequent methods such as the one based on the modulus constraints [7] and the one relying on the Dual Absolute Quadric [8]. This is because Kruppa’s constraints rely only on the dual images of the AC and do not enforce that those images correspond to a unique conic (the AC) lying on the plane at infinity [9]

. The plane at infinity and the AC are estimated in either of two ways; one after another (stratified) or simultaneously (direct). The stratified method given in

[10] for affine cameras was extended to perspective in [11] with further developments in [12, 7, 13]. Scene constraints combined with camera constraints over multiple views is described in [14, 15, 16]. The use of the modulus constraints to locate the plane at infinity was introduced by Pollefeys in [7]. The direct methods, which simultaneously estimate the plane at infinity and the dual IAC, basically deal with DAQ [8, 17, 18]. The DAQ was introduced for camera self-calibration by Triggs [8] who has proposed both quasi-linear and sequential programming methods to locate it. Pollefeys et al. [19] showed that the DAQ computation could be used for metric reconstruction under general motion even for varying focal length. In the case of a moving camera with varying parameters, there exists no tight constraints on the position of plane at infinity. Either chirality constraints [20] or the finiteness constraint [21] are used within iterative search schemes. Stratified methods are sensitive to critical motions [9]. whereas, direct methods are and less problematic to critical sequences [22, 23], however are not flexible to be used in many cases.

All the above methods start with the common assumption on the availability of perfect correspondences among all the images, which is not a practical. As per our knowledge, there is no research work that simultaneously deals with multiview matching of randomly captured images and self-calibration. One of the initial multiview matching work done for a set of unordered real images is presented in [24]. However, this method does not find the correspondences among all the images. Recent methods for multi-view matching, although often in a different context, have also been developed [25, 26, 27]. On the other hand, almost all projective SfM works [28, 29, 30] are primarily concerned for accuracy and optimality, without addressing the robustness, except two notable works that include [31, 32]. With the recent developments, learning-based methods for 3D reconstruction and/or self-calibration have also been developed [33, 34, 35, 36, 37, 38, 39, 40]. However, most of these works rely on the naive photo-metric error loss, if are unsupervised. This assumption immediately mandates ordered image sequence, or images captures under very similar conditions. In regarding to learning-based matching for structure and/or motion, few notable works include [41, 42, 43, 44].

3 Preliminaries

The Fundamental matrix encapsulates all the necessary (projective) geometric relationships for the two-view imaging model. However, when more than two views are involved, more sophisticated relationships (analogous to the Fundamental matrix), involving measurements from all the views, are required. These relationships are known as -view multilinear tensors such as the trifocal tensor for three views and the quadrifocal tensor for four. Although -view tensors successfully encapsulate the geometric relationships upto 4 views, their usage is limited due to their computational complexities. Therefore, a common practice of incorporating measurements from multiple views involves the projective factorization method. The process of projective factorization takes 2D point measurements from multiple views and decomposes it into a scene structure and camera matrices that are consistent with this structure.

3.1 Projective Factorization

Consider 3D points observed by cameras . The observed image points are given by . For given point correspondences across images , the reconstruction task is to find 3D point coordinates and camera matrices such that,


If we write this equation explicitly by introducing scale variables (or Projective depth), we have, . Provided that the points are visible in all views (i.e. is known for all and

), the complete set of equations may be written by stacking the vectors and matrices in the following form,


The matrix on the left-hand side is known as the measurement matrix, say . By construction, the matrix is of rank 4. This equation involves the scale variables , which are not part of the measurement, for each measured point . Furthermore, note that the decomposition on the right-hand side of the above equality is not unique. To see this, observe that with any non-singular matrix , we have which is also satisfied. Such reconstruction is a projective reconstruction and the matrix is called a projective homography matrix. There are several approaches that allow decomposing the measurement matrix in the form of Equation (2).

Sturm/Triggs Factorization: The first solution to decomposed as  (2) was proposed by Sturm and Triggs [45], where the initial estimate of projective depths is assumed to be known. This may be obtained either from initial projective reconstruction (for example, using fundamental matrix) or simply setting all . Once the projective depths are known, the measurement matrix is complete. In case of noisy measurements, the

can be enforced to have rank 4 using Singular Value Decomposition. Thus, if

, all except the largest four diagonal entries of are forced to zero resulting in . Then, the rank constrained measurement matrix is . Using such decomposition, the camera matrices and the scene points are retrieved as,


3.2 Projective-to-Metric Upgrade

For simplicity and without loss of generality, we assume that the coordinate frame of the first camera in both projective and metric space coincide with the world frame such that the first cameras are respectively given by, and . The projective structure and motion can then be upgraded using,


where the are the coordinates of the so-called plane at infinity, say , in the projective space whose frame coincides to the first camera.

3.3 DAQ for Camera Self-Calibration

The Dual Absolute Quadric (DAQ), , is a special degenerate quadric of planes in the dual 3D-space [9]. The canonical form of DAQ in metric space is given by, , which is fixed under metric transformations and takes the form in the projective space. Using the form of in (4), one can express DAQ in projective space with respect to the first camera frame as,


where, is also known as the Dual Image of Absolute Conic (DIAC) in the first image. Direct self-calibration methods rely on the existence of DAQ of the form (5). More specifically, one can establish the relationships between DAQ and DIAC in each view using the projective projection matrices as follows,


In this regard, the task of self-calibration is finding that has structure of (5) and satisfies (6), using the given projective projection matrices.

4 Mulitview Matching

The process of multiview matching assumes that the putative correspondences may get contaminated by potentially overwhelmingly many outlying matches. The filtering of these outliers is carried out while maximizing the consensus set of the correspondences that respect the factorization process of (2), while respecting the DAQ projection of (6). In this process, we are interested on classifying correspondence with the help of , for given noise free outlier contaminated measurement matrix , using the following optimization problem,


where, denotes the element-wise matrix multiplication, are projective structure, motion, and DAQ, respectively. Note that the assignment variable implies that the measurement corresponding to the point is an inliers, otherwise it is an outlier. The optimization problem of (7), however, has two major issues that need to be addressed prior to be used in practice. One concerns about noise and the other about an efficient usage of DAQ projection equation.

4.1 In the Presence of Noise

When the measurement matrix is also contaminated by noise, the constraint of (7) is not often satisfied for the desired solution. Therefore, we instead seek for a matrix , which is closest, in Frobenius norm, to the outlier filtered measurement matrix by ensuring the following,


In fact, any with satisfies the constraint of (8). On the other hand, the rank-4 matrix that minimizes the objective of (8) can be obtained by using the singular value decomposition of , whenever the assignment variable is known. Note that any matrix of higher rank can be projected on the rank-4 manifold by setting all except largest four singular values to zero, similar to the Sturm/Triggs Factorization discussed in Section 3.1.

4.2 DAQ Projection for Constant Intrinsics

The DAQ projection constraint of (6

) may turn out to be weak, if we assume that each camera can have different intrinsics. This however, is not a problem in itself. One can still make use of the DAQ projection constraints under the known prior in intrinsics. The known prior may include zero skew, unit aspect ratio, principal point close to image center, or only change in focal lengths. In this work, we assume that all the cameras have constant intrinsics. Furthermore, we also need to consider that projection equation will not be satisfied exactly in the presence of noise. Therefore, we minimize the following objective function,


4.3 The Matching Objective

The primarily goal of the multiview matching is to compute inlier/outlier assignment that also satisfy the DAQ projection conditions. Therefore, we aim at simultaneously estimating and by maximizing the surrogate objective, of (7), stated as follow,


where are the weights that take care of the influence of noise and constant intrinsics factors, respectively.

4.4 Self-Calibrating Projective SfM

In order to solve the optimization problem of (10), we present our deep self-calibrating projective SfM model (SCPSfM), which simultaneously performs the projective factorization and camera calibration. In this process, we exploit the advantage of a deep neural network on the high optimization accuracy and efficiency. Starting from noise and outliers contaminated measurement matrix , defined in Section 3.1, we predict per correspondence weights , defined in 7), to detect the inliers () and outliers () correspondence. Additionally, we also accurately calibrates the camera intrinsics defined in (4) and the coordinate of the plane at infinity defined in (5). Let us denote our SCPSfM model as , which parameterized by . Using the measurement matrix as input, our SCPSfM model predicts inlier/outlier scores as well as self-calibrates the camera, without requiring any ground truth, whatsoever. SCPSfM relies on DAQ projection and projective factorization constraints presented in (8) and (9). Based on the objective function of (10

), we propose the total loss function

which combines the projection assignment loss , and the DAQ loss , and the inlier loss resulting the following total loss:


where and are hyper parameters which balance between different loss,

represents the sigmoid function. The hyper parameter

represents threshold that guarantees the least number of inliers detected. The projection matrices can be recovered from according to (2), whereas the DAQ can be derived using based on (5).

Our input is the measurement matrix, where each correspondence can been seen as a point in -dimensional space. Our output aims to assign a label of inlier or outlier to each of the correspondence. This problem can be seen as a one-class point segmentation problem. A typical networks for point cloud segmentation naturally meets our requirement. Therefore, we use adopt PointNet [46] as the basic building block of our SCPSfM model.

Figure 1: The overview of our SCPSfM model: the PointNet-Seg and PointNet-Cls share the same feature extraction part before the 1024-dim global feature extraction. The Point-Seg branch predicts the correspondence weight for the -th correspondence to distinguish inlier/outlier. ,The Point-Cls branch regresses the plane at infinity . Besides, camera intrinsics are set as network parameters, as a part of . The purple arrow indicates the unique flow of the PointNet-Cls branch. Similarly, the red arrow denotes the unique flow of the PointNet-Seg branch, while the orange one represents the common flow shared by both.

Implementation. We illustrate our network structure in Fig. 1. Based on the theory presented in Section 4.4, we denote the SCPSfM model as the mapping . Then the network structure of our SCPSfM model can be divided into two branches, one assigns weight to each of the correspondence and the other regresses the plane at infinity . In order to ensure constant intrinsics , as stated in Section 4.2, the camera intrinsics are set as network parameters, a part of . As shown in Fig. 1, SCPSfM model combines PointNet-Seg and PointNet-Cls structures to realize the two branches. The two branches share the common part for feature extraction. The PointNet-Seg branch outputs the -dimensional vector for inlier/outlier scores of the correspondences. The PointNet-Cls branch regresses the 3-dimensional coordinate of the plane at infinity

. We implement the SCPSfM model in Pytorch

[47] and use the ADAM [48] optimizer to train the network.

, varies
, varies
, varies
, varies
(a) , varies
(b) , varies
(c) , varies
(d) , varies
(e) , varies
(f) , varies
(g) , varies
(h) , varies
Figure 2: F1 score, 2D error, and 3D error comparison between SCPSfM model (with and without self-calibration constraints) and the projective structure from motion (PSfM) [45]. The reported experiments were conducted with varying: number of points , number of views , outlier rate , and noise extent .

5 Experiments

We evaluated the effectiveness of our model with synthetic and real datasets. On the synthetic data, we measure the effect of different factors: the outlier rate , noise extent , number of points and number of views , in order to show the robustness of our model. We also compare our model with the traditional Projective Structure from Motion (PSfM) method under the same settings. We further perform ablation study of our model with and without the self-calibration support, which is the DAQ loss proposed in Section 4.4. On the real dataset, we combine our model with the state-of-the-art method on projective structure from motion PSfM [32]. The real experiments also demonstrate that our model quickly rejects most of the outliers. We use our method to reject outliers and obtain the outlier filtered measurement matrix. The final structure and motion are then recovered providing the outlier filtered measurement matrix into the PSfM pipeline. This choice is primarily because of the final step of outlier filtering and bundle adjustment offered by the PSfM. In fact, our experiments show that the proposed method is complimentary to the PSfM pipeline.

5.1 Synthetic Dataset

5.1.1 Experiments Setup.

In order to create the synthetic dataset, we randomly generate number of points and number of projections with random camera motion. The rotation of the camera is sampled uniformly from the set around -axis and the translation of the camera is uniformly sampled from . The 3D points are randomly sampled from along and axis and along axis. In order to guarantee the measurement matrix is meaningful under high outlier rate and high noise extent, the projective depth used in measurement matrix is the ground truth projective depth instead of the estimated projective depth calculated from the fundamental matrix as done in [45]. The synthetic measurement matrix for exploring the effect of the number of points and the number of views has a fixed dimension of , but only of which consists of valid measurements. The rest of the columns and rows are filled with zeros. The dimension of the synthetic measurement matrix for exploring the effect of the outlier rate and the noise extent is fixed as without zero columns or rows. The outlier correspondences are introduced by exchanging some of the correspondence points and the Gaussian noise is added to all elements in the measurement matrix. We set the hyper parameter , , in Eq. (11) for all the synthetic experiments. The learning rate is set as 0.001. In order to evaluate the performance of different method, the F1 score, 2D error and 3D error are adopted as the metric for evaluation, following [42, 32]. The F1 score is calculated according to the inlier detection accuracy and the inlier detection recall rate. The 3D error is calculated by where is the reconstructed 3D point and is the ground truth 3D point. The 2D error is the root mean square error in pixel between the reprojected 2D point and the ground truth 2D point and then averaged over all the points.

5.1.2 Experimental Results.

To explore the effect of four factors, number of points , the number of views , the outlier rate and the noise extent on our model and the traditional projective structure from motion (PSfM) method, we conduct the control variable experiment, whose experimental results are shown in Fig. 4. From Fig. 2, Fig. 1(a) and Fig. 1(e), it is shown that the performance of all the methods increases with increasing points. It also shows that our model with/without self-calibration part both perform better than the PSfM method. When the self-calibration constraints are introduces, the performance of our model for fewer points improves further. In Fig. 2, Fig. 1(b) and Fig. 1(f), PSfM fails in the regime of high outliers, whereas our models with/without calibration constraints provide meaningful results up to outlier rate, respectively. These experiments verify that the use of self-calibration constraints helps to further improve the robustness of our model. In Fig. 2, Fig. 1(c) and Fig. 1(g), the performance of PSfM drops quickly when noise extent increases, however our model remains very stable in terms of F1 score. It is natural that the 2D and 3D errors of our model increases with increasing noise. Similarly, the performance of all the methods improves with increasing number of views, which can be seen in 2, Fig. 1(d) and Fig. 1(h). In the same figures, it can also be seen that our model still perform better than PSfM method. As expected, our model with with self calibration constraints is better here again. Overall, our model with/without self-calibration constraints perform better than PSfM with changing number of points/ outlier rate/ noise extent/number of views, when measured in terms of F1 score, 2D and 3D errors. More importantly, our model with self-calibration constraints is consistently better than the one without, ever if it is by a small margin in some cases. For more results and analysis, please refer the supplementary material.

5.2 Real Dataset

5.2.1 Experiments Setup.

To verify the effectiveness of our model on the real data, image datasets which cover the multi view images such as Courtyard [49], West Side [49], Dome [49] and KITTI [50] are taken. In order to guarantee that there are common correspondences across multiple images, some of the views were rejected. Total number of views, number of correspondence, and the image size are listed in Table 2. Due to the limitation of the dataset scale, we report the training results to evaluate our method. Please, note that our method is fully unsupervised. In the whole sequence, except for KITTI, every 10 multi-view images are used to generate one meansurment matrix, i.e., number of views . For KITTI, the number of views is set as 11. Except for KITTI, the point in 2D images are detected by SIFT [3] and then correspondence is established by the Brute Force Matcher [51]. For the KITTI dataset, the point and the correspondence matching are taken through the Shi-Tomasi detector [52] and optical flow [51]. Due to the unavailability of the projective depth in the real dataset, the projective depth in the measurement matrix is estimated by the fundamental matrix and the epipole following [45]. In order to evaluate the performance of different methods, the 2D error metric, same as that of synthetic data experiments, is adopted. Besides, the run-time of different methods are also compared. Some qualitative results for matching are shown in Fig. 3.

Sequence Ours+PSfM PSfM[32] COLMAP[53]
Name Size Views Corrsp. 2D error Time(s) 2D error Time(s) 2D error Time(s)
Courtyard [49] 1936 1296 21 3000 0.2195 16.89 0.2506 46.74 0.4226 1696
West Side [49] 1936 1296 97 3000 0.2686 28.05 0.5216 118.93 0.5728 5141
Dome [49] 1296 1936 81 3000 0.1462 21.50 0.1554 30.49 - -
KITTI [50] 1242 375 334 200 0.5259 0.08 Not Available - -
Table 1: Performance comparison between our method combined with PSfM, original PSfM, and COLMAP on the real data. The 2D error and running time are reported for comparisons. The best result is denoted in bold. We run PSfM and our methods in the very same setup, however experimental setup for COLMAP is different as it also used camera intrinsics and a different pipeline. Therefore, we report the results of COLMAP from [32]. Due to difference in experimental setup, results of COLMAP are not supposed to be compared directly. The latter results are reported here for a general overview.

5.2.2 Experimental Results.

In Table. 2, we list the 2D error and the runtime comparisons on the real dataset between our model combined with PSfM and pure PSfM. By first taking advantage of our model for rejecting the outliers, the outlier rate of the measurement matrix fed into the PSfM pipeline becomes much lower. In this way, it becomes easier for the PSfM pipeline to refine and reconstructed structure and motion. From Table 2, it is proven that the combination of our model with PSfM outperforms the pure PSfM according to the 2D error on all the real datasets, 0.2195 v.s. 0.2506, 0.2686 v.s. 0.5216, 0.1462 v.s. 0.1554, respectively. Besides, our model makes the PSfM much faster for refinement, 16.89s v.s. 46.74s, 28.05s v.s. 118.93s, and 21.50s v.s. 30.49s, respectively. Especially on the West Side dataset, we have 48.5 improvement on the 2D error and 76.4 acceleration compared to the pure PSfM method. Moreover, facing the difficult setting where there are only a few correspondences available on KITTI dataset, the pure PSfM method does not work and cannot produce the final result. Nevertheless. our model under such setting can work independently to reconstruct with a few correspondences and reach to a meaningful 2D reprojection error of 0.5259. In this setup, our method takes only 0.08s with GPU acceleration. This further confirms the conclusion we get from synthetic dataset that our model is more stable and robust when fewer number of point correspondences are available, compared to the traditional projective structure from motion method. Due to the space limitation, more quantitative and qualitative results on real data are provided in the Supplementary material.

Figure 3: Visualization of the detected correspondence inliers of our model on the Courtyard, West Side, Dome, and KITTI dataset.

5.2.3 Camera Intrinsics Prediction.

Since use the DAQ constraints to realize self-calibration in an unsupervised way, we also validate the camera intrinsics prediction by our method. Using the ground truth camera intrinsics , in the synthetic dataset in Section 5.1, and the real KITTI dataset, in Section 5.2, we computed the errors in predicting the focal length. The intrinsics prediction error is calculated through , where is the predicted focal length while the is the ground truth. On an average, our model achieves the accuracy of and in predicting the focal length, respectively on synthetic and the real KITTI dataset.

6 Conclusion

In this work, we propose the self-calibrating projective structure from motion (SCPSfM) model, which is a unified framework for projective structure from motion and the self-calibration. We have proposed the first unsupervised deep model for solving the projective structure from motion problem, to the best of our knowledge. By exploiting the projective factorization, our model outperforms the traditional projective structure from motion method, both interms of robustness and accuracy. Moreover, when the self-calibration constraints are further exploited, i.e., DAQ constraint, the performance improves further specially in the cases of few views, few points, and high outlier rates.

The experiments on the synthetic and real datasets verify the effectiveness of our model on recovering structure and motion together with self-calibration, while being accurate and extremely robust to outliers.


  • [1] Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11) (November 2000) 1330–1334
  • [2] Tsai, R.: A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal on Robotics and Automation 3(4) (August 1987) 323–344
  • [3] Lowe, D.G.: Object Recognition from Local Scale-Invariant Features. (1999) 1150–1157
  • [4] Bay, T. Tuytelaars, H., Gool, L.: Surf: Speeded up robust features. 1 (2006) 404–417
  • [5] Sturm, P.: Critical motion sequences for monocular self-calibration and uncalibrated Euclidean reconstruction.

    Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on (1997) 1100–1105

  • [6] Faugeras, Q. Luong, O., Maybank, S.: Camera self-calibration: Theory and experiments. (1992) 321–334
  • [7] Pollefeys, L. Gool, M., Oosterlinck, A.: The modulus constraint: a new constraint for self calibration. International conference of pattern recognition (1996) 31–42
  • [8] Triggs, B.: Autocalibration and Absolute Quadric. International Conference on Computer Vision and Pattern Recognition (CVPR’97) (1997) 609–614
  • [9] Hartley, R., Zisserman, A.: Multiple view geometry. Cambridge University Press (2003)
  • [10] Koenderink, J., van Doorn, A.: Affine structure from motion. Journal of the Optical Society of America. A, Optics and image science 8(2) (February 1991) 377–385
  • [11] Faugeras, O.: Stratification of three-dimensional vision: projective, affine, and metric representations: errata. J. Opt. Soc. Am. A 12(7) (July 1995) 1606+
  • [12] Luong, Vieville, T.: Canonical Representations for the Geometries of Multiple Projective Views. Computer Vision and Image Understanding 64(2) (September 1996) 193–229
  • [13] Adlakha, D., Habed, A., Morbidi, F., Demonceaux, C., Mathelin, M.d.: Quarch: A new quasi-affine reconstruction stratum from vague relative camera orientation knowledge. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 1082–1090
  • [14] Liebowitz, D., Zisserman, A.: Combining scene and auto-calibration constraints. Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on 1 (1999) 293–300 vol.1
  • [15] Sturm, P., Maybank, S.: On Plane-Based Camera Calibration: A General Algorithm, Singularities, Applications (1999)
  • [16] Faugeras, G. Laveau, S.R.L.C.O., Zeller, C.: 3d reconstruction of urban scene from sequence of images. Technical report, INRIA (1995)
  • [17] Habed, A., Pani Paudel, D., Demonceaux, C., Fofi, D.: Efficient pruning lmi conditions for branch-and-prune rank and chirality-constrained estimation of the dual absolute quadric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 493–500
  • [18] Chandraker, M. Agarwal, S.K.F.N.D., Kriegman, D.: Practical autocalibration. Computer Vision and Pattern Recognition (2007)
  • [19] Pollefeys, L. Gool, M., Koch, M.: Self-Calibration and Metric Reconstruction in Spite of Varying and Unknown Internal Camera Parameters. (1998) 90–95
  • [20] Nister, D.: Untwisting a projective reconstruction. International Journal of Computer Vision (November,2004) 165–183
  • [21] Gherardi, R., Fusiello, A.: Practical autocalibration. European Conference on Computer Vision (2010)
  • [22] Strum, P., Triggs, B.: A factorization based algorithm for multi-image projective structure and motion. European Conference on Computer Vision, Cambridge, England (April, 1996) 709–720
  • [23] Gurdjos, P., Bartoli, A., Sturm, P.: Is dual linear self-calibration artificially ambiguous? In: 2009 IEEE 12th International Conference on Computer Vision, IEEE (2009) 88–95
  • [24] Schaffalitzky, F., Zisserman, A.: Multi-view Matching for Unordered Image Sets, or ”How Do I Organize My Holiday Snaps?”. (2002) 414–431
  • [25] Montserrat, D.M., Chen, J., Lin, Q., Allebach, J.P., Delp, E.J.: Multi-view matching network for 6d pose estimation. arXiv preprint arXiv:1911.12330 (2019)
  • [26] Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: European Conference on Computer Vision, Springer (2016) 501–518
  • [27] Serlin, Z., Yang, G., Sookraj, B., Belta, C., Tron, R.: Distributed and consistent multi-image feature matching via quickmatch. arXiv preprint arXiv:1910.13317 (2019)
  • [28] Mahamud, S., Hebert, M., Omori, Y., Ponce, J.: Provably-convergent iterative methods for projective structure from motion. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. Volume 1., IEEE (2001) I–I
  • [29] Hartley, R., Schaffalitzky, F.: Powerfactorization: 3d reconstruction with missing or uncertain data. In: Australia-Japan advanced workshop on computer vision. Volume 74. (2003) 76–85
  • [30] Oliensis, J., Hartley, R.: Iterative extensions of the sturm/triggs algorithm: Convergence and nonconvergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(12) (2007) 2217–2233
  • [31] Dai, Y., Li, H., He, M.: Element-wise factorization for n-view projective reconstruction. In: European Conference on Computer Vision, Springer (2010) 396–409
  • [32] Magerand, L., Del Bue, A.: Practical projective structure from motion (p2sfm). In: ICCV. (2017)
  • [33] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 1851–1858
  • [34] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 3828–3838
  • [35] Chen, Y., Schmid, C., Sminchisescu, C.:

    Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera.

    In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 7063–7072
  • [36] Pedra, A.V.B.M., Mendonça, M., Finocchio, M.A.F., de Arruda, L.V.R., Castanho, J.E.C.: Camera calibration using detection and neural networks. IFAC Proceedings Volumes 46(7) (2013) 245–250
  • [37] Bogdan, O., Eckstein, V., Rameau, F., Bazin, J.C.:

    Deepcalib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras.

    In: Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production. (2018) 1–10
  • [38] Hold-Geoffroy, Y., Sunkavalli, K., Eisenmann, J., Fisher, M., Gambaretto, E., Hadap, S., Lalonde, J.F.: A perceptual measure for deep single image camera calibration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2354–2363
  • [39] Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 8977–8986
  • [40] Zhuang, B., Tran, Q.H., Ji, P., Lee, G.H., Cheong, L.F., Chandraker, M.: Degeneracy in self-calibration revisited and a deep learning solution for uncalibrated slam. arXiv preprint arXiv:1907.13185 (2019)
  • [41] Ranftl, R., Koltun, V.: Deep fundamental matrix estimation. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 284–299
  • [42] Probst, T., Paudel, D.P., Chhatkuli, A., Gool, L.V.: Unsupervised learning of consensus maximization for 3d vision problems. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019) 929–938
  • [43] Brachmann, E., Rother, C.: Neural-guided ransac: Learning where to sample model hypotheses. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 4322–4331
  • [44] Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C.: Dsac-differentiable ransac for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6684–6692
  • [45] Sturm, P., Triggs, B.: A factorization based algorithm for multi-image projective structure and motion. In: European conference on computer vision (ECCV). (1996)
  • [46] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR. (2017)
  • [47] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: NIPS. (2019)
  • [48] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ICLR (2014)
  • [49] Olsson, C., Enqvist, O.: Stable structure from motion for unordered image collections. In: Scandinavian Conference on Image Analysis. (2011)
  • [50] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR) (2013)
  • [51] Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)
  • [52] Shi, J., et al.: Good features to track. In: CVPR. (1994)
  • [53] Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR). (2016)

7 Additional Results on Synthetic Dataset

In Section 5.1 of the main paper, the quantitative comparison between our self-calibration supported robust projective structure-from-motion model (SCPSfM) and the traditional projective structure-from-motion (PSfM) method is conducted under the different settings of: number of points , number of views , outlier rate and noise extent . In Fig. 2 of the main paper, the comparison is shown to prove the advantage of our model with calibration constraint compared with our model without calibration and PSfM method. In Fig. 2, it is shown that our model with/without self-calibration constraint both outperforms the traditional PSfM method under all the settings. Moreover, our model with self-calibration constraint consistently performs better than our model without self-calibration constraint, which can be seen from the obvious margin between the curves of with and without calibration constraint in Fig. 2 of the main paper. The margin can be observed in all the cases when varying the number of points , the number of views , the outlier rate in Fig. 2 of the main paper. But due to low range used to explore the effect of noise extent , the curve of our model with calibration constraint only shows small improvement compared to the curve without calibration constraint when varying noise extent (ref. Fig. 2(c)(g)(k) in the main paper). In order to show the robustness and benefit of our model from the self-calibration constraint when facing different extent of noise, we provide more experimental results on the synthetic dataset here. We further increase the noise extent to higher noise extent compared with the experiment in the main paper. The results of the experiments are plotted in Fig. 4, which shows that our model with the self-calibration constraint is more robust and performs much better especially when facing high noise condition. It is notable that our model with self-calibration constraint can stand the noise while the PSfM method and our model without calibration constraint does not work at all under such high noise. It further verifies the robustness and advantage of our SCPSfM model profiting from the self-calibration constraint.

8 Additional Results on Real Dataset

In Section 5.2 of the main paper, we provide the quantitative performance comparison between our model combined with PSfM and pure PSfM on the real dataset. The Table 1 of the main paper shows the advantage of our model for accelerating and reducing the error of the PSfM. In order to further verify the conclusion that we draw, we here provide more comparison results on additional real datasets, which are listed in Table 2. The experiment setup is exactly the same as done in Section 5.2 of the main paper. From Table 2, it is shown that the combination of our model with PSfM method outperforms the pure PSfM method according to 2D error, 0.2387 v.s. 0.3187, 0.1576 v.s. 0.1665, 0.2106 v.s. 0.4261 and 0.1596 v.s. 0.1912. Moreover, the speed of the PSfM is also highly improved profiting from our model, 23.43s v.s. 45.41s, 24.05s v.s. 35.99s, 18.61s v.s. 72.52s and 28.75s v.s. 44.78s. It further proves the benefit of our model on the accuracy and speed of the projective structure-from-motion. Besides the quantitative results, Fig. 5 provides the qualitative results of detected correspondence inliers of our method combined with PSfM on the additional real datasets.

Sequence Ours+PSfM PSfM[32]
Name Size Views Corrsp. 2D error Time(s) 2D error Time(s)
De Guerre [49] 1296 1936 20 2000 0.2387 23.43 0.3187 45.41
Lund Cathedral [49] 1296 1936 50 3000 0.1576 24.05 0.1665 35.99
UWO [49] 1296 1936 20 3000 0.2106 18.61 0.4261 72.52
Water Tower [49] 1296 1936 170 3000 0.1596 28.75 0.1912 44.78
Table 2: Performance comparison between our method combined with PSfM and the original PSfM on the real data. The 2D error and running time are reported for comparisons. The best values for the 2D error and time are in bold. We run PSfM and our methods in the very same setup.
(a) , varies
(b) , varies
(c) , varies
Figure 4: F1 score, 2D error, and 3D error comparison between SCPSfM model (with and without self-calibration constraints) and the projective structure-from-motion (PSfM) [45]. The reported experiments were conducted with varying noise extent and fixing number of points , number of views , outlier rate .
Figure 5: Visualization of the detected correspondence inliers of our model combined with PSfM on the De Guerre, Lund Cathedral, UWO, and Water Tower dataset.