1 Introduction
Scene structure and camera motion recovery from uncalibrated images is a fundamental problem in Computer Vision and a major requirement for numerous threedimensional capture systems. It has been established that, in the absence of any knowledge about the scene and camera, such structure can only be obtained up to a projective ambiguity. As such reconstruction suffers from a severe distortion, it is only useful in some limited applications (such as novel view synthesis). In practice, most applications require the recovered projective scene structure to be upgraded to metric. This upgrade is, however, not achievable without the calibration of the camera. Traditionally, camera calibration relies on information obtained from a known calibration object present in the scene. Other methods rely on available measurements directly extracted from the scene. One can argue that relying on scene information may not be reliable as the assumed constraints might not even be present in many cases.
With the development of flexible camera calibration techniques [1, 2]
, cameras with fixed intrinsic parameters can be reliably and accurately calibrated once and used so long as the parameters are kept unchanged. If the known calibration of the camera remains unchanged during the acquisition, 3D reconstruction boils down to solving linear systems of equations. However, these parameters may very well vary before or during the capture of the entire image sequence. The change may not take place in every image captured but may, nevertheless, occur under change of focus or zoom. As recalibrating the camera in this fashion is not always possible, it is safe to assume  in most cases  the camera to be uncalibrated at any instant. Means to calibrate it, other than relying on a special pattern or scene knowledge, are hence necessary. One way to do so is to resort to the more advanced and flexible approach of camera selfcalibration, i.e. the recovery of the camera’s parameters using solely point correspondences across images. Point correspondences across images, allow to locate a virtual object, the socalled Absolute Conic (AC), that is omnipresent in all scenes. The AC is a special conic lying on the plane at infinity and whose projection onto images is independent upon the rigid motion of the camera. In particular, the AC carries the advantage of projecting onto an image conic (IAC) whose location only depends upon the intrinsic parameters of the camera under consideration. Camera constraints, such as partial knowledge or full parameters constancy, are used to fix the AC and its supporting plane and hence calibrate the camera. This task is generally cast into the problem of recovering a single object, the socalled Dual Absolute Quadric (DAQ), encoding information about both the IAC (hence the camera intrinsics) and the AC’s supporting plane (i.e. the plane at infinity).
The recovery of the DAQ is a challenging nonlinear problem in which correct correspondences are assumed to be available. Using point feature detectors [3, 4], it is possible to extract a good number of reliable features in an image. However, finding good matches between the features obtained from two images of the same scene is not an easy task. The problem becomes even more difficult when it comes to simultaneously matching points across multiple images. Matching based on the epipolar constraints is a widely used technique for two views: given a point in one image, its corresponding point in the second image lies on a known line. This constraint is not sufficient enough to reject all the outliers as there may exist an outlier on the other image that still lies on that line. When multiple images are matched by considering only one pair of images at a time, there may be a significant number of outliers. These outliers can further be rejected by enforcing 3 or N views constraints. In practice, the constraints only up to 3views are used. This is due to the expensive computational cost for higher number of views.
In general, feature extraction and matching are tackled independently from the selfcalibration problem. However, it is unrealistic to address the camera selfcalibration problem with the assumption of the availability of perfectly matched sets of pixels across images. One can only notice that, in order to facilitate the correspondence process, camera selfcalibration techniques are often tested only on ordered image sequences when real images are considered. In the presence of mismatches, camera selfcalibration techniques are doomed to failure. Note that when camera parameters are known, they can be used to support the matching of features through the inspection of the reprojection residual errors. However, no such approach exists when the calibration is unknown. The basic idea on which our work is based upon spells out as follows: if selfcalibration was a linear process, then one could use it to support the correspondence process in a way similar to the role of the fundamental matrix, the trifocal (or multifocal) matching tensors or even reprojection: valid correspondences must necessarily lead to valid fundamental matrices, valid multifocal tensors and a valid reprojection of the reconstructed scene. However, selfcalibration constraints being inherently nonlinear, they  seemingly  are of no use to support the search for correspondences across uncalibrated images. We argue here that a packaged solution of 3D reconstruction form multiview will not be complete unless both of these problems are solved together. i.e, a robust multiview matching and a reliable camera selfcalibration that exploit one another instead of being solved independently. How useful would be point correspondences that yield inaccurate, false or even impossible calibration? Not useful at all! Selfcalibrating a camera with every candidate set of matches is unrealistic and computationally prohibitive. Also, selfcalibration being a nonlinear problem, it may very well fail because of numerical optimization considerations rather than false matches. It is thus of the utmost importance to find a proper formulation to express the likely existence of a valid calibration (given a pointset match) rather then going all the way to recover the camera parameters. The main goal of this paper is specifically to match multiview correspondences with the support of such intractable selfcalibration constraints. Up to our knowledge, this is the first work that (i) performs deep projective structurefrom motion, and (ii) exploits the selfcalibration constraints for the task of multiview matching.
In this work, we design a deep unified framework for projective structurefrommotion and camera selfcalibration, to support the multiview matching process. Using a set of putative correspondences across multiple views, the proposed framework predicts inlier/outlier scores of the correspondences together with camera intrinsics and the planeatinfinity. Notably, our deep network is trained endtoend in an unsupervised manner. The unsupervised training for intrinsics and planeatinfinity is possible, thanks to the selfcalibration constraints expressed in the form of DAQ projection equations. In fact, when it is proven that our model is more robust and further improved by adding the selfcalibration constraint when facing the difficult setting such as few points, few views and high outlier rate. We show the practicality of our methods, in terms of both robustness and accuracy, via real and the extreme cases of synthetic data.
2 Related Works
Camera selfcalibration is widely known to be a difficult problem. This is mainly due to two reasons: the nonlinear nature of the underlying equations and the numerous critical motion sequences [5]. Critical motions cause various levels of reconstruction ambiguities and lead to the failure of camera calibration. The preliminary work based on Kruppa’s equation proposed in [6] is historically seen as the first selfcalibration method. However, its application for three or more views provide weaker constraints than those obtained through subsequent methods such as the one based on the modulus constraints [7] and the one relying on the Dual Absolute Quadric [8]. This is because Kruppa’s constraints rely only on the dual images of the AC and do not enforce that those images correspond to a unique conic (the AC) lying on the plane at infinity [9]
. The plane at infinity and the AC are estimated in either of two ways; one after another (stratified) or simultaneously (direct). The stratified method given in
[10] for affine cameras was extended to perspective in [11] with further developments in [12, 7, 13]. Scene constraints combined with camera constraints over multiple views is described in [14, 15, 16]. The use of the modulus constraints to locate the plane at infinity was introduced by Pollefeys in [7]. The direct methods, which simultaneously estimate the plane at infinity and the dual IAC, basically deal with DAQ [8, 17, 18]. The DAQ was introduced for camera selfcalibration by Triggs [8] who has proposed both quasilinear and sequential programming methods to locate it. Pollefeys et al. [19] showed that the DAQ computation could be used for metric reconstruction under general motion even for varying focal length. In the case of a moving camera with varying parameters, there exists no tight constraints on the position of plane at infinity. Either chirality constraints [20] or the finiteness constraint [21] are used within iterative search schemes. Stratified methods are sensitive to critical motions [9]. whereas, direct methods are and less problematic to critical sequences [22, 23], however are not flexible to be used in many cases.All the above methods start with the common assumption on the availability of perfect correspondences among all the images, which is not a practical. As per our knowledge, there is no research work that simultaneously deals with multiview matching of randomly captured images and selfcalibration. One of the initial multiview matching work done for a set of unordered real images is presented in [24]. However, this method does not find the correspondences among all the images. Recent methods for multiview matching, although often in a different context, have also been developed [25, 26, 27]. On the other hand, almost all projective SfM works [28, 29, 30] are primarily concerned for accuracy and optimality, without addressing the robustness, except two notable works that include [31, 32]. With the recent developments, learningbased methods for 3D reconstruction and/or selfcalibration have also been developed [33, 34, 35, 36, 37, 38, 39, 40]. However, most of these works rely on the naive photometric error loss, if are unsupervised. This assumption immediately mandates ordered image sequence, or images captures under very similar conditions. In regarding to learningbased matching for structure and/or motion, few notable works include [41, 42, 43, 44].
3 Preliminaries
The Fundamental matrix encapsulates all the necessary (projective) geometric relationships for the twoview imaging model. However, when more than two views are involved, more sophisticated relationships (analogous to the Fundamental matrix), involving measurements from all the views, are required. These relationships are known as view multilinear tensors such as the trifocal tensor for three views and the quadrifocal tensor for four. Although view tensors successfully encapsulate the geometric relationships upto 4 views, their usage is limited due to their computational complexities. Therefore, a common practice of incorporating measurements from multiple views involves the projective factorization method. The process of projective factorization takes 2D point measurements from multiple views and decomposes it into a scene structure and camera matrices that are consistent with this structure.
3.1 Projective Factorization
Consider 3D points observed by cameras . The observed image points are given by . For given point correspondences across images , the reconstruction task is to find 3D point coordinates and camera matrices such that,
(1) 
If we write this equation explicitly by introducing scale variables (or Projective depth), we have, . Provided that the points are visible in all views (i.e. is known for all and
), the complete set of equations may be written by stacking the vectors and matrices in the following form,
(2) 
The matrix on the lefthand side is known as the measurement matrix, say . By construction, the matrix is of rank 4. This equation involves the scale variables , which are not part of the measurement, for each measured point . Furthermore, note that the decomposition on the righthand side of the above equality is not unique. To see this, observe that with any nonsingular matrix , we have which is also satisfied. Such reconstruction is a projective reconstruction and the matrix is called a projective homography matrix. There are several approaches that allow decomposing the measurement matrix in the form of Equation (2).
Sturm/Triggs Factorization: The first solution to decomposed as (2) was proposed by Sturm and Triggs [45], where the initial estimate of projective depths is assumed to be known. This may be obtained either from initial projective reconstruction (for example, using fundamental matrix) or simply setting all . Once the projective depths are known, the measurement matrix is complete. In case of noisy measurements, the
can be enforced to have rank 4 using Singular Value Decomposition. Thus, if
, all except the largest four diagonal entries of are forced to zero resulting in . Then, the rank constrained measurement matrix is . Using such decomposition, the camera matrices and the scene points are retrieved as,(3) 
3.2 ProjectivetoMetric Upgrade
For simplicity and without loss of generality, we assume that the coordinate frame of the first camera in both projective and metric space coincide with the world frame such that the first cameras are respectively given by, and . The projective structure and motion can then be upgraded using,
(4) 
where the are the coordinates of the socalled plane at infinity, say , in the projective space whose frame coincides to the first camera.
3.3 DAQ for Camera SelfCalibration
The Dual Absolute Quadric (DAQ), , is a special degenerate quadric of planes in the dual 3Dspace [9]. The canonical form of DAQ in metric space is given by, , which is fixed under metric transformations and takes the form in the projective space. Using the form of in (4), one can express DAQ in projective space with respect to the first camera frame as,
(5) 
where, is also known as the Dual Image of Absolute Conic (DIAC) in the first image. Direct selfcalibration methods rely on the existence of DAQ of the form (5). More specifically, one can establish the relationships between DAQ and DIAC in each view using the projective projection matrices as follows,
(6) 
In this regard, the task of selfcalibration is finding that has structure of (5) and satisfies (6), using the given projective projection matrices.
4 Mulitview Matching
The process of multiview matching assumes that the putative correspondences may get contaminated by potentially overwhelmingly many outlying matches. The filtering of these outliers is carried out while maximizing the consensus set of the correspondences that respect the factorization process of (2), while respecting the DAQ projection of (6). In this process, we are interested on classifying correspondence with the help of , for given noise free outlier contaminated measurement matrix , using the following optimization problem,
(7) 
where, denotes the elementwise matrix multiplication, are projective structure, motion, and DAQ, respectively. Note that the assignment variable implies that the measurement corresponding to the point is an inliers, otherwise it is an outlier. The optimization problem of (7), however, has two major issues that need to be addressed prior to be used in practice. One concerns about noise and the other about an efficient usage of DAQ projection equation.
4.1 In the Presence of Noise
When the measurement matrix is also contaminated by noise, the constraint of (7) is not often satisfied for the desired solution. Therefore, we instead seek for a matrix , which is closest, in Frobenius norm, to the outlier filtered measurement matrix by ensuring the following,
(8) 
In fact, any with satisfies the constraint of (8). On the other hand, the rank4 matrix that minimizes the objective of (8) can be obtained by using the singular value decomposition of , whenever the assignment variable is known. Note that any matrix of higher rank can be projected on the rank4 manifold by setting all except largest four singular values to zero, similar to the Sturm/Triggs Factorization discussed in Section 3.1.
4.2 DAQ Projection for Constant Intrinsics
The DAQ projection constraint of (6
) may turn out to be weak, if we assume that each camera can have different intrinsics. This however, is not a problem in itself. One can still make use of the DAQ projection constraints under the known prior in intrinsics. The known prior may include zero skew, unit aspect ratio, principal point close to image center, or only change in focal lengths. In this work, we assume that all the cameras have constant intrinsics. Furthermore, we also need to consider that projection equation will not be satisfied exactly in the presence of noise. Therefore, we minimize the following objective function,
(9) 
4.3 The Matching Objective
The primarily goal of the multiview matching is to compute inlier/outlier assignment that also satisfy the DAQ projection conditions. Therefore, we aim at simultaneously estimating and by maximizing the surrogate objective, of (7), stated as follow,
(10) 
where are the weights that take care of the influence of noise and constant intrinsics factors, respectively.
4.4 SelfCalibrating Projective SfM
In order to solve the optimization problem of (10), we present our deep selfcalibrating projective SfM model (SCPSfM), which simultaneously performs the projective factorization and camera calibration. In this process, we exploit the advantage of a deep neural network on the high optimization accuracy and efficiency. Starting from noise and outliers contaminated measurement matrix , defined in Section 3.1, we predict per correspondence weights , defined in 7), to detect the inliers () and outliers () correspondence. Additionally, we also accurately calibrates the camera intrinsics defined in (4) and the coordinate of the plane at infinity defined in (5). Let us denote our SCPSfM model as , which parameterized by . Using the measurement matrix as input, our SCPSfM model predicts inlier/outlier scores as well as selfcalibrates the camera, without requiring any ground truth, whatsoever. SCPSfM relies on DAQ projection and projective factorization constraints presented in (8) and (9). Based on the objective function of (10
), we propose the total loss function
which combines the projection assignment loss , and the DAQ loss , and the inlier loss resulting the following total loss:(11)  
where and are hyper parameters which balance between different loss,
represents the sigmoid function. The hyper parameter
represents threshold that guarantees the least number of inliers detected. The projection matrices can be recovered from according to (2), whereas the DAQ can be derived using based on (5).Our input is the measurement matrix, where each correspondence can been seen as a point in dimensional space. Our output aims to assign a label of inlier or outlier to each of the correspondence. This problem can be seen as a oneclass point segmentation problem. A typical networks for point cloud segmentation naturally meets our requirement. Therefore, we use adopt PointNet [46] as the basic building block of our SCPSfM model.
Implementation. We illustrate our network structure in Fig. 1. Based on the theory presented in Section 4.4, we denote the SCPSfM model as the mapping . Then the network structure of our SCPSfM model can be divided into two branches, one assigns weight to each of the correspondence and the other regresses the plane at infinity . In order to ensure constant intrinsics , as stated in Section 4.2, the camera intrinsics are set as network parameters, a part of . As shown in Fig. 1, SCPSfM model combines PointNetSeg and PointNetCls structures to realize the two branches. The two branches share the common part for feature extraction. The PointNetSeg branch outputs the dimensional vector for inlier/outlier scores of the correspondences. The PointNetCls branch regresses the 3dimensional coordinate of the plane at infinity
. We implement the SCPSfM model in Pytorch
[47] and use the ADAM [48] optimizer to train the network.5 Experiments
We evaluated the effectiveness of our model with synthetic and real datasets. On the synthetic data, we measure the effect of different factors: the outlier rate , noise extent , number of points and number of views , in order to show the robustness of our model. We also compare our model with the traditional Projective Structure from Motion (PSfM) method under the same settings. We further perform ablation study of our model with and without the selfcalibration support, which is the DAQ loss proposed in Section 4.4. On the real dataset, we combine our model with the stateoftheart method on projective structure from motion PSfM [32]. The real experiments also demonstrate that our model quickly rejects most of the outliers. We use our method to reject outliers and obtain the outlier filtered measurement matrix. The final structure and motion are then recovered providing the outlier filtered measurement matrix into the PSfM pipeline. This choice is primarily because of the final step of outlier filtering and bundle adjustment offered by the PSfM. In fact, our experiments show that the proposed method is complimentary to the PSfM pipeline.
5.1 Synthetic Dataset
5.1.1 Experiments Setup.
In order to create the synthetic dataset, we randomly generate number of points and number of projections with random camera motion. The rotation of the camera is sampled uniformly from the set around axis and the translation of the camera is uniformly sampled from . The 3D points are randomly sampled from along and axis and along axis. In order to guarantee the measurement matrix is meaningful under high outlier rate and high noise extent, the projective depth used in measurement matrix is the ground truth projective depth instead of the estimated projective depth calculated from the fundamental matrix as done in [45]. The synthetic measurement matrix for exploring the effect of the number of points and the number of views has a fixed dimension of , but only of which consists of valid measurements. The rest of the columns and rows are filled with zeros. The dimension of the synthetic measurement matrix for exploring the effect of the outlier rate and the noise extent is fixed as without zero columns or rows. The outlier correspondences are introduced by exchanging some of the correspondence points and the Gaussian noise is added to all elements in the measurement matrix. We set the hyper parameter , , in Eq. (11) for all the synthetic experiments. The learning rate is set as 0.001. In order to evaluate the performance of different method, the F1 score, 2D error and 3D error are adopted as the metric for evaluation, following [42, 32]. The F1 score is calculated according to the inlier detection accuracy and the inlier detection recall rate. The 3D error is calculated by where is the reconstructed 3D point and is the ground truth 3D point. The 2D error is the root mean square error in pixel between the reprojected 2D point and the ground truth 2D point and then averaged over all the points.
5.1.2 Experimental Results.
To explore the effect of four factors, number of points , the number of views , the outlier rate and the noise extent on our model and the traditional projective structure from motion (PSfM) method, we conduct the control variable experiment, whose experimental results are shown in Fig. 4. From Fig. 2, Fig. 1(a) and Fig. 1(e), it is shown that the performance of all the methods increases with increasing points. It also shows that our model with/without selfcalibration part both perform better than the PSfM method. When the selfcalibration constraints are introduces, the performance of our model for fewer points improves further. In Fig. 2, Fig. 1(b) and Fig. 1(f), PSfM fails in the regime of high outliers, whereas our models with/without calibration constraints provide meaningful results up to outlier rate, respectively. These experiments verify that the use of selfcalibration constraints helps to further improve the robustness of our model. In Fig. 2, Fig. 1(c) and Fig. 1(g), the performance of PSfM drops quickly when noise extent increases, however our model remains very stable in terms of F1 score. It is natural that the 2D and 3D errors of our model increases with increasing noise. Similarly, the performance of all the methods improves with increasing number of views, which can be seen in 2, Fig. 1(d) and Fig. 1(h). In the same figures, it can also be seen that our model still perform better than PSfM method. As expected, our model with with self calibration constraints is better here again. Overall, our model with/without selfcalibration constraints perform better than PSfM with changing number of points/ outlier rate/ noise extent/number of views, when measured in terms of F1 score, 2D and 3D errors. More importantly, our model with selfcalibration constraints is consistently better than the one without, ever if it is by a small margin in some cases. For more results and analysis, please refer the supplementary material.
5.2 Real Dataset
5.2.1 Experiments Setup.
To verify the effectiveness of our model on the real data, image datasets which cover the multi view images such as Courtyard [49], West Side [49], Dome [49] and KITTI [50] are taken. In order to guarantee that there are common correspondences across multiple images, some of the views were rejected. Total number of views, number of correspondence, and the image size are listed in Table 2. Due to the limitation of the dataset scale, we report the training results to evaluate our method. Please, note that our method is fully unsupervised. In the whole sequence, except for KITTI, every 10 multiview images are used to generate one meansurment matrix, i.e., number of views . For KITTI, the number of views is set as 11. Except for KITTI, the point in 2D images are detected by SIFT [3] and then correspondence is established by the Brute Force Matcher [51]. For the KITTI dataset, the point and the correspondence matching are taken through the ShiTomasi detector [52] and optical flow [51]. Due to the unavailability of the projective depth in the real dataset, the projective depth in the measurement matrix is estimated by the fundamental matrix and the epipole following [45]. In order to evaluate the performance of different methods, the 2D error metric, same as that of synthetic data experiments, is adopted. Besides, the runtime of different methods are also compared. Some qualitative results for matching are shown in Fig. 3.
Sequence  Ours+PSfM  PSfM[32]  COLMAP[53]  
Name  Size  Views  Corrsp.  2D error  Time(s)  2D error  Time(s)  2D error  Time(s) 
Courtyard [49]  1936 1296  21  3000  0.2195  16.89  0.2506  46.74  0.4226  1696 
West Side [49]  1936 1296  97  3000  0.2686  28.05  0.5216  118.93  0.5728  5141 
Dome [49]  1296 1936  81  3000  0.1462  21.50  0.1554  30.49     
KITTI [50]  1242 375  334  200  0.5259  0.08  Not Available     
5.2.2 Experimental Results.
In Table. 2, we list the 2D error and the runtime comparisons on the real dataset between our model combined with PSfM and pure PSfM. By first taking advantage of our model for rejecting the outliers, the outlier rate of the measurement matrix fed into the PSfM pipeline becomes much lower. In this way, it becomes easier for the PSfM pipeline to refine and reconstructed structure and motion. From Table 2, it is proven that the combination of our model with PSfM outperforms the pure PSfM according to the 2D error on all the real datasets, 0.2195 v.s. 0.2506, 0.2686 v.s. 0.5216, 0.1462 v.s. 0.1554, respectively. Besides, our model makes the PSfM much faster for refinement, 16.89s v.s. 46.74s, 28.05s v.s. 118.93s, and 21.50s v.s. 30.49s, respectively. Especially on the West Side dataset, we have 48.5 improvement on the 2D error and 76.4 acceleration compared to the pure PSfM method. Moreover, facing the difficult setting where there are only a few correspondences available on KITTI dataset, the pure PSfM method does not work and cannot produce the final result. Nevertheless. our model under such setting can work independently to reconstruct with a few correspondences and reach to a meaningful 2D reprojection error of 0.5259. In this setup, our method takes only 0.08s with GPU acceleration. This further confirms the conclusion we get from synthetic dataset that our model is more stable and robust when fewer number of point correspondences are available, compared to the traditional projective structure from motion method. Due to the space limitation, more quantitative and qualitative results on real data are provided in the Supplementary material.
5.2.3 Camera Intrinsics Prediction.
Since use the DAQ constraints to realize selfcalibration in an unsupervised way, we also validate the camera intrinsics prediction by our method. Using the ground truth camera intrinsics , in the synthetic dataset in Section 5.1, and the real KITTI dataset, in Section 5.2, we computed the errors in predicting the focal length. The intrinsics prediction error is calculated through , where is the predicted focal length while the is the ground truth. On an average, our model achieves the accuracy of and in predicting the focal length, respectively on synthetic and the real KITTI dataset.
6 Conclusion
In this work, we propose the selfcalibrating projective structure from motion (SCPSfM) model, which is a unified framework for projective structure from motion and the selfcalibration. We have proposed the first unsupervised deep model for solving the projective structure from motion problem, to the best of our knowledge. By exploiting the projective factorization, our model outperforms the traditional projective structure from motion method, both interms of robustness and accuracy. Moreover, when the selfcalibration constraints are further exploited, i.e., DAQ constraint, the performance improves further specially in the cases of few views, few points, and high outlier rates.
The experiments on the synthetic and real datasets verify the effectiveness of our model on recovering structure and motion together with selfcalibration, while being accurate and extremely robust to outliers.
References
 [1] Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11) (November 2000) 1330–1334
 [2] Tsai, R.: A versatile camera calibration technique for highaccuracy 3D machine vision metrology using offtheshelf TV cameras and lenses. IEEE Journal on Robotics and Automation 3(4) (August 1987) 323–344
 [3] Lowe, D.G.: Object Recognition from Local ScaleInvariant Features. (1999) 1150–1157
 [4] Bay, T. Tuytelaars, H., Gool, L.: Surf: Speeded up robust features. 1 (2006) 404–417

[5]
Sturm, P.:
Critical motion sequences for monocular selfcalibration and
uncalibrated Euclidean reconstruction.
Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on (1997) 1100–1105
 [6] Faugeras, Q. Luong, O., Maybank, S.: Camera selfcalibration: Theory and experiments. (1992) 321–334
 [7] Pollefeys, L. Gool, M., Oosterlinck, A.: The modulus constraint: a new constraint for self calibration. International conference of pattern recognition (1996) 31–42
 [8] Triggs, B.: Autocalibration and Absolute Quadric. International Conference on Computer Vision and Pattern Recognition (CVPR’97) (1997) 609–614
 [9] Hartley, R., Zisserman, A.: Multiple view geometry. Cambridge University Press (2003)
 [10] Koenderink, J., van Doorn, A.: Affine structure from motion. Journal of the Optical Society of America. A, Optics and image science 8(2) (February 1991) 377–385
 [11] Faugeras, O.: Stratification of threedimensional vision: projective, affine, and metric representations: errata. J. Opt. Soc. Am. A 12(7) (July 1995) 1606+
 [12] Luong, Vieville, T.: Canonical Representations for the Geometries of Multiple Projective Views. Computer Vision and Image Understanding 64(2) (September 1996) 193–229
 [13] Adlakha, D., Habed, A., Morbidi, F., Demonceaux, C., Mathelin, M.d.: Quarch: A new quasiaffine reconstruction stratum from vague relative camera orientation knowledge. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 1082–1090
 [14] Liebowitz, D., Zisserman, A.: Combining scene and autocalibration constraints. Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on 1 (1999) 293–300 vol.1
 [15] Sturm, P., Maybank, S.: On PlaneBased Camera Calibration: A General Algorithm, Singularities, Applications (1999)
 [16] Faugeras, G. Laveau, S.R.L.C.O., Zeller, C.: 3d reconstruction of urban scene from sequence of images. Technical report, INRIA (1995)
 [17] Habed, A., Pani Paudel, D., Demonceaux, C., Fofi, D.: Efficient pruning lmi conditions for branchandprune rank and chiralityconstrained estimation of the dual absolute quadric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 493–500
 [18] Chandraker, M. Agarwal, S.K.F.N.D., Kriegman, D.: Practical autocalibration. Computer Vision and Pattern Recognition (2007)
 [19] Pollefeys, L. Gool, M., Koch, M.: SelfCalibration and Metric Reconstruction in Spite of Varying and Unknown Internal Camera Parameters. (1998) 90–95
 [20] Nister, D.: Untwisting a projective reconstruction. International Journal of Computer Vision (November,2004) 165–183
 [21] Gherardi, R., Fusiello, A.: Practical autocalibration. European Conference on Computer Vision (2010)
 [22] Strum, P., Triggs, B.: A factorization based algorithm for multiimage projective structure and motion. European Conference on Computer Vision, Cambridge, England (April, 1996) 709–720
 [23] Gurdjos, P., Bartoli, A., Sturm, P.: Is dual linear selfcalibration artificially ambiguous? In: 2009 IEEE 12th International Conference on Computer Vision, IEEE (2009) 88–95
 [24] Schaffalitzky, F., Zisserman, A.: Multiview Matching for Unordered Image Sets, or ”How Do I Organize My Holiday Snaps?”. (2002) 414–431
 [25] Montserrat, D.M., Chen, J., Lin, Q., Allebach, J.P., Delp, E.J.: Multiview matching network for 6d pose estimation. arXiv preprint arXiv:1911.12330 (2019)
 [26] Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multiview stereo. In: European Conference on Computer Vision, Springer (2016) 501–518
 [27] Serlin, Z., Yang, G., Sookraj, B., Belta, C., Tron, R.: Distributed and consistent multiimage feature matching via quickmatch. arXiv preprint arXiv:1910.13317 (2019)
 [28] Mahamud, S., Hebert, M., Omori, Y., Ponce, J.: Provablyconvergent iterative methods for projective structure from motion. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. Volume 1., IEEE (2001) I–I
 [29] Hartley, R., Schaffalitzky, F.: Powerfactorization: 3d reconstruction with missing or uncertain data. In: AustraliaJapan advanced workshop on computer vision. Volume 74. (2003) 76–85
 [30] Oliensis, J., Hartley, R.: Iterative extensions of the sturm/triggs algorithm: Convergence and nonconvergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(12) (2007) 2217–2233
 [31] Dai, Y., Li, H., He, M.: Elementwise factorization for nview projective reconstruction. In: European Conference on Computer Vision, Springer (2010) 396–409
 [32] Magerand, L., Del Bue, A.: Practical projective structure from motion (p2sfm). In: ICCV. (2017)
 [33] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and egomotion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 1851–1858
 [34] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into selfsupervised monocular depth estimation. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 3828–3838

[35]
Chen, Y., Schmid, C., Sminchisescu, C.:
Selfsupervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera.
In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 7063–7072  [36] Pedra, A.V.B.M., Mendonça, M., Finocchio, M.A.F., de Arruda, L.V.R., Castanho, J.E.C.: Camera calibration using detection and neural networks. IFAC Proceedings Volumes 46(7) (2013) 245–250

[37]
Bogdan, O., Eckstein, V., Rameau, F., Bazin, J.C.:
Deepcalib: a deep learning approach for automatic intrinsic calibration of wide fieldofview cameras.
In: Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production. (2018) 1–10  [38] HoldGeoffroy, Y., Sunkavalli, K., Eisenmann, J., Fisher, M., Gambaretto, E., Hadap, S., Lalonde, J.F.: A perceptual measure for deep single image camera calibration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 2354–2363
 [39] Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 8977–8986
 [40] Zhuang, B., Tran, Q.H., Ji, P., Lee, G.H., Cheong, L.F., Chandraker, M.: Degeneracy in selfcalibration revisited and a deep learning solution for uncalibrated slam. arXiv preprint arXiv:1907.13185 (2019)
 [41] Ranftl, R., Koltun, V.: Deep fundamental matrix estimation. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 284–299
 [42] Probst, T., Paudel, D.P., Chhatkuli, A., Gool, L.V.: Unsupervised learning of consensus maximization for 3d vision problems. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019) 929–938
 [43] Brachmann, E., Rother, C.: Neuralguided ransac: Learning where to sample model hypotheses. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 4322–4331
 [44] Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., Gumhold, S., Rother, C.: Dsacdifferentiable ransac for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6684–6692
 [45] Sturm, P., Triggs, B.: A factorization based algorithm for multiimage projective structure and motion. In: European conference on computer vision (ECCV). (1996)
 [46] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR. (2017)
 [47] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, highperformance deep learning library. In: NIPS. (2019)
 [48] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ICLR (2014)
 [49] Olsson, C., Enqvist, O.: Stable structure from motion for unordered image collections. In: Scandinavian Conference on Image Analysis. (2011)
 [50] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR) (2013)
 [51] Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)
 [52] Shi, J., et al.: Good features to track. In: CVPR. (1994)
 [53] Schönberger, J.L., Frahm, J.M.: Structurefrommotion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
7 Additional Results on Synthetic Dataset
In Section 5.1 of the main paper, the quantitative comparison between our selfcalibration supported robust projective structurefrommotion model (SCPSfM) and the traditional projective structurefrommotion (PSfM) method is conducted under the different settings of: number of points , number of views , outlier rate and noise extent . In Fig. 2 of the main paper, the comparison is shown to prove the advantage of our model with calibration constraint compared with our model without calibration and PSfM method. In Fig. 2, it is shown that our model with/without selfcalibration constraint both outperforms the traditional PSfM method under all the settings. Moreover, our model with selfcalibration constraint consistently performs better than our model without selfcalibration constraint, which can be seen from the obvious margin between the curves of with and without calibration constraint in Fig. 2 of the main paper. The margin can be observed in all the cases when varying the number of points , the number of views , the outlier rate in Fig. 2 of the main paper. But due to low range used to explore the effect of noise extent , the curve of our model with calibration constraint only shows small improvement compared to the curve without calibration constraint when varying noise extent (ref. Fig. 2(c)(g)(k) in the main paper). In order to show the robustness and benefit of our model from the selfcalibration constraint when facing different extent of noise, we provide more experimental results on the synthetic dataset here. We further increase the noise extent to higher noise extent compared with the experiment in the main paper. The results of the experiments are plotted in Fig. 4, which shows that our model with the selfcalibration constraint is more robust and performs much better especially when facing high noise condition. It is notable that our model with selfcalibration constraint can stand the noise while the PSfM method and our model without calibration constraint does not work at all under such high noise. It further verifies the robustness and advantage of our SCPSfM model profiting from the selfcalibration constraint.
8 Additional Results on Real Dataset
In Section 5.2 of the main paper, we provide the quantitative performance comparison between our model combined with PSfM and pure PSfM on the real dataset. The Table 1 of the main paper shows the advantage of our model for accelerating and reducing the error of the PSfM. In order to further verify the conclusion that we draw, we here provide more comparison results on additional real datasets, which are listed in Table 2. The experiment setup is exactly the same as done in Section 5.2 of the main paper. From Table 2, it is shown that the combination of our model with PSfM method outperforms the pure PSfM method according to 2D error, 0.2387 v.s. 0.3187, 0.1576 v.s. 0.1665, 0.2106 v.s. 0.4261 and 0.1596 v.s. 0.1912. Moreover, the speed of the PSfM is also highly improved profiting from our model, 23.43s v.s. 45.41s, 24.05s v.s. 35.99s, 18.61s v.s. 72.52s and 28.75s v.s. 44.78s. It further proves the benefit of our model on the accuracy and speed of the projective structurefrommotion. Besides the quantitative results, Fig. 5 provides the qualitative results of detected correspondence inliers of our method combined with PSfM on the additional real datasets.
Sequence  Ours+PSfM  PSfM[32]  
Name  Size  Views  Corrsp.  2D error  Time(s)  2D error  Time(s) 
De Guerre [49]  1296 1936  20  2000  0.2387  23.43  0.3187  45.41 
Lund Cathedral [49]  1296 1936  50  3000  0.1576  24.05  0.1665  35.99 
UWO [49]  1296 1936  20  3000  0.2106  18.61  0.4261  72.52 
Water Tower [49]  1296 1936  170  3000  0.1596  28.75  0.1912  44.78 