1 Introduction
StructurefromMotion (SfM) refers to a process in which a set of 3D points are reconstructed from their projections on a given set of images. Almost all SfM techniques can be classified into two categories, sequential or incremental techniques
snavely2006photo; agarwal2011building; frahm2010building; wu2013towards; furukawa2010towards; havlena2009randomized; snavely2008skeletal, and global or batch techniques jiang2013global; moulon2013global; arie2012global; crandall2011discrete; sweeney2015optimizing.As the incremental approaches often suffer from large drifting error, it is recommended to use global methods to achieve better accuracy and consistent models. The global methods consist of the following main steps: 1) Pairwise image registration: Estimating relative rotations and movement directions between pairs of images using feature points
torr2000mlesac; nister2004efficient; kukelova2008polynomial. The result is a graph, i.e. the viewing graph, whose edges are the relative pose measurements. 2) Solving the viewing graph: Camera poses estimation ozyesil2015robust; sweeney2015optimizing; jiang2013global; zhu2018very; hartley2013rotation; chatterjee2017robust; arrigoni2018robust, which is the topic of this paper. 3) Triangulation: Reconstructing 3D points by triangulating corresponding points from calculated camera points hartley1997triangulation; stewenius2005hard; byrod2007improving. 4) Bundle adjustment: Camera poses and 3D points refinement by reprojection error minimization triggs1999bundle; lourakis2009sba.Since the relative translation observations only contains the movement directions and not their scale, solving the viewing graph is the challenging step in the SfM process. To simplify the problem, almost all viewing graph solvers determine the rotations using only the rotation observations and then use them to compute the camera positions wilson2014robust; govindu2001combining; ozyesil2015robust; jiang2013global
. Obtaining a set of rotations from their relative observations is known as rotation averaging problem and has been studied in the computer vision community
nasiri2018linear; hartley2011l1; arrigoni2014robust; hartley2013rotation.The rotations obtained by solving the rotation averaging problem are used to change the representation of direction observations from local coordinates to a global coordinate system. The camera location estimation problem is to find camera positions from these relative directions in the global coordinate frame. A solution is to find camera locations which minimize the squared sum of direction errors wilson2014robust. The difficulty of solving this nonlinear cost function led most of the available methods in the literature to change the cost function from the direction error to the displacement error. To this end, unknown scale factors are added to the problem which should be estimated besides the camera poses. Some of these approaches used the cross product of solution translations with direction observations as displacement error govindu2001combining; arie2012global. These approaches suffer from the property of cross product distance that decreases for an angle error of more than degrees. The constrained leastsquared tron2014distributed; tron2009distributed, and leastunsquared chordal distances ozyesil2015stable; ozyesil2015robust are other commonly used cost functions. Using the displacement error instead of direction error biases the errors in proportion to the length of edges.
In this paper, we show that the constrained leastsquared formulation for solving camera positions has inherit limitation that decreases its performance. We, therefore, propose an iterative algorithm to solve the original displacement error or direction error cost functions starting from the solution of the constrained leastsquared formulation. Experimental results show the efficacy of the proposed methods. We also propose iterative methods for solving simultaneous position and rotation registration. These methods solves a pose graph optimization (PGO) problem in each step, that thanks to recent success of PGO solvers can be solved efficiently and with high accuracy. Experimental results show that our proposed methods significantly outperform most accurate methods proposed in the literature.
2 Preliminaries
2.1 ViewingGraph
Given views or images from a scene, a subset of
pairwise relative motions can be estimated. The camera poses, as the vertices, and the pairwise relative motions, as the observed edges, construct the viewing graph. Each relative motion observation comprises a relative rotation and a relative direction. Relative directions are vectors from an endpoint of an edge to another one and are represented in local coordinate frames of vertices.
A set of vertices, , and its relevant set of edges, , form a viewing graph . The parameters of the vertex, , are the position and the orientation of the camera with respect to a global coordinate frame, i.e. . An edge between the and vertices, represented by , comprise a noisy measurement of the relative rotation, , and a noisy measurement of the direction of movement in the coordinate frame, i.e. .
Solving a viewing graph means finding unknown parameters of the vertices, i.e. camera poses, that match the edges, i.e. relative observations, as much as possible. Mathematically speaking, the viewing graph optimization problem solves for the edges that minimize the following mismatch cost function,
(1) 
where, and are distance metrics and . The weights can be added to the cost function to handle different confidence level of measurements and to balance errors of the two parts of the cost function.
2.2 Common Metrics
Rotations distance metrics are widely studied under the title of rotation averaging hartley2013rotation; chatterjee2017robust. The common metric, that is also used in our proposed methods, is the Frobenius norm of difference of the rotation matrices:
(2) 
Several distance metrics can also be used for the direction part of the cost function, i.e. . Some cases are listed in wilson2014robust. The commonly used metrics are the chordal and the orthogonal distances which are formulated as follows:
Chordal:  (3)  
Orthogonal:  (4) 
where is the projector onto the orthogonal complement of the span of and “” represents the cross product.
Common approaches separate the two parts of the cost function. The first part, i.e. the rotation averaging problem hartley2013rotation; nasiri2018linear, is independent of the positions and can be solved independently to obtain the vertices orientations. The rotation averaging problem is defined as:
(5) 
The obtained orientations are used to calculate s from s. Now the second part of the cost function is a function of positions and known as “camera location estimation” problem. The objective function is designed to minimize directions error and is formulated as:
(6) 
There are global scale and translation ambiguities in the solutions of (6), and if optimization methods can cope with overparametrization there is no need to remove the ambiguities. Some existing methods however use some constraints to resolve the ambiguites. The constraints or fixing the first vertex at the origin are the common constraints to remove the translation ambiguity, and the constraints like or can be used to remove the scale ambiguity.
A different approach to find camera locations is to minimize the displacements error instead of directions error. In this case the problem is formulated as:
(7) 
In this case. it is important to consider a constraint like to prevent the trivial solutions of placing all camera positions at the same point.
The combinations of two distance metrics, i.e. orthogonal and chordal, and the two aforementioned approaches, i.e. minimizing directions or displacements error, form a set of four different error criteria which were used by different methods to solve the camera location estimation problem. The geometric interpretation of these error criteria is shown in Fig. 1.
3 Limitations of Existing Methods
In this section, we first talk about a limitation of an existing method to solve the cost function of (7). The method is fast and gives relatively good results and we use it as the initialization procedure of our proposed methods. In the next part, we show the limitation of the orthogonal distance, and therefore the existing methods in the literature that use this cost function.
3.1 Linear constraint problem
A common approach of solving camera locations is to solve the following alternative problem:
(8) 
where the scale factors, s, are unknown parameters. For , problem (8) is the same as (7) with the chordal distance, which minimizes the displacement error, i.e. Euclidean distance between the endpoint of vectors and . In practice, the constraint is dropped and the linear constraint is replaced, which also handles the scale ambiguity. Like arrigoni2014robust, we refer to the problem of (8) with the constraint as the constrained leastsquares (CLS).
The cost function of (8) is quadratically related to the global scale, i.e. the size of the vertex set. Reducing the global scale reduces the cost quadratically. Obviously, setting the global scale to zero, results in zero cost with the trivial solution of . However, the constraint , prevents a solver to get this trivial solution. In fact, s represent the distance between the vertices, and the constraint prevents the vertices from getting too close to each other.
At a first glance, it seems that the constraint causes the global scale to be determined according to the minimum distance between the vertices. This implies that the smallest scale factor, i.e. the minimum corresponding to the minimum distance of the vertices, is set to one and the other scale factors are estimations of the distance of their endpoints divided by the minimum distance of the vertices. But in practice, a significant number of s are exactly set to one. In fact, s for a subset of edges, with shorter length than the others, are set to one. This will increase the cost on the subset of edges, but will reduce the global scale and reduce the cost on the other edges with longer length. Therefore, the cost is decreased, but s and consequently the positions deviate from their real values.
Fig. 2 compares the obtained s by CLS and one of our proposed methods for a dataset. In this figure, there are a large number of vertices with small , such that is set to and therefore does not match . Furthermore, it shows that for CLS the mismatches also occurred in large s. In contrast in our proposed method, s and their corresponding , completely match.
3.2 Orthogonal metric problem
Orthogonal distance is a common distance metric used in the literature for solving the camera locations. It is actually an easier metric to be solved and one can have interesting convergence results. It can be shown that the following cost function that has multipliers can be solved iteratively and at the minimum is equal to the orthogonal distance metric (see Appendix for more details).
(9) 
The orthogonal distance decreases for an angle error of more than 90 degrees. This means that the angle between and , and also the Euclidean distance between them, can be increased while their orthogonal distance is reduced. This can results in undesirable results in methods based on this distance metric.
We construct a simple example to demonstrate the problem of orthogonal distance. Fig. 3 shows a graph with six vertices. Vertices number and , placed near the center of the graph, are closer to each other than the other vertices. The rotation observations assumed to be exact and the direction observations are perturbed by degrees error. The cross product method of govindu2001combining and ShapeFit method of goldstein2016shapefit and the CLS algorithms are applied to the problem and the results are shown in the figure. The relative positions of the first four vertices in all three methods are close to their true relative positions, and the four vertices are located in the four vertices of a rhombus. The positions of vertices number and estimated by CLS are far from their true relative positions, but these vertices are still inside the rhombus of the first four vertices. That is while the positions of vertices number and estimated by the other two methods is outside of the rhombus in a situation that their positions relative to one of the vertices and is in the opposite direction of the real relative direction. Note that, when the angle between the relative position and the direction observation is about degrees, i.e. they are in opposite directions, then the orthogonal distance between them are very small. Therefore, the cost value obtained by the methods based on the orthogonal distance will be minimized, while the relative positions of some vertices are completely in different position with respect to their real relative positions to the positions of other vertices.
4 The Proposed Methods
To overcome the problem of linear constraint weakness, discussed in 3.1, we propose an iterative solver used in two different approaches, minimizing sum of squared of displacements and directions error, to solve the viewing graph problem. The main idea is to use s obtained by CLS at the first iteration and then replace them with fixed coefficients according to the obtained positions in the last iteration. We are not able to prove the convergence of the iterative solvers, but we observe experimentally that all of our iterative solvers converge.
To minimize the sum of squared displacements error, the formulation of (8) is employed and s are set to obtained distance of the endpoints of the vertices and in the last iteration of the algorithm. Therefore upon convergence, the algorithm yields , where s are the estimated camera positions and we are solving the actual optimization problem of (7).
The proposed algorithm summarized in Alg. 1 consists of the following steps:

Use the RS algorithm nasiri2020novel to solve (5) and find the orientation of the vertices. (RS uses the Frobenius norm for )

Use 1DSFM algorithm wilson2014robust to remove the outliers.

Solve (8) on the common constraint of to estimate s.

Replace parameters with fixed coefficients .

Repeat the last two steps until convergence.
To minimize the directions error instead of displacements error, we can use the formulation of (9). Therefore the and step of the algorithm is replaced by:

Replace parameters with fixed coefficients .
Fixing the parameters s converts the direction observations to translation observations. This means that the given viewing graph is converted to a PoseGraph, and the problem is turned to a PoseGraph optimization problem:
(10) 
The literature of the PoseGraph optimization has a wide variety of algorithms from which there are algorithms that solve the the orientations and positions of the vertices simultaneously nasiri2020novel; rosen2019se. This was the motivation of our second method which optimizes the positions and the orientations together. The idea is to change the step of the algorithm with the stateoftheart PS algorithm proposed in nasiri2020novel to find the vertices’ poses. The proposed algorithm is summarized in Alg. 2.
In both of our proposed algorithm, we use displacement error in the cost function without any constraint to avoid trivial solution. Since we use a good initialization obtained by the CLS algorithm, we do not see any problems in the experiments. Because of the scale problem using displacement error in the full viewing graph loss, which was solve by turning the problem into a PGO in our method, seems unjustified. But using this error in the full viewing graph loss helps getting important insight that becomes clear later in the experiments.
5 Experiments
To evaluate the proposed algorithms, the challenging real datasets of wilson2014robust are used. The camera poses estimations, computed by Bundler snavely2006photo and presented in wilson2014robust are considered as the ground truth. In this section, Alg. 1C or Alg. 2C stand for using the displacements error in the proposed algorithms, and Alg. 1O or Alg. 2O stand for using the directions error in each iteration of the algorithms ^{1}^{1}1Efficient implementation of the proposed algorithms in MATLAB is available via http://visionlab.ut.ac.ir/resources/vgorls.zip.
For the first step, the weakness of the constraint , which is discussed in section 3.1, is shown experimentally. To this end, the orientation of the cameras are estimated by solving the rotation averaging problem of (5) using RS nasiri2020novel. Outliers are removed by 1DSFM algorithm, and then the positions are estimated by CLS, i.e. solving (8) on the constraint of . The results are compared to the results of Alg. 1C. Fig. 4 compares the histogram of the scale factors ratio to the corresponding vertices distances obtained by CLS and Alg. 1C in different datasets. It can be seen that all scale factors obtained by Alg. 1C, despite CLS, match their corresponding vertices distances.
In the following, the recovery performance of the proposed algorithms are evaluated in comparison to the most accurate algorithms proposed in the literature. The comparing methods are Cross method of govindu2001combining, CLS method which is used in tron2009distributed; tron2014distributed, and LUD method proposed in ozyesil2015robust. Since the problem has natural global rotation, translation, and scale ambiguities, an Euclidean transformation should be found that aligns the camera positions to the ground truth as much as possible. Mathematically, we solve
(11) 
where , , and are the rotation matrix, translation vector, and scale parameters of the scaled Euclidean transformation. The comparison criteria are the mean (), the median (), and the root mean squared () distances error. The root mean squared distances error is given by
(12) 
where is the number of edges.
The histogram of rotation and direction measurements error, before and after applying 1DSFM for Tower of London are shown in Fig. 5. The figure shows that the most of outliers were removed and the output has a few number of outliers. The similar results were obtained for other datasets.
The results of various methods in different datasets are listed in Table 1. The results show that in almost all datasets, the proposed methods outperform the others. Further, the comparison of Alg. 1 and Alg. 2 demonstrates that minimizing the cost function over rotations and translations simultaneously, results in better camera localization. Comparing the results of Alg. 1C to Alg. 1O and Alg. 2C to Alg. 2O shows that, although the differences are small, minimizing the displacements error instead of the directions error results in a better camera localization. Because of the scale problem as discussed in the section of the proposed methods, using Alg. 2C is not justified. But interestingly, this algorithm shows the best performance which shows that resulting scales obtained by solving the problem using our proposed method is fine. This shows the importance of using correct weight for combining two terms of the viewing graph optimization problem.
Dataset  Cross  CLS  LUD  Alg. 1C  Alg. 1O  Alg. 2C  Alg. 2O  

Name  m  n  
Alamo  46353  479  15.94  12.96  12.52  5.36  2.89  1.78  4.89  2.56  1.43  5.46  2.57  1.18  5.39  2.61  1.30  5.14  2.17  0.93  4.83  2.11  0.95 
Ellis Island  4512  193  30.85  26.18  23.26  11.19  6.87  4.22  10.92  6.17  3.57  10.00  5.71  3.07  10.86  6.17  3.28  10.00  5.37  2.65  10.29  5.50  2.68 
Madrid Metropolis  4150  253  51.45  35.65  29.23  21.66  14.65  11.36  20.63  13.47  10.16  18.55  12.23  8.51  20.33  13.20  10.01  16.90  10.95  7.05  17.28  10.98  6.79 
Montreal Notre Dame  26330  385  12.13  11.14  11.99  2.70  1.82  1.13  2.62  1.76  1.10  2.54  1.66  1.01  2.87  1.84  1.05  2.45  1.59  0.88  2.60  1.67  0.96 
Notre Dame  45048  501  13.32  9.81  7.83  6.26  4.00  2.97  6.27  3.70  2.48  5.71  3.18  2.05  5.84  3.27  2.20  5.28  2.71  1.60  5.39  2.78  1.68 
NYC Library  5614  260  15.29  13.80  14.05  4.31  3.14  2.28  4.04  2.82  1.88  5.06  3.55  2.41  4.60  3.15  2.05  4.85  3.03  1.92  4.87  2.96  1.82 
Piazza del Popolo  12027  234  16.43  14.29  12.56  3.30  2.24  1.48  3.10  2.03  1.30  3.04  1.95  1.17  3.31  2.11  1.18  2.35  1.47  0.94  2.52  1.55  0.91 
Tower of London  7539  371  76.41  66.35  60.20  39.01  26.75  19.02  36.52  24.32  15.37  34.16  20.49  11.86  38.13  25.15  16.79  32.99  18.19  9.26  35.24  20.69  10.96 
Union Square  5919  468  18.73  14.84  10.81  12.91  9.02  6.61  12.67  8.68  6.27  13.17  9.07  6.74  12.89  9.02  6.89  12.80  8.43  5.75  12.90  8.71  6.19 
Vienna Cathedral  36729  624  34.58  27.46  22.41  24.27  18.81  14.12  22.84  17.62  13.17  20.90  15.59  10.85  25.45  19.20  13.15  19.56  14.42  10.06  23.32  17.29  11.99 
Yorkminster  10249  317  27.97  13.74  5.94  15.99  8.68  5.77  17.09  7.85  4.99  15.00  7.64  4.74  14.89  7.45  4.81  14.39  6.88  4.05  14.27  6.70  3.81 
6 Conclusion
In this paper, we proposed iterative methods for solving the camera location estimation problem. One of our proposed methods fixed the unknown scale factors, i.e. s, which appear on the common cost function (8), in each iteration. Hence, the solution is obtained through solving iterative leastsquares problems. The scale factors in each iteration are set to the distance between the corresponding vertices distances in the previous iteration. We observed experimentally that this helps the final scale factors to converge to the distances of the corresponding vertices, i.e. . This solved the problem of CLS algorithm that bunches up a significant number of scale factors in the lower bound.
Fixing the parameters s converts the direction observations to translation observations and converts the given viewing graph to a PoseGraph. This motivated us to use the stateoftheart method of PS nasiri2020novel to obtain rotations and positions of cameras simultaneously and to propose the second algorithm. Obtaining the best results by solving the full cost of viewing graph optimization problem is an important result. It is an step toward to solve challenging viewing graph optimization problem and calls other researches to put emphasis on developing the algorithms considering both terms together in the viewing graph optimization problem.
Both algorithms can solve the original cost function in which the chordal distance of directions are minimized. But the experiments show that minimizing the common used displacement distance metric (weighting the directions error by the edges length), results in a better performance in real datasets. This is an interesting result, because using displacement distance metric has scale problem which can be more important in the case of the full viewing graph optimization solved using our Alg. 2C. Having the best performance of Alg. 2C shows the importance of having good weights for combining the two term of the viewing graph optimization problem. An interesting future direction of work is obtaining the covariance of noise for using accurate weights for combining two terms in the loss function of viewing graph optimization.
We observed experimentally that all our proposed algorithms converge. Another important line of future research that we are undertaking is developing theoretical convergence results for our proposed algorithms.
Appendix
The author in govindu2001combining proposed an iterative reweighted algorithm for solving orthogonal distance of directions error, i.e. the solution of (6) with orthogonal distance. The strategy is to minimize orthogonal distance of displacements error which can be solved in the closedform and then reweigthing errors by the squared distance of the edges. Although upon convergence, the algorithm yields that the weights equal to the edges lengths, there is no guarantee that the algorithm converges.
We use a novel formulation in which the solution is directly the minimizer of (6) where the distance metric is orthogonal distance. We use the cost function of (9) that uses multipliers before the second term. The problem (9) does not need any constraint and the minimizer of the cost function is directly the solution of (8) with the orthogonal distance metric.
Suppose that the set of is the optimal solution of (9). So we have,
(13) 
Since comprises all points on the line which connects to , it can be conclude from (13) that is the nearest point of line to the endpoint of . Obviously the nearest point obtained by projecting onto line, and its Euclidean distance is equal to cross product of and .
Problem (9) can be solved by a two block coordinate descent approach. The algorithm repeats the following two steps. In one step, the cost is optimized over s while s are fixed, and in the next step the cost is optimized over s while s are fixed vectors. Both steps are linear leastsquares problem and can be solved in a closed form. It is easy to show that the proposed coordinate descent algorithm satisfies the convergence conditions of theorem 1 of rouzban2019rate. Therefore, the algorithm finds the minimizer of the sum of squared orthogonal distance of directions error. Although we were able to prove a convergence result for minimizing orthogonal distance, we obtained bad results using this method in experiments.