 # A linear method for camera pair self-calibration and multi-view reconstruction with geometrically verified correspondences

We examine 3D reconstruction of architectural scenes in unordered sets of uncalibrated images. We introduce a linear method to self-calibrate and find the metric reconstruction of a camera pair. We assume unknown and different focal lengths but otherwise known internal camera parameters and a known projective reconstruction of the camera pair. We recover two possible camera configurations in space and use the Cheirality condition, that all 3D scene points are in front of both cameras, to disambiguate the solution. We show in two Theorems, first that the two solutions are in mirror positions and then the relations between their viewing directions. Our new method performs on par (median rotation error Δ R = 3.49^∘) with the standard approach of Kruppa equations (Δ R = 3.77^∘) for self-calibration and 5-Point algorithm for calibrated metric reconstruction of a camera pair. We reject erroneous image correspondences by introducing a method to examine whether point correspondences appear in the same order along x, y image axes in image pairs. We evaluate this method by its precision and recall and show that it improves the robustness of point matches in architectural and general scenes. Finally, we integrate all the introduced methods to a 3D reconstruction pipeline. We utilize the numerous camera pair metric recontructions using rotation-averaging algorithms and a novel method to average focal length estimates.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-view geometry (mvg) is a Computer Vision (CV) subfield that attempts to understand the structure of the 3D world given a collection of its images

(hartley2003multiple). As the binocular human vision is naturally 3D, the same underlying principles allow the recovery of the 3D world structure in mvg reconstruction methods. However, a prerequisite is to have calibrated cameras, an assumption that is violated in unordered image sets. In this paper we focus on self-calibration and multi-view reconstruction using relations between camera pairs.

Assuming a camera pair with unknown and different focal lengths as the only unknown internal parameters, a standard approach to self-calibration and metric reconstruction first applies the 7 point algorithm (hartley2003multiple) inside a RANSAC (ransacalgcit) procedure to find the fundamental matrix. In this projective framework, the Kruppa Equations (krupparef) are used to determine the unknown focal lengths. Next, applying the 5 point algorithm (5palgcit) inside a RANSAC procedure, leads to a metric reconstruction. Since focal lengths are recovered in a projective framework, only epipolar geometry constraints may be used to check the solution plausibility. Solving self-calibration and metric reconstruction problems simultaneously permits the application of the more intuitive and restrictive geometric arguments of the metric framework.

Self-calibration methods are derived from relations on the dual absolute conic (DAC) and the dual image of the absolute conic (DIAC)  (176; 195; krupparef). However, existing methods require three or four images to provide a solution (176; 195), use numerical methods to determine DAC (176), provide an initial DAC estimate that violates the rank-2 condition (176; 195) and do not examine the relations between the recovered putative solutions (176). In mvg reconstructions additional assumptions have been made to determine focal lengths, as availability of EXIF tags (sfmrot2; snavely2010bundler), equality of focal lengths across all images (martinec2007robust; stewenius2005minimal) and vanishing points correspondences (sinha2010multi).

Towards a multi-view reconstruction, camera pairs have been utilized. Different estimates for a rotation matrix can be combined with a rotation averaging algorithm (rotavealg2) and reconstructions of pairs of images can be combined with rotation registration methods (rotreg1; rotreg2; rotavealg1) to initialize an instance of Structure-from-Motion with known rotations (sfmrot1; sfmrot2) and produce a multiple-view reconstruction.

Erroneous solutions in mvg problems are directly caused by erroneous or noisy image correspondences. Two complementary approaches, applying RANSAC procedures to repeateadly sample minimal point sets and verifying the initial point coresspondences, have been utilised to improve the validity of the recovered solution (chum2012homography; snavely2006photo).

In this paper, we derive a linear method for the self-calibration and metric reconstruction of camera pairs with unknown and different focal lengths, unifying two problems that were previously solved independently, to a single system of equations. We further disambiguate the two solutions recovered by our method through the derivation of two theorems about the solutions’ relations. We improve the robustness and applicability of this method by introducing a procedure to verify tentative point correspondences between images, using the Longest Common/Increasing Subsequence (LCS/LIS) problem (fredman1975computing). The verification method is tailored for outdoors scenes of buildings, and is based on enforcing expected geometric properties of such scenes. We integrate our afforementioned methods to a multi-view reconstruction pipeline, utilizing -norm algorithms and introducing a method to average different estimates of a single focal length , which uses the structure of the problem, specifically that each estimate for comes from a pair of images and is so paired with a second estimate .

The rest of this article is organized as follows: Section 2 provides background on the reconstruction problem and the verification of image correspondences. Section 3 introduces our method for self-calibration and metric reconstruction. Section 4 presents our method for correspondences verification between the images, first Section 4.1 presents a method which is reduced to the LCS problem and then Section 4.2 presents the final practical algorithm. In Section 5, we integrate our methods to a reconstruction pipeline. In doing so, we develop novel averaging methods for estimates recovered from image pairs. Results for camera pair reconstruction, correspondences verification, focal length averaging and mv reconstruction are given in Section 6.

## 2 Background & Related Work

In the following bold font (e.g.

) is used for vectors and capital case normal font (e.g.

) is reserved for matrices.

### 2.1 Elements of multiple view geometry

In this section we summarize basic notions about the projection of 3D scenes to 2D planes (hartley2003multiple; faugeras2004geometry). In a metric reconstruction parallel world lines converge at the plane at infinity : . The absolute conic is a conic on which satisfies , , where is the homogeneous representation of world points.

By taking all the planes tangent to , we construct , which is the dual surface of . is described in a metric reconstruction by the matrix

 Q∗∞=[I3×3030T30] (1)

Now, considering projective reconstructions of 3-space and projections to image plane we have the following Results (hartley2003multiple):

###### Result 1.

The projection of by projection matrix in the image plane is the dual conic

 C∗=PQ∗PT
###### Result 2.

If the 3-space is transformed by homography , that is , then planes of 3-space are transformed according to

 π′=H−Tπ
###### Result 3.

If is a matrix representing a projective transformation of 3-space, then the fundamental matrices corresponding to the pairs of camera matrices , are the same.

###### Result 4.

Suppose the matrix can be decomposed in two different ways as

 F =[a]xA F =[^a]x^A

then

 ^a =κa ^A =κ−1(A+avT)

for some non-zero constant and 3-vector

Using the preceding Results, we formulate the equations to solve the camera self-calibration problem and to determine position in a projective reconstruction.

### 2.2 Notation

We summarize our notation in Table 1.

### 2.3 Verifying point correspondences

In standard approaches to find image correspondences regions of interest are described by local feature descriptors (lowe2004distinctive). Consequently, erroneous matches occur between similar in local appearance image regions.

A way to reject erroneous matches is using arguments about the geometry of the depicted scenes (geometric verification). In SIFT features, Hough transform was used to acquire the orientation of the detected features (lowe2004distinctive). Another common approach applies a rudimentary transform (e.g affine, similarity) between the images, to reject some correspondences before fitting the full model (turcot2009better; philbin2007object; perd2009efficient; chum2004enhancing). For these methods we mention that:

• Most require specific image features to be extracted, from which special parameters are used to fit the transform

• Result in rejection of a large number of correspondences

• When we tried using a similarity transform to geometrically verify matches in the reconstruction problem, our results did not improve

Another direction is to improve the covariance of local feature descriptors (perd2009efficient; chum2012homography; chum2006geometric). The regions of interest can be first transformed before extracting a feature descriptor (perd2009efficient) or ellipses may be matched instead of points (chum2012homography).

Finally, the neighborhoods of putative matched points are examined in some verification methods. Such approaches include counting the number of correspondences between the neighborhoods of two tentative point matches (sivic2003video) or examining the order of matched features between the neighborhoods and counting the number of features out of order (wu2009bundling). The geometric verification method we propose uses properties concerning the order of matched points as well.

### 2.4 Approaches to multiple view reconstruction

In a reconstruction pipeline, initially Structure from Motion (SfM) is solved to get assuming image point correspondences and self-calibrated cameras. The fundamental method to solve SfM is Bundle Adjustment (BA) (BAalgcit), an iterative, numerical algorithm to minimize the reprojection error of the recovered solution.

In standard approaches to SfM a sequence of SfM sub-problems are solved (sequential SfM) (snavely2010bundler; snavely2006photo; wu2011multicore). In each iteration, more, possibly uncalibrated, cameras and world points are added to the SfM problem which is solved using BA. However such methods are sensitive to the initial camera pair selection, solve a large number of optimization problems numerically and optimize an objective function with possibly multiple local minima.

A different approach has been developed for solving the SfM with known Rotations problem within the framework of optimal algorithms in multiple-view geometry (mvg) and mvg algorithms (dalalyan2009l_1; hartley2007optimal; sfmrot1; sfmrot2; olsson2010generalized; zach2010practical). In this formulation, the camera rotation matrices

are given. SfM is formulated as a convex-optimization problem, for which a unique global minimum exists. For the actual solution of SfM with known rotations, either a sequence of Second-order cone programs are solved to arrive at an exact solution, or approximate solutions are recovered by solving SOCP or Linear programs

(martinec2007robust; enqvist2011non; sfmrot2; sinha2010multi). BA may still be applied as a last fine-tuning of the solution.

A SfM solution, allows the reconstruction of a low number of 3D points (sparse point cloud), limited by the number of image points correspondences. Multi-view stereo (mvs) algorithms can be used at this point to produce a dense point cloud, which contains a much larger number of 3D points (furukawa2010accurate). Finally, surface reconstruction algorithms can be used to produce a 3D surface (kazhdan2006poisson).

## 3 A method for Metric Reconstruction in pairs of Uncalibrated Images

### 3.1 Formulation of System Equations

Let us consider two cameras and further that coordinate system is aligned with the world coordinate system. Let us further assume, that the corresponding image coordinate systems are selected so that the internal parameters of each camera can be written as

 Ki=⎡⎢⎣fi000fi0001⎤⎥⎦

where is the focal length. The previous assumptions are routinely employed in multiple view geometry and are thoroughly discussed in the literature (hartley2003multiple).
We start from a projective reconstruction of the 2 cameras, given by , which is related to the metric reconstruction by a world (3D) homography as in

 PM1 =PP1H (2) PM2 =PP2H

Using Result 1, Eq.(1), we project to the image plane of camera 2. For this projection, we have:

 ⎡⎢ ⎢⎣f22000f220001⎤⎥ ⎥⎦=ω∗2=PP2HQ∗∞HTPTP2 (3)

To introduce the unknowns in Eq. (3), we use the canonical representation of the projective reconstruction, so that . From Eq. (2), we have for the homography

 H=[K10vTσ]

where is yet undetermined and the scale factor can be ignored ().
To fully determine , we turn to the plane at infinity

 π∞,P≜(p1)T≜(p1p2p31)T

Using Result 2 we arrive at

 H=[K10−pTK11] (4)

Substituting from Eq. (4) to Eq. (3) we get

 (5)

Eq. (5) comprise a non-linear system with respect to the five unknowns (plane at infinity coordinates and focal lengths) we pursue to determine to acquire a metric reconstruction of the scene. We note that is symmetric by definition, and is also homogeneous, thus it provides five independent equations.

### 3.2 Linearization

In Eq. (5), we substitute

 PP2≜⎡⎢⎣p11p12p13p14p21p22p23p24p31p32p33p34⎤⎥⎦

We group the unknowns in the following complexes

 xo≜⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝f21f22f21p21+f21p22+p23p3f21p1f21p2⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (6)

The augmented matrix for the linear system

 Axo=b (7)

is then given by

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝p221+p222−1p224−2p23p24−2p21p24−2p22p24−p223p21p31+p22p320p24p34−p23p34−p24p33−p21p34−p24p31−p22p34−p24p32−p23p33p11p31+p12p320p14p34−p13p34−p14p33−p11p34−p14p31−p12p34−p14p32−p13p33p211+p212−1p214−2p13p14−2p11p14−2p12p14−p213p11p21+p12p220p14p24−p13p24−p14p23−p11p24−p14p21−p12p24−p14p22−p13p23p231+p2320p234−2p33p34−2p31p34−2p32p34−p233+1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (8)

We derived the above equations (in order of appearance) from elements , , , , , of . In the following, we use the first five equations as explained in Section 3.4.

The matrix of Eq. (8) is rank deficient. Thus, we presented a linear system of five (in the best case) linearly-independent equations, in six unknowns. To solve it, we turn to the polynomial relations between the coordinates of .

### 3.3 Recovering the solutions

Taking five of Eqs. (7) we have the linear system

 A5xo=b5 (9)

Applying Gaussian elimination to (9), we bring the augmented matrix to the form

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣10000\definecolor[named]pgfstrokecolorrgb0,0.88,0\pgfsys@color@cmyk@stroke0.9100.880.12\pgfsys@color@cmyk@fill0.9100.880.120b101000\definecolor[named]pgfstrokecolorrgb0,0.88,0\pgfsys@color@cmyk@stroke0.9100.880.12\pgfsys@color@cmyk@fill0.9100.880.120b200100\definecolor[named]pgfstrokecolorrgb0.0600000000000001,0.46,1\pgfsys@color@cmyk@stroke0.940.5400\pgfsys@color@cmyk@fill0.940.54000b300010cb400001db5⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (10)

where:

1. The elements in default font, are in the usual form expected when we apply Gaussian elimination in the general case

2. The elements in green font, are a result of the problem’s structure, that is of the special relations in Eq. (10)

3. Finally, the element in blue font, is as given when we use the canonical representation for the projective reconstruction, which is:

 PP1=[I0],PP2=[[a]xFa] (11)

Where is the left null vector of , . By using the canonical pair, the leftmost block in is rank , and consequently has linearly-dependent row-vectors

The derivation of Eq. (10) is given in the Supplementary Material.

To solve for the focal lengths () and (), we now have from (10)

 f21 =b1 (12) f22 =b2 (13) f21p21+f21p22+p23 =b3 (14) p3+cf21p2 =b4 (15) f21p1+df21p1 =b5 (16)

We substitute , from (16), and , from (15), to Eq. (14), and obtain a second-order equation with respect to . Thus, we determine uniquely and with a two-way ambiguity. We refer to those two solutions as

 x1o =(b1b2b3p3f21p1f21p2)T (17) x2o =(b1b2b3p′3f21p′1f21p′2)T

### 3.4 The effect of homogeneous representation on the derived equations

In homogeneous coordinate systems representations equal up to a multiplicative constant refer to the same entity. We explore here how this ambiguity affects the formulation of Eq. (7).

Let

• , be the ground truth camera matrices we aim to recover

• , be the starting projective reconstruction

is related to by the homography

 Hk=[K−1100T1]

Thus, we get from to by homography , from to by and from to by .

For the camera pairs, we have the Fundamental matrices

 FP =[[a]xA] F′GT =[[^a]x^A]

From Result 3, since reconstructions are related by , the reconstructions share a common Fundamental matrix. Since Fundamental matrices are homogeneous entities, we have

 FP=ϵF′GT

Now, we turn to Result 3 and get

 ^a =κa ^A =ϵ−1κ−1(A+avT)

We write the previous equations in matrix form to get the projective transformation

 H′≜[κ−1ϵ−1I0κ−1ϵ−1vTκ]

satisfies

 κ−1ϵ−1P′GT1 =PP1H′ (18) P′GT2 =PP1H′

Now, we get from

 [κ−1ϵ−1K10κ−1ϵ−1vTK1κ]

We set the bottom-right element to , as we disregard the true scale of the reconstruction, and get the final form of

 [κ−1ϵ−1K10κ−1ϵ−1vTK11] (19)

One should compare the homographies of Eq. (19) and Eq. (4). Using Eq. (4) instead of Eq. (19), we get from to

 PM1 =[K10] PM2 =[μK2R2a]

We observe that the translation direction is correct but the left-most block of camera is multiplied by a constant .

To see how the constants in Eq. (19) affect Eq. (7), we substitute Eq. (19) in Eq. (3) and get for the expression

 ω∗2=PP2[(κϵ)−2K1KT1−(κϵ)−2K1KT1p−(κϵ)−2pTK1KT1(κϵ)−2pTK1KT1p]PTP2

To avoid the determination of additional unknowns in Eq. (7), we have

• All equations derived from elements off the diagonal are of the form , thus the constant can be eliminated

• The equation derived from element cannot be used without determining additional constants. So, we may only use the rest five of the six original equations of (7)

The complete method to solve the metric reconstruction and self calibration problem follows:

1. We solve the system (7), keeping five equations and discarding the equation derived from

2. In the previous step (1), we recovered . To fully determine , many different approaches are possible. We propose to repeat step 1, putting camera 2 at the origin of the coordinate system (in place of camera 1). This can be done by transposing the Fundamental matrix for the camera pair. Following this approach, we may additionally determine the constant

3. Using the homography of Eq. (4) or Eq. (19), we recover the metric reconstruction . Depending on the homography used, one camera matrix ( for Eq. (4) or for Eq. (19)) will have the left-most block multiplied by a constant. This has no effect on the correctness of the representation, and the image points are the same in each case

### 3.5 Solution disambiguation and geometric relations of the two solutions Figure 1: We visualise the geometric relations of Theorems 1,2. In the graph, we display the centers of projection (C1,C2) and viewing directions (v1,v2), for each of the two solutions of Eq. 17. In pink, we display solution 1 and in red solution 2. The common plane of C1,C2,v1,v2 is highlighted

We use the Cheirality condition (Corollary 1) to determine the valid solution of Eq. (7). Whenever the two recovered solutions represent cameras with divergent viewing directions, Cheirality condition is more likely to identify the valid solution. We explore in Theorems 1 and 2 the geometric relations between the two solutions, aiming to vizualize solutions’ relations and disambiguation. Proofs of Theorems 1 and 2 are outlined in Figs. 2 and 3. Full proofs are given in the Supplemental Material. Figure 2: An illustration of the intermediate results leading to Theorem 1 Figure 3: An illustration of the intermediate results leading to Theorem 2
###### Theorem 1.

Let

 {P1m1,P1m2},{P2m1,P2m2}

denote the reconstructions derived from (17). Then, cameras are in mirror positions with respect to the origin (position of ). The centers of projection satisfy

 C1m2=−C2m2
###### Theorem 2.

Let camera be positioned on the origin of the world coordinate system, with a viewing direction aligned to axis. We denote the viewing directions of and the position vectors of the corresponding centers of projection. Then, bisect the angles formed by , in the plane defined by . Thus, we have:

 ∠C1m2,v1m2=∠C1m2,v2m2 (20) ∠C2m2,v1m2=∠C2m2,v2m2 (21)

From Theorems 1,2, we easily deduce that

 ∠C1m2,vjm2+∠C2m2,vjm2=180∘
###### Corollary 1.

The correct one of solutions (17) can be identified by requiring all world points that are visible from camera to be in the space in front of camera .

## 4 Geometric verification of tentative image correspondences

### 4.1 Reduction to Longest Common Subsequence problem

The geometric property we pursued to enforce in tentative image correspondences is the order of imaged points with respect to the horizontal and vertical image directions. We have:

• If a point, A, is imaged to the left of a point, B, in the first image, then A should be to the left of B in the second image as well. We call this property Consistency-x

• Similarly, if a point, A, is below another point, B, in the first image, then A should be below B in the second image as well. We call this property Consistency-y

To see how we can arrive at the LCS/LIS problem we examine each one of the two Consistency properties independently. We present here the analysis concerning Consistency-x.

We start with a formal definition of Consistency-x. A set of correspondences , where is the x-coordinate of a point () in image , has Consistency-x, if for all points of image in :

1. All points in image that are in the Consistent-x set and are to the left of match in image with points that are to the left of :

 S={(pi1,pi2)}:∀i∀jpi1≤pj1⟹pi2≤pj2,(pi1,pi2),(pj1,pj2)∈S
2. All points in image that are in the Consistent-x set and are to the right of match in image with points that are to the right of :

 S={(pi1,pi2)}:∀i∀jpi1≥pj1⟹pi2≥pj2,(pi1,pi2),(pj1,pj2)∈S

We seek the most populous set of correspondences which is Consistent-x. We can reduce the Consistency-x problem to LIS in the following way: We sort points in image 1 with respect to the x-axis (). This sorting is a permutation in the sequence of correspondences. We apply this same permutation to and get a sequence from the ordinates of points . We seek the LIS of this last sequence.

The LCS/LIS problems are efficiently solved with complexity  (aldous1999longest; hunt1977fast; fredman1975computing) or even (van1975preserving) if special data structures are implemented. To solve LCS/LIS, we used the patience sorting algorithm (aldous1999longest).

#### 4.1.1 Perplexities of the combined Consistency-x,y problem and an efficient approximate method

In the combined Consistency-x,y problem we seek to find the largest subset of correspondences which are consistent in both x and y axes. The relation “Consistency-x and Consistency-y” is not transitive. We can observe that easily with a counter-example. Thus, the Consistency-x,y relation is not a partial order, a condition sufficient to rule out reduction to LCS/LIS (fredman1975computing).

Formally, in the Consistency-x,y problem, we seek a set of image correspondences so that:

• has the Consistency-x property

• has the Consistency-y property

• The number of elements () of the set is maximized

We propose an approximate solution by the following method:

1. We find the largest Consistent-x subset, , solving an LIS problem

2. We find the largest Consistent-y subset of , solving again an LIS problem

In our suboptimal solution Consistency-x,y holds, thus our primary aim to reject erroneous matches is achieved. Nevertheless, some true matches are rejected.

### 4.2 A practical verification method Figure 4: Effect of camera coordinate system rotation to depiction of parallel lines. The figures were created by projecting a 3D structure at a constant depth (z=zconst) to the image plane, thus effects of different scene depth are not shown

The consistency properties we introduced, depend on assumptions on the geometric structure of the scene. In photos of architectural scenes, usually the axis in camera coordinate system aligns with the perpendicular to the floor vector, leading to the assumption . In special cases, as in photographs of houses on a street, the camera axis may also be aligned between photographs.

Such assumptions may be violated between different views. The effects of camera rotations on a scene are illustrated in Fig. 4. We observe that:

• Lines parallel to or axis in one image may appear tilted in another, if the camera coordinate systems are not aligned. The same effect is caused by scene depth variation

• The relative order of points may change between two images. Moreover, it is more likely for two points to change order with respect to the -axis, if those points are close in axis but distant in axis, .

Still, in the case of photographs of architectural scenes, we can assume small rotations around the axes, as the photographer’s position in space is constrained. In-plane () rotations are uncommon and can nevertheless be fixed automatically (gallagher2005using).

#### 4.2.1 Approximations to Consistency properties

As Consistency properties are violated by projective phenomena, enforcing them leads to the rejection of many true correspondences. Thus, we relax Consistency properties to arrive at a practical verification method. We describe the method concerned with the order of points on the axis. Similar modifications apply to the Consistency-y property.

First, we introduce a threshold value () to allow violations in the order of points that remain within a predefined distance range. So, two consecutive sequence points , are considered in correct order if

 si−T≤si+1

where is the coordinate of the i-th point in the ordered sequence. We set as a fraction of the maximum distance in the axis, of any two points in the image we examine, that matched to points in the paired image:

 T=α(ymax−ymin) (22)

In the following we refer to this process as “setting the threshold as a percentage of image size”.

We propose to use a recursive method, acting on image subregions of different size. We solve a sequence of LIS problems, each one with a different value, set as a percentage of image size:

1. We solve an LIS problem using a threshold as a percentage of image size. The result is a Consistent-x set of correspondences

2. We split the image in two subregions, each with equal number of correspondences . The split is done on the axis

3. (Recursion): We repeat the process on each of the two subregions. We terminate if the region size is smaller than a predefined constant (we used pixels)

The recursive method has the advantage of allowing for larger violations in the order in the axis for points that are distant in the axis, as explained in Section 4.2. Concerning the computational complexity, we have:

 Complexity =nlogn+∑L∑nilogni ≤(L+1)nlogn,since ni≤n, ∑ni≤n

where is the number of recursion steps. depends on the initial image size and . Consequently, the recursive method adds no significant computational burden to the initial LIS problem formulation.

Finally, we remark that other approaches, as dropping recursion or fitting a simple transform to map lines between the images to estimate the value, produced worse results than the proposed recursive method.

## 5 An application to the multiple-view reconstruction problem

We integrate our methods for the geometric verification of image correspondences and the pair-based estimation of , in existing pipelines to solve the multiple-view reconstruction problem and produce a 3D-model of a scene.

Our approach is outlined in Fig. 5. The final reconstruction is done using the non-sequential SfM with known rotations formulation of (sfmrot2), which we modify extensively, using the methods of the preceeding sections as well as the averaging algorithms we describe in the following. Figure 5: Our pipeline to reconstruct a 3D scene from an unordered set of 2D photographs. In the first row, we display a flow diagram of the algorithm stages. Novel parts are displayed in green. The second row outlines the core methods we use. We highlight methods we introduced in preceding sections. The two final rows contain a visualization of data type and most important references per stage

### 5.1 Averaging pair-based solutions for f,R

In this paper we introduced computationally efficient methods for estimation, which we apply in randomly sampled minimal correspondences sets, in a way that resembles RANSAC procedures (ransacalgcit). The multiple , estimates, one from each minimal sample, are then averaged, to produce the final solutions.

In , case, we introduce a novel averaging method. In the case of pairwise rotations , we apply the Weiszfeld algorithm (rotavealg2; rotavealg1), which converges to the median (-average) rotation. We also use a form of the Weiszfeld algorithm (multiple rotation averaging) in the rotation registration problem to get the final camera rotation matrices (Section 5.1.2).

#### 5.1.1 Focal length estimates

The distribution of estimates collected from all the possible image pairs

can be skewed or multimodal (Fig.

6), in which case the mean or median estimate will not correctly determine value. Figure 6: f estimates distribution. f estimates were collected from all available camera pairs. We observe that for some cameras (right), focal length can be readily determined. The opposite holds for other cameras (left). Data from castle-P30 (strecha2008benchmarking)

We introduce new measures to evaluate the fit of focal length estimates. We initially introduce the Confidence count (cc) and then modify cc using the problem structure to introduce the Joint confidence count (Jcc). We assume that in each image pair that contains image , we receive a number of correct and a number of erroneous estimates for , and that erroneous estimates originating from different image pairs vary significantly in value, whereas correct ones aggregate. Figure 7: A demonstration of confidence count computation. Each disk represents a fi estimate. We compute the cc for the central value, depicted here with a bold border. This cc depends on the number of estimates within a β=10% range, depicted with a red square in the picture

We visualize cc computation in Fig. 7. Simplifying aspects of the computation, we can describe it as a binning procedure, where the bin range is adapted to contain all estimates within deviation:

1. We collect all estimates of , originating from all the different images we have matched with image

2. For each , we count the number of estimates, , within a error range. This sum is the confidence count for estimate

3. We normalize values to range. This step is critical for Jcc computation

To further improve the estimation, we introduce Jcc (Fig. 8). Since each estimate is paired with some estimate (the estimates were computed in an image pair), we expect that if is a good estimate then will be accurate too. To compute Jcc, we follow a similar to cc procedure, but this time each estimate in range contributes a different amount to Jcc sum. This amount is proportional to of estimate that is paired with . Good estimates have higher confidence counts, and contribute more to Jcc. Figure 8: A demonstration of joint confidence count computation. Each disk represents a f1 estimate. We compute the Jcc for the central value, depicted here with a bold border. This time, each disk is divided in half, to demonstrate that each f1 estimate is paired with one f2 estimate. We use different disk colors for different cameras. Jcc depends on the sum of elements within range (inside the red square). In contrast with cc computation, each element contributes a different amount to the sum, depending on cc of f2

In greater detail, to compute the of estimate about image , we have:

1. Let be the images we matched with image . For each image we have:

• From all estimates within range of , we pick the ones that originate from pair .

• Since every estimate originating from pair is matched to an estimate, from the estimates of the previous step we get the corresponding estimates of

• For each of the estimates of , we have a confidence count . We get their mean. We do not use the direct sum, to diminish the influence of a large sum (large ) of low cc’s.

2. is the sum of the previous mean values.

#### 5.1.2 Rotation estimates

In this section, we summarize rotation averaging using the Weiszfeld algorithm (rotavealg2; rotavealg1). Weiszfeld algorithm returns the -mean in a set of points in space . Many different metrics have been defined for rotation matrices (rotavealg1). We limit our analysis here to

 dgeometric(R,S)≜angle of rotation RS−1

Weiszfeld algorithm is a gradient-descent method and is guaranteed to converge to the true -mean in the case of single rotation averaging, as averaging of pairwise rotation estimates .

The -mean of estimates of a single rotation is the rotation that minimizes:

 n∑i=1dgeometric(Ri,Ry)

In this case of rotation registration, the convergence of Weiszfeld algorithm is not guaranteed.

In detail, we applied Weiszfeld algorithm to weight the estimates of the pairwise rotations we acquired through random sampling of minimal point sets ( points) yielding a solution.

In the rotation registration problem we applied the Weiszfeld algorithm in the following manner:

1. We construct the rotations graph, with one node for every image and an edge between nodes if we know the relative rotation between the respective images. We take a spanning tree in this graph, and using we get the initial estimates

2. For every node in the graph, we use all available estimates to get inconsistent estimates , through . We average estimates with one iteration of Weiszfeld algorithm

3. We repeat the previous step times ()

In all our experiments we set .

## 6 Results & Discussion

### 6.1 Metric Reconstruction in Pairs of Images

We implemented Kruppa equations, a well-studied and popular method for camera self calibration, and used it as reference method for the estimation of internal camera parameters. To compare the methods, we used synthetic camera projection matrices and image correspondences. We added Gaussian noise to the image points positions, and not to world points or other entities, to simulate actual noisy correspondences.

We observed that our method (Sec. 3) and the Kruppa method produce identical estimates. In rare cases with extremely noise-corrupted correspondences, our method failed and the Kruppa method produced largely erroneous focal length estimates.

We conclude that the two methods are equivalent concerning the self-calibration problem. Still our method is advantageous in additionally providing a metric reconstruction.

Next, we evaluate camera pair reconstruction. We compared our method to the 5-Point(5P) algorithm (5palgcit). We used both approaches as initialization to BA (BAalgcit) and evaluated the quality of the final reconstructions (Tab. 2). We used a multiple-view dataset (datasetcit) and determined the relative positions of all camera pairs with point correspondences. The same focal length estimates were used in both compared approaches. estimates were obtained by the method we introduced in Sec. 3. The two methods we compared were:

• Initialize BA with our method: We randomly sampled minimal subsets of correspondences and averaged the acquired solutions with rotation averaging (rotavealg2). We allowed for 20 BA iterations

• Initialize BA with 5P algorithm: We used a RANSAC procedure to sample minimal 5P subsets and to pick the solution. We allowed for 20 BA iterations

To quantify the reconstruction error, we used the angle () between the relative rotation estimate and the true relative rotation, , between two paired views .

The initialization of BA is important, to improve convergence and to reduce the computational cost. We observe that both the 5P algorithm and our method can be used as BA initialization with similar performance (Tab. 2). This result implies that to further reduce the reconstruction error, we should improve other problem parameters as image correspondences and focal length estimates.

### 6.2 Geometric verification of tentative correspondences Figure 9: Demonstration of tentative correspondences verification. Left: Initial correspondences. Right: Verified correspondences using the geometric verification method we introduced

In geometric verification, correspondences are classified as correct or erroneous. We evaluate this classifier using precision and recall. Two different datasets were used:

• Dataset (datasetcit): The set contains outdoor scenes of landmark buildings. The ground truth camera matrices, , are provided, from which we can separate correct and erroneous correspondences. In detail, from given we recover the fundamental matrix and then evaluate Sampson’s approximation to geometric error for each tentative correspondence (sampsoncit)

• Dataset (wongiccv2011): This set contains both architectural scenes and scenes with objects. We give performance results on each of those two subsets independently. In this set, point correspondences between images are provided and labeled as correct or erroneous

The results are presented in Tables 3, 4. Precision of the classifier is more important than its recall, as it is more important to have an oulier-free set of correspondences than to recover all true correspondences. Furthermore, the recursive verification method discards erroneous matches with very high precision. This result supports our argument that points more distant in the one, e.g , axis are more likely to violate order with respect to the other, e.g , axis. Finally, performance varies with scene type. The development of our method was based on scene properties found in architectural scenes. In scenes composed of objects, differences in scene geometry and the increased freedom in viewer’s position cause more violations in the Consistency properties. In Dataset (datasetcit), the performance in scenes of one main building, and consequently of a single main horizontal and vertical direction, as Fountain-P11, entry-P10, Herz-Jesu-P8, reaches almost flawless precision (). In more complex scenes, which include more buildings, as in castle-P30, the achieved precision degrades to values in the range .

### 6.3 Improving focal length estimation in multi-view reconstructions

We show in Tab. 5 the improved estimates we get with cc. Further improvement is achieved by Jcc measure. To quantify the error in estimation we use  (chandraker2007autocalibration; gherardi2010practical; kukelova2008polynomial):

 Δf≜∣∣ ∣∣^ff−1∣∣ ∣∣,where ^f % is an estimate of f

In Fig. 10 we present an non-ambiguous Jcc distribution from which can be correctly determined, in contrast to estimates distributions displayed in Fig. 6.

### 6.4 Multi-view reconstruction in unordered image sets

In Tab. 6, we provide quantitative performance measures for multi-view reconstructions that were acquired applying the proposed pipeline (Fig. 5), on unordered image datasets and with no other input apart from the scene photographs. To quantify reconstruction error in camera translation, we used the angle () between the translation estimate and the available true translation . In Fig. 11 we qualitatively display the results of the proposed reconstruction pipeline. The results in Tab. 6 and Fig. 11 demonstrate that the introduced methods can be used in unordered image sets to produce quality reconstructions of the photographed scenes. Figure 11: 3D reconstruction results, as dense point clouds. Top Row: Datasets (strecha2008benchmarking), Herz-Jesu-P25 (left) and Fountain-P11. Middle Row: Dataset castle-P19 (strecha2008benchmarking) (left) and photo set of Monument of Lysicrates, Athens (right). Bottom Row: Photo sets of locations in Athens, Parthenon (left) and Karyatids (right). The scenes in Athens were photographed by the authors with a simple compact camera

## 7 Conclusions

Using the DIAC, we developed a linear self-calibration and metric reconstruction method. Two theorems describe the relative configuration of the two recovered solutions and provide support to use the Cheirality condition for solution disambiguation. Comparisons to Kruppa equations and the 5P algorithm revealed that our method performs similarly to these standard approaches. Subsequently we show that the large number of estimates that are produced by our self-calibration and metric reconstruction method can be utilised through averaging methods, shifting our focus from choosing the best solution to finding, eg as in finding an optimised

estimate prior to self-calibration, the best solution averaging method. We also developed a general method to verify point matches between images, which can be solved by reduction to LCS. The corresponding verification method can be used in any problem with image correspondences input. The verification method successfully rejected outliers in both architectural and general scenes, with more success in the former category. All our methods were integrated to a full multiple-view reconstruction pipeline to produce visually high-quality reconstructions on both standard datasets and image sets we shot using a conventional camera. Multi-view reconstructions were obtained combining camera pair reconstructions using rotation averaging algorithms and a novel approach to average focal length estimates.

## Appendix A Gaussian elimination in Self calibration and metric recontruction equations

To simplify the expressions, we introduce the notation

and permute elements with the permutation

 4↔6

We denote the permuted vector by and the corresponding system matrix by . Using this notation, we write as

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣r1ϕ11Pγ2ψ1r2ϕ12Pγ2+ϕ22Pγ3ψ2r3ϕ13Pγ1+ϕ23Pγ3ψ3r4ϕ14Pγ1ψ4r5ϕ15Pγ1+ϕ25Pγ2ψ5⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (23)

where are vectors and are appropriate constants of no special structure.

We aim to eliminate the elements in the rows and columns of , which we refer to as , and then to apply regular Gaussian elimination. This is generally possible, owing to the structure of rows in (23), which are linear combinations of vectors and, also, using the canonical projective reconstruction allows us to substitute

 Pγ3=d1Pγ1+d2Pγ2 (24)

Thus, the elimination of elements is now straightforward by applying row-operations to matrix . We then apply ordinary Gaussian elimination to reduce to the form of (10).

## Appendix B Geometric Relations between the two recovered solutions for metric recontruction of a camera pair

We proceed with the proofs of Theorems 1 and 2.

###### Result 5.

Let denote a projection matrix. The center of projection has no image, as it is projected to point . Equivalently, is a right null-vector of .

###### Result 6.

Let denote a projection matrix. can be decomposed as

 P=[KR−KRC]

Results 56 describe properties of the camera position . The following Result is concerned with the camera direction

###### Result 7.

Assume a projection matrix

 P=[Mp]

Let the vector denote the third row of . Then the vector

 v=det(M)m3

is in the direction of the principal axis (the viewing direction) of and is directed towards the front of the camera.

The next two lemmas describe properties of metric reconstructions derived from Eq. (17)

###### Lemma 1.

Let

 P1m2 =[K12R1a1] P2m2 =[K22R2a2]

be the projection matrices for camera derived from Eq. (17).

Then

 a1=a2≜a
###### Proof.

Considering:

1. The form of homography (4)