Camera pose estimation is a fundamental problem in computer vision, which aims at determining the pose and orientation of a camera solely from an image. This localization problem appears in many interesting real-world applications, such as for the navigation of self-driving cars , in incremental environment mapping such as Structure-from-Motion (SfM) [1, 11, 13], or for augmented reality [8, 9, 14], where a significant component are algorithms that aim to estimate an accurate camera pose in the world from image data.
Given a three-dimensional point-cloud model of a scene, the classical, but also state-of-the-art approach to absolute camera pose estimation consists of a two-step procedure. First, one matches a large number of features in the two-dimensional camera image with corresponding features in the three-dimensional scene. Then one uses these putative correspondences to determine the pose and orientation of the camera. Typically, the matches obtained in the first step contain many incorrect associations, forcing the second step to use filtering techniques to reject incorrect matches. Subsequently, the absolute 6 degrees-of-freedom (DoF) camera pose is estimated, for example, with a perspective-point pose solver  within a RANSAC scheme .
In this work we concentrate on the second step of the camera pose problem. That is, we consider the task of estimating the camera pose and orientation from a (potentially large) set of already calculated image-to-scene correspondences.
Further, we assume that we are given a common direction between the world and camera frames. For example, inertial sensors, available on any smart-phone nowadays, allow to estimate the vertical gravity direction in the three-dimensional camera coordinate system. This alignment of the vertical direction fixes two degrees of freedom for the rotation between the frames and we are left to estimate four degrees of freedom out of the general six. To obtain four equations (in the four remaining degrees of freedom), this setup requires two pairs of image-to-scene correspondences111As we will see later in detail, each correspondence imposes two constraints on the camera pose. for a minimal solver. Hence a corresponding naive RANSAC-based scheme requires filtering steps, where in each iterations a pose hypothesis based on a different pair of correspondences is computed and verified against all other correspondences.
Recently, Zeisl et al. 
proposed a Hough-voting inspired outlier filtering and camera posing approach, which computes the camera pose up to an accuracy offrom a set of D-D correspondences, in time, under the same alignment assumptions of the vertical direction. In this paper we propose new algorithms that work considerably faster in practice, but under milder assumptions. Our method is based on a reduction of the problem to a problem of counting -approximate incidences between points and surfaces, where a point is -approximately incident (or just -incident) to a surface if the (suitably defined) distance between and is at most . This notion has recently been introduced by a subset of the authors in , and applied in a variety of instances, involving somewhat simpler scenarios than the one considered here. Our approach enables us to compute a camera pose when the number of correspondences is large, and many of which are expected to be outliers. In contrast, a direct application of RANSAC-based methods on such inputs is very slow, since the fraction of inliers is small. In the limit, trying all pairs of matches involves RANSAC iterations. Moreover, our methods enhance the quality of the posing considerably , since each generated candidate pose is close to (i.e., consistent with) with many of the correspondences.
Our results. We formalize the four degree-of-freedom camera pose problem as an approximate incidences problem in Section 2. Each 2D-3D correspondence is represented as a two-dimensional surface in the 4-dimensional pose-space, which is the locus of all possible positions and orientations of the camera that fit the correspondence exactly. Ideally, we would like to find a point (a pose) that lies on as many surfaces as possible, but since we expect the data to be noisy, and the exact problem is inefficient to solve anyway, we settle for an approximate version, in which we seek a point with a large number of approximate incidences with the surfaces.
Formally, we solve the following problem. We have an error parameter , we lay down a grid on of side length , and compute, for each vertex of the grid, a count of surfaces that are approximately incident to , so that (i) every surface that is -incident to is counted in , and (ii) every surface that is counted in is -incident to , for some small constant (but not all -incident surfaces are necessarily counted). We output the grid vertex with the largest count (or a list of vertices with the highest counts, if so desired).
As we will comment later, (a) restricting the algorithm to grid vertices only does not miss a good pose : a vertex of the grid cell containing serves as a good substitute for , and (b) we have no real control on the value of , which might be much larger than the number of surfaces that are -incident to , but all the surfaces that we count are ‘good’—they are reasonably close to . In the computer vision application, and in many related applications, neither of these issues is significant.
We give three algorithms for this camera-pose approximate-incidences problem. The first algorithm simply computes the grid cells that each surface intersects, and considers the number of intersecting surfaces per cell as its approximate -incidences count. This method takes time for all vertices of our -size grid. We then describe a faster algorithm using geometric duality, in Section 3. It uses a coarser grid in the primal space and switches to a dual 5-dimensional space (a 5-tuple is needed to specify a D-D correspondence and its surface, now dualized to a point). In the dual space each query (i.e., a vertex of the grid) becomes a 3-dimensional surface, and each original 2-dimensional surface in the primal 4-dimensional space becomes a point. This algorithm takes time, and is asymptotically faster than the simple algorithm for .
Finally, we give a general method for constructing an approximate incidences data structure for general -dimensional algebraic surfaces (that satisfy certain mild conditions) in , in Section 4. It extends the technique of Fonseca and Mount , designed for the case of hyperplanes, and takes time, where the degree of the polynomial in depends on the number of parameters needed to specify a surface, the dimension of the surfaces, and the dimension of the ambient space. We first present and analyze this technique in full generality, and then apply it to the surfaces obtained for our camera posing problem. In this case, the data structure requires storage and is constructed in roughly the same time. This is asymptotically faster than our primal-dual scheme when (for the term dominates and these two methods are asymptotically the same). Due to its generality, the latter technique is easily adapted to other surfaces and thus is of general interest and potential. In contrast, the primal-dual method requires nontrivial adaptation as it switches from one approximate-incidences problem to another and the dual space and its distance function depend on the type of the input surfaces.
We implemented our algorithms and compared their performance on real and synthetic data. Our experimentation shows that, for commonly used values of and in practical scenarios (, ), the primal-dual scheme is considerably faster than the other algorithms, and should thus be the method of choice. Due to lack of space, the experimentation details are omitted in this version, with the exception of a few highlights. They can be found in the appendix.
2 From camera positioning to approximate incidences
Suppose we are given a pre-computed three-dimensional scene and a two-dimensional picture of it. Our goal is to deduce from this image the location and orientation of the camera in the scene. In general, the camera, as a rigid body in 3-space, has six degrees of freedom, three of translation and three of rotation (commonly referred to as the yaw, pitch and roll). We simplify the problem by making the realistic assumption, that the vertical direction of the scene is known in the camera coordinate frame (e.g., estimated by en inertial sensor on smart phones). This allows us to rotate the camera coordinate frame such that its -axis is parallel to the world -axis, thereby fixing the pitch and roll of the camera and leaving only four degrees of freedom , where is the location of the camera center, say, and is its yaw, i.e. horizontal the orientation of the optical axis around the vertical direction. See Figure 1.
By preprocessing the scene, we record the spatial coordinates of a discrete (large) set of salient points. We assume that some (ideally a large number) of the distinguished points are identified in the camera image, resulting in a set of image-to-scene correspondences. Each correspondence is parameterized by five parameters, the spatial position and the position in the camera plane of view of the same salient point. Our goal is to find a camera pose so that as many correspondences as possible are (approximately) consistent with it, i.e., the ray from the camera center to goes approximately through in the image plane, when the yaw of the camera is .
2.1 Camera posing as an -incidences problem
Each correspondence and its 5-tuple define a two-dimensional surface in parametric 4-space, which is the locus of all poses of the camera at which it sees at coordinates in its image. For correspondences, we have a set of such surfaces. We prove that each point in the parametric -space of camera poses that is close to a surface , in a suitable metric defined in that -space, represents a camera pose where is projected to a point in the camera viewing plane that is close to , and vice versa (see Section 2.2 for the actual expressions for these projections). Therefore, a point in -space that is close to a large number of surfaces represents a camera pose with many approximately consistent correspondences, which is a strong indication of being close to the correct pose.
Extending the notation used in the earlier work , we say that a point is -incident to a surface if . Our algorithms approximate, for each vertex of a grid of side length , the number of -incident surfaces and suggest the vertex with the largest count as the best candidate for the camera pose. This work extends the approximate incidences methodology in  to the (considerably more involved) case at hand.
2.2 The surfaces
Let be a salient point in , and assume that the camera is positioned at
. We represent the orientation of the vector, within the world frame, by its spherical coordinates , except that, unlike the standard convention, we take to be the angle with the -plane (rather than with the -axis):
In the two-dimensional frame of the camera the -coordinates model the view of , which differs from above polar representation of the vector only by the polar orientation of the viewing plane itself. Writing for , we have
We note that using does not distinguish between and , but we will restrict to lie in or in similar narrower ranges, thereby resolving this issue.
We use with coordinates as our primal space, where each point models a possible pose of the camera. Each correspondence is parameterized by the triple , and defines a two-dimensional algebraic surface of degree at most , whose equations (in ) are given in (1). It is the locus of all camera poses at which it sees at image coordinates . We can rewrite these equations into the following parametric representation of , expressing and as functions of and :
For a camera pose , and a point , we write
In this notation we can write the Equations (1) characterizing (when regarded as equations in ) as and .
2.3 Measuring proximity
Given a guessed pose of the camera, we want to measure how well it fits the scene that the camera sees. For this, given a correspondence , we define the frame distance between and as the -distance between and , where, as in Eq. (3), , . That is,
Note that are the coordinates at which the camera would see if it were placed at position , so the frame distance is the -distance between these coordinates and the actual coordinates at which the camera sees ; this serves as a natural measure of how close is to the actual pose of the camera.
We are given a viewed scene of distinguished points (correspondences) . Let denote the set of surfaces , representing these correspondences. We assume that the salient features and the camera are all located within some bounded region, say . The replacement of by makes its range unbounded, so we break the problem into four subproblems, in each of which is confined to some sector. In the first subproblem we assume that , so . The other three subproblems involve the ranges , , and . We only consider here the first subproblem; the treatment of the others is fully analogous. In each such range, replacing by does not incur the ambiguity of identifying with .
Given an error parameter , we seek an approximate pose of the camera, at which many correspondences are within frame distance at most from , as given in (4).
The following two lemmas relate our frame distance to the Euclidean distance. Their (rather technical) proofs are given in the appendix.
Let , and let be the surface associated with a correspondence . Let be a point on such that (where denotes the Euclidean norm). If
(i) , and
(ii) , for some absolute constant ,
then for some constant that depends on .
Informally, Condition (i) requires that the absolute value of the coordinate of the position of in the viewing plane, with the camera positioned at , is not too large (i.e., that is not too close to ). We can ensure this property by restricting the camera image to some suitably bounded -range.
Similarly, Condition (ii) requires that the -projection of the vector is not too small. It can be violated in two scenarios. Either we look at a data point that is too close to , or we see it looking too much ‘upwards’ or ‘downwards’. We can ensure that the latter situation does not arise, by restricting the camera image, as in the preceding paragraph, to some suitably bounded -range too. That done, we ensure that the former situation does not arise by requiring that the physical distance between and be at least some multiple of .
The next lemma establishes the converse connection.
Let be a camera pose and a correspondence, such that . Assume that , for some absolute constant , and consider the point where (see Eq. (2))
Then and , for some constant , again depending on . Informally, the condition means that the orientation of the camera, when it is positioned at and sees at coordinate of the viewing plane is not too close to . This is a somewhat artificial constraint that is satisfied by our restriction on the allowed yaws of the camera (the range of ).
A Simple algorithm. Using Lemma 2.3 and Lemma 2.3 we can derive a simple naive solution which does not require any of the sophisticated machinery developed in this work. We construct a grid over , of cells , each of dimensions , where is the constant of Lemma 2.3. We use this non-square grid since we want to find -approximate incidences in terms of frame distance. For each cell of we compute the number of surfaces that intersect . This gives an approximate incidences count for the center of . Further details and a precise statement can be found in the appendix.
3 Primal-dual algorithm for geometric proximity
Following the general approach in , we use a suitable duality, with some care. We write , for suitable parameters , and , whose concrete values are fixed later, and apply the decomposition scheme developed in  tailored to the case at hand. Specifically, we consider the coarser grid in the primal space, of cell dimensions , where is is the constant from Lemma 2.3, that tiles up the domain of possible camera positions. For each cell of , let denote the set of surfaces that cross either or one of the eight cells adjacent to in the -directions.222The choice of , is arbitrary, but it is natural for the analysis, given in the appendix. The duality is illustrated in Figure 2.
We discretize the set of all possible positions of the camera by the vertices of the finer grid , defined as , with replacing , that tiles up . The number of these candidate positions is . For each vertex , we want to approximate the number of surfaces that are -incident to , and output the vertex with the largest count as the best candidate for the position of the camera. Let be the subset of contained in . We ensure that the boxes of are pairwise disjoint by making them half open, in the sense that if is the vertex of a box that has the smallest coordinates, then the box is defined by , , , . This makes the sets pairwise disjoint as well. Put and . We have for each . Since the surfaces are two-dimensional algebraic surfaces of constant degree, each of them crosses cells of , so we have .
We now pass to the dual five-dimensional space. Each point in that space represents a correspondence . We use the first three components as the first three coordinates, but modify the - and -coordinates in a manner that depends on the primal cell . Let be the midpoint of the primal box . For each we map , where , to the point , where and , with and as given in (3). We have If crosses then , for some absolute constant , provided that the following two properties hold, for some absolute constant (the constant depends on ).
(i) , and
(ii) , where are the -coordinates of the center of .
By Corollary 2, the points , for the surfaces that cross , lie in the region . We partition into a grid of small congruent boxes, each of dimensions .
Exactly as in the primal setup, we make each of these boxes half-open, thereby making the sets of dual vertices in the smaller boxes pairwise disjoint. We assign to each of these dual cells the set of dual points that lie in , and the set of the dual surfaces that cross either or one of the eight cells adjacent to in the -directions. Put and . Since the dual cells are pairwise disjoint, we have . Since the dual surfaces are three-dimensional algebraic surfaces of constant degree, each of them crosses grid cells, so .
We compute, for each dual surface , the sum , over the dual cells that are either crossed by or that one of their adjacent cells in the -directions is crossed by . We output the vertex of with the largest resulting count, over all primal cells .
The following theorem establishes the correctness of our technique. Its proof is given in Appendix B.
Suppose that for every cell and for every point and every such that intersects either or one of its adjacent cells in the -directions, we have that, for some absolute constant ,
(ii) , and
Then (a) For each , every pair at frame distance is counted (as an -incidence of ) by the algorithm. (b) For each , every pair that we count lies at frame distance , for some constant depending on .
3.1 Running time analysis
The cost of the algorithm is clearly proportional to over all primal cells and the dual cells associated with each cell . We have
Optimizing the choice of and , we choose and . These choices make sense as long as each of , lies between and . That is, and , or , where and are absolute constants (that depend on ).
If , we use only the primal setup, taking (for the primal subdivision). The cost is then Similarly, if , we use only the dual setup, taking and , and the cost is thus Adding everything together, to cover all three subranges, the running time is then Substituting , we get a running time of The first term dominates when and . In conclusion, we have the following result. Given data points that are seen (and identified) in a two-dimensional image taken by a vertically positioned camera, and an error parameter , where the viewed points satisfy the assumptions made in Theorem 3, we can compute, in time, a vertex of that maximizes the approximate count of -incident correspondences, where “approximate” means that every correspondence whose surface is at frame distance at most from is counted and every correspondence that we count lies at frame distance at most from , for some fixed constant .
Restricting ourselves only to grid vertices does not really miss any solution. We only lose a bit in the quality of approximation, replacing by a slightly large constant multiple thereof, when we move from the best solution to a vertex of its grid cell.
4 Geometric proximity via canonical surfaces
In this section we present a general technique to preprocess a set of algebraic surfaces into a data structure that can answer approximate incidences queries. In this technique we round the original surfaces into a set of canonical surfaces, whose size depends only on , such that each original surface has a canonical surface that is “close” to it. Then we build an octree-based data structure for approximate incidences queries with respect to the canonical surfaces. However, to reduce the number of intersections between the cells of the octree and the surfaces, we further reduce the number of surfaces as we go from one level of the octree to the next, by rounding them in a coarser manner into a smaller set of surfaces.
This technique has been introduced by Fonseca and Mount  for the case of hyperplanes. We describe as a warmup step, in Section C of the appendix, our interpretation of their technique applied to hyperplanes. We then extend here the technique to general surfaces, and apply it to the specific instance of 2-surfaces in 4-space that arise in the camera pose problem.
We have a set of -dimensional surfaces in that cross the unit cube , and a given error parameter . We assume that each surface is given in parametric form, where the first coordinates are the parameters, so its equations are
Moreover, we assume that each is defined in terms of essential parameters , and additional free additive parameters , one free parameter for each dependent coordinate. Concretely, we assume that the equations defining the surface , parameterized by and (we then denote as ), are
For each equation of the surface that does not have a free parameter in the original expression, we introduce an artificial free parameter, and initialize its value to . (We need this separation into essential and free parameters for technical reasons that will become clear later.) We assume that (resp., ) varies over (resp., ).
Remark. The distinction between free and essential parameters seems to be artificial, but yet free parameters do arise in certain basic cases, such as the case of hyperplanes discussed in Section C of the appendix. In the case of our 2-surfaces in 4-space, the parameter is free, and we introduce a second artificial free parameter into the equation for . The number of essential parameters is (they are ,,, and ).
We assume that the functions are all continuous and differentiable, in all of their dependent variables , and (this is a trivial assumption for ), and that they satisfy the following two conditions.
(i) Bounded gradients. for each , for any and any , where is some absolute constant. Here (resp., ) means the gradient with respect to only the variables (resp., ).
(ii) Lipschitz gradients. for each , for any and any , , where
is some absolute constant. This assumption is implied by the assumption that all the eigenvalues of the mixed part of the Hessian matrixhave absolute value bounded by .
4.1 Canonizing the input surfaces
We first replace each surface by a canonical “nearby” surface . Let where is the constant from Condition (ii). We get from (resp., from ) by rounding each coordinate in the essential parametric domain (resp., in the parametric domain ) to a multiple of . Note that each of the artificial free parameters (those that did not exist in the original equations) has the initial value for all surfaces, and remains in the rounded surfaces. We get canonical rounded surfaces, where is the number of original parameters, that is, the number of essential parameters plus the number of non-artificial free parameters; in the worst case we have .
For a surface and its rounded version we have, for each ,
where is some intermediate value, which is irrelevant due to Condition (i).
We will use the -norm of the difference vector as the measure of proximity between the surfaces and at , and denote it as . The maximum measures the global proximity of the two surfaces. (Note that it is an upper bound on the Hausdorff distance between the two surfaces.) We thus have when is the canonical surface approximating .
We define the weight of each canonical surface to be the number of original surfaces that got rounded to it, and we refer to the set of all canonical surfaces by .
4.2 Approximately counting -incidences
We describe an algorithm for approximating the -incidences counts of the surfaces in and the vertices of a grid of side length .
We construct an octree decomposition of , all the way to subcubes of side length such that each vertex of is the center of a leaf-cube. We propagate the surfaces of down this octree, further rounding each of them within each subcube that it crosses.
The root of the octree corresponds to , and we set . At level of the recursion, we have subcubes of of side length . For each such , we set to be the subset of the surfaces in (that have been produced at the parent cube of ) that intersect . We now show how to further round the surfaces of , so as to get a coarser set of surfaces that we associate with , and that we process recursively within .
At any node at level of our rounding process, each surface of is of the form , for where , and .
(a) For each the function is a translation of . That is for some constant . Thus the gradients of also satisfy Conditions (i) and (ii).
(b) is some vector of essential parameters, and each coordinate of is an integer multiple of , where .
(c) is a vector of free parameters, each is a multiple of .
Note that the surfaces in , namely the set of initial canonical surfaces constructed in Section 4.1, are of this form (for and ). We get from by the following steps. The first step just changes the presentation of and , and the following steps do the actual rounding to obtain .
Let be the point in of smallest coordinates and set . We rewrite the equations of each surface of as follows: , for , where , and , for . Note that in this reformulation we have not changed the essential parameters, but we did change the free parameters from to , where depends on , , , and . Note also that for .
We replace the essential parameters of a surface by , which we obtain by rounding each coordinate of to the nearest integer multiple of . So the rounded surface has the equations , for . Note that we also have that , for .
For each surface, we round each free parameter , , to an integral multiple of , and denote the rounded vector by . Our final equations for each rounded surface that we put in are for .
By construction, when and and and get rounded to the same vectors and then the corresponding two surfaces in get rounded to the same surface in . The weight of each surface in is the sum of the weights of the surfaces in that got rounded to it, which, by induction, is the number of original surfaces that are recursively rounded to it. In the next step of the recursion the ’s of the parametrization of the surfaces in are the functions defined above.
The total weight of the surface in for a leaf cell is the approximate -incidences count that we associate with the center of .
4.3 Error analysis
We now bound the error incurred by our discretization. We start with the following lemma, whose proof is given in Appendix .
Let be a cell of the octtree and let , for be a surface obtained in Step 1 of the rounding process described above. For any , for any , and for each , we have
where is the constant of Condition (ii), and consists of the first coordinates of the point in of smallest coordinates.
For any , for any , , and for each , we have
where is the constant of Condition (ii).
Using the triangle inequality and Lemma 4.3, we get that
Since , , and , the lemma follows. ∎
We now bound the number of surfaces in . Since and each of its coordinates is a multiple of , we have at most different values for . To bound the number of possible values of , we prove the following lemma (see the appendix for the proof).
Let , for , be a surface in . For each , we have , where is the constant of Condition (i).
Lemma 4.3 implies that each , , has only possible values, for a total of at most possible values for . Combining the number of possible values for and